Generating Comics from 3D Interactive Computer Graphicsivizlab.sfu.ca/arya/Papers/IEEE/C G & A/2006/May/Generating Comic… · camera parameters to create descriptive images, choosing

Published by the IEEE Computer Society 0272-1716/06/$20.00 © 2006 IEEE IEEE Computer Graphics and Applications 53

Feature Article

S toryboards and scripts use comic-likeframes to narrate stories. Such displays can

also summarize the main events in a film, video, or vir-tual world interaction creating a more succinct represen-tation of the original graphics.1 Our goal is to create asequence of comic-like images summarizing the mainevents in a virtual world environment and present themin a coherent, concise, and visually pleasing manner. Thetransformation from dynamic graphics to static nonpho-torealistic images reduces the complexity of both thetemporal and the spatial domains. Hence, the challengeis twofold:

■ In the temporal domain, we want to automaticallydetect and choose the most important or most inter-esting (as specified by the viewer) events in theinteraction.

■ In the spatial domain, we want to depict these eventswith one or more images that will convey their meaning faithfully. This includes choosing thepoints in time to portray the event, selecting camera parameters to create descriptive images,choosing a rendering style, and arranging the imagelayout.

Traditionally in comics, selecting or producing imageand text that communicate the right message is theartist’s job. Indeed, we’re still far from an artist’s capa-bility and expressiveness. However, our system canextract a sequence of important events from a contin-uous temporal story line and convert the events into agraphical representation automatically. Our automat-ic end-to-end system transforms interactions in 3Dgraphics into a succinct representation depicted bycomics. The system is based on principles of comic the-ory and can produce different comic sequences on thebasis of user-provided semantic parameters, such asviewpoint and level-of-details using the same reduc-tion principle in both the temporal and the spatialdomains.

Figure 1 (on the next page) gives an overview of oursystem. The system starts with a log of events based onplayer interactions inside a 3D virtual world and pro-gresses through several steps to generate a series ofcomic strips highlighting important events from the log.

The language of comicsComics have the reputation of being somewhat child-

ish, with images that are often deliberately simplifiedand nonphotorealistic. Still, comics can invoke strongemotional reactions, create identification, and effective-ly convey a story in an appealing manner. In fact, sym-bolism and abstraction in comics can promoteidentification because symbolic figures can representanyone or anywhere, whereas real-istic images resemble specific peo-ple or places.

A primary principle of comics isthe translation of time into space.Comic frames are inherently static,but they display dynamic informa-tion. The principle of depictingdynamic events using a sequence ofstill images is basic to human percep-tion. Still, comics present severalpossible time transitions betweenframes that let a sequence of stillimages convey dynamic events:2

■ Moment-to-moment transitionsbreak the action or motion intoimages based on time intervals.The results might look like asequence of frames from a movie.

■ Action-to-action transitionsbreak the action or motion intoimages based on the action’s type or essence. Thistransition is most common in comics.

■ In subject-to-subject transitions, images switch fromone action to another or from one character toanother but still remain within the same scene.

■ In scene-to-scene transitions, consecutive imagesleaping in time or space usually signify a scenechange.

Comics use two more transitions in which time doesn’tplay a central role:2 a concept-to-concept transition thatdoesn’t depict much action, but rather different angles ofthe same scene; and a nonsequitur, which is a sequenceof images with no (apparent) logical relationship. Ourwork doesn’t currently use these transitions, however.

This comics-generation

system enhances story

understanding by

transforming dynamic

graphics from interactions

in a computer game’s 3D

virtual world into a

sequence of static comic

images that summarize

the story’s main events.

Ariel Shamir, Michael Rubinstein, and Tomer LevinboimInterdisciplinary Center, Israel

Generating Comicsfrom 3D InteractiveComputer Graphics

Feature Article

54 May/June 2006

Another leading principle of comics is the use of visu-al icons. Here, an icon is any image representing a per-son, place, thing, or idea. Using abstract icons insteadof realistic images focuses our attention through simpli-fication, eliminates superfluous features, and createshigher identification and involvement.

Our work follows these two principles to create a suc-cinct depiction of 3D interaction.

Logging events in a 3D worldThe system’s first component is the logger. The log-

ger traces all events in an interaction and stores them ina log file. For our technique to be applicable to real 3Dvirtual interaction, the logger must be incorporatedinside the 3D world’s engine. After examining severaloptions, we resolved to use a 3D game as the basis forthe 3D interaction input. Not many open source gameslet you modify the core engine. For our experiments, weuse Doom (http://zdoom.org), a simple 3D shootinggame. Although we use the game’s multiplayer version,this type of game offers limited possible events and inter-actions. Nevertheless, we’ve designed our system in amodular fashion, which lets us easily define, recognize,and support new types of events and interactions.

Each entity in the virtual world holds a set of attribut-es defining its state. These include its position, posses-sions (such as tools or weapons), and sometimes a setof measures such as hunger, vitality, or strength. Fur-thermore, an entity’s state includes its current actionfrom a set of predefined possible actions such as walk-ing, shooting, watching, and picking. An event in thevirtual world is defined as any change in an entity’sstate⎯for example, a change in position, action, orattribute (such as getting hungry or tired). Events typi-cally result from a character’s action, such as shooting,picking up an object, sitting down, or opening a door.

Because many events can occur simultaneously, wemeasure the world’s change of state at some atomic timeunit called a tick. For each tick, the log stores a set ofevents defining all changes that occurred to all entitiesbetween the previous and the current tick. It also storeseach event’s related parameters, such as the entity iden-tifier, its position, and the action type. The log also holdsall the initial information regarding the world’s geome-try and entities at time (tick) 0. Thus, at any tick t, wecan reconstruct the world’s state by accumulating theevents information from tick 0 to tick t. When the dif-ference between ticks is small, the granularity of eventsis finer; when the difference is large, the granularity iscoarser and some events might even be missed. To pre-

serve all important events and still prevent the log fromexploding, we use adaptive tick differences and a filter-ing scheme. When concentrating on specific characters’stories, we store only events related to them. That is, thesystem creates a new tick and stores data in the log onlywhen a change occurs in the state of one of these char-acters. To track the characters’ positional information,we also add uniform ticks with only positional informa-tion every second or so.

Recognizing scenes: The iHood modelMany narrational forms such as theater and films

decompose a story into a sequence of scenes or chap-ters. Hence, one of the first steps in transforming anevent log into a coherent story is to separate it intoscenes. In our system, the scener performs this function.However, it’s difficult to define the notion of a scene pre-cisely. In general, we can combine a set of related actionsor events occurring in one place at one time to form ascene. Hence, the two main parameters defining a sceneare usually time and space, and we can separate theevents according to their location and time. Neverthe-less, many unrelated events can occur simultaneouslyat the same location, with only some of them relevant toa specific story. Similarly, some events occurring in dif-ferent locations might belong to the same scene (such asa car chase scene). A simple scheme separating eventsin time and space wouldn’t be sufficient; the transfor-mation would require some level of understanding andclassification.

Story understanding is a complex problem. We use asimpler scheme of modeling interest or event impor-tance. We model the interactions between entities in thevirtual world to recognize scenes and analyze interac-tions (which involve two entities) instead of events (a single entity’s actions). Before transforming asequence of events into a story, we select a viewpoint,or narrator. We use those to factor interaction impor-tance by entity and action importance to create a biastoward a specific viewpoint or event type.

An interaction is a relation between two entities in theworld. See, shoot, pick, and hold can be interactions,but we can also define spatial proximity as a feeble typeof interaction. Let E be the set of all entities in the world,F the set of all possible interactions, and N the naturalnumbers (representing ticks). An interaction R is a func-tion R: N × F × E × E → {0, 1}. If two entities e1 and e2

interact at tick t in an event of type f, then R(t, f, e1, e2)= 1; otherwise it’s 0.

An interaction neighborhood, or iHood, of entity e is

1 Overview of the process of transforming 3D graphics interaction into comics. The user interactions in a virtualworld are logged, analyzed, and processed based on parameters given by the user to create scenes. Next, specificscenes and events are chosen and transformed into a series of comics strips.

Log10 enter building20 sit30 stand35 enter room40 turn around50 sit60 stand75 shoot at e480 shoot back90 change gun100 shoot at e12105 walk to wall110 sit120 shoot at e12130 stand135 exit room140 walk corridor150 enter elevator160 exit elevator165 shoot at e32170 stand185 pick up gunÖ

50 sit60 stand75 shoot at e480 shoot back90 change gun100 shoot at e12105 walk to wall110 sit120 shoot at e12



RendererDirector

Virtual worldinteraction

Logger Scener

Scenes

Comics

IEEE Computer Graphics and Applications 55

the set of all active interactions at tick t of entity e withany other entity in the world: I(e, t) = {R | ∃e′ ∈ E, ∃f ∈ F,R(t, f, e, e′) = 1}.

Each entity e in the virtual world owns an iHood,which lists all entities it’s currently interacting with, andin what manner. We update these iHoods for each tickusing two procedures:

■ by continuously simulating all log events to findevents involving entity e, and

■ by examining the vicinity of entity e for other entities’entrance or departure.

An entity’s vicinity is a sphere S centered at the entityposition and with a given radius r(see Figure 2, next page).

Related WorkDevelopers have successfully used automatic comic

creation to portray chat sessions in comic chat.1 Comicchat converts text conversations to 2D comic frames bychoosing iconic 2D figures with several predefined posturesto represent the chat participants. Next, the systemdisplays the conversation text inside balloons connected tothe figures. Because comic chat’s main purpose is toportray a textual chat, its textual balloon positioning ismore elaborate than our simple one. However, comic chatis inherently 2D in nature, whereas our system involvesconverting 3D world interaction into images, includingstory line extraction, camera positioning, and frame layout.

Automatic comic creation is also closely related toautomatic animation or movie creation.2-4 In both cases, themain challenge is to decompose a series of events into scenesand create a visual display. In fact, our system’s generaloutline follows that of similar systems for movie creation:

■ it segments the events into scenes, ■ it selects specific scenes or scene parts, and ■ it transforms each one into a visual depiction.

Moreover, comic sequences must follow somecinematographic rules. For example, the 180-degreecamera line rule also applies to comics. Halper and Masuchdiscuss a framework similar to ours for creating a summaryof computer games using video and possibly static images.3

Nevertheless, their results were limited and weren’t orientedtoward comics creation.

Although video summarization is also related to our work,most of the research there has a somewhat different focus,such as image analysis and understanding. Our input comesfrom a 3D virtual world and its different nature includesmore elaborate semantic information.

Two previous approaches sought to preservecinematographic constraints in movie creation. One usesrule-based constraint satisfaction4 and the other uses idiomsthat are stereotypical ways to capture some specific actionsas a series of shots.2 Our approach for transforming eventsinto comics extends the notion of idioms to the creation ofcomic frame transitions based on comic principles.

The language of comics differs from the language ofcinematography. Although it seems simpler in that onlystatic images are created, the choices are more criticalbecause they must convey movements or actions using onlystatic images.5 Furthermore, the general approach ofautomatic movie creation is inherently photorealistic. Incontrast, we use the same principle of detail reduction andabstraction in rendering as in the temporal story extraction.Because we don’t have full geometric information forrendering, we create nonphotorealistic images by applying

postprocessing stylization on the rendered images toenhance important details and reduce cluttering.

Analyzing an event log is directly connected to storyunderstanding, which researchers have studied since theearly days of artificial intelligence.6 The more thorough thestory understanding, the better the selection of interestingand important events. Rather than depend on deep storyunderstanding, our approach uses a model of theinteractions between entities or characters in the world torecognize scenes and choose events.

Chittaro and Ieronutti use transformations of a virtualenvironment or games interactions to a static visualizationto trace user behavior,7 and Hoobler, Humphreys, andAgrawala use a similar approach to create an enhancedspectator view.8 However, these works mostly deal withglobal views of the events in the environment, and thevisualization doesn’t aim to convey a narrative, but rathersome global characteristics of the events. Nienhaus andDöllner also use this type of transformation.5 They presenta system for rendering dynamic VR events using symbolsfrom visual arts. They use these depiction types to convertspecific events to comics, whereas our work deals more inthe global extraction of a narrative story line and itsdepiction.

References1. D. Kurlander, T. Skelly, and D. Salesin, “Comic Chat,” Proc. Sig-

graph, vol. 30, ACM Press, 1996, pp. 225-236.2. L. He, M.F. Cohen, and D.H. Salesin, “The Virtual Cinematogra-

pher: A Paradigm for Automatic Real-Time Camera Control andDirecting,” Proc. Siggraph, vol. 30, ACM Press, 1996, pp. 217-224.

3. N. Halper and M. Masuch, “Action Summary for ComputerGames⎯Extracting and Capturing Action for Spectator Modesand Summaries,” Proc. 2nd Int’l Conf. Application and Develop-ment of Computer Games, 2003, pp. 124-132; http://wwwisg.cs.uni-magdeburg.de/graphik/pub/files/Halper_2003_ASF.pdf.

4. D. Friedman et al., “Automated Creation of Movie Summaries inInteractive Virtual Environments,” Proc. IEEE Virtual Reality, IEEECS Press, 2004, pp. 191-199.

5. M. Nienhaus and J. Döllner, “Depicting Dynamics Using Principlesof Visual Art and Narrations,” IEEE Computer Graphics and Appli-cations, vol. 25, no. 3, 2005, pp. 40-51.

6. R.C. Schank and R.P. Abelson, Scripts, Plans, Goals, and Under-standing: An Inquiry into Human Knowledge Structures, LawrenceErlbaum, 1977.

7. L. Chittaro and L. Ieronutti, “A Visual Tool for Tracing Users’ Behav-ior in Virtual Environments,” Proc. Working Conf. Advanced VisualInterfaces (AVI 04), ACM Press, 2004, pp. 40-47.

8. N. Hoobler, G. Humphreys, and M. Agrawala, “Visualizing Com-petitive Behaviors in Multiuser Virtual Environments,” Proc. IEEEVisualization, IEEE CS Press, 2004, pp. 163-170.

Feature Article

56 May/June 2006

For each interaction type f ∈ F, we define a weight wf

that is a real value number the user configures. Theweights denote these interactions’ importance in a spe-cific story⎯for example, from a specific viewpoint.Hence, different choices of weights for different inter-actions eventually govern whether an action will appearin a story.

At any tick t we define an entity e interaction level as

As Figure 3a shows, for a specific entity e we can graphl(e, t) as a function of the tick t. The interaction level hasseveral peaks and valleys during the course of interac-tion. Furthermore, we can change this function’s shapeby changing the weights of specific interaction types.Our key assumption is that a high interaction level

reflects significance in the story and vice versa.Hence, we’d want to include the events aroundthe peaks of l(e, t) in any story involving entitye, and skip these events when l(e, t) is low. Wecan do this using a simple threshold over the l(e, t) value. Nevertheless, this simple schemedoesn’t create a good partitioning of the storyinto scenes and is sensitive to local fluctuationsin l(e, t). The interaction level l(e, t) indicatesonly one of the arguments defining a scene: theevents’ time. We need to augment this argumentwith the second argument: the events’ location.

We define a change-of-location function s(e, t) that is always 0 except for a finite numberof times k when it’s defined as a uniform spikewith maximum height ck > 0. This function issimilar to a series of k approximated delta func-tions. The times when s(e, t) ≠ 0 are the timesthe entity e has moved from one physical loca-tion to another, and the constant ck depends onthe type of location change. This information iseither in the log file or we can extract it from theentities’ position and actions. For instance,opening a door or using a portal signifies mov-ing from one room to another, exiting or enter-ing a building. A change in z coordinatessignifies going up or down stairs or elevators.

We separate scenes when two argumentshold:

■ the interaction level is low, and ■ the location changes.

This means we subtract the change-of-location function from the interaction-levelfunction: l(e, t) − s(e, t). We also smooth theresult using a Gaussian with local support togain an ease-in and ease-out effect in scenes andevents: h(e, t) = G(t) ∗ (l(e, t) − s (e, t)). Func-tion h(e, t) is a smoother version of l(e, t) thatalso incorporates location change information(see Figure 3b).

We next use h(e, t) to segment the event loginto scenes. A transition from a period of low

l e t w R t f e e wf

f Fe Ef

R t f e e

( , ) , , ,, , ,

= ′( ) =∈′∈ ′( )

∑∑∈∈ ( )

∑I e t,

2 Two examples of interaction neighborhoods (iHoods) of twoentities⎯the orange and the blue circles. The image shows a top view of a3D game environment in which circles represent each entity. Entities arepart of another entity’s iHood if they’re within its vicinity (large orange orblue circles) or if they interact with it (dotted lines).

3 The interaction level and change-of-location functions are the basis for scenesegmentation and event selection. (a) An example of the interaction of level l(e, t)over time. (b) Combining a smoother version of the interaction level and the loca-tion change to create h(e, t) and scene partitioning based on h(e, t). Using a slightlyhigher threshold, we could have separated scene 3 into two scenes. (c) We chooseinteresting scenes and events on the basis of a threshold on the smoothed l(e, t).We could remove scenes 4 and 5 because no interesting event occurred in them.

21 3 64 5Sce ne :

1,0000(a) 2,000 3,000 4,000 5,000 6,0000

20

40

60

80

100

120

140

Inte

ract

ion

leve

l

21 3 64 5Sce ne :

–30

–40

–20

–10

0

10

20

30

40

h(e2

)

1,0000

(b)

2,000 3,000 4,000 5,000 6,000

54321Scene: 6

0

20

40

60

80

100

120

Smoo

thed

inte

ract

ion

leve

l

1,0000

(c)

2,000 3,000 4,000 5,000 6,000

54321Scene: 6


h(e, t) to a period of high h(e, t)⎯that is, ∂h(e, t)/∂t > 0 signifies the beginning of a scene. Conversely, a decreasing h(e, t)⎯that is, ∂h(e, t)/∂t < 0⎯indicatesthe end of a scene. Inside each scene, weuse a threshold T on a smoothed version ofl(e, t) to indicate an interesting event (seeFigure 3c). An interesting scene has at leastone interesting event. We remove noninter-esting scenes, when l(e, t) is low through-out the scene, such as periods when the character is waiting, because the char-acter isn’t participating in any interestinginteraction.

Selecting a main character segments the log into scenes from that character’sviewpoint. Changing the main characterchanges the segmentation and story.Hence, this character-selectionmechanism is a type of parametrictemporal detail reduction that canemphasize important informationand remove insignificant data de-pending on the viewpoint. Further-more, choosing different thresholdsfor defining interesting scenes andevents changes the story’s level-of-details. (Results for differentlevel-of-details and viewpoints areavailable at http://www.faculty.idc.ac.il/arik/comics.)

Converting to visual depictionAfter we’ve selected the interest-

ing scenes from the log file, thedirector converts them into visualdisplay. Currently, the directorexamines each scene independently(in the future, we might also useinterscene relations, for example, tochange scene order). Our goal is to transform each scene into asequence of images depicting themain happenings. This transforma-tion includes three major steps:

1. Choose the specific events with-in the scene to be portrayed.

2. Use one of the possible comics’temporal frame transitions toportray the events, as Figure 4illustrates.

3. Choose the camera parametersfor each frame image: position,direction, zoom, and so on.

We’ll use Figure 5 to illustrate theconversion process from scenes tovisual depiction. Each scene openswith one change-of-scene framerepresenting the scene-to-scenetransition and creating continuity in

4 Schematic view of the visual conversion process from log to comic frames, and theplaces where we used different types of comic transitions.

Scene

Log

Scene Scene

Comicframes

Action toaction

Action toaction

Action toaction

Subject to subject Subject to subject Subject to subject Subject to subject

Subject to scene Subject to scene

Event EventEvent Event

5 Comics page visual depiction created using our system. This is the beginning of the comicstrip from the green player’s viewpoint. The full sequence and more examples are available athttp://www.faculty.idc. ac.il/arik/comics.

Feature Article

58 May/June 2006

the story line. This frame is usually a wide-angle imageof the scene that gives an exposition of the scene (forinstance, the first and fourth frames in Figure 5). Next,we recognize the scene’s main events using a thresholdon a smoothed version of l(e, t) within the scene (seeFigure 3c). If more than one event occurs in a scene, weuse a subject-to-subject transition. Nevertheless, bydepicting each event separately, a subject-to-subjecttransition will emerge implicitly between the differentevent frames, as Figure 6 illustrates.

We can display each event using moment-to-momentor action-to-action transitions. Creating moment-to-moment transitions is simple because the event is shown

using k frames of equal intervals in time.Nevertheless, these transitions arerarely used in comics, and when theyare, they mainly stress important actionsor build tension by prolonging the per-ceived time (see Figure 6d).

Action-to-action transitions are moreappropriate and useful in comics. Theydepict events using fewer frames thatpresent significant movements in theevent. Although even the same actionperformed by the same entity on differ-ent occasions can use different signifi-cant frames, the frames typicallydepend on the action type and its inter-val in time. Therefore, to depict anevent using an action-to-action frametransition, we use predefined idiomsdepending on the event type.

In general, an idiom is a mappingbetween a specific pattern of events inthe scene to a specific time-framesequence. We can use the same idiomfor different event types, or, dependingon the entities’ state, choose differentidioms to depict the same event. Forinstance, for a shooting sequence thedirector mechanism uses an idiom thatdis- plays two frames in the manner ofan action-to-action transition: one justbefore the action’s peak (the peak inl(e, t)) and one a little after the out-come (see frames 5 to 8 in Figure 5 andFigures 6a and c). An idiom of a singlewide-shot frame depicts a single peakin l(e, t) containing interactions withmany entities. However, when a play-

er is shielded it’s important to stress that player’s per-sonal viewpoint. To do so, the director uses an idiomthat defines first-person viewpoint shots (frames 12 and16 in Figure 5).

Idioms can also define higher-level mappingsbetween otherwise solitary events. In a con- versa-tion scene, a prolonged idiom that later enables theinsertion of text balloons (for example, frames 2 and3 and 18 to 20 in Figure 5), replaces many single-interaction events. Table 1 lists the primary idiomsused in Doom comics, and Figure 7 designates the ori-gin of each frame in Figure 5. Although this gameoffers limited interaction types, the idiom mechanism

is versatile and easily extendible to otherfields, including other interaction types.

As a result of the director’s process, the sys-tem portrays the story of a specific player orentity in the virtual world using discrete pointsin time defined by idioms or indicated by highinteraction levels. However, we still need toconvert the 3D scene at these points in timesto 2D images. We do this by positioning thecamera and choosing the parameters forshooting the comic frames, as Figure 8 illus-trates. The director uses idioms that define

6 The shooting sequences idiom consists of two sets of action-to-action transitions⎯onefor the shooting entity and one for the target entity⎯implicitly creating a subject-to-sub-ject transition between the two sets. (a) We depict the green player’s shooting sequenceusing an idiom composed of two action-to-action and one subject-to-subject transitions. (b)Snapshots from (a) as seen in the game, without our system camera repositioning. (c) Thesame type of idiom depicting a similar event from the blue player’s story, only this time theplayer is hit. (d) A moment-to-moment transition that’s not used in practice.

Table 1. Examples of idioms use to depict interactions.

Idiom Depiction

Change of scene Wide shot Shooting Two action-to-action shots Conversation Interchanging shots Peak interaction Wide shot Object picking Long shot Shielded interaction First-person viewpoint shots

(a)

(b)

(c)

(d)


high-level directives (such as the scene’s main entity ora point in space at the scene’s center, or a list of sec-ondary entities to be portrayed if possible). It can alsoindicate the desired shot type: close-up, medium, long,or wide. The renderer translates these directives toexplicit camera parameters. For example, it translatesa central point or entity into the camera’s direction andthe shot type to the camera’s distance from the object.

Several 3D rendering engines and techniques thatgive objects a cartoon-like form are available. Unfor-tunately, Doom’s game engine supports only thegame’s original dark-mood type shading. Further-more, the game itself has low-quality graphics and isbased primarily on sprites and textures. Instead of full3D data, we can acquire only fixed-resolution 2-1/2Dimages from the game. Using these extracted images,we stylize the resulting image using several image-processing techniques, as Figure 9 (on the next page)shows. We use color clustering based on mean-shiftfor the background and k-means clustering for theforeground. We apply an anisotropic Laplacian kernelon the image to create stroke-like edges.

The renderer also creates text balloons when a con-versation occurs in the log. Currently, the idiom usedfor conversation is a simple alterna-tion of images in frontal view of thespeakers. The renderer chooses theballoon size that best fits the imagetext length and positions it to theleft or right of the speaker’s head,alternating between speakers. Theposition takes into account thespeaker’s foreground mask, otherentities’ masks, and the image bor-ders. If the text doesn’t fit into oneballoon, the renderer distributes itamong several consecutive equiva-lent images.

Comics layoutThe last stage in the comic cre-

ation is the page layout. After ren-dering and styling, all images are thesame size and can be displayed con-secutively, similar to the top tworows in Figure 5. Nevertheless,many comic pages don’t use asequence of uniform images, butpresent variations in the images’size, aspect ratio, and positioning tocreate effects such as stressing orprolonging. Although we can’t yetimitate all of a comic artist’s skills,we devised a heuristic image-layoutalgorithm that breaks the symmetryto create a comic look and feel.

To simplify the problem, we con-strain all rows in our page to thesame height, as is the case in manyreal comic sequences. Therefore,our layout algorithm becomes a 1Dproblem of choosing the horizontal

7 The idiom origin of the frames in Figure 5.

Expositionshot

Conversationidiom

Change ofscene

Action to action Action to action

Change ofscene

Object-pickingidiom

First-personidiom

Action to action

Conversatonidiom

Change ofscene

Action to action

Shooting idiom

Shooting idiom

Shooting idiom

8 A plot of the full path of the green player during an interaction. At different points in timeand space, and depending on the interaction level and idioms, the director chooses a snapshotand positions the camera (green pyramids) to shoot the comics’ frames. In this example, thegranularity is high and many shots were taken (as in the lower right corner)

Top ViewTop view

Feature Article

60 May/June 2006

length of images in each row. Next, we classify theimages based on their semantics⎯that is, on the idiomor event they portray. We define four classes of images:

■ B (big)⎯peak interaction images and change-of-scene images that can be expanded horizontally;

■ S (small)⎯images from an action-to-action pair orfirst-person viewpoint shots that can be condensed;

■ F (fixed)⎯images that include text balloons (to pro-tect the balloons layout); and

■ N (neutral)⎯all other images, which can be expand-ed or condensed.

The layout process receives as input the row’s length.We always choose a whole multiple of a regular image’slength (usually k = 4) as the row length. The layout pro-ceeds in a greedy manner by fitting sets of up to k images,one row at a time, beginning from the first image and con-tinuing until the last. At each row’s start, the layout exam-ines the next k images. If this set doesn’t include any

B-type image, the layout puts k images in the rowwith a fixed size (see rows 2 and 6 in Figure 5).Otherwise, it puts the first B-type image aside asa filler. For all other images, the layout tries tocomply with their constraints based on theirtypes. It assigns every B-type image a randomexpansion ratio between 1.2 and 2.0, and everyS-type image a random condensing ratio between0.7 and 1.0. Lastly, the layout expands the fillerimage to compensate for all other images. If thisisn’t sufficient, it expands other N-type images. Ifno solution is found, the layout tries anotherround with different random values (see rows 4and 5 in Figure 5). If after five trials it still fails, itinserts the images in their original sizes and pro-ceeds to the next row (see the first row in Figure5). We further enhance this basic scheme for spe-cific situations. For instance, when the layoutfinds two consecutive B-type images in a row, itexpands the first as usual by a random ratio r1,and the second by r2 so that r1 + r2 = 3 (see thethird row in Figure 5).

After setting the expansion and condensingfactors for a row of images, we prune the imagesat the fringes either horizontally (mostly for

expansion) or vertically (mostly for condensing) toachieve the right aspect ratio. We prune the images sym-metrically, from the top and bottom, or left and right,with as little damage to the foreground mask as possible.

ResultsWe present results created from the interactions of a

two-player episode of Doom, with the two players denot-ed as green and blue. The original interaction from thegreen player’s viewpoint takes 160 seconds. We creat-ed all examples on an Intel Pentium 4 with 2.8 GHz and500 Mbytes memory.

The log created from the interaction is 6.3 Mbytes inXML format and includes 5,574 ticks. The scener firstruns on the log to compose an interaction script thatincludes all of the iHoods for all entities. Creating a 683-Kbyte script takes 392 seconds. This lets us optimize sub-sequent calls to the scener and takes only a few seconds.Hence, given a specific entity (viewpoint) and a scene-cutting threshold (granularity), the scener uses the inter-

action script and log to create ascened log. This takes 3 secondsfor the blue and green players.The log is larger (8.68 Mbytes)because it’s now segmented intoscenes, and at the start of eachscene it contains a synchroniza-tion point for all entities in theworld. Later, this lets us portrayeach scene independently. Wecan also use the synchronizationpoint to change the scene orderin the future.

The director uses the level-of-details threshold to convert thescened log into a directing script.This script includes the high-

9 The 2-1/2D image data extracted from the game engine and the differentimage-processing techniques used to stylize images for our comics. We use twostyles: a cartoon-like style and one that enhances the entities in the foreground byremoving all color from the background.

k-means

Laplacian

Mean shift

Cartoon-like style

Major-focus style

10 An extract from the green player comic sequence created in the rendering style, which stresses the foreground.


level directives used for creating snapshots of the storyover time. In our example, this script is 60.4 Kbytes. Last-ly, the director uses the directing script to create thecomic images. We used different thresholds to createseveral level-of-details examples of the story for thegreen and blue players (see our Web site for the full setof examples).

Assuming E is the number of entities in the world, Tis the number of ticks, and V is the number of events,the most costly stage of the process other than render-ing is the creation and update of the iHood for each enti-ty in O(T × E2 × V). We use the game engine to renderthe basic images and extract all the masks for styliza-tion, which we perform by filtering all images in post-processing. Currently, this takes from 90 seconds infocused style (black-and-white background) to 150 sec-onds in cartoon style per image. The rendering styliza-tion process is in fact the major bottleneck because itscomplexity depends on the number of pixels and not theentities or time steps.

Figure 5 shows some of the results of the green play-er’s story using the coarsest granularity (that is, onlyhighlights of the interaction). Lowering the thresholdwill yield more frames automatically, as you can see atour Web site. We used four level-of-details each for theblue and green players. Figure 10 shows an example ofour results using the second rendering style.

Conclusion and future directionsAlthough our examples come from the Doom game,

our approach and solutions are applicable in more com-plex games such as Sims or massive multiplayer games,and possibly other types of interaction applications. Inthis work, we use the notion of reduction in details in amore general sense than usually in graphics, by portray-ing continuous temporal events as static images. How-ever, the same principle can guide similar graphicsrepresentation in other applications. Moreover, thestudy of comics as juxtaposed pictorial and other imagesin a deliberate sequence2 is useful not only in art andentertainment. In medical or scientific visualizations,for example, researchers can use juxtaposition to com-pare or illustrate data. In education and training, visu-al juxtaposition can help explain structures, assemblies,or other processes. Our work, along with that of sever-al others,3-5 suggests that automatic generation of compact but descriptive graphics representations mightbe feasible.

Causality between events might be the most challeng-ing problem in developing real story understanding.Causality lets us create better scene partitioning andmore accurately identify important events in the storyline. Our current system for choosing transitions basedon idioms is almost memoryless. Remembering the his-tory of frames, idioms, and plot might create a smootherand more interesting flow in the story and might bringus one step toward building causality.

In terms of rendering and stylization, the use of 3Drendering instead of 2D image postprocessing will opennew possibilities. We’d like to experiment with moreelaborate camera placement and more interesting shots,such as extreme angles and close ups. Researchers

should also further investigate the use of text in comics.Text is often used as a type of narration to show thoughtsand add onomatopoeic sounds. These powerful toolscan advance the story, fill gaps in the plot, and addatmosphere.

In terms of layout, algorithms for combining text andimages automatically to create effective displays, andalgorithms for automatic page layout are still in theirinfancy and need further investigation.

Lastly, the creation of comics and movies have muchin common. Nevertheless, we need additional researchto create camera movements and dynamic flow to sup-port the automatic creation of high-quality movies aswell as better comics. ■

References1. M. Brand, “The ‘Inverse Hollywood Problem’: From Video

to Scripts and Storyboards via Causal Analysis,” Proc. 14thConf. Artificial Intelligence, AAAI Press, 1997, pp. 132-137.

2. S. McCloud, Understanding Comics: The Invisible Art, Harp-er Perennial, 1994.

3. S.K. Feiner and K.R. McKeown, “Automating the Genera-tion of Coordinated Multimedia Explanations,” Computer,vol. 24, no. 10, 1991, pp. 33-41.

4. E. André and T. Rist, “Generating Coherent PresentationsEmploying Textual and Visual Material,” Artificial Intelli-gent Rev., vol. 9, no. 2-3, 1995, pp. 147-165.

5. M. Agrawala et al., “Designing Effective Step-by-StepAssembly Instructions,” ACM Trans. Graphics, vol. 22, no.3, 2003, pp. 828-837.

Ariel Shamir is a professor at theInterdisciplinary Center in Herzliya,Israel. His research interests includegeometric modeling, computer graph-ics, visualization, and machine learn-ing. Shamir received his PhD incomputer science from the HebrewUniversity in Jerusalem. Contact him

at [email protected].

Michael Rubinstein is in algorith-mic research at Aternity Ltd. Hisresearch interests include computergraphics and machine learning.Rubinstein received a BA in computerscience from the Interdisciplinary Cen-ter in Herzliya, Israel. Contact him [email protected].

Tomer Levinboim is founder ofuTrade, a company in the field of pre-diction markets. His research interestsinclude machine learning and AI. Lev-inboim received a BA in computer sci-ence from the Interdisciplinary Centerin Herzliya, Israel. Contact him at [email protected].

Documents

Generating Comics from 3D Interactive Computer Graphicsivizlab.sfu.ca/arya/Papers/IEEE/C G & A/2006/May/Generating Comic… · camera parameters to create descriptive images, choosing