9
Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy [email protected] Technion, Israel Lihi Zelnik-Manor [email protected] Technion, Israel Figure 1: Adding an artificial object to an existing video usually requires high-end tools and intensive user interaction. Our system allows to inlay any number of objects, a balcony and the wall lamps on the right, into any video via a simple user interface and minimal user interaction. This is achieved by representing the video structure and texture as a mosaic (center). Abstract Digital editing technology is highly popular as it enables to eas- ily change photos and add to them artificial objects. Conversely, video editing is still challenging and mainly left to the profession- als. Even basic video manipulations involve complicated software tools that are typically not adopted by the amateur user. In this pa- per we propose a system that allows an amateur user to performs a basic matchmove by adding an inlay to a video. Our system does not require any previous experience and relies on a simple user in- teraction. We allow adding 3D objects and volumetric textures to virtually any video. We demonstrate the method’s applicability on a variety of videos downloaded from the web. CR Categories: I.4.3 [Image Processing and Computer Vision]: Scene Analysis—Depth cues I.3.7 [Computer Graphics]: Three- Dimensional Graphics and Realism—Virtual reality; I.2.10 [Arti- ficial Intelligence]: Vision and Scene Understanding—Video anal- ysis; Keywords: video editing, matchmove, video representation 1 Introduction Video editing is composed of three main steps: sequencing, match- moving and compositing. Sequencing includes managing the tem- poral dimension of a video, namely, rearranging scenes and modi- fying the time flow. Matchmoving refers to matching between the camera movement and the motion of an artificial object, in order to place the object correctly in each frame. Finally, compositing takes care of the seamless composition of two, or more, sequences. In the world of professional video editing there exist plenty of sophisticated tools for each of the three tasks. For instance, se- quencing can be performed using Adobe R Premiere R or Apple’s Final Cut Pro R . In the high end production Boujou is the common matchmoving tool, and Adobe R After Effects R or Sony R Vegas TM are used for compositing. Unfortunately, these very expensive soft- wares require high user skills and intensive user interaction. In amateur video editing there is a lack of mathmoving and compositing tools. There are sequencing tools that are home user oriented, like Adobe R Premiere Elements R and CyberLink R PowerDirector R , but these are limited to basic effects, such as scene transitions and textual and image overlays. Although there are open-source tools like Blender.org, which is capable of match- moving, its easy of use for untrained users is questionable. To per- forms a basic camera motion modeling the user is required to ex- pert the tool. Furthermore, the level of user interaction is very high. Therefore amateurs rarely perform any video edits beyond scene arrangement and textual overlays. In this paper we propose a user-friendly system for amateur match- moving. We do not presume to compete with the professional tools for matchmoving, but rather seek to achieve acceptable per- formance with as simple user interaction as possible. The main contribution of the paper is a system that: allows adding multiple 3D objects or a volumetric texture to a video (see Figure 1), reduces the user interaction to a minimum by eliminating the need to check every video frame, enables adding an object or 3D texture with trivial interaction and, renders a basic composite video. To achieve the desired simplicity our system follows several steps. First, we represent the entire video as a single mosaic image - this

Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy [email protected] Technion, Israel Lihi Zelnik-Manor

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Video Inlays: A System for User-Friendly Matchmove

Dmitry [email protected]

Technion, Israel

Lihi [email protected]

Technion, Israel

Figure 1: Adding an artificial object to an existing video usually requires high-end tools and intensive user interaction. Our system allowsto inlay any number of objects, a balcony and the wall lamps on the right, into any video via a simple user interface and minimal userinteraction. This is achieved by representing the video structure and texture as a mosaic (center).

Abstract

Digital editing technology is highly popular as it enables to eas-ily change photos and add to them artificial objects. Conversely,video editing is still challenging and mainly left to the profession-als. Even basic video manipulations involve complicated softwaretools that are typically not adopted by the amateur user. In this pa-per we propose a system that allows an amateur user to performs abasic matchmove by adding an inlay to a video. Our system doesnot require any previous experience and relies on a simple user in-teraction. We allow adding 3D objects and volumetric textures tovirtually any video. We demonstrate the method’s applicability ona variety of videos downloaded from the web.

CR Categories: I.4.3 [Image Processing and Computer Vision]:Scene Analysis—Depth cues I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Virtual reality; I.2.10 [Arti-ficial Intelligence]: Vision and Scene Understanding—Video anal-ysis;

Keywords: video editing, matchmove, video representation

1 Introduction

Video editing is composed of three main steps: sequencing, match-moving and compositing. Sequencing includes managing the tem-poral dimension of a video, namely, rearranging scenes and modi-fying the time flow. Matchmoving refers to matching between thecamera movement and the motion of an artificial object, in order to

place the object correctly in each frame. Finally, compositing takescare of the seamless composition of two, or more, sequences.

In the world of professional video editing there exist plenty ofsophisticated tools for each of the three tasks. For instance, se-quencing can be performed using Adobe R© Premiere R© or Apple’sFinal Cut Pro R©. In the high end production Boujou is the commonmatchmoving tool, and Adobe R© After Effects R© or Sony R© VegasTM

are used for compositing. Unfortunately, these very expensive soft-wares require high user skills and intensive user interaction.

In amateur video editing there is a lack of mathmoving andcompositing tools. There are sequencing tools that are homeuser oriented, like Adobe R© Premiere Elements R© and CyberLink R©

PowerDirector R©, but these are limited to basic effects, such asscene transitions and textual and image overlays. Although thereare open-source tools like Blender.org, which is capable of match-moving, its easy of use for untrained users is questionable. To per-forms a basic camera motion modeling the user is required to ex-pert the tool. Furthermore, the level of user interaction is very high.Therefore amateurs rarely perform any video edits beyond scenearrangement and textual overlays.

In this paper we propose a user-friendly system for amateur match-moving. We do not presume to compete with the professionaltools for matchmoving, but rather seek to achieve acceptable per-formance with as simple user interaction as possible. The maincontribution of the paper is a system that:

• allows adding multiple 3D objects or a volumetric texture toa video (see Figure 1),

• reduces the user interaction to a minimum by eliminating theneed to check every video frame,

• enables adding an object or 3D texture with trivial interactionand,

• renders a basic composite video.

To achieve the desired simplicity our system follows several steps.First, we represent the entire video as a single mosaic image - this

Page 2: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

provides the user a simple 2D interaction to inlay artificial objectsor textures. Second, we further construct a corresponding mosaicof the surface normals in the scene. The normals mosaic is usefulfor accurate 3D positioning of the inlays by our system. Next, weestimate the camera motion thus allowing the system to match themotion of the camera to the inlay. Last, we automatically composethe final video by inserting the inlay into the original footage.

The rest of the paper is organized as follows. In Section 1.1 wereview recent methods for video editing. Section 2 provides a high-level overview of the proposed system. In Section 3 we describethe video representation used for editing. Section 4 presents thematchmoving method along with the required user interactions, andSection 5 explains the deneration of the final video. We present theresults of our experiments in Section 6, demonstrate limitations inSection 7 and draw conclusions in Section 8.

1.1 Related work

Amateur video editing gained an increased attention over the lastyears. Rav-Acha et al. proposed in [Rav-Acha et al. 2008] an un-warp mosaic video representation that allows to easily change thetextures in the video. By several brush strokes on the mosaic theuser can add new or remove undesired textures on the surface of ob-jects. Li et al. proposed to use 3D graph cuts for simple and effec-tive segmentation of objects from a video sequence [Li et al. 2005].Since their system is user-friendly and does not require heavy userinteraction, it can be used by amateurs for recomposition.

To address depth effects in a video Chen et al. proposed a “videomesh” – a way to model the given video as a 2.5D sequence [Chenet al. 2011]. In their system the user can easily add depth-of-focuseffects or place different layers at different distances from the cam-era. However, that system involves numerous user action and pre-cise object selection. Pavic et al. move further and add 3D videoeffects using 2D interaction only [Pavic et al. 2008]. However, theirsystem is limited to near-planar objects.

Adding more complicated effects to a video may be accomplishedby 3D modeling of the video objects. Van den Hendel et al. pre-sented VideoTrace – a system for video-based modeling of ob-jects [van den Hengel et al. 2007]. Their system allows to createa realistic 3D model of any object found in a given video. Al-though, VideoTrace system is intuitive and easy to use, it still re-quires much user interaction during the modeling stage. Once theuser has a good enough 3D model, any effect can be added. In ourwork we avoid 3D reconstruction of the video object to make theuser interaction simpler.

2 System overview

Before presenting the details of the proposed system we overviewits high-level components. Our system consists of three main ele-ments: video modeling, matchmoving, and compositing. A visualoverview of the system is provided in Figure 2.

Video modeling. To represent the video in a convenient fashion wemodel it as a single image. In this work we adopt the unwarp mo-saic, proposed by Rav-Acha et al. [Rav-Acha et al. 2008]. The rep-resentation maps every pixel in the video to a point in the mosaic.Editing the mosaic image is thus equivalent to changing the textureon the surface of the object-of-interest, in all video frames simul-taneously. Furthermore, we propose using another mosaic whichreflects the scene structure. This one allows to embed the orien-tation of each surface in the scene in a single image. Our systemcombines these mosaics together and uses the resulting representa-tion to attach a 3D inlay to the object’s surface.

User-friendly matchmoving. The first part of matchmoving iscamera motion estimation. We use the factorization method pro-posed by Tomasi and Kanade to estimate the camera location andorientation in the scene-centered coordinate system [Tomasi andKanade 1993].

At the second stage of matchmoving the user adds an inlay to thescene. Our system supports inserting multiple localized objects aswell as drawing a volumetric texture. To insert an object the userjust drags it to the desired place on the mosaic and scales accord-ingly. To add a volumetric texture the user selects the brush profileand draws its location on the mosaic.

The last stage creates a video of the inlay, referred to as the inlayvideo. To create it we rotate the object according to the cameramotion. To keep the inlay attached to the original scene we translateit according to the mosaic representation.

Compositing. To render the final video we compose the originalvideo and the matchmoved inlay video. We adjust the colors of theinlay video to fit the original one. Furthermore, the inlay shouldagree with the overall quality of the video. We achieve this by es-timating the image blur using the technique of Fergus et al. andapply it to the inlay video [Fergus et al. 2006].

In the following sections we describe each of these steps in detail.

3 Video representation

We first describe the construction of the two mosaics: the videomosaic and the mosaic of normals. The video mosaic is used tomodel the color space of the edited video in a single image. Themosaic of normals represents the structure of the scene in a singleimage.

3.1 Video mosaic

To guarantee simple user interaction we begin by modeling the en-tire video as a single image. This is done via the unwarp mosaic ap-proach, proposed by Rav-Acha et al. [Rav-Acha et al. 2008]. Alongwith the mosaic image we store the dense mapping between the mo-saic and the pixels of every video frame. This mapping, denoted by(w) = (wx(i, j, t), wy(i, j, t)) is used later for accurate placementof the inlay in a specific frame. Here, i, j are the indices of the mo-saic pixels, and t is the frame index. wx, wy are the mappings of xand y frame coordinates, respectively.

To construct such unwarp mosaic the following cost function isminimized:

E(C,w, b) = EIdata(C,w, b) + λwEw(w) + λbEb(b) (1)

where C denotes the colors in the scene and b are time-dependentocclusion masks. Ew denotes the cost function for the unwarp map-ping smoothness (spatial and temporal, with different weights), andEb is a similar smoothness cost for the occlusions. EIdata encour-ages the model to predict well the colors of the input frames. Forfurther details on the cost function formulation and the solutionmethod please refer to [Rav-Acha et al. 2008].

To create a good mosaic representation one needs a good featuretracker. In this work we use the particle video tracker proposed bySand and Teller [Sand and Teller 2008], that deals with complexreal-life videos, varying illumination and self-occlusions. Further-more, the method is capable of tracking points on relatively low tex-tured surfaces and produces hundreds of tracks. We use the methodof Jacobs for completing missing data [Jacobs 2001].

Page 3: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 2: Overview of the system. Given an input video we represent it as a single image (top). Then the camera motion is reconstructedand the user adds adds an inlay (middle). The system automatically matches the motion of the inlay to the camera and renders the inlay video(bottom, right). Finally, the output video is created by merging the original and inlay videos (bottom, left).

For videos with moving objects the mosaic requires a user-selectedobject of interest – the basis of our inlay. To separate the object-of-interest from the background we use “video cut and paste” [Li et al.2005], or video SnapCut [Bai et al. 2009].

3.2 Mosaic of normals

Similarly to the unwarp mosaic above we propose using an addi-tional mosaic, referred to as mosaic of normals. While the videomosaic represents the colors of the scene in a mosaic space, themosaic of normals represents the structure of the scene in the samespace. The mosaic of normals allows us to model the scene moreprecisely which in turn increases the accuracy of positioning of thevideo inlay. Furthermore, using the same space for both mosaicsallows us to optimize them simultaneously, which in turn leads torefined results in both video mosaic and mosaic of normals.

To model the scene structure as a mosaic image we propose an “im-age” formation model similar to the one used for the unwarp mo-saic. Inspired by the geometry images representation of [Gu et al.2002] we define a function S : Q → <3 which maps each pointin the mosaic space Q ⊂ <2 to a 3D point. Each point in the 3Dworld is associated with the normal to the surface it lies on, denotedby N , and with its 3D coordinates, denoted by S. Thus, for eachpoint on the mosaic u = (u, v) ∈ Q we wish to associate a corre-sponding normalN(u). Furthermore, since not all the points of thescene are visible from all the viewpoints we associate a visibility

variable b(u) of this point:

b(u) =

{1, S(u) is visible0, otherwise

(2)

Consequently the “image” of the normals of the scene projected tosome camera is:

IN (x) = π(S(u)) ·N(u) · b(u) · J(u) (3)

where π : <3 → <2 is a generalized camera function and J(u) is aJacobian of the mapping. We further note that π(S(u)) maps a 2Dmosaic space to a 2D image. To simplify the notation we denotew(u) = π(S(u)). We also note that for rigid scenes the normalsdo not change over time. Assuming rigidity for the time being wecan write the full “image” formation model as:

IN (x, t) = w(u, t) ·N(u) · b(u, t) · J(u, t) (4)

To estimate the surface normals we use the method of [Rusu 2009]on the set of sparse points tracked in the construction of the videomosaic.

The next step in the mosaic construction is formulation of the costfunction. Similarly to the data cost used for the unwarp mosaic wedefine a data cost for the mosaic of normals as:

ENdata = ΣtΣx||IN (x, t)− IN (x, t)||τ (5)

Page 4: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 3: “Image” formation model for mosaic of normals. Thescene is represented by its normals calculated on a sparse set ofpoints. The normals are projected to the cameras Ci and Cj ac-cording to the camera motion and create an “image” of normals.

where IN (x, t) is a projection of the scene normals to the cur-rent frame, or “image” of normals. Here we use a robust norm||e||τ = min(||e||, τ) that overcomes the outliers originating fromsmall occlusions. Joining Equations 4 and 5 we get the full datacost for the mosaic of normals:

ENdata = ΣtΣx,u||IN (x, t)−w(u, t)N(u)b(u, t)J(u, t)||τ (6)

Since the optimization of the mosaic of normals will be performedjointly with the optimization of the video mosaic we do not add anysmoothness constraints. Instead we replace the data cost in Equa-tion 1 with a sum of image and normals data costs. The resultingcost function thus is

E(N,C,w, b) = EIdata(C,w, b) + ENdata(N,w, b) +

λwEw(w) + λbEb(b) (7)

where the EIdata is as defined by [Rav-Acha et al. 2008] in Section2. The optimization is performed by iterating between solving forthe mapping w and optimizing for the data and smoothness terms,given the mapping w. To solve for w we unwarp a sparse set oftracked points and interpolate to get a dense map.

The suggested optimization process generates two mosaic images:an image that encodes the colors of the scene (the ordinary unwarpvideo mosaic) and a representation of the direction of normals ofthe same scene. Both images are demonstrated in Figure 4. As canbe seen, while the unwarp mosaic provides a visual summary of thevideo, the mosaic of normals pertains a corresponding summary of3D information.

As mentioned above, the scene is assumed to be rigid for success-ful normals estimation. For scenes that are not rigid we separate themain object by segmenting it from the background. Then both mo-saics are created for this object only. Furthermore, the frameworkis robust to small non-rigidity of the main object, as demonstratedin the giraffe example in the experiments section.

4 Matchmoving

There are two essential steps in matchmoving – camera motion es-timation and object placement. In this section we explain how theproposed system performs both steps with as simple as possibleuser interaction.

(a)

(b)

Figure 4: Unwarp mosaic (a) together with the correspondingmosaic of normals (b). In the mosaic of normals each color com-ponent encodes spatial direction: red is up, green is right and blueis left. As one can see the mosaic of normals reflects the generalorientation of the surface normals in the scene. Since the facade onthe right faces “east” the corresponding pixels are colored green.The facade facing “west” has lots of balconies, hence there is amixture of normal directions in the corresponding region.

4.1 Camera motion estimation

For the time being we limit ourselves to the case of rigid objectsonly. Using the tracks computed for the mosaic we reconstructthe camera motion using the factorization method of Tomasi andKanade [Tomasi and Kanade 1993]. This method reconstructs the3D location of the tracked points together with the camera mo-tion. The estimation of the camera motion is typically more stable,hence, we further calculate the camera rotation matrix for everyframe. We do not use the camera position since it is reflected moreaccurately by the mapping obtained in the mosaic global optimiza-tion process.

The factorization method reconstructs the camera rotation only upto some arbitrary rotation of the entire coordinate system. We re-solve this ambiguity by assuming that the vertical axis of the cameracorresponds to the vertical axis of the scene. In practice we esti-mate the plane of all the camera direction vectors and align it withthe ground plane (xy plane of the world coordinates). Most of ama-teur videos satisfy this assumption. We denote by Rt the recoveredrotation matrix between camera t and the world coordinates.

4.2 Placing the inlay

The second part of matchmoving must include user interaction,since we need to know where the user wishes to place the inlay.To avoid interaction with every video frame and to allow simul-taneous work with several points of view the user adds the inlaysto the mosaic. Our systems supports two different types of inlays:volumetric textures and 3D objects.

4.2.1 Adding 3D objects

To add an object the user selects the desired 3D model from a poolof objects. By clicking a point on the mosaic the model is placedthere using the default scale and orientation. To change the locationof the object the user can either drag it or click a different location

Page 5: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 5: The process of placing an object. First the user selects the 3D model (left). Then the location of the model is chosen by clickingon the mosaic (center). Here three different locations are selected by the user (marked by arrows). The system automatically chooses thefrontal frame for each inserted model and renders them in the correct 3D orientation (right).

Figure 6: The process of adding a volumetric texture. The user selects the desired brush profile and width and draws on top of the mosaic(left). Out system then adds a volumetric texture along the brush (center) and renders correctly in each frame (right).

on the mosaic. Scale and in-plane 2D rotation can be easily changedby resetting the zoom and the rotation angle. The entire interactionfor placing a custom object is illustrated in Figure 5.

The 3D orientation of the inserted model is computed automaticallyby the system. To achieve this we first define the anchor region ofthe mosaic as the part of the mosaic directly contacting the inlay.We use the mapping between the anchor region and the video toselect the “frontal frame”, i.e., the frame at which the inlay willappear frontally. If the anchor region has sources in several frames,we use the median frame as the frontal one. By construction, thecoordinate systems of the frontal frame and of the mosaic are thesame, thus, having rotations between cameras of different framesthe system is able to match automatically between the rotation ofthe camera and the 3D mesh. Matching of the translation in handledby using the mosaic mapping.

Finally, we use the mosaic of normals to increase the accuracy ofthe 3D inlay orientation. Since the system knows the surface nor-mals inside the anchor region we can align it with the object base.This is accomplished by approximating the object base to be planarand estimating the complementary plane from the surface normalsof the anchor region. Then we align two planes to be parallel byrotating the inlaid object. This, together with the frontal frame esti-mation, allows us to “stick” the object to any surface in the video.

4.2.2 Adding a volumetric texture

An intuitive way of adding a volumetric texture is “brushing” it ontop of the mosaic. Our system offers the user a selection of severalvolumetric texture brushes with different profiles. The user can varythe brush width and texture scale as he/she pleases. The process ofadding a volumetric texture to the mosaic is illustrated in Figure 6.

Our system tiles the marked region of the stroke with appropriatelyscaled texture units and removes the parts that are outside the strokeboundaries (Figure 6, center). Each texture unit is treated as a sep-arate 3D object and is automatically placed on top of the mosaic.

4.2.3 The benefits of the mosaic

An alternative to using the mosaic representation could be insert-ing the object in a single video frame. While this solution is sim-pler it has several disadvantages. First, when several objects are tobe placed at different locations, using frame-based placement willrequire interaction with multiple frames, thus making relative po-sitioning of the inserted objects difficult. By using the mosaic theuser can easily position the inserted objects relative to each other,in various view points.

Second, the mosaic is even more crucial when adding volumetric

Page 6: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

textures, since every part of the texture has an independent “frontalframe”. This makes drawing a texture on frame almost impossible.Furthermore, one continuous texture may cover larger regions, thatare never seen in a single frame. Using the mosaic one can add avolumetric texture with a single stroke to scene parts that are neverobserved in a single frame, as illustrated in Figure 7.

4.3 Rendering the inlay video

To render the inlay video we assume that the inlay is defined bya 3D mesh. We start by denoting all the vertices of the 3D mesh,concatenated to a single matrix, as

V mesh =

x1 x2 · · · xny1 y2 · · · ynz1 z2 · · · zn

(8)

where n is the number of vertices in the mesh and (xi, yi, zi)T are

the 3D coordinates of each vertex in the mesh’s coordinate system.We also assume that every object has a model base, which will besnapped to the mosaic, in the x− y plane.

We set the mosaic to be in the z = 0 plane of its coordinate system.We align the base of the inlay mesh with the mosaic. The rotationmatrix from the mesh space to the mosaic space is denoted as M(see Figure 8. We additionally denote the mesh scale as S.

Figure 8: Coordinate systems used in the inlay positioning.The camera motion estimation process computes the rotations ofthe scene coordinate systems and each camera’s image coordinates(Rt). The frontal camera coordinate system, which is parallel to themosaic’s one, is used as a reference for all the cameras (Rt). Theuser interaction defines the rotation between the coordinates of theinlay mesh and the mosaics (M ). Note that the scene coordinatesystem is not used by our system.

To render the inlay video we calculate the positing and orientationof the inserted mesh in each frame. Figure 8 displays the transfor-mation process for orientation calculation. First, we compute V f ,the vertices of the mesh in the frontal frame. Since the mesh should

be displayed in the frontal frame in its frontal view, i.e., exactly asdisplayed on the mosaic there is no rotation between the mosaiccoordinate system and the frontal frame and V f = V mos. Thetransformation between the mesh and the mosaic coordinates is:

V mos = S ·M · V mesh (9)

We compute the rotation Rt between the frontal frame f and eachframe t, as follows:

Rt = R−1t ·Rf (10)

where Rt is the rotation between camera t and the scene, and Rfis the rotation of the frontal frame with respect to the scene (ascomputed in Section 4.1). The final coordinates of the mesh inframe t are therefore:

V t = Rt · V mos (11)

= R−1t ·Rf · S ·M · V mesh (12)

To place the inlay in the correct location we use the warp mapping(wx, wy) from the mosaic representation. We accomplish this by atwo-dimensional shift of all the mesh’s vertices:

V t = V t +

wx(ax, ay, t) · · · wx(ax, ay, t)wy(ax, ay, t) · · · wy(ax, ay, t)

0 · · · 0

(13)

where wx(ai, aj , t), wy(ai, aj , t) are the horizontal and the verti-cal components of the warp mapping for frame t, and (ai, aj) is thecenter of mass of the anchor region.

We render every frame of the inlay video on a transparent back-ground to enable easy composition. The inlays are rendered usingthe vertices V t of Equation (13) and the original set of the faceswith their colors and textures. We light the model using a combi-nation of 50% ambient and 50% directional light, placed in infinitybehind the frontal camera. We use the Gouraud lighting model.

5 Compositing

(a) (b)

(c) (d)

Figure 9: Video frame compositing. (a) An original video frame.(b) The corresponding frame of the inlay video. (c) A compositeframe without blurring. (d) The final composite frame, after blur-ring the inlay and merging.

Finally we create a single composite video by merging the originalvideo and the inlay video, that includes all the artificial objects. Theoriginal video serves as background of the composite.

Page 7: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 7: Boeing sequence with inlaid texture. The red bumps were easily added to the side of the airplane, even though the entire editedregion of the airplane is never seen in a single frame. In the mosaic the entire airplane is captured, making the inlay process feasible.

Since the inlays are artificial we blur the inlay video using a kernelcomputed for every frame using the method proposed by Fergus etal. [Fergus et al. 2006]. We then merge the two videos togetherusing transparency as a mask. All the pixels that are transparentin the inlay video are replaced by the original video pixels. Anexample of the compositing process is depicted in Figure 9.

6 Experiments

We next test the proposed system on a diverse set of real-life videos.Since the results are video clips we highly recommend the reader towatch the supplementary video.

The first video is a visualization of a spinning Earth as seen fromouter space. The original video is high definition and downloadedfrom YouTube. Figure 10 depicts the results of inlaying a modelof the Empire state building in New York and adding a mountainchain along the west coast. The only interaction required to addthese inlays was to place them on top of the mosaic. As can be seenthe inlays preserve the correct orientation during the spin and stayin the same place of the Earth.

We next demonstrate the ease of adding multiple objects to thevideo. We pick another video for this – the Escala sequence – anaerial footage of a lodge, captured from a helicopter. The helicopterwas carefully controlled to preserve the stability and smoothness ofthe shot. The original high definition video is downloaded fromYouTube. Figure 11 shows three objects added to the original se-quence: a balcony on the front wall and two different wall lightson the sides. The only required user interaction is to place the ob-jects on top of the mosaic in the desired location. As one can see,the objects are added flawlessly to the video and follow the cameramotion.

In Figure 12 we use the face sequence of Rav-Acha et al. [Rav-Achaet al. 2008]. Since [Rav-Acha et al. 2008] can only add patterns tovideos they added to this video facial hair. Our system extends thisto 3D and hence we have added a rose above the right ear.

Our next example presents a different camera motion. Instead ofrotational motion we demonstrate the applicability of the proposedapproach to other type of motion. We use a video of an airplanepassing by the camera. We added a texture of bumps to the sideof the plane. As one can see in Figure 7 the texture is renderedcorrectly throughout the sequence.

In the following example we use an aerial video of a cruise ship.With very simple user interaction we manage to add a giganticlifeboat to its side. As one can see in Figure 13 the boat movesand rotates together with the ship.

The last example presents a new challenge – a non-rigid object.We use a video of a giraffe rotating its head, downloaded fromYouTube. Despite the non-rigidity, our system worked well and

we managed to inlay a flower to the giraffe’s mouth. The results aredepicted in Figure 14.

7 Conclusions and limitations

The proposed system has a few drawbacks. First, we inherit themain limitation of the mosaic representation – the restriction to ob-jects with disk-like topology. However, when we apply the systemto a rigid scene this limitation is less restrictive. At large scale thescene can usually be approximated to have disk-like topology. Theerrors usually happen near small details but they are not propagatedto the output video. As a result the inlay may be placed incorrectlyin these regions.

Second, the system does not handle occlusions of the inlay by videoobjects. As we model only the surface of the objects in the inputvideo and not their depth they cannot occlude the inlay. This limitsthe locations where the inlay can be positioned.

The last limitation is dolly camera motion. Because of ambiguityin camera motion estimation the system does not handle zoomingand dolly. This limits the type of footage where the system can beused. We believe that this limitation can be solved by incorporatinga more sophisticated camera motion method into the system.

8 Conclusions

In this paper we presented a system for amateur video editing. Oursystem allows the user to perform matchmoving in several simplesteps. By representing the video as a mosaic we achieve both sim-plicity of the interaction and the ability to cover multiple frameswith a single stroke. To complete the matchmoving we automat-ically estimate the camera motion throughout the sequence. Weexperimentally show that the proposed system is capable of editinga wide range of videos.

To better automate the inlay placing process we represent the struc-ture of the scene by mosaic of normals. This representation is cre-ated together with the video mosaic and use to accurately place theinlaid object. We demonstrate that this additional model furthersimplifies the user interaction required by the system.

Acknowledgements

The research of Lihi Zelnik-Manor is supported in part by the Ol-lendorf foundation, the Israel Ministry of Science, and by the IsraelScience Foundation under Grant 1179/11.

References

BAI, X., WANG, J., SIMONS, D., AND SAPIRO, G. 2009. Videosnapcut: robust video object cutout using localized classifiers.

Page 8: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 10: Earth sequence with inlaid building and mountains. We used a video of the spinning Earth and added a cartoon-style buildingand mountains to it. It can be seen that both inlays preserve correct location and orientation. The frames are enlarged and cropped for bettervisibility.

Figure 11: Escala sequence with three inlaid objects. We added three artificial objects to the aerial shot of a lodge. The inlays remainattached to the walls of the building and preserve correct orientation. The frames are enlarged and cropped for better visibility.

ACM Trans. Graph., (Proc. of SIGGRAPH) 28 (July), 70:1–70:11.

CHEN, J., PARIS, S., WANG, J., MATUSIK, W., COHEN, M.,AND DURAND, F. 2011. The video mesh: A data structure forimage-based three-dimensional video editing. In Proceedingsof the International Conference on Computional Photography(ICCP’11).

FERGUS, R., SINGH, B., HERTZMANN, A., ROWEIS, S. T., ANDFREEMAN, W. 2006. Removing camera shake from a singlephotograph. ACM Transactions on Graphics, (Proc. of SIG-GRAPH) 25, 787–794.

GU, X., GORTLER, S. J., AND HOPPE, H. 2002. Geometry im-ages. In ACM Transactions on Graphics (TOG), vol. 21, ACM,355–361.

JACOBS, D. 2001. Linear fitting with missing data for structure-from-motion. Computer Vision and Image Understanding 82, 1,57–81.

LI, Y., SUN, J., AND SHUM, H. 2005. Video object cut andpaste. ACM Transactions on Graphics (Proc. of SIGGRAPH)24, 3, 595–600.

PAVIC, D., SCHOENEFELD, V., KRECKLAU, L., HABBECKE, M.,AND KOBBELT, L. 2008. 2d video editing for 3d effects. In Pro-ceedings of 13th International Fall Workshop Vision, Modeling,and Visualization (VMV), 389–398.

RAV-ACHA, A., KOHLI, P., ROTHER, C., AND FITZGIBBON, A.2008. Unwrap mosaics: A new representation for video editing.ACM Transactions on Graphics (SIGGRAPH) (August).

RUSU, R. B. 2009. Semantic 3D Object Maps for Everyday Ma-nipulation in Human Living Environments. PhD thesis, Com-

puter Science department, Technische Universitaet Muenchen,Germany.

SAND, P., AND TELLER, S. 2008. Particle video: Long-rangemotion estimation using point trajectories. International Journalof Computer Vision 80, 1, 72–91.

TOMASI, C., AND KANADE, T. 1993. Shape and motion from im-age streams: a factorization method. Proceedings of the NationalAcademy of Sciences 90, 21, 9795.

VAN DEN HENGEL, A., DICK, A., THORMAHLEN, T., WARD, B.,AND TORR, P. 2007. Videotrace: rapid interactive scene mod-elling from video. In ACM Transactions on Graphics (TOG),vol. 26, ACM, 86.

Page 9: Video Inlays: A System for User-Friendly Matchmove2018/07/13  · Video Inlays: A System for User-Friendly Matchmove Dmitry Rudoy dmitry.rudoy@gmail.com Technion, Israel Lihi Zelnik-Manor

Figure 12: Face sequence with inlaid rose. A red rose was easily added to this guy by placing it on top of the mosaic.

Figure 13: Ship sequence with inlaid boat. The boat rotates together with the ship and stays attached to it. The frames are enlarged andcropped for better visibility.

Figure 14: Giraffe sequence with inlays. Despite the non-rigid nature of the giraffe motion our system manages to matchmove the flower(top) and the piercing (bottom) with the video correctly. As one can see, the inlays remain attached and rotate correctly.