Trends in Advanced Video Representation · PDF filecomputer graphics (CG) ... To alleviate this problem, the video bank ... rendering, etc Extraction and compositing of subject area

Broadcast Technology No.60, Spring 2015 ● C NHK STRL2

The recent applications of computer vision and computer graphics (CG) have caused a dramatic expansion in the variety of video representations in TV programs and the like. This paper describes the trends in representation techniques that use video analysis in program production. We also comment on issues affecting video production, present approaches to their resolution, and give examples of video representation applications using video analysis with real-time processing.

1. Introduction A wide variety of video representations are

incorporated in TV programs. In TV drama in particular, it is becoming commonplace to see highly realistic representations created by adding visual effects (VFX) to digital video footage during the processing and editing stage (called post-production, or “postpro”).

VFX is used for a wide variety of purposes, from the striking, e.g., the depiction of scenes that are impossible in real life, to the inconspicuous, such as the removal of manmade structures that would not have existed during the times in which a period drama is set. However, while VFX satisfies various production needs, the content creator must invest a great deal of effort in using it. This labor-intensive aspect is a major issue because content creators also have to do all of the program production work they would normally do and productions are constrained in terms of time and cost.

Under these circumstances, there has been growing recognition of the importance of process management in program production and proper management of video materials. For example, new workflows have been proposed, such as the scene-linear workflow in which the original data (RAW data) of the captured video is left untouched and is saved together with related metadata, such as lens information acquired during filming. This video information is then handled in a consistent way, i.e., “linearly”, within a same color space1)2). While such management makes it unlikely that video materials will be damaged or degraded, it also removes much of the subjectivity of the creator from the production environment. That is, while the aim is to make the work involved in VFX more efficient to enable the creator to concentrate on purely creative activities, in order to be creative it is necessary to have technology that supports the complicated work that is inherent to each process. This paper describes issues faced by broadcasters in video production and shows how analysis techniques such as computer vision*1can support VFX. It also describes new

applications of video analysis in program production.

2. Issues of video production faced by broadcasters Program production essentially consists of research,

planning, materials recording, post-production, and transmission. Image processings such as VFX are additional steps in this process.

As shown in Table 1, the objective of materials management is not simply to collect raw video footage, but also to enable the efficient handling of large volumes of video footage by providing functions to list and search the available materials. At present, however, except for labels indicating temporal information, users must add their own labels to enable the content of the video to be searched. Given the constraints of production, it is impractical to manually add detailed labels to a huge amount of video. Moreover, once the desired video images have been retrieved, the job of adding information for processing that video footage has just begun.

The pre-visualization (“previz”) step in this table create a rough video for imagining what the finished video would be like3). Pre-visualizations are used for selecting lenses or for showing producers and performers, by way of example, what the finished video would look like. The CGs of pre-visualizations are simplified, but still require substantial knowledge and effort to produce. Hence, the previz step is omitted in TV program productions when the preparation period before filming is too short. In such cases, although producers can use storyboards to convey their ideas about the finished video to the performers and staff, this form of expression does not facilitate intuitive comprehension like previz does.

As regards the subject area extraction step listed in the table, it often happens that is it not possible to use chroma key on location. In such a case, it is necessary to create subject area information (matte generation, also called key generation) by hand. This task takes considerable time and requires experience and skill. It is especially difficult to perform in TV program productions that have many time-related constraints.

How to obtain camera movement information is another issue. This information is used to composite CGs and live-action images. To ensure that composited image appears natural, it is necessary to match the environment information of the live-action filming space with that of the CG space, as shown in Table 2. In particular, in order to make natural appearing composited images, the movements of the camera and the lighting conditions during the live-action filming have to be recorded and used to make the CGs. However, acquiring this information places further constraints on the production. For instance, it may require special equipment and measurements to be made on the set.

Trends in Advanced Video Representation Techniques

*1 A technique that uses a computer to analyze images captured by a camera to elucidate the world of the subject and the relationship between that world and the camera.

Broadcast Technology No.60, Spring 2015 ● C NHK STRL

Feature

3

As the above has illustrated, there is a strong demand for technical support that frees content creators from having to do complicated and time-consuming production work that can sap their creativity.

3. Approaches to resolving the issuesThe problems outlined in previous section boil down

to two main issues: how can the desired video footage be retrieved from a large volume of raw video efficiently, and how can helpful information (metadata) be obtained quickly and easily in each step of the image processing.

This section first deals with the question of how to make video production more efficient. Then, it overviews technologies for acquiring camera orientation information, subject area information, and lighting information that can lighten the burden of workers at the production site.

3.1 Making the video production workflow more efficientAt the start of a typical video production workflow,

there is only the video footage that has been filmed; the metadata necessary for image processing is acquired afterwards by using various video analysis tools and authoring tools. At present, most of this work cannot be done automatically. Moreover, it usually requires

experience and a great deal of effort and has to be accomplished within strict time constraints. The use of file-based content within a broadcast station can make the production environment more efficient, not just by facilitating centralized management and non-linear editing, but by providing new ways of making content.

To address the issues raised above, we have proposed a “video bank” that automatically assigns metadata for image processing in addition to metadata for searching by using video analysis techniques and sensor technology. A conceptual diagram is shown in Figure 1.

The video bank performs video analysis around the clock by collecting and automatically assigning metadata to new and recorded video. Thus, if a content creator searches the video bank for raw video footage by keyword, he or she can simultaneously acquire the metadata necessary for image processing.

Another issue is that if there is insufficient camera orientation information and lighting information on the captured video images to act as clues for analysis, it becomes difficult to obtain enough metadata for image processing. To alleviate this problem, the video bank stores additional information gathered from sensors that operate in a way that does not interfere with filming.

We are working on increasing the accuracy and robustness of each technology that makes up the video

Table 1: List of VFX tasks

Objective and detailsTask Typical technique

Materials management Version management, searching, etc., of materials

Version management and searching of materials using a video database (labels attached manually)

Pre-visualization (previz) A simple visualization Imagining the finished video before filming starts

CG production using authoring tools (software for creating information such as the shape and movements of CG subjects)

Post-visualization (postviz) A simple visualization of finished video made directly after filming

Provisional composite of simple CG and live-action video

Camera tracking (acquisition of camera orientation information)

Composite video images in which the movement of the camera (live action) match the movements of the CG

Estimating camera movement by analyzing captured video images and using sensors mounted on tripods or cranes that can measure the camera orientation

Color correction Adjustment of color and matching of color between cameras and between video materials

Linear or non-linear scaling (expansion or compression processing) in a specific color space

CG production Raw images of video composite Preparation of textures (surface patterns) using CG authoring tool or paint tool, modeling, rendering, etc

Extraction and compositing of subject area

Extraction and compositing of subject area by using video analysis

Chroma key (technique of filming using a background such as a blue screen, then using color information to separate the areas of subject and background in the video images)

Specifying subject areas by hand Preparing a mask by using a method such as rotoscoping (a technique of tracing the outline of the subject by hand from the captured video image)

Table 2: Matching of environmental information in the production of naturally appearing composite video images

Temporal matching Factors such as timestamps and shutter speeds of composite materials

Optical matching Factors such as lighting conditions and mutual reflections of composite materials

Spatial matching Factors such as coordinate system, scaling, and camera parameters (lens state, position, and orientation) of composite materials


bank, with the aim of putting this system to practical use.

3.2 Camera orientation information acquisition technique

The camera orientation is necessary information for the making natural-looking composites of live action images and CGs. If the camera in the CG world does not move in the same way as the live-action camera, the resulting video would appear unnatural. For example, if inaccurate camera orientation information is used in compositing, the CG may appear to slid or sway unnaturally in relation to the subject.

The current techniques for acquiring camera orientation information use physical sensors. They are

subject to displacement inaccuracies. In contrast, the camera orientation can be more accurately determined by using video analysis on the captured images. The result is that CG objects can be composited with no displacement. However, conventional video analysis techniques have the disadvantage that estimation depends on the filming environment and video content, and in some cases, it has to be done manually.

With the aim of automating and increasing the accuracy of the camera orientation information acquisition, we have developed a technique that augments the video analysis with physical sensor information if the analytical estimate becomes unstable4). The estimation is based on bundle adjustment5). Bundle adjustment first extracts feature points such as the corners of subjects in each

4

Figure 2: Basic principle of camera orientation estimation

Camera orientation 1

Image plane

Camera orientation 2

Camera orientation 3Correspondence relationship

Feature point 2

Feature point 3

Feature point on image plane

Frame 1

Frame 2

Frame 3

Re-projection

Unknown

Known

Feature point 1

Object

Re-projection error

Check tools (verification, correction, addition)

Correspondence relationship

Object

Figure 1: Conceptual diagram of video bank

Automatic generation of metadata by video analysis

Supports video production

Provides ways of searching a wide variety of video footage

Provides information necessary for image processing

Sensor information

Video information

Video bank

Metadata for searches

Metadata for processing

001101


Feature

5

frame of the captured video image and then it tracks them between frames. Since these feature points have 2D coordinates, they can be placed on the image planes as shown in Figure 2 ( ). On the other hand, since the 3D position of each feature point and the orientation of the camera are unknown, each point is allocated a suitable initial value and the 3D coordinates of the feature points are projected onto the surfaces in the image ( ). The camera orientation is obtained by iterative processing to minimize the sum of errors (re-projection errors) between each point on the image surface and the corresponding feature point position. Bundle adjustment can also be used to estimate lens distortion and focal length.

A factor that dictates the accuracy and robustness of the estimate is the correspondence of the feature points between frames. Here, mismatches can be detected by applying a condition such that the topological phases between feature points in a frame carry over to the next frame. The results of a mismatch detection are shown in Figure 3. Each pink triangle in Figure 3(b) indicates a mismatched feature point.

The camera orientation can’t be accurately estimated when the feature points are distributed unevenly within the frame. Here, we can use this unevenness and projection error to evaluate the accuracy. If it is poor, we can instead measure the orientation by attaching a hybrid sensor6) to the camera. This sensor is equipped with a micro-electromechanical systems (MEMS) gyroscope and a sensor camera that is aimed at the floor.

3.3 Area division and extraction techniqueThe area division and extraction technique extracts

specific subject areas in the filmed video footage to be composited with different background video. If a still image is the target, a software technique such as GrabCut7)*2 can be used to roughly distinguish the subject area to be extracted. Graph cut8)*3 can then be used to perform an optimal extraction. It uses a Gaussian mixture

model*4to determine color distributions of the extracted subject areas and the other areas. More complicated processing is necessary for moving images. In this case an initial tracing value is manually assigned to the outline of a subject in a frame containing a timestamp at the production site. The outline can later be traced through frames and the subject areas extracted. The errors in tracking the outline of the subject have to be manually corrected. This sort of work requires an experienced operator since the quality of the resulting composite depends on how the initial tracking values are allocated.

To address these issues, we have developed a technique that can automatically extract a specific subject area throughout a set of frames9). It turns moving images into voxels (video information in which pixel values are arranged in a 3D space formed by the two horizontal and vertical axes of 2D imagery and a time axis). The mean-shift method*5 is used to segment the images according to the color information in them. An overview of the procedure is shown in Figure 4. Subjects are tracked by determining whether or not pixels are adjacent in the voxel space. Here, several sets of area divisions having different granularities are prepared so that the accuracy of the extraction can be tuned by dividing each frame into a suitable number of areas.

This technique thus consists of the two phases: area division and subject area extraction. The first is automated and done beforehand. The area extraction processing is based on GrabCut, which is fast and interactive.

3.4 Lighting information acquisition technique Composite images can be made to look more natural

by using lighting information gathered from the filming space in the CG rendering. A commonly used technique for doing so is to install a spherical mirror in the vicinity of where the CG will be composited, shoot it with a digital camera, and acquire ambient lighting information

Figure 3: Results of detecting feature point mismatches

(a) Filmed video image (b) Triangular patches (pink regions) that include mismatched points

*2 A technique of separating areas using graph cuts. A Gaussian mixture model represents the color distributions of the subject area to be extracted and the background area.

*3 An optimization technique that minimizes an energy function.

*4 A model that is represented by as a number of superimposed Gaussian distributions.

*5 An iterative technique for finding the set of camera pa-rameters that most accurately predicts the locations of the observed points in images.


over a wide dynamic range10). There are a number of problems with this approach, such as the difficulty of dealing with varying lighting conditions. It is necessary to acquire high-dynamic-range information. This means that either an ordinary camera has to take a number of shots at different exposures, or a special camera capable of taking high dynamic range images has to be used.

We have developed a technique that can indirectly estimate the color tone and intensity of each piece of the studio lighting by analyzing video footage taken by a small, wide-angle sensor camera shooting the ceiling of the studio. Here, the effect of the lighting on the studio set is analyzed rather than the lighting directly11). This approach requires lighting information to be acquired in

Figure 5: Comparison of real image and CG drawn by using estimated lighting conditions (the subject is a white sphere)

Real image

CG

a) Lighting conditions 1 b) Lighting conditions 2 c) Lighting conditions 3

Figure 4: Technique of extracting subject areas from moving image

V i d e o b a n k

Subject area extraction processing

Time-space area division processing

Specification of extraction subject area

Raw video

Store

Manipulation

Area division information

Extraction results

・・・


Feature

7

the studio, including the positions of lighting and images of objects in the studio lit by individual pieces of lighting. Since the brightness is not that of the direct lighting, these measurements can be done with an ordinary camera. This method thus has the virtue of creating natural looking composite video by using simple and inexpensive equipment. Figure 5 compares live-action video that was filmed with a real white sphere under actual lighting and a CG white sphere rendered with lighting conditions estimated by this method.

4. Latest trends in video representation using real-time video analysis

Video analysis techniques combining various sensor technologies and software can handle video at 30 frames per second in real time. This section introduces examples of applications that take advantage of this real-time capability.

4.1 Virtual studio A virtual studio is a video representation technique

that puts a CG in live-action video in real time and makes it move in accordance with the movements of the camera. It is used in live broadcasts to show performers interacting with CGs or a virtual world or for visualizing complicated information in an easy-to-understand manner.

The current virtual studio technology uses special tripods, dollies (carts used for travel shots), or cranes fitted with physical sensors to measure the camera orientation; it is not possible to use handheld cameras with it. Although a method exists to acquire video feature

points and their 3D positions from the captured video image itself and then use that information to estimate the orientation of the camera12), it does not have the accuracy required for TV program productions.

We have developed a technique that accurately estimates the camera orientation by analyzing how the edges of the subjects move in relation to the shapes and textures of the studio set. The shape and texture information is acquired before the shoot, and ideally, this enables the method to work in real time13).However, it is still subject to delays that sometimes cause the composite video to be more than three frames behind the captured video images and interfere with the synchronization of lip movements and speech. We intend to solve this problem by using the hybrid sensor developed for video bank and make it possible to use handheld cameras in a virtual studio.

A hybrid sensor measures the camera’s rotation by using a small gyroscope and its translation in the vertical direction by using a laser rangefinder. It estimates the translation in the horizontal direction by analyzing the video images of a sensor camera (Figure 6) aimed at the floor and using clues such as the floor’s pattern. This method can measure camera orientation information with little delay and has been used in live broadcasts by NHK.

4.2 Display Applications “Augmented TV”The Hybridcast platform can send information related

to the program being broadcast on one screen, e.g., a TV set in a living room, to a second screen, e.g. on a tablet computer held by the viewer. “Augmented TV”14) is a style of viewing content in which the TV and a second

Figure 6: Overview of hybrid sensor that measures camera orientation information

z

y

x

Broadcast camera

Hybrid sensor

3D coordinate system

Gyroscope

(Amount of rotation)

Rz

Ry

Rx

Ty

Laser rangefinder

(Height)

Video analysis of sensor camera

Tz

Tx

(Amount of travel along floor)


screen interact. Augmented TV uses a handheld terminal to analyze

images acquired by its built-in camera, synchronizes content on the TV and the handheld terminal, and acquires relative orientation information of the TV with respect to the handheld terminal. By using this information, it is possible to create an augmented reality wherein a character jumps from the TV screen to the second screen.

5. ConclusionsVideo analysis technology will steadily become

more sophisticated. We believe that the objective is not to automate all the creative activities in video representation, but to provide an environment that will enable producers to concentrate on their creative work. The video bank we have proposed on the basis of this concept makes it possible to access raw video footage quickly and effectively, expand the variety of video representations that are possible, and implement an efficient workflow. In the future we will improve its elemental technologies and put it to use in practical program production.

We will continue to develop services and applications that make use of video analysis techniques, which are not limited to video representation and program production.

(Hideki Mitsumine)

References1) ACES (Academy Color Encoding Specification) Science

and Technology Council Academy of Motion Picture Arts and Science: “The Academy Color Encoding System -- Idealized System,” https://www.oscars.org/sites/default/files/acesoverview.pdf

2) “Open Source Color Management,” http://opencolorio.org/

3) Mori, Ichikari, Shibata, Kimura, and Tamura: “Development of MR-PreViz System for On-site S3D Pre-Visualization of Real World Scenes,” Transactions of the Virtual Reality Society of Japan, Vol. 17, No. 3, pp. 231-240 (2012)

4) Mitsumine, Muto, Fujii, and Kato: “A Robust Camera Tracking Method Considering Visual Effects,” Annual Conference of the Institute of Image Electronics Engineers of Japan, R6-1 (2013)

5) Okatani: “Bundle Adjustment,” Information Processing Society of Japan CVIM Research Materials, No. 37, pp. 1-16 (2009)

6) Kato, Muto, Mitsumine, Okamoto, A. Moro, Seki, and Mizukami: “Study of Detecting Method of Broadcasting Camera’s Movement using MEMS Sensor and Image Processing Technology,” Journal of the Robotics Society of Japan, Vol. 32, No. 7, pp. 2-11 (2014)

7) C. Rother, A. Blake and V. Kolmogorov: “‘GrabCut’: Interactive Foreground Extraction Using Iterated Graph Cuts,” ACM Trans. Graphics, Vol. 23, No. 3, pp. 309-314 (2004)

8) Ishikawa : “Graph cut,” Information Processing Society of Japan CVIM Research Materials, No. 31, pp. 193-204 (2007)

9) Okubo and Mitsumine : “A Study of Spatio-temporal Segmentation Method Suitable for Interactive Object Extraction For Video Asset Processing and Management System “Video Bank”,” Vision Engineering Workshop ViEW 2013, OS1-O3 (2013)

10)P. E. Debevec and J. Malik: “Recovering High Dynamic Range Radiance Maps from Photographs,” Proc. ACM SIGGRAPH 1997, pp. 369-378 (1997)

11) Morioka, Okubo, Mitsumine : “Real-Time Estimation Method of Lighting Condition with Sensor Cameras for Image Compositing,” Transactions of the 13th Forum on Information Technology, No. 3, I-006, pp. 169-170 (2014)

12) G. Klein and D. Murray: “Parallel Tracking and Mapping for Small AR Workspaces,” Proc. IEEE/ACM ISMAR 2007, pp. 225-234 (2007)

13) H. Park, H. Mitsumine, and M. Fujii: “Adaptive Edge Detection for Robust Model-Based Camera Tracking,” IEEE Trans. Consum. Electron., Vol. 57, No. 4, pp. 1465-1470 (2011)

14) Kiwakita, Nakagawa, and Sato: “Augmented TV: an Augmented Reality System for TV Pictures Beyond the TV Screen,” Transactions of the Virtual Reality Society of Japan, Vol. 19, No. 3, pp. 319-328 (2014)

Documents

Trends in Advanced Video Representation · PDF filecomputer graphics (CG) ... To alleviate this problem, the video bank ... rendering, etc Extraction and compositing of subject area