3
Reconstructing 3D Models from Movies with Canonical Cinematography Settings Siddharth Choudhary Boyang Li Nam Vo Figure 1. The 180-degree rule. 1. Introduction Reconstructing the 3D scene depicted by a video clip is an active field of research. For this task, we need two or more overlapping images of the same situation taken from different angles. After that, we can recover the depth of points that appear in both images. Characteristics of modern movies makes it difficult to re- construct 3D scenes from them. They often utilize cameras that with more than 45 difference in orientation, large dif- ferences in focol length, and directly cut from one camera to another. Figure 1 shows an example of a canonical cin- ematography setting, known as the 180-degree rule. The bottom 3 camera positions are commonly used, and the top position is disallowed [4]. The drastic changes in cameras result in small regions of overlap, creating difficulties for finding points that appear in both images. However, cinematographic conventions are not meant to confuse the audience. Rather, they facilitate comprehension of the movie, arguably because they cater to human cogni- tion [5]. Hence, a reasonable hypothesis is that knowledge of cinematic conventions, more specifically knowledge of camera poses and positions, can help computational sys- tems understand movie as well. This paper investigates 3D reconstruction under the ad- verse condition of modern cinematography. Small over- laps between images lead to many wrongly matched points. Therefore, the key to finding correct matches is to increase the signal-to-noise ratio. In this paper, we attempt to em- ploy the 180-degree rule and other conventions to filter out wrong matches. We find that knowledge on camera posi- tions is beneficial for 3D reconstruction, but still insufficient for some difficult situations. 2. Related Work In a relevant work, Pollefeys et al. [6] proposed to build visual models using images from a hand-held camera. The matched features between images are used to infer scene structure and camera motion at the same time. In constrast, we consider videos consisting switching between multiple cameras. Some recent work focus on efficient large scale 3D reconstruction (e.g. [1, 2]) of millions of images. In stark comparison, we face exactly the opposite problem: data sparsity. 3. Approach Our process includes three steps. First, we find as many matching points as possible from two given images. After that, RANSAC is used to filter wrongly matched points. Fi- nally, we triangulate the points in 3D space. The three steps are discussed below. Given a point in one image I 1 , the corresponding point in other image I 2 appears on the epipolar line. For two camera poses P 1 and P 2 , we can approximate the epipolar line l for one point x I 1 as: l =(P 2 C 1 ) × (P 2 P -1 1 x) (1) where C 1 is the camera center of P 1 . Assuming the cameras are set up according to the 180- degree rule, we can estimate the epipolar line, so we can search for the corresponding point near this estimated line. For each interest point in I 1 , we find 500 interest points in I 2 that are closest to the epipolar line, out of which the best match is selected. A matching point is found if the best candidate is sufficiently better than the second best. The degree of matching is the Euclidean distance between the SIFT features of two points. as detailed in Algorithm 1. 3.1. Color SIFT and SIFT orientation In order to increase the signal-to-noise ratio, we utilize color information during the matching process. We use Color SIFT features [3] to compute the distance between the points. In addition, we observe that correctly matched points usually do not have a large difference in their SIFT orientation. We attribute this to the fact that the up direction is usually well maintained in professional cinematography. 1

Reconstructing 3D Models from Movies with Canonical ...Reconstructing 3D Models from Movies with Canonical Cinematography Settings Siddharth Choudhary Boyang Li Nam Vo Figure 1. The

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reconstructing 3D Models from Movies with Canonical ...Reconstructing 3D Models from Movies with Canonical Cinematography Settings Siddharth Choudhary Boyang Li Nam Vo Figure 1. The

Reconstructing 3D Models from Movies with Canonical CinematographySettings

Siddharth Choudhary Boyang Li Nam Vo

Figure 1. The 180-degree rule.

1. IntroductionReconstructing the 3D scene depicted by a video clip is

an active field of research. For this task, we need two ormore overlapping images of the same situation taken fromdifferent angles. After that, we can recover the depth ofpoints that appear in both images.

Characteristics of modern movies makes it difficult to re-construct 3D scenes from them. They often utilize camerasthat with more than 45◦ difference in orientation, large dif-ferences in focol length, and directly cut from one camerato another. Figure 1 shows an example of a canonical cin-ematography setting, known as the 180-degree rule. Thebottom 3 camera positions are commonly used, and the topposition is disallowed [4]. The drastic changes in camerasresult in small regions of overlap, creating difficulties forfinding points that appear in both images.

However, cinematographic conventions are not meant toconfuse the audience. Rather, they facilitate comprehensionof the movie, arguably because they cater to human cogni-tion [5]. Hence, a reasonable hypothesis is that knowledgeof cinematic conventions, more specifically knowledge ofcamera poses and positions, can help computational sys-tems understand movie as well.

This paper investigates 3D reconstruction under the ad-verse condition of modern cinematography. Small over-laps between images lead to many wrongly matched points.Therefore, the key to finding correct matches is to increasethe signal-to-noise ratio. In this paper, we attempt to em-ploy the 180-degree rule and other conventions to filter outwrong matches. We find that knowledge on camera posi-tions is beneficial for 3D reconstruction, but still insufficientfor some difficult situations.

2. Related WorkIn a relevant work, Pollefeys et al. [6] proposed to build

visual models using images from a hand-held camera. Thematched features between images are used to infer scenestructure and camera motion at the same time. In constrast,we consider videos consisting switching between multiplecameras. Some recent work focus on efficient large scale3D reconstruction (e.g. [1, 2]) of millions of images. Instark comparison, we face exactly the opposite problem:data sparsity.

3. ApproachOur process includes three steps. First, we find as many

matching points as possible from two given images. Afterthat, RANSAC is used to filter wrongly matched points. Fi-nally, we triangulate the points in 3D space. The three stepsare discussed below.

Given a point in one image I1, the corresponding pointin other image I2 appears on the epipolar line. For twocamera poses P1 and P2, we can approximate the epipolarline l for one point x ∈ I1 as:

l = (P2C1)× (P2P−11 x) (1)

where C1 is the camera center of P1.Assuming the cameras are set up according to the 180-

degree rule, we can estimate the epipolar line, so we cansearch for the corresponding point near this estimated line.For each interest point in I1, we find 500 interest points inI2 that are closest to the epipolar line, out of which the bestmatch is selected. A matching point is found if the bestcandidate is sufficiently better than the second best. Thedegree of matching is the Euclidean distance between theSIFT features of two points. as detailed in Algorithm 1.

3.1. Color SIFT and SIFT orientationIn order to increase the signal-to-noise ratio, we utilize

color information during the matching process. We useColor SIFT features [3] to compute the distance betweenthe points. In addition, we observe that correctly matchedpoints usually do not have a large difference in their SIFTorientation. We attribute this to the fact that the up directionis usually well maintained in professional cinematography.

1

Page 2: Reconstructing 3D Models from Movies with Canonical ...Reconstructing 3D Models from Movies with Canonical Cinematography Settings Siddharth Choudhary Boyang Li Nam Vo Figure 1. The

Algorithm 1 EpipolarMatch ( I1, I2 )for each interest point pj ∈ I1 do

Find an estimated epipolar line in I2 using cameraposition constraint.Find 500 interest points spatially closest to the epipo-lar line in I2.Compute the match between the SIFT feature sj andthe SIFT features of the 500 points. The best matchis s1 and the second best is s2if dist(s1, sj)/dist(s2, sj) < 0.6 thens1 and sj are matched

end ifend for

As a result, matches with large disparity of orientation isprobably wrong. Therefore, we prevent matches betweentwo Color SIFT features that differ for more than 60◦.

3.2. 3D reconstructionInevitably, the points matched by the above procedures

contain noises. We use the RANSAC procedure to filterout wrong matches and keep the largest group of matchedpoints that are consistent with one set of camera positions.From these matched points, we can estimate the founda-mental matrix. After that, we triangulate the location ofmatched features points in the 3D space. We minimizethe sum of re-projection errors of all points when projectedback to the estimated cameras. Points with reprojection er-ror greater than a threshold are discarded.

4. EvaluationWe selected three different video clips and two frames

from each clip as three test sets. Two clips are taken fromthe TV show White Collar, and one is taken from HarryPotter and the Prisoner of Azkaban. We compare threedifferent algorithms: the traditional 3D reconstruction al-gorithm without any knowledge of the 180-degree rule, a3D reconstruction algorithm aware of camera positions un-der the 180-degree rule, and an algorithm that utilizes ColorSIFT and constraints on orientation but is not aware of the180-degree rule. To evaluate the three versions of algo-rithms, we report (1) the number of matched points, (2)the number of matched points considered to be inliers byRANSAC, (3) correct inliers as determined by manual in-spection, (4) the signal-to-noise ratio computed as correctinliers divided by all matched points and (5) points that re-mains after triangulation (i.e. points with acceptable repro-jection error). Greater numbers of correctly matched pointsindicate better results. The results are shown in Table 1.The matched points for the algorithm with knowledge ofcameras is shown in Figure 2.

Figure 3 shows the recovered camera positions by the al-gorithm with camera knowledge, which are consistent with

Figure 3. Reconstructed positions of cameras from the test setWhite Collar 1. The front of a camera is aligned with the directionof the z-axis. The y-axis is the up direction of images. The blackdots are points reconstructed in the 3D space.

the 180-degree rule. Due to the small number of correctlyinliers, we are not able to recover camera setups for the sec-ond and third test set.

All three algorithms are able to find a good number ofcorrectly matching for the first test set. The knowledge ofcamera positions and the use of Color SIFT and the ori-entation constraint can both improve the number of cor-rect matched and triangulated points significantly. How-ever, these algorithms generally do not achieve good resultsfor the second and the third test set. Knowledge of camerasis not very effective in suppressing noise because the 180-degree rule only roughly specifies camera positions, andleaves a lot of freedom for the director. Thus, our estimateof the epipolar line is crude, and we still have to considerpoints far away from this line. Although Color SIFT andthe orientation constraint is effective in suppressing noiseand hence improving the ratio, its ability to increase num-ber of correct matches is limited.

4.1. The horizontal panning conditionA simpler situation happens when the camera pans hori-

zontally. In this case, the epipolar line is horizontal. Assum-ing the camera motion is known, we can restrict the searchfor the matching point to points with a similar height withthe given point. We tested this algorithm on three other testsets, and the results are shown in Table 2. The knowledgeof horizontal panning is more powerful than the 180-degreerule as it directly gives us the epipolar line instead of an es-timate of the line. Hence, we can find a good number ofcorrectly triangulated points.

5. DiscussionOur investigation shows that the biggest challenge in

3D reconstructing of movies is our limited ability to find

2

Page 3: Reconstructing 3D Models from Movies with Canonical ...Reconstructing 3D Models from Movies with Canonical Cinematography Settings Siddharth Choudhary Boyang Li Nam Vo Figure 1. The

Algorithm Image Set Matches Inliers Correct Inliers Correct / Matched Triangulated Points

Traditional AlgorithmWhite Collar 1 52 32 29 56% 26White Collar 2 17 15 2 12% 0Harry Potter 10 9 2 22% 0

Camera KnowledgeWhite Collar 1 139 61 59 42% 59White Collar 2 31 14 3 10% 0Harry Potter 53 24 2 3.8% 0

Color + OrientationWhite Collar 1 148 77 70 47% 65White Collar 2 21 14 6 29% 0Harry Potter 18 11 4 22% 0

Table 1. Reconstruction results from the three algorithms

Figure 2. Inlier matches for the three test sets from the algorithm with camera knowledge. Red lines connect matched points from twoimages. From left to right: White Collar 1, White Collar 2 and Harry Potter and the Prisoner of Azkaban

Image Set Inliers Triangulated PointsA Beautiful Mind 107 29

The Big Bang Theory 110 92Friends 55 51

Table 2. Reconstruction results under the horizontal panning con-dition

matching points from two frames. In contrast to applica-tions with millions of images (e.g. [2]), we need to find asmany matched points as possible from each image.Findingmatching points is difficult because of certain characteris-tics of modern cinematography. The first challenge is rapidcuts between cameras with different poses and focal length.Further, movies tend to focus on characters’ faces with along focal length to blur the background, which makes itdifficult to match points on the background.

Faces can be important anchors in estimating cameraposes. Our third test set is a good example. Thus, the abil-ity to recognize and interpolate facial features can be vitalfor 3D reconstruction. Although we did not utilize thesefeatures, it is an interesting future direction.

Our method matches points, but we know point-basedmethods are less powerful than segment-based methods. In

the third test case, for example, we match points on one hatto points on two different hats. Matching the entire seg-ment can avoid this kind of error. On the other hand, pointsat different depths are unlikely to be the same segment, sodepth information can also inform segmentation. The syn-ergy between depth information and segmentation could bean interest research direction.

References[1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski.

Building Rome in a day. In ICCV, 2009.[2] J. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram,

C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, et al. BuildingRome on a cloudless day. In ECCV, 2010.

[3] J. M. Geusebroek, R. van den Boomgaard, A. W. M. Smeul-ders, and H. Geerts. Color invariance. IEEE Transactions onPattern Analysis and Machine Intelligence, 23(12), 2001.

[4] J. Mascelli. The Five C’s of Cinematography: Motion PictureFilming Techniques. Cine/Grafic Publications, 1965.

[5] J. May and P. Barnard. Cinematography and interface design.In Human-Computer Interaction, 1995.

[6] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cor-nelis, J. Tops, and R. Koch. Visual modeling with a hand-heldcamera. In’l J. Comp. Vision, 59(3):207–232, 2004.

3