Shape-TimePhotographypeople.csail.mit.edu/billf/publications/Shape-Time...Shape-TimePhotography WilliamT.Freeman HaoZhang EECSDept.EECSDept. MassachusettsInstituteofTechnologyU.C.Berkeley

Shape-Time Photography

William T. Freeman Hao Zhang

EECS Dept. EECS Dept.Massachusetts Institute of Technology U.C. Berkeley

Cambridge, MA 02139 Berkeley, CA 94720

IEEE Computer Vision and Pattern Recognition (CVPR), Madison, WI, June, 2003.

We introduce a new method to describe shape relationshipsover time in a photograph. We acquire both range and im-age information in a sequence of frames using a station-ary stereo camera. From the pictures taken, we compute acomposite image consisting of the pixels from the surfacesclosest to the camera over all the time frames. Through oc-clusion cues, this composite reveals 3-D relationships be-tween the shapes at different times. We call the composite ashape-time photograph.

Small errors in stereo depth measurements can create ar-tifacts in the shape-time images. We correct most of theseusing a Markov network to estimate the most probable front-surface pixel, taking into account (a) the stereo depth mea-surements and their uncertainties, and (b) spatial continu-ity assumptions for the time-frame assignments of the front-surface pixels.

1 Introduction

With a single still image, we seek to describe the changes inthe shape of an object over time. Applications could includeartistic photographs, instructional images (e.g., how doesthe hand move while sewing?), action summarization, andphotography of physical phenomena.

How might one convey, in a still image, changes inshape? A photograph depicts the object, of course, but notits relationship to objects at other times. Multiple-exposuretechniques, pioneered in the late 1800’s by Marey and Mur-bridge [1, 9] can give beautiful depictions of objects overtime. They have two drawbacks, however: (1) The controlof image contrast is a problem; the image becomes over-exposed where objects at different times overlap. Back-grounds may need to be dark to avoid over-exposure. (2)The result doesn’t show how the various shapes relate toeach other in three-dimensions. What we see is like an X-

This work was initiated when both authors were at Mitsubishi ElectricResearch Labs (MERL), WTF as a researcher and HZ as a student intern.

ray photograph, showing only a flattened comparision be-tween 2-d shapes.

Using background stabilization techniques from com-puter vision, researchers have developed video summariza-tion tools which improve on multiple-exposure methods.Researchers at both Sarnoff Labs [13] and Salient Stills[7] have shown single-frame composites where the fore-ground image at each time overwrites the overlapping por-tions of all the previous foreground images, over a single,stabilized background. We will refer to this compositingas the “layer-by-time”algorithm, since it is time, not 3-Dshape, which determines object visibility. The layer-by-time method avoids the contrast reduction of multiple expo-sure techniques, However, since temporal order, not shape,determines the occlusion relationships, this method cannotdescribe the shape relationships between foreground objectsat different times. Video cubism [5] is a less structured ap-proach to rendering video information into a single frame,and also does not incorporate shape information into thecomposite.

Our solution for displaying shape changes over timemakes use of 3-D information which is captured along withthe images. We form a composite image where the pix-els displayed are those showing the surfaces closest to theviewer among all surfaces seen over the entire sequence.The effect is to display a photograph of the union of the sur-faces in all the photographs (without mutual illuminationand shading effects). This allows occlusion cues to revealthe 3-D shape relationships between objects seen over dif-ferent times in the original video sequence.

Figure 1 illustrates these summarization methods for thecase of a familiar motion sequence: the rattling spiral of acoin as it rolls to a stop on a table. (a) shows the individualframes of the sequence. (To avoid motion blur, we placedthe coin in those positions, using clay underneath). Themultiple-exposure summary, (b), shows the loss of imagecontrast where foreground objects overlap. The layer-by-time algorithm, (c), shows more detail than (b), but doesn’treveal how the coins of different times relate spatially. (d)is our proposed summary of the sequence. The composite

1

(a)

(b)

(c)

(d)

Figure 1: (a) Image sequence of rolling coin. (b) Multipleexposure summary. (c) Layer-by-time summary. (d) Shape-time summary. (Color-based foreground masks were usedin (c) and (d) to isolate the foreground coins from the back-ground: in (c) to specify the foreground object and in (d) toremove the unreliable stereo depth for the featureless back-ground.)

image is constructed to make sense in 3-D. We can see howthe coin occludes itself at other times; these occlusions letus picture the 3-D relationships between the different spatialconfigurations of the coin. To emphasize that the techniquedescribes shapes over time, we call it “shape-time photog-raphy”.

1.1 Related effects

In some special cases of natural viewing, we are accus-tomed to viewing shape-time images. Extrusion processes,such as squeezed blobs of toothpaste or shaving cream,leave a shape-time history of the motion of the extrusionsource. Shape-time photographs have some resemblanceto Duchamp’s “Nude Descending a Staircase”, the clas-sic depiction of motion and shape in a static image. Thecomic book Nogenon uses drawn shape-time outlines in itsstory [14]. In unpublished independent work, researchers atGeorgia Tech have made graphical displays of data from amotion-capture system using a shape-time style rendering,but not using visual input [2].

2 Problem Specification

To make a shape-time photograph, we need to record bothimage and depth information. Various technologies canmeasure depth everywhere in a scene, including shape-from-defocus, structured light systems, and stereo. Whilestereo range can be less accurate than others, a stereo cam-era is quite portable, allowing a broad range of photographicsubjects in different locations. Stereo also avoids the prob-lem of registering range and image data, since disparitiesare computed from the image data itself. Fig. 2 shows thestereo camera we used. The beam-splitter system allowedus to capture left and right images using a single shutter,assuring temporally synchronized images.

The simplest version of shape-time photography as-sumes a stationary camera which photographs N time-frames of stereo image pairs. (Background stabilizationtechniques such as [16] might be used to generalize the re-sults of this paper to non-stationary cameras). At each po-sition, we need to select for display a pixel from one of theN frames captured over all times at that position. We canthen generate a single-frame composite, from one camera’sviewpoint (left, for our examples), or a composite stereoimage.

Let and denote the values at the th pixel attime frame recorded in the left and right images, respec-tively. Let be the distance to the surface imaged at the

th pixel (of the left camera) at frame . Pixel of the leftview shape-time image, , is simply

argmin (1)

2

Figure 2: The apparatus for taking synchronized stereo im-age sequences: Olympus Camedia C-3040 camera, and aPentax stereo adapter (connected using a Kenco 41mm -52mm adapter ring). The digital camera can take 5 full-resolution shots in a row at 1/3 second intervals. The L/Rsplit-screen image is visible in this photo on the camera’sLCD display. Insert: a typical split-screen image recordedby the camera.

We call argmin the frame assignment at pixel , sinceit indicates which time-frame’s pixel is displayed at position

in the shape-time composite image.With perfect depth data, the frame assignment is triv-

ial to compute; we simply find the frame of minimumdepth for each position. However, substituting the measuredstereo depth, for the true depth in Eq. (1) typicallygives unacceptable artifacts, illustrated in Fig. 3. Equa-tion (1) for shape-time rendering involves comparisons be-tween very similar depth values from different frames andcan reveal even small errors in stereo depth. We will needto estimate the proper frame assignments for each pixel inmuch the same way as one approaches other low-level vi-sion problems: combining local evidence (the measuredstereo depths and their uncertainties) with regularizationconstraints (penalties for inconsistent frame assignmentsover space). Shape-time composites made using depth mea-surements other than stereo may also benefit from the pro-cessing steps below.

3 Algorithm

One might design an algorithm to estimate the time-frameassignments directly from image data without first comput-ing stereo depth. Motion coherence over time, as well asstereo, could then be used to estimate depth, as in [12].However, we are often interested in sequences with largemotions between frames, which are not amenable to thatintegrated approach. To address that more general case,we chose a modular architecture. We first measure stereo

disparity, , and its uncertainty, , independentlyat each time, . (Since we are only interested in ordi-nal relationships, we treat stereo disparity like depth.) Wethen assign the time frame to be displayed at each pixelbased on those depth estimates. This modular approach willalso let us incorporate improved stereo algorithms into ourshapetime system as they are developed in the future. Wecan also substitute other depth measurement methods in-stead of stereo.

Our stereo camera is uncalibrated. We found the fun-damental matrix by using the web-based automatic pointmatching algorithm of Zhang [18]. We rectified the imageso epipoles are along scan lines using the algorthm of [11].

We used the stereo algorithm of Zitnick and Kanade(ZK) [19] which constructs a 3-dimensional array of matchvalues in disparity space. The iterative algorithm enforcesglobal constraints of uniqueness (one disparity value perpoint) and continuity (neighboring pixels have similar dis-parities) by diffusing or inhibiting support among neighbor-ing match values. After the algorithm has converged, oc-cluded areas are explicitly identified. A version of the codeis available for download [19].

3.1 Probabilistic formulation

Small errors in depth estimates lead to islands of pixelswhere the selected frame switches in the shape-time com-posite, illustrated in Fig. 3. To remove those spurious frameswitches, we add two assumptions: (1) that a pixel’s time-frame assignment is likely to be the same as its neighbors,and (2) that time-frame assignment transitions are morelikely to occur at image edges, because they may be oc-cluding edges where a frame switch should occur. Theseare analogous to assumptions about disparity used in stereo[6, 15], but we are applying this to estimate the frame ofminimum depth from a collection of frames, not the dispar-ity.

A probabilistic formulation can combine these assump-tions with the noisy stereo disparity data to compute themost probable frame of minimum disparity for each pixel.We assume that the frame assignments at each pixel form aMarkov random field (MRF) [4, 6, 3, 15]. Let the vectordenote the frame assignments at all the pixel locations, ie,

, where is the number of pix-els in the image. We seek to define a probability suchthat the time-frame assignments which maximize re-sult in an artifact-free shape-time composite image.

Toward this end, we write as a product of (a) lo-cal evidence terms, , incorporating stereo dispar-ity information, and (b) neighbor compatibility matrices,

, encouraging spatially consistent pixel frame as-

3

signments for the composite image:

(2)

is a normalization constant, independent of . The paren-theses in mean the product is taken over only neigh-boring pixels and .

3.1.1 Local evidence terms

is an N-vector describing the probability, based onstereo depth and uncertainty measurements, that time-frame

is in front of the others, at pixel . We assume thateach frame’s depth measurement , at pixel , is anindependent Gaussian random variable of mean andstandard deviation . We want to transform these un-certain depth measurements into a set of probabilities thateach time-frame corresponds to the front-most surface, seeFig. 5. Let be the independent, but notidentically distributed Gaussians (dropping the subscriptfor brevity here). We want to compute the probability that

(say) is the smallest among all the ’s, namely,. We condition on the

value of :

(3)

because the ’s are independent (4)

This lets us evaluate the integration over the joint probabil-ity in N-dimensions using a 1D integral. We evaluate thatintegral numerically using simple trapezoid quadrature.

The ZK stereo algorithm returns a confidence value,, for every disparity estimate. We used the confidence

value to estimate a standard deviation, , of the Gaus-sian distribution used to model each pixel’s true disparity.We used a heuristically selected function, mapping ZK con-fidence 0.05 to Gaussian disparity ; confidence0.002 to , linearly interpolating in between. Weclamped to the extremal values outside that interval.

In Eq. (5) below for belief propagation, the resulting lo-cal evidence vectors, describing the probability that eachtime-frame corresponds to the front-most surface pixel, aretreated as a message coming into the node at that pixel.

(a) (b)

(c) (d)

Figure 3: (a) Time-frame assignments for the front-mostsurface pixels, based on stereo depth measurements alone,without MRF processing. Grey level indicates the time-frame assignment at each pixel. (b) Shape-time image basedon those assignments. (c) Most probable time-frame assign-ments, computed by MRF. (d) Resulting shape-time image.Note that the belief propagation in the MRF has removedspurious frame assignment changes.

(a)

(b)

Figure 4: (a) Component frames of banner in wind. (b)Shape-time composite, showing the evolving flag shapes inrelation to each other.

4

(a) output from stereo (b) local evidence input to MRF

Figure 5: (a) From the stereo algorithm, for any given pixel,we find the disparity estimate and its uncertainty for eachtime frame. (b) We convert these to the probability that eachtime frame has the maximum disparity at the given pixel.Note that the probability is not necessarily monotonic withthe mean disparity (illustrated by the synthetic data plottedhere, which are monotonically decreasing in mean disparityover the time frames shown, but not monotonically decreas-ing in the probability of maximum disparity). These prob-abilities (in (b)) are the input to the MRF which computesan approximation to the optimal time frame assignment ateach pixel.

3.1.2 Neighbor compatibility matrices

The compatibility matrix, , between neighbor-ing nodes is an matrix determining the probabil-ity of a minimum disparity frame transition from frameto frame between the (neighboring) pixels and . Weset these probabilities to more easily allow frame assign-ment breaks in regions of large image gradients. Let thesquared magnitude of the image gradient at time and pixel

be (scaled to range from 0 to 1). The diagonalentries of are 1. The off-diagonal entries are

.

3.1.3 Belief propagation

We have constructed Eq. (2) so that the which maximizesrepresents our best estimate for the desired time-frame

assignments for the shape-time composite image. Exactmaximization of is NP-hard, but good approximatemethods exist [17, 6]. We found good results using beliefpropagation [10, 17, 3] (see [8] for code), which imposes noconstraints on the form of the matrices . (See[15] for stereo depth estimation using belief propagation ina Markov random field.)

In belief propagation, each node sends a “message”,, to all the neighboring nodes. The messages are N-

vectors, initialized to all ones. The iterative update equationfor the message from node to node is:

(5)

Upon convergence, the marginal probability that time-frame has the maximum disparity is contained in the“belief” at a node, , , obtained from the messagesthrough:

(6)

Twenty iterations of passing messages between all pairsof pixels yielded an estimate for the belief , themarginal probability that time-frame is in front at pixel .We selected the maximizing ; the displayed shape-time composite image was .

We obtained improved stereo disparity results if webandpass filtered and contrast normalized the rectified im-ages before calculating disparities [3]. This lessened theeffect of brightness variations within our stereo camera andfixed some matching problems in low-contrast regions inthe image. For Figs. 6 and 8, the stereo disparity values inthe distant background were too noisy to be useful, so wegenerated a mask isolating the foreground person from theimage. Both the disparity calculation and the shape-timecomputation took roughly 90 seconds to compute for typi-cal images shown here, computed on a 500 MHz machine.

4 Results

We show results indicating possible applications of theshape-time technique. Fig. 4 shows a blowing flag wherefluid dynamics controls the shape evolution over time. Themethod allows a new way to visualize those shape changesover time.

Figs. 6, 7 and 8 show shape-time applied to people.Figs. 6 and 7 give instructional single-frame summaries ofshort actions, such as someone’s style of throwing, or a par-ticular sewing stitch. Fig. 8 shows relationship between dif-ferent shapes on the body or face.

Fig. 9 examines the water height at different phases ofa wave breaking on the shore, revealing the surge in wa-ter height relative to the other frames at the final frame ofthe sequence, which dominates in the shape-time compositeimage. Fig. 1 (d) shows shape over time as the coin falls.

Some artifacts are visible in the composite images. Theface is broken-up slightly in Fig. 6, and some backgroundpixels are seen attached to the index finger of the left handin Fig. 7. These are caused by edge artifacts in the depthdata from the stereo algorithm. The MRF processing cantolerate small stereo depth errors, but the approach cannotwork for imaging conditions where stereo depth gives nouseful information, or when such regions cannot be maskedout of the image.

One feature of our present implementation is that thecomposite image does not depict the temporal ordering;we treat all time frames identically. This could be easily

5

(a)

(b)

Figure 6: (a) Frames of girl throwing snowball. (b) Shape-time photograph showing the girl’s throwing form.

changed by altering each displayed pixel by some functionof its time frame: for example, by darkening, reducing thecolor saturation or reducing the opacity of pixels furtherback in time. We chose the current implementation to em-phasize the depiction of shape relationships over the depic-tion of temporal dynamics.

5 Conclusions

We proposed a new method for showing the shape relation-ships between objects at different times. We point out theusefulness of shape-time photography, and show a methodto implement it. We developed an algorithm to reduce theartifacts resulting from noise in stereo disparity measure-ments. Since this rendering method is sensitive to small er-rors in depth, the algorithm may be needed for shape-timerenderings from other depth modalities, as well.

The method occupies a a special-effects niche. Shape-time photography could be useful for summarizing action,for instructional photographs, or physics illustrations. It candescribe in a picture how things move. It may reveal a pat-tern or spatial relationship in the world that is not clear fromthe video sequence or its individual frames.

The shape-time rendering of this paper is a special case

(a)

(b)

Figure 7: (a) Frames of sewing stitch example. (b) Shape-time rendering of the sewing stitch, illustrating the hand’smovement.

Figure 8: Portrait of a man, from frontal and profile views.Intersection contours in the shape-time image describe hisface shape.

6

(a)

(b)

(c)

Figure 9: (a) Images in wave sequence. (b) Shape-timecomposite image of ocean wave breaking. (c) Inside-outrendering of wave (furthest surface shown at every point).

of a more general problem: given a stack of images cap-tured from one viewpoint, use computer vision analysis toselect which pixels to display in a composite image. Thepixel selection could depend on object motion (show wherethe objects moved fastest, or where something moved to-ward you), or on the orientation of a face (show whereverthe dancer looked back). As one example of this general-ization, in Fig. 9 (c) we show the wave rendered “insideout”: we display the surfaces furthest away from the cam-era. This gives a picture of the lowest water in the breakingwave during its cycle.

Acknowledgments

We thank Steve Seitz for helpful conversations about thisresearch.

References

[1] M. Braun. Picturing Time. University of Chicago,1992.

[2] I. Essa, 2002. Personal communication.[3] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael.

Learning low-level vision. Intl. J. Computer Vision,40(1):25–47, 2000.

[4] S. Geman and D. Geman. Stochastic relaxation,Gibbs distribution, and the Bayesian restoration ofimages. IEEE Pattern Analysis and Machine Intelli-gence, 6:721–741, 1984.

[5] A. Klein, P. Sloan, A. Colburn, A. Finkelstein, andM. F. Cohen. Video cubism. Technical Report MSR-TR-2001-45, Microsoft Research, 2001.

[6] V. Kolmogorov and R. Zabih. Computing visual cor-respondence with occlusions using graph cuts. In Intl.Conf. on Computer Vision (ICCV), 2001.

[7] M. Massey and W. Bender. Salient stills: process andpractice. IBM Systems Journal, 35(3&4):557–574,1996.

[8] K. Murphy, 2001. www.cs.berkeley.edu/ mur-phyk/Bayes/bnt.html.

[9] E. Muybridge. Horses and other animals in motion.Dover, 1985.

[10] J. Pearl. Probabilistic reasoning in intelligent systems:networks of plausible inference. Morgan Kaufmann,1988.

[11] M. Pollefeys, R. Koch, and L. V. Gool. A simple andefficient rectification method for general motion. InIntl. Conf. on Computer Vision (ICCV), pages 496–501, 1999.

[12] H. S. Sawhney, Y. Guo, K. Hanna, R. Kumar, S. Ad-kins, and S. Zhou. Hybrid stereo camera. In ACMSIGGRAPH, 2001. In Computer Graphics Proceed-ings, Annual Conference Series.

[13] H. S. Sawhney and R. Kumar, 2001. Personal com-munication.

[14] L. Schuiten and F. Schuiten. Nogegon. HumanoidsPublishing, www.humanoids-publishing.com, 2001.

[15] J. Sun, H. Shum, and N. Zheng. Stereo matching usingbelief propagation. In Proc. ECCV, 2002.

[16] H. Tao, H. Sawhney, and R. Kumar. Dynamic layerrepresentation with applications to tracking. In Proc.of the IEEE Computer Vision and Pattern Recognition,Hilton Head, SC, 2000.

[17] J. S. Yedidia, W. T. Freeman, and Y. Weiss. General-ized belief propagation. In Adv. in Neural Info. Proc.Systems, volume 13, pages 689–695. MIT Press, 2001.

[18] Z. Zhang. Determining the epipolar geometryand its uncertainty: A review. Technical Report2927, Sophia-Antipolis Cedex, France, 1996. seehttp://www-sop.inria.fr/robotvis/demo/f-http/html/.

[19] C. L. Zitnick and T. Kanade. A cooperative algorithmfor stereo matching and occlusion detection. IEEEPattern Analysis and Machine Intelligence, 22(7), July2000.

7

Documents

Shape-TimePhotographypeople.csail.mit.edu/billf/publications/Shape-Time...Shape-TimePhotography WilliamT.Freeman HaoZhang EECSDept.EECSDept. MassachusettsInstituteofTechnologyU.C.Berkeley