Spatial Navigation for Context-Aware Video Surveillance

20 September/October2010 PublishedbytheIEEEComputerSociety 0272-1716/10/$26.00©2010IEEE

MultimediaAnalytics

Spatial Navigation for Context-Aware Video SurveillanceGerwin de Haan, Huib Piguillet, and Frits H. Post ■ Delft University of Technology

User interfaces for video surveillance net-works are traditionally based on arrays of video displays, maps, and indirect con-

trols. To address such interfaces’ usability limi-tations when the number of cameras increases, novel video surveillance systems aim to improve context awareness. However, interactive spatial navigation is still difficult, especially for live track-ing of complex events along many cameras. To vi-sually keep track of such events, the operator must

make quick, correct decisions on where to navigate along the many available cameras.

To support this demanding task, we’ve designed a spatial-navigation interface that com-bines navigation tools with improved view transitions. As guidelines for this design, we used the operator’s attention on the event actions and spatial awareness. First, to help opera-tors focus on the action instead of an external navigation in-terface, our interface lets them directly navigate in the current

visible video via the mouse. While the operator tracks the action with the mouse, interactive 3D widgets augmented in the video provide visual feed-back on the available camera transitions. Second, we’ve dynamically optimized visual transitions be-tween videos to ensure context awareness at those points.

MotivationVideo surveillance control rooms must address an ever-growing amount and diversity of data from complex environments. Currently, operators ob-

serve video streams directly on large matrix display arrangements, combined with an interactive cam-era layout plan (for example, see Figure 1). This basic layout is useful primarily for a general over-view and for first detection of abnormal events. However, it’s less suitable for coherently following an event’s course through time and space.

In a complex environment with densely placed cameras, maintaining orientation and spatial context between individual streams is difficult. Cognitive overload can easily occur, especially in time-critical situations involving live video streams. For example, tracking persons between different screens or finding an uncluttered view-ing angle of an important object in a crowded scene can be extremely difficult.

To overcome some of these problems, recent systems have begun displaying individual video streams within their spatial context—for example, in a 2D map or 3D virtual environment. (The “To-ward Context-Aware Surveillance Systems” sidebar summarizes related work in this area.) However, because of imperfect video alignment and complex 3D interaction, these integrated environments in-troduce new challenges in navigation and compre-hensive viewing.

In a previous paper,1 we argued that major bot-tlenecks in many surveillance tasks stem from the difference between view reference frames in different camera views and the map view,2 and the shifting of visual attention. Whereas opera-tors can naturally reason and act upon first-person (egocentric) views from a single camera, they must mentally translate their reasoning back to the other camera views or the map for third-person (exocentric) representations.

So, we proposed a first-person viewing and naviga-tion interface for integrated surveillance monitor-

Interactivespatialnavigationforvideosurveillancenetworkscanbedifficult.Thisisespeciallytrueforlivetrackingofcomplexeventsalongmanycameras,inwhichoperatorsmustmakequick,accuratenavigationdecisionsonthebasisoftheactualsituation.Theproposedspatial-navigationinterfacefacilitatessuchvideosurveillancetasks.

IEEEComputerGraphicsandApplications 21

ing in a 3D virtual environment. Because most of the spatial information and overlap of all cameras is available, the system performs the mental spatial mapping for the user and provides this as feedback in the first-person 3D view. This feature aims to provide effortless navigation along sets of available video streams, as if the operator were flying from camera to camera in the 3D environment. During navigation along images, the interface enhances the visual flow to maximize coherence between video streams and spatial context (see Figure 2).

Our current research looks beyond the visual characteristics of spatial transitions and extends context awareness to interactive navigation and visual feedback. Our original interaction was re-stricted to back-and-forth navigation along a pre-defined 3D guidance path for a series of camera views defined in the context graph. Although guid-ance is essential, having a predefined path is often too limited for live tracking of events in a complex camera network: persons or vehicles simply don’t follow fixed paths. Operators must be able to make quick, accurate navigation decisions based on the actual situation.

Also, we found that the manual construction of view transition parameters’ guidance paths for all camera pairs was not only labor intensive but also difficult to specify correctly for different events. Because different events could happen in different regions of a single camera pair’s observed video, the parameters for a smooth transition could dif-fer completely for each region. So, improving auto-matic, adaptive creation of transition parameters is important.

View TransitionsView transitions help operators preserve spatial awareness while navigating from camera to cam-era in a context graph. Besides maintaining spa-tial awareness, operators must be able to maintain their attention on a point of interest (POI) in the scene. This requires a transition function for en-

abling a changing view that’s perceptually the easiest to follow. So, for a given view transition, the resulting camera movement should be simple in direction and velocity to make it predictable for the observer.

Types of TransitionsFrom analyzing various surveillance scenarios (see the “Surveillance Scenarios” sidebar), we identified three types of view changes: broadening or nar-rowing views, sliding views, and rotating views. We call the corresponding view transitions zoom-ing, panoramic, and orbiting. For each of these tran-sitions, we designed a function that results in a steady view change (see Figure 3).

We define these functions as adjustments of the viewpoint’s position, orientation, and field of view, all parameterized over time. Each function inter-polates both the orientation and the field of view linearly over time but determines the viewpoint’s position differently.

Improving the View Transition FunctionAlthough a linear interpolation of the viewpoint’s position and orientation would be trivial, in many cases it doesn’t produce desirable view changes. For zooming, it works fine because the view broadens or narrows predictably without a lot of movement.

Figure1.Asurveillancecontrolroomwithatraditionalvideomatrixdisplayarrangement.Operatorsselectsetsofvideosfrom2Dmaps.Thissetupisn’tsuitableforcoherentlyfollowinganevent’scoursethroughtimeandspace.

32

1

A B

Figure2.AviewtransitionsequencefortrackingapersonbetweencameraAandadjacentcameraBinanofficehallwaysurveillancescenario.1Theguided3Dnavigationenablessimplefirst-personvideoobservationwhileensuringagoodvisualflowforspatialcontext.Thismethoddynamicallyembedsandblendsvideocanvasesina3Dvirtualenvironment,whichconsistsofcharacteristicmodellandmarksandperspectivelines.

22 September/October2010

MultimediaAnalytics

But applying this method to a panoramic transition often results in a rapid visual flow, which can dis-orient the user. For orbiting, things get even worse, because the distance to the POI can vary consider-ably, and the POI can move in an unpredictable path on the screen (see Figure 4a).

We can solve the issue introduced by panoramic movement by increasing the distance to the video canvases gradually, providing a broader view that slides through the screen at a lower speed. Takeo Igarashi and Ken Hinckley applied such a mecha-nism to a document-viewing problem.3 Jarke van

Wijk and Wim Nuij further refined this approach in a search for an optimal viewpoint trajectory to keep the perceived speed constant.4

Desney Tan and his colleagues applied this concept to 3D navigation, using speed-coupled flying,5 which couples the camera’s height and tilt to its movement speed. We’ve developed a similar technique to define panoramic transitions by using Bézier curves.

We can parameterize these Bézier curves by time t and a set of control points, which we can eas-ily compute from the position and orientation of cameras A and B (see Figure 3c). The control points

A

B

CD

BC

POIPOI

D C B A

Panoramic Orbit Zoom

AB

D

POOI

CD CD

PPPPPPPPOOOIIA

B

D

APOI B

C(a)

(b)

(c)

Figure3.Thethreeviewtransitiontypesbetweenfourcameras(AthroughD)aroundahouseandaperson—thepointofinterest(POI):(a)thefourindividual-cameraviews(righttoleft),alongwiththezoom,orbit,andpanoramictransitionsbetweenthem;(b)eachcamera’sposeandfieldofviewandthecorrespondingtransitioninplanview;and(c)eachtransitiontype’sresultingvisualeffect.

(a)

(b)

Figure4.Acomparisonofvisual-transitionresultsbetweentwoviewpointinterpolationschemes.(a)AsimplecameraviewpointinterpolationinspaceresultsinacurvedscreentrajectoryofthePOI.(b)OptimizingtheviewpointinterpolationresultsinasmoothlineofthePOIonthescreen.


influence the trajectory’s curvature, allowing for tweaking to obtain an optimal path. We then de-termine the viewpoint’s position at a certain time by computing B(t), the position of a point on the curve at t.

Orbit transitions are based on a rotation around a certain POI. We can enhance the view transition by interpolating the distance between the viewpoint and the POI, as well as the POI’s position on the screen. From the resulting values, along with the viewpoint’s orientation and field of view, we can then calculate the viewpoint’s position. This ap-proach is an extension of orienting POI movement.6 The result is a smooth change of the viewpoint-to-POI distance and a straight path of the POI on the screen. Figure 4 illustrates the difference between direct linear interpolation of the viewpoint position and implicit determination based on the POI.

Navigation ToolsThe main purpose for integrating a set of videos with

a 3D environment is to reduce the mental effort re-quired to execute certain surveillance tasks such as person tracking. Many of these tasks involve con-trolling the view-transitions sequence, which corre-sponds to navigation through the 3D environment.

On the basis of formative evaluation and early experiments on various surveillance scenarios, we describe two navigation interface requirements. To perform interactive navigation from the cur-rent camera view to other cameras, the user first needs information on availability and pose (posi-tion and orientation) of the neighboring cameras at a certain POI (for example, a person). Second, the user must be able to select and initiate predict-able transitions to these cameras, preferably while focusing on the POI.

With these requirements in mind, we developed a comprehensive set of interaction tools. Figure 5 provides an overview of the tools visible during a transition. We selected a simple 2D mouse as the main input method because it provides fast, absolute

(a) (b)

(c) (d)

Videos ofneighboring cameras

Cylinder-shaped cursor

Glyphs

Orbit glyph

Indication ofcursor position

Cursor leaves ghostbehind during transition

Selected glyph

View frustum bounds

Selectedvideo

Difference betweenghost and subject

Camera that isn’t selectable

from cursor

Figure5.Screenshotsfromusingthenavigationtoolsforaviewtransition.Theactivevideoissuperimposedonanabstract3Dmodelofthescene.Arrowandorbitglyphsindicaterelatedcameras;thethumbnailspreviewadjacentvideos.(a)TheuserlocatesthePOIbycontrollingthe3Dcursorwiththemouse.(b)Aquickmouseclick-and-dragoveroneoftheglyphshighlightstheglyphitselfandtherespectivecameraviewfrustumandthumbnail.(c)Amousereleasestartsthetransition,anda3Dcursorghostappears.(d)Thetargetcameraisreached,andtheusercanreselectanewPOI.Thelargenumbersonthebottomleftofeachscreenshotarethecameranumbers.


MultimediaAnalytics

2D pointing at a POI in the video. The tools are connected either to cameras (glyphs, view frustum bounds, and border thumbnails) or other tools (the 3D cursor). We’ve made the connection to a camera clear by labeling the tools with distinct colors for each camera. The tools related to the cameras can be used interchangeably, and they compensate for one another when certain tools can’t be shown or are preferred for any other reason.

Arrow GlyphsArrow shapes are widely used to represent motion. Marc Nienhaus and Jürgen Döllner employed ar-row glyphs to depict dynamics in illustrations of animations, using their shape, size, and other properties to illustrate the motion path and the moving objects’ direction and velocity.7 Each glyph corresponds to a transition between two cameras. While viewing a scene using a given camera, users can see only the glyphs that start from this cam-era. They can directly control these glyphs from

within the camera view by selecting them with the mouse pointer.

The 3D Cursor and Orbit GlyphsTo enable navigation based on predictions about activity in videos, a tool for selecting a POI is necessary. You can shoot such a 3D cursor on the scene via a ray from the mouse pointer to the 3D scene. In our case, we specifically use floor parts of the scene—the parts that persons or vehicles can move across. In this way, the cursor appears to slide over the model when the user moves the mouse pointer on the screen.

Although a point would be a simple cursor shape, it would give almost no visual feedback regarding its 3D position in the scene. A circle-shaped cursor would give a better indication because perspective viewing would scale with the distance from the camera and get rounder when viewed from above. But such a cursor could be misleading when placed on an object in the video that’s not modeled in the

Researchers have proposed new types of interfaces for spatial context in multicamera video surveillance

systems.1,2 Such systems generally convey camera streams’ spatial context through placement of video thumbnails, camera icons, and coverage indicators on a 2D map of the environment. For instance, in a user study conducted by Andreas Girgensohn and his colleagues, the Spatial Multi-video (SMV) Player provided coherence between video streams by carefully arranging spatially related video thumbnails around the current, selected camera stream.3 In a sense, these interfaces convey spatial context by em-bedding multiple egocentric (first-person) representations (video streams) into the exocentric (third-person) 2D map representation. In this user study, spatial context, provided by either a map or spatially related videos, improved user performance and acceptance for tracking persons walking on an office floor across multiple cameras.

Other researchers have added a third dimension to this spatial representation by putting the video streams directly in a 3D model of the monitored area. The Aug-mented Virtual Environment (AVE) system4–6 and the video flashlight technique7 integrate live video streams with an accurate 3D environment model using projective texture mapping. These approaches aim to provide sufficient model reconstruction of the 3D world for observation from arbitrary viewpoints.

Other approaches use 3D canvas objects to project vid-eos; they focus more on the user navigation interface.8–10 To explore the design space of combining videos with 3D environment models, Yi Wang and his colleagues devel-oped a testbed that combines existing, static rendering

techniques and provides basic navigation, and they report on possible usage patterns.8 Their user study indicates that 3D integration can indeed be useful in surveillance tasks.11

Gerwin de Haan and his colleagues spatially embedded complete video streams in a first-person view to allevi-ate duality in the navigation interface.10 Photo Tourism dynamically transforms texture billboards and focuses on improved real-time navigation.12 The user should be able to smoothly navigate to this position instead of teleporting, to maintain spatial context and orientation.13 Benjamin Bederson and Angela Boltman show that, for inspecting spatial information, animations and smooth transitions of the viewpoint can help users build a mental map of spatial-object relations.14 Using a 3D environment makes fast, easy navigation challenging—for instance, because of awkward controls or loss of orientation. Yi Wang and his colleagues have emphasized automated and guided navi-gation’s importance for a complex surveillance context.8

Although Suya You and Ulrich Neumann reported on automated fly-throughs of the AVE-based V-Sentinel system with optimal and natural displays along predefined paths and generated alarms, they didn’t provide details of these transitions.6 The 3D video player developed by Girgensohn and his colleagues provides arbitrary 3D view-ing or camera transitions to follow automatically tracked persons.9 Our method focuses on simplifying interactive navigation, such that the user is guided or constrained along the views in a comprehensible manner. (Also, see the work of Tinsley Galyean.15)

In an extension of Photo Tourism, Noah Snavely and his colleagues combined path-based, egocentric navigation

Toward Context-Aware Surveillance Systems


scene. So, a full 3D cursor can best indicate its position in the scene. To prevent the cursor from covering a large part of the video, it’s important to build it only from the parts needed to span a 3D shape.

Meeting these requirements, we designed a 3D cursor consisting of the outlines of a cylinder that would fit around the average person (1 meter wide and 2 meters high). To create the illusion of a transparent cylinder, the cylinder is always rotated toward the viewer so that the vertical outlines are always on the cylinder’s outer left and right. The cursor is visible only when the user points the mouse at a specified floor part and not at a select-able interaction tool.

When the cursor is visible, connections to cam-eras that are directly reachable with an orbit transi-tion appear with orbit glyphs. We added these glyphs because normal glyphs indicate only the transition’s direction, whereas orbit transitions mainly involve a rotation around a point in the scene. Orbit tran-

sitions and orbit glyphs are useful when there’s no particular direction that our straight glyphs can depict. An orbit glyph consists of

■ a dashed line from the 3D cursor’s top to the relevant camera, and

■ a polygon from the cursor’s bottom to the cam-era position’s projection on the same vertical plane as the cursor’s bottom.

Pressing the mouse button makes connections to the regular glyphs. The user can then drag the mouse pointer around the cylinder to select a glyph from the cursor, which is similar to selecting an item in a pie menu. Releasing the mouse button initiates a transition to the selected camera. Figure 6 illustrates how to use the cursor menu.

During a transition initiated from the cursor menu, keeping track of the selected POI can be difficult. So, we spawn a “ghost” copy in the scene just before the transition, and we position it at the

and blended photographs.16 We’re currently extending our navigation interface by augmenting the videos with context-aware feedback.

References 1. A. Girgensohn et al., “Support for Effective Use of Multiple

Video Streams in Security,” Proc. 4th ACM Int’l Workshop

Video Surveillance and Sensor Networks (VSSN 06), ACM Press,

2006, pp. 19–26.

2. Y.A. Ivanov and C.R. Wren, “Toward Spatial Queries for

Spatial Surveillance Tasks,” Proc. Pervasive Workshop Pervasive

Technology Applied to Real-World Experiences with RFID and

Sensor Networks (PTA 06), 2006; www.hcilab.org/events/

pta2006/pdf/n010-f05-wren.pdf.

3. A. Girgensohn et al., “Effects of Presenting Geographic Context

on Tracking Activity between Cameras,” Proc. SIGCHI Conf.

Human Factors in Computing Systems (CHI 07), ACM Press,

2007, pp. 1167–1176.

4. U. Neumann et al., “Augmented Virtual Environments (AVE):

Dynamic Fusion of Imagery and 3D Models,” Proc. IEEE Virtual

Reality, IEEE CS Press, 2003, pp. 61–67.

5. I.O. Sebe et al., “3D Video Surveillance with Augmented

Virtual Environments,” Proc. 1st ACM SIGMM Int’l Workshop

Video Surveillance (IWVS 03), ACM Press, 2003, pp. 107–112.

6. S. You and U. Neumann, “V-Sentinel: A Novel Framework

for Situational Awareness and Surveillance,” Proc. SPIE, vol.

5778, no. 713, 2005, pp. 713–724.

7. H.S. Sawhney et al., “Video Flashlights: Real Time Rendering

of Multiple Videos for Immersive Model Visualization,” Proc.

13th Eurographics Workshop Rendering (EGWR 02), Eurographics

Assoc., 2002, pp. 157–168.

8. Y. Wang et al., “Contextualized Videos: Combining Videos

with Environment Models to Support Situational Under-

standing,” IEEE Trans. Visualization and Computer Graphics,

vol. 13, no. 6, 2007, pp. 1568–1575.

9. A. Girgensohn et al., “DOTS: Support for Effective Video

Surveillance,” Proc. 15th Int’l Conf. Multimedia, ACM Press,

2007, pp. 423–432.

10. G. de Haan et al., “Egocentric Navigation for Video Surveil-

lance in 3D Virtual Environments,” Proc. IEEE Symp. 3D User

Interfaces (3DUI 09), IEEE Press, 2009, pp. 103–110.

11. Y. Wang et al., “Effects of Video Placement and Spatial

Context Presentation on Path Reconstruction Tasks with

Contextualized Videos,” IEEE Trans. Visualization and

Computer Graphics, vol. 14, no. 6, 2008, pp. 1755–1762.

12. N. Snavely, S.M. Seitz, and R. Szeliski, “Photo Tourism:

Exploring Photo Collections in 3D,” ACM Trans. Graphics,

vol. 25, no. 3, ACM Press, 2006, pp. 835–846.

13. D.A. Bowman, D. Koller, and L.F. Hodges, “Travel in Im-

mersive Virtual Environments: An Evaluation of Viewpoint

Motion Control Techniques,” Proc. Virtual Reality Ann. Int’l

Symp. (VRAIS 97), IEEE CS Press, 1997, pp. 45–52.

14. B.B. Bederson and A. Boltman, “Does Animation Help Users

Build Mental Maps of Spatial Information?” Proc. IEEE Symp.

Information Visualization (InfoVis 99), IEEE CS Press, 1999,

pp. 28–35.

15. T.A. Galyean, “Guided Navigation of Virtual Environments,”

Proc. Symp. Interactive 3D Graphics (SI3D 95), ACM Press,

1995, pp. 103–104.

16. N. Snavely et al., “Finding Paths through the World’s Photos,”

ACM Trans. Graphics, vol. 27, no. 3, 2008, pp. 11–21.


MultimediaAnalytics

To evaluate our interface prototype, we employ several surveillance scenarios. To prepare the scenarios, we use

the Blender 3D modeler (www.blender.org) and a Python script. This script can render animations to videos, letting us create artificial datasets. Because the availability of real camera data is limited, such artificial datasets can be use-ful for creating any scenario that suits specific needs.

Preparing a scenario involves creating a model of the environment and then aligning virtual cameras with the real cameras. Subsequently, we can put arrow glyphs in the model, as well as eventual animated objects. Using a Python script, we can then export the cameras’ spatial properties and export glyphs to an XML file. Optionally, when we want to create an artificial scenario, we can render videos from the virtual cameras. The model, XML file, and videos form a scenario that we can load into the prototype.

PETS DatasetsThe University of Reading’s Computational Vision Group (www.cvg.rdg.ac.uk) has made several datasets, which the IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS) uses as publicly available benchmarks. We’ve prepared three of these datasets for our system (see Figures A1, A2, and A3 for examples). These datasets contain video footage from cameras in public environments where many different people walk in and out. The datasets are useful for investigating crowds

and certain events when viewed from different angles and distances. (They are, however, less interesting for evaluating our set of navigation tools as a whole, because all cameras focus on the same area.)

Office HallwayThe office scenario (see Figure A4) offers a suitable envi-ronment for testing all our navigation tools. It contains a camera setup that includes various camera relations. However, the videos don’t show much activity, making this scenario less interesting for tracking experiments.

Highway TrafficCamera setups for monitoring traffic are unique in that they consist mainly of sequential cameras placed at large distances from one another along a road. We created a synthetic scenario that incorporates a highway portion, and we animated some car models in it (see Figure A5). The resulting dataset can then help smooth out zooming transitions.

Evaluation EnvironmentTo evaluate our prototype, we needed an environment fea-turing various camera setups and video activity that wasn’t trivial to follow. Because surveillance videos aren’t widely available, especially those that suit our needs, we created a synthetic evaluation environment (see Figure A6).

Surveillance Scenarios

Map Map Map

Map Map Map

(1) (2) (3)

(4) (5) (6)

FigureA.Screenshotsandmapsofsomesurveillancescenarioswe’veused:(1)trainstation,(2)airport,(3)crossing,(4)officehallway,(5)

highwaytraffic,and(6)synthetictestenvironment.Weusedthesescenariostoevaluatethedesignofthenavigationtoolsandtransition

techniques.Eachscenariohasadifferentcharacteristicintermsofcamerasetupandviewingangles.


POI. The human brain is well trained to follow dis-tinct objects’ motion, especially if the object fol-lows a predictable path (which our view transition functions make possible). The cursor ghost makes it immediately clear where to focus on when the transition is over. The ghost also shows the orien-tation change, because the cross-hair indicator at the bottom circle stays aligned with the world’s two horizontal axes. When the ghost fades away, the user can use the 3D cursor again.

A disadvantage of such a static ghost is the time gap. The person or object that’s followed will likely have moved during a transition, appearing in an-other place as the cursor ghost. We can partly solve this problem via a prediction based on the menu selection’s direction or other gestures, but an exact prediction isn’t possible without perfect automatic tracking.

Video ThumbnailsSometimes, a camera has connections to other cameras but the user can’t see any floor part of the model and can view only some or none of the glyphs. In such cases, navigation directly from within the camera view isn’t possible, so other in-teraction tools are needed. Inspired by the Spatial Multi-video (SMV) Player,8 we add thumbnails of the videos recorded by neighboring cameras in a border around the main view. Pressing the mouse button while the curser hovers over a thumbnail initiates a transition to the corresponding camera; this transition updates the border with that cam-era’s thumbnails.

View Frustum BoundsWhen zooming in on one or more regions of the video is possible, the 3D tools don’t make com-pletely clear which part of the video will be visible after the transition. We could intersect the cor-responding camera’s view frustum with the model and project it on the image plane. However, this projection won’t be correct if many objects are vis-

ible in the video that aren’t included in the model. A simpler approach is to take a rectangular slice of the view frustum, but then, on which distance from the camera should we do this?

We can provide a better estimation of the con-tent that a camera sees by showing the shape of that camera’s view frustum so that the camera’s position, orientation, and visible area are clear. To realize this, we display the outlines of planes inter-secting the view frustum at various distances up to the camera’s focal distance.

EvaluationTo gain more insight and feedback on our navi-gation tools’ usability and effectiveness, we per-formed an experiment involving 12 participants.

Experiment DescriptionOur evaluation aimed mainly to get qualitative re-sults and feedback on how the navigation interface performed in basic scenarios, especially any dif-ficulties it posed. So, the participants performed simplified surveillance tracking tasks. The main task was to track one specific actor, the subject, as recorded on several video cameras while walking randomly in an indoor office scene. We asked the participants to track the subject along the camera views and then, once the subject walked through one of the doors, write down that door’s identifica-tion marker.

Because the experiment emphasized how well people track subjects during various transitions, we kept the scenarios and video material as basic as possible to avoid uncontrolled influences. First, we selected three basic surveillance scenarios, each based on a camera setup for a specific view transi-tion: orbit, panoramic, and zoom (see Figure 7). Second, we limited the number of cameras to four. Larger-scale tracking would be very difficult in the simple matrix interface. Third, we used preren-dered synthetic videos with simplistic and identi-cal virtual actors. We made the actors identical to

(a) (b) (c)

Figure6.The3Dselectioncursorandglyphs:(a)Theglyphsofpossibleorbittransitionsconnectdirectlytothemousecursor.(b)Whentheuserclicksthemouse,otherglyphsconnecttothecursor.(c)Apiemenuappearsaroundthecursorsothattheusercanquicklydragtoaregiontoselectacamera.Releasingthemousetriggerstheselectedtransition.


MultimediaAnalytics

ensure that participants would follow the subject from the beginning and not simply spot or retrack the subject solely through visual recognition. For each scenario, we prepared four synthetic videos, which showed various animated actors walking across the scene, as viewed from the cameras.

Each participant performed a tracking task with two interfaces (see Figure 8). The first involved a traditional static matrix layout; the second em-ployed our navigation tools. Although the results from the matrix interface can serve as a comparison with our interface, we included the matrix interface mainly to get user feedback on which scenarios are difficult, independent of the interface used.

Before the participants performed the tasks, we introduced them to the system and gave them time to get used to the interfaces. For the matrix inter-face, we presented them with a printed plan view

of the camera layout. Each tracking task took about 2 minutes. We divided the scenarios among all participants such that every scenario was used an equal number of times and four participants per-formed each task for every scenario. Participants had two chances for each task, so they could start over if they thought they had lost the subject.

Figure 9 shows the experimental setup. During each session, we asked participants to think aloud while performing the task, which we recorded via a microphone. At the same time, we recorded syn-chronized videos of the participants and the inter-active session’s screen content. After the sessions, the participants filled out a questionnaire about their experiences during the tasks.

ResultsFrom the experiment results, the recordings, and

(b)(a)

(c)

C

B

A

D 3

21

4 ABCDEFGHI

5 6 7 8

BA C D E F G H9 10 11 12

Figure7.Planviewsofthespatialcameralayoutforthe(a)orbit,(b)panoramic,and(c)zoomscenariosusedintheexperiments.Rediconsindicatethefournumberedcamerapositions;markedgreenareasindicatethetargetdoors.

(a) (b)

A C

AB

D

C B

D

Figure8.Screenshotsofthetwointerfacesusedintheexperiment,showingvideosoftheorbitscenariocapturedforasynchronizedmomentintime:(a)thenumberedvideosinatraditionalmatrixlayoutand(b)thevideosinourproposedspatialinterface,includingthe3Dcursor,thenavigationglyphs,andtheborderthumbnails,withthecross-hairindicatorshighlightingthecursorandPOIpositioninthethumbnailvideos.


the questionnaire answers, we distilled some no-ticeable issues, including differences between the scenarios and the surveillance interface used (ma-trix versus spatial).

The questionnaire first asked how useful each navigation tool was and why; Figure 10 shows the results. This histogram shows that participants found glyphs the most useful. The main reason was that the view transition’s direction was im-mediately clear from the arrow shape, so using glyphs to navigate felt natural. Participants also found the 3D cursor useful, but they used it ac-tively only in the orbit scenario, in which it was critical for distinguishing between different ac-tors. In the other scenarios, participants mainly used the 3D cursor passively to mark and compare a subject with its instances on neighboring video thumbnails.

Participants used video thumbnails mainly to check whether a subject appeared on one of the other videos than the main video; some partici-pants used the cross-hair indicator as well. Finally, participants found view frustum bounds the least useful. Sometimes, participants were confused re-garding the direction in which the view frustum bounds pointed. They also sometimes found the view frustum bounds misleading regarding how they occluded with the videos and the underlying 3D model.

The orbit scenario appeared to pose the most difficulties in tracking for both interfaces—only one correctly determined door (identification marker) out of four in both cases. For the matrix interface, this was because camera angles were op-posite, making spatial reasoning difficult. In real images, people tend to compare appearance and image features between multiple cameras. But, in this scenario, the scene was symmetric and all actors were identical, so participants couldn’t di-rectly compare them on the basis of appearance alone. The one participant succeeded by compar-ing door letters with the ones on the map and then reasoning about which actor was the subject.

In the spatial interface, participants found the 3D cursor very helpful, especially when combined with the dynamic feedback on which transitions were available. However, difficulties still origi-nated from the requirement to quickly choose an orbit glyph while actively tracking the subject with the 3D cursor. If a user makes even one wrong choice or loses concentration even for a moment, the subject is lost very easily. Participants some-times experienced difficulties matching the 3D cursor ghost with the correct actor when actors walked closely together.

Participants in the panoramic scenario, on the other hand, performed quite well with both inter-faces (matrix: three correct out of four; spatial: two correct out of four). In this simple configu-ration with four cameras, the matrix interface is straightforward, and motion predictions are easy. In both interfaces, some difficulties arose when two actors left the video at the same moment in the same area or when actors walked in a dead area (that is, an area that isn’t covered by any of the cameras). For the spatial interface, timing proved

Figure9.Theexperimentalsetup:(a)aparticipant,(b)theexperimentleader,(c)thetaskscreen,(d)thecontrolscreen,(e)amousetoperformthe3Dnavigationtask,(f)akeyboardtocontroltheexperiment,and(g)amicrophoneforrecording.

Not useful

Glyphs

3D cursor

Border thumbnails

View frustum bounds

Part

icip

ants

(%

)

50

40

30

20

10

0A bit useful Neutral Useful Very useful

Figure10.Navigationtoolpreferences.Usersgenerallyconsideredthetoolsuseful,exceptfortheviewfrustumbounds.


MultimediaAnalytics

difficult sometimes when transitions were slow. Participants could use thumbnails to determine the right moment for a transition, but most people focused too much on the main view. Most partici-pants preferred simple clicking on arrow glyphs to navigate; they avoided using thumbnails and drag selection with the 3D cursor.

In the zooming scenario, we observed remark-able improvements in the spatial interface (ma-trix: 0 correct out of 4; spatial: 4 correct out of 4). For this scenario, most participants failed to keep track of the subject in the matrix interface. This was due mainly to difficulty in verifying which actors were the same ones in multiple videos. If a user’s attention changes from one video to an-other, the subject might not have appeared in that video yet. At that moment, it’s already too late to regain focus on the first video, which means the subject is lost. In the spatial interface, participants could navigate from within the same view. As the videos smoothly merged, participants could always keep their attention on the subject if they selected the right moment for a transition. Some partici-pants achieved this by following the subject with the 3D cursor until the cross-hair indicator cursor appeared on the thumbnail of the next video.

DiscussionOur recordings and questionnaire answers indicate that the augmented navigation tools helped par-ticipants keep track of the subject and maintain a mental model of the spatial context. The limited amount of training the participants received greatly influenced the experiment. Studying the recordings, we found that participants often failed to respond quickly enough to keep track of the subject. Future experiments, therefore, should incorporate more-extensive training sessions to ensure that partici-pants are accustomed enough to the navigation tools and the task scenario environments.

The difficulty participants had performing the tasks in the orbit scenario correctly was due partly to inconsistencies between different navigation tools. For example, whereas arrow glyphs point in the same direction as their corresponding view transition, orbit glyphs point in the opposite di-rection of their view transition (namely, in the direction of their corresponding camera). So, the orbit glyphs’ behavior is exactly the opposite of the regular glyphs’ behavior because users must make a gesture in the opposite direction of where they want their view to change to. We therefore expect that eliminating these inconsistencies can make the orbit transitions and corresponding camera-choosing task less difficult.

In addition, the view frustum tool could do a better job of indicating the view frustum. Espe-cially in the zooming scenario, it isn’t exactly clear which area of the video is visible to other cameras. This is because the view frustum bounds are fully rendered atop the video canvas without intersect-ing the underlying model. So, having the tool in-dicate the intersection of a camera’s view frustum with the model would be an improvement.

Finally, the results suggest that, using our spa-tial interface, participants could track a subject as well as, or only sometimes better than, they could using the matrix interface. Although our origi-nal goal wasn’t to compare the two interfaces, we contribute this apparent minimal advantage to the experiment design, in which we deliberately kept only four cameras and focused on a single view transition. Such setups aren’t realistic; real con-figurations are usually far larger and incorporate various camera relations to ensure sufficient cam-era coverage. So, a more extensive evaluation on larger, more complex surveillance scenarios would enable a more realistic comparison between our interface and a matrix interface.

With first-person navigation using augmented controls within the scene, the operator can

focus on where the action is, while preserving spatial orientation just like someone would do if walking through the environment. But a person can still get lost in a complex environment or lose sight of the focus object. In such situations, add-ing a map or annotating objects could be helpful.

The participants in our study and the domain experts were impressed by the transitions’ fluid-ity and the navigation tools’ ease of use. Feed-back from early experiments and demonstrations have helped us redesign the interface details and provide guidelines for future experiments. In the evaluation we reported here, our research hasn’t yet yielded conclusive statistical results on our approach’s effectiveness compared with a tradi-tional video surveillance interface. We observed many interesting things during these experiments, however. For example, users found the 3D cursor to be very helpful in orbit scenarios, especially when combined with the dynamic feedback on which transitions were available. Also, based on the smooth controls and transitions, zoom transi-tions appear highly beneficial. On the basis of the qualitative results and case studies, we expect that more benefits will become clear as the real-world scenario’s complexity increases.

The spatial interface is useful mainly for target


object tracking in real-time observation. The third-person overview provided by the video screens’ ma-trix would still be useful for routine surveillance tasks. So, we plan to explore hybrid interfaces of a video screen matrix and guided navigation. Because operators for large-scale surveillance usu-ally work in groups, collaborative work scenarios are necessary.

The techniques presented here focus on loca-tions with static camera placement. However, operators are increasingly using adjustable pan-tilt-zoom cameras to get more detailed views of small events and individual persons or vehicles. But zooming comes at the expense of camera cov-erage, and the operator could easily lose context.

The research presented supports either direct observation of scenes and events or reconstruc-tion of events using recorded surveillance videos. In real-time observation, an instant-replay func-tionality would be useful for back-tracking to see missed details and events. Time management and navigation constitute an important topic, and techniques are necessary to preserve the spatial and temporal context during such operations.

For tracking persons, researchers are investigat-ing automatic techniques from computer vision.9 When these techniques are sufficiently fast and re-liable for real-time surveillance, integrating them in an interface such as ours to enhance operator navigation will be easy.

Because the perceptual and cognitive processes involved in video surveillance aren’t completely un-derstood, improvements of new techniques claiming to support effectiveness or efficiency should always be evaluated experimentally. Moreover, evaluation of surveillance interfaces would require a variety of surveillance footage. However, because of privacy concerns, usually forced by legislation, obtaining authentic video materials can be difficult. So, we’ll also have to revert to artificial scenarios, such as self-generated movies or synthetically generated scenes. Setting up artificial test scenarios and de-signing performance metrics for this purpose is an interesting challenge in itself.

References 1. G. de Haan et al., “Egocentric Navigation for Video

Surveillance in 3D Virtual Environments,” Proc. IEEE Symp. 3D User Interfaces (3DUI 09), IEEE Press, 2009, pp. 103–110.

2. P.W. Thorndyke and B. Hayes-Roth, “Differences in Spatial Knowledge Acquired from Maps and Navigation,” Cognitive Psychology, vol. 14, no. 4, 1982, pp. 560–589.

3. T. Igarashi and K. Hinckley, “Speed-Dependent Automatic Zooming for Browsing Large Documents,” Proc. 13th Ann. ACM Symp. User Interface Software and Technology (UIST 00), ACM Press, 2000, pp. 139–148.

4. J.J. van Wijk and W.A.A. Nuij, “Smooth and Efficient Zooming and Panning,” Proc. 9th Ann. IEEE Symp. Information Visualization (InfoVis 03), IEEE CS Press, 2003, pp. 15–22.

5. D.S. Tan, G.G. Robertson, and M. Czerwinski, “Exploring 3D Navigation: Combining Speed-Coupled Flying with Orbiting,” Proc. SIGCHI Conf. Human Factors in Computing Systems (CHI 01), ACM Press, 2001, pp. 418–425.

6. J.D. Mackinlay, S.K. Card, and G.G. Robertson, “Rapid Controlled Movement through a Virtual 3D Workspace,” Proc. Siggraph, ACM Press, 1990, pp. 171–176.

7. M. Nienhaus and J. Döllner, “Dynamic Glyphs—Depicting Dynamics in Images of 3D Scenes,” Proc. 3rd Int’l Symp. Smart Graphics (SG 03), LNCS 2733, Springer, 2003, pp. 102–111.

8. A. Girgensohn et al., “Effects of Presenting Geographic Context on Tracking Activity between Cameras,” Proc. SIGCHI Conf. Human Factors in Computing Systems (CHI 07), ACM Press, 2007, pp. 1167–1176.

9. H.M. Dee and S.A. Velastin, “How Close Are We to Solving the Problem of Automated Visual Surveillance? A Review of Real-World Surveillance, Scientific Progress and Evaluative Mechanisms,” Machine Vision Applications, vol. 19, nos. 5–6, 2008, pp. 329–343.

Gerwin de Haan is a postdoctoral researcher in Delft Uni-versity of Technology’s Visualization Group. His research in-terests include video surveillance systems, VR, and software architectures for interactive systems. De Haan has a PhD in computer science from Delft University of Technology. Con-tact him at [email protected].

Huib Piguillet is a software developer at CleVR. He com-pleted the research described in this article while pursuing graduate studies in Delft University of Technology’s Visu-alization Group. His interests include video surveillance systems, 3D interaction, human-computer interaction, and VR. Piguillet has an MSc in media and knowledge engineer-ing from Delft University of Technology. Contact him at [email protected].

Frits H. Post is an associate professor of computer science at Delft University of Technology. His research interests include flow visualization, medical visualization, VR, and 3D in-teraction. Post has an MSc in industrial design engineering from Delft University of Technology. He’s a fellow of the Eu-rographics Association. Contact him at [email protected].