8
Video-guided Camera Control for Target Tracking and Following ? Jake Gemerek * Silvia Ferrari * Brian H. Wang * Mark E. Campbell * * Department of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY, USA (e-mail: (jrg362, ferrari, bhw45, mc288)@cornell.edu) Abstract: This paper considers the problem of controlling a nonholonomic mobile ground robot equipped with an onboard camera characterized by a bounded field-of-view, tasked with detecting and following a potentially moving human target using onboard computing and video processing in real time. Computer vision algorithms have been recently shown highly effective at object detection and classification in images obtained by vision sensors. Existing methods typically assume a stationary camera and/or use pre-recorded image sequences that do not provide a causal relationship with future images. The control method developed in this paper seeks to improve the performance of the computer vision algorithms, by planning the robot/camera trajectory relative to the moving target based on the desired size and position of the target in the image plane, without the need to estimate the target’s range. The method is tested and validated using a highly realistic and interactive game programming environment, known as Unreal Engine TM , that allows for closed-loop simulations of the robot-camera system. Results are further validated through physical experiments using a Clearpath TM Jackal robot equipped with a camera which is capable of following a human target for long time periods. Both simulation and experimental results show that the proposed vision-based controller is capable of stabilizing the target object size and position in the image plane for extended periods of time. Keywords: Robot vision, Tracking applications, Robot navigation, Robot control 1. INTRODUCTION Recent advancements in computer vision, particularly video-based object detection and classification, pave the way for future autonomous systems comprised of camera- equipped mobile robots that can decide how to obtain and process images or videos without human intervention (Wei et al. (2014)). The development of autonomous mobile cameras have recently been shown to impact a variety of applications that include automated surveillance, (Esterle et al. (2017)), intelligent cinematography (Nageli et al. (2017)), and autonomous social navigation (Chen et al. (2017)). The challenge of detecting, tracking, and following a mobile human target of interest is critical to all of the aforementioned applications. As a result, several human tracking algorithms have been developed, some of which make use of carefully designed hand-crafted features, such as Histograms of Oriented Gradients (HoG) for detection with simultaneous KLT (Kanade-Lucas-Tomasi) feature- tracking (Benfold and Reid (2011)), and optical flow-based human tracking methods using multiple cameras (Tsutsui et al. (2001)). More recent human tracking algorithms take advantage of the advancements of deep convolutional net- works, and instead use convolutional features for tracking human appearance in videos (McLaughlin et al. (2017)). Human tracking in video has become one of the challenging problems at the forefront of computer vision, leading to ? This research is funded by the Office of Naval Research Grant N00014-17-1-2175. the development of benchmark datasets (Leal-Taix´ e et al. (2015)). The state-of-the-art methods according to such benchmarks base their hypotheses on multiple cues, such as appearance, kinematics, and interactions (Sadeghian et al. (2017)). Although the problem of human detection and tracking has been studied extensively over recent years, almost all of the tracking algorithms do not incorporate any type of control over the camera field of view (FoV), and instead assume the video sequence is recorded a priori. The few works that do actively control the camera FoV simplify the detection and tracking of the human target (Goldhoorn et al. (2014), Nageli et al. (2017)). This paper considers the problem of controlling a mobile camera with a bounded FoV that is rigidly attached to a mobile robot capable of onboard computing for real-time video processing. The camera is controlled such that a target (human) of interest is detected, tracked, and followed while moving through a complex, unstructured environment. The novel approach in this paper leverages a state-of-the-art deep learning algorithm for detecting a human target in the video (Liu et al. (2016)), whose output is processed to compute a control input command for the mobile robot without requiring a 3-dimensional position estimate or kinematic model of the target. This is advantageous since human motion is generally very difficult, if not impossible, to accurately predict.

Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

Video-guided Camera Control for TargetTracking and Following ?

Jake Gemerek ∗ Silvia Ferrari ∗ Brian H. Wang ∗

Mark E. Campbell ∗

∗Department of Mechanical and Aerospace Engineering, CornellUniversity, Ithaca, NY, USA (e-mail: (jrg362, ferrari, bhw45,

mc288)@cornell.edu)

Abstract: This paper considers the problem of controlling a nonholonomic mobile groundrobot equipped with an onboard camera characterized by a bounded field-of-view, tasked withdetecting and following a potentially moving human target using onboard computing andvideo processing in real time. Computer vision algorithms have been recently shown highlyeffective at object detection and classification in images obtained by vision sensors. Existingmethods typically assume a stationary camera and/or use pre-recorded image sequences thatdo not provide a causal relationship with future images. The control method developed in thispaper seeks to improve the performance of the computer vision algorithms, by planning therobot/camera trajectory relative to the moving target based on the desired size and position ofthe target in the image plane, without the need to estimate the target’s range. The method istested and validated using a highly realistic and interactive game programming environment,known as Unreal EngineTM, that allows for closed-loop simulations of the robot-camera system.Results are further validated through physical experiments using a ClearpathTMJackal robotequipped with a camera which is capable of following a human target for long time periods. Bothsimulation and experimental results show that the proposed vision-based controller is capableof stabilizing the target object size and position in the image plane for extended periods of time.

Keywords: Robot vision, Tracking applications, Robot navigation, Robot control

1. INTRODUCTION

Recent advancements in computer vision, particularlyvideo-based object detection and classification, pave theway for future autonomous systems comprised of camera-equipped mobile robots that can decide how to obtain andprocess images or videos without human intervention (Weiet al. (2014)). The development of autonomous mobilecameras have recently been shown to impact a variety ofapplications that include automated surveillance, (Esterleet al. (2017)), intelligent cinematography (Nageli et al.(2017)), and autonomous social navigation (Chen et al.(2017)). The challenge of detecting, tracking, and followinga mobile human target of interest is critical to all of theaforementioned applications. As a result, several humantracking algorithms have been developed, some of whichmake use of carefully designed hand-crafted features, suchas Histograms of Oriented Gradients (HoG) for detectionwith simultaneous KLT (Kanade-Lucas-Tomasi) feature-tracking (Benfold and Reid (2011)), and optical flow-basedhuman tracking methods using multiple cameras (Tsutsuiet al. (2001)). More recent human tracking algorithms takeadvantage of the advancements of deep convolutional net-works, and instead use convolutional features for trackinghuman appearance in videos (McLaughlin et al. (2017)).Human tracking in video has become one of the challengingproblems at the forefront of computer vision, leading to

? This research is funded by the Office of Naval Research GrantN00014-17-1-2175.

the development of benchmark datasets (Leal-Taixe et al.(2015)). The state-of-the-art methods according to suchbenchmarks base their hypotheses on multiple cues, suchas appearance, kinematics, and interactions (Sadeghianet al. (2017)).

Although the problem of human detection and trackinghas been studied extensively over recent years, almost allof the tracking algorithms do not incorporate any type ofcontrol over the camera field of view (FoV), and insteadassume the video sequence is recorded a priori. The fewworks that do actively control the camera FoV simplifythe detection and tracking of the human target (Goldhoornet al. (2014), Nageli et al. (2017)). This paper considers theproblem of controlling a mobile camera with a boundedFoV that is rigidly attached to a mobile robot capableof onboard computing for real-time video processing. Thecamera is controlled such that a target (human) of interestis detected, tracked, and followed while moving through acomplex, unstructured environment. The novel approachin this paper leverages a state-of-the-art deep learningalgorithm for detecting a human target in the video (Liuet al. (2016)), whose output is processed to computea control input command for the mobile robot withoutrequiring a 3-dimensional position estimate or kinematicmodel of the target. This is advantageous since humanmotion is generally very difficult, if not impossible, toaccurately predict.

Page 2: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

!

2

#

ℱ%

ℱ&

ℱ�

ℎ'

'

ℎ!

(!/)

*!

(

(!

Fig. 1. The inertial, body, and image reference frames areillustrated, along with relative position vectors anddimensions. The target is shown in the 3-dimensionalworld being transformed, via perspective projection,onto the image plane.

This paper also proposes the use of a game developmentsoftware, known as Unreal EngineTM, for simulation of thevision-based control algorithm in a photo-realistic virtualspace. Most robot simulation and visualization softwareemphasize physical accuracy and lack visual realism. Fur-thermore, simulation in a visually-realistic environmentprovides the ability to quickly and efficiently recreateand redefine an infinite set of testing conditions, whileproviding a deterministically repeatable test environment.This work further validates the proposed controller aswell as the reliability of the Unreal EngineTMas a controlalgorithm simulator through physical experiments whichdemonstrate the proposed methods are capable of real-world implementation.

2. PROBLEM FORMULATION

Consider a region of interest, W ⊂ R3, populated withhuman target T ⊂ W, and a mobile robot A ⊂ W. Therobot is equipped with a camera which has a bounded FoV,characterized by a focal length λ ∈ R, a half-angle α ∈ S1,and an aspect ratio AR. The image plane, S = [0, w] ×[0, h], is the perspective projection of W as seen throughthe camera FoV, where w and h are the width and heightof the image, respectively, which may be computed fromthe camera parameters λ, α, and AR, that is w = 2λ tanα,and h = w/AR (Fig. 1).

A reference frame FA is embedded in A, such that the 1staxis is aligned with the camera optical axis, the 3rd axispoints vertically, and the 2nd axis completes the right-hand rule. The origin of FA is known as the focal point ofthe camera, whose position r(t) ∈ R3 is expressed relativeto an inertial reference frame FW , as illustrated in Fig.1. The coordinates of the focal point position, expressedin FW are given as r(t) = [x(t), y(t), z]T , where z is theconstant height of the focal point. Assuming the z-axesof FW and FA remain parallel, the rotation matrix whichmaps vectors expressed in FW to FA is given as

RAW =

[cos θ(t) sin θ(t) 0− sin θ(t) cos θ(t) 0

0 0 1

], (1)

where θ(t) is the yaw angle of the robot.

The perspective projection PSA maps vectors in R3 ex-

pressed with respect to FA, whose origin is the focal pointof the camera, to vectors in R2 expressed with respect toFS . Fig. 1 shows the vector rt/A ∈ W being mapped to

the vector pt ∈ S via PSA. p(t) = [xb(t), yb(t)]

T is theposition vector to the center of the bounding box in theimage plane. The perspective projection can be written asa scaled linear operation on rt/A using homogeneous coor-dinates (Sonka et al. (2014)). The perspective projectionmapping of an arbitrary vector r ∈ W to the associatedimage plane vector p ∈ S is

PSA(r) =

1

2

[wb0

]− γ

[λ 0 00 λ 00 0 1

]r =

[p1

], (2)

where γ ∈ R is a scaling parameter used to enforce the3rd element to equal unity. Therefore, all of the requiredtransformations have been defined which take the targetposition rt, and the robot position r, expressed in inertialframe, and define a position vector on the image plane pt.This complete transformation is illustrated in Fig. 1 andmay be expressed as[

pt

1

]= PS

A

(RA

W (rt − r)). (3)

Assuming A is rigid, and the camera is rigidly attached toA, the robot configuration (state) vector can be describedas q(t) = [x(t) y(t) θ(t)]T , which is governed by thenonholonomic unicycle kinematic model,

q(t) =

[cos θ(t) 0sin θ(t) 0

0 1

] [v(t)ω(t)

]= G(q(t))u(t), (4)

where v(t) ∈ R is the forward velocity, and ω(t) ∈ Ris the yaw rate. The control input vector is defined asu(t) = [v(t) ω(t)]T ∈ U , and U ⊂ R2 is the set ofadmissible control inputs.

A bounding box b(t) ∈ R4 associated with the tar-get T is extracted from the video and projected on S.The elements composing b(t) = [xb(t), yb(t), wb(t), hb(t)]

T

are the coordinates of the bounding box center pt(t) =[xb(t), yb(t)]

T ∈ S, expressed in the image frame FS , andthe width wb(t) ∈ [0, w] and height hb(t) ∈ [0, h] of thebounding box. Several computer vision algorithms existwhich extract such a bounding box containing an objectof interest from an image, some of which will be reviewedin the following section.

2.1 Control Objective

In order to maintain the target within the camera FoV,a reliable target bounding box must be consistently ex-tracted from the video sequence. Therefore, it is desirableto maintain the target not only within the FoV, but at areliable range for accurate image processing. The controlobjective is to drive the bounding box b(t) to a desired set

point b by suitable choice of the control input vector u(t)over some time interval of interest [t0, T ) ⊆ R. That is,

[xb, yb, wb, hb]T → [xb, yb, wb, hb]

T . (5)

Page 3: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

where xb, yb are the desired (constant) coordinates of the

bounding box center, and wb, hb are the desired (constant)width and height of the bounding box, respectively. Typ-ically, the set point b is chosen such that the boundingbox remains in the center of the image at a sufficient scalefor reliable image processing. However, since there is nocontrol over the pitch of the camera or the relative heightof the camera focal point and the target center, yb(t) isnot controllable. As long as the target and robot bothremain on a level surface with a reasonable relative heightbetween the camera focal point and the target center, thisuncontrollable variable will not affect the proposed algo-rithm’s performance. A more complex model may assumethe camera is mounted on a gimbal, enabling the camerato rotate with respect to the robot, which would provide ameans to control yb(t), but such a model is not consideredin this work.

Similarly, the width wb(t) and height hb(t) of the boundingbox provide some redundant information since the robotcannot control the orientation of the target. Therefore, thewidth and height are combined into a single size metric∆b(t) of the bounding box, which is a measure of the lengthof the bounding box diagonal, ∆b(t) = ‖[wb(t), hb(t)]

T ‖2.Therefore, after post-processing the extracted boundingbox b(t) = [xb(t), yb(t), wb(t), hb(t)]

T into a useful outputvector y(t) = [∆b(t), xb(t)]

T , the control objective isreduced to

y(t)→ y, (6)

where y = [∆b, xb]T , and ∆b = ‖[wb, hb]

T ‖2.

3. METHODOLOGY

This paper presents a novel unified video processing andcontrol approach for detecting and pursuing a humantarget in a complex unstructured environment, which isaccomplished by stabilizing the control objective definedin the previous section. Due to recent advancements ofobject recognition tasks in computer vision, the presentedmethodology employs a state-of-the-art deep convolutionalneural network (CNN) Liu et al. (2016) to detect andclassify objects within the image plane. The output of theCNN is then processed to extract a bounding box b(t)associated with the target. This bounding box is then usedto compute a control law designed to maintain the targethuman in a desirable position within the image for reliablefuture detections.

3.1 Target Detection and Identification

Over recent years algorithms for multi-class object detec-tion in images have become extremely accurate, mostlydue to the use of deep CNNs Huang et al. (2017)). Threerecent well-known architectures are Faster R-CNN pro-posed by Ren et al. (2016), the Single Shot MultiboxDetector (SSD) developed by Liu et al. (2016), and theRegion-based Fully Convolutional Network (R-FCN) byDai et al. (2016). It is difficult to disambiguate the bestarchitecture due the use of interchangable feature extrac-tion and classification techniques. Furthermore, the workby Huang et al. (2017) present a comprehensive studyof the speed-accuracy tradeoff between different CNN ar-chitectures and feature extractors. The object detection

algorithm used in this work is chosen to be as accurateas possible while simultaneously being capable of real-time implementation on a physical robot with onboardcomputing. This work uses a MobileNet (Howard et al.(2017)) implementation of the SSD architecture (Liu et al.(2016)) in order to satisfy real-time resource constraints.The CNN is pre-trained on the Microsoft COCO data setof Lin et al. (2014).

The CNN takes as input an image S(t) at time tand outputs a set of detection-confidence pairs B(t) =

{(bi(t), ci(t))}Ndet(t)i=1 , where bi(t) are bounding boxes con-

taining objects of the same class as the target (i.e., hu-man), and ci(t) ∈ [0, 1) are the associated confidencescores of the bounding box, and Ndet(t) ∈ N is the numberof detections. The target bounding box b(t) is then com-puted as the bounding box with the highest associatedconfidence score, when multiple detections are extracted.If no detections are extracted, i.e., B(t) = ∅, the targetbounding box b(t) is set equal to the set point bounding

box b in order to stop the robot. This processing step isexpressed as

b(t) =

{arg max

bi

{ci : (bi, ci) ∈ B(t)} B(t) 6= ∅

b B(t) = ∅. (7)

This formulation guarantees that b(t) exists and is uniqueby construction. The bounding box b(t) is then trans-formed into the output vector y(t) which is used as thecontrol variable in the controller to be designed in thefollowing subsection.

3.2 Video-guided Camera Control Design

The control law for tracking and following the target basedon video frames obtained by the robot camera, and pro-cessed according to the previous subsections, is developedby considering properties of the perspective projection,and noting how points in three dimensions move acrossan image in two-dimensions while the camera is moving.Due to the properties of the perspective projection, objectswhich are closer to the focal point appear larger, and ob-jects that are farther from the focal point appear smaller.This provides a natural method for controlling the size ofthe bounding box ∆b(t) without requiring kinematic esti-mations in the 3-dimensional world. Similarly, the positionof the bounding box xb(t) in the image provides a naturalerror signal for the yaw rate ω(t) of the robot. Because thetarget human may be moving, the use of integral compen-sation is proposed in order to reduce steady state errorsthat would be present if the control input were simplyproportional to these error signals. Then, the proposedvideo-guided control input is designed using the followingproportional-integral compensation

u(t) = −K1∆y(t)−K2

∫ t

0

∆y(τ)dτ, (8)

where K1,K2 � 0 are diagonal gain matrices of reasonabledimension, and ∆y(t) = y(t) − y. The proposed controllaw is validated using photo-realistic simulations in highlycomplex environments, as well as through physical exper-iments in a laboratory setting.

Page 4: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

4. SIMULATION ENVIRONMENT

The Unreal EngineTMis a leading game development soft-ware capable of advanced open-source environment devel-opment and manipulation. These capabilities have recentlybeen exploited by several industries outside of the gamedevelopment community, including architectural visualiza-tion, film-making, and virtual reality training simulations.The use of Unreal EngineTMfor simulation of computervision-based automatic control algorithms makes no sac-rifice to physical accuracy, but has the advantage of avast user community composed of artists and developerswho create visually realistic environments, characters, be-haviors, and objects, which may be used in simulations.This ease of access to a diverse set of environments andscenarios helps test the robustness of proposed methods inways not feasible in real-world experiments or conventionalrobotics software.

Similar game development softwares have been used inprevious works for generating synthetic data to train deepCNNs, such as Johnson-Roberson et al. (2017). UnrealEngineTMhas also been used for similar computer visiontasks by several authors, such as semantic segmentationby Qui and Yuille (2016), and simulating stereo-visionapplications by Zhang et al. (2016). Furthermore, the workby Shah et al. (2017) uses Unreal EngineTMto developrealistic quadrotor and vehicle simulations for autonomousvehicle simulations. This work, proposes the use of Un-real EngineTMfor the novel task of simulating a fully au-tonomous robot using visual feedback for tracking a targetin the realistic virtual space. The setup and results of thesesimulations are presented in the following section.

5. SIMULATION RESULTS

The control law developed in Section 3 is tested in thevisually realistic and complex subway environment usingUnreal EngineTM, in which a mobile robot equipped witha camera tracks and follows a potentially moving humanwithout knowledge of the target dynamics or environmentgeometry. The subway environment consists of realisticlighting, a moving subway, and other moving objects.

A number of simulations are conducted in order to analyzethe proposed control algorithm: (1) a step response of thevelocity input v(t), (2) a ramp response of the velocityinput v(t), (3) a step response of the yaw rate inputω(t), (4) a ramp response of the yaw rate input ω(t).The set point used for all simulations is chosen such thatthe bounding box stabilizes to a desired size for reliableimage processing at the center of the camera FoV. Thesingle setting for the diagonal gain matrices K1 and K2 ismanually chosen and not changed between simulations.

The first simulation tests the controller response to a stepinput in the size error of the target, i.e., ∆b(t)−∆b. This isdone by initially placing the target human in the center ofthe camera FoV, such that xb(t0) = xb, and at a distance

away from the robot such that ∆b(t0) < ∆b. The bottomof Fig. 2 shows the initial configuration of the robot andtarget in a geometrically simplified visualization of thesubway, along with the visual input to the robot at theinitial time t0. The set point bounding box b is illustrated

Err

or,

�∆

( !)

Sp

ee

d,

"( !

) [

$/

%]

Time ! [s]

Time ! [s]

Set Point

Detection

Robot

Target

FoV

Fig. 2. Simulated step response to an initial error in thesize of the target, eDelta = (∆b(t)−∆b)/∆b. The bot-tom of the figure illustrates the initial configurationof the target and robot in a geometrically simplifiedvisualization of the subway, along with the initialvisual input at the initial configuration.

as the orange bounding box and the estimated targetbounding box b(t0) output from the CNN is illustratedas the green bounding box. The top of Fig. 2 showsthe resulting error signal e∆(t) = (∆b(t) − ∆b)/∆b andcontrol input v(t). The step response slightly overshootsthe desired position and stabilizes in roughly five seconds.

The second simulated experiment tests the controller re-sponse to a ramp input in the size error of the target,∆b(t)− ∆b. This is done by initially placing the target hu-man in the center of the camera FoV, such that xb(t) = xb,and programming the human target to walk at a constantvelocity in the direction of the initial camera optical axis.The human is programmed to walk at 1.3 m/s, which isa typical walking speed of a human. Fig. 3 illustrates thecontroller response as well as several snapshots throughoutthe simulation showing the robot-target configuration andthe associated visual input. The response of the controllerto such an input again stabilizes without any steady stateerror, due to the integral term of the control law in 8.

The next simulated experiment tests the controller re-sponse to a step input in the lateral position of the target,i.e., xb(t)− xb. This is done by initially placing the target

human at a distance from the robot such that ∆b(t0) = ∆b,

Page 5: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

Err

or,

�∆

( !)

Sp

ee

d,

"( !

) [

$/

%]

Time ! [s]

Time ! [s]

Set Point

Detection

Robot

Target

FoV

� = 0.0! �" = 4.9!

�# = 10.0! �$ = 15.0!

�% = 20.1!

� �" �# �$ �%

Fig. 3. Simulated ramp response of the error in the sizeof the target, e∆ = (∆b(t) − ∆b)/∆b. The bottom ofthe figure illustrates the configuration of the targetand robot in a geometrically simplified visualizationof the subway at a number of snapshots throughoutthe simulation, along with the visual input at the timeof these snapshots.

but offset from the optical axis such that xb(t0) < xb.The bottom of Fig. 4 shows the initial configuration of therobot and target in a geometrically simplified visualizationof the subway, along with the visual input to the robot atthe initial time t0. The top of Fig. 2 shows the resultingerror signal exb

(t) = (xb(t)−xb)/xb and control input ω(t).The step response slightly overshoots the desired positionbut quickly stabilizes about the set point configuration.

The final simulated experiment tests the controller re-sponse to a ramp input in the lateral position of thetarget, xb(t) − xb. This is done by programming the hu-man target to walk in a circular motion centered at thecamera focal point at a constant speed. The radius of thetarget’s circular path is chosen such that ∆b(t) = ∆b.Fig 5 illustrates the controller response as well as severalsnapshots throughout the simulation showing the robot-target configuration and the associated visual input. Theresponse of the controller to this ramp input very rapidly

Ya

w R

ate

, �(

)[!

"#/$]

Err

or,

%&'(

)

Time [s]

Time [s]

Set Point

Detection

Robot

Target

FoV

Fig. 4. Simulated step response to an initial error in theposition of the target, exb

= (xb(t)− xb)/xb. The bot-tom of the figure illustrates the initial configurationof the target and robot in a geometrically simplifiedvisualization of the subway, along with the initialvisual input at the initial configuration.

stabilizes without any steady state error. Some high fre-quency oscillations in the error signal are visible, and arecaused by the periodic nature of the human walking asviewed from the side. That is, the bounding box slightlychanges in shape and position due to swinging arms andlegs of a walking human. Two of the spikes in the signalare caused by errors in the CNN detection algorithm, butare only present at single frames, which does not affectperformance.

The four simulations performed in this study all showthe controller stabilizing about the desired set point. Fur-thermore, the robot was programmed to follow the hu-man through the subway environment using the proposedcontroller, while the target moved arbitrarily through theenvironment. Even in this case, where the robot was sub-ject to small disturbances such as brief/partial occlusions,lighting variations, and change in the target motion therobot successfully stabilized about the set point. In thiscase, the robot was able to follow the human throughthe subway environment for several minutes, and possiblylonger. Therefore, these simulation results suggest that,as long as the human target does not intentionally evadethe robot, then the proposed controller will be capable offollowing the human indefinitely under reasonable condi-tions.

Page 6: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

Err

or,

!( "

) Y

aw

Ra

te,

#( "

) [

%&

'/

*]

Time " [s]

Time " [s]

Set Point

Detection

Robot

Target

FoV

� = 0.0! �" = 10.0!

�# = 20.0! �$ = 30.0!

�% = 40.0!

�#

Fig. 5. Simulated ramp response of the error in the positionof the target, exb

= (xb(t) − xb)/xb. The bottom ofthe figure illustrates the configuration of the targetand robot in a geometrically simplified visualizationof the subway at a number of snapshots throughoutthe simulation, along with the visual input at the timeof these snapshots.

6. EXPERIMENTAL RESULTS

The experimental validation of the proposed control algo-rithm is done using a ClearpathTMJackal robot equippedwith a camera and onboard computing capabilities. Therobot can be accurately modeled by the nonholonomicunicycle model 4. A Vicon motion capture system is usedto provide ground truth measurements of the robot andtarget states. It should be made clear that the Vicon datais never made available to the robot, and is only used tocollect accurate pose data for results visualization. Thefour physical experiments presented here are exactly thesame as the four simulated experiments in the previoussection. That is : (1) a step response of the velocity inputv(t), (2) a ramp response of the velocity input v(t), (3)a step response of the yaw rate input ω(t), (4) a rampresponse of the yaw rate input ω(t)

The first physical experiments tests the controller responseto a step input in the size error of the target, i.e., ∆b(t)−

Time, � [!]

Err

or,

"∆

( �)

Sp

ee

d,

$( �

) [

%/

!]

Time, � [!]

Set Point

Detection

Fig. 6. Simulated step response to an initial error in thesize of the target, eDelta = (∆b(t) − ∆b)/∆b. Thebottom of the figure illustrates the initial visual input.

∆b. This is done by initially placing the target human inthe center of the camera FoV, such that xb(t0) = xb, and at

a distance away from the robot such that ∆b(t0) < ∆b. Thebottom of Fig. 6 shows the visual input to the robot at theinitial time t0. The set point bounding box b is illustratedas the orange bounding box and the estimated targetbounding box b(t0) output from the CNN is illustratedas the green bounding box. The top of Fig. 6 showsthe resulting error signal e∆(t) = (∆b(t) − ∆b)/∆b andcontrol input v(t). The step response slightly overshootsthe desired position then quickly stabilizes.

The second physical experiment tests the controller re-sponse to a ramp input in the size error of the target,∆b(t)− ∆b. This is done by initially placing the target hu-man in the center of the camera FoV, such that xb(t) = xb,and programming the human target to walk at a constantvelocity in the direction of the initial camera optical axis.Fig. 7 illustrates the controller response as well as severalsnapshots throughout the simulation showing the robot-target configuration and the associated visual input. Theresponse of the controller to such an input again stabilizes.However, due to the physical limitations of the laboratorysetup (i.e., Size of the Vicon area) the human cannot walkfar enough to allow the robot to fully reach steady state,but extrapolation of the available response is promising.This further illustrates the power of visually-realistic sim-ulation in Unreal EngineTM.

The next physical experiment tests the controller responseto a step input in the lateral position of the target, i.e.,xb(t) − xb. This is done by initially placing the target

human at a distance from the robot such that ∆b(t0) = ∆b,but offset from the optical axis such that xb(t0) < xb. The

Page 7: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

Time, � [!]

Err

or,

"∆

( �)

Sp

ee

d,

$( �

) [

%/

!]

Time, � [!]

Set Point

Detection

Robot

Target

FoV

! = 0.0" # = 5.5"

$ = 11.0" % = 16.5"

& = 22.0"

!

$

&

Fig. 7. Simulated ramp response of the error in the sizeof the target, e∆ = (∆b(t) − ∆b)/∆b. The bottom ofthe figure illustrates the configuration of the targetand robot at a number of snapshots throughout thesimulation, along with the visual input at the time ofthese snapshots.

bottom of Fig. 8 shows the initial visual input to the robotat t0. The top of Fig. 2 shows the resulting error signalexb

(t) = (xb(t) − xb)/xb and control input ω(t). The stepresponse slightly has no overshoot and quickly stabilizesabout the set point configuration.

The final physical experiment tests the controller responseto a ramp input in the lateral position of the target, xb(t)−xb. This is done by the human target walking in a circularmotion centered at the camera focal point at a constantspeed. Fig 9 illustrates the controller response as well asseveral snapshots throughout the simulation showing therobot-target configuration and the associated visual input.The response of the controller to this ramp input veryrapidly stabilizes without any steady state error. These re-sults confirm that the proposed controller as well as futurecomputer vision-based controllers can be readily simulatedin Unreal EngineTMand then successfully implemented onphysical robotic platforms.

Time, � [!]

Err

or,

"#

$( �

) Y

aw

Ra

te,

%( �

) [

&'

*/

!] Time, � [!]

Set Point

Detection

Fig. 8. Simulated step response to an initial error in theposition of the target, exb

= (xb(t) − xb)/xb. Thebottom of the figure illustrates the initial visual input.

Time, � [!]

Err

or,

"#

$( �

) Y

aw

Ra

te,

%( �

) [

&'

*/

!] Time, � [!]

Set Point

Detection

Robot

Target

FoV

� = 0.0! �" = 6.2!

�# = 10.6! �$ = 16.8!

�% = 22.7!

�#

�"

�$

Fig. 9. Simulated ramp response of the error in the positionof the target, exb

= (xb(t) − xb)/xb. The bottom ofthe figure illustrates the configuration of the targetand robot at a number of snapshots throughout thesimulation, along with the visual input at the time ofthese snapshots.

Page 8: Video-guided Camera Control for Target Tracking and Followinglisc.mae.cornell.edu/LISCpapers/GemerekVideoGuided... · Video-guided Camera Control for Target Tracking and Following?

7. CONCLUSION

This paper presents a method for mobile camera controlusing its video feedback in real time, in order to detect andpursue a human target. Because video frames are depen-dent on the camera position and orientation, the interac-tive and highly realistic game programming environmentUnreal EngineTMis used to perform virtual experimentsin real time. The proposed approach relies on consistentbounding box extraction to control the camera’s forwardspeed and yaw rate to maintain the target within its FoVand at a specified distance for accurate image processing.The simulation results show the camera tracking the hu-man target, keeping the target within the FoV and ata reasonable distance for reliable image processing. Thesame control algorithm is successfully implemented on theClearpathTMJackal robot, which also successfully followsthe target and maintains it in the camera FoV. Futurework includes tracking a particular human target usingmultiple cameras with different viewpoints in a crowdedenvironment.

REFERENCES

Benfold, B. and Reid, I. (2011). Stable multi-targettracking in real-time surveillance video. ComputerVision and Pattern Recognition.

Chen, Y.F., Everett, M., Liu, M., and How, J.P.(2017). Socially aware motion planning with deep re-inforcement learning. CoRR, abs/1703.08862. URLhttp://arxiv.org/abs/1703.08862.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-FCN: object detection via region-based fully convo-lutional networks. CoRR, abs/1605.06409. URLhttp://arxiv.org/abs/1605.06409.

Esterle, L., Lewis, P., Yao, X., and McBride, R. (2017).The future of camera networks: staying smart in achaotic world. In International Conference on Dis-tributed Smart Cameras.

Goldhoorn, A., Garrell, A., Alquzar, R., and Sanfeliu, A.(2014). Continuous real time pomcp to find-and-followpeople by a humanoid service robot. In 2014 IEEE-RAS International Conference on Humanoid Robots,741–747. doi:10.1109/HUMANOIDS.2014.7041445.

Howard, A.G., Zhu, M., Chen, B., Kalenichenko,D., Wang, W., Weyand, T., Andreetto, M.,and Adam, H. (2017). Mobilenets: Efficientconvolutional neural networks for mobile visionapplications. CoRR, abs/1704.04861. URLhttp://arxiv.org/abs/1704.04861.

Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A.,Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama,S., et al. (2017). Speed/accuracy trade-offs for modernconvolutional object detectors. In IEEE CVPR.

Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar,S.N., Rosaen, K., and Vasudevan, R. (2017). Driving inthe matrix: Can virtual worlds replace human-generatedannotations for real world tasks? In 2017 IEEE Interna-tional Conference on Robotics and Automation (ICRA),746–753. doi:10.1109/ICRA.2017.7989092.

Leal-Taixe, L., Milan, A., Reid, I.D., Roth,S., and Schindler, K. (2015). Motchallenge2015: Towards a benchmark for multi-target

tracking. CoRR, abs/1504.01942. URLhttp://arxiv.org/abs/1504.01942.

Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Gir-shick, R.B., Hays, J., Perona, P., Ramanan, D., Dollar,P., and Zitnick, C.L. (2014). Microsoft COCO: com-mon objects in context. CoRR, abs/1405.0312. URLhttp://arxiv.org/abs/1405.0312.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed,S., Fu, C.Y., and Berg, A.C. (2016). Ssd: Single shotmultibox detector. arXiv preprint.

McLaughlin, N., d. Rincon, J.M., and Miller, P. (2017).Video person re-identification for wide area trackingbased on recurrent neural networks. IEEE Transactionson Circuits and Systems for Video Technology, 1–1. doi:10.1109/TCSVT.2017.2736599.

Nageli, T., Alonso-Mora, J., Domahidi, A., Rus, D., andHilliges, O. (2017). Real-time motion planning for aerialvideography with dynamic obstacle avoidance and view-point optimization. IEEE Robotics and Automation Let-ters, 2(3), 1696–1703. doi:10.1109/LRA.2017.2665693.

Qui, W. and Yuille, A. (2016). Unrealcv: Connectingcomputer vision to unreal engine. Computing ResearchRepository (CoRR).

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Fasterr-cnn: Toward real-time object detection with regionproposal networks. arXiv preprint.

Sadeghian, A., Alahi, A., and Savarese, S. (2017). Trackingthe untrackable: Learning to track multiple cues withlong-term dependencies. CoRR, abs/1701.01909. URLhttp://arxiv.org/abs/1701.01909.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2017).Airsim: High-fidelity visual and physical simulation forautonomous vehicles. CoRR, abs/1705.05065. URLhttp://arxiv.org/abs/1705.05065.

Sonka, M., Hlavac, V., and Boyle, R. (2014). Image pro-cessing, analysis, and machine vision. Cengage Learn-ing.

Tsutsui, H., Miura, J., and Shirai, Y. (2001). Opticalflow-based person tracking by multiple cameras. InInternational Conference on Multisensor Fusion andIntegration for Intelligent Systems.

Wei, H., Lu, W., Zhu, P., Ferrari, S., Klein, R.H., Omid-shafiei, S., and How, J.P. (2014). Camera controlfor learning nonlinear target dynamics via bayesiannonparametric dirichlet-process gaussian-process (dp-gp) models. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS).

Zhang, Y., Qiu, W., Chen, Q., Hu, X., and alan L. Yuille(2016). Unreal stereo: A synthetic dataset for analyzingstereo vision. Computing Research Repository (CoRR).