14
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1 An Energy Minimization Approach to Automatic Traffic Camera Calibration Douglas N. Dawson and Stanley T. Birchfield, Senior Member, IEEE Abstract—We present a method for automatic calibration of traffic cameras. The problem is formulated as one of energy min- imization in reduced road-parameter space, from which internal and external camera parameters are determined. Our approach combines bottom-up processing of a video to find a vanishing point, lines in the background, and a directed activity map, along with top-down processing to fit a road model to these detected features using Markov chain Monte Carlo (MCMC). Enhanced autocorrelation along the dashed lines is used in conjunction with a best-fit road model to find road-to-image parameters. To maximize both robustness to noise and flexibility (e.g., to handle cases in which the camera is looking straight down the road), a single-vanishing-point length-based approach (VWL, according to the taxonomy in the work of Kanhere and Birchfield) is used. On a large number of data sets exhibiting a wide variety of conditions (including distractions such as bridges and on/off-ramps), our ap- proach performs well, achieving less than 10% error in measuring test lengths in all cases. Index Terms—Camera calibration, computer vision, Markov chain Monte Carlo, traffic monitoring. I. I NTRODUCTION T RAFFIC cameras are becoming increasingly common- place on streets and highways in metropolitan areas throughout the world. These cameras are versatile tools in traffic analysis, allowing human operators to monitor traffic conditions in many sections of roadways to detect and quickly report incidents. Since the number of cameras in a location is often too many for operators to continuously watch, au- tonomous video-based processing of the camera feeds would greatly facilitate such operations. Video-based processing is inherently flexible and nonintrusive and takes advantage of the existing infrastructure. However, most cameras installed for manual surveillance are pan–tilt–zoom (PTZ). These cameras preclude the use of tra- ditional video-based processing because they require automatic adaptation to new viewpoints when operators move the cam- eras. Many traffic analysis systems require mapping from image to road coordinates to compute vehicle speed and dimensions in metric units. Such mapping involves both intrinsic (focal length) and extrinsic (height, distance, and angles) camera Manuscript received August 20, 2012; revised November 21, 2012 and January 29, 2013; accepted February 28, 2013. The Associate Editor for this paper was M. Bertozzi. The authors are with Clemson University, Clemson, SC 29634 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITS.2013.2253553 parameters. With PTZ cameras, therefore, automatic calibra- tion of these parameters is crucial to a successful continuous operation where metric quantities are being measured. Early work on roadside camera calibration required manual intervention from the user to draw lines or rectangles on the image [8], [9], [12], [16], [23], [30]. Alternative approaches require previous knowledge of some of the camera parameters, such as height or tilt angle [1], [21], [26], or a calibration pattern that is not available in real-world scenes [10], [24], [29]. Some of the more successful approaches to automatic roadside cam- era calibration include methods that use one vanishing point [21], two vanishing points [14], [20], or three vanishing points [28]. Using multiple vanishing points, however, is problematic due to the ill-conditioned scenario when the camera is looking down the road (since the second vanishing point tends toward infinity). Searching the literature, we were unable to find any automatic roadside calibration method using a single vanishing point without requiring prior knowledge of some of the camera parameters. In this paper, we present a novel method for automatic roadside camera calibration using a single vanishing point, known lane widths, and known dash lengths (VWL, using the taxonomy in [13]). Our approach uses a combination of bottom-up and top-down processing. The bottom-up processing extracts lines from the image, finds the vanishing point, and computes a directed activity map. Then, top-down processing fits a road model to the lines using a Markov chain Monte Carlo (MCMC) approach to minimize the energy associated with the model. MCMC is used because of its computational efficiency, its wide applicability due to no assumption about convexity, and its robustness. A final step estimates the scaling parameter between the road and the image. The technique does not require any prior knowledge of the camera parameters but rather relies on the assumption of known lane widths and dash lengths. Such measurements are standardized in many regions (e.g., in the United States by the Department of Transportation’s Manual on Uniform Traffic Control [25]). Extensive experimental results on both simulated and real data sets from a wide variety of positions and angles (including cases in which the camera looks straight down the road) demonstrate the accuracy and robustness of the method. II. PREVIOUS WORK We have recently introduced a taxonomy for roadside cam- era calibration algorithms [13]. According to this taxonomy, which covers both manual and automatic approaches, tech- niques can be classified based on the information used. Table I 1524-9050/$31.00 © 2013 IEEE

An Energy Minimization Approach to Automatic Traffic Camera Calibration

Embed Size (px)

Citation preview

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

An Energy Minimization Approach to AutomaticTraffic Camera Calibration

Douglas N. Dawson and Stanley T. Birchfield, Senior Member, IEEE

Abstract—We present a method for automatic calibration oftraffic cameras. The problem is formulated as one of energy min-imization in reduced road-parameter space, from which internaland external camera parameters are determined. Our approachcombines bottom-up processing of a video to find a vanishingpoint, lines in the background, and a directed activity map, alongwith top-down processing to fit a road model to these detectedfeatures using Markov chain Monte Carlo (MCMC). Enhancedautocorrelation along the dashed lines is used in conjunctionwith a best-fit road model to find road-to-image parameters. Tomaximize both robustness to noise and flexibility (e.g., to handlecases in which the camera is looking straight down the road), asingle-vanishing-point length-based approach (VWL, according tothe taxonomy in the work of Kanhere and Birchfield) is used. Ona large number of data sets exhibiting a wide variety of conditions(including distractions such as bridges and on/off-ramps), our ap-proach performs well, achieving less than 10% error in measuringtest lengths in all cases.

Index Terms—Camera calibration, computer vision, Markovchain Monte Carlo, traffic monitoring.

I. INTRODUCTION

TRAFFIC cameras are becoming increasingly common-place on streets and highways in metropolitan areas

throughout the world. These cameras are versatile tools intraffic analysis, allowing human operators to monitor trafficconditions in many sections of roadways to detect and quicklyreport incidents. Since the number of cameras in a locationis often too many for operators to continuously watch, au-tonomous video-based processing of the camera feeds wouldgreatly facilitate such operations. Video-based processing isinherently flexible and nonintrusive and takes advantage of theexisting infrastructure.

However, most cameras installed for manual surveillance arepan–tilt–zoom (PTZ). These cameras preclude the use of tra-ditional video-based processing because they require automaticadaptation to new viewpoints when operators move the cam-eras. Many traffic analysis systems require mapping from imageto road coordinates to compute vehicle speed and dimensionsin metric units. Such mapping involves both intrinsic (focallength) and extrinsic (height, distance, and angles) camera

Manuscript received August 20, 2012; revised November 21, 2012 andJanuary 29, 2013; accepted February 28, 2013. The Associate Editor for thispaper was M. Bertozzi.

The authors are with Clemson University, Clemson, SC 29634 USA (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TITS.2013.2253553

parameters. With PTZ cameras, therefore, automatic calibra-tion of these parameters is crucial to a successful continuousoperation where metric quantities are being measured.

Early work on roadside camera calibration required manualintervention from the user to draw lines or rectangles on theimage [8], [9], [12], [16], [23], [30]. Alternative approachesrequire previous knowledge of some of the camera parameters,such as height or tilt angle [1], [21], [26], or a calibration patternthat is not available in real-world scenes [10], [24], [29]. Someof the more successful approaches to automatic roadside cam-era calibration include methods that use one vanishing point[21], two vanishing points [14], [20], or three vanishing points[28]. Using multiple vanishing points, however, is problematicdue to the ill-conditioned scenario when the camera is lookingdown the road (since the second vanishing point tends towardinfinity). Searching the literature, we were unable to find anyautomatic roadside calibration method using a single vanishingpoint without requiring prior knowledge of some of the cameraparameters.

In this paper, we present a novel method for automaticroadside camera calibration using a single vanishing point,known lane widths, and known dash lengths (VWL, usingthe taxonomy in [13]). Our approach uses a combination ofbottom-up and top-down processing. The bottom-up processingextracts lines from the image, finds the vanishing point, andcomputes a directed activity map. Then, top-down processingfits a road model to the lines using a Markov chain Monte Carlo(MCMC) approach to minimize the energy associated with themodel. MCMC is used because of its computational efficiency,its wide applicability due to no assumption about convexity,and its robustness. A final step estimates the scaling parameterbetween the road and the image. The technique does not requireany prior knowledge of the camera parameters but rather relieson the assumption of known lane widths and dash lengths. Suchmeasurements are standardized in many regions (e.g., in theUnited States by the Department of Transportation’s Manual onUniform Traffic Control [25]). Extensive experimental resultson both simulated and real data sets from a wide variety ofpositions and angles (including cases in which the cameralooks straight down the road) demonstrate the accuracy androbustness of the method.

II. PREVIOUS WORK

We have recently introduced a taxonomy for roadside cam-era calibration algorithms [13]. According to this taxonomy,which covers both manual and automatic approaches, tech-niques can be classified based on the information used. Table I

1524-9050/$31.00 © 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE ICOMPARISON OF PREVIOUS WORK IN TRAFFIC CAMERA CALIBRATION,

ALONG WITH OUR PROPOSED METHOD

compares the capabilities of existing traffic camera calibra-tion approaches. By far, the most popular approach to datehas been VVW, which means that two vanishing points anda known width are needed. From these values, the internaland external parameters of the camera can be determinedusing straightforward geometry. Examples of manual VVWcalibration approaches are those by Fung et al. [8], Zhaoxueand Pengfei [30], Kanhere and Birchfield [12], and Lai andYung [16].

Other approaches use a single vanishing point, such as themanual approaches of Gupte et al. [9] and He and Yung [10],both of which use a vanishing point, a known width, and aknown length (VWL); the manual approach of Bas and Crisman[1], which uses a vanishing point, a known height, and a knowntilt angle (VHΦ); and the manual approach of Wu et al. [26],which uses a vanishing point, a known width, and a known focallength (VWF). All these methods are manual in the sense thatsome user intervention is required.

Several fully automatic roadside camera calibration methodsusing multiple vanishing points have been developed. Whenthree orthogonal vanishing points can be determined, then thecamera’s internal calibration matrix can be estimated; suchapproaches require either a known height (VVVH) [28] or aknown width (VVVW) [11] to overcome the scale ambiguity.Assuming that the line connecting the start of each dash isperpendicular to the direction of travel, Dong et al. [6] describea technique using a known camera height and two vanishingpoints, where the latter are estimated by detecting the dashedlines on a highway. Kanhere et al. [14] use pattern recognitionto find vehicles that are then tracked to yield the first vanishingpoint (in the direction of travel); image gradient estimation onthe vehicles provides the second vanishing point. A similarapproach is that of Schoepflin and Dailey [20], in which thefirst vanishing point is estimated from the edges of the laneactivity map, whereas the second vanishing point is estimatedfrom the bottom edges of vehicles. By detecting and trackingvehicle headlights at night and assuming that the headlights areat a known distance apart, level with the ground, and moving in

a straight line, Zhang et al. [27] describe a calibration procedure(VVW) using techniques from projective geometry. One of thelimitations of these approaches is that when the camera pointsdown the road, the second vanishing point cannot be reliablyestimated because it goes to infinity.

To avoid this problem, Song and Tai [21] propose an au-tomatic calibration method using a single vanishing point, aknown width, and a known height (VWH). Edge detectionon a background image yields the lane lines; assuming thatthe longest lines in the image are the outer-lane lines, theiroverwhelming support in the image enables them to be distin-guished from distractions, and their intersection then providesan estimate of the vanishing point. A similar VWH approach isfollowed by Li et al. [17], who also detect lane lines and a singlevanishing point. While these approaches have the advantage ofrequiring a single vanishing point, we have shown in our earlierwork [13] that decreased accuracy in the presence of noiseoccurs when calculating speed or length-based measurementsusing a camera that has been calibrated with width-basedmeasurements. For increased accuracy, length should be usedin the calibration process.

Some research into traffic video analysis focuses on recov-ering only a partial calibration. Cathey and Dailey [4] andSchoepflin and Dailey [19] propose methods to convert pixelsto meters along a specific direction, yielding speed estimationusing only a single calibration parameter. We call these methodsVL since they use a known length and a single vanishing point.We formalize this concept in our definition of the image-to-length scaling factor (ILSF), which also allows for a nonzeroroll angle. An earlier work in this area is that of Dailey et al. [5],where it appears that a known camera height is combinedwith a known distribution of vehicle lengths to estimate speedswithout full calibration. Full calibration, however, is necessaryto perform measurements in other directions and to computethe locations of the lanes. Other calibration techniques thatwork with vehicle shapes or are applicable to mounted vehiclesinclude those in [2], [7], and [18].

III. GEOMETRIC MODELS

Calibrating a roadside camera involves determining the ge-ometric relationship between the camera and road coordinatesystems. Here, we describe this relationship, along with theroad model used to perform the calibration.

A. Road to Camera Transformation

As in the work of Kanhere and Birchfield [13], we simplifythe problem by assuming that the camera is a pinhole camerawith zero roll angle, perpendicular image axes (zero skew),principal point in the center of the image, and square pixels(unity aspect ratio) viewing a level straight road. The cameracoordinate system is placed so that its origin is the center ofprojection, the positive x-axis points along the image rows, thepositive y-axis points down along the image columns, and thepositive z-axis points along the optical axis toward the world.The road coordinate system is centered on the road plane at theintersection of the line running along the middle of the median

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 3

Fig. 1. Road and camera coordinate systems. The camera, with tilt angle φ and pan angle θ, is at a height of h above the road and a distance of dm to the leftof the median line (or −dm to the right); wm is the width of the median, and wl is the width of a lane. (a) Top-down view (looking down the negative zr-axis),assuming φ = 0. If φ �= 0, then the camera coordinate system is rotated so that zc is no longer parallel to the road plane, and yc is no longer perpendicular to it.(b) Side view (looking down the positive xc-axis), assuming θ = 0. If θ �= 0, then the camera coordinate system is rotated so that yr is no longer parallel to theyc − zc plane, xr is no longer perpendicular to it, and the road coordinate system is no longer directly beneath the camera coordinate system in this view. Bothcoordinate systems are right-handed; hence, symbol � indicates an axis is coming out of the page, whereas ⊗ indicates an axis is going into the page.

(which we call the “median line”) and the perpendicular planepassing through the camera’s center of projection. The tilt angleof the camera with respect to the road plane is 0 < φ < π/2,the pan angle with respect to the road is −π/2 < θ < π/2, theheight of the camera’s center of projection above the road is h,and the signed distance between the two coordinate systems inthe road plane is dm (see Fig. 1).

With these coordinate systems defined, a point (x, y, z) in theroad coordinate system projects onto the image plane at (u, v)according to

p ∼ Px = KRφRθTx (1)

where p = [u v 1]T and x = [x y z 1]T are the ho-mogeneous coordinates of the road point and image point,respectively, and ∼ means equality up to an arbitrary scalingfactor. Matrix K captures the camera’s internal parameters, i.e.,

K =

⎡⎣ f 0 0

0 f 00 0 1

⎤⎦ (2)

rotation matrices Rφ and Rθ describe the camera tilt and pan,respectively, i.e.,

Rφ =

⎡⎣ 1 0 0

0 − sinφ − cosφ0 cosφ − sinφ

⎤⎦ (3)

Rθ =

⎡⎣ cos θ − sin θ 0sin θ cos θ 0

0 0 1

⎤⎦ (4)

and matrix T captures the translation between the coordinatesystems, i.e.,

T =

⎡⎣ 1 0 0 dm

0 1 0 00 0 1 −h

⎤⎦ = [t1 t2 t3 t4] (5)

where ti is the ith column of T .Points on the road plane satisfy zr = 0, yielding a simpler

expression. Thus

p ∼ KRφRθ[t1 t2 t4]x = Hx (6)

where x = [x y 1]T , and homography H = KRφRθ[t1 t2 t4]is given by (7), shown at the bottom of the page.

B. Road Model

As can be seen from (7), five parameters are necessary todescribe the road-to-image transformation: f (in pixels), φ (inradians), θ (in radians), h (in meters), and dm (in meters). Inaddition, we introduce four parameters to capture road char-acteristics: the number of lanes in the oncoming roadway no,the number of lanes in the receding roadway nr, the width ofthe median wm (in meters), and the lane width wl (in meters).Together, these nine parameters completely specify the roadgeometry for our purposes. We assume that all lanes are ofequal width to keep the math simple, but our model is easilyextended to the case when the lanes have different (but stillknown) widths.

Our approach involves searching a state space to find theroad model that best fits the image data. Searching randomlyin the given state space defined by f , φ, θ, and so on would bepossible but quite slow. Instead, to ensure efficient computation,

H =

⎡⎣ f cos θ −f sin θ fdm cos θ−f sin φ sin θ −f sin φ cos θ fh cos φ− fdm sin φ sin θcos φ sin θ cos φ cos θ h sin φ+ dm cos φ sin θ

⎤⎦ (7)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 2. Projection of multiple lanes of a straight road onto an image. Theprojected lane lines intersect at vanishing point (u0, v0). The clockwise angleof the median line in the image with respect to the positive horizontal axis is

ψ, whereas the angle of the ith oncoming lane is ψ(i)o , and the angle of the

ith receding lane is ψ(i)r . At a vertical distance v (pixels) below the vanishing

point, the projected median width is Δum = vwm, whereas the projected lanewidth is Δul = vwl, where wm and wl are unitless quantities.

we use the image data to perform a directed search. To facilitatethis, we propose a state space consisting of three parametersfrom the given space, namely, no, nr, and wl, along with sixadditional parameters: ψ (radians), u0 (pixels), v0 (pixels), wm

(unitless), wl (unitless), and κ (pixel · meters). The advantage ofthese parameters is that they are more closely tied to the image,thus facilitating directed search. These nine parameters can bemapped into the nine parameters of the given state space, as wewill show in Section VII-A. To summarize

road-to-image : no, nr, wl, wm, dm, f, φ, θ, h

road model : no, nr, wl, u0, v0, ψ, wl, wm, κ.

Fig. 2 shows five of these six additional parameters. Ofthese, the most straightforward are (u0, v0), which specifies thevanishing point in the image, and ψ, which is the clockwiseangle in the image of the median line with respect to the positivehorizontal axis. To understand wm and wl, let us define ψ

(i)o ,

i = 0, . . . , no, as the angle of the ith oncoming lane, and letus define ψ

(j)r , j = 0, . . . , nr, as the angle of the jth receding

lane. Given our simplified imaging model (square pixels, zeroskew, zero roll angle), at any given vertical distance v (pixels)below the vanishing point, a horizontal line intersects projectedparallel lines at equally spaced intervals if the parallel linesare equally spaced in the real world. Let Δum and Δul be thehorizontal spacing of the median and the lane, respectively, at avertical distance of v. We define

wm =Δum

v(8)

wl =Δul

v. (9)

From the right triangles in Fig. 3, along with the observationthat cotψ = a/v, we see that

cotψ(i)r =

a+ 12 wmv + iwlv

v= cotψ +

wm

2+ iwl (10)

cotψ(j)o =

a− 12 wmv − jwlv

v= cotψ − wm

2− jwl. (11)

Fig. 3. Projection of the median line and the two lines surrounding theinnermost lane of the receding traffic onto the image. The relationship betweenthe angles is found by considering the three right triangles formed at a distanceof v pixels from the horizon line. In addition, given a known length � in theworld, quantity � is defined by the absolute difference between the inverses ofv1 and v2, which are the vertical coordinates from the horizon of the start andendpoints of the projection of the line segment. The value � is independent ofthe location of the line segment.

We define the sixth parameter, i.e., κ, after first defining aclosely related quantity. Given a perspective image formed by apinhole camera with square pixels and zero skew of a flat roadand a line segment parallel to the road, the quantity

� =

∣∣∣∣ 1v2

− 1v1

∣∣∣∣ (12)

is invariant to the translations of the line segment in the roadplane, where v1 and v2 are the distances in the image from theline segment endpoints to the horizon line (see Fig. 3). We usethe term inverse projected length (IPL) to denote quantity �.

The IPL is proportional to the line segment’s length. Theconstant of proportionality

κ =�

�(13)

where � is the length of the line segment, is uniquely determinedby the focal length of the camera f , the pan θ and tilt φ anglesof the camera with respect to the road, and the height h of thecamera above the road plane; it is not dependent on the locationof the line segment. The units of � are pixels−1, whereas theunits of κ are pixel · meters.

The constant κ, which we call the ILSF, relates lengths ofline segments in the road along the direction of travel to theircorresponding IPLs. The IPLs, in turn, are related to projectionsof the line segments in the image plane. Notice that, althoughFig. 3 shows the zero-roll case, the definition of ILSF is notdependent on the roll angle.

IV. BOTTOM-UP PROCESSING

We approach calibration as a model fitting problem in whichwe find the parameters of the road model that best fit the imagedata. Of the nine parameters, three are assumed to be known,namely, no, nr, and wl. The remaining six parameters are found

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 5

Fig. 4. Overview of our approach to finding the six unknown road model parameters (u0, v0, ψ, wm, wl, and κ). Background subtraction yields foregroundmasks, which are used to generate a directed activity map. RANSAC applied to Canny edges of the background image yields a set of straight lines, from whichvanishing point (u0, v0) is estimated. The lines passing through the vanishing point are used to generate candidate models (ψ, wm, and wl), which are evaluatedusing the lines and directed activity map in an MCMC framework. The best model is then used, along with the vanishing point, to unwarp the background image,from which enhanced autocorrelation yields an estimate for κ.

by a combination of bottom-up and top-down processing, asshown in Fig. 4. Here, we describe the bottom-up processing,which consists in computing a directed activity map and detect-ing a set of straight lines in the image. From the lines, vanishingpoint (u0, v0) is estimated.

A. Directed Activity Map Generation

The first step is to generate a background image and toperform background subtraction. We use the adaptive Gaussian-mixture-model-based approach of Zivkovic [31], which isavailable in OpenCV. This algorithm provides both a back-ground image and binary foreground masks for each imageframe. Each foreground mask is then median filtered for noiseremoval. One drawback of using the resulting blobs directly isthat tall vehicles may spill over into adjacent lanes. To avoidthis problem, we filter out all but the foreground pixels directlyabove background pixels. The remaining pixels, which we callvehicle base points (VBPs), are near the bases of the vehiclesand are closely related to the idea of vehicle base fronts [15],except that we make no distinction based on the location ofthe points relative to the front, side, or rear of the vehicle. Atypical result of this foreground/background segmentation andVBP filtering is shown in Fig. 5.

Each VBP is tracked from the previous frame using a simpleblock matching technique. A window (size 7 × 7) around theVBP is extracted, and the binary image containing 1s indicatingthe neighboring VBPs is compared with the binary image fromthe previous frame. This best match from this 2-D search withinten pixels from the location in the previous frame yields a vectorrepresenting the image motion for this point, which is thenaccumulated in a directed activity map. This directed activitymap provides a vector for each pixel that indicates the sum ofactivity at that location in the image. Fig. 6 shows the horizontal

Fig. 5. (Top left) Image from the sequence. (Top right) Foreground blobs.(Bottom left) Background image. (Bottom right) VBPs.

Fig. 6. Horizontal and vertical components of the directed activity map.Bright pixels indicate large positive values, whereas dark pixels indicate largenegative values.

and vertical components of the directed activity map for a videosequence.

B. Line and Vanishing Point Detection

Next, Canny edges [3] are detected on the background image,followed by erosion and dilation to remove noise. We run arepeated RANSAC procedure to fit straight lines to the set of

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 7. (Left) Canny edge image (white pixels), with the lines (red) detectedby our RANSAC procedure overlaid. (Right) Lines that pass through thevanishing point. Note that the lines on the right are not necessarily a subsetof the lines on the left.

edge pixels in the Canny image, removing inlier edge pixels ateach iteration until no more straight lines can be found. Thevanishing point is then found by applying a similar RANSACprocedure to the set of straight lines, proposing candidatelocations as the intersection of two lines, and evaluating suchlocations by the number of lines within a small distance.

Once the vanishing point has been found, we determinelines passing through the vanishing point. The most obviousapproach would be to simply discard any of the detectedlines that is not within some distance of the vanishing point.Unfortunately, this approach would yield lines passing near thevanishing point but not necessarily directly through it. As aresult, we run yet another RANSAC procedure, proposing linesusing the vanishing point and one point chosen at random fromthe Canny edge image and evaluating such lines by the amountof support they obtain in the edge image. Another advantage ofthis approach is that it is less likely to produce false negatives,that is, it is more likely to find all the lane lines. The results ofedge detection, line detection, vanishing point estimation, andthe detection of lines through the vanishing point are shownin Fig. 7.

V. TOP-DOWN PROCESSING

After bottom-up processing, only four parameters remain tobe estimated: ψ, wm, wl, and κ. This section describes the top-down processing to estimate the first three of these parameters.We use an MCMC algorithm to search the space of detectedline segments; fit ψ, wm, and wl to the line segments selected;and evaluate the resulting road model. This process is repeatedfor some number of iterations, retaining the best model foundso far.

A. Line Selection

Line selection is the random jump step of MCMC, producingtwo sequences of lines that possibly represent the oncomingand receding lane lines. We make the observation that there isgenerally little texture in the road portion of the images exceptfor the lane-line paint. As a result, we expect to generate fewfalse positives on the road, although there may be large numbersof false positives in the median and on the sides of the road.Based on this observation, our procedure is as follows. Thedetected line segments that pass through the vanishing pointare ordered according to their angle in the image. For eachside of the road, a line segment is selected at random, thenthe neighboring line segment is either randomly skipped or

selected, then its neighboring line segment is either randomlyskipped or selected, with the process repeating until the propernumber of line segments have been selected. By setting theskip probability to a low value (10%), this procedure yields asequence of nr + 1 lines and another sequence of no + 1 lines,such that the sequences are either contiguous (no false positivesin the road) or nearly contiguous (few false positives in theroad).

B. Model Fitting

Once the lines have been selected, ψ, wm, and wl must becomputed. Let ψ

(i)r be the ith selected line in the group of

receding lines, and similarly, let ψ(j)o be the jth line in the group

of oncoming lines. From (10) and (11), assuming no noise, wehave

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cot(ψ) + 12 wm + 0

cot(ψ) + 12 wm + wl

...cot(ψ) + 1

2 wm + nrwl

cot(ψ)− 12 wm − 0

cot(ψ)− 12 wm − wl

...cot(ψ)− 1

2 wm − nowl

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cot(ψ(0)r

)cot

(ψ(1)r

)...

cot(ψ(nr)r

)cot

(ψ(0)o

)cot

(ψ(1)o

)...

cot(ψ(no)o

)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(14)

yielding the following overconstrained linear system:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

1 1 01 1 1...

......

1 1 nr

1 −1 01 −1 −1...

......

1 −1 −no

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎣ cot(ψ)

12 wm

wl

⎤⎦ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cot(ψ(0)r

)cot

(ψ(1)r

)...

cot(ψ(nr)r

)cot

(ψ(0)o

)cot

(ψ(1)o

)...

cot(ψ(no)o

)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(15)

which is solved for cot(ψ) (and thus, ψ), 1/2wm (and thus,wm), and wl.

In some scenarios, only one side of the road is visible. Thisis a special case of our more general model and is easilyimplemented. With no receding lanes, for example, nr = 0, andno lines for receding lanes are selected in the previous step.Parameter wm is forced to zero, and the following equation isused instead of (15) to estimate ψ and wl:

⎡⎢⎢⎣

1 01 −1...

...1 −no

⎤⎥⎥⎦[cot(ψ)wl

]=

⎡⎢⎢⎢⎢⎢⎣

cot(ψ(0)o

)cot

(ψ(1)o

)...

cot(ψ(no)o

)

⎤⎥⎥⎥⎥⎥⎦ . (16)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 7

C. Model Evaluation

Let E(M|A,Ψ) be the energy of the road model M = (ψ,wm, wl), given the directed activity map A and selected linesegments Ψ. We assume that the other six road model param-eters are fixed. We define the energy as the sum of two energyterms, i.e.,

E(M|A,Ψ) = λΨEΨ(M|Ψ) + λAEA(M|A). (17)

The first term is the error between the projected lines (accordingto the parameters) and the detected lines. Rather than usingan algebraic error from the residue of least squares fitting,we notice from (15) that the angle of a projected line isgiven by arccot(cot(ψ) + 1/2wm + iwl) for receding lanesand, similarly, arccot(cot(ψ)− 1/2wm − jwl) for oncominglanes. Comparing these angles with the detected lines yields ageometric error, i.e.,

EΨ(M|Ψ)

=1nr

nr∑i=0

∣∣∣∣arccot(cot(ψ) +

wm

2+ iwl

)− ψ(i)

r

∣∣∣∣ (18)

+1no

no∑j=0

∣∣∣∣arccot(cot(ψ)− wm

2− jwl

)− ψ(j)

o

∣∣∣∣ .(19)

The activity map error, i.e., EA(M|A), is defined as

EA(M|A) = 1|Cr|

∑(u,v)∈Cr

[cos(ψr) sin(ψr)]A(u, v) (20)

− 1|Co|

∑(u,v)∈Co

[cos(ψo) sin(ψo)]A(u, v)

(21)

where Cr and Co are two sets containing the image pixelsin the road on either side (not including the median), andψr and ψo are the predicted direction of travel for the twosides. The equation therefore calculates, for each pixel (u, v)in these regions, the inner product of the observed directionof travel according to the 2 × 1 vector in the directed activitymap A(u, v) for that pixel and the predicted direction of travelaccording to the model. For computational efficiency, we setψr = 1/2(ψ(0)

r + ψ(nr)r ) and ψo = 1/2(ψ(0)

o + ψ(no)o ), which

allows the inner product to be pulled out of the summation,which further reduces computation. (A more accurate approachwould be to compute ψr and ψo separately for each pixel (u, v)as the angle between the pixel and the vanishing point (u0, v0),but we did not notice a significant difference in the resultsobtained using this much slower method.)

To find appropriate values for scaling factors λΨ and λA, tensimulated data sets were used. These data sets were generatedby randomly perturbing the lines on the image so that the lanewidths were not all identical. Our method was then run on thesedata sets, and the correct road model was evaluated, yieldingvalues for EΨ(Mk|Ψk) and EA(Mk|Ak) for k = 1, . . . , 10,

where k is the training set index. The scaling factors were thenset to

λΨ =

∑k EA(Mk|Ak)∑

k EΨ(Mk|Ψk) +∑

k EA(Mk|Ak)(22)

λA =

∑k EΨ(Mk|Ψk)∑

k EΨ(Mk|Ψk) +∑

k EA(Mk|Ak). (23)

This procedure causes λΨEΨ(Mk|Ψk) and λAEA(Mk|Ak) tobe of roughly the same scale. From our simulated data sets, weused λΨ = 0.17 and λA = 0.83.

VI. FINDING THE IMAGE-TO-LENGTH SCALING FACTOR κ

The eight parameters already estimated (no, nr, wl, u0, v0,ψ, wl, and wm) are sufficient to accurately overlay projectedlane lines on the image. To fully calibrate, however, we alsoneed to know the ILSF κ, which is a scaling factor that relateslengths along the road with distances in the image. To estimateκ, our approach involves detecting the dashed lines in the imageand relying on the assumption of a known dash length (10 ft)and known dash spacing (30 ft).

The first step is to unwarp the image into a skewed bird’s-eye view. The skew occurs because our unwarping convertsto an affine plane rather than to a Euclidean/similarity plane.To simplify the following analysis, consider a translated imagecoordinate system shifted so that the vanishing point is theorigin. That is, a point (u, v) in the original image coordinatesystem is represented in the translated coordinate system as(u, v) = (u− u0, v − v0). Let the unwarped image coordinatesof a point be represented by (x, y), where the y-axis pointsalong the direction of travel (aligned with the yr-axis), but thex-axis is not necessarily aligned with anything in particular.

We define the unwarping transformation Hw between theprojective plane of the original image and the resulting affineplane as ⎡

⎣ xy1

⎤⎦ ∼

⎡⎣ 1 0 0

0 0 10 1 0

⎤⎦

︸ ︷︷ ︸Hw

⎡⎣ uv1

⎤⎦ . (24)

This choice of Hw satisfies two important constraints. First,vanishing point (u, v) = (0, 0) in the image (which is the pointat infinity associated with the yr-direction) maps to the point atinfinity associated with the y-direction. This is easily seen bythe fact that Hw[0 0 1]T = [0 1 0]T . Second, the horizon linein the original image (which is the line at infinity associatedwith the road plane) maps to the line at infinity in the unwarpedplane. Using duality, this is seen by H−T

w [0 1 0]T = [0 0 1]T ,where [0 1 0]T is the homogeneous representation of line v = 0,and [0 0 1]T is the homogeneous representation of the lineat infinity. Other choices for Hw are possible, but this onehas the advantage of simplicity, because Hw is orthogonal andsymmetric, i.e., H−1

w = HTw = Hw. For illustrative purposes,

Fig. 8 shows the unwarped image, although our procedure doesnot actually unwarp the entire image since it only operatesalong lane lines.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 8. Road in Fig. 5 unwarped using Hw . The left–right direction is parallel to the y-axis, and the up–down direction is parallel to the x-axis. Notice that thereis a skew between this image and the real top-down view of the road; only the y-direction is metric.

The next step is to extract a 1-D signal parallel to y alongeach interior (dashed) lane line. To perform this step, twoparameters must be determined: the row index (x) of theunwarped image for each lane and sampling period T alongthe y-direction, which is the same for all lane lines. First, wedescribe how to determine x for a given lane. From (24), wesee that for any image point (u, v), x = u/v. By inspection,Fig. 3 reveals that tanψ(i)

r = v/u for any image point (u, v)along the ith receding lane line and, similarly, for the oncominglane lines. Combining with (10) and (11) yields expressions forx for each lane, i.e.,

x(i)r = cotψ(i)

r = cotψ +wm

2+ iwl (25)

x(j)o = cotψ(j)

o = cotψ − wm

2− jwl. (26)

Now, we describe how to estimate sampling period T . Let(u1, v1) and (u2, v2) be two points along the median line,and let (x1, y1) and (x2, y2) be their transformations onto theunwarped image. To avoid losing information, we want no morethan one pixel in the original image to map to each pixel in theunwarped image. As a result, we place the original two pointsso that they lie near the intersection of the median line andimage boundary and are one pixel apart from each other, thatis,

√(u2 − u1)2 + (v2 − v1)2 = 1. Then, the sampling period

for all dashed line signals is set as T ≡ y2 − y1.The final step is to estimate the period of the 1-D lane-line

signals in the unwarped image using the enhanced autocorre-lation method of Tolonen and Karjalainen [22]. The enhancedautocorrelation of a signal is given by

R(δ) = max

{R(δ)−R

(⌊δ

2

⌋), 0

}(27)

where R(δ) is the standard autocorrelation. In other words, astretched version of the autocorrelation function is subtractedfrom the original, and the result is then clamped to onlynonnegative values. This method has the effect of removingthe even-ordered harmonics whose peaks are located at 0, 2,4, . . . times the location of the first peak. The same processcould be repeated with additional stretching to remove the thirdharmonic or more, but we have not found this to be necessary.

The period of one signal is estimated as

τ = T argmaxδ

R(δ). (28)

In our case, the enhanced autocorrelation is calculated foreach of the unwarped dashed lane lines, producing R

(1)r , . . . ,

R(nr−1)r , R

(1)o , . . . , R

(no−1)o , and these are added together for

the overall estimate of the period, i.e.,

τ = T argmaxδ

⎛⎝nr−1∑

i=1

R(i)r (δ) +

no−1∑j=1

R(j)o (δ)

⎞⎠ . (29)

This period τ provides a measurement of the length from thestart of one dash to the next in the unwarped image. (See Fig. 9for the image gray levels of the dashed lines, the autocorrelationof each of the lines, and the sum of the enhanced autocorrelationfunctions.)

To understand what τ represents, consider point (u1, v1)with some distinctive feature on a lane line. Such a point willtransform to the unwarped image according to Hw[u1 v1 1]T .If we can detect another feature at a distance of τ along they-axis from the previous point, then this new point correspondsto coordinates (u2, v2) in the original image, where⎡

⎣ u2

v21

⎤⎦ ∼ H−1

w

⎡⎣ 1 0 0

0 1 τ0 0 1

⎤⎦Hw

⎡⎣ u1

v11

⎤⎦ (30)

=

⎡⎣ 1 0 0

0 1 00 τ 1

⎤⎦⎡⎣ u1

v11

⎤⎦ . (31)

Thus

v2 =v1

1 + v1τ. (32)

Rearranging and solving for τ reveals

τ =1v2

− 1v1

. (33)

Comparing this result with (12), we see that τ is identical to the� associated with the two points in the image. If we know thereal-world distance �, then we can use (13) to compute κ = �/τ .

VII. CALIBRATION

The set of nine parameters in the road-to-image transforma-tion (no, nr, wl, wm, dm, f , φ, θ, and h) is directly relatedto the nine parameters in the road model (no, nr, wl, u0, v0,ψ, wl, wm, and κ). Three of the parameters are exactly thesame, namely, no, nr, and wl. The other six parameters (wm,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 9

Fig. 9. (Top) Unwarped gray levels along each dashed lane line, after applyinga gray level dilation to the original image to remove noise and extend thewidth of the dashed lines. Multiples of 100 were added to the signals toprevent overlap and colors used for display were abitrarily chosen. (Middle)Result of autocorrelation for each dashed line. (Bottom) Sum of all enhancedautocorrelation sequences. Notice that the second harmonic is removed and thatthe third harmonic is noticeably smaller, so that the peak at 186 samples (whenmultiplied by T ) reveals desired period τ .

dm, f , φ, θ, and h) are calculated from the road model parame-ters as follows.

A. Relation Between Road Model and Camera Model

Since the direction of travel is along the yr-axis, the pointat infinity corresponding with that direction is given by x∞ =[0 1 0]T , where the overbar indicates coordinates in the 2-Droad plane. Applying the homography from (7), the vanishingpoint in the image is given by

p∞ =

⎡⎣u0

v01

⎤⎦ ∼ Hx∞ =

⎡⎣ −f sin θ−f sinφ cos θcosφ cos θ

⎤⎦ (34)

or converting to inhomogeneous coordinates

u0 =−f tan θ

cosφ(35)

v0 = − f tanφ. (36)

Here, we continue to use (u, v) to represent coordinatestranslated so that the vanishing point is the origin, i.e., (u, v) =(u− u0, v − v0). Applying this translation yields mappingfrom point x in the road plane to point (u, v) as

p ∼

⎡⎣ 1 0 −u0

0 1 −v00 0 1

⎤⎦Hx (37)

=

⎡⎣

fcos θ 0 fdm cosφ+fh sinφ sin θ

cosφ cos θ

0 0 fhcosφ

cosφ sin θ cosφ cos θ dm cosφ sin θ + h sinφ

⎤⎦

︸ ︷︷ ︸H

x

(38)

where p = [u v 1]T are the homogeneous coordinates of (u, v),and (35) and (36) were substituted to yield the expression forH . A point (x, y) on the road, therefore, projects to

p ∼

⎡⎢⎣

f(dm+x) cosφ+fh sinφ sin θcosφ cos θ

fhcosφ

(dm + x) cosφ sin θ + h sinφ+ y cosφ cos θ

⎤⎥⎦ . (39)

To derive the relationship between ψ and the road-to-imageparameters, we note that, for any point (u, v), the angle of theline passing through the point and the origin (vanishing point)is given by cotψ(u,v) = u/v. Therefore, if we let ψ(x′) refer tothe angle in the image of the projection of line x = x′ (whichis along the direction of travel), then we can divide the first andsecond elements of (39) to yield

cotψ(x′) =(dm + x′) cosφ+ h sinφ sin θ

h cos θ. (40)

In particular, a point on the median line satisfies x′ = 0,leading to

cotψ =dm cosφ+ h sinφ sin θ

h cos θ. (41)

To derive the relationship between wm and the road-to-imageparameters, we first notice that ψ(0)

r = ψ(wm/2), because x′ =wm/2 for the innermost lane line. Substituting x′ = wm/2 into(40) and subtracting (41) yields

cotψ(0)r − cotψ =

wm cosφ

2h cos θ. (42)

From (10), we notice that

cotψ(0)r − cotψ = wm/2. (43)

Equating these two yields the desired expression, i.e.,

wm =wm cosφ

h cos θ. (44)

Applying the same procedure to angles ψ(0)r and ψ

(1)r yields a

similar relationship for the lane width parameter, i.e.,

wl =wl cosφ

h cos θ. (45)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

To derive the relationship between κ and the road-to-imageparameters, let (u(x,y), v(x,y)) be the image coordinates in thetranslated coordinate system of the projection of point (x, y)on the road. By converting (39) to inhomogeneous coordinates,we obtain expressions for u(x,y) and v(x,y), but only the latteris important. Thus

v(x,y)=fh

(dm+ x) cos2 φ sin θ+ y cos2 φ cos θ+ h cosφ sinφ.

(46)

It is easy to see (see Fig. 3) that v(x,y) represents the verticaldistance in the image of the projection of (x, y) to the vanishingpoint. Hence, from (12), the IPL of a line segment of length �in the direction of the road with endpoints (x, y) and (x, y + �)can be given by

� =1

v(x,y+�)− 1

v(x,y)(47)

=� cos2 φ cos θ

fh. (48)

Because � does not depend on x or y, it is independent of theposition of the line segment in the road. From the definition ofκ in (13) and (48), we have

κ =�

�=

fh

cos2 φ cos θ. (49)

From this equation, we can clearly see that the ILSF, i.e., κ,is uniquely determined by f , φ, θ, and h. Notice from thisequation that, in the case of zero roll, the ILSF is the same asthe “scale factor” introduced by Cathey and Dailey [4].

B. Estimating the Road-to-Image Parameters

We now show how to estimate the road-to-image parametersfrom the road model parameters using the relationships derivedearlier.

From (36), an expression can be found for f , i.e.,

f = −v0 cosφ

sinφ. (50)

From (45), an expression can be found for h, i.e.,

h =wl cosφ

wl cos θ. (51)

Substituting (50) into (35) and solving for sinφ yields

sinφ =v0 tan θ

u0. (52)

This equation can be substituted into (50) and the result alongwith (51) substituted into (49) to yield

sin θ cos θ = −u0wl

κwl(53)

which is simplified using the double-angle formula to yield

sin 2θ = −2u0wl

κwl. (54)

Rearranging (41) yields an expression for dm, i.e.,

dm =h cos θ cotψ − h sinφ sin θ

cosφ. (55)

Finally, an equation for wm can be found by rearranging(44), i.e.,

wm =wmh cos θ

cosφ. (56)

To summarize, the procedure for finding the road-to-imageparameters is as follows.

1) Compute θ using (54), which relies only on road modelparameters.

2) Given θ, use (52) to solve for φ.3) Solve for f and h using (50) and (51), respectively.4) Compute dm and wm using (55) and (56), respectively.

The careful reader may have noticed that an ambiguityexists when solving for θ in (54) because −π < 2θ < π. Asa result, for any θ that satisfies (54), ±π/2 − θ also satisfiesthe equation, where the sign is determined by the sign of θ.However, some values of θ cause the calibration solution toinclude nonreal values [e.g., when the right-hand side of (52)is greater than 1]. If the solution for one value of θ is realwhile the other is nonreal, then there is no ambiguity. Theactual conditions under which this occurs are given in [13, eq.(42)]. Thus

(sin2 θ

sinφ+ cos2 θ sinφ

)2

< 1. (57)

As a rule of thumb, this condition approximately holds whenθ < 40◦ and when φ is greater than (or at least not much lessthan) θ. One way to overcome ambiguity when it occurs is touse a known value for one of the parameters to test against thetwo possible solutions. For example, the height of the cameracould be found during an initial calibration step, in which thecamera is pointed in a nonambiguous direction, after which thenow known height can be used to resolve future ambiguities.It should be also noted that either set of calibration parameterswill yield correct measurements of length and speed along thedirection of the road; errors will only occur with measurementsthat are not aligned with the direction of the road.

VIII. RESULTS

The proposed algorithm was tested on 15 data sets, bothsimulated and real. The first six data sets have ground truthavailable to show accuracy. The other nine data sets were usedto show the robustness of the method. The data sets cover a widevariety of angles and road positions with challenging conditionssuch as bridges (data set 9), on/off ramps (data sets 9, 11, 12,and 14), road curve (data sets 4, 11, and 13), and rain (dataset 5). In all the data sets, we assume that the roads obey thesuggestions in the United States Department of Transportation’sManual on Traffic Control Devices, such that lane widths are12 ft (3.7 m), and from the start of one dash to the start of thenext dash is 40 ft (12.2 m) [25]. Images from the data sets, with

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 11

Fig. 10. Images of the road (background image) for each data set with the road model overlaid. The red lines indicate the lane markings, whereas green indicatesthe median line. For data sets 4, 5, 6, and 15, only the oncoming direction is clearly visible; hence, only one roadway is modeled. For these data sets, the roadwaymarking closest to the median is drawn in green. For data sets 7–15, the cyan lines indicate the distance (12.2 m) between dashes used for length error calculations.

Fig. 11. Sample image from data set 1, which is a simulated data set. The carsare represented as gray boxes (and are needed for activity map generation).

the detected lane lines and median center overlaid, are shownin Fig. 10.

Data sets 1–3 were simulated with a GNU Octave script togenerate an accurate set of images representing a highway withgray boxes (“cars”) that move along the lanes to properly pop-ulate the directed activity maps. (These data sets are differentfrom those used in training λΨ and λA.) An example from dataset 1 is shown in Fig. 11. The results from our implementa-tion on these data sets are shown in Table II along with theground truth. Since data set 2 has an ambiguity, there are twopossible sets of parameters; using the notation in [13], these arerepresented as m+ (|θ| < π/4) and m− (otherwise). Overall,on these data sets, our algorithm performs well, achieving verysmall error on every parameter except for the focal length ofdata set 3. Note that on data set 3, the test error (4.1%) is smallerthan the focal length error (7.5%). While this situation mayappear somewhat counterintuitive at first, it arises because roadmeasurements are used in the calibration process, thus makingroad distances estimated using the calibration parameters more

accurate than the parameters themselves. This outcome furthersupports the practical utility of the approach.

To test the ability of our system to make measurements usingthe estimated parameters on these data, we projected the roadplane points (0 m, 0 m) and (3 m, 4 m) onto each image usingground truth mapping and then used our parameters to map thecoordinates back and estimate the distance. The ground truthdistance, estimated distance, and the percent error are providedin Table II. Assuming that the ambiguity can be resolved, thetest error was less than 5% in all cases. Note that the incorrectchoice for data set 2 leads to a large error since the test lengthis not in the direction of the road.

Data sets 4–6 were used by Kanhere and Birchfield [13] toevaluate different manual calibration procedures. For evaluatingaccuracy, each of these data sets has a known height, knowndistance to the median, known focal length, and a known real-world distance (between the cones in data sets 4 and 5 or be-tween tar marks on the road in data set 6). Since this distance isnot used by the calibration procedure, it can be used to evaluateour method’s ability to make measurements. The results areshown in Table III. In addition, data set 6 has a known tiltangle. Since these three cameras do not have a good view ofthe road lines on the roadway for receding traffic, the roadparameters are estimated using only the oncoming roadway(nr = 0). Despite this loss of information, which affects theaccuracy of some of the parameters, the method yields a testlength error of less than 10%. Although data set 5 is ambiguous,either option produces a small test length error, since the lengthis in the direction of travel.

To further test the robustness of the system, nine other real-world traffic data sets (data sets 7–15) were used. Althoughthese data do not have independent ground truth, a sense ofaccuracy can be obtained by examining the lane markings

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE IIRESULTS FOR SIMULATED DATA SETS 1–3 ALONG WITH THE GROUND TRUTH. NOTE, FOR DATA SET 2, THE IMPORTANCE OF

CHOOSING THE CORRECT SOLUTION

TABLE IIIRESULTS FOR DATA SETS 4–6 ALONG WITH THE GROUND TRUTH. THESE DATA SETS HAVE ONLY ONE ROADWAY WHERE LANE MARKINGS ARE

VISIBLE; HENCE, THE MEDIAN WIDTH CANNOT BE ESTIMATED. IN EACH OF THESE DATA SETS, THERE IS A KNOWN REAL-WORLD DISTANCE

BETWEEN CERTAIN MARKS. THIS DISTANCE AND THE ESTIMATION OF THIS DISTANCE USING THE ESTIMATED MODEL

ROAD-TO-IMAGE PARAMETERS IS GIVEN UNDER THE TEST LENGTH HEADING

TABLE IVRESULTS FOR DATA SETS 7–15. DATA SET 15 WAS A VIEW OF ONLY ONE ROADWAY; HENCE, THE MEDIAN WIDTH COULD NOT BE ESTIMATED

overlaid on the background using the estimated road modelparameters (see Fig. 10). From these images, along with theresults for the other sequences, we see that the estimated lanelines are accurately aligned with the actual lane lines. Thecalculated road-to-model parameters, as shown in Table IV,were used to measure 12.2-m lengths along each road (fromone dash to the next), yielding length errors from 0.2% to 5.4%.Since data set 15 has only oncoming lanes visible, nr = 0 wasagain used for this data set. Although the numbers for data set12 are unusually high, we verified from satellite imagery thatthe distance to the road is indeed more than 65 m and that thecamera is very much zoomed in (the exit lane, which is visibleon the left side of the image, begins about 500 m from thecamera); because of an incline in the roadway, this large zoomfactor results in an overestimation of the height, which is, inactuality, no more than 20 m above the road.

To analyze the performance of our calibration model over alarger portion of the road, six length segments were manuallydrawn on an image from one of the data sets (see Fig. 12).Using the parameters estimated by our algorithm, the length ofeach segment was estimated and compared with ground truth(12.2 m), with the errors recorded in Table V. Although theerrors generally increase as the segments recede into the image,as one would expect, the errors remain small over a wide rangeof locations.

Fig. 12. Image from data set 7, with several different length segments marked(in cyan). (See Table V for the estimated lengths of each segment.)

TABLE VLENGTH AND ERROR MEASURED USING THE ESTIMATED CALIBRATION

PARAMETERS FOR THE SIX LENGTH SEGMENTS SHOWN IN FIG. 12.THE GROUND TRUTH LENGTH IS 12.2 m FOR ALL SEGMENTS

Additional verification of the algorithm’s accuracy is ob-tained by checking which side of the road the camera is located.Data sets 8, 10, and 14 have the camera on the right side of theroad, whereas the other data sets have the camera on the leftside. This is verified in the table since data sets 8, 10, and 14

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DAWSON AND BIRCHFIELD: ENERGY MINIMIZATION APPROACH TO AUTOMATIC TRAFFIC CAMERA CALIBRATION 13

have negative values for both the pan angle θ and the distanceto the median dm. In data set 15, the camera is on the left sideof the road but looking slightly to the left, causing dm to bepositive even when θ is negative.

Although our calibration procedure worked well most of thetime, errors were observed. The typical error encountered is an“off by one lane” error (see data sets 10 and 12 in Fig. 10).With this error, our method sometimes detects a lane wherethere is none, which is likely due to the fact that the shoulderson the roadways are about the same width as a lane, and thesharp gradients associated with the pavement to grass causes aspurious line detection. This, combined with the fact that insidelanes often do not have much activity, leads to the error.

The computational workload can be broken down into twoparts: 1) background subtraction and directed activity mapcalculation, which occur on each frame and 2) line detectionand MCMC, which is run once per data set. Using a C++implementation on an AMD 2.8-GHz processor, the formerpart easily runs in real time (70 frames per second), whereasthe latter takes an average of 7 s (depending on the data set,it ranges from 3 to 15 s). Although this computation timeis acceptable in a real-world system, it could nevertheless befurther improved by replacing the activity map, which is thecurrent bottleneck in the system because it requires sufficienttraffic volume, with an operation (e.g., texture analysis) thatprocesses a single image.

IX. CONCLUSION

We have described a novel method of automatic cameracalibration, which estimates all the road-to-image parametersneeded to map from road coordinates to image coordinates,and vice versa. Our method consists of two parts: 1) bottom-up processing, which automatically finds the vanishing point,lines, and directed activity map and 2) top-down processing,which fits a road model to the image using MCMC energy min-imization along with an enhanced autocorrelation to retrieve aknown length. Our method performs well on both simulatedand real data sets, exhibiting a wide variety of camera positionsand angles. The algorithm achieves less than 10% test lengtherror on all the data sets for which ground truth is known and,in many cases, achieves much less error.

There are many ways our algorithm can be improved. Inthe line detection stage, the continuity of the lines could beenforced, since our RANSAC implementation currently de-tects many false positives in areas of texture. Similarly, thedirected activity map could be improved by using more robusttracking methods or texture-based approaches. In the top-downprocessing, improvement could be achieved by allowing falsenegatives in line selection, along with color-based model evalu-ation. Additionally, approaches are needed that work at nightwhen lane lines are much harder to detect or for roads thatviolate our straight flat assumptions, such as roads with hills,curves, intersections, traffic circles, or roundabouts, where carsmay not be moving in straight lines. An interesting avenue toexplore would be to couple image processing techniques withthe parameters obtained by querying a PTZ camera to producemore robust calibration.

REFERENCES

[1] E. K. Bas and J. D. Crisman, “An easy to install camera calibrationfor traffic monitoring,” in Proc. IEEE Conf. Intell. Transp. Syst., 1997,pp. 362–366.

[2] M. Bertozzi and A. Broggi, “GOLD: A parallel real-time stereo visionsystem for generic obstacle and lane detection,” IEEE Trans. Image Pro-cess., vol. 7, no. 1, pp. 62–81, Jan. 1998.

[3] J. F. Canny, “A computational approach to edge detection,” IEEE Trans.Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.

[4] F. W. Cathey and D. J. Dailey, “One-parameter camera calibration fortraffic management cameras,” in Proc. 7th Int. IEEE Conf. Intell. Transp.Syst., 2004, pp. 865–869.

[5] D. J. Dailey, F. W. Cathey, and S. Pumrin, “An algorithm to estimate meantraffic speed using uncalibrated cameras,” IEEE Trans. Intell. Transp.Syst., vol. 1, no. 2, pp. 98–107, Jun. 2000.

[6] R. Dong, B. Li, and Q.-M. Chen, “An automatic calibration method forPTZ camera in expressway monitoring system,” in Proc. WRI WorldCongr. Comput. Sci. Inf. Eng., Apr. 2009, vol. 6, pp. 636–640.

[7] G. S. K. Fung, N. H. C. Yung, and G. K. H. Pang, “Vehicle shape approx-imation from motion for visual traffic surveillance,” in Proc. IEEE Conf.Intell. Transp. Syst., Aug. 2001, pp. 608–613.

[8] G. S. K. Fung, N. H. C. Yung, and G. K. H. Pang, “Camera calibrationfrom road lane markings,” Opt. Eng., vol. 42, no. 10, pp. 2967–2977,Oct. 2003.

[9] S. Gupte, O. Masoud, R. F. K. Martin, and N. P. Papanikolopoulos, “De-tection and classification of vehicles,” IEEE Trans. Intell. Transp. Syst.,vol. 3, no. 1, pp. 37–47, Mar. 2002.

[10] X. C. He and N. H. C. Yung, “New method for overcoming ill-conditioning in vanishing-point-based camera calibration,” Opt. Eng.,vol. 46, no. 3, pp. 037202-1–037202-12, Mar. 2007.

[11] M. Hödlmoser, B. Micusik, and M. Kampel, “Camera auto-calibrationusing pedestrians and zebra-crossings,” in Proc. IEEE ICCV-VS, Nov.2011, pp. 1697–1704.

[12] N. K. Kanhere and S. T. Birchfield, “Real-time incremental segmentationand tracking of vehicles at low camera angles using stable features,” IEEETrans. Intell. Transp. Syst., vol. 9, no. 1, pp. 148–160, Mar. 2008.

[13] N. K. Kanhere and S. T. Birchfield, “A taxonomy and analysis of cam-era calibration methods for traffic monitoring applications,” IEEE Trans.Intell. Transp. Syst., vol. 11, no. 2, pp. 441–452, Jun. 2010.

[14] N. K. Kanhere, S. T. Birchfield, and W. A. Sarasua, “Automatic cam-era calibration using pattern detection for vision-based speed sensing,”Transp. Res. Rec., J. Transp. Res. Board, vol. 2086, pp. 30–39, 2008.

[15] N. K. Kanhere, S. T. Birchfield, W. A. Sarasua, and T. C. Whitney, “Real-time detection and tracking of vehicle base fronts for measuring trafficcounts and speeds on highways,” Transp. Res. Rec., J. Transp. Res. Board,vol. 1993, pp. 155–164, 2007.

[16] A. H. S. Lai and N. H. C. Yung, “Lane detection by orientation and lengthdiscrimination,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 30,no. 4, pp. 539–548, Aug. 2000.

[17] Y. Li, F. Zhu, Y. Ai, and F.-Y. Wang, “On automatic and dynamic cameracalibration based on traffic visual surveillance,” in Proc. IEEE Intell. Veh.Symp., Jun. 2007, pp. 358–363.

[18] C. C. C. Pang, W. W. L. Lam, and N. H. C. Yung, “A method for vehiclecount in the presence of multiple-vehicle occlusions in traffic images,”IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3, pp. 441–459, Sep. 2007.

[19] T. N. Schoepflin and D. J. Dailey, “Algorithms for calibrating roadsidetraffic cameras and estimating mean vehicle speed,” in Proc. IEEE Intell.Transp. Syst. Conf., Oct. 2007, pp. 277–283.

[20] T. N. Schoepflin and D. J. Dailey, “Dynamic camera calibration of road-side traffic management cameras for vehicle speed estimation,” IEEETrans. Intell. Transp. Syst., vol. 4, no. 2, pp. 90–98, Jun. 2003.

[21] K.-T. Song and J.-C. Tai, “Dynamic calibration of pan–tilt–zoom camerasfor traffic monitoring,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,vol. 36, no. 5, pp. 1091–1103, Oct. 2006.

[22] T. Tolonen and M. Karjalainen, “A computationally efficient multipitchanalysis model,” IEEE Trans. Speech Audio Process., vol. 8, no. 6,pp. 708–716, Nov. 2000.

[23] M. Trajkovic, “Interactive calibration of a PTZ camera for surveillanceapplications,” in Proc. Asian Conf. Comput. Vis., 2002, pp. 1–8.

[24] R. Y. Tsai, “A versatile camera calibration technique for high-accuracy3D machine vision metrology using off-the-shelf TV cameras and lenses,”IEEE J. Robot. Autom., vol. RA-3, no. 4, pp. 323–344, Aug. 1987.

[25] Manual on Uniform Traffic Control Devices, U.S. Dept. Transp., FederalHighway Admin., Washington, D.C., USA, 2009.

[26] B.-F. Wu, W.-H. Chen, C.-W. Chang, C.-C. Liu, and C.-J. Chen, “DynamicCCD camera calibration for traffic monitoring and vehicle applications,”in Proc. IEEE Int. Conf. SMC, Oct. 2007, pp. 1717–1722.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

[27] W. Zhang, Q. Wu, G. Wang, and X. You, “Tracking and pairing vehicleheadlight in night scenes,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 1,pp. 140–153, Mar. 2012.

[28] Z. Zhang, M. Li, K. Huang, and T. Tan, “Practical camera auto-calibrationbased on object appearance and motion for traffic scene visual surveil-lance,” in Proc. IEEE CVPR, 2008, pp. 1–8.

[29] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000.

[30] C. Zhaoxue and S. Pengfei, “Efficient method for camera calibration intraffic scenes,” Electron. Lett., vol. 40, no. 6, pp. 368–369, Mar. 2004.

[31] Z. Zivkovic, “Improved adaptive Gaussian mixture model for backgroundsubtraction,” in Proc. 17th ICPR, Aug. 2004, vol. 2, pp. 28–31.

Douglas N. Dawson received the B.S. degree incomputer engineering from Clarkson University,Potsdam, NY, USA, in 2010. He is currently workingtoward the Ph.D. degree in computer engineeringwith Clemson University, Clemson, SC, USA.

His current research interests include computervision, object tracking, and classification.

Stanley T. Birchfield (S’91–M’99–SM’06) receivedthe B.S. degree in electrical engineering fromClemson University, Clemson, SC, USA, in 1993and the M.S. and Ph.D. degrees from Stanford Uni-versity, Stanford, CA, USA, in 1996 and 1999,respectively.

While at Stanford University, his research wassupported by a National Science Foundation Gradu-ate Research Fellowship, and he was part of the win-ning team of the Association for the Advancement ofArtificial Intelligence Mobile Robotics Competition

in 1994. From 1999 to 2003, he was a Research Engineer with Quindi Corpo-ration, which is a startup company in Palo Alto, CA, USA. Since 2003, he hasbeen with the Department of Electrical and Computer Engineering, ClemsonUniversity, where he is currently an Associate Professor. His research interestsinclude visual correspondence, tracking, and segmentation, particularly appliedto real-time systems.