UNIVERSITY OF CALIFORNIA, SAN DIEGO Video-based Car ...vision.ucsd.edu/belongie-grp/research/carRec/dlagnekov_thesis_200… · Video-based Car Surveillance: License Plate, Make, and

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Video-based Car Surveillance: License Plate, Make, and Model Recognition

A thesis submitted in partial satisfaction of the

requirements for the degree Masters of Science

in Computer Science

by

Louka Dlagnekov

Committee in charge:

Professor Serge J. Belongie, ChairpersonProfessor David A. MeyerProfessor David J. Kriegman

2005

Copyright

Louka Dlagnekov, 2005

All rights reserved.

The thesis of Louka Dlagnekov is approved:

Chair

University of California, San Diego

2005

iii

TABLE OF CONTENTS

Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Social Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4. Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

II License Plate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3. Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4. The AdaBoost Algorithm . . . . . . . . . . . . . . . . . . . . . . . 132.5. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1. Integral Images . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.2. Cascaded Classifiers . . . . . . . . . . . . . . . . . . . . . . 16

2.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6.1.Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

III License Plate Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1. Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2. Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1. Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.2. Point Spread Function . . . . . . . . . . . . . . . . . . . . . 303.2.3. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.4.Maximum Likelihood Estimate . . . . . . . . . . . . . . . . 313.2.5.Maximum a Posteriori Estimate . . . . . . . . . . . . . . . . 323.2.6.Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3. Optical Character Recognition . . . . . . . . . . . . . . . . . . . . 383.3.1. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . 38

iv

3.3.2.Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3. Template Matching . . . . . . . . . . . . . . . . . . . . . . . 403.3.4.Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

IV Make and Model Recognition . . . . . . . . . . . . . . . . . . . . . . . . 464.1. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3. Appearance-based Methods . . . . . . . . . . . . . . . . . . . . . 48

4.3.1. Eigencars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4. Feature-based Methods . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.1. Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 584.4.2. Shape Contexts . . . . . . . . . . . . . . . . . . . . . . . . . 614.4.3. Shape Context Matching . . . . . . . . . . . . . . . . . . . . 634.4.4. SIFT Matching . . . . . . . . . . . . . . . . . . . . . . . . . 644.4.5.Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 71

V Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 745.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1.1.Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1. Color Inference . . . . . . . . . . . . . . . . . . . . . . . . . 765.2.2.Database Query Algorithm Development . . . . . . . . . . . 775.2.3.Make and Model 3-D Structure . . . . . . . . . . . . . . . . 78

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

v

LIST OF FIGURES

1.1 (a) A Dutch license plate and (b) a California license plate. Mostcars in our datasets have plates of the form shown in (b), but at amuch lower resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 A frame from the video stream of (a) the ‘Regents’ dataset and (b)the ‘Gilman’ dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 (a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b)Same images variance normalized. . . . . . . . . . . . . . . . . . . . 6

2.1 PCA on 1,520 license plate images. Note that about 70 componentsare required to capture 90% of the energy. . . . . . . . . . . . . . . 10

2.2 The means of the absolute value of the (a) x-derivative, and (b)y derivative, and the variance of the (c) x-derivative, and (d) y-derivative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Types of features selected by AdaBoost. The sum of values com-puted over colored regions are subtracted from the sum of valuesover non-colored regions. . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Typical class conditional densities for weak classifier features. Forsome features, there is clearly a large amount of error that cannotbe avoided when making classifications, however this error is muchsmaller than the 50% AdaBoost requires to be effective. . . . . . . . 12

2.5 (a) The integral image acceleration structure. (b) The sum of thevalues in each rectangular region can be computed using just fourarray accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 A cascaded classifier. The early stages are very efficient and goodat rejecting the majority of false windows. . . . . . . . . . . . . . . 16

2.7 The three sets of positive examples used in training the license platedetector – sets 1, 2, and 3, with a resolution of 71×16, 80×19, and104× 31, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 ROC curves for a 5-stage cascade trained using 359 positive exam-ples and three different choices of negative training examples. . . . . 20

2.9 ROC curves for (a) a single-stage, 123-feature detector, and (b) a6-stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features perstage respectively. The sizes of the images trained on in sets 1, 2,and 3 are 71 × 16, 80 × 19, and 104 × 31 respectively. The x-axisscales in (a) and (b) were chosen to highlight the performance ofthe detector on each set. . . . . . . . . . . . . . . . . . . . . . . . . 22

2.10 Examples of regions incorrectly labeled as license plates in the set3 test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.11 Detection on an image from the Caltech Computer Vision group’scar database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi

3.1 A car tracked over 10 frames (1.7 seconds) with a blue line indicatingthe positions of the license plate in the tracker. . . . . . . . . . . . 27

3.2 Our image formation model. The (a) full-resolution image H un-dergoes (b) a geometric transformation Tk followed by (c) a blurwith a PSF h(u, v); is (d) sub-sampled by S, and finally (e) addi-

tive Gaussian noise η is inserted. The actual observed image Lk

from our camera is shown in (f). The geometric transformation isexaggerated here for illustrative purposes only. . . . . . . . . . . . . 28

3.3 (a) The Huber penalty function used in the smoothness prior withα = 0.6 and red and blue corresponding the regions |x| ≤ α and|x| > α respectively; (b) an un-scaled version of the bi-modal priorwith µ0 = 0.1 and µ1 = 0.9. . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Super-resolution results: (a) sequence of images processed, (b) anup-sampled version of one low-resolution image, (c) the average im-age, (d) the final high-resolution estimate. . . . . . . . . . . . . . . 37

3.5 The alphabet created from the training set. There are 10 examplesfor each character for the low-resolution, average image, and super-resolution classes, shown in that respective order. . . . . . . . . . . 40

3.6 Character frequencies across our training and test datasets. . . . . . 413.7 Template matching OCR results on the low-resolution test set for

‘standard’ and ‘loose’ comparisons between recognized charactersand actual characters. . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.8 Recognition results for the images in our test set. Each horizontalsection lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and7 mistakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Our automatically generated car database. Each image is alignedsuch that the license plate is centered a third of the distance frombottom to top. Of these images, 1,102 were used as examples, and 38were used as queries to test the recognition rates of various methods.We used the AndreaMosaic photo-mosaic software to construct thiscomposite image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 (a) The average image, and (b) the first 10 eigencars. . . . . . . . . 524.3 The first 19 query images and the top 10 matches in the database

for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . . 544.4 The second 19 query images and the top 10 matches in the database

for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . . 554.5 The first 19 query images and the top 10 matches in the database

for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . . 564.6 The second 19 query images and the top 10 matches in the database

for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . . 574.7 Harris corner detections on a car image. Yellow markers indicate

occlusion junctions, formed by the intersection of edges on surfacesof different depths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

vii

4.8 Kadir and Brady salient feature extraction results on (a) a car imagefrom our database, and (b) an image of a leopard. . . . . . . . . . . 60

4.9 SIFT keypoints and their orientations for a car image. . . . . . . . . 614.10 (a) Query car image with two interest points shown, (b) database car

image with one corresponding interest point shown, (c) diagram oflog-polar bins used for computing shape context histograms, (d,e,f)shape context histograms for points marked ‘B’, ‘C’, and ‘A’ re-spectively. The x-axis represents θ and the y-axis represents log rincreasing from top to bottom. . . . . . . . . . . . . . . . . . . . . . 62

4.11 (a) Image edges and (b) a random sampling of 400 points from theedges in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.12 Query images 1-10 and the top 10 matches in the database us-ing SIFT matching. Yellow lines indicate correspondences betweenmatched keypoints of the query (top) and database (bottom) images. 67




viii

LIST OF TABLES

2.1 Negative examples remaining during training at each stage of thecascade. The three training operations shown are (1) initial trainingwith 10,052 randomly chosen negative examples, (2) first bootstraptraining with an additional 4,974 negative examples taken from falsepositives, (3) second bootstrap operation with another 4,974 nega-tive examples taken from false positives from the previous trainingstage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Summary of overall recognition rates for each method. . . . . . . . 714.2 Test set of queries used with ‘Size’ indicating the number of cars

similar to the query in the database and which method classifiedeach query correctly. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

ix

ACKNOWLEDGEMENTS

I would like to thank the following people for helping make this thesis

possible:

Serge Belongie for being there every step of the way and always being

available for consultation, even at three in the morning. David Meyer for arranging

funding and for ongoing consultation. David Kriegman for very helpful initial

guidance.

My family for being understanding and supportive throughout my edu-

cation. My best friend Brian for many enlightening discussions and proof reading

drafts.

David Rose of the UCSD Police Department and Robert Meza of the

Campus Loss Prevention Center for providing access to car video data.

This work has been partially supported by DARPA under contract

F49620-02-C-0010.

x

ABSTRACT OF THE THESIS

Video-based Car Surveillance: License Plate, Make, and Model Recognition

by

Louka Dlagnekov

Masters of Science in Computer Science

University of California, San Diego, 2005

Professor Serge J. Belongie, Chair

License Plate Recognition (LPR) is a fairly well explored problem and is already

a component of several commercially operational systems. Many of these systems,

however, require sophisticated video capture hardware possibly combined with in-

frared strobe lights, or exploit the large size of license plates in certain geographical

regions and the (artificially) high discriminability of characters. One of the goals

of this project is to develop an LPR system that achieves a high recognition rate

without the need for a high quality video signal from expensive hardware. We also

explore the problem of car make and model recognition for purposes of searching

surveillance video archives for a partial license plate number combined with some

visual description of a car. Our proposed methods will provide valuable situational

information for law enforcement units in a variety of civil infrastructures.

xi

Chapter I

Introduction

License plate recognition (LPR) is widely regarded to be a solved problem,

the technology behind the London Congestion Charge program being a well-known

example. In an effort to reduce traffic congestion in Central London, the city

imposes a daily fee on motorists entering a specified zone [21]. In order to automate

the enforcement of the fee, over two hundred closed-circuit television (CCTV)

cameras are in operation whose video streams are processed by an LPR system. If

a plate is found whose registered owner has not paid the fee, the owner is fined.

Other LPR systems are used by the U.S. Customs for more efficient cross-

checks in the National Crime Information Center (NCIC) and Treasure Enforce-

ment Communications System (TECS) for possible matches with criminal sus-

pects [22]. The 407 ETR toll road in Ontario, Canada also uses LPR to fine

motorists who do not carry a radio transponder and have not paid a toll fee. In

the Netherlands LPR systems are in place that are fully automated from detecting

speeding violations, to reading the license plate and billing the registered owner.

All of these systems treat license plates as cars’ fingerprints. In other

words, they determine a vehicle’s identity based solely on the plate attached to

it. One can imagine, however, a circumstance where two plates from completely

different make and model cars are swapped with malicious intent, in which case

these systems would not find a problem. We as humans are also not very good

1

2

at reading cars’ license plates unless they are quite near us, nor are we very good

at remembering all the characters. However, we are good at identifying and re-

membering the appearance of cars, and therefore their makes and models, even

when they are speeding away from us. In fact, the first bit of information Amber

Alert signs show is the car’s make and model and then its license plate number,

sometimes not even a complete number. Therefore, given the description of a car

and a partial license plate number, the authorities should be able to query their

surveillance systems for similar vehicles and retrieve a timestamp of when that

vehicle was last seen along with archived video data for that time.

Despite the complementary nature of license plate and make and model

information, to the best of our knowledge, make and model recognition is an un-

explored problem. Various research has been done on detecting cars in satellite

imagery and detecting and tracking cars in video streams, but we are unaware

of any work on the make and model recognition (MMR) aspect. Because of the

benefits that could arise from the unification of LPR and MMR, we explore both

problems in this thesis.

1.1 Problem Statement

Although few details are released to the public about the accuracy of com-

mercially deployed LPR systems, it is known that they work well under controlled

conditions and require high-resolution imaging hardware. Most of the academic

research in this area also requires high-resolution images or relies on geographically-

specific license plates and takes advantage of the large spacing between characters

in those regions and even the special character features of commonly misread char-

acters as shown in Figure 1.1 (a). Although the majority of license plates in our

datasets were Californian and in the form of Figure 1.1 (b), the difficulty of the

recognition task is comparable to other United States plates. The image shown in

Figure 1.1 (b) is of much higher resolution than the images in our datasets and is

3

(a) (b)

Figure 1.1: (a) A Dutch license plate and (b) a California license plate. Most

cars in our datasets have plates of the form shown in (b), but at a much lower

resolution.

shown for illustrative purposes only.

Our goal in this thesis is to design a car recognition system for surveillance

purposes, which, given low-resolution video data as input is able to maintain a

database of the license plate and make and model information of all cars observed

for the purposes of performing queries on license plates and makes and models. In

this thesis, we do not explore algorithms for such queries, but our results in this

project are an invaluable foundation for that task.

1.2 Social Impact

The use of any system that stores personally identifiable information

should be strictly monitored for adherence to all applicable privacy laws. Our

system is no exception. Since license plates can be used to personally identify in-

dividuals, queries to the surveillance database collected should only be performed

by authorized users and only when necessary, such as in car theft or child abduc-

tion circumstances. Because our system is query-driven rather than alarm-driven,

where by alarm-driven we mean the system issues an alert when a particular be-

havior is observed (such as running a red-light), slippery slope arguments toward

a machine-operated automatic justice system do not apply here. The query-driven

aspect also alleviates fears that such technology could be used to maximize state

revenue rather than to promote safety.

Although there exists the possibility of abuse of our system, this possibil-

ity exists in other systems too such as financial databases employed by banks and

4

other institutions that hold records of persons’ social security numbers. Even cell

phone providers can determine a subscriber’s location by measuring the distance

between the phone and cell towers in the area. In the end we feel the benefits of

using our system far outweigh the potential negatives, and it should therefore be

considered for deployment.

1.3 Datasets

We made use of two video data sources in developing and testing our

LPR and MMR algorithms. We shall refer to them as the ‘Regents’ dataset and

the ‘Gilman’ dataset. The video data in both datasets is captured from digital

video cameras mounted on top of street lamp poles overlooking stop signs. Figure

1.2 shows a typical frame captured from both cameras. These cameras, along

with nearly 20 others, were set up in the Regents parking lots of UCSD as part

of the RESCUE-ITR (Information Technology Research) program by the UCSD

Police Department. The ‘Regents’ video stream has a resolution of 640 × 480 and

sampling is done at 10 frames per second, while the ‘Gilman’ video stream has a

resolution of 720 × 480 and is sampled at 6 frames per second.

Due to the different hardware and different spatial positions of both cam-

eras, the datasets have different characteristics. The camera in the ‘Regents’

dataset is mounted at a much greater distance from the stop sign and is set to

its full optical zoom, while the ‘Gilman’ camera is much closer. The size of plates

in the ‘Regents’ dataset are therefore much smaller, but exhibit less projective

distortion as cars move through the intersection. On the other hand, the ‘Gilman’

camera is of higher quality, which combined with the larger plate sizes made for

an easier character recognition task.

Since only about a thousand cars pass through both intersections in an

8 hour recorded period, some sort of automation was necessary to at least scan

through the video stream to find frames containing cars. An application was

5

(a) (b)

Figure 1.2: A frame from the video stream of (a) the ‘Regents’ dataset and (b) the

‘Gilman’ dataset.

written for this purpose, which searches frames (in a crude, but effective method

of red color component thresholding, which catches cars’ taillights) for cars and

facilitates the process of extracting training data by providing an interface for

hand-clicking on points.

Using this process, over 1,500 training examples were extracted for the

‘Regents’ dataset as shown on Figure 1.3(a). In the figure, time flows in raster

scan order, such that the top left license plate image was captured at 8am and

the bottom right at 4pm. Note the dark areas in the image – this is most likely

a result of cloud cover, and this illumination change can be accounted for by

variance normalizing the images as shown in Figure 1.3(b). Although this variance

normalization technique does improve the consistency of license plate examples, it

had little effect on the overall results and was not used so as to reduce unnecessary

computation. However, we point it out as a reasonable solution to concerns that

illumination differences may adversely affect recognition.

Unless otherwise indicated, all references to datasets shall refer to the

‘Gilman’ dataset.

6

(a) (b)

Figure 1.3: (a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b)

Same images variance normalized.

1.4 Thesis Structure

Chapter 2 discusses the design and performance of a license plate detec-

tor trained in a boosting framework. In Chapter 3 we present several important

pre-processing steps on detected license plate regions and describe a simple al-

gorithm to perform optical character recognition (OCR). The problem of make

and model recognition is explored in Chapter 4, where we evaluate several well-

known and some state of the art object recognition algorithms in this novel setting.

We conclude the thesis in Chapter 5 and discuss ideas for future research on car

recognition.

Chapter II

License Plate Detection

2.1 Introduction

In any object recognition system, there are two major problems that need

to be solved – that of detecting an object in a scene and that of recognizing it;

detection being an important requisite. In our system, the quality of the license

plate detector is doubly important since the make and model recognition subsystem

uses the location of the license plate as a reference point when querying the car

database. In this chapter we shall discuss our chosen detection mechanism.

The method we employ for detecting license plates can be described as

follows. A window of interest, of roughly the dimensions of a license plate image,

is placed over each frame of the video stream and its image contents are passed as

input to a classifier whose output is 1 if the window appears to contain a license

plate and 0 otherwise. The window is then placed over all possible locations in the

frame and candidate license plate locations are recorded for which the classifier

outputs a 1.

In reality, this classifier, which we shall call a strong classifier, weighs the

decisions of many weak classifiers, each specialized for a different feature of license

plates, thereby making a much more accurate decision. This strong classifier is

trained using the AdaBoost algorithm. Over several rounds, AdaBoost selects the

7

8

best performing weak classifier from a set of weak classifiers, each acting on a single

feature. The AdaBoost algorithm is discussed in detail in Section 2.4.

Scanning every possible location of every frame would be very slow were

it not for two key optimization techniques introduced by Viola and Jones – integral

images and cascaded classifiers [49]. The integral image technique allows for an

efficient implementation and the cascaded classifiers greatly speed up the detection

process, as not all classifiers need be evaluated to rule out most non-license plate

sub-regions. With these optimizations in place, the system was able to process

10 frames per second at a resolution of 640 × 480 pixels. The optimizations are

discussed in Section 2.5.

Since the size of a license plate image can vary significantly with the

distance from the car to the camera, using a fixed-size window of interest is im-

practical. Window-based detection mechanisms often scan a fixed-size window

over a pyramid of image scales. Instead, we used three different sizes of windows,

each having a custom-trained strong classifier for that scale.

2.2 Previous Work

Most LPR systems employ detection methods such as corner template

matching [20] and Hough transforms [26] [51] combined with various histogram-

based methods. Kim et al. [28] take advantage of the color and texture of Korean

license plates (white characters on green background, for instance) and train a

Support Vector Machine (SVM) to perform detection. Their license plate images

range in size from 79 × 38 to 390 × 185 pixels, and they report processing low-

resolution input images (320 × 240) in over 12 seconds on a Pentium3 800MHz,

with a 97.4% detection rate and a 9.4% false positive rate. Simpler methods, such

as adaptive binarization of an entire input image followed by character localization,

also appear to work as shown by Naito et al. [36] and [5], but are used in settings

with little background clutter and are most likely not very robust.

9

Since license plates contain a form of text, we decided to face the detection

task as a text extraction problem. Of particular interest to us was the work done by

Chen and Yuille on extracting text from street scenes for reading for the blind [10].

Their work, based on the efficient object detection work by Viola and Jones [49],

uses boosting to train a strong classifier with a good detection rate and a very low

false positive rate. We found that this text detection framework also works well

for license plate detection.

2.3 Feature Selection

The goal of this section is to find good features in the image contents

of the window of interest, one for each weak classifier. The features to which the

weak classifiers respond are important in terms of overall accuracy and should be

chosen to discriminate well between license plates and non-license plates.

Viola and Jones use Haar-like features, where sums of pixel intensities

are computed over rectangular sub-windows [49]. Chen and Yuille argue that,

while this technique may be useful for face detection, text has little in common

with faces [10]. To support their assumption, they perform principal component

analysis (PCA) on their training examples and find that about 150 components are

necessary to capture 90 percent of the variance, whereas in typical face datasets,

only a handful would be necessary. To investigate whether this was the case with

license plates, a similar plot was constructed, shown in Figure 2.1. Unlike the text

of various fonts and orientations with which Chen and Yuille were working, license

plates require much fewer components to capture most of the variance. However, an

eigenface-based approach [48] for classification yielded very unsatisfactory results

and is extremely expensive to compute over many search windows. Fisherface-

based classification [3], which is designed to maximize between-class scatter to

within-class scatter, also yielded unsatisfactory results.

It is desirable to select features that produce similar results on all license

10

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Number of Eigenvalues

Ene

rgy

Cap

ture

d

Figure 2.1: PCA on 1,520 license plate images. Note that about 70 components

are required to capture 90% of the energy.

10 20 30 40

5

10

1510 20 30 40

5

10

15

(a) (b)

(c) (d)

Figure 2.2: The means of the absolute value of the (a) x-derivative, and (b) y

derivative, and the variance of the (c) x-derivative, and (d) y-derivative.

11

X derivative Y derivative Y derivative varianceX derivative variance

Figure 2.3: Types of features selected by AdaBoost. The sum of values computed

over colored regions are subtracted from the sum of values over non-colored regions.

plate images and are good at discriminating between license plates and non-license

plates. After pre-scaling all training examples in the ‘Regents’ dataset to the same

45 × 15 size and aligning them, the sum of the absolute values of their x and

y-derivatives exhibit the pattern shown in Figure 2.2. The locations of the 7 digits

of a California license plate are clearly visible in the y-derivative and y-derivative

variance. Although the x-derivative and x-derivative variance show the form which

Yuille and Chen report for text images, the y-derivative and y-derivative variance

are quite different and yield a wealth of information.

A total of 2,400 features were generated as input to the AdaBoost algo-

rithm. These were a variation of the Haar-like features used by Viola and Jones

[49], but more generalized, yet still computationally simple. A scanning window

was evenly divided into between 2 and 7 regions of equal size, either horizontal or

vertical. Each feature was then a variation on the sum of values computed in a set

of the regions subtracted from the sum of values in the remaining set of regions.

Therefore, each feature applied a thresholding function on a scalar value. Some of

these features are shown in Figure 2.3.

The values of the regions of each window were the means of pixel intensi-

ties, derivatives, or variance of derivatives. None of the features actually selected

by AdaBoost used raw pixel intensities, however, probably because of their poor

discriminating ability with respect to wide illumination differences. Each weak

classifier was a Bayes classifier, trained on a single feature by forming class condi-

12

00.030.060.090.120.15

0 .250 .500 .750 1.00 1.250 1.500 1.750 2.000 2.250 2.500 2.875 3.125 3.375

00.15

0.30.45

0.60.75

0 .125 .250 .375 .500 .625 .750 .875 1.00 1.125 1.250 1.375 1.500 1.625

00.030.060.090.120.15

0 .250 .500 .750 1.00 1.250 1.500 1.750 2.000 2.250 2.500 2.875 3.125

License PlateLicense Plate Non−License Plate

Feature Value

Like

lihoo

d

Figure 2.4: Typical class conditional densities for weak classifier features. For

some features, there is clearly a large amount of error that cannot be avoided when

making classifications, however this error is much smaller than the 50% AdaBoost

requires to be effective.

tional densities (CCD) from the training examples. The CCD for a typical weak

classifier is shown in Figure 2.4. When making a decision, regions where the license

plate CCD is larger than the non-license plate CCD are classified as license plate

and vice-versa, instead of using a simple one-dimensional threshold.

Although the features we have described are rather primitive and not

flexible in the sense that they are not able to respond to discontinuities other than

vertical and horizontal, they lend themselves nicely to the optimization techniques

discussed in Section 2.5. Steerable filters, Gabor filters, or other wavelet-based

approaches are more general, but would be slower to compute.

13

2.4 The AdaBoost Algorithm

AdaBoost is a widely used instance of boosting algorithms. The term

boosting refers to the process of strengthening a collection of weak learning al-

gorithms to create a strong learning algorithm. It was developed by Schapire

[40] in 1990, who showed that any weak learning algorithm could be transformed

or “boosted” into a strong learning algorithm. A more efficient version of the

algorithm outlined by Schapire was later presented by Freund [16] called “boost-

by-majority”, and in 1995 Schapire and Freund developed AdaBoost [17], “Ada”

standing for “adaptive” since it adjusts adaptively to the errors observed in the

weak learners.

The idea of boosting can be explained with the following example. Con-

sider the problem of classifying email messages into junk-email and regular email

by examining messages’ keywords. An example of a keyword we may tend to see

often in junk email is “click here” and can classify messages as junk if they contain

the keyword. Although this may work for many junk emails, it will almost cer-

tainly also lead to many legitimate messages being misclassified. Classifying solely

based on the “click here” keyword is a good rule of thumb, but it is rather coarse.

A better approach would be to find several of these rough rules of thumb and take

advantage of boosting to combine them.

In its original form, AdaBoost is used to boost the classification accuracy

of a single classifier, such as a perceptron, by combining a set of classification

functions to form a strong classifier. As applied to this project, AdaBoost is used

to select a combination of weak classifiers to form a strong classifier. The weak

classifiers are called weak because they only need to be correct just over 50% of

the time.

At the start of training, each training example (x1, y1)...(xn, yn) is as-

signed a weight wi = 12m

for negatives and wi = 12l

for positives, where xi are

positive and negative inputs, y ∈ {0, 1}, m is the number of negatives, and l is the

14

number of positives. The uneven initial distribution of weights leads to the name

“Asymmetric AdaBoost” for this boosting technique.

Then, for t = 1, ..., T rounds, each weak classifier hj is trained and its

error is computed as εt =∑

i wi |hj(xi)− yi|. The hj with lowest error is selected,

and the weights are updated according to:

wt+1,i = wt,i

(εt

1− εt

)if xi is classified correctly and not modified if classified incorrectly. This essentially

forces the weak classifiers to concentrate on “harder” examples that are most often

misclassified. We implemented the weighting process in our Bayes classifiers by

scaling the values used to build the CCDs.

After T rounds, T weak classifiers are selected and the strong classifier

makes classifications according to

h(x) =

1∑T

t=1 αtht(x) ≥ τ∑T

t=1 αt

0 otherwise

, (2.1)

where αt = ln(

1−εt

εt

)and τ is set to 1

2to minimize the error.

Schapire and Freund showed that the overall error of the boosted classifier

is bound exponentially with the size of T .

2.5 Optimizations

In this section we discuss two key optimization techniques introduced by

Viola and Jones [49], which allowed us to achieve very fast detection rates – 10

frames per second on 640× 480 image sizes.

2.5.1 Integral Images

The features described in Section 2.3 add values over one group of sections

and subtract them from another group of sections. If these sections are m × n

15

(x, y)

y

w x

z

A

C

B

D

(a) (b)

Figure 2.5: (a) The integral image acceleration structure. (b) The sum of the

values in each rectangular region can be computed using just four array accesses.

pixels in size, we would normally require mn array accesses. However, if we take

advantage of their rectangular nature, we can reduce the accesses to four, regardless

of the size of the section, using an integral image data structure.

An integral image I ′ of an image I is of the same dimensions as I and at

each location (x, y) contains the sum of all the pixels in I above and to the left of

the pixel (x, y):

I ′(x, y) =∑

x′≤x,y′≤y

I(x, y).

With this structure in place, the sum of the pixel values in region D in

Figure 2.5 (b), can be computed as

D = I ′(w) + I ′(z)− (I ′(x) + I ′(y)).

The integral image itself can be efficiently computed in a single pass over the image

using the following recurrences:

r(x, y) = r(x− 1, y) + I(x, y)

I ′(x, y) = I(x, y − 1) + r(x, y),

where r(−1, y) and I(x,−1) are defined to be 0.

16

1 2 3 4ScanningWindow Processing

Further1 1 1

0

1

000

Reject Window

Figure 2.6: A cascaded classifier. The early stages are very efficient and good at

rejecting the majority of false windows.

For the images on which we trained and classified, we created integral

images for raw pixel values, x-derivatives, y-derivatives, as well as integral images

for the squares of these three types of values. The integral image of squares of

values is useful for quickly computing the variance of the values in the sections of

our features, since the variance can be computed as

σ2 = m2 − 1

N

∑x2,

where m is the mean and x is the feature value.

2.5.2 Cascaded Classifiers

At any given time, there are at most a handful of license plates visible in

a frame of video, yet there are on the order of (640− 100)× (480− 30) ≈ 200, 000

window positions that require scanning, assuming a license plate image is 100×30

pixels. The number of regions to be classified as not containing a license plate

clearly far exceed those that do. Luckily, it is not necessary to employ all classifiers

selected by AdaBoost at each window position.

The idea behind a cascaded classifier is to group the classifiers into several

stages in order of increasing complexity with the hopes that the majority of regions

can be rejected quickly by very few classifiers. Such a cascaded structure is depicted

in Figure 2.6. Although a positive instance will pass through all stages of the

cascade, this will be a very rare event, and the cost would be amortized.

17

Training the cascade is done stage by stage, where the first stage is trained

on all positive and negative examples, the second stage is trained on all positive

examples and only the false positives of the first stage used as negative examples,

and so on for the remaining stages. The justification for this selection of negative

examples is that when the cascade is in operation, there are many window instances

which the latter stages will never be asked to classify since the early stages will

have rejected them, and, therefore, training of the latter stages should reflect the

type of data those stages would see in practice. Usually, the largest percentage of

negative examples will be rejected in the first two stages, and the rest of the stages

in the cascade will train on “harder” examples, and thus have much higher false

positive rates than the early stages and as a result require more classifiers.

By increasing the τ threshold in Equation (2.1), which is designed to yield

a low error on the training data, we can decrease the false positive rate, at the

expense of a decrease in the detection rate. This adjustment allows us to generate

the receiver operating characteristic (ROC) curves shown in the next section, and

it also allows us to design the cascade with a desirable false negative rate at each

stage. Since the false negative rate is given by

N =K∏

i=1

ni,

where n is the false negative rate of each stage in the cascade, and K are the

number of stages, if we desire a 90% overall detection rate and K = 10, we would

require each ni to be 99% since .9910 ≈ .90. The 99% false negative rate can easily

be achieved by decreasing the τ threshold in Equation (2.1), even at the expense

of a high false positive rate at each stage. The overall false positive rate is given

by

P =K∏

i=1

pi,

where pi is the false positive rate of each stage. Even a high false positive rate of

40% at each stage would equate to an overall false positive rate of only .01%, since

.4010 ≈ .0001.

18

The design of a good cascade is not trivial. Viola and Jones present a

simple algorithm that determines the number of features to be used at each stage

by selecting a desired false negative and false positive rate [49], however, it assumes

that each feature is of equal computational complexity. In our case, and in Chen

and Yuille’s [10] cascaded classifier, this assumption does not hold. In principle

one could design an algorithm to evaluate the time complexity of each feature type

and choose how many and of what type features should be placed in K stages in

order to minimize the overall running time of the classifier. Unfortunately, this is

a very difficult problem. In practice, however, one can design a reasonably good

cascade using the guiding principle that efficient features should be evaluated near

the front of the cascade and more computationally expensive features should be

evaluated near the end of the cascade.

In our chosen cascaded classifier, we did not allow AdaBoost to select

variance-based features for the first stage since we wanted it to be very efficient at

eliminating a large portion of window locations early on. We should also mention

that not only is detection fast in a cascaded classifier, but so is its training. Since

each stage eliminates a large number of negative examples, the latter stages train

on a much smaller set of examples. For a 123-feature single-stage classifier, full

training with two bootstrap operations takes 18 hours to train, whereas a 6-stage

classifier with the same number of features in total takes 5 hours.

2.6 Results

In this section we present our results on the ‘Gilman’ dataset.

2.6.1 Datasets

Unlike in our ‘Regents’ dataset, the camera on the ‘Gilman’ dataset was

mounted much closer to the intersection, which resulted in greater projective dis-

tortion of the license plate as each car progresses through the intersection. We

19

Figure 2.7: The three sets of positive examples used in training the license plate

detector – sets 1, 2, and 3, with a resolution of 71 × 16, 80 × 19, and 104 × 31,

respectively.

investigated training our license plate detector on plate images of a single scale

and performing detection on a pyramid of scales for each frame, but found that

that detection rate was not as good as having a dedicated detector trained on

several scales. Therefore, the final training and test datasets were created by sam-

pling three images of each car when it is approaching, entering, and exiting the

intersection for 419 cars over several hours of video. The plates were then manually

extracted from these images and split into three sets of small, medium, and large

area. This provided 359 training images and 60 test images for each of the three

sets. The average size of a plate in each set was 71 × 16, 80 × 19, and 104 × 31

respectively. The images in each set are shown in Figure 2.7. To allow for an easier

method of extracting negative examples for training and to test our detector, we

20

60

68

76

84

92

100

0 0.05 0.1 0.15 0.2 0.25

20000 random20000 random 10000 random, 10000 FP10000 random, 10000 FP 10000 random, 5000 FP, 5000 FP10000 random, 5000 FP, 5000 FP

Det

ectio

n R

ate

(%)

False Positive Rate (%)

Figure 2.8: ROC curves for a 5-stage cascade trained using 359 positive examples

and three different choices of negative training examples.

ensured that each of the 419 frames sampled for each set contained at most one

visible license plate.

We generated additional positive training examples for each set by ex-

tracting images from 10 random offsets (up to 1/8 of the width and 1/4 of the

height of license plates) of each license plate location (for a total of 3,590), all of

the same size as the average license plate size for that set. We found that this

yielded better results than just using the license plate location for a single positive

example per hand-labeled region. Of course, when the detector was in operation, it

fired at many regions around a license plate, which we in fact used as an indication

of the quality of a detection.

To generate negative examples, we picked 28 license plate-sized images

from random regions known not to contain license plates in each positive frame,

which resulted in 10,052 per set. We then applied a sequence of two bootstrap op-

erations where false positives obtained from testing on the training data were used

as additional negative examples for re-training the cascade. We found that two se-

quential bootstrap operations of 4,974 negative examples each were more effective

21

than a single bootstrap operation with 9,948 negative examples. A comparison of

these two methods is given in Figure 2.8.

2.6.2 Results

Figure 2.9 shows a receiver operating characteristic (ROC) curve for our

cascaded detector, and a single-stage cascade detector with the same number of

features. There appears to be a trend indicating that a larger set (in terms of image

size) is learned better than a smaller set. This is most likely due to the detector

having access to more information content per image and as a result is able to

better discriminate between license plates and non-license plates. In fact, when

our detector was trained on the ‘Regents’ dataset where plate sizes were on average

only 45×15 pixels, the detection rates were much lower even though more training

examples were used. The ROC improvement for the resolution increase between

sets 1 and 2 does not appear in the single-stage cascade, most likely because it is

not a large increase.

Table 2.1 shows the number of negative examples remaining at each stage

of the cascade during the three training operations. Stages using the same number

of negative examples as the previous indicate that the desired detection rate of

99.5% could not be maintained at the previous stage, and the τ threshold of Equa-

tion (2.1) was unchanged. Note that with each bootstrap operation the number

of negative examples that enter the last stage of the cascade increases a lot more

quickly than the linear increase of negative examples because the false positives

represent ‘harder’ examples.

As was to be expected, the cascaded classifier was much faster in opera-

tion with each frame requiring about 100 ms to process, whereas the single-stage

classifier required over 3 seconds, but exhibited a superior ROC curve.

Figure 2.10 shows a few examples of regions that our detector incorrectly

labeled as license plates in our test dataset. Perhaps not surprisingly, a large

number of them are text from advertising on city buses, or the UCSD shuttle.

22

60

68

76

84

92

100

0 0.004 0.008 0.012 0.016 0.02

Set 1Set 1 Set 2Set 2 Set 3Set 3

Det

ectio

n R

ate

(%)


(a)

60

68

76

84

92

100

0 0.05 0.1 0.15 0.2 0.25

Set 1Set 1 Set 2Set 2 Set 3Set 3

Det

ectio

n R

ate

(%)


(b)

Figure 2.9: ROC curves for (a) a single-stage, 123-feature detector, and (b) a 6-

stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features per stage respectively.

The sizes of the images trained on in sets 1, 2, and 3 are 71 × 16, 80 × 19, and

104×31 respectively. The x-axis scales in (a) and (b) were chosen to highlight the

performance of the detector on each set.

23

Table 2.1: Negative examples remaining during training at each stage of the cas-

cade. The three training operations shown are (1) initial training with 10,052 ran-

domly chosen negative examples, (2) first bootstrap training with an additional

4,974 negative examples taken from false positives, (3) second bootstrap operation

with another 4,974 negative examples taken from false positives from the previous

training stage.

1 2 3 4 5 6 Remaining# of Features 2 3 6 12 40 60

(1) 10,052 1,295 1,295 537 207 0 0(2) 15,026 4,532 4,532 2,217 861 152 0(3) 20,000 20,000 8,499 5,582 2,320 552 14

Figure 2.10: Examples of regions incorrectly labeled as license plates in the set 3

test set.

Those that contain taillights can easily be pruned by applying a color threshold.

We also applied our license plate detector to a few car images from the

Caltech Computer Vision group’s car database, whose image quality is far better

than the video cameras used to create our datasets, and we found that many license

plates were detected correctly, at the expense of a high number of false positives

due to vegetation, for which our detector was not given negative examples. These

could easily be pruned as well simply by applying yet another color threshold.

Figure 2.11 shows the output of our detector on one of these images.

We did not achieve as low a false positive rate per detection rate on our

datasets as either Chen and Yuille, or Viola and Jones, but the false positive rate

of 0.002% for a detection rate of 96.67% in set 3 is quite tolerable. In practice, the

number false positives per region of each frame is small compared to the number of

detections around a license plate in the frame. Therefore, in our final detector we

24

Figure 2.11: Detection on an image from the Caltech Computer Vision group’s car

database.

do not consider a region to contain a license plate unless the number of detections

in the region is above a threshold.

2.7 Future Work

It would be advantageous to investigate other types of features to place

in the latter stages of the cascade in order to reduce the false positive rate. Color-

based discrimination would be especially useful, since most plates contain a bi-

modal color distribution of a white background and black or dark blue text. Other

features mentioned by Chen and Yuille [10] such as histogram tests and edge-

linking were not tried but should be to test their performance in a license plate

detection setting.

Chapter III

License Plate Recognition

In this chapter, we present a process to recognize the characters on de-

tected license plates. We begin by describing a method for tracking license plates

over time and how this can provide multiple samplings of each license plate for

the purposes of enhancing it for higher quality character recognition. We then

describe our optical character recognition (OCR) algorithm and present our recog-

nition rates.

3.1 Tracking

More often than not, the false positive detections from our license plate

detector were erratic, and if on the car body, their position was not temporally

consistent. We use this fact to our advantage by tracking candidate license plate

regions over as many frames as possible. Then, only those regions with a smooth

trajectory are deemed valid. The tracking of license plates also yields a sequence

of samplings of the license plate, which are used as input to a super-resolution

pre-processing step before OCR is performed on them.

Numerous tracking algorithms exist that could be applied to our problem.

Perhaps the most well-known and popular is the Kanade-Lucas-Tomasi (KLT)

tracker [45]. The KLT tracker makes use of a Harris corner detector to detect

good features to track in a region of interest (our license plate) and measures the

25

26

similarity of every frame to the first allowing for an affine transformation. Sullivan

et al. [47] make use of a still camera for the purposes of tracking vehicles by defining

regions of interest (ROI) chosen to span individual lanes. They initiate tracking

when a certain edge characteristic is observed in the ROI and make predictions on

future positions of vehicles. Those tracks with a majority of accurate predictions

are deemed valid. Okuma et al. [38] use the Viola and Jones [49] framework to

detect hockey players and then apply a mixture particle filter using the detections

as hypotheses to keep track of the players.

Although each of these tracking methods would probably have worked

well in our application, we chose a far simpler approach which worked well in

practice. Because detecting license plates is efficient we simply run our detector

on each frame and for each detected plate we determine whether that detection is

a new plate or an instance of a plate already being tracked. To determine whether

a detected plate is new or not, the following conditions are checked:

• the plate is within T pixels of an existing tracker

• the plate is within T ′ pixels of an existing tracker and the plate is within θ

degrees of the general direction of motion of the plates in the tracker’s history

If any of these are true, the plate is added to the corresponding tracker, otherwise

a new tracker is created for that plate. In our application T ′ was an order of

magnitude larger than T . Figure 3.1 shows the tracking algorithm in action.

Our tracking algorithm was also useful for discarding false positives from

the license plate detector. The erratic motion of erroneous detections usually re-

sulted in the initiation of several trackers each of which stored few image sequences.

Image sequences of 5 frames or fewer were discarded.

3.2 Super-Resolution

Video sequences such as the ones obtained from our cameras provide

multiple samplings of the same surface in the physical world. These multiple

27

Figure 3.1: A car tracked over 10 frames (1.7 seconds) with a blue line indicating

the positions of the license plate in the tracker.

samples can sometimes be used to extract higher-resolution images than any of

the individual samples. The process of extracting a single high-resolution image

from a set of lower-resolution images is called super-resolution. Super-resolution

is different from what is known as image restoration where a higher-resolution

image is obtained from a single image, a process also sometimes referred to as

enhancement.

The investigation into super-resolution was inspired by the low-resolution

license plate images in our ‘Regents’ dataset. In that dataset, the noisy and blurry

45 × 15 pixel license plate images made it very difficult to read the text on the

plates.

Before we describe the super-resolution algorithm we shall use, we shall

describe our assumed image formation model. A plane in the scene undergoes a

28

(a) (b) (c) (d) (e) (f)

Figure 3.2: Our image formation model. The (a) full-resolution image H undergoes

(b) a geometric transformation Tk followed by (c) a blur with a PSF h(u, v); is (d)

sub-sampled by S, and finally (e) additive Gaussian noise η is inserted. The actual

observed image Lk from our camera is shown in (f). The geometric transformation

is exaggerated here for illustrative purposes only.

geometric transformation that maps its world coordinates to those of the camera.

The optics of the camera blur the resulting projection at which point the cam-

era samples it at the low-resolution we observe. Because of imperfections in the

sampling device, noise is introduced, which we shall assume to be spatially uncor-

related, additive, and Gaussian-distributed with zero-mean and constant variance.

Expressed in more formal terms, the imaging process is:

Lk(x, y) = S ↓ (h(x, y) ∗H(Tk(x, y))) + η(x, y), (3.1)

with the following notation:

Lk – kth estimated low-resolution image

S ↓ – down-sampling operator by a factor of S

h – point spread function (PSF)

∗ – convolution operator

H – high-resolution image

Tk – geometric transformation

η – additive noise

This image formation process is illustrated in Figure 3.2. Note that the actual

observed image in Figure 3.2 (f) appears to have a further blurring effect after the

additive noise step when compared to Figure 3.2 (e). This could be due to a slight

29

motion-blur, which is not taken into account by our model.

The goal of a super-resolution algorithm is to find H given each observed

Lk. The sub-sampling factor S is usually chosen to be 2 or 4, and the estimation

of Tk and h(x, y) are discussed in sections 3.2.1 and 3.2.2 respectively.

We shall use the symbol to differentiate between estimated and actual

images. In other words, H, represents the actual high-resolution image, and H

denotes its estimate.

3.2.1 Registration

The process of determining the transformation Tk for each image is known

as registration. In the general case, Tk is a projective transformation (planar ho-

mography) and its reference coordinates are usually those of one of the images in

the sequence. If all the images are roughly aligned by the detector, as was the case

with our detector, the choice of a reference image is arbitrary, and we chose the

first of each sequence.

As a simplification, we are assuming that Tk is simply translational since,

as the reader may recall from Chapter 2, our license plate detector is custom-

designed for three different scales, and the variation in size of detections within a

scale is minimal. To calculate the translation of each image Lk in the sequence

relative to the reference image L1, we divided each image into several patches and

used the normalized cross-correlation measure of similarity

NCC(I1, I2) =

∑x(I1(x)− I1)(I2(x)− I2)√∑x(I1(x)− I1)2(I2(x)− I2)2

(3.2)

to find the best place in L1 of each patch. In Equation (3.2),

I1 =1

N

∑x

I1(x) and I2 =1

N

∑x

I2(x)

are the means of I1 and I2. NCC(I1, I2) takes on values in [−1, 1], with 1 repre-

senting most similar and -1 representing least similar. Each patch I1 is placed over

all possible offsets of the same size, I2, over the reference image L1, and the average

30

offset of each correspondence is computed and treated as the translation from Lk

to L1. This simple process leads to sub-pixel accuracies for each translation.

Since registration is a crucial pre-processing step for the extraction of an

accurate high-resolution estimate, we applied an all-pairs cross-correlation proce-

dure on the plates in each tracked sequence to ensure all images in the sequence

are somewhat similar and no erroneous detections are included. Those images with

poor correlation to the rest are discarded.

3.2.2 Point Spread Function

The blur operation in Equation (3.1) is modeled by a convolution with a

point spread function (PSF). The PSF should approximate the blur of both the

optics of the camera as well as its sensor. Zomet and Peleg [24] suggest three

methods of estimating it:

• Use camera specifications obtained from manufacturer (if available)

• Analyze a picture of a known object

• Use the images in the sequence

Capel and Zisserman [7] instead suggest to simply use an isotropic Gaussian, which

Capel found to work well in practice [6]. For our experiments we chose a Gaussian

of size 15 × 15 and standard deviation of 7, which was used to create the blur

operation in Figure 3.2.

3.2.3 Algorithm

Our super-resolution algorithm is based on a probabilistic framework.

The algorithm estimates the super-resolution image H by maximizing the con-

ditional probability Pr(H|L) of the super-resolution estimate H given the set of

observed low-resolution images L = {Lk}. We do not know Pr(H|L) directly,

but using the imaging model of Equation (3.1) we can determine Pr(L|H). Using

31

Bayes’ Rule,

Pr(H|L) =Pr(L|H)Pr(H)

Pr(L).

To find the most probable high-resolution image H, we need to maximize

Pr(L|H)Pr(H). (3.3)

We can drop the Pr(L) term since it does not depend on H. A further simplification

is sometimes made by assuming that all high-resolution images are equally likely,

in which case just Pr(L|H) is maximized. The high-resolution estimate obtained

from this process is the maximum likelihood (ML) estimate. In our case, however,

we do have some prior knowledge of the high-resolution images of license plates,

which we can use to our advantage. We shall first describe a method of finding

the ML estimate and then describe the priors we use in Section 3.2.5.

3.2.4 Maximum Likelihood Estimate

Using our assumption that image noise is Gaussian with zero-mean and

variance σ2, Capel and Zisserman [7] suggest the total probability of an observed

image Lk given an estimate of the super-resolution image H is

Pr(Lk|H) =∏x,y

1

σ√

2πe−(bLk(x,y)−Lk(x,y))2

2σ2 . (3.4)

The log-likelihood function of Equation (3.4) is:

L(Lk) = −∑x,y

(Lk(x, y)− Lk(x, y))2. (3.5)

If we assume independent observations,

Pr(L|H) =∏

k

Pr(Lk|H), (3.6)

and the corresponding log-likelihood function for all images in the set L becomes

L(L) =∑

k

L(Lk) = −∑

k

‖Lk − Lk‖2. (3.7)

32

The ML estimate then is obtained by finding the H that maximizes Equation (3.7):

HM = argmaxH

∑k

L(Lk)

= argminH

‖Lk − Lk‖2. (3.8)

If the formation process in Equation (3.1) that maps the high-resolution

estimate H to Lk is expressed in matrix form as

Lk = MkH, (3.9)

we have a system of N linear equations for all N images in the sequence. Stacking

these vertically, we have: L1

L2

...

LN

=

M1

M2

...

MN

H.

L = MH. (3.10)

Using this notation, the solution of Equation (3.7) can be obtained by

H = (M>M)−1M>L. (3.11)

In practice, M is very large and its pseudo-inverse is prohibitive to compute and

therefore iterative minimization techniques are used. The iterative methods also

facilitate the computation of H when the high-resolution images are not all equally

likely, and several priors are included in Equation (3.3). We use simple gradient

descent as our minimization method.

3.2.5 Maximum a Posteriori Estimate

In this section we shall describe the priors Pr(H) we use for obtain-

ing a maximum a posteriori estimate (MAP). The MAP estimate is obtained by

maximizing the full expression in Equation (3.3). The most common prior used

33

in the super-resolution literature is the smoothness prior introduced by Schultz

and Stevenson [42]. Capel and Zisserman also use a learnt face-space prior [8]. For

super-resolution of text specifically, Donaldson and Myers [12] use a bi-modal prior

taking into account the bi-modal appearance of dark text on light background. The

two priors we experimented with were the smoothness and bi-modal prior.

Smoothness Prior

The smoothness prior we used was introduced by Schultz and Stevenson

[42] and has the probability density:

Prs(H(x, y)) = cse−ρ( bH(x,y)− bH(x,y)), (3.12)

where cs is a normalizing constant, H(x, y) is the average of the pixel intensities

of the four nearest neighbors of H:

H(x, y) =H(x− 1, y) + H(x + 1, y) + H(x, y − 1) + H(x, y + 1)

4, (3.13)

and ρ(x) is the Huber cost function:

ρ(x) =

x2 , |x| ≤ α

2α|x| − α2 , |x| > α

. (3.14)

The H(x, y) − H(x, y) expression is a measure of the local smoothness around

a pixel (x, y), where large indicates discontinuities and small indicates a smooth

region. A plot of the Huber function is shown in Figure 3.3 (a). Its use is justified

by Donaldson and Myers [12] who suggest the linear region of ρ(x) for |x| > α

preserves steep edges because of the constant derivative.

Bi-Modal Prior

The bi-modal prior used by Donaldson and Myers [12] is an exponential

fourth-order polynomial probability density with maxima at the corresponding

34

−1 −0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

Gradient

Gra

dien

t Pen

alty

0 0.25 0.5 0.75 10.96

0.98

1

1.02

Image Intensity

Pr b/c

b

(a) (b)

Figure 3.3: (a) The Huber penalty function used in the smoothness prior with α =

0.6 and red and blue corresponding the regions |x| ≤ α and |x| > α respectively;

(b) an un-scaled version of the bi-modal prior with µ0 = 0.1 and µ1 = 0.9.

black and white peaks of the pixel intensity distributions of the high-resolution

image:

Prb(H(x, y)) = cbe−( bH(x,y)−µ0)2( bH(x,y)−µ1)2 , (3.15)

where cb is a normalizing constant and µ0 and µ1 are the centers of the peaks. The

function is shown in Figure 3.3 (b) for a choice of µ0 = 0.1 and µ1 = 0.9.

Donaldson and Myers estimate µ0 and µ1 for each high-resolution esti-

mate, but instead we used the constants in Figure 3.3 (b).

Computing the Estimate

Combining the likelihood and two prior probability distributions and sub-

stituting into Equation (3.3), we have

H = argmaxH

∏k

Pr(Lk|H) ·∏x,y

Prs(H(x, y)) · Prb(H(x, y)). (3.16)

35

Taking the negative log-likelihood of the right-hand side,

H = argminH

−∑

k

‖MkH − Lk‖2+∑

x,y

ρ(H(x, y)− H(x, y))−∑x,y

(H(x, y)− µ0)2(H(x, y)− µ1)

2. (3.17)

For convenience, we shall refer to each of the three terms as EM(H), ES(H), and

EB(H). To control the contributions of each term we weigh EM(H), ES(H), and

EB(H) by the constants cM , cS, and cB, respectively:

H = argminH

cMEM(H) + cSES(H) + cBEB(H). (3.18)

We chose to use gradient descent to minimize Equation (3.17), therefore,

we need to find the derivative of the entire expression with respect to H. The

derivative of the ML term is straightforward:

∂

∂HEM(H) = −2Mk

>(MkH − Lk), (3.19)

and the derivative of the bi-modal term is:

∂

∂HEB(H) = 2(H(x, y)− µ0)(H(x, y)− µ1)(2H(x, y)− µ0− µ1). (3.20)

The derivative of the smoothness term is more tricky to compute since each neigh-

bor of H(x, y) involves H(x, y) in its H calculation. Therefore, we need to unroll

ES(H) around H(x, y) and then find the derivative:

ES(H) = . . . + ρ(x1 −. . . + x

4) + ρ(x2 −

. . . + x

4) . . . +

ρ(x− x1 + x2 + x3 + x4

4) + ρ(x3 −

. . . + x

4) + . . . + (3.21)

ρ(x4 −. . . + x

4) + . . . ,

36

where

x = H(x, y)

x1 = H(x, y − 1)

x2 = H(x− 1, y)

x3 = H(x + 1, y − 1)

x4 = H(x, y + 1).

The derivative then is

∂

∂HEB(H(x, y)) =

1

4ρ′(x1 −

. . . + x

4) +

1

4ρ′(x2 −

. . . + x

4) +

ρ′(x− x1 + x2 + x3 + x4

4) + (3.22)

1

4ρ′(x3 −

. . . + x

4) +

1

4ρ′(x4 −

. . . + x

4).

Having obtained the derivatives of each term, we iteratively step in the direction

opposite the gradient until we reach a local minimum. At each step we add some

portion of the gradient of the EM(H), ES(H), and EB(H) terms, controlled by the

factors cM , cS, and cB, respectively and the step size. Instead of constructing each

Mk matrix in Equation (3.19) explicitly, we only apply image operations such as

warp, blur, and sample for multiplications with Mk and Mk>Mk, similar to Zomet

and Peleg’s work [52]. Since Mk is the product of each of the image operations, it

can be decomposed into

Mk = SBWk, (3.23)

where S is the down-sampling matrix, B is the matrix expressing the blurring with

the PSF, and Wk is the transformation matrix representing Tk. Therefore,

Mk> = Wk

>B>S>, (3.24)

37

(a)

(b) (c) (d)

Figure 3.4: Super-resolution results: (a) sequence of images processed, (b) an up-

sampled version of one low-resolution image, (c) the average image, (d) the final

high-resolution estimate.

where Wk> is implemented by applying the reverse transformation Wk applies,

B> is implemented by applying the same blur operation as B since we are using

an isotropic Gaussian PSF, and S> is implemented by up-sampling without any

interpolation.

We use the average image in the sequence resized to four times the res-

olution using bi-linear interpolation as the initial high-resolution estimate. The

choice of the average image as an initial estimate is justified since it contains little

of the noise found in the individual images as is seen in Figure 3.4 (c).

Since we are performing a cross-check of each image with each other image

in the sequence during registration, the first few images (which have most detail)

are pruned. Had we implemented a more general transformation estimation for

registration, we would have been be able to take advantage of these images, but

simple translation estimation with them included negatively affected the average

image and thus the initial super-resolution estimate.

3.2.6 Discussion

There are numerous parameters in our image formation model and our

super-resolution algorithm that require either estimation or initial adjustment. The

38

values of these parameters have a profound effect on the final super-resolution im-

ages. Some of these results may look more appealing to us as humans, but the only

way to determine whether super-resolution in general is worth-while is to actually

determine whether they improve the OCR rate. This was the approach taken by

Donaldson and Myers [12], however, they used Scansoft’s DevKit 2000, a commer-

cial OCR engine, on printed text, for which most commercial OCR packages are

designed. Although we were unable to obtain a copy of their choice of OCR pack-

age, the commercial OCR software we experimented with performed very poorly

on our super-resolution images, most likely because the OCR engines were not

specifically trained on our forms of license plate text, or were not of sufficiently

high resolution.

Donaldson and Myers found that the biggest factor super-resolution had

on improving OCR performance was the more clear separation of characters rather

than the reduction of noise. The separation of characters, which is a result of the

bi-modal prior, can also be observed in our data as shown in the super-resolution

estimate in Figure 3.4 (d). The image also exhibits a clear bi-modal pixel intensity

distribution, and in fact, the contrast is good enough to not require binarization

algorithms to be applied, a pre-processing step often necessary for OCR packages

to work correctly.

3.3 Optical Character Recognition

In this section we describe a very simple algorithm to recognize the char-

acters on detected plates and propose additional methods that may be used in

further research.

3.3.1 Previous Work

It was our initial intent to apply a binarization algorithm, such as a

modified version of Niblack’s algorithm as used by Chen and Yuille [10], on the

39

extracted license plate images from our detector, and then use the binarized im-

age as input to a commercial OCR package. We found, however, that even at a

resolution of 104× 31 the OCR packages we experimented with yielded very poor

results. Perhaps this should not come as a surprise considering the many custom

OCR solutions used in existing LPR systems.

The most common custom OCR approach used by existing LPR systems

is correlation-based template matching [35], sometimes done on a group of char-

acters [11]. Sometimes, the correlation is done with principal component analysis

(PCA) [27]. Others [44] apply connected component analysis on binarized images

to segment the characters and minimize a custom distance measure between char-

acter candidates and templates. Classification of segmented characters can also be

done using neural networks [37] with good results.

Instead of explicitly segmenting characters in detected plates Amit et al.

[2] use a coarse-to-fine approach for both detection and recognition of characters

on license plates. Although they present high recognition rates, the license plate

images they worked with were of high-resolution, and it is not clear whether their

method will be as effective on the low-resolution images in our datasets.

Because of the simplicity of the template matching method, we chose to

experiment with it first, and it proved to work reasonably well.

3.3.2 Datasets

We generated training and test data by running our license plate detector

on several hours of video and extracting sequences of images for each tracked license

plate. This process resulted in a total of 879 plate sequences each of which was

labeled by hand. Of these, 121 were chosen at random to form an alphabet of

characters for training. These 121 sequences contained the necessary distribution

of characters to form 10 examples per character, for a total of 360 examples (26

letters and 10 digits). This alphabet of training images is shown in Figure 3.5.

The remaining 758 plates were used for testing the OCR rate.

40

Figure 3.5: The alphabet created from the training set. There are 10 examples for

each character for the low-resolution, average image, and super-resolution classes,

shown in that respective order.

Figure 3.6 shows a histogram of the frequency of all characters in our

training and test datasets. Note that the majority of characters are numbers with

‘4’ being most common since most of today’s California plates start with that

number. The frequencies of ‘I’, ‘O’, and ‘Q’ were relatively small most likely due

to their potential confusion with other similarly shaped characters.

3.3.3 Template Matching

Unless text to be read is in hand-written form, it is common for OCR

software to segment the characters and then perform recognition on the segmented

image. The simplest methods for segmentation usually involve the projection of

row and column pixels and placing divisions at local minima of the projection

functions. In our data, the resolution is too low to segment characters reliably in

this fashion, and we therefore decided to apply simple template matching instead,

which can simultaneously find both the location of characters and their identity.

41

0

150

300

450

600

750

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

License Plate Character Frequencies

Figure 3.6: Character frequencies across our training and test datasets.

The algorithm can be described as follows. For each example of each

character, we search all possible offsets of the template image in the license plate

image and record the top N best matches. The searching is done using the NCC

metric shown in Equation (3.2), and a threshold on the NCC score is applied before

considering a location a possible match. If more than one character matches a

region the size of the average character, the character with the higher correlation is

chosen and the character with the lower correlation is discarded. Once all templates

have been searched, the characters for each region found are read left to right

forming a string. N is dependent on the resolution of the license plate image and

should be chosen such that not all N matches are around a single character when

the same character occurs more than once on a plate, and not too large so that

not all possible regions are processed.

This method may seem inefficient, however, the recognition process takes

on the order of half a second for a resolution of 104 × 31, which we found to be

acceptable. This recognition time is much smaller than the several seconds required

to estimate a super-resolution image. Our results for this method are shown in

Section 3.4.

42

3.3.4 Other Methods

We would like to propose several ideas for future work on license plate

OCR. The first method is to apply shape context matching [4] on characters

segmented after applying connected components and a thinning pre-processing

step [44] on the high-resolution estimates. Shape contexts have shown to be very

effective at recognizing hand-written digits, and it is reasonable to presume that

the method might work well on license plate characters.

The second method that might benefit further research in this area is to

apply the AdaBoost framework to recognizing segmented characters. At the time

of this writing we are not aware of any OCR algorithms that use boosted classi-

fiers, but the filters we presented in Chapter 3 may also be adapted to individual

characters, with the caveat that many more training examples would be required

and the AdaBoost classifier we presented would need to be modified for multiclass

classification.

Mori and Malik [32] use a Hidden Markov Model (HMM) to choose the

most likely word when performing text recognition in images with adversarial clut-

ter. A similar method may apply to license plate recognition to learn and recognize

common character sequence types, such as a digit, followed by three letters, fol-

lowed by three digits.

3.4 Results

Our template matching method was not well-suited for recognition on

the super-resolution images using the super-resolution templates in our training

alphabet. Our low-resolution templates yielded far better results on the test set,

which is most likely due to a better correlation resulting from the natural blur that

occurs in the low-resolution images, allowing more intra-class variance. Therefore,

in this section we present our results on just the low-resolution image sequences.

Figure 3.7 shows our recognition results on the low-resolution images

43

0

40

80

120

160

0 1 2 3 4 5 6 7

Standard Loose

Low−Resolution Edit Distance Histogram

Edit Distance

Num

ber o

f Pla

tes

Figure 3.7: Template matching OCR results on the low-resolution test set for ‘stan-

dard’ and ‘loose’ comparisons between recognized characters and actual characters.

from the test set, taken from the second frame in the image sequence of the plate

trackers. We used the edit distance, sometimes also referred to as the Levenshtein

distance, to measure how similar our recognized text was to the labeled plates in

the test set. Because certain characters are easily confused with others, even by

humans, we also applied a ‘loose’ character equality test whenever the edit distance

algorithm compared two characters. The groups of characters {‘O’, ‘0’, ‘D’, ‘Q’},

{‘E’, ‘F’}, {‘I’, ‘T’, ‘1’}, {‘B’, ‘8’}, and {‘Z’, ‘2’} were each considered of the same

type and no penalty was applied for incorrect readings within the group. Figure

3.7 shows the number of license plates read with various numbers of mistakes with

and without using the ‘loose’ comparison measure.

Figure 3.8 shows the template matching method applied to the actual low

resolution images in the test set. Note that over half of the test set was recognized

with two or fewer mistakes. One can observe a large degradation in image quality

with each progressive horizontal section. The template matching is most often

thwarted by plate boundaries, which are more and more visible as the size of the

plate decreases.

Our goal for this thesis was to have an unconstrained LPR system, and

44

Figure 3.8: Recognition results for the images in our test set. Each horizontal

section lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and 7 mistakes.

45

these OCR rates are quite satisfactory for our purposes. An alternative to super-

resolution would be to perform OCR on each image in the sequence and obtain

the most likely text in that fashion, however, this experiment was not performed.

Chapter IV

Make and Model Recognition

As with our license plate recognition problem, detecting the car is the

first step to performing make and model recognition (MMR). To this end, one can

apply a motion segmentation method such as [50] to estimate a region of interest

(ROI) containing the car. Instead, we decided to use the location of detected

license plates as an indication of the presence and location of a car in the video

stream and to crop an ROI of the car for recognition. This method would also be

useful for make and model recognition in static images, where the segmentation

problem is more difficult.

In this chapter, we describe several feature-based and appearance-based

methods commonly used in object recognition and evaluate their recognition rates

on car images extracted from our video stream.

4.1 Previous Work

To the best of our knowledge, MMR is a fairly unexplored recognition

problem. Various work has been done on car detection in street scene images

[29] [43] [39] and aerial photographs [41]. Dorko and Schmid [13] use scale invariant

features to detect cars in images with 50% background on average. Agarwal et

al. [1] automatically create a vocabulary of car parts, such as tires and windshields,

from training images and detect cars by finding individual parts and comparing

46

47

their spatial relations. Interestingly, most of the car detection literature only deals

with side-views of cars, perhaps because from a large distance the side profile

provides richer and thus more discriminating features.

The work of Ferencz et al. [14] is most closely related to our problem

statement. Their work is helping develop a wide-area car tracking system and is

not formulated as a recognition problem, but what they call an object identification

problem. In our system we are interested in determining to which make and model

class a new vehicle belongs, and although all classes consist of cars, there is fair

amount of variation within each of the make and model classes. In contrast, Ferencz

et al. are interested in determining whether two images taken at different times

and camera orientations are of the exact same car, where there is really only a

single example that serves as a model. They solve this problem by automatically

finding good features on side views of cars from several hundred pairs of training

examples, where good features refer to features that are good at discriminating

between cars from many small classes.

4.2 Datasets

We automatically generated a database of car images by running our

license plate detector and tracker on several hours of video data and cropping a

fixed window of size 400× 220 pixels around the license plate of the middle frame

of each tracked sequence. This method yielded 1,140 images in which cars of each

make and model were of roughly the same size since the license plate detector was

specialized to respond to a narrow range of license plate sizes. The majority of

these images are shown in Figure 4.1. The crop window was positioned such that

the license plate was centered in the bottom third of the image. We chose this

position as a reference point to ensure matching was done with only car features

and not background features. Had we centered the license plate both vertically

and horizontally, cars that have their plates mounted on their bumper would have

48

exposed the road in the image. Although this method worked well in most cases,

for some cars, the position of the license plate was off-center horizontally, which

allowed for non-car regions to be included in the ROI.

After collecting these images, we manually assigned make, model, and

year labels to 790 of the 1,140 images. We were unable to label the remaining

350 images due to our limited familiarity with those cars. We often made use of

the California Department of Motor Vehicles’ web site [23] to determine the makes

and models of cars with which we were not familiar. The web site allows users to

enter a license plate or vehicle identification number for the purposes of checking

whether or not a car has passed recent smog checks. For each query, the web site

returns smog history as well as the car’s make and model description if available.

The State of California requires all vehicles older than three years to pass a smog

check every two years. Therefore, we were unable to query cars that were three

years old or newer and relied on our personal experience to label them.

We split the 1,140 labeled images into a query set and a database set.

The query set contains 38 images chosen to represent a variety of make and model

classes, in some cases with multiple queries of the same make and model but

different year in order to capture the variation of model designs over time. We

evaluated the performance of each of the recognition methods by finding the best

match in the database for each of the query images.

4.3 Appearance-based Methods

Appearance-based object recognition methods work by treating entire

images as feature vectors and comparing these vectors with a training vector space.

An M × N image would be transformed into a single MN -dimensional feature

vector consisting of just pixel intensities from the image. In practice, M and

N are too large to search the training vector space efficiently for a best match

and some sort of dimensionality reduction is done first. Common dimensionality

49

Figure 4.1: Our automatically generated car database. Each image is aligned such

that the license plate is centered a third of the distance from bottom to top. Of

these images, 1,102 were used as examples, and 38 were used as queries to test the

recognition rates of various methods. We used the AndreaMosaic photo-mosaic

software to construct this composite image.

50

reduction techniques are principal component analysis (PCA) [33] [34] and the

Fisher transform [3].

Because appearance-based methods work directly with feature vectors

consisting entirely of pixel brightness values (which directly correspond to the ra-

diance of light emitted from the object), they are not good at handling illumination

variability in the form of intensity, direction, and number of light sources, nor vari-

ations in scale. The Fisherface method [3] and Illumination Cones [18] address the

illumination variability problem but are not invariant to scale.

In this section, we describe the Eigenface recognition method, which has

frequently been used in face recognition, and evaluate its performance on MMR.

4.3.1 Eigencars

In principal component analysis, a set of feature vectors from a high-

dimensional space is projected onto a lower dimensional space, chosen to cap-

ture the variation of the feature vectors. More formally, given a set of N images

{x1,x2, ...,xN}, each expressed as an n-dimensional feature vector, we seek a lin-

ear transformation, W ∈ Rn×m, that maps each xk into an m-dimensional space,

where m < n, such that

W>ΣW (4.1)

is maximized. Here,

Σ =1

N − 1

N∑k=1

(xi − µ)(xi − µ)>,

and

µ =1

N

N∑i=1

xi

is the average image. The covariance matrix Σ is also referred to as the total scatter

matrix [3] since it measures the variability of all classes in the n-dimensional feature

vectors.

51

Finding the W that maximizes Equation (4.1) is an eigenvalue prob-

lem. Since n is usually very large (in our case 88,000) and much larger than N

(1,102), computing the eigenvectors of Σ directly is computationally and storage-

prohibitive. Instead, consider the matrix

A = [x1 − µ,x2 − µ, ...,xN − µ] . (4.2)

Then, Σ = AA>. Using singular value decomposition (SVD), A can be decomposed

as A = UDV >, where U and V are orthonormal and of size n × N and N × N

respectively, and D is an N × N diagonal matrix. Using this decomposition, Σ

becomes

Σ = AA> = UDV >(UDV >)> = UDV >V D>U = UD2U>, (4.3)

where D2 consists of {λ1, λ2, ..., λN}, where λi are the first N eigenvalues of Σ, with

the corresponding eigenvectors, and, therefore columns of W , in the columns of

U . Because these eigenvectors are of the same dimensions as the set of xi images,

they can be visualized and in the face recognition literature are referred to as

‘eigenfaces’ [48]. We chose to more aptly call them ‘eigencars’ since our domain of

input images consists of cars. The first ten eigenvectors corresponding to the ten

largest eigenvalues are shown in Figure 4.2 (b), and µ is shown in Figure 4.2 (a).

The eigencars recognition algorithm can then be described as follows:

Off-line

1. Construct the A matrix from a set of N images {x1,x1, ...,xN}

2. Compute the SVD of A to obtain the eigenspace U and the diagonal matrix

D containing the eigenvalues in decreasing order

3. Project each of the N column vectors of A onto the eigenspace U to obtain

a low-dimensional N ×N feature matrix F = A>U , and scale each row of F

by the diagonal of D

52

(a)

(b)

Figure 4.2: (a) The average image, and (b) the first 10 eigencars.

On-line

1. Subtract the average image µ from the query image q, q′ = q− µ

2. Project q′ onto the eigenspace U to obtain an N -dimensional feature vector

f and scale f by the diagonal of D

3. Find the row k of F that has the smallest L2 distance to f and consider xk

to be the best match to q

Results

We applied the algorithm to our database and query sets and obtained a

recognition rate of only 23.7%. This is a very low recognition rate, however, the

recognition rate using random guessing is 2.5%.

Figures 4.3 and 4.4 show the query images and the top ten matches

in the database for each query using the on-line recognition method. Note the

stark similarity in overall illumination of all matches for each query, even though

53

the matches contain a large variation of makes and models. This suggests the

algorithm is not recognizing car features, but rather illumination similarity.

Belhumeur et al. suggest that the three eigencars corresponding to the

three largest eigenvalues capture most of the variation due to lighting and that it is

best to ignore them. Indeed, discarding these eigenvectors increased the recognition

rate to 44.7%. The results of this modified approach are shown in Figures 4.5 and

4.6. Note that the matches no longer exhibit the strong similarity in illumination

as before. We also tried removing the top 7 largest eigenvectors, which led to

a recognition rate of 47.4%. Removing any more eigenvectors, however, had a

negative effect.

Discussion

The most computationally intensive part of the eigencars algorithm is the

computation of F = A>U . With A consisting of the full resolution images, the

process takes about four hours, and requires roughly 1,500MB of RAM. We also

performed the recognition experiment on sub-scaled versions of the images with

200× 110 resolution and found that this greatly reduced the off-line training time

and significantly reduced the memory requirements without adversely affecting the

recognition rate.

The on-line part of the algorithm is reasonably fast. It only takes one or

two seconds to project q′ onto the eigenspace U . We shall see that this is a strong

advantage of the appearance-based method when we evaluate the performance of

feature-based methods in Section 4.4.

The Fisherface [3] method is a more recent appearance-based recognition

method that has similar computational requirements as the Eigenface method and

has been shown to yield superior recognition rates in the face recognition domain

because it selects a linear transformation that maximizes the ratio of the between-

class scatter to the within-class scatter. It therefore requires us to place our set

of xk training images into separate classes. Due to time constraints, we did not

54

Figure 4.3: The first 19 query images and the top 10 matches in the database for

each using all N eigencars.

55

Figure 4.4: The second 19 query images and the top 10 matches in the database

for each using all N eigencars.

56

Figure 4.5: The first 19 query images and the top 10 matches in the database for

each using N − 3 eigencars.

57

Figure 4.6: The second 19 query images and the top 10 matches in the database

for each using N − 3 eigencars.

58

evaluate this method.

4.4 Feature-based Methods

In contrast to appearance-based recognition methods, feature-based

recognition methods first find a number of interesting features in an image and

then use a descriptor representative of the image area around the feature location

to compare with features extracted from training images of objects. The features

should belong to the objects to be recognized, and should be sparse, informative,

and reproducible, the latter two properties being most important for object

recognition. If the features themselves are not sufficiently informative, descriptors

are used for matching methods, where the descriptors are usually constructed

from the image structure around the features.

4.4.1 Feature Extraction

Here, we discuss several feature extraction methods commonly used in

object recognition.

Corner Detectors

In the computer vision community, interest point detection is often called

corner detection even though not all features need be corners. Corner detection

is often used for solving correspondence problems, such as in stereopsis. Corner

features occur in an image where there is a sharp change in the angle of the

gradient. In practice, these points of sharp change in the angle of the gradient

do not always correspond to real corners in the scene, for example in the case of

occlusion junctions.

Two popular corner detectors are the Harris [19] and Forstner [15] detec-

tors. The output of a Harris corner detector on a car image from our dataset is

shown in Figure 4.7.

59

Figure 4.7: Harris corner detections on a car image. Yellow markers indicate

occlusion junctions, formed by the intersection of edges on surfaces of different

depths.

Corner features by themselves are not sufficiently informative for object

recognition, but Agarwal et al. [1] combine them with patches of the image used

as a descriptor.

Salient Features

Kadir and Brady [25] have developed a low-level feature extraction

method inspired by studies of the human visual system. Their feature detector

extracts features at various scales that contain high entropy. For each pixel

location x, the scale s is chosen in which the entropy is maximum, where by scale

we mean the patch size around x used to obtain a probability distribution P on

the pixel intensities used in the entropy H calculation:

H(s,x) =255∑i=0

Ps,x(i) log Ps,x(i). (4.4)

Equation (4.4) assumes pixel intensities take on values between 0 and 255. Unlike

the corner detector, Kadir and Brady features carry a scale descriptor in addition

to their position in the image.

We created an efficient implementation of the detector using our integral

image optimization technique from Section 2.5.1 for the calculation of P around

x for the various scales. Our results on our car image are shown in Figure 4.8.

60

(a) (b)

Figure 4.8: Kadir and Brady salient feature extraction results on (a) a car image

from our database, and (b) an image of a leopard.

We found that Kadir and Brady features had low repeatability when applied to

our car images and were, therefore, not further explored. They seem to be more

suitable in some images over others as can be seen in Figure 4.8 (b).

SIFT Features

The corner detector we described earlier is sensitive to changes in image

size and, therefore, does not provide useful features for matching images of dif-

ferent sizes. Scale invariant feature transform (SIFT) features recently developed

by Lowe [30] overcome this problem and are also invariant to rotation and even

partially invariant to illumination differences.

The process of extracting SIFT features consists of four steps: scale-space

extremum detection, keypoint localization, orientation assignment, and descriptor

assignment. The scale space L(x, y, σ) of an image I(x, y) is defined as a convolu-

tion of a variable-scale Gaussian kernel:

L(x, y, σ) = G(x, y, σ) ∗ I(x, y),

where

G(x, y, σ) =1

2πσ2e−(x2+y2)/2σ2

.

61

Figure 4.9: SIFT keypoints and their orientations for a car image.

The scale parameter σ is quantized and keypoints are then localized by finding

extrema in

D(x, y, σ) = L(x, y, kσ)− L(x, y, σ),

where kσ is the next highest scale. The location of the extrema are called keypoints.

Orientation assignment of each keypoint is then done by computing the gradient

magnitude m(x, y) and orientation θ(x, y) of the scale space for the scale of that

keypoint:

m(x, y) =√

(L(x + 1, y)− L(x− 1, y))2 + (L(x, y + 1)− L(x, y − 1))2

θ(x, y) = tan−1 L(x, y + 1)− L(x, y − 1)

L(x + 1, y)− L(x− 1, y)

Figure 4.9 shows 352 keypoints and their orientations extracted from our example

car image from our database.

Finally, the descriptor is assigned by dividing the region around the key-

point into 16 symmetric sub-regions and assigning 8 orientation bins to each sub-

region. The final result is a 16 × 8 = 128-dimensional feature vector. When

comparing two SIFT descriptors, the L2 distance measure is used.

4.4.2 Shape Contexts

A shape context is an image descriptor introduced by Belongie et al.

[4] and has been shown to be very good for matching shapes. Some success-

62

(a) (b)

(c) (d) (e) (f)

Figure 4.10: (a) Query car image with two interest points shown, (b) database car

image with one corresponding interest point shown, (c) diagram of log-polar bins

used for computing shape context histograms, (d,e,f) shape context histograms for

points marked ‘B’, ‘C’, and ‘A’ respectively. The x-axis represents θ and the y-axis

represents log r increasing from top to bottom.

ful applications include hand-written digit recognition [4] and breaking “Com-

pletely Automated Public Turing Tests to Tell Computers and Humans Apart”

(CAPTCHA) [32] protection mechanisms used by internet companies such as Ya-

hoo to deter automated signups for thousands of email accounts. Although the

shape context descriptor is best suited for binary images, we felt it would be in-

teresting to test it in the context of grayscale car images.

The shape context descriptor is computed as follows. Given an interest

point x, we consider a circle of radius r centered on x and divide it into sections

according to a log-polar grid as shown in Figure 4.10 (c). We then count the number

of edge pixels within a radius r that fall in each bin. The resulting histogram is

known as the shape context of x. Figure 4.10 shows the shape context for a pair

of matching points and a point on the shape far away from each matching point.

63

Note the similarity in the descriptor for the corresponding points and how vastly

different it is for point A.

The shape context descriptors are usually compared using the χ2 distance

d(hi,hj) =∑

bins k

‖hi(k)− hj(k)‖2

‖hi(k) + hj(k)‖, (4.5)

where hi and hj are the two descriptors. Sometimes, the L2 distance is used

instead, though we found that using it had little effect on the overall recognition

results.

The original shape context work [4] used a histogram with 5 logarithmic

divisions of the radius and 12 linear divisions of the angle. In our recognition

experiments we also tried a histogram of size 9 × 4 in addition to the original

5× 12. In [31], Mori et al. augment the shape context histogram to include edge

orientations, which we have not experimented with.

4.4.3 Shape Context Matching

Shape context matching was the first feature-based method we tried. The

algorithm we implemented works as follows:

1. For each image d in the database and a query image q, take a random

sampling of N points from the edge images (as shown in Figure 4.11) of q

and d and compute the shape context around each point.

2. For each database image d:

(a) For each sampled edge point pq in q find the best matching sampled

point pd in d within some radius threshold that has a shape context

with the smallest χ2 distance according to Equation (4.5).

(b) Sum all χ2 distances for every correspondence and treat the sum as

some cost c.

3. Choose the d that has the lowest cost c and consider that the best match

64

(a) (b)

Figure 4.11: (a) Image edges and (b) a random sampling of 400 points from the

edges in (a).

In the original work on shape contexts, step 2 was performed for several

iterations using the correspondences to compute a thin plate spline transformation

that transforms q. Since we are matching 3-D rigid bodies under an approximately

affine camera, we instead computed an estimation for the affine transformation

using RANSAC that would best align q with d but found that the affine estimate

was not sufficiently stable and the natural alignment obtained by using the license

plate as a reference point was sufficient.

Our recognition rates on our query set were 65.8% using a 5 × 12-size

shape context and 63.2% using a 9 × 4-size shape context. The radius of the

descriptor we used was 35 pixels and the sampling size N of points was 400.

4.4.4 SIFT Matching

We also explored matching query images using the SIFT feature extractor

and descriptor discussed earlier. The algorithm we used was the following:

1. For each image d in the database and a query image q, perform keypoint

localization and descriptor assignment as described in Section 4.4.1.

2. For each database image d:

(a) For each keypoint kq in q find the keypoint kd in d that has the smallest

L2 distance to kq and is at least a factor of α smaller than the distance

65

to the next closest descriptor. If no such kd exists, examine the next

kq.

(b) Count the number of descriptors n that successfully matched in d.

3. Choose the d that has the largest n and consider that the best match.

Discussion

We found that a few types of keypoint matches resulting from the above

algorithm did not contribute to the selection of a best car match. For example,

some matching keypoints corresponded to entire groups of digits and letters on

the license plates of a query image and a database image even though the the

cars to which they belonged looked quite different. Since the best car match in

the database is determined based on the number of matched keypoints, spurious

matches should be ignored. We, therefore, applied the following keypoint pruning

procedures:

• Limit horizontal distance between matching keypoints. This helps remove

outliers when estimating an affine transformation between the query and

database images.

• Ignore keypoints that occur in the license plate region.

• Do not allow multiple query keypoints to match to the same database key-

point.

• Compute an affine transformation from the query to the database image

when there are more than three matching keypoints. If the scale, shear, or

translation parameters of the transformation are outside a threshold, set the

number of matching keypoints n to 0.

We used Lowe’s implementation [30] of the keypoint localization part of the al-

gorithm. Unlike in Lowe’s implementation, the query’s keypoint descriptors were

66

compared with the keypoint descriptors of each image in the database. This means

that the second best descriptor was not chosen for an object other than the cur-

rent database image. Also, modifying the threshold from the 0.36 appearing in

the published code to 0.60 (which is closer to the suggested in Lowe’s paper) in-

creased the number of matches, but had little effect on the overall recognition rate

– misclassified cars using one method were correctly classified with the other at

the expense of different misclassifications.

When the number of matching descriptors between the query image and

a database image is equal to that of another database image, we break the tie by

selecting the database image with the smaller overall L2 distance between all the

descriptors. This only occurred when the best matches in the database had one

or two matching descriptors, and applying the tie-break procedure had little effect

on the overall recognition rate.

Results

The SIFT matching algorithm described above yielded a recognition rate

of 89.5% on the query set. The recognition results are shown in Figures 4.12 – 4.15

for each query image. Note that the top 10 matches were all of the same make and

model for some of the queries with over 20 similar cars in the database.

67

Figure 4.12: Query images 1-10 and the top 10 matches in the database using

SIFT matching. Yellow lines indicate correspondences between matched keypoints

of the query (top) and database (bottom) images.

68




69




70




71

Table 4.1: Summary of overall recognition rates for each method.

Method Recognition rateEigencars using all eigenvectors 23.7%Eigencars without 3 highest 44.7%Shape context matching with 9× 4 bins 63.2%Shape context matching with 5× 12 bins 65.8%SIFT matching 89.5%

4.4.5 Optimizations

Finding the best match for a query image in our database of 1,102 images

for both shape context and SIFT matching takes about 30 seconds, compared to

0.5 seconds with the Eigencars method. The high recognition rate achieved with

SIFT matching is certainly appealing, but for our system to be real-time, MMR

must be as fast as the LPR algorithms.

Several possibilities exist that may help in that regard. Instead of com-

paring features in the query image with every single database image, it would be

useful to cluster the database images into groups of similar type, such as sedan,

SUV, etc. and perform a hierarchical search to reduce the number of comparisons.

A promising method that is applicable to our situation is the recent work

by Sivic and Zisserman [46]. They formulate the object recognition problem as

a text retrieval problem, which itself has been shown to be remarkably efficient

based on our daily experiences with internet search engines. Future work on MMR

should investigate the possibility of incorporating a similar approach.

4.5 Summary of Results

Table 4.1 summarizes the overall recognition rates of the appearance-

based and feature-based methods we evaluated.

Table 4.2 lists the the queries used in our test set and shows which meth-

ods were able to classify each query correctly. Note that most of the queries SIFT

72

matching was not able to classify correctly had 5 or fewer entries similar to it in

the database. It is safe to assume that having more examples per make and model

class will increase the recognition rate.

73

Table 4.2: Test set of queries used with ‘Size’ indicating the number of cars similar

to the query in the database and which method classified each query correctly.

Make and Model Size Eigencars Eigencars SC SC SIFTFull Minus 3 5× 12 9× 4

VW Beetle 5√

Honda Accord-1 18√ √ √

Honda Accord-2 20√ √ √ √

Honda Accord-3 17√

Honda Civic-1 19√ √ √


Honda Civic-3 12√ √ √ √ √


Toyota Camry-1 20√ √ √

Toyota Camry-2 16√ √ √ √

Toyota Camry-3 11√ √ √

Toyota Camry-4 4√ √ √ √

Toyota Corolla-1 (dent) -√ √ √ √

Toyota Corolla-1 14√ √

Toyota Corolla-2 26√ √ √ √ √

Toyota Corolla-4 8√ √

VW Jetta-1 15√

VW Jetta-2 6√ √ √ √

Ford Explorer-1 4√ √ √ √

Ford Explorer-2 6√ √ √

Van Van 18√ √ √ √ √

Van Van (occluded) -√ √ √

Nissan Altima-1 5√ √ √

Nissan Altima-2 4√ √ √ √

Nissan Altima-3 3Nissan Altima-4 3Nissan Sentra-5 4

√

Toyota 4Runner 9√ √ √ √

Ford Focus-1 6√ √

Ford Focus-2 6√

Ford Mustang 9√ √ √

Honda CR-V 7√ √ √ √ √

BMW 323 10√

VW Passat 9√

Toyota Tundra 7√ √ √ √ √

Toyota RAV4 3√ √ √ √ √

Toyota Sienna-1 3√

Toyota Sienna-2 6√ √ √ √

Chapter V

Conclusions and Future Work

5.1 Conclusions

We have presented a useful framework for car recognition that combines

LPR and MMR. Our recognition rates for both sub-problems are very promising

and can serve as an important foundation to a query-based car surveillance system.

Our LPR solution is real-time and works well with inexpensive camera hardware

and does not require infrared lighting or sensors as are normally used in commercial

LPR systems. Our MMR solution is also very accurate, however, further research

is required to make it real-time. We have suggested ideas on how this may be

achieved in Section 4.4.5.

5.1.1 Difficulties

At the start of our project, we anticipated several difficulties that we

would possibly encounter. Some of these include:

1. The weather can sometimes make the background undesirably dynamic, such

as swaying branches and even wind-induced camera shake.

2. Variability exists in the license plate designs of different states, and even in

the character spacing such as found in vanity plates.

74

75

3. Depending on the Sun’s position, a vehicle’s shadow may be mistaken as

being part of the vehicle.

4. Various types of vehicle body damage or even dirt might impact LPR and

MMR.

5. Recognition algorithms might only work during broad daylight or only with

very good lighting.

6. The surface of most cars is specular, a material property known to cause

problems for appearance-based recognition algorithms.

In this section, we discuss the observed performance of our system in each of the

above situations.

1. The effects of wind were heavily pronounced in the ‘Regents’ camera since the

camera’s optics were extended to their full zoom range and even light winds

caused camera shake. Even though image stabilization techniques could alle-

viate this effect, camera shake does not influence our license plate detection

algorithms because the entire frame is searched for license plates, and the

license plate tracker is sufficiently robust to handle the camera movement we

observed.

2. Our datasets did not include an adequate sampling of out-of-state plates

and vanity plates to determine how well our system would handle these in-

stances. However, the few such plates we observed seemed to be detected

and recognized no differently.

3. Vehicle shadows did not affect our car recognition algorithms. Because of our

choice of license plate location as a reference point when segmenting the car

image, the segmented image contained only pixels belonging to a car and no

background except in very rare cases with SUVs whose plates are mounted

off-center. Even in those cases, our MMR algorithm performed well as seen

in Figure 4.14.

76

4. Figure 4.13 shows an example of a query of a car with a dent and a van

with partial occlusion. For both cases SIFT matching matched the correct

vehicle in the database, while the appearance-based methods failed. License

plate detection did in fact perform quite poorly on very old or dirty plates,

however, those instances were rare, and even we as humans were unable to

read those plates.

5. It might be worthwhile to investigate possible night-time make and model

recognition methods where some crude intuition might be formed about the

vehicle examined based on taillight designs. We have not experimented with

night-time video data, but external lighting would certainly be required in

those cases for our system to operate.

6. The specular material of car bodies had little observed effect on our MMR

rates. In most cases, reflections of tree branches on cars’ windows resulted

in features that simply did not match features in the database. In a few

instances, as seen in Figure 4.13, several features caused by a tree branch

reflection resulted in a match, but were simply not enough to impact the

overall recognition rate, and in general, with more examples per make and

model this would hardly be a problem.

5.2 Future Work

Although our work is a good start to a query-based car surveillance sys-

tem, further research is necessary to make such a system possible. In this section,

we discuss several necessary features that need to be researched and developed.

5.2.1 Color Inference

In addition to searching the surveillance database for cars using some

make and model description and a partial license plate, it would also be useful to

be able to search for a particular color car as the make and model information may

77

be incomplete. Various color- and texture-based image segmentation techniques

used in content-based image retrieval such as [9] may be suitable for our purpose.

Since we already segment cars statically using the license plate tracker,

we could simply compute a color histogram for the entire region and store this as a

color feature vector in addition to the existing SIFT feature vectors for each image

in the database. To assign a meaningful label to the color histogram, such as ‘red’,

‘white’, ‘blue’, etc., we can find the minimum distance, as described in [9], to a

list of pre-computed and hand-labeled color histograms for each interesting color

type.

5.2.2 Database Query Algorithm Development

Due to the heavy computation necessary to perform both LPR and MMR,

a production surveillance system would require constant updates to a surveil-

lance database as cars are detected in the live video stream and not simply as

an overnight batch process. An algorithm for querying such a database might

work as follows. Given any combination of partial license plate, make and model,

or color description of a car:

1. If partial license plate description is provided, perform a search for the license

plate substring and sort results using the edit distance measure described in

Chapter 3.

2. If make and model description is provided, search the top results from Step 1

for desired make and model. Otherwise search entire database for the given

make and model description.

3. If color is provided, return results from Step 2 with a similar color, as de-

scribed in Section 5.2.1.

Queries in the database should return the times each matching vehicle

was observed and allow the user to replay the video stream for those times.

78

5.2.3 Make and Model 3-D Structure

In our MMR work, we have not explored car pose variation beyond what

normally occurs at the stop signs in our scenes. A robust MMR system should also

work well in scenes where there is a large variation of poses. This could require

the estimation of a car’s 3-D structure to be used as additional input to the MMR

algorithms.

Bibliography

[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images viaa sparse, part-based representation. PAMI, 26(11):1475–1490, 2004.

[2] Y. Amit, D. Geman, X. Fan. A coarse-to-fine strategy for multiclass shapedetection. IEEE Trans. Pattern Analysis and Machine Intelligence, 26, 1606–1621. 2004.

[3] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recog-nition using class specific linear projection. PAMI, pp 711–720. 1997.

[4] S. Belongie, J. Malik, J. Puzicha. Matching shapes. Proc. ICCV. pp. 454-461,2001.

[5] G. Cao, J. Chen, J. Jiang, An adaptive approach to vehicle license platelocalization. Industrial Electronics Society, 2003. IECON ’03. Volume 2, pp1786- 1791

[6] D. Capel. Image Mosaicing and Super-resolution. PhD thesis, University ofOxford, 2001.

[7] D. Capel, A. Zisserman. Super-resolution enhancement of text image se-quences. International Conference on Pattern Recognition, pages 600–605,Barcelona, 2000.

[8] D. Capel, A. Zisserman. Super-resolution from multiple views using learntimage models. In Proc. CVPR, 2001.

[9] C. Carson, S. Belongie, H. Greenspan, J. Malik. Blobworld: color- and texture-based image segmentation using EM and its Application to image queryingand classification. PAMI, 24(8):1026–1038, 2002.

[10] X. Chen, A. Yuille. Detecting and reading text in natural scenes. CVPR.Volume: 2, pp. 366–373, 2004.

[11] P. Comelli, P. Ferragina, M. Granieri, F. Stabile. Optical recognition of motorvehicle license plates. IEEE Trans. On Vehicular Technology, Vol. 44, No. 4,pp. 790–799, 1995.

79

80

[12] K. Donaldson, G. Myers. Bayesian Super-Resolution of Text in Video with aText-Specific Bi-Modal Prior. SRIhttp://www.sri.com/esd/projects/vace/docs/IJDAR2003-Myers-Donaldson.pdf

[13] G. Dorko and C. Schmid. Selection of scale-invariant parts for object classrecognition. Proc. ICCV, 2003.

[14] A. Ferencz, E. Miller, J. Malik. Learning hyper-features for visual identifica-tion. NIPS, 2004.

[15] W. Forstner. E. Gulch. A fast operator for detection and precise location ofdistinct points, corners and circular features. Proc. Intercommission Confer-ence on Fast Processing of Photogrammetric Data, Interlaken. 281–305, 1987.

[16] Y. Freund. Boosting a weak learning algorithm by majority. Information andComputation, Volume 121: 256–285, 1995

[17] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learn-ing and an application to boosting. EuroCOLT 95, pages 23-37, Springer-Verlag, 1995.

[18] A. Georghiades, P. Belhumeur, D. Kriegman. From few to many: IlluminationCone models for face recognition under variable lighting and pose. IEEE trans.PAMI, pp.643–660, 2001.

[19] C. Harris, M. Stephens. A combined corner and edge detector. Alvey VisionConference, pp 147–151, 1988.

[20] H. Hegt, R. de la Haye, N. Khan. A high performance license plate recognitionsystem. SMC’98 Conference Proceedings. 1998 IEEE International Conferenceon Systems, Man, and Cybernetics (Cat. No.98CH36218). IEEE. Part vol.5,1998, pp.4357–62 vol.5. New York, NY, USA.

[21] http://en.wikipedia.org/wiki/London Congestion Charge

[22] http://www.cbp.gov/xp/CustomsToday/2001/December/custoday lpr.xml

[23] California DMV Smog Check Web Sitehttp://www.smogcheck.ca.gov/vehtests/pubtstqry.aspx

[24] M. Irani, S. Peleg. Super resolution from image sequences. In InternationalConference on Pattern Recognition, pages 115120, 1990.

[25] T. Kadir and M. Brady. Saliency, scale and image description. Proc. IJCV,45(2): 83–105, 2001.

81

[26] V. Kamat, S. Ganesan. An efficient implementation of the Hough transform fordetecting vehicle license plates using DSP’S. Real-Time Technology and Ap-plications Symposium (Cat. No.95TH8055). IEEE Comput. Soc. Press. 1995,pp.58–9. Los Alamitos, CA, USA.

[27] N. Khan, R. de la Haye, A. Hegt. A license plate recognition system. SPIEConf. on Applications of Digital Image Processing. 1998.

[28] K. Kim, K. Jung, and J. Kim, Color texture-based object detection: an ap-plication to license plate localization. Lecture Notes in Computer Science: In-ternational Workshop on Pattern Recognition with Support Vector Machines,pp. 293–309, 2002.

[29] B. Leung. Component-based Car Detection in Street Scene Images. Master’sThesis, Massachusetts Institute of Technology, 2004.

[30] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,2(60):91–110, 2004.

[31] G. Mori, S. Belongie, J. Malik. Efficient Shape Matching Using Shape Con-texts, PAMI (to appear), 2005

[32] G. Mori and J. Malik. Recognizing objects in adversarial clutter: breaking avisual CAPTCHA. Proc. CVPR, 2003.

[33] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects fromappearance. ICJV, 1995.

[34] H. Murase, M. Lindenbaum, Spatial temporal adaptive method for partialeigenstructure decomposition of large images. Tech. Report 6527, Nippon Tele-graph and Telephone Corporation, 1992.

[35] T. Naito, T. Tsukuda, K. Yamada, K. Kozuka. Robust recognition meth-ods for inclined license plates under various illumination conditions outdoors.Proc. of IEEE/IEEJ/JSAI International Conference on Intelligent Trans-portation Systems, pp. 697702,1999.

[36] T. Naito, T. Tsukada, K. Yamada, K. Kozuka, S. Yamamoto, Robust license-plate recognition method for passing vehicles underoutside environment. IEEET VEH TECHNOL 49 (6): 2309–2319 NOV 2000.

[37] J. Nijhuis, M. Brugge, K. Helmholt, J. Pluim, L. Spaanenburg, R. Venema,M. Westenberg. Car license plate recognition with neural networks and fuzzylogic. Proceedings of IEEE International Conference on Neural Networks,Perth, Western Australia, pp 21852903. 1995.

[38] K. Okuma, A. Teleghani, N. de Freitas, J. Little and D. Lowe. A boostedparticle filter: Multitarget detection and tracking, ECCV, 2004.

82

[39] C. Papageorgiou, T. Poggio. A trainable object detection system: car detec-tion in static images. MIT AI Memo, 1673 (CBCL Memo 180), 1999.

[40] R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.

[41] C. Schlosser, J. Reitberger, S. Hinz, Automatic car detection in high-resolutionurban scenes based on an adaptive 3D-model. Proc. IEEE/ISPRS Workshopon ”Remote Sensing and Data Fusion over Urban Areas”. 2003.

[42] R. Schultz, R. Stevenson. Extraction of high-resolution frames from videosequences. IEEE Transactions on Image Processing, 5(6):996–1011, 1996.

[43] H. Schneiderman, T. Kanade. A statistical method for 3D object detectionapplied to faces and cars. IEEE CVPR, 2000.

[44] V. Shapiro, G. Gluhchev. Multinational license plate recognition system: seg-mentation and classification. Proc. ICPR 1051–4651. 2004.

[45] J. Shi, C. Tomasi, Good Features to track. Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR94), Seattle, June 1994.

[46] J. Sivic, A. Zisserman. Video google: a text retrieval approach to objectmatching in videos. Proc. ICCV, 2003.

[47] G. Sullivan., K. Baker, A. Worrall, C. Attwood, P. Remagnino, Model-basedvehicle detection and classification using orthographic approximations. Imageand Vision Computing. 15(8), 649–654.

[48] M. Turk, A. Pentland. Face recognition using eigenfaces. Proc. CVPR, 1991.

[49] P. Viola, M. Jones. Rapid object detection using a boosted cascade of sim-ple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001.Proceedings of the 2001 IEEE Computer Society Conference on , Volume: 1,8–14 Dec. 2001 Pages:I-511 - I-518 vol.1

[50] J. Wills, S. Agarwal, S. Belongie. What went where. Proc. CVPR pp. 98104,2003.

[51] Y. Yanamura, M. Goto, D. Nishiyama, M. Soga, H. Nakatani, H. Saji. Ex-traction and tracking of the license plate using Hough transform and votedblock matching. IEEE IV2003 Intelligent Vehicles Symposium. Proceedings(Cat. No.03TH8683). IEEE. 2003, pp.243–6. Piscataway, NJ, USA.

[52] A. Zomet, S. Peleg. Super-resolution from multiple images having arbitrarymutual motion. Super-Resolution Imaging, Kluwer Academic, 2001.

Documents

UNIVERSITY OF CALIFORNIA, SAN DIEGO Video-based Car ...vision.ucsd.edu/belongie-grp/research/carRec/dlagnekov_thesis_200… · Video-based Car Surveillance: License Plate, Make, and