Robust People Detection using Computer Vision · statistics which could be useful for planning and management. Video surveillance through the help of humans however presents several

Autonomous Systems LabProf. Roland Siegwart

Master-Thesis

Supervised by: Author:Jerome Maye Endri DibraPaul Beardsley

Robust People Detectionusing Computer Vision

Spring Term 2013

Declaration of Originality

I hereby declare that the written work I have submitted entitled

Robust People Detection using Computer Vision

is original work which I alone have authored and which is written in myown words.1

Author(s)

Endri Dibra

Supervising lecturer

Jerome Maye

Company Supervisor

Paul Beardsley

With the signature I declare that I have been informed regarding nor-mal academic citation rules and that I have read and understood the in-formation on ’Citation etiquette’ (http://www.ethz.ch/students/exams/plagiarism_s_en.pdf). The citation conventions usual to the discipline inquestion here have been respected.

The above written work may be tested electronically for plagiarism.

Place and date Signature

1Co-authored work: The signatures of all authors are required. Each signature atteststo the originality of the entire piece of written work in its final form.

http://www.ethz.ch/students/exams/plagiarism_s_en.pdf

http://www.ethz.ch/students/exams/plagiarism_s_en.pdf

Abstract

People detection is a challenging and interesting field of research that hasmany applications, one of which being video surveillance. Besides monitor-ing or detection of abnormalities, there is an increasing trend in using videosurveillance to draw statistics that can help in managing and planning. Inan environment like a restaurant, an automatic system could be used todetermine the transition of people and the presence of seated people at ta-bles. This information could help to direct the service for the tables. Thepurpose of this thesis was then to develop a framework for automatic peopledetection in restaurants. Adopting state-of-the-art Histograms of OrientedGradients (HOG) as feature descriptors and Linear Support Vector Ma-chines (SVM) to train classifiers, we propose a novel method for peopledetection in scenes seen from an oblique camera view. Unlike other holisticmethods, which train one general people classifier from available datasets ofpeople, like the INRIA dataset, and search through the whole image apply-ing the classifier model in a sliding window multi-scale fashion, we proposea method based on training of multiple specific classifiers in specific posi-tions in the scene. This scheme allows users to overcome the difficultiesthat arise from people occlusions and foreshortening that are inherent inother holistic approaches. Furthermore, by performing training only withdata generated by synthetic models of humans in synthetic 3D models ofthe scene, we avoid the problem of acquiring datasets of real images fortraining. In addition, by performing training only with synthetic modelswe provide ourselves with a rich dataset of humans in various poses andarticulations, not only for standing configurations, but also for sitting (ourmethod is general enough to extend to other poses, but detection of stand-ing and sitting people is the focus here). We demonstrate that this methodoutperforms the state-of-the-art holistic HOG approach by comparing it tofour detectors trained on the INRIA dataset and on three different datasetsfrom our scene real footage. In the end, we show that our detector, whichperforms single scale detection only on previously trained scene positions,is twice faster when compared to the sliding window multi-scale methods.

iii

Acknowledgment

Before I elaborate further on this master thesis I would like to express mygratitude to people who have helped in realizing it.

First of all I would really like to thank my supervisors, Jerome Maye (ASL)and Paul Beardsley (DRZ) for their great support and presence during thewhole thesis and even before that. I would sincerely like to thank the DRZartist Maurizio Nitti for helping me get quickly acquainted with AutodeskMaya in the very beginning. Then, I would like to thank Marcin Eichnerfor his great insights given during the evaluation process of this thesis. Fur-thermore, I would like to thank ASL represented by Prof. Roland Siegwartas well as DRZ represented by Prof. Markus Gross that enabled this masterthesis.

Finally, I would like to thank my Disney colleagues, friends and family whowith their presence and inputs made this work easier and more enjoyable.

v

Contents

Abstract iii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Camera- and Scene-Specific Human Training and De-tection . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Synthetic Data Generation . . . . . . . . . . . . . . . 41.2.3 Classifiers Training . . . . . . . . . . . . . . . . . . . . 41.2.4 Human Detection System . . . . . . . . . . . . . . . . 51.2.5 Assumption of Prior 3D Model of the Scene . . . . . . 5

1.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 72.1 People Detection Methods . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Background Subtraction . . . . . . . . . . . . . . . . . 92.1.2 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Silhouette . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Appearance . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 People Detection using Synthetic Data . . . . . . . . . . . . . 13

3 Method Description 153.1 System Calibration . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Training Data Generation . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Synthetic Models . . . . . . . . . . . . . . . . . . . . . 163.2.2 Tile Division . . . . . . . . . . . . . . . . . . . . . . . 173.2.3 Positive Data Generation . . . . . . . . . . . . . . . . 193.2.4 Negative Data Generation . . . . . . . . . . . . . . . . 20

3.3 Per Tile Classifiers Training . . . . . . . . . . . . . . . . . . . 203.3.1 Features Extraction with HOG . . . . . . . . . . . . . 203.3.2 Training with Linear SVM . . . . . . . . . . . . . . . . 22

3.4 People Detection . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.1 Per Tile Confidence Computation . . . . . . . . . . . . 233.4.2 Non Maximum Suppression . . . . . . . . . . . . . . . 243.4.3 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.4 Standing vs Sitting . . . . . . . . . . . . . . . . . . . . 26

4 Implementation 274.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Software and Tools . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.1 Detection Module . . . . . . . . . . . . . . . . . . . . 30

vii

4.3.2 Training Module . . . . . . . . . . . . . . . . . . . . . 304.3.3 Annotation Module . . . . . . . . . . . . . . . . . . . 30

4.4 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . 31

5 Evaluation and Results 335.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.1 Positive data generation . . . . . . . . . . . . . . . . . 355.2.2 Negative data generation . . . . . . . . . . . . . . . . 355.2.3 Hard training . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . 365.4 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . 37

5.4.1 P-R curves comparison . . . . . . . . . . . . . . . . . . 375.4.2 Runtime Computation . . . . . . . . . . . . . . . . . . 38

5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 405.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion and Outlook 436.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 48

List of Figures 50

List of Tables 51

Acronyms and Abbreviations

ASL Autonomous Systems Lab

BLOB Binary Large Object

C-HOG Circular Histograms of Oriented Gradients

DRZ Disney Research Zurich

GMM Gaussian Mixture Model

GUI Graphical User Interface

HOG Histograms of Oriented Gradients

HW Haar Wavelets

IOU Intersection Over the Union

ISM Implicit Shape Model

MEL Maya Embedded Language

MOCAP Motion Capture

R-HOG Rectangular Histograms of Oriented Gradients

ROI Region of Interest

ROS Robots Operating System

SCAPE Shape Completion and Animation of People

SIFT Scale Invariant Feature Transform

SVM Support Vector Machines

VOC Visual Object Classes

Chapter 1

Introduction

In this introductory chapter we present the motivations behind the realiza-tion of the project. First, we outline the importance of people detection.Then, we present the project goal, along with our approach to tackle it andits components. In the end, the structure of this report is outlined.

1.1 Motivation

Object detection is an increasingly popular research area in computer vision.Vision researchers are continuously developing or improving existing algo-rithms and methods used for object detection. Starting in 2005, the PascalVisual Object Classes (VOC)1 challenge was created and since then it hasbeen used as a benchmark in visual object category recognition and detec-tion. It provides the computer vision and machine learning communitieswith a yearly increasing dataset of images and annotations, as well as stan-dard evaluation procedures, on which researchers can test their algorithmsand detectors. Being considered as the benchmark for object detection, thePascal VOC challenge consists of different object detection categories, onwhich detection algorithms are run and their performance is evaluated [1].Out of all detection categories, the people category is particularly challeng-ing because it exhibits high variability in human pose, both in the VOCdataset and in real life. Recent results from the challenge on the detectionperformance for this particular category are promising; however, there is alot of room for improvement.

People detection is not only important because it is a state-of-research topicin the vision community, but also because there is a wide range of appli-cations that relate to and benefit from it. Some of these applications are,but not only limited to, video surveillance and people counting, human –robot interaction, games, augmented reality, etc. Video surveillance is be-coming more and more feasible nowadays because of the reduced cost inthe production of sensors used in these applications. Cameras are beingdeployed both indoors and outdoors, in private and public spaces such ashouses, airports, restaurants, shopping malls and parking lots. They aremainly being used for monitoring and detection of abnormalities, as well as

1http://pascallin.ecs.soton.ac.uk/challenges/VOC/

1

2 1.2. Problem Description

for security purposes; however, lately they are also being exploited to drawstatistics which could be useful for planning and management.

Video surveillance through the help of humans however presents severalproblems, such as the need to monitor a great number of people in onecamera view or a number of cameras simultaneously while being able tominimize the distractions that come along with the task. Performing suchtasks is by all means not easy for humans, therefore a computer vision sys-tem could be able to assist in these cases. Using an automatic video surveil-lance system has difficulties of its own, such as the variety of the scenes orthe acquisition conditions. Most of these difficulties could be overcome bydesigning systems that utilize single, stereo or multiple cameras. These sen-sors could be static or mobile, color or infrared. In this master thesis, wedeal with indoor scenes using a static monocular color camera.

A typical automatic monitoring system involves three main tasks. The firsttask is detection and its purpose is to locate previously specified objectsin the images. Once detection is performed, tracking is used to robustlyestimate the objects’ positions in the scene over time. In the end, one couldextend the system with a third task, which involves describing what happensin the scene during time, and it is generally known as the event recognitiontask. Even though it is the first step of a real monitoring system, detectionhas not had as much emphasis and attention in research as tracking hashad. Furthermore, the majority of errors that come during tracking consistof wrong detections [2]. This is the reason why we go into greater depthwith regards to detection in this thesis.

1.2 Problem Description

The aim of this project was to develop a robust system for people detectionin restaurants using a monocular static color camera from an oblique view(see Fig.1.1). The camera is placed at one of the top corners of the restau-rant and is tilted downwards such as to form an angle of ∼ 60 [degrees] withrespect to the scene’s ground plane.

Figure 1.1: Camera setup. left Oblique view of a random restaurant, cam-era tilted ∼ 60 [degrees]. right Top view of the restaurant.

Various people detection techniques have been developed in the recent years.Broadly speaking they could be divided into techniques that require somepreprocessing, such as background subtraction or segmentation, and tech-niques that can detect people using features directly extracted from the im-

Chapter 1. Introduction 3

age fed as input. Because monocular cameras tend to have problems withtechniques from the first group due to sudden illumination changes, ghosteffects or people standing still for a long time, we adopt a popular techniquefrom the second group that has proven to have a very good performance inpeople detection tasks. Being one of the state-of-the-art people detectors,Histograms of Oriented Gradients (HOGs) developed by Dalal and Triggs[3] is a shape based descriptor, and it uses the fact that an object shapecan be well represented by a distribution of edge directions. It is trainedon positive (images containing people) and negative (images containing nopeople) data to build a classifier, which is used during the time of detectionin a multi-scale sliding window approach to determine whether a windowat a certain position and scale in an image contains a person or not. Theavailable implementation is trained on the INRIA Person Dataset2, whichconsists of images taken from several viewpoints under varying lighting con-ditions in indoor and outdoor scenes. These images contain mostly uprightfull-body generic people in generic settings. Being trained under such cir-cumstances, the detector does not perform as well on foreshortened andoccluded people.

1.2.1 Camera- and Scene-Specific Human Training andDetection

As described above, a HOGs detector is generic and works across multiplescenes and all likely human poses. In contrast, this thesis describes how totrain HOGs for a specific camera viewpoint, in a specific scene, and evenpotentially for specific human poses at an individual scene location. Thesystem assumes a prior 3D model of the scene.

The focus is to show a system working in a restaurant. This is an ordinarytype of human environment, but it has characteristics which show strengthsof our approach. Firstly, there are specific locations where people walk(between tables) and specific locations where they sit (at the tables). Thework will demonstrate how sitting person detectors are applied only wheresubjects may be sitting, and standing person detectors are used only wheresubjects are standing or walking. The work also addresses the case where astanding person is in close proximity to a sitting person.

A second characteristic of the restaurant environment is that occlusion ofhuman subjects will happen at certain locations. For example, the legsof a human standing behind a table relative to the camera viewpoint willbe occluded. This will degrade the detection results of a generic humandetector. In contrast, the system described here addresses this problem byspecifically training for occluded humans at locations where there are fixedforeground objects in the scene.

A final characteristic which is typical of all systems, is that images of humansubjects are truncated when they are at the edge of the image. For example,for an elevated camera which is angled downwards into the scene, onlythe bottom part of the body may be visible for a distant human subjectappearing in the top part of the image. In this case, a generic humandetector fails. Our approach is to specifically train for truncated humanappearance at locations around the edge of the image.

2http://pascal.inrialpes.fr/data/human/

4 1.2. Problem Description

1.2.2 Synthetic Data Generation

There are two consequences of the decision to train classifiers for a specificcamera view in a specific scene. The first consequence is that we need toquantize a scene into a rectangular grid of overlapping tiles on the groundplane, and train a different classifier at each tile. For example, a tile fora location behind a table will be associated with human subjects that arepartly occluded versus a tile in open space where the human subjects arenot occluded. Clearly the visual appearance is different and the trainingimages will be quite different for these two tiles.

The second consequence is that we need to obtain sufficient training data,in order to train good classifiers. This is a very tedious process, whichconsists of gathering images of people under specific settings and manuallyannotating them. Even though annotation tools and websites exist, suchas the Amazon Mechanical Turk3 or LabelMe4, which can ease the anno-tation process, the results are not always satisfactory. Even gathering ofreal footage of people in a certain environment might not always be pos-sible because of privacy issues. Furthermore, assuming that for a specificcamera viewpoint and environment enough data is obtained, what wouldhappen if either of the specific settings changes? New training data wouldbe needed and the tedious process would have to start from the beginning.In order to overcome all these difficulties, we propose a system which cre-ates a 3D model of the calibrated camera and the environment. Instead ofreal people, we insert a variety of synthetic humans, under different posesand orientations, and we train only on those. Using only synthetic data, wecreate an automatic training data generation system, which could be usedin any other scenario, provided that the camera calibration parameters andthe 3D model of the environment are known.

1.2.3 Classifiers Training

We generate enough positive and negative training data per tile to train aspecific classifier. In order to achieve best detection results, we try differ-ent combinations of the positive and negative training data by varying theamount data, as well as the human or non-human models, used for training.The HOG features extracted from the data are then fetched into a LinearSupport Vector Machines (SVM) classifier [4] for training. The SVM is abinary classifier, which in our case builds a model that assigns new andunseen data to the person or non-person category. Besides determining thecategory, the SVM classifier also assigns a score which determines the con-fidence of a person detection/non-detection. We perform specific standingclassifier training for adults and children in every tile of the scene, as wellas sitting classifier training for adults and children only where chairs andcouches are in the scene. All the above process is automatic.

3http://www.mturk.com/mturk/welcome4http://labelme.csail.mit.edu/Release3.0/

Chapter 1. Introduction 5

1.2.4 Human Detection System

Unlike sliding window multi-scale detection approaches, we do not need tosearch over the entire image for people. Instead, the search is performed onlyat specific locations (tiles) where people are expected to be either standing orsitting in the scene. Having pre-computed all the per tile classifiers, duringdetection we simply need to extract single scale HOG features from theimage patches corresponding to the tiles and determine whether a personis present or not. This results in a more efficient and faster single-scaledetection per tile. In order to get rid of multiple detections around thetiles where people are, we use two non-maxima suppression schemes basedon tile detection scores and compare them. The non-maxima suppressionschemes need to be applied only for standing detections, because the sittingdetections are always on isolated tiles (where a chair is present in the realscene). In the end, based on a score comparison and box overlapping scheme,we remove ambiguities for tiles where a standing and a sitting detection isobserved at the same time.

1.2.5 Assumption of Prior 3D Model of the Scene

As also mentioned in Sec. 1.2.1, our scheme assumes a prior 3D model ofthe scene. In this work, the 3D model of the scene was obtained using realdistance measurements of spaces or objects in the scene, and then creatinga CAD model from them. Even though this is performed manually, it isfeasible to envisage an integrated system which automatically extracts the3D model of the scene. Many different techniques for 3D reconstructionhave been proposed in the recent years. They vary from 3D reconstructionof small or medium sized objects to reconstruction of entire scenes. Thelatter is a challenging task, however works similar to [5] and [6] show thatquite realistic and accurate reconstructions can be achieved through theuse of vision-based techniques and laser scanners. This, however, is not thefocus of our work and is a future research direction.

1.3 Report Outline

This report consists of six chapters. In this first Chapter, we provided somemotivations and a quick overview of the project. In the following Chapter 2we present a resume of related work in the field on people detection. Then inChapter 3, we describe our data generation, training and detection methodsin detail, followed by Chapter 4, where we give the project implementationdetails. The results and evaluation of our method, where we also compareit to four other training methods is presented in Chapter 5. In the end weconclude in Chapter 6 by giving a resume of the achievements and noveltiesof the method, as well as suggestions for further improvement.

6 1.3. Report Outline

Chapter 2

Related Work

As also mentioned in Sec. 1.1, automatic video monitoring systems differfrom each other because of the various setups where they are used. Thechoice of the sensors is usually constrained by the specific environment orscenario, while the methods used during the inference processes are stronglycoupled with the sensor and setup. Regardless of the setup, a video surveil-lance system passes through some general steps which are common to mostsystems. These steps generally consist of :

� Motion Extraction or Background Subtraction

� People Detection

� People Tracking

� Semantic Analysis of Human Activity

Most of the visual surveillance systems start with motion detection, whereregions corresponding to moving objects are segmented from the rest of theimage. This first step, which depending on the set-up is not always present,is considered as a pre-processing step, as it is strongly coupled with thepeople detection step and needs to be done quite accurately, because itsaccuracy directly and strongly affects the following stages and the systemas a whole. Because of the difficulties that arise from using backgroundsubtraction with static monocular cameras, the focus of this project andalso of this chapter is mainly on the second step of the monitoring pipeline,which is the people detection step. For this reason, we try to provide in Sec.2.1 a comprehensive overview of the work that has been done in the fieldof people detection in the past 15 years mainly focusing on the methodsand algorithms used, and we conclude with some recent work done on usingsynthetic data for training detection classifiers in Sec. 2.2.

2.1 People Detection Methods

People detection has had a lot of emphasis in the computer vision commu-nity in the recent years. The great variability that humans possess, in terms

7

8 2.1. People Detection Methods

of body poses and articulations, makes it such a difficult category of objectrecognition, hence many methods and techniques have been developed. Onecould use many criteria to classify people detection techniques. Some arebased on whether pre-processing (background subtraction) or motion esti-mation is used, while others are based on which models are used (e.g. figurebased or statistical), or on the camera attributes (e.g. color vs infra-red,monocular vs stereo vs multiple cameras, static vs mobile, oblique view vspedestrian horizontal view etc.). Many surveys exist that compare variouspeople detection methods, in terms of their accuracy and efficiency. Ogaleet al. [7] makes a survey of techniques for human detection from a giveninput image or video. The techniques there are classified with respect tothe need for pre-processing (background subtraction vs direct detection),the features that describe human appearance (shape vs color vs motion)and the use of explicit body models and learning techniques. Santhanam etal. [8] as well reviews various aspects of human detection in static images,while focusing mainly on methods that use learning to build people classi-fiers. Enzweiler work [9] consists of a survey and an experimental study ofstate-of-the-art systems used for monocular pedestrian detection. Such sys-tems include wavelet-based features used in a cascaded AdaBoost fashion[10], histograms of oriented gradient based features (HOGs) in combinationwith linear support vector machines (LinSVM) [3], combination of shapeand texture detection [11] etc. Similarly Wojek et al. [12] compare themost prominent people detectors based on the features and classifiers theyuse and also propose a stronger detector by combining multiple features.A comprehensive evaluation of people detection approaches is contained inGeronimo et al. [13] survey and they are divided into silhouette match-ing and appearance methods. Finally, Schiele et al. [14] first provide anoverview of the most successful methods for visual people detection, sep-arating them in holistic sliding-window techniques that exhaustively scanthe input images over many positions and scales independently classifyingeach sliding window(e.g. [10], [3]), and methods which generate hypothesesof people by using part-based body models (e.g. [15], [16], [17], [18], [19],[20]).

Figure 2.1: left Holistic representation. right Part based representation(taken from [8])

Taking into consideration the above surveys and the literature that has beeninvestigated during this master thesis, people detection methods can bedivided into four big groups consisting of methods based on pre-processingor background subtraction, motion, appearance and silhouette (see Fig.2.1).

Chapter 2. Related Work 9

2.1.1 Background Subtraction

There are many approaches to background subtraction and [21] gives acomprehensive overview of them. They range from simple frame differencemethods to more complicated ones, which allow foreground segmentationby inferring a background model. Some of these methods consist of :

� Running Gaussian average

� Temporal median filter

� Mixture of Gaussians

� Kernel density estimation (KDE)

� Sequential KD approximation

� Cooccurence of image variations

� Eigenbackgrounds

Static camera setups allow such pre-processing steps to be used however theyrelate various problems which are linked to sudden illumination changes,ghost effects, partial occlusions and especially people remaining in the sameposition for a long time, which is also the case in restaurants. Even thoughworks like [22] try to suppress shadow effects, they struggle with occlusions,illumination changes and proper people localization. In order to overcomethe aforementioned difficulties, depth/disparity based systems along withstereo cameras (e.g. [23], [24], [25], [26], [27]) have been recently adoptedand have shown to improve accuracy, however a thorough investigation ofdisparity based background subtraction systems is out of the scope of thiswork.

2.1.2 Motion

Methods based on motion represent the smallest group. Sometimes consid-ered as complementary to the other methods based on appearance, they areuseful when the latter do not work very well. These are cases when humanappearance varies due to external factors such as illumination, clothing orcontrast, but also when the objects are too small. For these reasons some-times detection is realized only using motion information. As an example, in[28] the authors detect human motion patterns because of its independenceto the differences in appearance. For each person present in two consecutiveimages, his pattern flow (dense optical and horizontal flow) is calculated andthen a normalization scheme is performed (see Fig. 2.3). At this point asupport vector machine is trained with the flow patterns and the model isused for classification during detection. One big problem with these meth-ods is that they do not support partial occlusions since motion patternswould not be extracted correctly. Furthermore, the results obtained areworse than other methods based on appearance or silhouette, however theycould be used to complement them in cases of difficulties, based also on thefact that their complexity is quite low.


Figure 2.2: People Detection Methods

Figure 2.3: top Human motion patterns. bottom Non-human motionpatterns (taken from [28])


2.1.3 Silhouette

Methods based on silhouettes in general focus on extracting object contoursand then matching them to some pre-computed people models. Internallythey could be classified into models, which model the silhouette’s shape andthose that extract discriminative features from the silhouette (aspect ratio,ellipse adjust, etc.). One of the simplest approaches is that of the binaryshaped model, where an edge image is matched to the upper body shapeby simple correlation [29]. Gavrila et al. [30] proposes a more sophisti-cated approach called the Champfer System, which consists of a hierarchi-cal template-based classifier used to match distance transformed ROIs withtemplate shapes (see Fig. 2.4 (left)). However, these methods are mainlybased on detecting the whole person (holistic) and do not deal well withpartially occluded people. Wu and Nevatia [20] propose a novel schemeto overcome this. They detect partially occluded people through matchingsmall chamfer segments (up to 12 pixels long see Fig. 2.4 (right)), whichthey refer to as edgelets. The part detectors which are based on these fea-tures are then learned by AdaBoost and responses to the part results arecombined with a probabilistic voting scheme.

Figure 2.4: left Hierarchy of templates used in chamfer system (figure from[30]). right First five edgelet features selected by AdaBoost (figure from[20].)

2.1.4 Appearance

Methods based on appearance define a features space on top of which aclassifier is trained by using ROIs that contain positive examples (people)and negative examples (non-people). They are divided into methods thatmodel the person as a unique region and others that try to detect differentregions and later join them in a person model.

Holistic Approaches. The first are called global representation or holisticapproaches. Papageorgiou and Poggio [31] proposed to work with Haar-likefeatures, which they called Haar wavelets instead of the usual image inten-sities. The Haar wavelets compute the pixel difference between rectangularareas, as can be seen in Fig. 2.5 (left), which is then used to train a classi-fier consisting of a quadratic Support Vector Machines (SVMs) for front andrear viewed pedestrians, demonstrating very good results in pedestrian, carand face detections. Viola and Jones [10] adapted the Haar-like features byadding two more rectangles and using them in a speeded up cascade classi-fier structure. The framework, which is referred to as cascaded AdaBoost,is based on layers of threshold-rule weak classifiers and is motivated by the


fact that the majority of windows in an image are non-people. Thereforethe cascade structure is supposed to detect almost all people while rejectingnon people as early. Such features and methods have proven to be quitesuccessful in people surveillance scenarios.

One of the best and state-of-the-art feature descriptors was developed in2005 by Dalal and Triggs [3] and it is based on SIFT inspired features, calledHistograms of Oriented Gradients (HOGs). These histograms of orientedgradient directions are computed over dense grids of uniformly spaced cells,which in the original paper consist of 8 by 8 pixels, by dividing the regioninto nine orientation bins. The cells are combined into square groups offour which they call blocks and in the end a contrast normalization schemeand a Gaussian mask are applied in order to weight more heavily the centerpixels and improve detection accuracy. These features are passed into alinear SVM classifier trained on positive (people) and negative (non-people)image patches to generate a model which classifies new unseen image patchesinto people or non-people, in a sliding window multi-scale approach (seeFig. 2.5 (right)). These features are greatly exploited in literature, as theyhave the best performance on holistic people detection tasks. Inspired byPorikli [32], who proposed a fast and efficient way of computing histogramsover rectangular patches, which he called Integral Histograms, Zhu et al.[33], used HOGs as a weak rule of AdaBoost, achieving similar performancewhile decreasing computation time. Recent works show also how HOGfeature computation can be speeded up by 67 times using GPU [34]. Lastly,Tuzel et al. [35] propose a novel algorithm based on the covariance ofmeasures such as position, gradient orientation, and model the first andsecond order derivatives in different subwindows as features altogether withadditive logistic regression using Riemannian manifolds achieving state-of-the-art performance.

Figure 2.5: left Different haar wavelets and haar features applied on ahuman patch (figure from [31]). right Block placed on sample image andHOG descriptor weighted by positive/negative SVM weights (figure from[3].)

Part Based Approaches. The second group of methods are called part-based approaches and they combine the classification of various parts ofthe body (head, legs, arms etc.), instead of classifying the whole person.While Mohan et al. [36] use Haar wavelets and a quadratic SVM to inde-pendently classify four human parts (head, legs, right arm, and left arm),Wu and Nevatia [37] propose to use the full body, head-shoulder, torso, andlegs and three view categories (front/rear, left profile, and right profile) fortraining an AdaBoost nested classifier with edgelets as features. A Bayesianreasoning with a camera looking down the plane assumption is used to com-bine the different body parts. One of the most prominent papers that has


state-of-the-art performance on many object recognition datasets, was de-veloped by Felzenszwalb et al. [38] and it describes an object detectionsystem based on mixtures of multi-scale deformable part models. Theirsystem is based on new methods for discriminative training with partiallylabeled data using a latent SVM, which is a reformulation of MI-SVM interms of latent variables and is used for data-mining of hard negative ex-amples. The features extracted here are HOGs from [3] and they use thesum of the classification score of the ROIs and six different dynamic partsto give the final classifier. Similarly Dollar et al. [39] uses a part-basedscheme called Multiple Component Learning with Haar features. Both ap-proaches seemingly avoid the task of manually annotating parts, since theyare automatically determined by the method.

Some of the other recent works noticed during literature research include,but are not limited to, the paper by Belongie et al. [40] which uses a Hessian-Laplace keypoint detector. A codebook is then constructed by computingshape context descriptors for each keypoint and clustering them. This ideawas based on the work by Leibe et al. [41], known as the Implicit ShapeModel (ISM), which avoids the ROI generation step. During detection, thekeypoints are matched to a cluster which adopts a Hough voting scheme todecide for an object hypothesis, avoiding the need for a candidate generationstep. Seeman et al. [19] proposes an effective and efficient multi-viewpointdetection method which handles global appearance changes caused by dif-ferent articulations and viewpoints of people by a generalization of ISM andrequires few training samples.

Holistic methods are based on simplified people models, hence having alower complexity than part-based models however they do not support par-tial occlusions or pose variations, which are captured by part-based methodsand are crucial for achieving high accuracy in detection. Regardless of theapproach taken, HOG and Shape Contexts seem to be the best feature de-scriptors independent of which learning method is used (SVM, AdaBoostetc.), and a combination of different features improves the detector’s per-formance. Because we need a robust system that also requires real time op-eration, we have decided to adopt the HOG feature descriptor as a holisticapproach, however adapted to our specific scenario. While partial occlusionsand foreshortening of people are captured by a training and classificationscheme on specific positions in the scene, pose variations are captured byperforming training with various poses and models of synthetic humans.

2.2 People Detection using Synthetic Data

The use of HOGs as feature descriptors requires a big amount of trainingdata. Because our approach is not a generic one, but specific to the environ-ment, it is difficult to obtain training data, as it is going to be explained inthe next chapter. For this reason, we explore a recent field in research, whichis that of generating synthetic data. There have been different works whichmake use of synthetic data to train their classifiers. The classic example isthe work done by Shotton et al. [42], where they propose a new methodto predict 3D positions of body joints from a single depth image extractedfrom the Kinect camera. What is important in this work, as related to ourproject, is that they use only synthetic data to generate their training exam-

14 2.2. People Detection using Synthetic Data

ples. The synthetic depth images of people are generated through MotionCapture (MOCAP) sequences, providing a highly varied training dataset interms of poses and articulations (see Fig. 2.6 (left)). Another recent work,which makes use of synthetic training data, is that of Pischulin et al. [43]where the real human datasets are enriched by synthetic samples generatedfrom SCAPE [44] consisting of more articulations and poses (see Fig. 2.6(right)). The advances in computer graphics make it possible for very re-alistic models to be generated on real backgrounds, enriching the existingdatasets and improving the overall detection performance. Lastly, Marinet al. [45] propose a method for pedestrian detection based on training se-quences in virtual scenarios where appearance-based pedestrian classifiersare learnt using HOG and linear SVM, providing very good results. Basedon the aforementioned works, we perform a classifier training only based onsynthetic models of humans placed in real scene backgrounds.

Figure 2.6: left Kinect synthetic depth images (figure from [42]), rightSCAPE examples (figure from [44].)

Chapter 3

Method Description

In the previous Chapter 2, we gave an overview of the recent works onpeople detection, while focusing mainly on methods based on appearance.Taking inspiration from the Histograms of Oriented Gradients (HOG) paper[3], and the other works which use synthetic data for training, we proposea novel approach to people detection for video surveillance. We use HOGas feature descriptors, but instead of training on generic people in genericsettings, as it is done in the original paper using the INRIA dataset, we trainfor specific positions in the scene where people could be observed walkingor sitting. In order to generate training data, we use synthetic models ofhumans in synthetic scenes. This chapter is then structured in the followingway : In the beginning, we show how we calibrate the system and generatethe training data. Then, we present how we perform training of classifiersfor specific positions, and in the end we present our people detection scheme.

3.1 System Calibration

In order to generate data for training classifiers specific to our environment,we need to have a calibrated camera and also a model of the environment. Inour scenario, a monocular camera (Axis Network Camera) with a resolutionof 1,024 by 768 pixels is used. The internal and external calibration is per-formed on site and stored into files. We use the internal camera parametersto undistort the images (which could either be static or a video stream). Inorder to create a 3D model of the environment, some manual measurementsof the environment are made on site for simplicity, and a CAD model is cre-ated, which perfectly models the static objects in the scene (chairs, couches,walls, pavement etc.). It is noteworthy to mention, that this step could alsobe performed automatically, by generating pointclouds of the restaurant inorder to infer the 3D model. Having the scene CAD model, the calibratedcamera parameters, and an image of the empty background scene, we useAutodesk Maya as a software to create a synthetic Maya scene. An imageof the CAD model and the camera is shown in Fig. 3.1. In order to alignthe camera to the scene, a program was written that converts the internaland external parameters into a Maya Embedded Language (MEL) scriptwhich is directly passed to Maya. Once we have the synthetic scene, wesimply need to insert synthetic human models in it, and that is going to beexplained in the following section.

15

16 3.2. Training Data Generation

Figure 3.1: CAD model in Maya

3.2 Training Data Generation

Because we adopt a detection approach which involves training of classi-fiers, there is a high demand for sufficient and various training examples.Commonly, training data consist of datasets of images containing peopleor non-people, such as the INRIA dataset, which was created by N.Dalalmainly through pictures taken with his personal camera in the course ofmany years. Even though such datasets exist, they normally contain im-ages of full body people taken in generic settings. In the same time, theyare to some extent limited in the ability to capture as many people posesand articulations. As also mentioned in the introductory chapter, we trainfor specific scenarios, which makes it difficult to obtain real training images,especially if the camera’s view point changes along with the environment.For this reason, we only used synthetic people models as training data. Theadvantage here is that we can automatically generate as much data as wewant, while being able to control the variety in poses and articulations. Fur-thermore, we have full control on where to place the models in the scene,so that we can train classifiers at desired positions or the whole scene.

3.2.1 Synthetic Models

With the advances in computer graphics, very realistic human models areeasy to obtain nowadays. There are many websites where various humanmodels can be downloaded and used in different scenarios and applications.In our case we downloaded some free models of humans from Mixamo1 (seeFig. 3.2).

In total, ten human models were downloaded, consisting of five male modelsand five female models of various sizes and clothing. Because these modelsneed to be integrated in Maya, it is desired that they already come in T-

1http://www.mixamo.com/

Chapter 3. Method Description 17

Figure 3.2: Mixamo characters

pose, with texture and rigged. A script was written to align the skeletonof the Mixamo characters to that of Maya human skeleton. This is animportant step to be performed, since in order to have different poses andarticulations, we pair the synthetic human models with walking and sittinganimations, which depend on the Maya skeleton rig model. Maya alreadycomes with two walking animations and one sitting (among others), howeverwe modified and lengthened the sequences to add more variation. Then,by rendering the images at specific timeframes (the animations consist ofapproximately 100 frames each), we obtain various poses for the positivetraining data. Because we perform single scale detection, as also mentionedin Sec. 1.2.4, we need to train specifically for children, as they vary greatlyin size from adults. Ideally, one would need to download synthetic childrenmodels, because of the different head-to-shoulder and head-to-body ratiothat they have as compared to adults; however, because of time constraintswe just use the same adult models that we already have and we scale themdown by a factor of 0.6.

3.2.2 Tile Division

Once we have the synthetic models, we need to place them in the scenein order to generate training examples. We decided to divide the wholescene floor into areas of roughly 1m by 1m, which we call tiles. The sizeof the tile is chosen heuristically, and represents the space that a personwould normally take if projected on the floor. In order to capture as manypositions in the scene as possible where e person can stand, and in the sametime, the fact that people might stay close to each other, we arrange thetiles such that they overlap with each other at half of their surface (seeFig. 3.3). We place standing synthetic humans at every possible tile inthe environment, inside the camera view range. On the other hand, underthe assumption that tables don’t move too much, we place sitting synthetichumans only where chairs are placed in the scene, as can be seen also byFig. 3.3. This assumption was made because of time constraints. In thefuture, we will train classifiers for sitting people around areas where tablesand chairs might be, and not only in fixed positions. It is noteworthyto mention, that placing of synthetic models in the scene is an automaticprocess, which is done by utilizing MEL scripts. Furthermore, we placesynthetic models also in those positions at the limits of the camera viewrange, where people appear truncated, as in Fig. 3.4 (b).

18 3.2. Training Data Generation

Figure 3.3: Top view of the scene showing tile division and chair positions

Figure 3.4: (a) Synthetic human, (b) Truncated synthetic human, (c) and(d) Standing positive training patch, (e) and (f) Sitting positive trainingpatch


3.2.3 Positive Data Generation

Here, we explain the process of positive data generation for standing andsitting people, for every tile. Since the process is exactly the same for everytile, we are going to explain how it works only for one of them. In order togenerate positive training data, which consist of images of synthetic peopleon the scene background, we use nine synthetic models, and place them ineight different orientations by rotating them every 45 [degrees]. In orderto obtain different poses and articulations, we render at each frame for thewalking or standing animations respectively, and we heuristically choose 11articulations different from each other, making a total of 792 positive train-ing images per tile. In order to perform the training however, we do notfeed the whole rendered image of the scene containing the synthetic person.Instead, we extract a 2D bounding box of fixed size around the syntheticperson. One would normally apply contour extraction techniques from com-puter vision around the synthetic people, and then extract a bounding boxaround it; however, because the models vary from one another in size, mak-ing it difficult to extract a fixed size bounding box, we propose a betterand faster method. We create a 3D capped cylinder which completely en-capsulates all of our models, and place it at every tile. Then, we renderit separately from the human model, and we extract the contours and thebounding box around it (see Fig. 3.5). The bounding box is used to extractfrom the rendered synthetic image only the patch that needs to be fed asa training example to the SVM classifier (see Fig. 3.4 (c) and (d)). Weperform the same steps for the children models, by rescaling the cappedcylinder to the children size. For the sitting people, we extract boundingboxes only around the chairs and couches. These bounding boxes comein two sizes (for adults and children) for every sitting position. We usenine synthetic models, 40 articulations and two sitting elevations above thechair, making a total of 720 positive training image patches for each sittingposition (see Fig. 3.4 (e) and (f)).

Figure 3.5: left Capped cylinder encapsulating the human model, rightBounding box extracted

20 3.3. Per Tile Classifiers Training

3.2.4 Negative Data Generation

Along with the generation of the positive training data, we need to generatealso negative training data, which consists of images containing non-people.For that, we pick random synthetic object models (eg. trolleys, chairs,dogs, cats etc.), and we automatically place and scan them through thewhole image. This process assures that we get a variety of objects thatmight be present in the restaurants. In the same time, the fact that wescan through the whole image and not only where tiles are, creates negativetraining data containing only partial non-people objects, hence enrichingthe negative dataset. In this way using the same tile bounding box size,which is used during the positive training examples generation phase, wecreate 5,000 negative training image patches per tile, consisting of eitherempty backgrounds or partial and full object models on the background(see Fig. 3.6).

Figure 3.6: Examples of negative training patches. (a) Part of a trolley, (b)Part of a dog, (c) Empty Background (d) Whole ball

3.3 Per Tile Classifiers Training

Having generated the positive and negative training data, we can finallytrain the people/non-people classifiers at each tile. We perform four differenttrainings, standing adults vs non-adults, standing children vs non-children,sitting adults vs non-adults and sitting children vs non-children. Trainingfor standing people is performed at every tile in the scene, while training forsitting people is only performed where chairs and couches are. The trainingprocess consists of two phases, such as per tile features extraction from anautomatically downscaled fixed size box and training of the features witha linear SVM. The whole training process schematic for a standing personand for a specific tile is depicted in Fig. 3.7. Similar to the data generationprocess, this process is also completely automatic.

3.3.1 Features Extraction with HOG

As also mentioned in Chapter 2, Histograms of Oriented Gradients (HOG)are state-of-the-art feature descriptors extracted from image patches, whichin combination with support vector machines, provide some of the best peo-ple detectors. They are computed over dense grids of uniformly space cells,concatenating the obtained vector of features for each cell, and then nor-malizing over groups of cells which are called blocks (see Fig. 3.8 (left)).


Figure 3.7: Overview schematic of the training process for a standing personand specific tile

The features consist of oriented gradients of an image region, which aredivided into orientation bins ([3] proposes a division into nine orientationbins). There are two variations of HOG. One divides the computed gradientsinto rectangular blocks (R-HOG), while the other divides them into circularblocks (C-HOG). We adopt the R-HOG scheme. A cell size normally con-sists of 8 by 8 pixels while a block is divided into 2 by 2 cells (i.e. a blockconsists of 16 by 16 pixels). Considering a block stride of 8 pixels, a windowof 64 by 128 pixels (which is the standard window used in [3] consists then of7 by 15 overlapping blocks. Because there are nine orientation bins for eachcell, and a block has four cells, the length of the feature vector for a block is36. The length of a feature vector for a whole window under considerationcan be computed by (No. of features per block×No. of blocks per window),which for a window of 64 by 128 pixels is 3,780. During detection, this win-dow is scanned through the whole image in a sliding-window fashion overmultiple scales of the image.

In our case, we do not train a general classifier. Instead, as also mentionedin the previous section, we train specifically for different positions in thescene. Because the bounding boxes vary in size and aspect ratio, we cannotexpect to have the same window for all; however, we constrain the windowsize to be smaller or equal to the standard 64 by 128 pixels by rescaling.The main reason behind it is the time it takes to compute the features.We use the CPU implementation of HOG in OpenCV, and extracting fromwindows that have more than 3,780 features, makes the detection processnot any more real time. This speed issue could be overcome by using amore efficient implementation, like Integral HOG or GPU HOG; however,the rescaling does not seem to have any effect on the detector performance.There is one detail worth mentioning when it comes to the generation ofthe rescaled bounding boxes. The aspect ratio of the person from a specificbox must not change when the box is rescaled. Furthermore, because thecells and blocks are all multiples of eight pixels, the detection window widthand height need to also be a multiple of eight. For this reason, we pad theinitial bounding boxes extracted from the contours of the capped cylinderswith extra pixels from the background, so then when rescaled they fulfillthe above constraints of multiplicity and size.

22 3.3. Per Tile Classifiers Training

Figure 3.8: left HOG feature vector pipeline (taken from [46]), right SVMclassifier

3.3.2 Training with Linear SVM

Support vector machines are supervised learning models used for classifica-tion and regression tasks [4]. By being trained on positive and negative data,obtained from two different categories (e.g. people vs non-people, apples vsbananas etc.), the basic SVM builds a model that assigns new unseen datato one category or the other, making it a non-probabilistic binary classifier.The model first maps the training examples as points in space, and thentries to find a hyperplane as a decision function which best separates thetwo categories. The best separation is achieved by finding the hyperplanethat creates the widest gap between the closest examples of either category(see Fig. 3.8 (right)). At this point, new data is mapped into the samespace and depending on which side of the gap they fall on, they are pre-dicted to belong to one category or the other. The simplest type of SVM isthe Linear SVM, which does a linear separation of the data. Sometimes thedata might not be easily separable by a linear method. In these cases, theexamples are mapped into a higher-dimensional feature space, where theyare better separable using the so called Kernel Trick, making the SVM anon-linear classifier. Some of these kernels include but are not limited tothe quadratic, polynomial, RBF, sigmoid etc.

In this work, in order to perform the training of the per tile classifiers, weuse the SVMLight package by Thorsten Joachims, which is available onthe web2. For every tile, we extract the feature vectors from each trainingpatch (positive and negative), along with the class where they belong to,and we save them all in one file used to build the SVM model for that tile.Then, this model is converted through a script to a weight file, which isused during detection to categorize the boxes into person or non-person bythe following formula,

f(x) = wTx + b, (3.1)

where w is the weight vector, b is the SVM bias and x is the input featurevector of the image patch that is being classified. For a box of 64 by 128pixels, the weight file consists of 3,781 elements, out of which 3,780 are the

2http://svmlight.joachims.org/


feature weights and the last one is the bias. This weight file is used to setthe HOG SVM detector in OpenCV and replace the default one, which istrained on the INRIA dataset. In this work we use a linear kernel for theSVM. We set the parameters to default and the regularization parameterC to 0.0096 (see [47]). The main reason behind using a linear SVM isefficiency and speed, and in the same time, the authors of [3] also use alinear SVM with a regularization parameter C set to 0.01. The efficiencyof linear classifiers in training and detection, is related to the fact thata linear classifier uses the training data only to learn the weight vectorw. Afterwards, the training data is discarded, since only the weights areneeded to classify new data, while a non-linear classifier needs to also carrythem, which can be costly in terms of storage capacity. As also mentionedin Sec. 3.2.3 and Sec. 3.2.4, we use around 750 positive training examplesand 5,000 negative training examples, which follows to some extent theratio of positive vs negative training data that was used to train the defaultHOG (1,200 positive vs 12,000 negative). It is noteworthy to mention that,besides trying different ratios between positive and negative training data,we also tried changing the backgrounds behind the positive models, or usinga combination of them, as well as trying different models. However, the bestresults were observed when using the same background and increasing thevariety of poses, as well as the number of synthetic human models to nine.A further increase in the number of models didn’t cause a big improvementin the qualitative results.

3.4 People Detection

Having pre-computed all the weight files, for standing and sitting people,we use a per tile single-scale approach to find people in the image. We goover each tile and we extract the respective image patch, delineated by apre computed 2D bounding box, as explained in Sec. 3.2.3. This boundingbox is then downscaled, applying the same scheme as in Sec. 3.3.1 , andthe HOG features are extracted. These features are then combined withthe weight vector and the bias parameter, as in Eq. 3.1, and based on theoutcome, the 2D bounding box is classified as firing or non-firing (containingperson or non-containing person). In addition, we can also extract the scoreof the classification, which shows the confidence of a classification or missclassification. In terms of feature space, the score represents the distancefrom the dividing hyperplane of the mapped example. This score is veryhelpful, not only because it can be used in a non-maximum suppressionscheme to remove false boxes (false positives) which occur in the vicinityof people, but it can also be utilized to distinguish between sitting versusstanding people, in cases where two detections occur in the same region.The whole detection process schematic is depicted in Fig. 3.9.

3.4.1 Per Tile Confidence Computation

Computing the confidence in our scenario is a bit tricky. Because we havemultiple classifiers, implying the existence of multiple hyperplanes, the scorecannot directly be used to compare one classifier to another, because it isjust a floating point number and it has no absolute meaning. It is sub-

24 3.4. People Detection

Figure 3.9: Overview schematic of the whole detection process

jectively related only to the classifier and hyperplane it represents, whichimplies that a higher score for one specific tile over another one close to itdoes not really mean a higher chance of a person being present. For thisreason, we needed to standardize the scores. The approach we adopted wasto compute for each tile, the median (or max) of all the scores obtainedfrom the positive training examples for a specific classifier. In this way, anew comparable detection score for a specific tile would be computed overthe median tile score as in,

score =relative score

median score, (3.2)

.

guaranteeing that all the scores are in the same range [0, 1]. We computethis only once for every tile and we save the results. The process is the samefor sitting people and children too.

3.4.2 Non Maximum Suppression

After classifying all the boxes from each tile into firing and non-firing, wenotice that there are more boxes than people in the image. This happensmainly in the neighborhood of the tile where a person might be standing orwhen multiple people are close to each other. In order to suppress the boxes,which are false positives, we try two different non-maximum suppressionschemes.

Over Tiles

The first non-maximum suppression that we tried was based on the reason-ing that if two tiles fire next to each other, then the true detection should beon the tile that has the biggest score. In this way, for each tile, if any of theimmediate neighbors has a bigger score, the tile is suppressed and the boxis set to non-firing, otherwise it is considered as a valid box. A maximumof eight comparisons are made for every tile, depending on where it resideson the 2D tile grid (see 3.10 (left)). It is noteworthy to mention that alarger amount of neighbors could be searched (e.g. including the neighborsto the immediate neighbors), however this could potentially remove validboxes, if two people stay very close to each other. One might try to alterthe neighboring parameter by making a tradeoff between bigger precisionor recall (see Sec. 5.4.1).


Figure 3.10: Non Maximum Suppression Schemes. left Over tiles, rightIOU over 2D boxes

Over 2D Boxes

A second non-maximum suppression scheme was implemented, which is alsomore standard in the computer vision literature, and it is also being usedin the PASCAL VOC challenge. It is based on the idea of intersection overthe union (IOU) between two boxes A and B.

IOU(A,B) =area(A ∩B)

area(A ∪B), (3.3)

.

In the beginning, all the boxes are sorted based on their scores (usuallyfrom left to right in a descending order), and then starting from the firstbox which also has the highest score, the intersection over the union withthe remaining boxes on the right is computed, and the boxes that have anIOU greater than 0.5 are suppressed and set to non-firing. This process issimilarly repeated with the box having the second highest score, until theonly boxes remaining have an IOU smaller than 0.5 with one another. Wenoticed that this scheme is more effective than the first one in removingfalse boxes.

3.4.3 Grouping

After applying the first non-maximum suppression there might be still someboxes remaining. This is due to the fact that the non-maximum suppressionis performed only around the immediate neighbors, implying that theremight be boxes of further tiles that could still include some body partsand might fire as positive. While these boxes are false positives and couldbe removed by improving the negative training data with body parts ofa synthetic human model, at this stage they are handled by a 2D boxweighted averaging and grouping step. This is similar to the post processingstep performed after the multi-scale detection in the sliding window defaultHOG approach. A weighted average and grouping of the boxes is done bysetting two parameters, the boxes proximity and relative size. In the endthe expected results look like in Fig. 3.11.

26 3.4. People Detection

Figure 3.11: Weighted averaging and grouping scheme (a) Initial boxesaround the person, (b) Remaining single box

3.4.4 Standing vs Sitting

After applying the non-maximum suppression schemes (note that they areapplied only to standing people), we are left with bounding boxes aroundstanding and/or sitting people. Because HOG puts more weight on thehead and shoulders zone (as can be seen as well in Fig. 2.5 (right)), thereare cases where standing boxes along with sitting boxes fire where sittingpeople are, and vice versa. In order to overcome this problem, we applya small non-maximum suppression scheme only around the sitting places.The area overlap between the sitting boxes and the neighboring standingboxes is computed, and only if it is bigger than a certain threshold (whichwe set to 0.9), the box with the lowest score is discarded. In this way, areable to remove many false positives which appear around sitting places. Theresults of such a step are presented in Sec. 5.4.1.

Chapter 4

Implementation

In this chapter, we present an overview of the hardware, software, scriptsused in this project, as well as the implementation details. In the beginning,we present the hardware and then the software and packages used. Inthe end, we present the structure of the implementation along with anillustration of the Graphical User Interface.

4.1 Hardware

During this project, we have used two machines. The first machine runs onWindows 7 operating system and has a 3.0 GHz Intel® Core� i7 processorwith 12 GB of Ram. Its main use is for generating synthetic training datawith the Autodesk Maya 2013 software which runs on Windows. The secondmachine runs on Ubuntu 10.04 Unix operating system and has 2.55 GHzIntel® Core� 2 Quad processor. The whole detection and training wasperformed on this machine.

We need to note that our detection and evaluation results were based on adataset of images extracted from real footage of the environment, howeverfor completeness, we would like to mention that the camera used was anAxis Network Monocular Color Camera with a resolution of 1,024 by 768pixels as presented in Fig. 4.1.

4.2 Software and Tools

The 3D environment modeling, synthetic models placing, training data gen-eration and rendering was performed with the animation and rendering soft-ware Autodesk Maya 20131 in Windows 7. The scripting language of Mayacalled MEL was used for automatic placement and rendering. On the otherhand, we used the Avisynth tool2, in order to produce videos.

The implementation is mainly done using the programming language C++

1http://www.autodesk.com/products/autodesk-maya/overview2http://avisynth.org/mediawiki/MainP age

27

28 4.3. Code Structure

Figure 4.1: Axis Network Color Camera

on Linux Ubuntu 10.04 LTS, while recurring sometimes to Python Scripting.For some of the programs we use ROS (Robots Operating System) Electric3,which provides libraries and tools to help software developers create robotapplications. One of those libraries, which comes along with ROS andwe frequently use during the whole project, is the OpenCV library. Thislibrary provides an extensive framework for computer vision, image analysisand machine learning methods and has some state-of-the-art algorithmsimplemented in it. We use the OpenCV 2.4.3 version and, because we onlyuse the CPU implementation of HOG, there is no need to integrate OpenCVwith GPU access. Besides including these necessary packages, ROS alsooffers a simple way to create packages and maintain make files.

In order to train the classifiers, we use the LightSVM4 package by IoachimThorsten, however since the classifiers are computed offline we do not in-tegrate it in the C++ code. Instead, we use the package executables toread files of training features which we create and we use a Python script toconvert the model which is built by the SVM into a weight file that we readduring detection. We implement a Graphical User Interface (GUI) usingthe QT 4.85 library and QT Designer, which is a very well documented andmulti-platform framework and easy to integrate in C++ code. In order toplot evaluation results, we use the Gnuplot Linux tool6.

4.3 Code Structure

We would like to divide the code structure into three groups. The first groupconsists of scripts written in the Maya Scripting Language MEL, in orderto automatically place synthetic models into the scene and render them.Below there is a list with a short description of the scripts being used to:

� Perform skeleton matching of the downloaded models skeleton withthe Maya skeleton

3http://www.ros.org/wiki/4http://svmlight.joachims.org/5http://qt.nokia.com/6http://www.gnuplot.info/

Chapter 4. Implementation 29

� Bake the standing and sitting animations to the converted skeletonmodels

� Load, place and move standing models (adults and children) in thetiles

� Load, place and move capped cylinder (for adults and children) in thetiles

� Load, place and move sitting models (adults and children) in the chairsand couches

� Load, place and move random downloaded synthetic objects in thewhole scene

� Render (with and without scene background) and save images prop-erly into folders

The second group consists of programs written using the ROS packageframework. Because of the need to write script like C++ programs, andalso to try different methods (especially in the beginning of this thesis), thecurrent ROS package consists of single executables. Below we would like tobriefly list some of the important programs which can be used to:

� Automatically convert camera calibration parameters into Maya read-able MEL scripts

� Compute three different background subtraction methods from monoc-ular cameras

� Un-distort and save images from the dataset

� Project the tiles and the plane of our scene in the 2D image in orderto visualize detector accuracy

� Perform training with different backgrounds

� Generate positive and negative data for training from the renderedimages with Maya

� Detect standing vs sitting people by applying the first non-maximumsuppression scheme and grouping

� Compute the standard non-maximum suppression scheme from PAS-CAL VOC on bounding boxes read from files

� Read, write and split annotation files for further processing

� Compute Precision-Recall (P-R) curves

� Extract silhouettes and apply morphological operations

Afterwards, we implemented our final code in a modular and flexible fashionas a standalone application (independent from ROS). The code consistsof the main Image class which contains image properties and three mainmodules that communicate with it. The modular implementation allowsremoving and adding individual modules which at the moment consist ofa detection, training and annotation module. On top of that a GUI isimplemented, which communicates with the modules and the main classthrough wrapper classes.

30 4.3. Code Structure

4.3.1 Detection Module

The detection module is responsible for all the detection functions. Bysetting parameters, it can initialize detectors on the whole or parts of theimage. If we talk about detection strategies, two main detectors are imple-mented. One is based on the sliding window multi-scale approach and theother is based on our new strategy of per tile single scale approach. Thefirst detector is trained on INRIA dataset, while the second one could bedivided in two further groups. One group of detectors goes over all tiles todetect standing people, while the other one detects sitting people only inpre specified chairs and couches. Each of these detectors is divided furtherinto a children and an adult detector. There is also a detector which com-bines both sitting and standing detection by the non-maximum suppressionschemes. Further, the number of backgrounds used for training any of thedetectors can be specified. We provide the option to use the single scenebackground, the scene background and a white background, as well as manyrandomly picked backgrounds.

Because detection is strongly linked with display, this module serves as adisplay module too. It manages visual data outputs in the graphics sceneof the GUI which mainly consists of color images and super-imposing de-tections. The detection display mode consists of either 2D bounding boxdetections or blobs projected on the plane of the scene which is in turn pro-jected in the 2D image. In addition, this module can be used to go throughthe images in a video mode by going back and forth with a desired speed(fps) and also to select specific images from a folder. It performs imageconversion from OpenCV image formats into QT image formats and canalso be used to save images with or without detections in specified folders.

4.3.2 Training Module

On the other hand, the training module is responsible for all the trainingsof the Maya generated training data as well as for the pre computationof the confidences into files. This model consists of four different trainersconsisting of sitting and standing adults and sitting and standing childreneach coupled with the confidence computations. Several options could beset like the number of chairs for sitting or the number of tiles for standingto be used for training. Further, we can set the number of backgrounds andof positive and negative synthetic models we would like to train with.

4.3.3 Annotation Module

The annotation module was implemented when the need came to manuallyannotate ground truth data. It enables the user to manually label realhumans from the images through mouse clicks and keyboard commandsand automatically dump detections into files. There are two methods forhow one can manually label the data. One way is by clicking at a certainpoint and dragging the mouse until the human is included in a boundingbox. The other way is by placing the center of a bounding box with a prespecified human size and then adjusting the width and the height by thekeyboard. The output is then saved into a specified *.txt-file in the following

Chapter 4. Implementation 31

output format

< imagenumber >< label >< center.x >< center.y >< width >< height >

There are three labels possible, for standing adults, standing children andsitting people. The saved *.txt files are further processed throughout script-ing programs.

4.4 Graphical User Interface

The main reason for including the Graphical User Interface (GUI) was toease the process of visualization and debugging of our methods, while beingable to set parameters for each of the modules, as well as to perform themanual annotation of the ground truth evaluation data. It is implementedusing QT library and QT designer and it consists of a main window fordisplay on the left and either of the modules (or all of them together ac-cessed by different tabs) in the right. The following figures display our threedifferent modules (Fig. 4.2 - 4.4).

Figure 4.2: Screenshot of the detection module with the main window.

32 4.4. Graphical User Interface

Figure 4.3: Screenshot of the training module.

Figure 4.4: Screenshot of the detection module with the 3 labels in the mainwindow.

Chapter 5

Evaluation and Results

In this chapter, we first introduce the dataset used for generating qualitativeand quantitative results. Then, we present the accuracy of our personallytrained per tile detector with a state-of-the-art evaluation method, as com-pared to other detectors trained from different datasets or methods. In theend, we conclude with efficiency and speed measurements and we compareour method to sliding window multi-scale approaches.

5.1 Datasets

The dataset used during this thesis was obtained from a real footage ofthe restaurant and it contains around 8,000 images. The resolution of theimages was 1,024 by 768 pixels however both for training and detectionthey were rescaled keeping the aspect ratio to 640 by 480 pixels. Whilequalitative results were observed over the whole entire dataset, quantitativeresults were generated over a subset of the entire dataset. This subsetis chosen carefully over consecutive images from the real footage, so thatit contains various scenarios that can occur in a restaurant (e.g. peopleentering or leaving the restaurant alone or in groups, guest sitting in thetables or the couches, waiters serving food and talking to the guests etc.).In order to generate quantitative results, we created ground truth imagesby hand using the annotation module of our GUI as explained in Sec. 4.4.We annotated standing adults separately from standing children and sittingadults/children, as can be seen in Fig. 4.4. A more detailed information onthe number of annotations made is presented in Tab. 5.1.

During the annotation process, it was confirmed that people fit perfectly

Table 5.1: Evaluation Details

Details Amount

Video frames evaluated 1′430Annotations of standing adults 2′415Annotations of sitting people 5′880

Annotations of standing children 50

33

34 5.2. Training Methods

inside the bounding boxes centered on them and the relative size of thebounding box around the person is annotated based on the standards usedin the recognition challenges such as the PASCAL VOC challenge. In thesame time, annotations were not only made on cases where the entire personcould be seen, but also on very hard cases such as extreme occlusions orwaiters bending and staying in a sitting position outside of the chairs andcouches on which training for sitting positions is done. As you can noticefrom Tab. 5.1, there is twice the amount of sitting annotations as comparedto the standing annotations for the obvious reason that once guests sit theystay at those positions for extended periods of time. Unfortunately, theamount of children annotations is very small to draw a conclusion, hencewe have not evaluated the standing children detector.

5.2 Training Methods

In order to evaluate the performance of our per tile single scale detector ascompared to the state-of-the-art detectors, we compare it with two trainingmethods. The first method is the default one, which is trained using win-dows of 64 by 128 pixels on the INRIA dataset and using HOG as featuredescriptor. The number of positive training data is 1,200, while that of neg-ative training data is 12,000. OpenCV already provides the default featureweights, hence personal training is not needed. During detection a slidingwindow multi-scale approach is used followed by a weighted averaging andgrouping scheme similar to ours (see Sec. 3.4.3). Because the groupingparameters are chosen heuristically, in order to make a fair comparison,we do not perform grouping in any of the methods compared. Instead, weonly apply to all of them the same non-maximum suppression scheme asexplained in Sec. 3.4.2.

Comparing only to the method which trains HOG on the INRIA dataset isnot sufficient. Because the INRIA dataset consists of generic pedestrians ingeneric settings, training HOGs on our specific scenario image dataset wasnecessary in order to perform a thorough evaluation. We use our manuallyannotated ground truth data as training data by performing two-fold crossvalidation on three different sets. At this point, it is worth mentioning thatthe performance comparison is only done for standing detections for tworeasons. One of the reasons is that sitting detections are only performed atfew specific positions in the scene, leading to a high precision curve whichis not so informative. The second and most important reason is that HOGtrained on the INRIA dataset is only trained on standing people. We believethat performing cross validation on three different sets is enough to avoidbias on our results. Furthermore, as can be seen from Tab. 5.1, the numberof standing annotations is 2,415. By dividing it into half, we get around1,200 positive training samples, which is the same as the number of samplesused to train on the INRIA dataset, and the remaining is left for testing.We then perform three trainings varying from the subset of the annotationset they use for training and testing as follows:

� Training performed on the first half and tested on the second half

� Training performed on the second half and tested on the first half

Chapter 5. Evaluation and Results 35

� Training performed on every second annotation and tested on theremaining

5.2.1 Positive data generation

It is not sufficient to simply take the ground truth image patches, to usefor positive training data. An additional processing step is needed becausethe bounding boxes are not of the same size and aspect ratio. We try tostandardize the window size to 64 by 128 pixels by first padding with extrapixels from the background until the aspect ratio becomes 1 to 2 is reachedand then rescaling.

5.2.2 Negative data generation

As for the negative training data, we use 10 times more patches than thepositive training data following the style from [3]. In order to generatenegative images, we run through the images and we extract randomly 10patches per image. Because the detection is performed multi-scale, we firstpre compute the possible scales for a window of size 64 by 128 pixels in animage of size 640 by 480 pixels. In addition to randomizing the positionin the image where we get the negative data, we also randomize the scalesof patches extracted. After extraction they are also rescaled to 64 by 128pixels. Because we extract negative examples from images with people init, to including people patches, for each random patch, we compute theIOU with the ground truth patches, and we accept it as a negative exampleonly if the IOU is smaller than 0.3. We continue the process of extractingnegatives until we reach a total of 12,000 samples.

5.2.3 Hard training

Having generated the positive and negative training data, we train a classi-fier for each of the three dataset splits and then we evaluate the performanceof detecting in a sliding window multi-scale approach, only computing thenon-maximum suppression scheme without weighted averaging and group-ing. In order to improve the performance of the detector, we adopt a popularapproach in the machine learning community which is called hard training.Hard training, also known as mining of hard negatives, runs over images notcontaining the category to be detected (here people) and tries to find falsefiring boxes with a high confidence which determine the hyperplane. It usesthen these hard negatives, along with the original negative data samples(subsampling if needed), and then retrains the classifier on the new trainingdata. We perform this step for each of the three classifiers, however it isnoteworthy to mention that we do not perform this step on our per tileclassifiers.

36 5.3. Evaluation method

5.3 Evaluation method

In order to evaluate and compare the performance of our detector, we usea state-of-the-art evaluation method for object detection called Precision-Recall curve (P-R curve). The precision value represents the relation be-tween true detections and all detections obtained for a test dataset. Toput it in other words, in our case the precision value determines the ratiobetween the correct people detections (true positives) and all detections,with all detections being nothing else but the sum of correct and incorrectdetections. On the other hand, the recall value represents the relation be-tween the detected objects and the total number of objects, with the totalnumber of objects being the number of ground truths of the test dataset.Precision and recall are computed through:

Precision =TP

TP + FP, (5.1)

Recall =TP

N, (5.2)

N = TP + FN, (5.3)

where true positives (TP ) is the number of correct detections, false positives(FP ) the number of wrong detections, N is the total number of objects andfalse negatives (FN) the number of missed detections. Both precision andrecall are real numbers in the range from 0 to 1.

In order for a detection to be considered a true positive, we use a standardmethod from the PASCAL VOC recognition challenge. First the IOU ofevery detected box with the ground truth boxes is computed and if its valueis greater than 0.5 then the box is considered as a true positive. Otherwise,it is considered as a false positive. In addition, it must be mentioned thatonly one-to-one assignments of detected boxes and ground truth boxes areaccepted, assigning in this way a ground truth to only one detection. Theremaining ground truths that do not find a match are considered as falsenegatives.

In order to generate the curves, the detected boxes, after being assigneda true positive or a false positive value, are sorted in a descending orderbased on their scores. Then starting from the box with the highest scorethe precision and recall values of each box are plotted consecutively. Forexample, the box that has the m−th highest score has the following precisionand recall values

Precisionm =TPm

m, (5.4)

Recallm =TPm

N, (5.5)

where TPm represents the number of true positives seen up to the m−th box.In order to have complete curves, it is a common approach to include moreboxes (from the false ones) in the plot than there actually are detected. Thisboxes are chosen from those detections, which when mapped to the featurespace resign just on the negative side very close to the hyperplane (a distance


of -1 from hyperplane is commonly used). After all the points are plottedan ideal P-R curve starts from a precision value equal to 1 (assuming thebox with the highest score is a true positive) and monotonically decreaseswhile processing more and more boxes until it reaches a value of zero inprecision and a high value in recall (ideally reaching one).

5.4 Quantitative results

In this section, we present three plots of the P-R curves where we firstcompare the different training methods on standing people, then we observethe effect of including sitting people and in the end we compare soft andhard training results. Besides comparing the performance, we also comparethe runtimes of our single scale method and the sliding window multi scalemethod.

5.4.1 P-R curves comparison

The first comparison we perform is between tile specific and sliding win-dow approach for different training methods. The plot shows five curves,the red curve representing our tile specific single scale approach trained onsynthetic humans, the green curve representing the sliding window multiscale approach trained on the INRIA dataset and the other three curvesrepresenting the sliding window multi-scale approach trained on our restau-rant dataset split in three different ways. As it can be seen from Fig. 5.1,our method outperforms all the others, achieving higher values both forprecision and recall.

Figure 5.1: P-R curves for different training methods

Another obvious and expected result is that the methods trained on ourrestaurant dataset are better than the method trained on the INRIA dataset.

38 5.4. Quantitative results

Furthermore, if we compare only the three curves of the method trained onour dataset, we can see that the curves belonging to training performed onone half and testing on the other (and vice-versa) have similar performance.In the meantime, the performance is noticeably increased if training is doneevery second image and testing on the remaining. The main reason behindthis is that in the latter case the training and testing examples are moresimilar to each other than when we split the dataset in two halves.

As we also mention in Sec. 3.4.4, by incorporating sitting people evidenceswe should be able to remove some of the false positives created. In Fig. 5.2we show the effect of incorporating sitting people (through the green curve)in the standing people detections.

Figure 5.2: P-R curves when incorporating sitting detections

As it can be seen, we achieve a great improvement in precision, while notic-ing a very small decrease in recall. The decrease in recall can be explainedwith the fact that there might be some true positives, which are mistakenlyremoved from the little non-maximum suppression scheme just because forthat image the scores don’t represent the true scenario.

The third plot is not a highlight of the thesis, but it illustrates an inter-esting fact that we have observed during the evaluation process. Here wecompare the effect that hard training has on different datasets (see Fig.5.3). An increase in both precision and recall is observed if hard trainingis performed on similar data used both for training and testing, while weobserve a decrease in recall if hard training is performed on different dataused for training and testing.

5.4.2 Runtime Computation

In order to observe how fast our method is, as compared to the other slidingwindow multi-scale methods, we follow two approaches. One simply mea-


Figure 5.3: P-R curves for soft vs hard training

sures the time it takes either of the methods to perform all detections inone image. Reminding ourselves of the computer specifics presented in Sec.4.1 on the Linux machine and also of the fact that HOG feature extractionis done on the CPU and not on the GPU, we observe a detection time of 67[ms] for the sliding window multi scale approach using a scale factor of 1.05.This scale factor is heuristically chosen for the default HOG trained on IN-RIA dataset by the authors and it is also adopted by us. It was noticed thatincreasing the scale factor reduces the time it takes to process all detectionsin an image, in the meantime worsening the detection performance. Onthe other hand, the time it takes our tile specific single scale approach todetect people in all the tiles is dependent only on the number of tiles. Werecall that we do not search the image in places where no classifier trainingis performed and the overlapping tile scheme is half a tile overlap. Consid-ering roughly 800 tiles in all our scene the time it takes for all detectionsis 34 [ms]. Therefore, we can notice a speed increase of two times for ourapproach, from 1.5 to 3 frames per second.

The other measuring approach is based on the number of HOG windowsneeded to be computed per image. In our case, the number of windowscomputed is equal to the number of tiles, hence 800 windows. Before featureextraction just as a recall we perform a rescale such that the window sizedoes not surpass 64 by 128 pixels used in the sliding window multi scaleapproach (see Sec. 3.3.1). This allows to compare the performance in termsof window evaluations since the time needed to extract the features for awindow is approximately the same (unless for cases in our approach whenit is smaller than 64 by 128 pixels). Keeping this in mind and also usingthe same scale factor of 1.05, for an image of 640 by 480 pixels the sliding

40 5.5. Qualitative Results

window multi scale approach does around 27 scale evaluations, which inturn leads to a 24,320 HOG windows in total. As one can see, we computearound 30 times less windows and this is a big improvement.

The reason why the two aforementioned approaches speeds don’t match hasto do with the fact of how the sliding window approaches are implemented.Recent advances in research, as in the work from Dubout [48], show that thebottleneck of sliding window approaches, that has to do with the computa-tional cost of the convolutions between the multiple rescaling of the imageto process and the linear filters, can be overcome by using the propertiesof the Fourier transform and clever implementation strategies. They claima speed increase by a factor proportional to the filters’ sizes. In our casewe cannot apply this speed because we do not detect in a sliding windowmulti-scale, and furthermore the size of the windows changes from tile totile. However, we can implement a more efficient detector by parallelizingthe process or using multi-threaded programming.

5.5 Qualitative Results

In this section we would like to present some visual results obtained byapplying the detector directly on the images. Based on the argumentationfrom Sec. 1.2.1, we have captured scenarios that are characteristic of restau-rant like environments, showing the strengths of our approach as well as thehard detection cases. Fig. 5.4 depicts the situation where two humans arestanding or walking close to each other and the respective detection boxesin green.

Figure 5.4: Detection results on people standing or walking close to eachother

On the other hand, in Fig. 5.5, the performance of the detector could beobserved in a scenario with multiple people relatively close to each other.While the picture on the left shows five adults and a child entering therestaurant, the picture on the right is representative of a situation wheremultiple humans are sitting in the table and simultaneously interact withthe waiter standing. While standing boxes are of green color, the sitting


boxes are of yellow color.

Figure 5.5: Detection results on multiple people sitting, standing or walkingclose to each other

Besides training for full body or sitting people, as already mentioned, in thiswork we train also for situations where humans are occluded or truncated.These situations are depicted in Fig. 5.6. While the two left pictures showtruncated humans on the bottom and side of the camera image, the tworight ones show occlusions of the human legs from the tables or chairs nearthem.

Figure 5.6: Detection results on truncated and occluded people

As it can be noticed from the above figures, our approach finds correctbounding boxes around standing or sitting people even when they are trun-cated or occluded. However, there exist cases where our detectors do notperform as well. Examples of such cases are shown in Fig. 5.7. The leftpicture depicts the situations where waiters are serving and they adoptbending like body articulations. At the current stage the detectors givevery low scores on such poses, however this could be overcome by trainingspecifically for bending humans. On the other hand, the picture on the rightdepicts the situation where multiple people occlude each other, and as wecan see there are missing detections. This is due to the fact that the currenttraining is performed only for single humans in the tiles, however trainingfor multiple humans or using deformable part based models from Felzen-szwalb [38] instead of the HOG holistic approach could help to overcomethese difficulties.

42 5.6. Contributions

Figure 5.7: Hard cases

5.6 Contributions

In this section we would like to point out the contributions of this work,as observed from the quantitative and qualitative results explained in thesections above.

� Better detection accuracy. We evaluated our tile specific detectionmethod thoroughly and compared it to other four trained methodsbased on the sliding window multi-scale approach, and we showedthrough the P-R curves that our method overcomes the others by a bigamount. Furthermore, we showed that by incorporating sitting peopleinferences, we are able to remove false positives hence increasing theprecision of our detector.

� Faster detection. We observed a performance speed up of two times,in terms of implementation dependent PC runtime, and a speed up ofthirty times in terms of HOG window computations from an image ascompared to the sliding window multi-scale approaches.

� Captures partial occlusions and foreshortening. Our methodis able capture partially occluded and truncated people, while thedefault holistic HOG trained on generic data can’t.

In conclusion, we can say that using HOGs as a feature descriptor, ourmethod performs better than the state-of-the-art for holistic approaches.Further investigation is needed to observe how it compares to part-basedapproaches, as mentioned in Sec. 2.1, however such evaluation is out of thescope of this master thesis.

Chapter 6

Conclusion and Outlook

In this report, we presented a novel approach to people detection fromoblique view monocular color cameras. We used Histograms of OrientedGradients as feature descriptors in combination with Linear Support VectorMachines for training; however, instead of training a single classifier forgeneric settings and scenes, we train multiple classifiers for specific positionsof a specific scene. We presented a new approach of training using onlysynthetic human models placed on a calibrated 3D model of the scene andrendered through Autodesk Maya. In addition, we presented a per tile singlescale detection scheme different from and more efficient than commonlyused sliding window multi scale approaches. In the end, we would like tostress that the whole process of data generation and training is completelyautomatic, making it reusable for many other scenes and environments.Lastly, the modular implementation makes the whole scheme possible foruse with existing or more powerful feature descriptors, other than HOG,that could be developed in the future.

6.1 Achievements

In this section, we would like to present the achievements reached duringthis project. First of all, to the best of our knowledge, there is no otherapproach which exploits the power of HOG feature descriptors and classifiertraining and applies it to a specialized scheme of people detection for obliqueview cameras. This specialized approach, as was presented in the previousChapter 5, noticeably increases the performance of the detector as comparedto approaches trained on generic people and generic settings using slidingwindow multi scale detection schemes, while speeding up the detection inthe same time. Furthermore, it is able to capture people occlusions behindobjects (tables, couches etc.) as well as foreshortening and truncationsbecause of the camera view point and range. One additional contribution isthe fact that training is done only using synthetic human models, which tothe best of our knowledge has been adopted only by the Shotton et al. [42]in the attempt to train Kinect pose estimation using synthetic depth images;however, while the latter trains for generic humans in normal indoor scenewhere Kinect can be deployed, we train for a specific camera viewpoint andscene. This brings a great advantage, not only because we have full control

43

44 6.2. Future Work

on the variability of models and poses that we can use for training, butalso because this allows this approach to be used for any desired scene andenvironment. To conclude, our system detects both standing and sittingpeople by specifically training for them and achieves better results than theother evaluated methods from the previous section.

6.2 Future Work

This project could be considered as only the first and main step to a propervideo surveillance system. In order to further increase the accuracy, track-ing should and will be integrated in the system. Further we would like toinclude the preprocessing step of image background subtraction to eliminatefurther false detections in places where there are no people. As also men-tioned in Sec. 2, performing background subtraction only from monocularcameras is not that accurate, hence we could integrate another calibratedcamera and perform stereo disparity background subtraction. This wouldmake it possible to apply our detection classifiers only in the ROI extractedby the background subtraction methods. Another possibility would be toinclude background subtraction or depth images directly at the data fortraining similar to what is done in the Kinect paper [42]. In terms of thecurrent training data, we plan to enrich the negative training data withparts of human bodies while enriching the positive training data with moreposes similar to bending (waiters or people sitting). Furthermore, we wouldlike to improve the training for sitting people by developing specific sittingdetectors in the whole scene, as well as incorporating children detectionsin our evaluation. In terms of speed, we would like to try the GPU imple-mentation of HOG, as well as to parallelize our per tile detection process.To conclude, we would like to compare our holistic approach to the part-based approaches, and observe how the performance changes if we replacethe HOG-s by better feature descriptors.

Bibliography

[1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” InternationalJournal of Computer Vision(IJCV), vol. 88, pp. 303–338, June 2010.

[2] J. Varona, J. Gonzalez, I. Rius, and J. J. Villanueva, “Importance ofdetection for video surveillance applications,” International Society forOptical Engineering(SPIE), vol. 47, no. 8, pp. 087201–087201, 2008.

[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. of the IEEE Conference on Computer Vision andPattern Recognition(CVPR), vol. 2, pp. 886–893, June 2005.

[4] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-Verlag New York, Inc. Secaucus, NJ,USA 2006, 2006.

[5] M. Habbecke and L. Kobbelt, “Laser brush: a flexible device for 3d re-construction of indoor scenes.,” in Proc. of the Symposium on Solid andPhysical Modeling(SPM) (E. Haines and M. McGuire, eds.), pp. 231–239, ACM, 2008.

[6] Lu-Xingchang and Liu-Xianlin, “Reconstruction of 3d model based onlaser scanning,” in Proc. of the Innovations in 3D Geo InformationSystems, First International Workshop on 3D Geoinformation, 7-8 Au-gust, 2006, Kuala Lumpur, Malaysia (A. Abdul-Rahman, S. Zlatanova,and V. Coors, eds.), Lecture Notes in Geoinformation and Cartography,pp. 317–332, Springer, 2006.

[7] N. A. Ogale, “A survey of techniques for human detection from video,”Survey, University of Maryland, 2006.

[8] T. Santhanam, C. P. Sumathi, and S. Gomathi, “A survey of tech-niques for human detection in static images,” in Proc. of the Inter-national Conference on Computational Science, Engineering and In-formation Technology(CCSEIT), (New York, NY, USA), pp. 328–336,ACM, 2012.

[9] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection:Survey and experiments,” IEEE Transactions on Pattern Analysis andMachine Intelligence(PAMI), vol. 31, no. 12, pp. 2179–2195, 2009.

[10] M. Jones, P. Viola, P. Viola, M. J. Jones, D. Snow, and D. Snow, “De-tecting pedestrians using patterns of motion and appearance,” in Proc.of the International Conference on Computer Vision(ICCV), pp. 734–741, 2003.

45

46 Bibliography

[11] D. M. Gavrila and S. Munder, “Multi-cue pedestrian detection andtracking from a moving vehicle,” International Journal of ComputerVision(IJCV), 2006.

[12] C. Wojek and B. Schiele, “A performance evaluation of single andmulti-feature people detection,” in Proc. of the Symposium of theGerman Association for Pattern Recognition(DAGM), (Berlin, Heidel-berg), pp. 82–91, Springer-Verlag, 2008.

[13] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey ofpedestrian detection for advanced driver assistance systems,” IEEETransactions on Pattern Analysis and Machine Intelligence (PAMI),vol. 32, pp. 1239–1258, July 2010.

[14] B. Schiele, M. Andriluka, N. Majer, S. Roth, and C. Wojek, “Vi-sual people detection: Different models, comparison and discussion,”in Proc. of the IEEE International Conference on Robotics and Au-tomation(ICRA) Workshop on People Detection and Tracking, pp. 1–8,2009.

[15] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient matching of picto-rial structures,” in Proc. of the IEEE Conference on Computer Visionand Pattern Recognition(CVPR), pp. 66–73, 2000.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human detection basedon a probabilistic assembly of robust part detectors,” in Proc. of theEuropean Conference on Computer Vision(ECCV), pp. 69–82, 2004.

[17] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowdedscenes,” in Proc. of the IEEE Conference on Computer Vision andPattern Recognition(CVPR), pp. 878–885, 2005.

[18] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detectionand people-detection-by-tracking,” in Proc. of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2008.

[19] E. Seemann, B. Leibe, and B. Schiele, “Multi-aspect detection of artic-ulated objects,” in Proc. of the IEEE Conference on Computer Visionand Pattern Recognition(CVPR), (Washington, DC, USA), pp. 1582–1588, IEEE Computer Society, 2006.

[20] B. Wu and R. Nevatia, “Detection of multiple, partially occluded hu-mans in a single image by bayesian combination of edgelet part de-tectors,” in Proc. of the International Conference on Computer Vi-sion(ICCV), 2005.

[21] M. Piccardi, “Background subtraction techniques: a review,” in Proc.of the IEEE International Conference on Systems, Man and Cybernet-ics (SMC), vol. 4, pp. 3099–3104, IEEE, Oct. 2004.

[22] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti, “Improv-ing shadow suppression in moving object detection with hsv color infor-mation,” in Proc. of the IEEE Intelligent Transportation Systems(ITS),2001.

[23] S. Bahadori, L. Iocchi, G. R. Leone, D. Nardi, and L. Scozzafava, “Real-time people localization and tracking through fixed stereo vision,” inLecture Notes on Artificial Intelligence (LNAI), pp. 44–54, 2005.

Bibliography 47

[24] A. Ess, B. Leibe, K. Schindler, and L. van Gool, “A mobile visionsystem for robust multi-person tracking,” in Proc. of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), IEEEPress, June 2008.

[25] D. Mitzel and B. Leibe, “Taking mobile multi-object tracking to thenext level: people, unknown objects, and carried items,” in Proc. ofthe European Conference on Computer Vision(ECCV), (Berlin, Hei-delberg), pp. 566–579, Springer-Verlag, 2012.

[26] S. Se and M. Brady, “Ground plane estimation, error analysis andapplications,” Robotics and Autonomous Systems, vol. 39, no. 2, pp. 59– 71, 2002.

[27] K. C. Sofiane Yous, Hamid Laga, “People detection and tracking withworld-z map from a single stereo camera,” in Proc. of the InternationalWorkshop on Visual Surveillane (VS) - (ECCV), (Marseille, France),October 2008.

[28] H. Sidenbladh, “Detecting human motion with support vector ma-chines,” in Proc. of the IEEE International Conference on PatternRecognition(ICPR), (Washington, DC, USA), pp. 188–191, IEEE Com-puter Society, 2004.

[29] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, “Shape-based pedes-trian detection,” in Proc. of the IEEE Intelligent Vehicles Sympo-sium(IV), pp. 215–220, 2000.

[30] D. M. Gavrila, J. Giebel, and S. Munder, “Vision-based pedestriandetection: The protector system,” in Proc. of the IEEE IntelligentVehicles Symposium(IV), pp. 13–18, 2004.

[31] C. Papageorgiou and T. Poggio, “A trainable system for object de-tection,” International Journal of Computer Vision(IJCV), vol. 38,pp. 15–33, June 2000.

[32] F. Porikli, “Integral histogram: A fast way to extract histograms incartesian spaces,” in Proc. of the IEEE Conference on Computer Visionand Pattern Recognition(CVPR), pp. 829–836, 2005.

[33] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast human de-tection using a cascade of histograms of oriented gradients,” in Proc.of the IEEE Conference on Computer Vision and Pattern Recogni-tion(CVPR), (Washington, DC, USA), pp. 1491–1498, IEEE ComputerSociety, 2006.

[34] V. A. Prisacariu and I. Reid, “Fasthog- a real-time gpu implementationof hog technical report no. 2310/09,” 2009.

[35] O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection via classifica-tion on riemannian manifolds,” IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), vol. 30, pp. 1713–1727, Oct. 2008.

[36] A. Mohan, C. Papageorgiou, and T. Poggio, “Example based objectdetection in images by components,” IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI), vol. 23, pp. 349–361, 2001.

48 Bibliography

[37] B. Wu and R. Nevatia, “Detection and tracking of multiple, partiallyoccluded humans by bayesian combination of edgelet based part de-tectors,” International Journal of Computer Vision(IJCV), vol. 75,pp. 247–266, Nov. 2007.

[38] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan, “Object detection with discriminatively trained part basedmodels,” IEEE Transactions on Pattern Analysis and Machine Intel-ligence(PAMI), vol. 32, pp. 1627–1645, 2010.

[39] P. Dollar, B. Babenko, S. Belongie, P. Perona, and Z. Tu, “Multiplecomponent learning for object detection,” in Proc. of the EuropeanConference on Computer Vision(ECCV), 2008.

[40] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and ob-ject recognition using shape contexts,” IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI), vol. 24, pp. 509–522, Apr.2002.

[41] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection withinterleaved categorization and segmentation,” International Journal ofComputer Vision(IJCV), vol. 77, pp. 259–289, May 2008.

[42] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in Proc. of the IEEE Conference on Com-puter Vision and Pattern Recognition(CVPR), 2011.

[43] L. Pishchulin, A. Jain, M. Andriluka, T. Thormaehlen, and B. Schiele,“Articulated people detection and pose estimation: Reshaping the fu-ture,” in Proc. of the IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), (Providence, United States), p. 1–8, June2012.

[44] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, andJ. Davis, “Scape: shape completion and animation of people,” ACMTransactions on Graphics(TOG), vol. 24, pp. 408–416, 2005.

[45] J. Mar D. Vazquez, D. Gero, and A. M. L, “Learning appearance in vir-tual scenarios for pedestrian detection.,” in Proc. of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp. 137–144, IEEE, 2010.

[46] T. Kozakaya, T. Shibata, M. Yuasa, and O. Yamaguchi, “Facial featurelocalization using weighted vector concentration approach,” Interna-tional Conference on Image, Vision and Computing(ICIVC), vol. 28,pp. 772–780, May 2010.

[47] C. wei Hsu, C. chung Chang, and C. jen Lin, “A practical guide tosupport vector classification,” 2010.

[48] C. Dubout and F. Fleuret, “Exact acceleration of linear object de-tectors,” in Proc. of the European Conference on Computer Vi-sion(ECCV), (Berlin, Heidelberg), pp. 301–311, Springer-Verlag, 2012.

List of Figures

1.1 Camera setup. left Oblique view of a random restaurant,camera tilted ∼ 60 [degrees]. right Top view of the restaurant. 2

2.1 left Holistic representation. right Part based representation(taken from [8]) . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 People Detection Methods . . . . . . . . . . . . . . . . . . . . 10

2.3 top Human motion patterns. bottom Non-human motionpatterns (taken from [28]) . . . . . . . . . . . . . . . . . . . . 10

2.4 left Hierarchy of templates used in chamfer system (figurefrom [30]). right First five edgelet features selected by Ad-aBoost (figure from [20].) . . . . . . . . . . . . . . . . . . . . 11

2.5 left Different haar wavelets and haar features applied on ahuman patch (figure from [31]). right Block placed on sam-ple image and HOG descriptor weighted by positive/negativeSVM weights (figure from [3].) . . . . . . . . . . . . . . . . . 12

2.6 left Kinect synthetic depth images (figure from [42]), rightSCAPE examples (figure from [44].) . . . . . . . . . . . . . . 14

3.1 CAD model in Maya . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Mixamo characters . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Top view of the scene showing tile division and chair positions 18

3.4 (a) Synthetic human, (b) Truncated synthetic human, (c)and (d) Standing positive training patch, (e) and (f) Sittingpositive training patch . . . . . . . . . . . . . . . . . . . . . . 18

3.5 left Capped cylinder encapsulating the human model, rightBounding box extracted . . . . . . . . . . . . . . . . . . . . . 19

3.6 Examples of negative training patches. (a) Part of a trolley,(b) Part of a dog, (c) Empty Background (d) Whole ball . . . 20

3.7 Overview schematic of the training process for a standingperson and specific tile . . . . . . . . . . . . . . . . . . . . . . 21

3.8 left HOG feature vector pipeline (taken from [46]), rightSVM classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.9 Overview schematic of the whole detection process . . . . . . 24

3.10 Non Maximum Suppression Schemes. left Over tiles, rightIOU over 2D boxes . . . . . . . . . . . . . . . . . . . . . . . 25

3.11 Weighted averaging and grouping scheme (a) Initial boxesaround the person, (b) Remaining single box . . . . . . . . . 26

4.1 Axis Network Color Camera . . . . . . . . . . . . . . . . . . . 28

4.2 Screenshot of the detection module with the main window. . 31

4.3 Screenshot of the training module. . . . . . . . . . . . . . . . 32

49

50 List of Figures

4.4 Screenshot of the detection module with the 3 labels in themain window. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 P-R curves for different training methods . . . . . . . . . . . 375.2 P-R curves when incorporating sitting detections . . . . . . . 385.3 P-R curves for soft vs hard training . . . . . . . . . . . . . . . 395.4 Detection results on people standing or walking close to each

other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.5 Detection results on multiple people sitting, standing or walk-

ing close to each other . . . . . . . . . . . . . . . . . . . . . . 415.6 Detection results on truncated and occluded people . . . . . . 415.7 Hard cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

List of Tables

5.1 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . 33

51

Documents

Robust People Detection using Computer Vision · statistics which could be useful for planning and management. Video surveillance through the help of humans however presents several