Real-Time Head Pose Estimation in Low-Resolution Football ... · Real-Time Head Pose Estimation in Low-Resolution Football ... time head pose estimation in low-resolution football

Real-Time Head Pose Estimation in Low-Resolution Football Footage

A N D R E A S L A U N I L A

Master of Science Thesis Stockholm, Sweden 2009

Real-Time Head Pose Estimation in Low-Resolution Football Footage

A N D R E A S L A U N I L A

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was Josephine Sullivan Examiner was Stefan Carlsson TRITA-CSC-E 2009:130 ISRN-KTH/CSC/E--09/130--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

AbstractThis report examines the problem of real-time head pose estimationin low-resolution football footage. A method is presented for inferringthe head pose using a combination of footage and knowledge of thelocations of the football and players. An ensemble of randomized fernsis compared with a support vector machine for processing the footage,while a support vector machine performs pattern recognition on thelocation data. Combining the two sources of information outperformseither in isolation. The location of the football turns out to be animportant piece of information.

Referat

Realtidsestimering av huvudets vridning i lågupplöstavideosekvenser från fotbollsmatcher

Rapporten behandlar problemet att i realtid utläsa huvudets vridningi lågupplösta videosekvenser från fotbollsmatcher. En metod beskrivsdär en kombination av visuell data och kunskap om fotbollens och spe-larnas positioner används. En kommitté av slumpade ormbunkar (ettspecialfall av beslutsträd) jämförs med en stödvektormaskin (SVM) förbehandlingen av visuell data, medan en stödvektormaskin används förmönsterigenkänning av fotbollen och spelarnas positioner. De två in-formationskällorna i kombination ger bättre resultat än endera enskilt.Kännedom om fotbollens position visar sig vara en viktig informations-källa.

Contents

1 Introduction 11.1 Resources and Requirements . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Head Detection 32.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Support Vector Machine Pattern Recognition . . . . . . . . . 52.1.3 Temporal Filtering . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Background Subtraction and Ellipse Fitting . . . . . . . . . . 52.1.5 K-means Cleaning . . . . . . . . . . . . . . . . . . . . . . . . 62.1.6 Median Template Matching . . . . . . . . . . . . . . . . . . . 6

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Bottom-up Head Pose Estimation 103.1 Randomized Ferns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Methods Investigated . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Randomized Ferns . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 15

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Top-down Head Pose Estimation 184.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Camera Angle . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Tying Things Together 245.1 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1.1 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . 245.1.2 Log-Linear Interpolation . . . . . . . . . . . . . . . . . . . . . 245.1.3 Independent Likelihood Pool . . . . . . . . . . . . . . . . . . 255.1.4 Monolithic Support Vector Machine . . . . . . . . . . . . . . 25

5.2 Temporal Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Angle Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Experiments and Results 306.1 Dataset Division and Labeling . . . . . . . . . . . . . . . . . . . . . 306.2 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3.1 Bottom-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.3.2 Top-down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3.3 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3.4 Temporal Smoothing . . . . . . . . . . . . . . . . . . . . . . . 376.3.5 The System as a Whole . . . . . . . . . . . . . . . . . . . . . 39

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.1 Bottom-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.2 Top-down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.3 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.4 Temporal Smoothing . . . . . . . . . . . . . . . . . . . . . . . 426.4.5 The System as a Whole . . . . . . . . . . . . . . . . . . . . . 42

7 Conclusions and Future Work 447.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

Appendices 52

A Assumed Background Knowledge 52A.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 53A.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.6 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.7 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . . . . . 54A.8 Homographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.9 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 1

Introduction

The ACTVIS (Capturing and Visualizing Large scale Human Action) project aimsto perform 3D reconstruction of football games. More specifically, the aim is toreconstruct the football players’ motions using visual information. As a step in thatdirection, this degree project explores how to estimate in which direction a footballplayer is looking. The project was carried out at the Computer Vision and ActivePerception laboratory (CVAP) at KTH.

1.1 Resources and Requirements

The goal is to utilize video footage to automatically estimate in which directioneach player is looking (which will be referred to as the head pose). As an additionalaid, tracking information detailing each player’s position on the pitch at each timestep is available. The tracking data also contains additional information, such asthe location of the football and referees. Both the footage and tracking data areprovided by TRACAB[56].

The footage is taken by four HD color cameras covering the entire field. Exampleframes are shown in figure 1.1. The cameras are stationary and do not pan duringthe game. Hence the background remains comparatively static. The sizes of theheads range from approximately 10 × 12 pixels to 18 × 21 pixels. Occlusion of theheads and players occur from time to time. Motion blur and encoding artifacts areto be expected.

The intent is to perform the estimation in real-time. Hence the method usedshould not have any underlying properties that prevent it from being run at thosespeeds (i.e. it should be computationally light). The goal of this degree project is,however, not to produce an implementation that has been optimized to that level.

1.2 Outline

Two main approaches will be used to garner information. The bottom-up approachaims to locate the player’s head and then estimate its direction based on its ap-

1

CHAPTER 1. INTRODUCTION

(a) (b)

(c) (d)

Figure 1.1: Frames from the four cameras. The original resolution is 1920× 1080pixels.

pearance. The top-down approach instead tries to model where the player could belooking, i.e., where the points of interest are in relation to the player. The two ap-proaches are then combined to produce estimates that are hopefully more accuratethan either approach used in isolation.

The outline of the report is as follows. The problem is divided into severalsubproblems which are each tackled separately. The experiments and results arepresented in a common chapter at the end. Chapter 2 deals with the problem oflocating the heads of the players, which is the first step to performing the bottom-up analysis described in chapter 3. Chapter 4 details how the top-down analysisis performed, which is fused with the bottom-up analysis in chapter 5. Chapter6 presents experimental results, from which conclusions are drawn in chapter 7.Appendix A covers some of the background knowledge assumed on the part of thereader.

2

Chapter 2

Head Detection

The first step towards performing the bottom-up analysis is to locate the head.In future chapters, we will employ a learning approach to head pose estimation,which requires large amounts of examples of heads pointing in various directions.Manually marking the location of a sufficient number of heads would require far toomuch work. Hence an automated method is used.

Head detection and head tracking are both difficult problems. The focus of thisdegree project is on head pose estimation, hence we will not dig too deep into headdetection. Rather, we will present an automated method that is good enough forthe purpose of producing large datasets, but presumably not close to the state ofthe art in terms of accuracy.

Head detection and tracking is related to tracking humans and body partsas well as face detection. General-purpose approaches include using mean shifttrackers[11, 5] and particle filters[39, 46, 3]. Pattern recognition can be used torecognize the appearance[14, 50], as well as the motion[15, 59], of humans. Morepurpose-specific approaches include fitting ellipses to edge detections[52], using facedetectors and skin color[24], and utilizing silhouettes produced through backgroundsubtraction[20].

We can not assume that the face, or even any skin on the head, is visible. Evenwhen the face is visible, the low resolution makes detecting face structure difficult.The amount of blur introduced by the low resolution also destroys much of the edgeinformation.

A mean shift implementation[57] was tested, but did not produce satisfyingresults. Taking into account spatial information, as done in [5], was not attempted.

While the resolution causes problems, the background is static (and the camerasdo not move). Hence, background subtraction is relatively easy. Using silhouettesmight work, but it is complicated by the low resolution. A few examples of playersilhouettes are shown in figure 2.1.

A combination of pattern recognition and background subtraction is selected.

3

CHAPTER 2. HEAD DETECTION

(a) Original frame (b) Foreground mask

(c) Zoomed in frame (d) Zoomed in foregroundmask

Figure 2.1: An example of a foreground mask produced through backgroundsubtraction. The foreground mask is shown after applying a median filter andmorphological opening.

2.1 Method

Head detection is performed in multiple steps, gradually improving the localizationof the head. The steps are as follows.

1. Cut out a region of interest using the tracking data.

2. Use a linear SVM trained with HOG-features to roughly locate the head-areain the region of interest.

3. Filter the location and size of the head-area.

4. Use background subtraction to construct a contour of the player. Locate thehead by fitting an ellipse to the head-area of the contour.

5. Do a k-means segmentation to clear out any background that is still left afterthe background subtraction.

6. Take the median of a single player’s head images across a time window. Usethe result as a template of how a well-centered head of that player should

4


look like. Use cross-correlation to relocate the head in the original region ofinterest.

Each step is now explained in further detail.

2.1.1 Region of Interest

The tracking data provides the player’s location on the real world pitch. The player’slocation can be used to obtain a rough estimate of where the player’s head shouldbe located. Additionally, it is necessary to use the data to identify the players.

The tracking data provides the location as a point on the pitch, while we wantto localize the head in the camera frame. Hence, the first hurdle is to compute theprojective transformation (homography) between a birds-eye view of the pitch andthe camera frames. This is done by manually marking several points on the pitch inthe camera frames and using the implementation of [28] to compute homographiesfrom frames of each of the four cameras to the birds-eye view of the pitch.

The field of view of each camera is manually marked. After having figured out inwhich camera’s field of view a tracked position is located, the inverse homographyof that camera is applied to obtain the corresponding location in the frame. Thetracking data provides the location of the players’ feet rather than the head. Toaccount for possible body poses and inaccuracies in the transformed coordinate, agenerous region of interest is extracted.

2.1.2 Support Vector Machine Pattern Recognition

Pattern recognition is used to locate the head-shoulder region, which is hopefullyfairly distinct and invariant in shape. A linear SVM with HOG features is usedsince it has proven to work for human detection[14]. The SVM is trained on head-shoulder regions extracted from the MIT pedestrian dataset[41], as well as negativeexamples from the Daimler pedestrian classification benchmark[37]. HOG featuresare computed for the region of interest and an SVM is applied using Dalal’s objectdetection and localization toolkit[13]. The window that gives the highest responseis selected.

2.1.3 Temporal Filtering

Given the temporal relationship between player positions in neighbouring frames,we can constrain detections to move smoothly from frame to frame and therebycorrect stray outliers. This is implemented using a Kalman filter[61].

2.1.4 Background Subtraction and Ellipse Fitting

The color of the background is static and differs from the color of the players.Therefore background subtraction is relatively easy. The subtraction is performedby modelling the background color of each pixel as a mixture of Gaussians and

5


treating deviations as foreground. Out of the many background subtraction tech-niques, mixture of Gaussians was selected because it provides high accuracy whilebeing computationally comparatively light[44]. The OpenCV library[40] is used tocompute foreground masks, such as the one shown in figure 2.1, for each frame.

The foreground mask is cleaned up using a 2 × 2 median filter (rounding up)and a 4 × 4 morphological opening. The contour of the blob present in the SVMdetection window (from the previous step) is computed. An ellipse is fitted to thecontour inside the detection window as shown in figure 2.2. The SVM detection isthen cropped so that it only covers the fitted ellipse and the immediate surrounding.

The ellipse is fitted using a brute force search to find the ellipse with the bound-ary that overlaps the most with the contour. The ellipse is assumed to have avertical major axis and the size is estimated beforehand. Thereby, the ellipse onlyhas two degrees of freedom (the location). This, in combination with the smalldetection window, makes brute force search feasible.

The size of the ellipse, i.e. the head size, is estimated from the size of theplayer’s blob (identified as being the dominant blob inside the detection window).This makes the head detection procedure fragile to even the tiniest bit of occlusionbetween players. Occlusion causes the players’ overlapping blobs to be seen as asingle (unusually large) player, throwing off the head size estimation.

An alternative to estimating the head size based on blob size is to use a Houghtransform[25] to search across both location and size. It was attempted, but didnot produce satisfying results. The head did not always manifest as an ellipse inthe foreground contour, causing the fitted ellipse to be of an incorrect size. Partialellipses would be more fitting as a model for the head contour[43], but were notused.

2.1.5 K-means Cleaning

The background subtraction is far from perfect. Therefore the head localization isrefined by trying to further eliminate the background. This is done by clusteringthe pixels into four clusters based on color and then finding the one that is mostsimilar to background. The heuristic used is that a cluster occupying both the top-left and top-right pixels is background. Otherwise, the one with the cluster centroidclosest (in an Euclidean sense) to a precomputed background color is background.Background-heavy rows and columns close to the edges are subsequently removed.

2.1.6 Median Template Matching

The head should look approximately the same from frame to frame. Thereby wecan take the median of all detections within a small time window (more specifically,20 frames, centered around the time of interest) to obtain a template image that isrobust to occasional poor detections. The cross-correlation[19] between the templateimage and the original region of interest is then computed to find the final detection.

6


(a) The SVM detection win-dow

(b) The foreground mask (c) The contour with the fit-ted ellipse

(d) The detection windowwith the background removedand the fitted ellipse displayed

(e) The centered detection

Figure 2.2: An example showing the ellipse fitting. The ellipse size is estimatedfrom the foreground mask. An ellipse is then fit to the contour of the foregroundmask. The detection window is finally cropped to the size of the ellipse.

The k-means cleaning is repeated to eliminate any background introduced in thisstep.

2.2 Summary

An overview of the head detector is shown in figure 2.3.

7


Head Detector

Cut out region of interest

Region of interest

Frame Player location

Fit ellipsePerform k-meanssegmentation and

remove background

Refined detectionwindows acrossmultiple frames

Take the median

Head model Cross-correlation

Head detection

Compute foregroundcontour

Roughly locatethe head

Footage Homographies Tracking data

(a)

Figure 2.3: A diagram summarizing the head detector. Rounded rectangles rep-resent operations. Rectangles without rounded corners represent computed results.Parallelograms represent input. The head localization and computation of the fore-ground contour are shown in more detail in figures 2.3b and 2.3c.

8


Compute Foreground Contour

Backgroundsubtraction

Foreground mask

Apply morphologicaloperations andcompute contour

Foreground contour

Frame

Footage

(b)

Roughly Locate the Head

Detectionsacross

multipleframes

Compute HOG featuresApply SVM

Apply temporal filtering

Region of interest

Head region

(c)

Figure 2.3: Diagrams summarizing how the foreground contour is computed andhow the rough localization of the head is performed. Rounded rectangles repre-sent operations. Rectangles without rounded corners represent computed results.Parallelograms represent input.

9

Chapter 3

Bottom-up Head Pose Estimation

Thanks to the head detector, we can now locate the head. The next step is toestimate the head pose direction based on the head’s appearance.

Head pose estimation is closely related to gaze estimation, where one wants tofind out in what direction a person is looking, and visual focus of attention, whereone wants to find out what a person is looking at. A recent survey of head poseestimation can be found in [38]. The low resolution of the footage means thatthe method used for head pose estimation must work well with resolutions around16 × 16 pixels. This prevents approaches such as facial feature-tracking[64], sincee.g. the eyes and nose are not reliably identifiable. Additionally, the head detectionsare far from perfect. The head pose estimator needs to be robust enough to dealwith slight shifts, rotations, and scale differences. It also needs to be able to performthe estimation in real time, which rules out computationally heavy methods suchas MCMC[53].

Approaches to low-resolution head pose estimation in particular include skin de-tection combined with nearest neighbor[49], neural networks[60, 9, 55], probabilisticmodels[63, 9], and tree-based methods[4].

The method presented in [60] uses multiple cameras, and is hence not applicableto our data. In [49], information about the body pose is added, making it difficultto compare with the rest.

The randomized ferns classifier presented in [4] has been shown to work onimages of size 10 × 10 pixels, which is approximately the size of our smallest headdetections. Additionally, the method is shown to have some resistance to scalechanges and detections that are off-center. For the method to work well, it isimportant that the heads are well-centered. Heads not being of the same size is lessof a problem.

Neural networks have been shown to perform well on images that are 20 × 30pixels large[54, 55]. Those studies do not, however, investigate the susceptibility toimperfect detections. A similar neural network is used in [9], which is there shownto be susceptible to shifts in the head detection window.

The probabilistic model of [9] is promising, but is not investigated. We choose

10

CHAPTER 3. BOTTOM-UP HEAD POSE ESTIMATION

t4 t5

t2

t6 t7

t3

t1

(a) A decision tree

t3 t3

t2

t3 t3

t2

t1

(b) A fern

Figure 3.1: The difference between a decision tree and a fern. A fern has the sametest (in this case t1, t2, t3) in all branches on the same level, while a decision treecan have different tests.

to compare a method based on the randomized ferns of [4] with an SVM, using thefeatures that have proven successful for the neural networks.

3.1 Randomized Ferns

Randomized ferns, which were introduced by [42], are closely related to decisiontrees. They stem from the randomized trees classifiers introduced in [1], whereseveral decision trees are constructed by selecting tests in a random fashion. Thetests used are often simple, such as comparing the intensities of two pixels[29, 42].There is nothing stopping more complex tests though, such as using simple linearclassifiers[6].

The basic classification procedure of randomized trees is as follows. Let T1,T2, . . . , Tn be an ensemble of random decision trees. Let D be the image to beclassified. Let c be the class. The image is sent down each tree, ending up atleaf nodes that each contain a posterior learned from the training data. Let theposteriors be P (c|T1, D), P (c|T2, D), . . . , P (c|Tn, D). The output of the ensembleis then computed as the arithmetic mean[1, 29, 6].

P (c|T1, T2, . . . , Tn, D) =1n

n∑i=1

P (c|Ti, D)

The choice of using the arithmetic mean is motivated simply by it reportedly work-ing as well as using the posteriors as high-dimensional inputs to one of severalclassifiers[1].

The difference between ferns and trees is that ferns use the same test regardless ofthe branch. I.e. the test used at each level is independent of the result of tests on theprevious levels, as shown in figure 3.1. This difference simplifies the implementation.When randomized ferns were introduced in [42], they also differed from randomizedtrees by not reusing tests and by combining the posteriors multiplicatively in anaive Bayesian fashion. Since then, other authors have chosen to use the flat fern

11


structure, but to select tests with replacement and combine the posteriors using thearithmetic mean of the randomized trees[6, 4].

Combining the posteriors multiplicatively is motivated by seeing the randomizedferns classifier as something halfway between the ideal classifier, which makes noindependence assumptions, and the naive Bayes classifier, which assumes that everyfeature is independent given the true class. In comparison, the randomized fernsclassifier presented in [42] only makes some independence assumptions. The tests arechosen without replacement, partitioning the set of all features F = {f1, f2, . . . , fm}(the result of each possible test) into several sets F1, F2, . . . , Fm/s of equal size s.Each fern is then used to model the dependencies between the features within asingle set. Conditional independence is only assumed between the different sets offeatures.

P (c|F ) ∝ P (c)m/s∏i=1

P (Fi|c)

There are many ways to randomly select trees and ferns. As we have seen, someuse purely randomly chosen tests[4], while others randomly partition the availabletests[42]. One can also combine random selection with additional conditions, suchas choosing randomly from the 20 tests with the best information gain[16]. Anotherway to inject randomness is to train different ferns with different random subsetsof the training data (i.e. bagging), which can be combined with greedily choosingtests based on information gain[29, 6].

Selecting tests randomly might seem suboptimal, given that one could use e.g.boosting to select tests (or entire trees) that take into account what previoustests/trees failed to classify correctly. In noisy situations, however, classificationusing random tests can outperform boosting[16]. The intuition is that boostingends up focusing on getting the mislabeled examples correct, while randomizationputs no extra weight on them.

3.2 Methods Investigated

Two methods are compared. One is based on randomized ferns and the other isbased on support vector machines.

3.2.1 Randomized Ferns

The first method is based on the one presented in [4], which is based on randomizedferns. Eight classes are used, each representing a 45 degree slice of the pan angle,as shown in figure 3.2. Given an image, the randomized ferns produce a probabilitydistribution over the eight classes.

Random selection of the tests (with replacement) is used, along with combiningthe posteriors using the arithmetic mean. Alternatives, such as test selections guidedby information gain, randomly chosen training sets, and combining posteriors frompartitioned features in a naive Bayes-fashion, have not been explored.

12


Figure 3.2: The pan-angle bins used as classes. Each bin spans 45 degrees. Thelabels shown are the centers of the circle sectors that make up the bin.

Pre-processing

The ferns are applied on the head detections produced by the head detector fromchapter 2. The size of the detections is normalized to 20 × 20 pixels. The size ischosen to be larger than the vast amount of detections so as not to subsample theinformation.

The detections’ raw pixel colors are not used directly in the tests. Rather, asegmentation algorithm is applied to the image to divide it into six segments basedon pixel color. Each of the six segments is labeled as hair, skin or background, priorto being sent off to the fern. The fern uses tests of the type “Is the pixel at (17,7) labeled as skin?”, where both the pixel coordinate and test type (skin, hair orbackground) are chosen randomly.

The segmentation algorithm used is k-means clustering with Euclidean distance.The segmentation is not performed in the RGB color space, rather the perceptuallyuniform CIELUV color space[58] (henceforth referred to simply as “Luv”) is used1.The clustering is performed a total of three times. Only the clustering with thesmallest sum of intra-cluster distances (between cluster points and the centroid) iskept. The reason for the repeated clustering is to lessen the effect of a poor k-meansinitialization.

1The implementation used for the experiments has an implementation error that causes negativevalues of the u and v components to be truncated to 0. This affects a total of 176 out of 1096detections. The mean number of values affected in the affected detections is 17 out of 800.

13


(a) The head detection (b) The segmented head de-tection

(c) The segments labeled asskin, hair and background

(d) The skin segments sepa-rated from the rest

(e) The hair segments sepa-rated from the rest

(f) The background segmentsseparated from the rest

Figure 3.3: An example of a head detection that has been segmented and labeled.The hair and jersey having similar colors causes some or the jersey and hair to endup in the same segment.

Training and Classification

Consider a single fern and a single image. Let L = {l1, l2, . . . , l729} be the set ofall possible ways to assign the labels “skin”, “hair”, “background” to six segments(36 = 729 ways in total). Let dl be the leaf where the image ends up when labeledusing l. Let c be the class of the image. Let sl be the event that labeling l is thetrue labeling.

The training data consist of segmented images for which the true class andlabeling are known. An example of an image with labeled segments is shown infigure 3.3. A fern is trained by taking each training example and computing dl forall l ∈ L. Two histograms are then constructed for each class. One records whichleaves were reached by the training images when using the true labeling. The otherrecords which leaves were reached by the training images when using an incorrectlabeling (a labeling other than the true labeling).

As we shall see, we want to estimate P (dl|c, sl) and P (dl|c,¬sl). These areestimated by simply normalizing the computed histograms. However, to avoid as-signing probability 0 to any leaf, add-one smoothing (adding one to all histogram

14


bins) is first applied.Classification is performed by, for each l ∈ L, applying l on the image’s segments

and then sending it through the fern. Let ~d = (dl1 , dl2 , . . . , dl729). We want toestimate P (c|~d), the probability of each class given the leaves that the image endedup in.

Assuming that dl1 , dl2 , . . . , dl729 are conditionally independent given the trueclass, we can utilize Bayes’ rule and the law of total probability to obtain thefollowing (J. Sullivan, personal communication).

P (c|~d) ∝ P (~d|c)P (c)

P (~d|c) =∏l∈L

P (dl|c) =∏l∈L

P (dl|c, sl)P (sl|c) + P (dl|c,¬sl)P (¬sl|c)

=∏l∈L

P (dl|c, sl)P (sl|c) + P (dl|c,¬sl)(1− P (sl|c))

P (dl|c, sl) and P (dl|c,¬sl) are the quantities estimated during training. Note thatthis derivation differs from the one described in [4].

The priors P (sl|c) and P (c) are estimated from the training data. P (sl|c) isestimated as the fraction of images of class c that have any permutation of l as truelabeling. Reflections around the half-way line are included when estimating P (c),making it mimic the symmetry of the pitch. Add-one smoothing is applied whenestimating P (sl|c) to avoid assigning probability 0.

Let ~d1, ~d2, . . . , ~dn be the vector of leaves from ferns 1 through n. The ferns’posteriors are combined using the arithmetic mean.

P (c|~d1, . . . , ~dn) =1n

n∑i=1

P (c|~di)

3.2.2 Support Vector Machine

Given the success of neural networks, it seems reasonable that an SVM with similarfeatures would perform well. Therefore, an SVM is tested and compared to therandomized ferns.

An SVM is trained using features computed from the head detection. Threesets of features are considered. The first is to simply give the SVM the imageencoded in the Luv color space. The other two sets of features are grayscale andedge responses, which have been shown to work for neural networks[55]. The imageis converted to grayscale and histogram equalization[19] is applied to give betterresilience to illumination changes. Edge responses are computed for both the x andy directions using a Sobel operator[19]. Before being sent to the SVM, the size ofthe head detection is normalized to 20 × 20 pixels and features are scaled to therange [−1, 1].

The LIBSVM library[10] is used with an RBF-kernel and C-style soft margins.This setup is recommended as the first thing to try by [22]. No other kernels

15


Randomized Ferns Classifier

...

Probability distribution

Combine results

Normalize size and segment

Head detector

Head detection

Segmented detection

Apply fern 1 Apply fern 2 Apply fern n

Homographies Footage Tracking data

Figure 3.4: A diagram summarizing the randomized ferns approach to bottom-uphead pose estimation. Rounded rectangles represent operations. Rectangles withoutrounded corners represent computed results. Parallelograms represent inputs.

were explored. LIBSVM supports both multi-classing (using the “one-against-one”approach[23]) and posterior estimations (using the second approach from [62]). Thisis used to produce a probability distribution over the eight classes.

3.3 Summary

An overview of the two approaches to bottom-up head pose estimation are shownin figures 3.4 and 3.5.

16


SVM Classifier


Apply SVM

Normalize size and compute features

Head detector

Head detection

Homographies Footage Tracking data

Figure 3.5: A diagram summarizing the SVM approach to bottom-up head poseestimation. Rounded rectangles represent operations. Rectangles without roundedcorners represent computed results. Parallelograms represent inputs.

17

Chapter 4

Top-down Head Pose Estimation

The appearance of the head is not the only source of information about a player’shead pose. It would for instance seem reasonable that the player is unlikely tohave the head oriented in such a way that nothing is in his or her field of view.Furthermore, one could imagine that players are often looking in the direction ofthe football, as seen in figure 4.1. An example of using top-down informationas a complement to bottom-up head pose estimation can be found in [49], whereinformation about the body orientation is used to augment the bottom-up headpose estimation.

The additional information used is primarily drawn from the supplied tracking

Figure 4.1: An example of a situation when the head pose is easily inferred byjust knowing where the action is.

18

CHAPTER 4. TOP-DOWN HEAD POSE ESTIMATION

data. Features that can be directly extracted from the tracking data include

• Position of the player whose head pose is being estimated

• Positions of other players

• Position of the football

It seems reasonable that the absolute position of e.g. the football is not all thatimportant. Rather, the relative position, or the angle and distance to the football,ought to capture the most relevant information. Some such secondary features thatmight be of interest include the following.

• The direction and speed of the player’s movement

• The angles and distances to other players

• The angle and distance to the football

Examples of some of these features are shown in figure 4.2.

4.1 Method

Rather than handcrafting a model, an SVM classifier is trained on the featuresand then used to estimate a posterior over the head pose directions. This is doneusing the LIBSVM library[10] in a fashion similar to section 3.2.2, but with otherfeatures. An RBF-kernel is used along with C-style soft margins. No other kernelsare explored.

The football is not always present in the tracking data. To deal with this, twoSVMs are trained. One for situations when the football is present, and one forsituations when it is not. In both cases, the SVMs have eight classes, one for eachbin used for bottom-up head pose estimation (shown in figure 3.2).

It might seem natural to treat the problem as a regression problem (since somemisclassifications are worse than other). The periodicity, however, complicates mat-ters. A classifier is used due to not being able to find a readily available implemen-tation of periodic regression.

4.1.1 Camera Angle

There is some mismatch in the sense that the features are measured relative to thepitch, while the bins describe the head’s orientation relative to the camera. Theorientation of the bins hence depend on the player’s location on the pitch, as shownin figure 4.3. This means that even if e.g. all top-down angles are the same, thetrue class might differ due to differences in the camera angle. Therefore, the cameraangle is added as an additional feature.

The camera angle is computed in a fashion similar to [49]. Using the homogra-phies computed in chapter 2, we exploit that the homography applied to the vertical

19


Figure 4.2: The tracking data for the situation shown in figure 4.1. The circlesand crosses are the players from the two teams. The asterisk is the football. Adashed line is drawn between one of the players and the football, with the angle tothe football marked. The dash-dotted line is to one of the other players (with thecorresponding angle marked).

vanishing point (the point at which two parallel vertical lines converge in the imageplane) yields the camera location[12]. The vertical vanishing point is computed foreach camera by manually marking two vertical lines that are parallel in the realworld and computing their intersection. This method is sensitive to small errors inthe manual clicking. While the vanishing point can be computed without the needfor manual labor[33], the manual approach is used because of its simplicity.

4.1.2 Feature Selection

It is desirable to limit the number of features used to reduce the risk of over-fittingand to avoid unnecessary computation[21]. We want to select a minimal set offeatures that captures most of the relevant information. Testing all possible sets offeatures is not feasible. Instead, the features are selected in a greedy fashion. Theselection procedure is as follows

1. Start with the empty set of features.

20


Figure 4.3: The pan angles measured relative to the camera do not necessarilycorrespond to the absolute angles. Each pie represents the class bins at a specificlocation on the football pitch. The black slice represents the bin used when theplayer is looking directly at the camera. As can be seen, a player looking directly atthe camera could be looking anywhere between south-east and south-west relativeto the pitch.

2. For each feature not in the set of features:

1. Temporarily add the feature to the set of features used.

2. Optimize the SVM parameters with respect to the classification rate onthe validation set (explained in section 6.1).

3. Compute the performance gain as the improvement in the mean predic-tion error (explained in section 6.2) on the validation set compared tothe mean prediction error before the feature was added.

3. If no feature yields a positive performance gain, then stop. Otherwise: addthe feature that gave the greatest performance gain to the set of features.

4. Go to step 2.

The following features are considered. Each item is considered as one feature,and is not subdivided.

21


Table 4.1: The features selected when the football is present. The features arelisted in order of importance along with the mean prediction error (explained insection 6.2) after the feature has been added.

Feature Mean prediction errorAngle to the football 31.9

Camera angle 12.4Direction and speed of motion 7.8

Table 4.2: The features selected when the football is not present. The featuresare listed in order of importance along with the mean prediction error (explainedin section 6.2) after the feature has been added.

Feature Mean prediction errorAngles to all players 36.7

Camera angle 26.2Direction and speed of motion 24.3

• Angle to the football (when present)

• Distance to the football (when present)

• Angles to all players (sorted by distance)

• Distances to all players (sorted by distance)

• Direction and speed of motion of the player

• Camera angle

• x and y position of the player

The features selected by the described greedy procedure are shown in tables 4.1 and4.2.

4.2 Summary

An overview of the top-down head pose estimation is shown in figure 4.4.

22


Top-down Head Pose Estimator

Compute features forwhen the football is

present

Apply the SVM trained forsituations when thefootball is present

Compute cameraangle

Camera angle

Is the football present?

Football angleDirection and speed

of motion


not present

Yes No

Angles to playersDirection and speed

of motion

Apply the SVM trained forsituations when the

football is not present


Homographies Tracking data

Figure 4.4: A diagram summarizing the top-down head pose estimation. Roundedrectangles represent operations. Rectangles without rounded corners represent com-puted results. Parallelograms represent input.

23

Chapter 5

Tying Things Together

5.1 Fusion

The last step is to fuse the bottom-up and top-down posteriors. Four fusion methodsare compared. The first two methods are Linear interpolation[45] and independentlikelihood pool [51], taken from the problem of multi-sensor data fusion. The third islog-linear interpolation[17], which is used for merging language models (probabilitydistributions over word sequences) in speech recognition. The fourth alternativeis to give all the features to a single monolithic SVM, and hence not separate thebottom-up and top-down estimation.

These fusion methods will now be explained in general terms. Let o1, . . . , on beobservations from n different sources. Assume that we know P (c|o1), . . . , P (c|on).We want to estimate P (c|o1, . . . , on).

5.1.1 Linear Interpolation

Weights α1, . . . , αn,∑n

i=1 αi = 1 are assigned to the sources. The fused posterior iscomputed as a linear sum[45].

P (c|o1, . . . , on) =n∑

i=1

αiP (c|oi)

The averaging of posteriors used by the randomized ferns in chapter 3 can be seenas a special case of linear interpolation.

5.1.2 Log-Linear Interpolation

Log-linear interpolation is similar to linear interpolation, but the linear sum is takenin log-scale[17].

P (c|o1, . . . , on) ∝n∏

i=1

P (c|oi)αi

24

CHAPTER 5. TYING THINGS TOGETHER

5.1.3 Independent Likelihood Pool

Assume that o1, . . . , on are conditionally independent given c. Bayes’ rule is appliedin the following manner to arrive at what is known as renormalized multiplicationor independent likelihood pool [45, 51, 2].

P (c|o1, . . . , on) ∝ P (c)n∏

i=1

P (oi|c) (5.1)

Equation 5.1 requires that we know P (oi|c), while we only have access to dis-tributions on the form P (c|oi). Bayes’ rule can be used to flip the probabilities (J.Sullivan, personal communication).

P (c|o1, . . . , on) ∝ P (c)n∏

i=1

P (oi|c)

= P (c)n∏

i=1

P (c|oi)P (oi)P (c)

=1

P (c)n−1

(n∏

i=1

P (oi)

)n∏

i=1

P (c|oi)

Given o1, . . . , on,∏n

i=1 P (oi) is a constant. Hence

P (c|o1, . . . , on) ∝ 1P (c)n−1

n∏i=1

P (c|oi)

The independence assumption should hold in our case. There are two sources:the bottom-up analysis from the ferns and the top-down SVMs. Intuitively, theappearance of the head should not be affected by where other players are locatedbeyond the head pose. Conversely, the location of other players should not beaffected by the appearance of the player’s head given the head pose (and probablynot even by the head pose).

5.1.4 Monolithic Support Vector Machine

Instead of fusing separate posteriors, the top-down and bottom-up features canbe given to the same SVM. This essentially merges the top-down and bottom-upestimation in the case when an SVM is used for both parts. One downside of usinga monolithic SVM is that the bottom-up training data has to be split into the twotop-down situations, resulting in e.g. less bottom-up training data when the footballis in play. Additionally, the amount of top-down training data is also decreased sincenot all top-down examples have a head detection (as explained in section 6.1). Thequestion is whether this can be compensated by the SVM having access to bothtop-down and bottom-up features.

25


5.2 Temporal Smoothing

The human head has a maximum rotational speed of about 700 degrees per second[18].Hence, the head pose must maintain some temporal smoothness between frames.This constraint can be exploited by applying a sequential Bayesian filter. We makethe simplifying Markov assumption of the head pose only depending on the headpose that directly precedes it (which is not necessarily true, since it takes time toaccelerate and de-accelerate the head).

Let ct be the true bin and zt be the observation (both footage and locations) attime t. P (ct|zt) is obtained from the fusion. We are interested in the distributionover the bins given all previous observations, i.e. P (ct|z1, . . . , zt). Bayesian filteringgives the following recursive solution[8].

P (ct|z1, . . . , zt) ∝ P (zt|ct)P (ct|z1, . . . , zt−1) (5.2)

∝ P (ct|zt)P (c)

P (ct|z1, . . . , zt−1) (5.3)

P (ct|z1, . . . , zt−1) =∫

P (ct|ct−1)P (ct−1|z1, . . . , zt−1) dct−1 (5.4)

P (c1|z1) is obtained directly without filtering. Equation 5.3 is then recursivelyapplied until P (ct|z1, . . . , zt) is obtained for the desired t. If an observation zt ismissing (e.g. due to a poor detection) then the prior P (c) is used in place of P (ct|zt),

Additionally, we can utilize some future observations (by adding a slight delay tothe stream of estimations, so called fixed-lag smoothing). Delaying the result by tenframes seems reasonable. Taking observations up to n > t into account, Bayesiansmoothing gives the following recursive update rule[8].

P (ct−1|z1, . . . , zn) = P (ct−1|z1, . . . , zt−1)∫

P (ct|ct−1)P (ct|z1, . . . , zn)P (ct|z1, . . . , zt−1)

dct (5.5)

P (ct−1|z1, . . . , zt−1) and P (cn|z1, . . . , zn) are obtained using equation 5.3, whileP (ct|z1, . . . , zt−1) is computed using equation 5.4. Equation 5.5 is then recursivelyapplied until P (ct|z1, . . . , zn) is obtained for the desired t.

In a similar fashion to [4], the transition probability P (ct|ct−1) between two binsct and ct−1 is modelled as a Gaussian function of the smallest positive angel act,ct−1

between the bin centers.

P (ct|ct−1) ∝ exp−a2

ct,ct−1

2σ2(5.6)

5.3 Angle Estimation

The posterior produced is over eight discrete bins, while the angle that we want toestimate is continuous. Picking e.g. the center angle of the most likely bin is hardlyoptimal. If we for instance have two bins that are equally likely, then we wouldmuch rather choose an angle between the bins.

26


The mean angle, where each bin is represented by its center angle, is used as theestimate. The arithmetic mean will not do since the angles are periodic. Instead wetake a page from directional statistics and use the circular mean, which is definedas follows for angles θ1, . . . , θn[34].

C̄ =1n

n∑i=1

cos θi, S̄ =1n

n∑i=1

sin θi

θ̄ =

{arctan(S̄/C̄) if C̄ ≥ 0arctan(S̄/C̄) + π otherwise

In essence, the circular mean θ̄ is the angle to the center of mass of the unit vectorscorresponding to θ1, . . . , θn.

5.4 Summary

A diagram of the whole system put together, combining the bottom-up and top-down estimations, is shown in figure 5.1. A diagram of the monolithic SVM approachis shown in figure 5.2.

27


Head Pose Estimator

Head detector

Head detection

Bottom-up head pose estimator


Top-down head pose estimator


Fusion


Apply temporal smoothing

Smoothed probability distribution

Pick the mean

Estimated angle

Footage Tracking data Homographies

Figure 5.1: A diagram summarizing the head pose estimation process. Roundedrectangles represent operations. Rectangles without rounded corners represent com-puted results. Parallelograms represent input.

28


Head Pose Estimator

Head detector

Head detection


Apply temporal smoothing

Smoothed probability distribution

Pick the mean

Estimated angle

Footage Tracking data Homographies


present

Apply the SVM trained forsituations when thefootball is present


not present

Apply the SVM trained forsituations when the

football is not present

Is the football present?Yes No

Figure 5.2: A diagram summarizing the head pose estimation process when amonolithic SVM is used. Rounded rectangles represent operations. Rectangleswithout rounded corners represent computed results. Parallelograms represent in-put.

29

Chapter 6

Experiments and Results

6.1 Dataset Division and Labeling

In total, eleven minutes (16500 frames) of footage is available. The footage is usedto construct the following three datasets.

• The validation set consists of frame 111 through 789 (679 frames, 27 seconds).It is used to optimize parameters and forms the basis for design decisions.

• The training set consists of pairs of frame and players selected randomly fromframes between 1500 and 9000 (5 minutes).

• The test set consists of the last 501 frames (20 seconds). It is used to estimatethe generalization performance of the final system.

Additionally, 100 frames before each set was used to initialize the background sub-traction model.

The head detector described in chapter 2 was applied to the entire dataset.We make the assumption of having a reliable head detector, removing all sub-pardetections. The quality was decided by a human. In general, any detections wherethe (correct player’s) head was not present, did not cover most of the image, orwas too off-center, were removed. In total, 9859 out of 14894 (66%) validation setdetections were kept, and 6987 out of 11022 (63%) test set detections were kept. Asample of removed and kept head detections are shown in figure 6.1.

For the validation set, the head pose of each player is labeled in each frame.The labels assigned correspond to the eight head pose direction classes (shown infigure 3.2). Two special labels, representing that the head is occluded and that theplayer is looking straight down, are also used.

The training and test sets are labeled similarly, but only good detections arelabeled. Additionally, some of the detection images in the training set are segmentedas described in chapter 3. The segments are labeled as either “skin”, “hair” or“background”. The labeled segmentations, as well as their reflections, are used whentraining the randomized ferns. In total, that encompasses 1192 training examples.

30

CHAPTER 6. EXPERIMENTS AND RESULTS

(a) Kept head detections (b) Removed head detections

Figure 6.1: Random samples of kept and removed head detections. Note thatgood-looking detections can be removed if the detection is of the wrong player.

Not all of the examples in the training set had their segments labeled in order toavoid any direction being severely underrepresented.

Assigning a bin to a displayed head is not necessarily easy. The head might betoo blurry, or too far away for a human to reliably tell the direction. Additionally,players might be looking along the edge of two bins, making it difficult to decide inwhich bin the player belongs.

In an attempt to ascertain how reliable the ground truth is, three individualsseparately labeled a single player during 400 frames. The labels assigned are shownin figure 6.2. The number of pairwise disagreements total 61, 50 and 47 out of 400.All disagreements involve neighbouring classes, i.e. the distance is never strictlygreater than one bin. Hence one can expect that some of the ground truth ismislabeled by one bin.

6.2 Evaluation Method

The performance is only evaluated based on the detections that were not removed.Only instances labeled with a direction class label are used during evaluation (i.e.occluded detections and detections of players looking straight down are omitted).

When evaluating the bottom-up methods, one must take into consideration thatall three sets contain instances of the same players. In the general case, we can notexpect the classifier to have training data available of the player being processed. Toget a better picture of the true generalization performance, one classifier is trainedfor each player using all training data except the instances of that specific player.This is done for all experiments where bottom-up estimation is involved. There is

31


Figure 6.2: The labels assigned when three persons individually labeled the sameplayer. Each bin is split up into three lanes. If e.g. the top lane of a bin is filledthen that indicates that the person corresponding to that lane assigned that binduring those frames. Note that bin -2 corresponds to the player looking straightdown (regardless of head pose) and that bins 1 and 8 are adjacent.

one caveat though: the training set contains misdetected instances. I.e. there existsinstances where the head detected does not belong to the intended player. This isa rare occurrence though, and should hopefully not affect the results much.

The ground truth is discretized into the eight bins, while the estimated angleis continuous. Any angle within the true direction’s bin must be considered to beequally valid. Therefore the error measure used is 0 if the estimated angle is withinthe circle sector corresponding to the true direction. If the angle is not within thetrue direction’s bin, then the smallest positive angle between the estimated angleand an edge of the true direction’s circle sector is used. The error measure will bereferred to as the prediction error.

A prediction error of 0 corresponds to a perfect predictor. A useful comparisonis the expected prediction error of a random predictor, which just assigns a randomangle between 0 and 360 degrees. Without loss of generality, assume that the correct

32


bin is the one between −22.5 and 22.5 degrees. Such a predictor will in 1/8 of thecases assign an angle within the bin, resulting in a prediction error of 0. Otherwise,it will in half of the cases assign an angle between 22.5 and 180 degrees, with anexpected prediction error of (180 − 22.5)/2 degrees. In the rest of the cases, theassigned angle will be between 180 and 360− 22.5 degrees, with the same expectedprediction error. Hence the expected prediction error of a random predictor is

18· 0 + 2 · 7

8 · 2· 180− 22.5

2≈ 68.9 degrees

Throughout the experiments, we will aim to minimize the mean prediction error.When evaluating separate parts, such as the randomized ferns classifier, the meanangle of the posterior is used to compute the prediction error.

6.3 Results

6.3.1 Bottom-up

Randomized Ferns

The mean prediction errors obtained when varying the number of tests (depth of theferns) are shown in figure 6.3a. The mean prediction errors obtained when varyingthe number of ferns are shown in figure 6.3b. A fern count of 30 was used whenvarying the depth. A depth of 13 was used when varying the fern count.

A total of 50 randomized ferns classifiers with depth 13 and fern count 30 weretrained on the training set. The approximate 95% confidence interval of the meanprediction error is 23.0± 1.4 degrees on the validation set and 26.4± 2.1 degrees onthe test set.

A representative classifier was chosen by training three classifiers and pickingthe one with the best mean prediction error on the validation set. The selectedrandomized ferns classifier, which is used throughout the rest of the experiments,has a mean prediction error of 21.5 degrees on the validation set and 26.1 degreeson the test set. Histograms of the prediction error for the validation and test setsare shown in figure 6.4.

Using a mixture of Matlab and unoptimized C-code, the estimator executes at aspeed of somewhere between 0.15 and 0.20 seconds per estimation (from the pointof being given a head-detection to the point of returning the posterior). About 40%of this time is spent computing the segmentation.

Support Vector Machine

The sets of features described in section 3.2.2 are tested. The two SVM parame-ters, C and γ, are optimized using a grid search over (γ, C) ∈ {(2eγ , 2eC ) : eγ ∈{−15,−13, . . . 3}, eC ∈ {−5,−3, . . . , 15}}. The pair of parameters that gives thebest classification rate on the validation set is selected. The mean prediction error,after having optimized the parameters on the validation set, are shown in table 6.1.

33


0 5 10 15 2020

25

30

35

40

45

50

55

60

65

Depth

Mea

n pr

edic

tion

erro

r (d

egre

es)

(a) Mean prediction error when varying thefern depth

−20 0 20 40 60 80 100 120 14015

20

25

30

35

40

45

50

55

Number of ferns

Mea

n pr

edic

tion

erro

r (d

egre

es)

(b) Mean prediction error when varying thenumber of ferns

Figure 6.3: The mean prediction error of the randomized ferns classifier whenthe depth and number of ferns are varied. The results of training a classifier with30 ferns and a depth of 2,4,6,8,9,10,. . . ,16 are shown in figure 6.3a. The memoryis exhausted for depths greater than 16. Figure 6.3b contains the results of usinga depth of 13 with {2k : k ∈ {0..7}} ferns. In both cases, the approximate 95%confidence intervals resulting from training ten classifiers are shown.

An SVM using just the grayscale image feature is used as bottom-up classifierthroughout the rest of the experiments. Histograms of the prediction error on thevalidation and test sets are shown in figure 6.5. The mean prediction errors forindividual players are shown in tables 6.2 and 6.3.

6.3.2 Top-down

Two SVM classifiers are used. One when the football is present and one when it isnot. The football is present in 478 out of 1380 examples (35%) in the training set,291 out of 679 frames (43%) in the validation set, and 52 out of 501 frames (10%)in the test set. In addition, the reflections of the coordinates about the half-wayline are used to double the amount of training data.

The two SVM parameters, C and γ, are optimized using a grid search over(γ, C) ∈ {(2eγ , 2eC ) : eγ ∈ {−15,−13, . . . 3}, eC ∈ {−5,−3, . . . , 15}}. The pair ofparameters that gives the best classification rate on the validation set is selected.The mean prediction error on the validation and test sets are shown in table 6.4.

34


0 50 100 1500

100

200

300

400

500

600

700

Prediction error (degrees)

Num

ber

of e

xam

ples

(a) Histogram of the randomized fernsclassifier’s non-zero prediction errors onthe validation set

0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(b) Histogram of the randomized fernsclassifier’s non-zero prediction errors onthe test set

Figure 6.4: Histograms of the bottom-up prediction errors on the validation setand test set when the randomized ferns classifier is used. Only prediction errorsstrictly greater than zero are included in the histogram. In total, 4241 examplesout of the 9769 examples (43%) in the validation set had a prediction error of zero.In the test set, the corresponding figures were 2842 out of 6656 examples (43%).

Table 6.1: The mean bottom-up prediction error (in degrees) of an SVM trainedon varying sets of features. The approximate number of estimations per second,when running the SVM on a single 1.60 GHz core, are also shown.

Feature Validation set Test set Speed (Hz)Grayscale (C = 23, γ = 2−7) 13.4 14.1 375

Edge (C = 21, γ = 2−5) 14.7 14.5 150Grayscale + edge (C = 23, γ = 2−7) 13.0 14.3 130

Luv (C = 23, γ = 2−7) 13.2 11.6 135

35


0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(a) Histogram of the non-zero predictionerrors on the validation set for the SVM

0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(b) Histogram of the non-zero predictionerrors on the test set for the SVM

Figure 6.5: Histograms of the bottom-up prediction errors on the validation setand test set when the SVM is used. Only prediction errors strictly greater than zeroare included in the histogram. In total, 5491 examples out of the 9769 examples(55%) in the validation set had a prediction error of zero. In the test set, thecorresponding figures were 3345 out of 6656 examples (50%).

Table 6.2: The mean prediction error (in degrees) of individual players when thebottom-up SVM is applied on the home team.

Player (jersey number) 1 3 5 7 16 18 21 23 26 29 33

Mean error Validation 9 90 2 11 17 13 2 13 10 9 5Test 19 72 20 7 17 17 24 16 9 10 7

Table 6.3: The mean prediction error (in degrees) of individual players when thebottom-up SVM is applied on the away team.

Player (jersey number) 1 3 4 5 6 8 9 10 15 18 19 23

Mean error Validation 16 10 9 9 9 7 4 5 – 6 5 5Test 70 14 10 – 14 6 12 8 3 16 7 3

Table 6.4: The mean prediction error (in degrees) and selected parameters of thetop-down approach on the validation and test set.

Validation set Test setFootball present (γ = 2−5, C = 27) 7.8 4.7

Football not present (γ = 2−13, C = 213) 24.3 29.2

36


Table 6.5: Mean prediction error (in degrees) of the various fusion methods on thevalidation and test sets when the football is present.

Method Validation set Test set

Without fusion Top-down 7.8 4.7Bottom-up 14.3 11.3

Log-linear (α = 0.64) 5.5 4.1Linear (α = 0.60) 5.6 4.2

Independent likelihood pool 7.3 6.0Monolithic SVM 15.8 9.3

Table 6.6: Mean prediction error (in degrees) of the various fusion methods on thevalidation and test sets when the football is not present.

Method Validation set Test set

Without fusion Top-down 24.3 29.2Bottom-up 12.6 14.5

Log-linear (α = 0.54) 7.0 13.2Linear (α = 0.52) 8.9 14.2

Independent likelihood pool 7.4 13.5Monolithic SVM 11.3 13.8

6.3.3 Fusion

The weights for the linear and log-linear interpolation are selected using the valida-tion set. The mean prediction error for different weights and methods are shown infigures 6.6a and 6.6b. The mean prediction error with the selected weights are shownin tables 6.5 and 6.6. The best mean prediction error on the validation set is achievedusing log-linear interpolation with the weight 0.64 given to the top-down approachwhen the football is present, and 0.54 when the football is not present. Posteriorsare fused using log-linear interpolation, with the selected weights, throughout therest of the experiments.

6.3.4 Temporal Smoothing

The temporal smoothing σ-parameter, used in equation 5.6, is selected by computingthe mean prediction error for the fused classifiers on the validation set and thenpicking the σ that gives the best result. The results are shown in figure 6.7a, whereσ = 0.85 is the optimum.

The effect of applying the temporal smoothing on the results of the fused classi-fier is shown in table 6.7. Temporal smoothing does not decrease the mean predic-tion error and is not used throughout the rest of the experiments. As a comparison,the effects of smoothing when randomized ferns are used for bottom-up estimation

37


0 0.2 0.4 0.6 0.8 15

6

7

8

9

10

11

12

13

14

15

Weight

Pre

dict

ion

erro

r (d

egre

es)

Log−linear interpolation

Linear interpolation

Independent likelihood pool

(a) When the football is present

0 0.2 0.4 0.6 0.8 16

8

10

12

14

16

18

20

22

24

26

WeightP

redi

ctio

n er

ror

(deg

rees

)

Log−linear interpolation

Linear interpolation

Independent likelihood pool

(b) When the football is not present

Figure 6.6: The mean prediction error on the validation set when using inde-pendent likelihood pool, linear interpolation and log-linear interpolation as fusionmethod. The weight is the weight given to the top-down SVM. I.e. when it is equalto 1, only the top-down information is used. When it is equal to 0, the top-downinformation is not used at all. The independent likelihood pool does not use anyweight.

0 0.5 1 1.5 2 2.5 36

8

10

12

14

16

18

20

22

24

σ

Mea

n pr

edic

tion

erro

r (d

egre

es)

(a) Temporal smoothing applied on the entiresystem using the SVM approach for bottom-upestimation.

0 0.5 1 1.5 2 2.5 38

10

12

14

16

18

20

22

24

σ

Mea

n pr

edic

tion

erro

r (d

egre

es)

(b) Temporal smoothing applied on the entiresystem using the randomized ferns approach forbottom-up estimation.

Figure 6.7: The mean prediction error of the entire system on the validation setwhen varying the σ-parameter used during temporal smoothing. The dotted linesrepresent the mean prediction error without any temporal smoothing.

38


Table 6.7: Mean prediction error (in degrees) of the estimator using an SVMfor bottom-up estimation before and after temporal smoothing is applied. Thesmoothing is performed using σ = 0.85 and ten frames of look-ahead.

Validation set Test setBoth Football No Football Both Football No Football

Without smoothing 6.4 5.5 7.0 12.0 4.1 13.2With smoothing 6.4 5.7 7.0 12.4 4.8 13.5

Table 6.8: Mean prediction error (in degrees) of the estimator using randomizedferns for bottom-up estimation before and after temporal smoothing is applied. Thesmoothing is performed using σ = 0.72 and ten frames of look-ahead.

Validation set Test setBoth Football No Football Both Football No Football

Without smoothing 9.9 5.8 13.0 15.3 4.9 16.8With smoothing 8.3 5.3 10.8 13.4 4.7 14.8

are shown in table 6.8 and figure 6.7b.

6.3.5 The System as a Whole

On the test set, the estimator gets the bin right in 52% of the cases. When it doesnot get the bin right, the errors are distributed as shown in figure 6.8. The meanprediction error over all estimations is 12.0 degrees.

The time consumption is divided roughly as shown in table 6.9. The measure-ments are only meant to give an impression of the execution speed and how it isdivided amongst the tasks. As such, too much attention should not be paid to theabsolute values.

39


0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(a) Bottom-up ferns

0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(b) Bottom-up SVM

0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(c) Top-down

0 50 100 1500

100

200

300

400

500

600

700


Num

ber

of e

xam

ples

(d) Fused (bottom-up SVM + top-down)

Figure 6.8: Histograms of the non-zero prediction errors on the test set for thebottom-up and top-down parts as well as the final fused system.

Table 6.9: Approximate execution speed of the different parts of the system. Thetimes are measured on a single 1.6 GHz core with code that has not been writtenwith execution speed in mind (except for the estimations done using LIBSVM).

Step Time (ms)Bottom-up feature computation 6

Bottom-up estimation 3Top-down feature computation (regardless of situation) 8

Top-down estimation (football present) < 1Top-down estimation (football not present) < 1

Fusion < 1Angle estimation 3

40


6.4 Discussion

For all intents and purposes, the performance will be considered to be the negativemean prediction error.

6.4.1 Bottom-up

When deciding on the number of ferns, the gain in performance from adding fernsmust be weighted against the additional computational costs incurred. Using 30ferns seems like a good trade-off between mean prediction error and time consump-tion (which grows linearly with the number of ferns).

Grayscale is selected as SVM feature because it is a good trade-off between pre-diction performance and execution speed. Using larger feature vectors slows downthe SVM considerably without giving much in terms of performance improvement.At 375 estimations per second, the SVM can handle an entire team on one processingunit with time to spare.

The downside of the SVM is its sensitivity to the skin color. As can be seen intable 6.2, the SVM performs worse than random on player number three. Playernumber three is the one with the darkest skin in the datasets. When removed fromthe training set, the remaining examples become too dissimilar for the SVM toreliably estimate the player’s head pose. Given that there are no other players totest with, it is difficult to tell whether this is simply a problem of not having avaried enough training set. The randomized ferns approach is not affected since itis only concerned with variation in color within the individual images.

The SVM approach clearly outperforms the randomized ferns approach, both interms of mean prediction error and execution speed. Some of the difference could beexplained by the difference in methods, but randomized ferns can produce resultsthat are competitive with SVMs[6], so it can’t explain everything. The differencein features used is probably a better explanation. Normally, randomized ferns holdthe advantage over SVM when it comes to execution speed[6]. That advantage ishowever squandered by having to consider all possible ways to label the segments.Instead of just applying the ensemble once, we in effect have to apply it 729 times.

6.4.2 Top-down

As seen in table 6.4, knowing the location of the football significantly improves theestimation. The low prediction error on the test set when the football is presentshould be taken with a grain of salt due to the small sample size. The location ofthe football is only known during 10% of the test set.

6.4.3 Fusion

As can be seen in figures 6.5 and 6.6, the independent likelihood pool is more or lessthe same as log-linear interpolation with α = 0.5. They only differ by the directionprior.

41


When the football is present, the top-down information provides a good ini-tial estimate. The bottom-up information can still help disambiguate between thechoices presented by the top-down approach though, as shown in figure 6.9.

The poor results of the monolithic SVM when the football is present is puzzling.It performs worse than the top-down SVM, despite having a superset of the top-down features. Possible explanations include the reduction in training set size (956to 568 examples), the training/validation data not being representative enough withthe additional features, and the parameter optimization being done with respect tothe classification rate rather than prediction error.

6.4.4 Temporal Smoothing

The smoothing does not help when an SVM is used for bottom-up estimation. Itshould hence not be used. It does however help when randomized ferns are used.This would indicate that the ferns are more erratic in their predictions, presumablydue to the instability added by the clustering step. It could also be an inherentproperty of the randomized ferns classifier. In order to tell, one would have toapply the ferns to the same features as the SVM.

As for the overall performance between the validation and test sets: it is im-portant to keep in mind that the football is present during more of the validationset than the test set (43% vs. 10%). Therefore, comparing the overall errors is notadvised, since the validation set is influenced more by the lower prediction errorachieved with knowledge of the football location.

6.4.5 The System as a Whole

The system is capable of performing in real-time. The top-down and bottom-upestimations are likely to become the bottle necks if the system is implemented withexecution speed in mind.

Comparing tables 6.7 and 6.8, the ferns do not fare as badly against SVMsas indicated by the results in section 6.3.1. After temporal smoothing, the fernsapproach is only 1.4 degrees worse than the SVM approach on the test set. Theferns approach is still significantly slower though.

42


(a) The player being processed(the player on the left)

0.2

0.4

30

210

60

240

90

270

120

300

150

330

180 0

(b) The fused posterior

0.2

0.4

30

210

60

240

90

270

120

300

150

330

180 0

(c) The top-down posterior

0.1

0.2

0.3

0.4

30

210

60

240

90

270

120

300

150

330

180 0

(d) The bottom-up posterior

(e) The game situation

Figure 6.9: An example of a situation when the bottom-up approach proves valu-able, despite the football being present. The player has his head turned in prepara-tion for a possible dash. The ground truth puts the head somewhere between 292.5and 337.5 degrees. The top-down approach presents three viable alternatives, ofwhich the bottom-up approach only agrees with the one representing the labeleddirection.

43

Chapter 7

Conclusions and Future Work

7.1 Conclusions

A method for real-time estimation of the head pose using a combination of bottom-up and top-down features has been presented. Assuming a good (but not necessarilyperfectly centered) head detection, it can, when applied on a short sequence offootball footage, estimate the head pose down to a mean error of 5 degrees fromthe correct bin when the location of the football is known and 13 degrees from thecorrect bin when the location of the football is not known.

Evaluating the performance on longer sequences would be required to draw anysolid conclusions as to the generalization performance, but the shorter sequencesshould provide at least a hint. All data used is from the same game, making thegeneralization performance between games uncertain. Efforts have been taken toeliminate the aspect of having the same players in all datasets, but other conditionsremain unchanged.

For bottom-up estimation, an SVM trained on grayscale images performs betterthan the described randomized ferns approach. The SVM is however sensitive todark skin, which is not an issue for the feature used by the ferns. Whether that canbe corrected by adding more training data of people with dark skin is unknown.

The combination of bottom-up and top-down information outperforms eithersource in isolation. When the location of the football is known, the usage of bottom-up information helps detect cases when the player is not looking directly at thefootball. This has a marginal effect on the mean prediction error, but is presumablyimportant for the subjective quality of the predictions.

The estimation is significantly better when the location of the football is knowncompared to when it is not. There are currently many sequences when the footballis in play, but not successfully tracked. Therefore, improving the football tracking(as in successfully localizing the football more often), would have a beneficial effecton the head pose estimation.

44

CHAPTER 7. CONCLUSIONS AND FUTURE WORK

7.2 Future Work

There are many things that could be attempted to further improve the estimation.Below are some ideas that were not attempted due to lack of time and resources.

As noted in chapter 3, there are many ways in which the randomized fernsclassifier can be altered. Comparing the presented classifier with other ways tocombine the posteriors, other types of tests, other ways to chose tests and otherways to inject randomness would be of interest. Applying randomized ferns ongrayscale images (with a different set of tests) would allow a better comparisonwith SVMs, given that the difference in features is a possible explanation for theperformance gap. A direct comparison with other promising probabilistic models,such as the one presented in [9], would also be of interest.

This report has focused on the situation when one does not have training dataof the player whose head pose is being estimated. If there is a desire to get a bitbetter performance on specific players, then one could possibly adapt the general-purpose estimator using a relatively small amount of person-specific training data.This would be akin to e.g. speech recognition systems that are adapted to specificpersons by using samples of that person’s speech to adapt a general-purpose model.

There is a significant overlap between some of the cameras. Hence multipleviews of the same head can sometimes be extracted. Only one view has been usedin such situations. Using them all would give an indication of whether it would bebeneficial to have multiple setups of cameras.

When it comes to top-down information, one could probably garner some addi-tional information by taking into account where the surrounding players are looking.I.e. two players standing side-by-side are probably looking in a similar direction.This could be done by estimating everyone’s head pose at once (utilizing measure-ments from all players), or in some iterative fashion by re-estimating each head posein turn until (hopefully) reaching convergence. Other features that might prove use-ful include the direction and speed of the football, as well as the distance betweenother players and the football.

A more fine-grained ground truth could also be a good next step. The currentbins are 45 degrees wide. Knowing that the head pose is estimated within x degreesof the right 45 degree circle sector is not necessarily all that reassuring. A relatedproblem is the uncertainty in the labeling. Examples are frequently mislabeled byone bin. A one-bin difference makes the prediction error of an estimated angle jumpfrom 0 to anything between 0 and 45 degrees. A set of examples with heads close tobin edges will therefore pose more of a challenge than a set of examples where theheads are centered in the bins, since the former will have a noisier ground truth.

Labeling longer sequences of games would open up for investigations as to howdifferent game situation (e.g. free kicks, throw-ins etc.) affect the behavior. Onecould also investigate whether players in different strategic positions behave differ-ently.

The head detector is slow and far from optimal. It ended up being more complexthan desired. Many approaches that might end up simpler, such as dropping HOG

45

CHAPTER 7. CONCLUSIONS AND FUTURE WORK

and relying more on the background subtraction silhouette, as well as detecting skinand jersey colors, were not attempted due to not wanting to spend more time thannecessary on head detection. The first thing to test would be to simply drop HOGand Kalman filtering, and instead assume that the top part of each player contouris the head region.

Access to a reliable head detector has been assumed in this report. Given theperformance of the top-down approach, it might be interesting to see how well acombined approach works when the head detector makes frequent errors. Errorsin the head detection will presumably lower the performance of the bottom-upapproach. The question is how well that can be compensated by weighting downthe bottom-up approach during fusion.

Jointly estimating the head position and the head pose (and indeed, the completebody pose in the ACTVIS project) would allow the uncertainty in e.g. the headtracking to be factored into the head pose estimation and vice versa[3].

46

Bibliography

[1] Y. Amit and D. Geman. Shape Quantization and Recognition with RandomizedTrees. Neural Computation, 9(7):1545–1588, 1997.

[2] P.K. Atrey and M.S. Kankanhalli. Probability Fusion for Correlated Multime-dia Streams. In ACM International Conference on Multimedia, pages 408–411,2004.

[3] S.O. Ba and J.M. Odobez. A probabilistic framework for joint head track-ing and pose estimation. In International Conference on Pattern Recognition,volume 4, pages 264–267, 2004.

[4] B. Benfold and I. Reid. Colour invariant head pose classification in low reso-lution video. In British Machine Vision Conference, 2008.

[5] S.T. Birchfield and S. Rangarajan. Spatiograms Versus Histograms for Region-Based Tracking. In IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2005.

[6] A. Bosch, A. Zisserman, and X. Munoz. Image Classification using RandomForests and Ferns. In IEEE International Conference on Computer Vision,pages 1–8, 2007.

[7] L. Breiman. Bagging Predictors. Machine learning, 24(2):123–140, 1996.

[8] M. Briers, A. Doucet, and S. Maskell. Smoothing algorithms for state-spacemodels. Technical report, Cambridge University Engineering Department,CUED/F-INFENG/TR.498, 2004.

[9] L.M. Brown and Y.L. Tian. Comparative Study of Coarse Head Pose Esti-mation. In IEEE Workshop on Motion and Video Computing, pages 125–130,2002.

[10] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines,2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[11] D. Comaniciu and V. Ramesh. Real-time tracking of non-rigid objects usingmean shift, 2003. US Patent 6,590,999.

47

BIBLIOGRAPHY

[12] A. Criminisi, I. Reid, and A. Zisserman. Single View Metrology. InternationalJournal of Computer Vision, 40(2):123–148, 2000.

[13] N. Dalal. Finding people in images and videos. PhD thesis, Institut NationalPolytechnique de Grenoble, 2006.

[14] N. Dalal and B. Triggs. Histograms of oriented gradients for human detec-tion. In IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2005.

[15] N. Dalal, B. Triggs, and C. Schmid. Human Detection Using Oriented His-tograms of Flow and Appearance. European Conference on Computer Vision,2006.

[16] T.G. Dietterich. An Experimental Comparison of Three Methods for Con-structing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.Machine Learning, 40(2):139–157, 2000.

[17] D. Gildea and T. Hofmann. Topic-based language models using EM. In Euro-pean Conference on Speech Communication and Technology, 1999.

[18] A. Gonshor and G.M. Jones. Short-term adaptive changes in the humanvestibulo-ocular reflex arc. The Journal of Physiology, 256(2):361, 1976.

[19] R. Gonzalez and R. Woods. Digital Image Processing. Prentice Hall, 2008.

[20] I. Haritaoglu, D. Harwood, and L.S. Davis. Hydra: Multiple people detectionand tracking using silhouettes. In International Conference on Image Analysisand Processing, pages 280–285, 1999.

[21] L. Hermes and J.M. Buhmann. Feature Selection for Support Vector Machines.In International Conference on Pattern Recognition, pages 712–715, 2000.

[22] C.W. Hsu, C.C. Chang, C.J. Lin, et al. A Practical Guide to Support VectorClassification, 2009.

[23] C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vectormachines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

[24] G. Hua, M. Yang, and Y. Wu. Learning to Estimate Human Pose with DataDriven Belief Propagation. In IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, volume 2, page 747, 2005.

[25] J. Illingworth and J. Kittler. A survey of the Hough transform. ComputerVision, Graphics, and Image Processing, 44(1):87–116, 1988.

[26] K. Kanatani. Geometric Computation for Machine Vision. Oxford UniversityPress, 1993.

48

BIBLIOGRAPHY

[27] K. Kanatani, N. Ohta, and Y. Kanazawa. Machine Vision Applications. Op-timal Homography Computation with a Reliability Measure. IEICE Transac-tions on Information and Systems, 83(7):1369–1374, 2000.

[28] P. D. Kovesi. MATLAB and Octave functions for computer vision and imageprocessing. School of Computer Science & Software Engineering, The Univer-sity of Western Australia. Software available at: http://www.csse.uwa.edu.au/~pk/research/matlabfns/.

[29] V. Lepetit, P. Lagger, and P. Fua. Randomized Trees for Real-Time KeypointRecognition. In IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2005.

[30] D.D. Lewis. Naive (Bayes) at Forty: The Independence Assumption in Infor-mation Retrieval. In European Conference on Machine Learning, pages 4–15,1998.

[31] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Inter-national Journal of Computer Vision, 60(2):91–110, 2004.

[32] L. Lucchese and S.K. Mitra. Colour Image Segmentation: A State-of-the-ArtSurvey. In Proceedings of the Indian National Science Academy, 67(2):207–222,2001.

[33] M.J. Magee and J.K. Aggarwal. Determining vanishing points from perspectiveimages. Computer Vision, Graphics, and Image Processing, 26(2):256–267,1984.

[34] K.V. Mardia and P.E. Jupp. Directional Statistics. Wiley New York, 2000.

[35] R. Meir and G. Rätsch. An introduction to boosting and leveraging. In Ad-vanced Lectures on Machine Learning, LNCS, pages 119–184, 2003.

[36] K.R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An Introductionto Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks,12(2):181–201, 2001.

[37] S. Munder and D.M. Gavrila. An Experimental Study on Pedestrian Clas-sification. IEEE Transactions on Pattern Analysis and Machine Intelligence,28(11):1863, 2006.

[38] E. Murphy-Chutorian and M.M. Trivedi. Head pose estimation in computervision: A survey. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 31(4):607–626, 2009.

[39] K. Nummiaro, E. Koller-Meier, and L. Van Gool. An adaptive color-basedparticle filter. Image and Vision Computing, 21(1):99–110, 2003.

49

BIBLIOGRAPHY

[40] OpenCV: Open Computer Vision Library. Software available at http://sourceforge.net/projects/opencvlibrary/.

[41] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. PedestrianDetection Using Wavelet Templates. In IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pages 193–199, 1997.

[42] M. Ozuysal, P. Fua, and V. Lepetit. Fast Keypoint Recognition in Ten Linesof Code. In IEEE Computer Society Conference on Computing Vision andPattern Recognition, 2007.

[43] S. Park and J.K. Aggarwal. Head Segmentation and Head Orientation in 3DSpace for Pose Estimation of Multiple People. In IEEE Southwest SymposiumImage Analysis and Interpretation, pages 192–196, 2000.

[44] M. Piccardi. Background subtraction techniques: a review. In IEEE Interna-tional Conference on Systems, Man and Cybernetics, 2004.

[45] O. Punska. Bayesian approaches to multi-sensor data fusion. M. phil., Cam-bridge University, 1999.

[46] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistictracking. In European Conference on Computer Vision, pages 661–675, 2002.

[47] J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1(1):81–106,1986.

[48] I. Rish. An empirical study of the naive Bayes classifier. In ĲCAI 2001Workshop on Empirical Methods in Artificial Intelligence, pages 41–46, 2001.

[49] N. Robertson and I. Reid. Estimating Gaze Direction from Low-ResolutionFaces in Video. In European Conference on Computer Vision, 2006.

[50] H. Schneiderman and T. Kanade. A Statistical Method for 3D Object Detec-tion Applied to Faces and Cars. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2000.

[51] B. Siciliano and O. Khatib. Springer handbook of robotics. Springer, 2008.

[52] S.A. Sirohey. Human Face Segmentation and Identification. Technical report,University of Maryland at College Park, CS-TR-3176, 1993.

[53] K. Smith, S.O. Ba, J.M. Odobez, and D. Gatica-Perez. Tracking the VisualFocus of Attention for a Varying Number of Wandering People. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 30(7):1212–1229, 2008.

[54] R. Stiefelhagen. Estimating head pose with neural networks-results on thepointing04 icpr workshop evaluation data. In Pointing 2004 Workshop: VisualObservation of Deictic Gestures, 2004.

50

BIBLIOGRAPHY

[55] R. Stiefelhagen, M. Finke, J. Yang, and A. Waibel. From Gaze to Focus ofAttention. Lecture Notes in Computer Science, pages 761–768, 1999.

[56] TRACAB. http://www.tracab.com/.

[57] R. Valenti and S. Hageloh. Implementation and evaluation of the mean shifttracker. Technical report, University of Amsterdam, 2005. Software availableat http://staff.science.uva.nl/~rvalenti/?content=projects.

[58] V. Vezhnevets, V. Sazonov, and A. Andreeva. A Survey on Pixel-Based SkinColor Detection Techniques. In GraphiCon, International Conference on Com-puter Graphics and Vision, pages 85–92, 2003.

[59] P. Viola, M.J. Jones, and D. Snow. Detecting Pedestrians Using Patterns ofMotion and Appearance. International Journal of Computer Vision, 63(2):153–161, 2005.

[60] M. Voit, K. Nickel, and R. Stiefelhagen. A Bayesian Approach for Multi-view Head Pose Estimation. In IEEE International Conference on MultisensorFusion and Integration for Intelligent Systems, pages 31–34, 2006.

[61] G. Welch and G. Bishop. An Introduction to the Kalman Filter. Technicalreport, University of North Carolina, TR 95-041, 1995.

[62] T.F. Wu, C.J. Lin, and R.C. Weng. Probability Estimates for Multi-Class Clas-sification by Pairwise Coupling. The Journal of Machine Learning Research,5:975–1005, 2004.

[63] Y. Wu and K. Toyama. Wide-range, person- and illumination-insensitive headorientation estimation. In IEEE International Conference on Face and GestureRecognition, pages 183–188, 2000.

[64] H. Yamazoe, A. Utsumi, T. Yonezawa, and S. Abe. Remote gaze estimationwith a single camera based on facial-feature tracking without special calibrationactions. In Eye Tracking Research & Applications Symposium, pages 245–250,2008.

51

Appendix A

Assumed Background Knowledge

The reader is assumed to have some familiarity with machine learning and computervision. This chapter briefly lists some specific concepts that are used in the report. Ashort summary is provided along with a reference in case the reader feels unfamiliarwith the concept.

A.1 Bayes’ Rule

Bayes’ rule is the basis of Bayesian statistics, and is derived by expressing theprobability of a conjunction of two events, P (H,D), using conditional probabilities.

P (H,D) = P (H|D)P (D) = P (D|H)P (H)

P (H|D) =P (D|H)P (H)

P (D)(A.1)

Equation A.1 is called Bayes’ rule, and can be used to compute the probabilityof an hypothesis H (e.g. “The player is looking directly at the camera.”) givenobserved data D. P (H) will be referred to as the prior distribution. P (H|D) willbe referred to as the posterior distribution. In practice, P (D) is a normalizationconstant that one avoids explicitly computing by noting that the sum of P (H|D)over all H must sum to 1.

A.2 Naive Bayes Classifier

Let f1, f2, . . . fn be a number of features and c be the class of an object. Thenaive Bayes classifier assumes that the features are independent given that the truehypothesis is known. In combination with Bayes’ rule, one can use the assumption

52

APPENDIX A. ASSUMED BACKGROUND KNOWLEDGE

to break down P (c|f1, f2, . . . , fn) into

P (c|f1, f2, . . . , fn) ∝ P (c)n∏

i=1

P (fi|c)

In practice, the independence assumption is often violated to some degree, butthe method tends to work surprisingly well anyway[30, 48]. The classification isperformed by picking the c which has the highest probability.

A.3 Support Vector Machines

Support Vector Machines (SVMs) are a group of methods from statistical learningtheory. In this report we will only be concerned with the supervised-learning classi-fiers with C-style slack variables[36]. Several kernels can be used, but only the RBFkernel will be touched upon in this report.

A.4 Decision Trees

A classifier represented as a tree. Each non-leaf node represents a (for our purposes,binary) test that decides which child the example being classified should go to next.Each leaf represents a (not necessarily distinct) label. The example to be classifiedstarts at the root, and is then guided downwards by the tests until it reaches a leafnode. The example is given the label corresponding to the reached leaf node.

Decision trees can be constructed in many ways. One approach is to select thetests based on a concept from information theory called information gain[47].

A.5 Boosting

An iterative method where classifiers are trained on the same set of training data,but with different weights given to the examples. The training data is weightedbased on the performance of the past classifiers, putting more stress on previouslymisclassified examples. The classifiers are in the end combined to produce a strongerclassifier[35].

A.6 Bagging

A method where a single set of training data is used to train an ensemble of clas-sifiers. Each classifier is trained using a set sampled uniformly at random, withreplacement, from the training data. The final label can then be decided using e.g.a majority vote amongst the ensemble of classifiers[7].

53

APPENDIX A. ASSUMED BACKGROUND KNOWLEDGE

A.7 Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) is a feature descriptor used for images[14].HOG is similar to SIFT[31], but the descriptors are computed over dense overlappingblocks rather than at keypoints.

A.8 Homographies

Homographies, also called projective transformations, are transformations betweentwo images of a planar surface[27, 26]. For our purposes, we will use it to mapcoordinates from a birds-eye view of the football pitch (the planar surface), to thecamera-view of the same football pitch.

A.9 K-means Clustering

K-means clustering is one of many algorithms for image segmentation[32]. In itsmost common incarnation, a fixed number of cluster centroids are seeded in thefeature space. The instances to be clustered are subsequently assigned to the closestcentroid. Each centroid is then moved to the center of the instances assigned to it.The process is repeated until convergence.

54

TRITA-CSC-E 2009:130 ISRN-KTH/CSC/E--09/130--SE

ISSN-1653-5715

www.kth.se

Documents

Real-Time Head Pose Estimation in Low-Resolution Football ... · Real-Time Head Pose Estimation in Low-Resolution Football ... time head pose estimation in low-resolution football