home.elka.pw.edu.plhome.elka.pw.edu.pl/.../2016_commodity_camera_egt_phd.pdf · 2016-04-06 · This thesis presents a complete eye gaze tracking system intended for use with a com-modity

WARSAW UNIVERSITY OF TECHNOLOGY

Faculty of Electronics and Information Technology

Ph.D. THESIS Adam Strupczewski, M. Sc.

Commodity Camera Eye Gaze Tracking

Supervisor Professor Władysław Skarbek, Ph.D., D.Sc.

Warsaw, 2016

Acknowledgements

To begin with, I would like to thank all people at Samsung Research Poland involved in supporting this thesis and the Eye Gaze Tracking project that it is based on. I would like to specially thank Dr. Taehwa Hong from Samsung Electronics who has always shown support for the project and resulting publications.

I would also like to express my gratitude to Professor Władysław Skarbek for his big support of my work. He has always been helpful and has put in effort to support me with any problems that I was facing.

Moreover, I would like to give many thanks to my colleague Błażej Czupryński who did a lot of work in the field of eye gaze tracking. Some of the research described in this thesis would have been much more difficult without his input.

I also wish to acknowledge the work of Jacek Naruniec, Kamil Mucha and Marek Kowalski for their contribution in the development of eye gaze tracking and computer vision in general. Their research was important in many ways for me personally and for the results presented in this thesis.

5

Abstract

This thesis presents a complete eye gaze tracking system intended for use with a com-modity RGB or RGBD camera. The proposed system requires only simple device setup and no personal calibration to be performed during usage. At the same time, the system demonstrates state-of-the-art accuracy while running in real time on ordinary computers without using the GPU. The author believes that this is an important contribution in the field of eye gaze tracking and can lead to making the eye gaze tracking technology easily accessible to the community. Along with the whole system, highly accurate com-ponent algorithms for head pose estimation and iris localization are proposed and dis-cussed in detail. Apart from theoretical analysis this thesis contains an extensive real-world scenario evaluation of the proposed component algorithms and whole system which justifies the claims about its novelty and high accuracy.

6

Polish Abstract

Niniejsza praca opisuje kompletny system do śledzenia wzroku przeznaczony do użycia z kamerą RGB lub RGBD. Proponowany system wymaga jedynie bardzo prostego zestawienia urządzeń oraz nie wymaga osobistej kalibracji użytkownika. Jednocześnie, prezentowany system działa w czasie rzeczywistym na zwykłym komputerze bez użycia karty graficznej do obliczeń i wykazuje dokładność porównywalną z najlepszymi tego typu systemami na świecie. Autor jest przekonany, że proponowany system jest istotnym wkładem w rozwój dziedziny śledzenia wzroku i może pomóc w udostępnieniu tej technologii dla społeczności. Oprócz całego systemu, w pracy są zaprezentowane i szczegółowo przedyskutowane algorytmy składowe do śledzenia pozy głowy i lokalizacji tęczówki oka. Poza analizą teoretyczną, niniejsza praca zawiera obszerną ewaluację całego systemu i algorytmów składowych przeprowadzoną w rzeczywistych warunkach użytkowania, która potwierdza ich innowacyjność oraz wysoką dokładność.

7

Table of contents

List of figures ................................................................................................................ 10

List of used abbreviations ............................................................................................. 15

1 Introduction ............................................................................................................ 17

1.1 Problem definition ........................................................................................... 17

1.2 Human eye model ............................................................................................ 18

1.3 Motivation ....................................................................................................... 21

1.4 Dissertation Structure ..................................................................................... 21

1.5 Theses .............................................................................................................. 22

2 Previous work ......................................................................................................... 24

2.1 Eye gaze tracking ............................................................................................ 24

2.1.1 Appearance based approaches .................................................................. 24

2.1.2 Model based approaches ........................................................................... 29

2.1.3 Methods using a depth camera ................................................................. 38

2.2 Head pose estimation ....................................................................................... 42

2.2.1 Appearance-based methods ....................................................................... 42

2.2.2 Geometric feature based methods ............................................................. 43

2.2.3 Flexible model methods ............................................................................ 45

2.2.4 Methods based on tracking ....................................................................... 47

2.2.5 Hybrid methods ........................................................................................ 49

2.3 Iris localization ................................................................................................ 51

3 Head pose estimation.............................................................................................. 56

3.1 Algorithm initialization ................................................................................... 56

3.2 Head pose tracking .......................................................................................... 60

3.2.1 Lukas-Kanade intensity based tracking .................................................... 62

3.2.2 Feature based tracking ............................................................................. 65

8

3.3 Proposed hybrid algorithm .............................................................................. 68

3.3.1 Tracker error measurement ...................................................................... 70

3.3.2 Keyframe initialization ............................................................................. 71

3.3.3 Illumination compensation ........................................................................ 72

3.4 Usage of a depth camera ................................................................................. 75

3.4.1 Improved head pose initialization ............................................................. 76

3.4.2 Improved head pose tracking .................................................................... 77

4 Iris localization ....................................................................................................... 81

4.1 Coarse iris localization ..................................................................................... 81

4.1.1 Key modifications of the original algorithm ............................................. 84

4.2 Iris location refinement .................................................................................... 86

4.2.1 Adaptive arc selection ............................................................................... 87

4.2.2 Other key modifications of the original algorithm .................................... 91

5 Proposed system ..................................................................................................... 93

5.1 Proposed eye model ......................................................................................... 94

5.2 Geometric model initialization ......................................................................... 97

5.2.1 Initialization of head and pupils ............................................................... 98

5.2.2 Initialization of eyeball rotation centers ................................................. 104

5.3 Geometric eye gaze tracking .......................................................................... 109

5.3.1 Eyeball center tracking ........................................................................... 109

5.3.2 Pupil center tracking .............................................................................. 110

5.3.3 Gaze point estimation ............................................................................. 113

5.4 Block diagram of proposed system ................................................................ 115

6 Model parameter inaccuracy analysis .................................................................... 117

6.1 Theoretical error resulting from incorrect interpupillary distance ................ 118

6.1.1 Impact on initial eyeball center position ................................................ 118

6.1.2 Impact on gaze point estimation ............................................................ 120

6.2 Theoretical error resulting from the user having a non-frontal pose in the initialization frame .................................................................................................. 121

9

6.2.1 Impact of eyeball location ....................................................................... 122

6.2.2 Impact on head pose tracking ................................................................. 123

6.3 Theoretical error resulting from incorrect angles between opt. and vis. axis 123

6.4 Theoretical error resulting from incorrect eye dimensions ............................ 124

6.5 Theoretical error resulting from incorrect camera external calibration relative to display ................................................................................................................. 126

7 Experimental validation ........................................................................................ 130

7.1 Environment setup ........................................................................................ 130

7.2 Head pose estimation ..................................................................................... 133

7.2.1 Evaluation framework ............................................................................. 134

7.2.2 Results – RGB input .............................................................................. 135

7.2.3 Results – RGBD input ............................................................................ 140

7.2.4 Analysis of non-frontal initialization ...................................................... 145

7.2.5 Analysis of lateral illumination ............................................................... 147

7.3 Iris localization .............................................................................................. 150


7.3.2 Results .................................................................................................... 151

7.4 Eye gaze tracking .......................................................................................... 157


7.4.2 Model parameter selection ...................................................................... 158

7.4.3 Results – RGB input .............................................................................. 166

7.4.4 Results – RGBD input ............................................................................ 171

7.4.5 Analysis of non-frontal initialization ...................................................... 177

7.4.6 Analysis of lateral illumination ............................................................... 178

8 Conclusions ............................................................................................................ 182

8.1 Justification of theses .................................................................................... 183

8.2 Future work ................................................................................................... 184

Author’s published work .............................................................................................. 186

References ..................................................................................................................... 187

10

List of figures

Figure 1.1. Classification of eye gaze tracking methods. ........................................................... 18

Figure 1.2. Human eye anatomy. .............................................................................................. 19

Figure 1.3. Gaze formation in the eyeball. ................................................................................ 20

Figure 2.1. Gaze vector calculation method in [42] ................................................................... 32

Figure 2.2. Eyeball rotation treated as 2D projection in [46]. ................................................... 34

Figure 2.3. Gaze estimation in [11]. .......................................................................................... 35

Figure 2.4. Eye model used in [55]. ........................................................................................... 37

Figure 2.5. Circular Hough transform voting. ........................................................................... 52

Figure 3.1. Generic face model used for head pose estimation initialization. ............................ 57

Figure 3.2. Warped mesh after initialization for 3 different users............................................. 58

Figure 3.3. Illustration of AAM results for three head poses. ................................................... 58

Figure 3.4. Illustration of face alignment by cascaded regression. ............................................ 60

Figure 3.5. SIFT and STAR features ........................................................................................ 67

Figure 3.6. Result of simplified multiscale retinex. ................................................................... 73

Figure 3.7. Result of dynamic range compression. .................................................................... 74

Figure 3.8. Result of discarding DCT coefficients. .................................................................... 75

Figure 4.1. Extracted eye regions using high quality webcam .................................................. 82

Figure 4.2. Illustration of radial symmetry transform voting. .................................................. 82

Figure 4.3. Comparison of original (left) and modified (right) voting algorithm. ..................... 85

Figure 4.4. Results of voting after smoothing for various radii. ................................................ 85

Figure 4.5. Illustration of iris localization refinement. .............................................................. 87

Figure 4.6. Typically visible iris boundaries. ............................................................................ 87

Figure 4.7. Changes in visible iris boundaries.. ......................................................................... 88

Figure 4.8. Illustration of adaptive iris boundary selection with frontal head pose. ................. 90

Figure 4.9. Illustration of adaptive iris boundary selection with varying head pose ................. 90

Figure 4.10. Elliptical shapes used in refinement stage. ............................................................ 91

Figure 5.1. Comparison of high quality and low quality eye region .......................................... 93

Figure 5.2. Typical eye and gaze model. ................................................................................... 94

Figure 5.3. Proposed eye gaze model with labelled parameters. ............................................... 96

Figure 5.4. Natural head pose when looking at a display. ........................................................ 98

Figure 5.5. Illustration of binocular vision in system initialization phase. ................................ 99

Figure 5.6. Analysis of pupil deviation when looking at initialization point (top view). ........ 100

Figure 5.7. Head pose and pupil initialization illustration (side view). ................................... 102

Figure 5.8. Head pose and pupil initialization illustration (perspective view). ....................... 102

Figure 5.9. Eyeball rotation center calculation during initialization (cross-sectional view). ... 104

Figure 5.10. Triangle OCP showing the deviation between measured and true visual axis .... 105

11

Figure 5.11. Eyeball rotation center calculation during initialization (perspective view) ....... 107

Figure 5.12. Eyeball optical and visual axes (perspective view) ............................................. 111

Figure 5.13. Eyeball optical and visual axes (perspective view). ............................................ 113

Figure 5.14. Block diagram of proposed system. ..................................................................... 116

Figure 6.1. Head yaw during initialization (top view) ............................................................ 122

Figure 6.2. Schematic presentation of inconsistent face model topology and face topology caused by initial head yaw (top view). ............................................................................................... 123

Figure 6.3. Impact of eyeball size on estimated gaze direction (one dimension). .................... 125

Figure 6.4. Relative camera and display orientation errors – horizontal plane. ...................... 128

Figure 6.5. Relative camera and display orientation errors - vertical plane............................ 128

Figure 7.1. Relative hardware placement (frontal view). ........................................................ 130

Figure 7.2. User placement in test environment. .................................................................... 131

Figure 7.3. Points displayed during test sequence recording. ................................................. 132

Figure 7.4. Average landmark displacement in pixels for various head pose tracking algorithms on standard test set. ............................................................................................................... 136

Figure 7.5. Test stage specific landmark displacement in pixels for OF + FT aggregate template algorithm on standard test set. ............................................................................................... 137

Figure 7.6. Sequence specific landmark displacement in pixels for OF + FT aggregate template algorithm on standard test set. ............................................................................................... 139

Figure 7.7. Average landmark displacement in pixels for three head pose tracking algorithms on three test sets. ......................................................................................................................... 139

Figure 7.8. Test stage specific landmark displacement in pixels for OF + FT aggregate template algorithm on three test sets. ................................................................................................... 140

Figure 7.9. Landmark displacement in pixels for various head pose tracking algorithms using depth on standard test set. ..................................................................................................... 142

Figure 7.10. Test stage specific landmark displacement in pixels for baseline tracking algorithm with depth used for initialization on standard test set. .......................................................... 143

Figure 7.11. Sequence specific landmark displacement in pixels baseline tracking algorithm on standard test set with and without depth used for initialization. ........................................... 144

Figure 7.12. Average landmark displacement in pixels for five head pose tracking algorithms using depth on three test sets. ................................................................................................ 144

Figure 7.13. Test stage specific landmark displacement in pixels for baseline tracking algorithm with depth used for initialization on three test sets. ............................................................... 145

Figure 7.14. Sequence specific landmark displacement in pixels for baseline tracking algorithm without depth on test set with initial head rotation. .............................................................. 146

Figure 7.15. Sequence specific landmark displacement in pixels for baseline tracking algorithm with depth used for initialization on test set with initial head rotation. ................................ 146

Figure 7.16. Comparison of face images with ambient and lateral lighting. ........................... 147

12

Figure 7.17. Sequence specific landmark displacement in pixels for OF + FT aggregate template algorithm with and without depth used for initialization on two test sequences with lateral illumination. ............................................................................................................................ 148

Figure 7.18. Average landmark displacement in pixels for OF + FT aggregate template algorithm without depth used for initialization on two test sequences with lateral illumination; for three illumination compensation algorithms. ................................................................................... 149

Figure 7.19. Average landmark displacement in pixels for OF + FT aggregate template algorithm with depth used for initialization on two test sequences with lateral illumination; for three illumination compensation algorithms. ................................................................................... 149

Figure 7.20. Sequence specific landmark displacement in pixels for OF + FT aggregate template algorithm with depth used for initialization on standard test set with and without illumination compensation. ......................................................................................................................... 150

Figure 7.21. Tool developed for purpose of iris tagging. ......................................................... 151

Figure 7.22. Iris localization algorithm average errors in pixels on standard test set – head pose tracking algorithms without depth. ........................................................................................ 152

Figure 7.23. Iris localization algorithm average errors in pixels on standard test set – head pose tracking algorithms with depth............................................................................................... 152

Figure 7.24. Sequence specific iris localization algorithm average errors in pixels on standard test set. .......................................................................................................................................... 153

Figure 7.25. Test stage specific iris localization algorithm average errors in pixels on standard test set. ................................................................................................................................... 154

Figure 7.26. Test stage specific iris localization algorithm average errors in pixels on three test sets. ......................................................................................................................................... 155

Figure 7.27. Test stage specific iris localization algorithm average errors in pixels on test set with lateral illumination. ................................................................................................................ 155

Figure 7.28. Various configurations of the iris localization algorithm tested on the standard test set – errors in pixels. ............................................................................................................... 156

Figure 7.29. Average eye gaze tracking errors in degrees on standard test set depending on observed eyeball radius, baseline head pose tracking algorithm is used. ................................. 160

Figure 7.30. Average eye gaze tracking errors in degrees on standard test set depending on observed eyeball ellipsoid shape, baseline head pose tracking algorithm is used. ................... 161

Figure 7.31. Average eye gaze tracking errors in degrees on standard test set depending on angle 𝛽𝛽 of the visual axis, baseline head pose tracking algorithm is used. ....................................... 161

Figure 7.32. Average eye gaze tracking errors in degrees on standard test set depending on angle 𝛼𝛼 of the visual axis, baseline head pose tracking algorithm is used. ....................................... 162

Figure 7.33. Average eye gaze tracking errors in degrees on standard test set for various eye radii and eyeball shapes assuming a complex eye model with angles 𝛼𝛼 = 5.5° and 𝛽𝛽 = 2.0°, baseline head pose tracking algorithm is used. ..................................................................................... 163

13

Figure 7.34. Average eye gaze tracking errors for the left and right eye in degrees on standard test set depending on angle 𝛼𝛼 of the visual axis, baseline head pose tracking algorithm is used................................................................................................................................................. 164

Figure 7.35. Average eye gaze tracking errors in degrees on standard test set depending on interpupillary distance, baseline head pose tracking algorithm without depth is used. .......... 165

Figure 7.36. Average eye gaze tracking errors in degrees for several head pose tracking algorithms on standard test set. ............................................................................................................... 167

Figure 7.37. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm on standard test set. ............................................................................................................... 167

Figure 7.38. Test stage specific eye gaze tracking errors in degrees, mesh tracking errors in pixels and iris localization errors in pixels for baseline tracking algorithm on standard test set. Left axis refers to degree errors and right to pixel errors. .............................................................. 168

Figure 7.39. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm on standard test set. ............................................................................................................... 169

Figure 7.40. Average eye gaze tracking errors in degrees for three head pose tracking algorithms on three test sets. .................................................................................................................... 170

Figure 7.41. Test stage specific eye gaze tracking errors in degrees for baseline head pose tracking algorithm on three test sets. ................................................................................................... 170

Figure 7.42. Average eye gaze tracking errors in degrees for several head pose tracking algorithms using depth on standard test set. ............................................................................................ 172

Figure 7.43. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth for initialization on standard test set. ................................................................. 172

Figure 7.44. Test stage specific eye gaze tracking errors in degrees, mesh tracking errors in pixels and iris localization errors in pixels for for baseline tracking algorithm using depth for initialization on standard test set. Left axis refers to degree errors and right to pixel errors. 173

Figure 7.45. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth during initialization on standard test set. ........................................................... 174

Figure 7.46. Comparison of sequence specific average eye gaze tracking errors in degrees for baseline tracking algorithm on standard test set when using and when not using depth during initialization. ........................................................................................................................... 174

Figure 7.47. Average eye gaze tracking errors in degrees for four head pose tracking algorithms using depth on three test sets. ................................................................................................ 175

Figure 7.48. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth for initialization on three test sets. ...................................................................... 176

Figure 7.49. Relation of eye gaze error in degrees and iris localization error in pixels for various configurations of the iris localization algorithm tested on the standard test set. .................... 176

Figure 7.50. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm without depth on test set with initial head rotation. .............................................................. 177

14

Figure 7.51. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on test set with initial head rotation. ................................ 178

Figure 7.52. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with and without depth used for initialization on two test sequences with lateral illumination................................................................................................................................................. 179

Figure 7.53. Average eye gaze tracking errors in degrees for baseline tracking algorithm without depth used for initialization on two test sequences with lateral illumination; for three illumination compensation algorithms. ....................................................................................................... 179

Figure 7.54. Average eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on two test sequences with lateral illumination; for three illumination compensation algorithms. ....................................................................................................... 180

Figure 7.55. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on standard test set with and without illumination compensation. ......................................................................................................................... 181

15

List of used abbreviations

Abbreviation Meaning

AAM Active appearance model ASM Active shape model BRISK Binary robust invariant scalable keypoints CLM Constrained local model DCT Discrete cosine transform FPS Frames per second HOG Histogram of oriented gradients ICP Iterative closest point IRLS Iteratively reweighted least squares PCA Principal component analysis POG Point of gaze POR Point of regard (same as POG) RANSAC Random sample consensus RGB Red, green, blue – refers to typical 3 channel color image RGBD Red, green, blue, depth – refers to 3 channel color image with depth RST Radial symmetry transform SfM Structure from motion SIFT Scale invariant feature transform STAR Abbreviation for center surround extrema keypoint detector [1] SVD Singular value decomposition

16

1 Introduction

17

1 Introduction

1.1 Problem definition

The eyes are a very important part of the human body. They play a major role in everyday life, in expressing a person’s desires, in interpersonal relations and in all kinds of cognitive processes. The eyes are a medium through which humans communicate with the outer visual world. Because of this, the ability to track a person’s gaze in a non-intrusive way seems crucial for designing efficient human-computer interfaces and for better human perception understanding. This is the main research goal of this disserta-tion.

It is also important to distinguish between two often confused terms: eye tracking and eye gaze tracking. The first term refers to finding the location of the person’s eyes in an image or video. The second term refers to estimating the exact point in the world where the user is looking, called the point of regard (POR) or gaze point. This disserta-tion focuses primarily on the second problem, although solving the first problem is also necessary in the process. Another term that is used interchangeably with eye gaze track-ing is eye gaze estimation. The latter is actually a broader term, as it can refer not only to processing continuous data with time dependencies, but also static images in order to determine the person’s gaze.

When talking about eye gaze tracking in the rest of this thesis three different terms will be used to describe the quality of the system. For clarity these are shortly explained below:

• accuracy – meaning the average error in gaze angle or gaze point estimation that the system produces in certain conditions

• robustness – meaning how well the system tolerates difficult environment conditions (such as illumination), image noise and extreme head poses while maintaining good accuracy

• stability – meaning how much input variations impact the output values; a system with good stability will demonstrate small changes in output values when the input changes are small

Having such big importance for the human being, there is little wonder that the topic of eye gaze tracking has been actively researched for many decades. As a result, a multitude of different approaches to solving the problem have been developed, ranging

1 Introduction

18

from highly intrusive using probes to measure electric muscular activity to non-intrusive based on optical sensors. An overview of the possible approaches and their classification is shown in Figure 1.1. This work is focused on passive methods, which embrace all remote, non-intrusive systems using only image registration techniques for the purpose of eye gaze estimation. As described in a recent survey on the topic of vision based eye gaze tracking [2], even when considering this smaller problem, there are numerous ap-proaches. An overview of them is presented in Chapter 2. The research presented in this work is related to passive geometric eye gaze tracking methods. The other kinds of methods that are shown in Figure 1.1 are deemed to be out of the scope of this work and so will not be described further.

Figure 1.1. Classification of eye gaze tracking methods.

1.2 Human eye model

Many of the proposed eye gaze tracking methods both in prior work and in this disser-tation will heavily rely on the used eye model. It will be helpful to introduce a relatively detailed anatomical human eye model at this point for future reference. It is shown in

1 Introduction

19

Figure 1.2 – with an anatomical model of the whole eyeball on the left and a frontal view of the eye on the right.

Figure 1.2. Human eye anatomy [3]. Left: eye model cross section. Right: eye frontal view.

The eyeball has the shape of a sphere, but it is not perfectly symmetrical [3, 4, 5]. The vertical diameter is slightly smaller than the horizontal diameter, with both being close to 24mm. The transparent front part of the eye is modelled as a smaller protruding spherical structure – the cornea – with a radius of around 8mm. However, similarly as the eyeball, the cornea is in fact an ellipsoid. Behind the cornea are the pupil and lens, which are a core part of the optical system focusing light rays on the retina so that humans see.

The core visible elements shown in the frontal view are:

• Sclera - opaque, fibrous, protective, outer layer of the eye containing collagen and elastic fiber, visible as white from the front.

• Iris – thin, circular structure in the eye, responsible for controlling the diameter and size of the pupil and thus the amount of light reaching the retina. The iris is visible as a colorful disc surrounding the pupil.

• Pupil – the hole located in the center of the iris that allows light to strike the retina. It is always black in appearance; its size differs depending on lighting conditions. The pupil has an associated entrance and exit – points typically de-fined for optical systems.

https://en.wikipedia.org/wiki/Human_eye

https://en.wikipedia.org/wiki/Collagen

https://en.wikipedia.org/wiki/Elastic_fiber

https://en.wikipedia.org/wiki/Human_eye

https://en.wikipedia.org/wiki/Pupil

https://en.wikipedia.org/wiki/Retina

https://en.wikipedia.org/wiki/Iris_(anatomy)

https://en.wikipedia.org/wiki/Retina

1 Introduction

20

• Limbus – the boundary between the cornea and the sclera. In practice it is also the border of the iris.

• Eye corners – the points where upper and lower eyelids meet.

In short, human vision works as follows: Light rays enter the eye through the pupil and are focused by the lens on the retina. From the standpoint of eye gaze tracking it is crucial to understand what determines the gaze focus point of a person and how the gaze vector is formed in the eye. This is presented schematically in Figure 1.3, based on recent studies of the eyeball anatomy [4, 5]. It is very important in the context of all model-based eye gaze tracking systems, where a somewhat simplified eye model is often as-sumed.

Figure 1.3. Gaze formation in the eyeball.

The eye has several axes that can be geometrically defined. These are:

• The optical axis – is defined as the axis passing through the eyeball rota-tion center and corneal center (C). It can be viewed as the geometrical symmetry axis of the eyeball.

• The pupillary axis – is defined as a normal line to the corneal surface passing through the center of the entrance pupil (EP). It is very close to the optical axis of the eye.

https://en.wikipedia.org/wiki/Cornea

https://en.wikipedia.org/wiki/Sclera

1 Introduction

21

• The line of sight – is defined as the ray that goes from the point of gaze (POG) to the fovea. It also passes through the center of the entrance pupil (EP) and the center of the exit pupil (EP’).

• The visual axis – is defined as the ray that goes from the observed object to the fovea through both nodal points of the eye (N and N’). The visual axis and line of sight are very close to each other.

The important conclusion is that the eye geometry is not as straightforward as it might seem and there are several different axes; in particular the optical axis and visual axis are not the same. It is reported in literature that the horizontal angle between the visual and the optical axes of the eye ranges from 3.5° to 7.5° (average of 5.5°) and the vertical angle from 0.25° to 3.0° (average of 1.0°) [6]. Many eye gaze tracking systems, however, assume a simplified spherical eyeball shape and approximate the visual axis with the optical axis [2].

1.3 Motivation

Although a lot of research has been carried out in the area of eye gaze tracking, the problem still cannot be recognized as solved. While numerous commercial eye gaze track-ing systems exist and demonstrate good accuracy, they require sophisticated and expen-sive setup of the environment [7, 8]. On the other hand, with the emergence of smartphones and tablets the number of electronic devices equipped with simple cameras has grown significantly. The systems that can be deployed to such devices without any hardware modifications have limited accuracy [9, 10, 11, 12, 13, 14].

The community would certainly benefit from a more accurate and easier to use eye gaze tracking system than what has been presented in literature so far. Such a system should be easily accessible for everyone regardless of the type of device that they are using. It should work in real time and provide better accuracy than previous approaches. Apart from this, the system should not require any tiresome calibration to be performed, as this seriously limits its usability. The aim of this work is to present a system meeting all these criteria and to demonstrate its superiority in the proposed scenarios.

1.4 Dissertation Structure

This dissertation is structured as follows:

1 Introduction

22

• Chapter 2 summarizes all the previous work in the field of passive eye gaze track-ing and presents the current state of the art

• Chapter 3 describes the first core component algorithm of the proposed eye gaze tracking system – head pose estimation

• Chapter 4 describes the second core component algorithm of the proposed eye gaze tracking system – iris localization

• Chapter 5 describes the proposed geometric eye gaze tracking system in detail and aims to justify the used assumptions

• Chapter 6 presents an evaluation of theoretical errors resulting from inaccuracies of the assumed model parameters used in the eye gaze tracking system

• Chapter 7 focuses on a broad set of experiments aimed at measuring the perfor-mance of the proposed system and component algorithms

• Chapter 8 concludes the research and experimental results, and points towards promising areas for future research

1.5 Theses

This doctoral dissertation aims to verify the following theses:

1. The proposed eye and gaze model requiring only single point initialization allows estimating the eye gaze direction with comparable accuracy to state-of-the-art model-based systems that use multiple point calibration.

2. The proposed system works best when the environment parameters (such as the distance between the user’s eyes) are equal to those assumed in the model. How-ever, the system is robust to small variations of these parameters, i.e. small devi-ations of these parameters from model parameters do not degrade the system performance significantly.

3. The proposed modifications of previously published component algorithms of the eye gaze tracking system improve the accuracy and robustness of the system. These are most importantly:

a. New hybrid head pose tracking algorithm b. New model initialization procedure for head pose tracking c. New model reinitialization procedure for head pose tracking d. New aggregate template feature based head pose estimation method e. Novel methods of using depth measurements from a depth sensor for

model-based head pose tracking

1 Introduction

23

f. New adaptive iris localization algorithm selecting iris arcs depending on gaze and head pose estimates

g. Modified refinement step of iris localization using ellipses h. New more complex eye model considering the deviation between the opti-

cal and visual axis

2 Previous work

24

2 Previous work

Previous work can refer to several areas: the type of approach to eye gaze tracking and the specific component algorithms used for its purpose. Each of these areas will be dis-cussed in this chapter.

2.1 Eye gaze tracking

Although remote passive eye gaze tracking methods are only a subgroup of all the pos-sible approaches to eye gaze tracking, the amount of completed research is extensive. There are several surveys on the existing methods [2, 15, 16], where the methods of interest are often referred to as natural light methods, which stresses that they do not use IR glints to estimate the gaze.

In order to have a better understanding of how all existing methods are related, it is helpful to group them further into two categories:

• Appearance based methods – use neural networks, linear regression or other forms of learning to establish a mapping between the image of the face and the user’s gaze point

• Model based methods – use a 3D geometric model of the eye to estimate the gaze vector and determine the gaze point as the intersection of this vector and the screen

Each group of methods will be reviewed separately. Whenever it is helpful to better present the research context, active methods using IR illumination will also be mentioned.

2.1.1 Appearance based approaches

Appearance based methods create a mapping between images of the user’s face and gaze focus points. They intend to implicitly model the function for estimating the point of regard from relevant features and personal variation. Therefore, they do not require scene geometry or camera calibration.

An early work [17] presents an eye gaze tracking system based on a multilayer neural network. Extensive calibration is required in order to gather the training data – the user needs to move the mouse and follow it with their gaze for a certain period of time. The reported results claim 1.5 degree accuracy of gaze angle estimation, but it is

2 Previous work

25

not described how exactly the testing procedure was performed. Also, a specular reflec-tion resulting from an external light source seems to be a crucial element of the system, which disqualifies this approach from passive usage. The system can also only be used only on the user that it is trained on. Although the paper states that head movements are possible, it is unclear how big their influence is on the reported accuracy results. Judging by the fact that many later publications in the field of appearance based gaze estimation report lower accuracy despite requiring a still head, the capability of the system to deal with significant head pose changes is doubtful.

A significant advance is a system using a morphable model to identify the eye region and a neural network to map the model parameters to the viewed screen coordi-nates [18]. This system is partly user independent and does not rely on any external illumination source. On the other hand, the reported accuracy is quite low and the sys-tem is far from real time. Tan et al. [19] later proposed to use an appearance manifold with linear interpolation to estimate the gaze. This essentially performs a nearest neigh-bor search among the training images for each given input image. The claimed mean angular error of 0.46 degrees is impressive, but the system requires a large amount of training samples (several hundred), which is impractical. Furthermore, the measurement methodology that was used for accuracy evaluations is far from real-world scenarios. The evaluation was performed on a set of manually tagged ground truth images of a person’s eyes with known points of regard, where one image from the whole set was selected as the test image and mapped to the appearance manifold created from the other images. Unfortunately, it is not possible to further evaluate this system, as the software isn’t publicly available.

A slightly different approach was proposed by Morency et al. [20], who use Ei-genspaces and gaze direction approximation by selecting the most similar sample from the training set. This approach reports an average accuracy of only 9 degrees despite the fact that head movements are tracked using an Active Appearance Model (AAM) frame-work. At a similar time, Ono et al. [21] presented an eye gaze tracking system using low resolution images and N-mode Singular Value Decomposition (SVD). They focused on solving the problem of inaccurate eye region positioning, which causes significant degra-dation of appearance based methods and has big significance for low resolution images. The proposed solution was twofold: incorporating training images of eye regions with artificially added positioning errors and separating the factor of gaze variation from that of positioning error, thus formulating gaze estimation as a bilinear problem. This prob-lem is solved by alternately minimizing a bilinear cost function with respect to gaze

2 Previous work

26

direction and position of the eye region. The reported accuracy results of below 3° are somewhat impressive considering the low quality of input images. However, the system did not consider different users or head pose changes, which seriously limits its usability.

The next significant work in the field is that of Williams et al. [22], who uses a sparse, semi-supervised Gaussian process interpolation method on filtered visible spec-trum images. As a result, images of an eye are mapped to 2D screen coordinates. It is notable that the presented system works in real time and claims to have an accuracy nearly as good as that of Tan et al. [19]. However, the testing procedure is not described in detail and head movements are not considered.

The first publicly available piece of software for webcam eye gaze tracking was OpenGazer [12]. This software has been developed by Piotr Zieliński, who is known to have worked together with Oliver Williams, the author of [22]. It is therefore highly probable that OpenGazer is in fact an implementation of the system described in [22]. According to the description on the website, OpenGazer learns a Gaussian mapping between eye images and predefined calibration points. After this, the user’s face is tracked in order to extract new eye images for classification. The face tracking is per-formed by optical flow tracking of several preselected points. This means that only two dimensional tracking is performed – the head pose remains unknown. As a consequence, any change of head pose is sure to have a negative influence on the system.

The software has been tested by the author with a Microsoft LifeCam Studio webcam. When the calibration was completed correctly and no head movements were performed, the system allowed to achieve a 3x3 gaze tracking resolution on the screen, which roughly corresponds to an angular accuracy of 11 degrees. In case of head move-ments the system failed completely. This is a big difference compared to the accuracy of 0.83 degrees reported in [22], but sometimes the difference between laboratory tests and real world scenarios is indeed huge.

Another system that requires 9-point calibration is that of Lu et al. [23]. This is a linear regression based approach, with considerable improvement over the work of [19]. Instead of requiring densely obtained training samples to create the mapping from high-dimensional feature space to low-dimensional gaze point space, an adaptive linear re-gression strategy requiring sparse training data is proposed. The goal is to adaptively find the subset of training samples where the test samples are most linearly representable. This allows to drastically reduce the required number of calibration points – from several hundred to several, while maintaining high accuracy (of around 1°). This actually makes

2 Previous work

27

the system practical, with the one major remaining drawback that head movements are not handled.

More recently, the work of Sugano et al. [24] presents a method where saliency maps extracted from videos are treated as probability distributions for gaze points. This allows to automatically create a mapping between the eye images and gaze points, with-out any required calibration. Gaussian process regression is used to learn the mapping between the images of the eyes and the gaze points, similarly as in OpenGazer [12]. The paper reports an accuracy of 3.5 degrees, but assumes that the head is completely still and even states explicitly that the experiments were conducted with a chinrest, in a fixed testing environment. In the conclusions the authors mention the problem of head movements and admit that this still remains an unsolved problem.

Alnajar et al. [25] also propose to solve the problem of automatic user calibration. They assume that different users follow similar gaze patterns when presented with cer-tain stimuli. Therefore, gaze patterns of calibrated individuals can be used for new view-ers without active calibration. The authors have compared several mapping methods and selected the best to be using a 2D manifold [23] with K-closest point fitting. Similarly as in [24], only a simple webcam is used to obtain eye images. However, the head motion is not constrained by using a chinrest and much shorter attention analysis is required to map the user’s gaze – approximately 3s instead of at least 150s required in [24]. The reported accuracy of 4.3° is certainly a big achievement considering all these conditions.

An even more impressive work is the system proposed by Lu et al. [26]. It also uses Gaussian process regression to map eye appearances to gaze points, however it directly handles head motion – something that previous systems failed to address. The key proposed idea is to decompose the eye gaze tracking problem into two sub-problems, including an initial fixed head pose problem and subsequent compensations to correct the initial estimation biases. The reported accuracy considering head movements is 3 degrees, which is comparable to state-of-the-art model-based approaches. The drawbacks of the system are the need of complex calibration and the fact that no real-time imple-mentation has been demonstrated. The biggest doubt, however, refers to the accuracy and is related to the testing methodology. The leave-one-out experiments, consisting in selecting each sample as a test sample and using the rest to train the regression model, do not represent real-world usage.

2 Previous work

28

In the last few years, several publications tackled the problem of appearance based eye gaze estimation in real-world scenarios. Very recently, Zhang et al. [27] per-formed a very large-scale study using convolutional networks, which also compares sev-eral other recently published methods [23, 28, 29, 30]. They recorded a dataset of over 200,000 images from real life environments, where people periodically observed displayed points on their laptops. The diversity of this test data exceeds anything gathered previ-ously. The proposed gaze estimation algorithm works as follows. After the face is de-tected, six facial landmarks are located using the approach of [31]. A generic 3D model is fitted to these landmarks in order to estimate the head pose. Both the head pose and normalized eye regions are inputs to a multimodal convolutional neural network [32].

Zhang et al. evaluated their method on two scenarios cross-dataset validation using their own dataset, [28] and [33], and within-dataset validation on their own dataset only. The measured accuracy was 13.9° in cross dataset evaluation and 6.3° in within dataset leave-one-person-out evaluation. While these accuracies might seem low com-pared to previous publications, they are calculated in completely unconstrained condi-tions under various head poses and illumination conditions, something that was never done before. As part of their research the authors compared their approach to several other recently pursued appearance based methods. They claim to achieve better perfor-mance than all those methods, among which were Random Forests [28], k-Nearest Neigh-bors [28], Adaptive Linear Regression [23, 29] and Support Vector Regression [30]. No other such extensive evaluation of appearance based eye gaze tracking methods has been performed to date.

To recapitulate, the high sensitivity to head movements and illumination changes is a common shortcoming of all appearance based gaze estimation methods. The appear-ance of the eye region may look the same for different poses and gaze directions. At the same time, illumination changes will cause different appearance under the same pose and gaze direction. Another common problem of appearance based methods is accurate eye region positioning. Even a small positioning error of the cropped eye region can signifi-cantly degrade the gaze tracking accuracy. All the mentioned problems limit the accu-racy in real world scenarios. A further drawback of appearance based eye gaze estimation is the usual need for calibration of each new user, although recent work has demonstrated this might not be necessary in future [27, 28].

2 Previous work

29

2.1.2 Model based approaches

Model based methods explicitly model the eye and 3D scene geometry in order to use geometrical properties for solving the eye gaze estimation problem. Generally, they re-quire scene geometry and camera calibration. The number of approaches is vast, with the most work related to active eye gaze tracking using corneal reflections from infra-red illuminators. However, as the primary focus of this work is on passive eye gaze tracking, only related work – a small subset of all model based approaches – will be covered in this section.

Model based methods can be divided into two groups. The first is based on the property that the iris observed from an angle has an elliptical shape. The elliptical dis-tortion of the circular iris is proportional to the angle at which it is rotated relative to the camera. The second group of approaches models the eyeball in 3D and calculates the gaze based on a gaze vector that stems from anatomical eye parameters such as the eyeball center, iris center or center of the cornea.

2.1.2.1 Ellipse fitting approaches

Wang and Sung [34] were the first to actively pursue eye gaze tracking by analyzing only the shape of the observed irises. In their method a simple eyeball model is assumed where the eye is a sphere and the visual axis is approximated by the optical axis. There-fore, the gaze vector is determined as the vector from the eyeball center to the iris center. Each of these points are found independently in every image, so no head pose tracking algorithm is necessary other than coarse facial feature tracking to know where the eye regions are. The method assumes that irises are circles in the real world. The perspective projection of a circle to the image plane yields an ellipse. This ellipse is found by least squares fitting of a quadratic curve to the detected iris edges. Once the equations of two ellipses are known, the exact pose of the eyeballs (three rotations and three translations) can be estimated analytically using the two-circle algorithm. A similar approach to that of [34] was later presented by Wu et al. [35].

The biggest benefit of the described system is that no user-specific calibration is required. However, several parameters such as the eyeball radius or iris radius are pre-determined as fixed, which will lead to different performance depending on the user. One important shortcoming of the ellipse fitting approach in general is that relatively high resolution images of the irises are required. This is solved in [34] by using a zoom-in camera. A further drawback of the ellipse fitting approach is the need to see the eye

2 Previous work

30

boundaries in good detail. If the eyelids obstruct the view of the camera and only a small part of the iris boundaries is visible, the system will fail. The reported average accuracy of around 5 degrees is not bad, but due to its design the system provides the worst accuracy when the user is looking towards the camera, which is usually near the observed screen center. The authors fail to provide a more detailed accuracy evaluation, so the true accuracy achievable in real-world cases remains unknown.

In later work, Wang et al. proposed a one-circle algorithm [36]. It is in principle similar to the two-circle algorithm – the detected elliptical iris contour is used to calcu-late the normal vector of the supporting plane of the circular iris boundary. The main difference compared to [34] is that the ambiguity of multiple analytical solutions is re-solved differently. In [34] two iris ellipses were required and it was assumed that the direction in which they are oriented is parallel. In the new approach, Wang et al. propose to use domain knowledge about the eye region to resolve the ambiguity. They also use a simple calibration procedure to establish the central gaze vector. The distances between the eye center and eye corners are assumed to be equal. The eye corners are found by a separate dedicated algorithm [37]. Because the one-circle algorithm only requires the image of one eye, it can zoom in further than the two-circle algorithm. Consequently, the system accuracy is improved and reported as 1 degree. This seems to be very im-pressive, but it should be remember that a pan-tilt-zoom (PTZ) camera is essential. Also, the reported experiments are quite limited and it is not clear how much impact head movements and user appearance variations have on the presented system.

Kohlbecher and Poitschke [38] extended the work of [36] to a form that does not require personal calibration. They proposed to use a stereo camera setup to reconstruct the pupil in 3D space using a closed mathematical framework. The gaze is then estimated as the pupil normal. If the position of the camera rig is known, this approach works without the need for user calibration, because the position and orientation of the pupil ellipse is known explicitly. Furthermore, there is no need to deal with ambiguities and elliptical iris shapes are also correctly accounted for. On the other hand, a stereo camera setup is necessary, which is much less practical than using a monocular camera. The reported accuracy is also lower, at best around 2 degrees. While difficult to compare because of the lack of a common test database, it is clear that improvement is necessary to prove the practical value of the proposed methodology.

Quite recently, the ellipse fitting approach to eye gaze tracking was demonstrated on a tablet [14]. The so called Eyetab system follows similar reasoning as the work of Wang [36], with certain modifications. Most importantly, the front camera of the tablet

2 Previous work

31

is used for obtaining eye images instead of the remote PTZ camera. Furthermore, differ-ent algorithms are used for iris boundary detection and ellipse fitting itself, to better suit the type of images obtained from the tablet front camera. A random sample consensus (RANSAC) method is used instead of direct least squares ellipse fitting. Also, a simplified trigonometric procedure is used to obtain the 3D gaze vector orientation from the ellipse shape. The system’s accuracy is reported as 6.88°, although it was mentioned that on some people the tests fail completely. Considering that the users had to keep their head 20cm away from the tablet screen, there is much room for improvement. Nonetheless, Eyetab is one of few real-time eye gaze tracking systems that use only the RGB camera and can be effectively used. It should also be noted that the source code of Eyetab has been made publicly available for download on GitHub [39].

2.1.2.2 Eyeball model approaches

There are a significant number of model-based approaches to eye gaze tracking. They vary in how the eye is modelled and what is measured. In this section both approaches that explicitly model the eye geometry and those that only make certain assumptions and interpolate calibration measurements will be discussed.

One of the first eye gaze models proposed to use only the face orientation as the gaze direction [40]. In this system the mouth and eye corners are used to define a facial plane. Based on the tracked location of these facial features and assuming weak projec-tion, the face orientation is calculated. While providing useful information about the general direction in which the face is rotated, this system cannot achieve any measurable eye gaze tracking accuracy.

The work of Heinzmann and Zelinsky [41] extended the idea proposed in [40]. Apart from tracking several facial features to infer the face orientation, the relative position of the irises and eye corners is used to determine the relative angle between the facial normal and gaze direction. This angle is later merged with the head pose to obtain the gaze vector in camera coordinates. The authors claim that the system is real-time, which makes it one of the first real-time head pose and eye gaze tracking systems that have been created. Unfortunately, the exact eye model used for calculating the relative eye orientations is not described. Also, no gaze estimation accuracy analysis was per-formed.

Matsumoto and Zelinsky [42] proposed to measure the person’s gaze in a stereo-vision setup, with a very similar system described in [43]. This work was the first where

2 Previous work

32

the used 3D eye model was clearly described and the gaze vector estimation methodology fully explained. The system works in three stages: Initialization, face tracking and gaze detection. In the first stage, the face is located in the image and the tracking algorithm is initialized. In the second stage the head pose is tracked based on the location of six facial features matched by template matching: inner eye corners, outer eye corners and mouth corners. Because stereo cameras are used, the feature locations in 3D space can be calculated directly. A generic face model is fitted to the estimated six features in order to estimate the final head pose. In the third stage, a simple eye model is used to calculate the gaze vector. This eye model is shown schematically in Figure 2.1.

Figure 2.1. Gaze vector calculation method in [42]

As can be seen, the eye is assumed to be a sphere with three parameters:

• The eye radius • The iris radius • The relative position of the eyeball center with respect to the head pose

The authors assume that the center of the eyeball is shifted by an offset vector from middle point between two eye corners. The eye radius and iris radius are manually adjusted based on test sequences with known gaze points, with average values estimated as 13mm and 7mm respectively. Once the eye parameters are set and the 3D eye corner positions are known, the gaze vector from the eyeball center to the iris center can be calculated. The intersection of this vector with the screen yields the gaze focus point.

2 Previous work

33

The proposed system is real-time and provides both head pose and eye gaze esti-mations with good accuracy. It has the ability to reinitialize after tracking is lost and it can also deal with partial occlusions of the facial features. The reported worst case gaze vector estimation accuracy is about 3° in [42] and 3.5° in [43], which is impressive. How-ever, the accuracy measurement does not state that head movements were performed, which suggests that the reported results are for a fixed head pose scenario. In fact, the work of [43] states that the head pose tracking accuracy is only 10°, which implies that eye gaze tracking accuracy under head movements cannot be better. It is also question-able how accurately the eye corners can be localized using the template matching ap-proach on different people. Despite these discrepancies, the proposed systems were the first real-time eye gaze tracking systems capable of estimating a more or less accurate gaze vector using only two ordinary RGB cameras.

Simon Baker et al. [44] proposed an eye gaze tracking system using a single RGB camera. They use an Active Appearance Model (AAM) with an underlying 3D model to estimate the head pose and facial feature locations [45]. Compared to previous local feature tracking based on template matching this has the advantage that more infor-mation about the face is used as it is treated as a single object.

The general eye model that is used conforms to that of [42]. The gaze vector estimation relies on locating the eye corners, finding the eye center and projecting a vector from the eye center to the iris center. An important difference is the usage of the AAM both to estimate the head pose and to localize the facial features in 3D. Smaller modifications include a more sophisticated procedure of eye parameter estimation – the anatomical parameters are actually trained offline on a set of training samples. Also, the eye center is no longer constrained to be exactly in between the eye corners – something that is generally not true for different poses. The approach of [44] also describes a two-stage iris localization procedure. In the first stage template matching is used to estimate o coarse iris position. In the second stage an edge-based iris refinement is performed, where an ellipse is fit to detected edges using least squares.

The AAM-based system reports very high gaze tracking accuracy of 3.2° under significant head rotations. Although it is not clear exactly how the accuracy evaluation was performed and how important the offline training was to estimate eye model param-eters, no previous research claimed such good results – and all of this considering only a single RGB camera.

2 Previous work

34

A different kind of approach was proposed by Zhu and Yang [46]. Fundamentally, they use a similar gaze model as shown in Figure 2.1 – the gaze direction is a vector starting in the eyeball center and passing through the iris center. They noticed that if the rotation angle of the eyeball is simplified as the projection of the movement of the iris center, very small errors will be introduced for typical usage (0.17° for ±15° gaze directions), as illustrated in Figure 2.2. This allows to use a simple calibration procedure where the user gazes at several known points to create a linear 2D mapping between the relative eye corner and iris center positions to the gaze angle. The proposed system also uses subpixel filters to localize eye corners and the iris center. It is claimed to achieve an exceptionally high accuracy of 1.4° using an ordinary web camera and running in real time. The only limit of this system is the requirement to keep the head motionless.

Figure 2.2. Eyeball rotation treated as 2D projection in [46].

A little later, Hansen and Pece proposed a novel algorithm for iris detection and eye gaze estimation [47]. As far as gaze estimation is concerned, the key novelty was showing that tracking facial features other than the irises is not necessary. A 4-point calibration based scheme was proposed. The authors proved that when assuming a spher-ical eyeball shape and a fixed head pose, 4 calibration points are the minimum number of calibration points that allow estimating the gaze vector assuming only the iris centers are located. Unfortunately, the reported accuracy of only 4° and requirement to keep the head completely still makes the proposed system’s usefulness quite limited.

2 Previous work

35

The calibration-based approach of [47] was combined with facial feature tracking by Valenti et al. [48]. They proposed a system to perform eye gaze tracking by analyzing the relative position of the pupil and the eye corners, using a novel eye corner detector. The method is based on 9-point calibration and on interpolation of the locations from calibration during tracking. No head movements are considered and the method strongly relies on the initial calibration. A significant improvement on this concept was presented by the same authors in [11]. This time they assumed that the head pose determines a specific field of view, whereas the eye orientations can influence which part of this field of view is observed (Figure 2.3). This way, an initial 9-point calibration with a frontal face pose can be used in any other pose as well. A cylindrical head model and optical flow were used for head pose estimation following the algorithm in [49]. Furthermore, a hybrid framework was presented where the eye center detection can be used to refine the head pose estimation by so called eye location cues, while the eye locations can in turn be refined by so called head pose cues.

Figure 2.3. Gaze estimation in [11].

The work presented in [11] reports an accuracy between 2° and 5° depending on the usage scenario and seems to be the best performing purely webcam based eye gaze tracking system demonstrated so far. It works also under limited head movements. Most notably, a ready implementation can be found on the website of a company founded by Roberto Valenti [10]. The system has a number of drawbacks, however. Firstly, it de-pends on initial multiple point calibration, which is an inconvenience. Secondly, the

2 Previous work

36

calibration data is used for means of interpolation within the deduced field of view. Although a model is used to transform the field of view according to the head pose, no eye model is used for gaze vector calculation. This in itself limits the maximum achiev-able accuracy, as the point of regard does not result from a true geometric model.

An interesting proposal is that of Yamazoe et al. [50]. They proposed a calibration free approach without requiring any calibration from the user. Again, a simple eye model is assumed, where the visual axis is approximated by the optical axis. The proposed system works in two stages. In the first stage, a set of N images is used to estimate parameters of the eye model – the eye radius, the iris radius and the eyeball center positions for both eyes. These parameters are estimated through a nonlinear optimization process where segmented eye regions are compared with projections of the model. For an accurate model, the observed eye regions should be similar to the projected model appearances. The image data used for correct eye model estimation does not require the user to look at any specific points, and so can be recorded at any time when the user is in front of the system. Once the eye model parameters are estimated, the eyes are tracked by tracking a set of facial features using the Lucas-Kanade method.

The work of Yamazoe et al. [50] is strongly related to the system presented in this dissertation. It works in real time as requires only a simple web camera. Unfortunately, the accuracy it reports is quite low - 5° horizontally and 7° vertically. Two things can clearly be improved in the proposed approach. The first is using a more complex eye model. The second is tracking the head pose more accurately – the authors only do it in 2D and do not evaluate its exact accuracy.

Here it should be noted that all methods mentioned so far assume a simplified eye model where the visual axis is approximated by the optical axis of the eye. A number of papers considering a more complicated eye model with separate optical and visual axes have also been published. Most of these papers assume IR illumination and corneal reflections with a calibration procedure [51, 52]. Several papers even proposed to auto-matically measure the angles between the optical and visual axes of the eye without user specific calibration [6, 53, 54]. All of them require corneal glints, however, and so cannot be applied directly to the passive eye gaze tracking scenario.

One work, though, proposes to use the more complex eye model assumption using just one simple webcam [55]. It is an extension of the approach proposed by Matsumoto and Zelinsky [42]. It relies on tracking several rigid facial features and estimating the relative eye positions in each frame. Instead of using a stereo camera setup for facial

2 Previous work

37

feature tracking, a monocular approach is used [56]. The facial feature tracking algorithm also provides the head pose. The authors assume that the relative position of the eyeball center and the center point between eye corners does not change for a given person. They propose to find this relation given by a so called displacement vector during an initial 9-point calibration, when the user looks at predefined points while keeping the head still. The authors further assume that the distance between the eyeball center and the corneal center, as well as the distance between the eyeball center and the pupil center are both fixed and known beforehand. As the visual axis is considered as separate from the optical axis, the deviation angle 𝛾𝛾 composed of horizontal and vertical angles 𝛼𝛼,𝛽𝛽 is also calculated together with the displacement vector during the calibration procedure. The used eye model is shown in Figure 2.4.

Figure 2.4. Eye model used in [55].

While very high accuracy is reported – around 2° without head movements and 2.5° considering head movements, the experiments have been only performed on one person. It is doubtful if the assumed fixed parameters would work equally well on other people. Furthermore, it is not stated clearly what kind of head movements were per-formed in the experiment (how big translations and rotations). This is a crucial factor affecting the eye gaze estimation accuracy and without more detailed information the real-world value of the proposed system remains unknown. Nonetheless, the proposed methodology seems sound and might indeed be one of the best systems proposed within the webcam eye gaze tracking community to date.

It should be noted that the previous work of the author [57] also falls into the category of eyeball modelling approaches. Although a simplified eye model is used, the proposed method reports state-of-the-art accuracy and does not require any calibration

2 Previous work

38

from the user. The only requirement is to look at a known point during a single-frame initialization procedure. The method uses this to estimate the centers of the eyeballs and from then on tracks these points together with the head pose. As this thesis will explain the algorithms from [57] in detail and propose further extensions, there will be no more discussion of this work in this chapter.

2.1.3 Methods using a depth camera

While tracking the eye gaze using a depth camera also relies either on model-based calculations or appearance mapping, these approaches will be covered in a separate sec-tion as they are all relatively new.

The work of Li et al. [58] proposes a system with an HD webcam attached to a Microsoft Kinect device in order to obtain high quality RGB images along with the depth information. While the idea is somewhat novel, the proposed eye gaze tracking system based on basic composition of interpolated eye movements and head translations seems too simple to provide notable results.

Mansanet et al. [59] performed a comparative study on three non-linear regression algorithms comparing their performance on images from an RGB and RGBD camera. Three state-of-the-art algorithms were compared: k-Nearest Neighbor Regression (kNNR) [60], Support Vector Regression (SVR) [61] and Random Forest Regression (RFR) [62]. The compared cameras were the Logitech webcam HD C525 working in HD resolution and Microsoft Kinect v1. The data from the sensors was first reduced in dimensionality by PCA calculated on normalized pixel intensities of the eye image. The aim of the study was to compare visible spectrum camera data and depth data in terms of usefulness in eye gaze tracking with statistical approaches.

The comparative study [59] reports that the kNNR performed slightly worse than the other two methods – SVR and RFR. Also, the visible spectrum Logitech camera allowed to achieve better results. However, all reported accuracies for three algorithms and two sensor types are within a small range of 4°-5°. This poses questions over the test methodology. Throughout all experiments a chinrest was used to ensure head pose invariance. If the head pose did not change at all, it is difficult to imagine what benefit the depth camera could provide over the RGB camera, as the eyeball rotation does not change the eye regions’ depth profiles. All in all, it is difficult to draw a definitive con-clusion from this comparative evaluation.

2 Previous work

39

Funes and Odobez [63] point out that all appearance based methods are unable to properly deal with the eye region variation due to head pose changes. They propose to use the Microsoft Kinect sensor to estimate the head pose and use it to map the eye regions to a frontal pose. The head pose estimation step works as follows. First, a person-specific 3D Morphable Model [64] is created in an offline stage using depth images in different poses. The landmarks are placed manually for each image, although this could be improved to an automatic process. In the online stage the head pose is estimated by registration of the personalized template and the Kinect depth data using the Iterative Closest Point (ICP) algorithm with point-to-plane constraints [65]. Once the head pose is known, the eye regions can be rendered in a frontal pose. In order to calculate the gaze vector from the normalized eye regions the authors implemented the approach of [23]. After the gaze vector is estimated for the normalized eye region, the head-relative gaze is transformed into world coordinates using the estimated head pose.

The presented method [63] is a novel way of combining the appearance based approach (gaze from normalized eye region) and the model based approach (using esti-mated head pose for gaze vector compensation). An evaluation was performed on 3 people, where the scene illumination was changed and the head pose was changed in-cluding extreme poses (>40° rotation). The reported average accuracy of gaze tracking is around 9.9°. This is very far from the accuracy reported as state-of-the-art results in literature. However, the evaluation is performed in a much less constrained setting than most other research. This shows just how much difference there is between experiments in a strictly controlled environment and the real world. Apart from relatively low accu-racy (the paper only reports around 7° for a frontal head pose), an major drawback of the proposed approach is the complicated offline stage necessary to obtain a personalized face model, which needs to be performed for each participant individually.

In their later work Funes and Odobez extended the system described above to be largely independent of a person’s specific appearance [66]. Instead of using a different trained appearance model for every person individually, the new approach allows to estimate the gaze also for people who haven’t performed calibration. Based on the simi-larities of eye region appearances a weighted interpolated model is formed from all ex-isting models by an automatic algorithm for sparse target model reconstruction. This significantly improves the usability of the system, as new users do not have to undergo a calibration process anymore. Another improvement proposed in [66] is to use gaze coupling constraints between both eyes. By assuming that both eyes have the same gaze elevation and that both eyes observe the same point, the linear regression process can

2 Previous work

40

be optimized. The demonstrated results, however, show very similar accuracy to the one reported in [63] – around 7-14° depending on head movements. This is still insufficient to enable most real-world eye gaze tracking use cases.

A completely new approach was proposed by Jafari and Ziou [67]. In their system, similarly as in the work of Funes and Odobez [63, 66], the Kinect camera was used to estimate the head pose, while a PTZ camera was used to capture the appearance of the eye region. The head pose measurement algorithm uses data from the Kinect device classified by random regression forests [68]. The gaze mapping function is learnt by variational Bayesian multinomial logistic regression using head pose iris center displace-ments as input. The authors claim that the proposed mapping technique outperforms other well established discrimination methods. The proposed eye gaze tracking system allows large head movement without any need for personal calibration. However, the reported accuracy does not seem to be very high. The authors measured a 92% success rate in discriminating between tiles in a 3x3 grid on the screen.

A fully model-based system is described by Sun et al. [69]. This system uses a Kinect camera to capture the eye region and estimate the gaze vector. The gaze vector estimation process follows the work of [42], where the eye center is calculated relative to the eye corners. This time, however, the eye corners can be localized in 3D camera coordinates thanks to the depth reading from the depth camera. The proposed system works in several stages. First, the eye regions are located using visual context boosting [70]. Next, the iris and eye corners are localized using template fitting. The authors justify this by claiming that template fitting works best for the low resolution eye images that are obtainable from the Kinect RGB camera. In the final step, once the eye corners and irises are located, the gaze vector is calculated as a vector from the corneal center to the iris center. The used eye model consists of two spheres with different radii as the eyeball and the cornea, and also models the deviation of the visual axis from the optical axis. The necessary parameters – eyeball radius, eye center displacement relative to eye corners and visual axis displacement are all calculated in a lengthy, person-specific cali-bration procedure during which each person gazes at 5 points on the screen under 5 different head poses. The authors describe this calibration procedure in detail, as well as a separate calibration of the camera with the screen.

The reported accuracy of this system is 1.4°-2.7° considering head movements. However, only head translations and head roll is included in these movements. In their experiments on 8 subjects Sun et al. noticed that out-of-plane rotations of the head – pitch and yaw – cause significant degradation of the eye gaze tracking accuracy. This is

2 Previous work

41

not surprising, as the proposed system lacks an accurate six degree of freedom head pose tracker. Thus, when the head rotates, it is impossible to accurately determine the unseen eyeball center. What is more, the sensor data quality from Microsoft Kinect is not very high, as the device was designed for a different kind of purpose. To sum up, the presented system [69] is an efficient eye gaze tracking implementation relying solely on a geometric model. It offers relatively high accuracy, but severely limits the allowed head movements and requires lengthy calibration for each user.

Xiong et al. [71] propose another fully model-based approach, based on a spherical eyeball model that considers deviation between the optical and visual axis. However, unlike in the system of Sun et al. [69], the head pose is fully tracked in 6 degrees of freedom. The head pose is tracked based on rigid facial landmarks detected using a supervised descent method [72]. The depth information for the landmarks is obtained from Kinect and the head pose is estimated using a personalized 3D face model created offline. Alternatively, the authors implemented head pose estimation without depth data available – in this case the POSIT algorithm [73] is used. The iris is found using a modified version of the Starburst algorithm [74]. It is generally assumed that the eyeball center is fixed in the head pose coordinate system. In order to find the initial eyeball center position and deviation angle between optical and visual axis 9-point calibration is performed for each user.

The reported accuracy is 4.4° assuming that Microsoft Kinect is used to obtain depth estimates. While this is lower than the accuracy stated in [69], it is a good result considering that free head movement is allowed. A further contribution of [71] is esti-mating a lower bound on the accuracy of the used model-based approach using simulated data. The determined best possible accuracy – assuming a perfect user calibration – is 2° if the iris and all landmark localization have an accuracy of ½ pixels. This shows just how challenging eye gaze estimation is with commodity hardware.

Jianfeng and Shigang [75] propose a similar approach as Xiong et al. [71], with a similar eye model and 9-point calibration procedure. The differences lie in the head pose and iris localization algorithms. Jianfeng and Shigang propose to estimate the head pose using the Kinect head pose tracking module directly. They also use a different technique for iris localization based on gradient direction intersections [76], which seems to be very robust but not necessarily of very high precision. Not surprisingly, the accuracy reported in [75] is relatively low and only around 5 degrees. This is most certainly caused by lower accuracy of component algorithms, as the general logic of the proposed system is sound.

2 Previous work

42

The system proposed in this work has been influenced by several prior-art ideas. Apart from the author’s own work [57, 77] which is a base for this thesis, the most similar systems to the one described here are [11], [50], [71] and [75], although the last two approaches were developed at a similar time as the one described in this dissertation. The proposed eye gaze tracking algorithm will be described in detail in Chapter 5.

2.2 Head pose estimation

Head pose estimation is in itself a very actively studied topic in computer vision. As estimating the head pose is a slightly easier task than estimating the eye gaze direction, it is closer to deployment in real-world scenarios, and thus even more researched. A number of surveys exist on the work done in this field [78, 79]. A common consensus is that head pose estimation techniques can be divided into five groups:

1. Appearance-based methods 2. Flexible model methods 3. Geometric feature-based methods 4. Methods based on tracking 5. Hybrid methods

As this dissertation is focused on model-based eye gaze tracking, high accuracy of head pose estimation is absolutely crucial. Therefore, this section will focus primarily on those methods that are capable of achieving very high accuracy, and mention other methods only shortly.

2.2.1 Appearance-based methods

The most simple statistical approach is using appearance templates and assigning the pose of the most similar template to the query. Approaches using normalized cross cor-relation [80], mean square error over a sliding window [81] or more complex feature-extracting functions [82, 83] have been proposed to estimate the image similarity. The general problems with these methods are low accuracy dependent on the trained data-base, large complexity proportional to trained database size and unclear relationship between feature space and pose space similarity. Detector arrays are similar to appear-ance templates with the advantage that face detection and localization are inherently accounted for in the process [84]. However, they also suffer from low accuracy and cum-bersome training.

2 Previous work

43

A more popular approach is nonlinear regression. Such approaches aim to find a mapping from the image space to the head pose and train a model from labelled training data. The most popular methods focus on using support vector regressors [85, 86] or neural networks [87, 88]. Recently, random forests have been proposed to estimate the head pose using low-quality depth data from Microsoft Kinect [68] with a reported ac-curacy of around 8°. Unlike templates and detector arrays, nonlinear regression ap-proaches can be fairly accurate and fast. The main disadvantages are difficult training and high sensitivity to initial head pose localization.

Another group of appearance based techniques for head pose estimation are man-ifold embedding methods. The aim of these methods is to reduce the dimensionality of the input face image so that it can be placed on a low-dimensional manifold constrained only by allowable head poses. The placement position then defines the head pose. Several techniques have been tried for reducing dimensionality: PCA and LDA [89], Support Vector Machines [90], Isometric Feature Mapping [91], Locally Linear Embedding [92] or Laplacian Eigenmaps [93] to name a few. While demonstrating good robustness, tech-niques using manifold embedding easily capture appearance variations apart from pose variations.

As stated in [79], appearance based head pose estimation methods may provide good robustness, but are highly reliable on training data and have limited accuracy. Even recent appearance based eye gaze tracking methods [26, 27] use other approaches for head pose estimation. Therefore, the author concludes that appearance based head pose estimation is not the best direction when high accuracy is the paramount goal.

2.2.2 Geometric feature based methods

Geometric feature-based methods derive the head pose from the geometric configuration of several facial salient points (facial features). One of the first approaches that was also proposed for coarse eye gaze tracking [40] relies on tracking the outer eye and mouth corners to infer a facial plane. Later work also considered the inner eye corners [94] and other facial characteristic points tracked as triplets [41]. All these approaches assumed a rigid 3D facial model that is fitted in every image using the 2D locations of detected features, thus providing a pose estimate. A number of algorithms can be used to achieve this: POSIT [73], EPnP [95] or various iterative approaches [96]. A different line of research proposed to track the facial landmarks using a stereo camera setup [42, 43]. In this case the 3D location of each landmark could be measured by the sensor and the

2 Previous work

44

head pose could be calculated directly. A more recent method proposes improved facial landmark detection using Gabor wavelets and multi-state local shape models [56]. This is used in eye gaze tracking systems that report high overall accuracy, better than 3° [55].

A recent paper describes the implementation of head pose estimation from six facial salient points on modern hardware using a cascade of detectors for feature locali-zation [97]. This implementation tracks the eye corners, mouth corners and nose tip. An evaluation performed on the Boston University Dataset [98] reports quite impressive accuracy for such a simple method: 3° to 5°.

Another recent method proposes to use the parallelism of lines projected through eye corners and lip corners [99]. A so called vanishing point can be calculated as the intersection of the eye line and the lip line, which leads to establishing the yaw of the head. Moreover, the ratio of line lengths can be used to infer the pitch of the head. Gaussian mixture models and an EM algorithm are proposed to account for variations among different people. While [99] proposes an interesting and novel algorithm, its ac-curacy is low especially for a completely frontal head, as in this case the lines are near parallel. It therefore has little value in the eye gaze tracking scenario.

A number of methods have also been proposed for head pose estimation from depth measurements. One of the first studies in this field [100] proposes to fit a 3D line to the nose ridge and use this for pose estimation together with the location of the detected face. Despite utilizing only depth information, an average accuracy better than 2.5° is reported. It is however slightly questionable whether such can be achieved in real world scenarios under unconstrained head movements.

Recently, a different method to estimate the head pose from range data was presented [101]. The method does not rely on any set of facial features, but instead uses all available information (pixels) to perform a global particle swarm optimization. The reported accuracy of around 3° is impressive considering that only depth data (and no RGB) is used.

Apart from the last mentioned algorithm, geometric face pose estimation ap-proaches rely on facial salient points and their accuracy depends directly on the accuracy of localizing such points. Even today few feature detectors provide subpixel accuracy, especially when the image quality is low. A feature detection error of 1-2 pixels leads to relatively large head pose estimation errors – especially in terms of rotation. A further limitation of most geometric methods is the need to have all the processed facial features well visible if reasonable accuracy is to be obtained.

2 Previous work

45

2.2.3 Flexible model methods

Flexible models can be seen as a further development of geometric feature-based head pose estimation. The problem with feature-based methods is that the feature detection is a local process and does not use available global information about the facial appear-ance. Flexible model methods add a model to account for this and provide better robust-ness. They usually aim to fit a non-rigid model to certain facial features so as to uncover the facial structure in each case. There are many types of flexible model approaches and their popularity has changed over time. An early approach proposed to use Elastic Bunch Graphs [102]. In this approach deformable graphs of local feature points such as eyes, nose or lips are first matched to labelled training data to establish a reference pose set. In the second stage the same graph is matched to a query image and the relative feature locations provide means to select a best fit from the training data. As such graph match-ing relies on facial features it bears a stronger linger link to pose similarity compared to appearance based approaches. However, this approach only allows differentiating be-tween a discrete set of poses and is problematic with large amounts of training data.

A breakthrough in the field of flexible models was made by Cootes et al. [103]. They proposed an Active Shape Model (ASM) which is a specific shape fitting technique. In an offline training phase all variability in a set of training shapes is captured to create an eigenspace of shapes representing the given class of objects. In the online phase an iterative refinement is performed, where the fitted shape is alternately deformed to better match the detected contours and remodeled to allow only deformations that appear in training images. This effectively combines greedy local fitting with the constraints of a model. This approach works really well on objects with distinctive contours. In case of faces, the internal appearance – texture – is equally important. That is why such shape models combined with appearance, known as Active Appearance Models (AAMs), give better results for faces [104]. The fitted shape or appearance model is represented by a set of points, so once the fitting process is complete the head pose can be inferred from the relative positions of these points. This can be done in a simple way by using linear regression [105], but better accuracy is obtained when an inherent 3D model is used to constrain the 2D point fitting process – this is known as combined 2D and 3D AAMs [45]. This approach was used with good results in one previously mentioned eye gaze tracking system intended for car drivers [44]. It is important to note that both ASMs and AAMs require an initial coarse location of the face in the image. This is not a difficulty, though, as many fast methods have been developed to solve this problem [106].

2 Previous work

46

Some researchers have taken this group of methods even further and proposed to use observed model point locations from many images to perform structure from motion (SFM) [107] and infer 3D point locations directly [108]. Performing such structure from motion allows dealing with non-rigid motion and does not require any 3D models before-hand. However, the robustness and exact accuracy that this method can yield has not been thoroughly evaluated and remains somewhat doubtful. Alternatively, the depth can be measured by a depth sensor directly and used to refine the AAM [109]. The authors report an accuracy of about 3-6mm in real-world scenarios when using Microsoft Kinect, which is a decent result. It is worth noting that this algorithm is used in the Microsoft Kinect SDK.

An extension of Active Appearance Models was proposed as Constrained Local Models (CLMs) [110]. CLMs combine global shape constraints and local appearance similarly to AAMs. The main difference compared to AAMs is using the trained appear-ance model to generate feature templates instead of trying to approximate the image pixels directly. The model is fitted to the face in two stages. In the first stage, a local and exhaustive search of each feature is performed using trained feature detectors. In the second stage the model parameters are jointly optimized over all the detector re-sponses. While AAMs are a generative approach, CLMs are discriminative and therefore generalize much better previously unseen data. Recent research in the area confirms the superiority of CLMs in terms of accuracy and robustness over other approaches [111, 112] and even attempts to accommodate depth camera measurements to achieve an average head pose estimation error smaller than 3° [113].

Another recent work proposes to estimate the head pose using a combination of tree models and shared pool of parts [114]. The method allows to combine the task of face detection and landmark localization by localizing various areas of the face separately using trained part models and later choosing the best global tree model for the estimated part configuration. The authors claim that their approach works better than AAMs and CLMs and demonstrate this on a new face database containing real-world scenes with extreme head poses – Annotated Faces in the Wild. While the proposed method does seem to work more robustly, especially in terms of face detection and landmark locali-zation, its accuracy is low – an error is assumed only when the manually annotated pose differs by more than 15°. As a result, it has little value for the system presented in this thesis.

A new and very promising line of research in face alignment is cascaded shape regression. The main principle of methods in this category is to estimate the face shape

2 Previous work

47

by iterative refinement of an initial shape using learned regression functions. In each iteration a set of features is extracted at the locations of landmarks in the current shape estimate and a learned projection function is used to refine the landmark location esti-mates (thus the fitted shape). Such explicit shape regression technique allows to circum-vent an important problem of parametric model based approaches (AAM, CLM) which is caused by the fact that minimizing the parameter error is suboptimal – it is not necessarily the same as minimizing the alignment error. Various types of features have been used to regress the shape increment, among others SIFTs [72], boosted ferns [115] and local binary features [116]. The last publication claims to achieve 3000 fps on a modern computer while demonstrating very high accuracy even in case of extreme head poses and varying facial expressions. This is much faster than typical AAM and CLM implementations. Currently approaches using regression are focused on the landmark alignment task and are able to achieve average accuracies of around 3-4 pixels on chal-lenging datasets. It has not been exhaustively measured how this relates to head pose estimation, but it can be expected that these approaches might provide better head pose estimation accuracy than other flexible models in the near future.

In general, the approaches using flexible models are becoming more and more robust and already demonstrate impressive accuracy while running in real time. They are a promising way of estimating the head pose and should be considered in the design of a real-time eye gaze tracking system.

2.2.4 Methods based on tracking

Tracking methods have one significant advantage over all other head pose estimation methods: they accumulate relative estimations between consecutive images. This means that the pose changes are usually small and can thus be calculated very accurately [79]. However, tracking approaches also have two important drawbacks. The first is the need for initialization after start-up and after tracking gets lost. Usually an independent face detection algorithm is used to perform this. The second shortcoming is drift accumulation after longer tracking – drift reduction techniques are necessary in real-world scenarios.

Simple low-level tracking of 2D features can be performed to track any of the facial features described in Section 2.2.2. This is not very popular, however, as independ-ent detection of these features in every video frame provides better accuracy. The benefit of the tracking approach is clear when an object model is used. Apart from providing

2 Previous work

48

better robustness than when tracking every feature independently, a model allows to determine motion in 3D even if only 2D image data is available. The 2D translations of each tracked model point can be unprojected from the image coordinates into the 3D scene coordinates and account for true 3D motion understood as six degrees of freedom: three translations and three rotations [117].

Various approaches to model-based head pose tracking have been proposed. The most simple planar models of the face [118] did not perform well in recovering out-of-plane rotations. More complicated cylindrical head model were used by Cascia et al. [98] and Xiao et al. [49], with the first of these tracking the head pose by texture mapping and the second performing model-based optical flow tracking on a mesh of evenly dis-tributed points. The second algorithm works particularly well [11, 79]. In this algorithm each mesh point is tracked using the Lukas-Kanade method [119] and its translation contributes to the transformation of the entire model. An Iteratively Reweighted Least Squares (IRLS) algorithm is used to determine the final transformation of the model between two frames. This allows to minimize the impact of badly tracked or occluded points. An initialization and reinitialization procedure has also been suggested as track-ing to a facial template registered at start-up. An evaluation on the BU dataset [98] determined the average accuracy of head roll, pitch, and yaw to be 1.4°, 3.2°, and 3.8° respectively. This is a very good result considering the head movements and illumination changes that were performed. Later, extensions of this approach to non-calibrated cam-eras [120] and non-rigid face deformations [121] have been proposed. On the other hand, using the Kalman filter for head pose prediction has been shown to significantly reduce the computational complexity of this approach [122]. One notable eye gaze tracking system that relies on this kind of head pose tracking is [11].

More complicated, personalized models were also proposed to be used for head pose tracking [123]. A comparative study [124] aimed to analyze how much the tracking accuracy depends on the type of model points out that a more accurate model does achieve better tracking accuracy but only when it is well aligned with the tracked subject. Once the alignment degrades (for example due to drift error), the tracking accuracy degrades dramatically, much more than in case of simple face models. The conclusion is that if a highly detailed model is used, care should be taken to ensure that the model is well aligned to the tracked face at all times.

Apart from purely luminance-based tracking of face models mentioned so far, more complicated approaches have also been suggested. Jang and Kanade proposed to use feature points unevenly distributed on the face to obtain correspondences [125, 126].

2 Previous work

49

Instead of assuming that the each mesh point can be matched to a corresponding mesh point between two images, the authors propose to find and match SIFT feature points [127] to obtain correspondences. Additionally, the head pose estimation system in [125] proposes to track the face to the previous frame as well as to a set of stored keyframes and merge the results using a Kalman filter. The 3D model transformation is computed using an IRLS approach, similarly as in [49]. While the measured head pose estimation accuracy is not better than in [49], the robustness and error recovery abilities of the system are remarkable.

While typically frame-to-frame tracking performs very well, the gradual error accumulation as well as inability to recover from large tracking errors require a reinitial-ization mechanism. This has been proposed as tracking to single template [49] or a set of stored templates [125] periodically or even constantly. More advanced reinitialization techniques will be described in Section 2.2.5.

A slightly different tracking approach is used when dealing with depth data. The typical algorithm for point cloud alignment known as Iterative Closest Point (ICP) [65] can be used. This algorithm aims to iteratively transform two point clouds so that they fit each other better. This was demonstrated for instance in [63, 128]. In practice, this kind of alignment works well when the two point clouds are identical. Unfortunately, when the rotated face is observed the two depth maps seen by the camera contain dif-ferent parts of the face, and so the two resulting point clouds are different. This poses difficulties for ICP alignment and in practice using purely depth data alone yields inferior head pose estimation accuracy than when using only RGB data. One good direction might be to use RGB and depth data together in a holistic approach [129].

2.2.5 Hybrid methods

Various head pose tracking algorithms provide good accuracy in different conditions: tracking approaches are most accurate in the short term once the current pose is known, flexible model and geometric feature-based approaches provide good accuracy regardless of the sequence length and prior information. Combining various algorithms to create an optimal head pose tracker is therefore a natural line of thought.

Liao et al. [130] proposed to combine intensity based tracking [49] and a modified feature based tracking [123] with SIFT points in a non-linear weighting scheme. They argue that in wide-baseline scenarios features outperform intensity-based tracking, while

2 Previous work

50

the intensity tracker can be used for refining the feature-based estimation or for obtain-ing high accuracy in favorable tracking conditions. The proposed tracking system can be initialized in a frontal pose only, but an Active Shape Model (ASM) [131] is used to better fit the initial model to the face. While a comparison presented in [79] does not show improvement over the original approach of [49] as far as peak accuracy is concerned, the aim of the proposed system is to maximize both accuracy and robustness regardless of the tracking conditions. The benefit of this is visible mostly in sequences with illumi-nation changes, occlusions and extreme poses.

Morency et al. [132] proposed a user-independent and initial pose independent hybrid tracker called Generalized Adaptive View-based Appearance Model (GAVAM). The system combines three sources of head pose estimation:

- a static user-independent head pose estimator using an AAM - a frame-to-frame tracker that tracks the pose change compared to a previous

frame based on the Normal Flow Constraint [133], which is similar to the method of [49]

- a keyframe based tracker that tracks the relative pose of the current frame to the best stored keyframe using the same method as the frame-to-frame tracker; the keyframes are updated online during tracking

All of the used tracking estimates have an associated uncertainty and are combined using the Kalman filter. An evaluation of this approach on the BU dataset [79] does not show better accuracy than the original cylinder tracking method [49]. However, this new approach does not require the user to have a frontal pose at the beginning and it is more robust in terms of error recovery and extreme pose estimation.

The work of Jang and Kanade mentioned in the previous section in fact also describes a hybrid tracking system. The initial estimate of the face pose is done by aligning a Bayesian Tangent Shape Model (BTSM) [134]. Subsequent tracking relies on merging frame-to-frame tracking and keyframe-based tracking with a Kalman filter. However, both differential tracking and keyframe tracking uses SIFT feature points. This does not improve the peak accuracy of previous approaches [132], but ensures even better robustness.

From an extensive analysis of previous work on head pose estimation, the author con-cludes that both flexible model approaches and hybrid tracking approaches are the most

2 Previous work

51

promising for creating a high-accuracy eye gaze tracking system. The implemented meth-ods and the final chosen configuration is described in detail in the next chapters.

2.3 Iris localization

Locating the point through which visual rays enter the eyeball is a crucial task in eye gaze tracking. As was described in Section 1.2, this point is the pupil center. Unfortu-nately, in visible light cameras the eye pupil is often completely indiscernible. On the other hand, the iris is usually clearly visible. Because the iris and the pupil share a common center, locating the iris center is identical with locating the pupil center.

Accurate iris localization is a well-known problem as it stems from the field of biometric iris recognition. In case of remote eye gaze tracking systems the images tend to have a much lower resolution than those used in biometric recognition. However, many algorithms can be used in both cases. Early approaches to eye gaze tracking pro-posed to use voting techniques relying on the iris shape. Kim and Ramakrishna [135] proposed to analyze the intensity profile of the iris neighborhood to determine the iris center. They proposed a Longest Line Scanning algorithm and Circular Edge Matching algorithm, both of which assigned scores to candidate points based on analysis of the surrounding areas’ intensity profiles. Perez et al. [136] proposed to analyze diagonal lines crossing the iris boundaries and select the center as an equidistant point. Both these approaches rely on thresholds and characteristic intensity profiles. Therefore, they are applicable only to constrained settings and might fail in uncharacteristic cases.

A more advanced voting technique is the circular Hough transform [137, 42, 138]. The Hough transform is a voting technique used to detect simple shapes possessing several parameters. The voting is in a parameter space, from which object candidates are obtained as local maxima. In case of circle detection, if the radius is assumed to be known, the circle edges vote for the circle center. This is depicted in Figure 2.5. The circle center has two parameters: the x and y coordinate. Therefore, the voting space is two dimensional and can be illustrated as a rectangular grid. Each detected circle point – marked black – can be part of any circle which has a center placed on the surrounding blue circle. Eeach cell covered by a blue circles gains a vote. This is repeated for every blue circle. In the end, the cell with maximum value is found in the whole grid (marked red). This represents the model which satisfies the largest amount of observed data.

2 Previous work

52

Figure 2.5. Circular Hough transform voting.

The Hough transform can be applied after the iris edges are detected by an edge detector such as the Sobel filter. If irises are treated as circles, the described algorithm can be applied directly to find one or if necessary several candidate center locations. If an elliptical iris shape is assumed, the voting space has to be of a higher dimension – apart from the ellipse center the data will vote for the ellipse axes lengths. It is important to note that the Hough transform becomes computationally inefficient if the number of voting dimensions is too large. In early eye gaze tracking studies the Hough transform performed well for iris detection, and it is used until today as the main iris localization step or as a supporting step in a more sophisticated algorithm [139].

A different voting based technique using isophote curvature was proposed by Valenti and Gevers [140]. Isophotes are curves in an image that connect points of equal intensity. Independence to rotation and linear lighting changes makes them useful fea-tures in an image. By analyzing the curvature of an isophote it is possible to determine the so called curvedness – the reciprocal of the radius. Thus, all points of an isophote can vote for the center of the curve in an accumulator space, similarly as in case of the Hough transform. The exact point which is being voted for is determined by a displace-ment vector, resulting from the curve point’s gradients. The strength of a gradient near the eye limbus can also be used to weight the votes, providing better accuracy.

If the search region is sufficiently constrained, this voting procedure can be suc-cessfully used to locate the iris center. In practice, the image needs to be upscaled using a smoothing kernel for the isophote curves to provide useful information. Also, voting

2 Previous work

53

techniques have inherently limited accuracy by the voting space resolution. The authors [140] report an accuracy of 84.1% on the BioID database [141] using the normalized error metric [142] of 0.05. This means that for 84.1% of the tested images the iris localization error was smaller than 5% of the distance between the eyes. While this is better than what many other methods report, it is not a very high accuracy in principle. If the distance between the eyes in the observed image is 100 pixels, for instance, then nearly 16% of test cases will produce an error larger than 5 pixels.

Hansen and Pece proposed a novel algorithm for iris detection and eye gaze esti-mation [47]. The main focus of the paper was two-stage iris localization. The authors proposed to use a particle filter to track the coarse position of the iris and a multiscale expectation-maximization (EM) contour algorithm to obtain a precise iris location. The described experiments prove that this approach is highly robust in real-world scenarios. A crucial benefit of the particle filter is that it allows maintaining multiple hypotheses. Thus, in situations where various factors temporarily degrade the eye image quality, the tracking may continue and recover when better input is available. While the localization accuracy may not be state of the art in typical scenarios, the aim was to provide strong robustness to occlusions, deformations and extreme head poses. Hence the title of the work – Eye tracking in the wild. The images shown in the published paper confirm that this was achieved.

Zhu and Jang [46] proposed a different, two-stage approach for iris localization. In the first step they find iris boundary pixels using subpixel Sobel filtering with 1-dimensional cubic interpolation. This provides a set of subpixel iris boundary points. In the next step, instead of using the Hough transform to fit these points, a least squares ellipse fitting step is performed [143]. Least squares ellipse fitting was also used in [44]. The approach has the following advantages over the Hough transform:

• An ellipse is often a much better approximation of the iris than a circle • No assumption about the size of the iris is required • Accurate results can be obtained with very few points (as little as six)

While no direct evaluation of the iris localization accuracy is performed, the combined eye gaze tracking system is reported to have an average error of 1.4°. Considering the camera resolution on distance constraints, an error of 1 pixel in iris center location would cause a 3° error in gaze estimation. This proves that the accuracy of iris localization in the presented system was indeed of subpixel accuracy.

2 Previous work

54

Li et al. [74] introduced a more sophisticated ellipse fitting algorithm for iris localization called Starburst. While primarily designed to work with high resolution infra-red images of the pupil, it was later demonstrated to also work well when localizing the iris in visible spectrum images [144]. Starting with a coarse eye center from the previous frame, the algorithm follows several steps:

1. Calculate potential limbus feature points by projecting rays from the coarse center point and analyzing the gradient changes

2. Filter out points that are too far or too close to the coarse eye center based on the average distance

3. Fit an ellipse to a subset of the remaining points using RANSAC 4. Optimize the ellipse shape using a Nelder-Mead Simplex search, with the aim

to maximize the underlying gradients

The Starburst approach is similar to active shape models [145], with the provision that it allows several features to be used along each normal. The main disadvantages are the inability to account for eye features such as the eyelids, as well as relatively high computational complexity.

While voting techniques can be used for both coarse and precise iris localization depending on the parameters, John Daugman [146] proposed an algorithm perfectly suited for high accuracy iris localization, primarily designed for high resolution images and the field of biometrics. He defined a circular integro differential operator based on the fact that the summation of intensity difference along the iris circular boundary should be maximum. The proposed operator allows to find the optimum solution in terms of available gradient information, with computational complexity being the main drawback. In essence, the iris can also be defined as an ellipse and the approach will still work. Since its formulation, Daugman’s algorithm has become a standard in iris segmen-tation, gradually gaining significance also in the low resolution scenario. It is widely used until today.

Zhang et al. [147] proposed to use Daugman’s algorithm as a refinement step, but formulated a novel coarse iris localization method based on the radial symmetry trans-form. They proposed to utilize a fast version of the transform [148] to make use of the fact that the iris usually appears as a dark circular blob. In a neighborhood, correspond-ing gradients at pairs of points symmetrically arranged around the central pixel are used as evidence of radial symmetry and contribute to the symmetry measure of the central point. Previous methods of coarse iris localization analyzing gradients in the eye region

2 Previous work

55

failed to utilize the symmetry characteristic of the iris. Zhang et al. demonstrated that their method was indeed more accurate and faster than many of the existing coarse iris localization approaches, especially in case of low quality images. This is in line with the tendency that algorithms designed for high resolution and high quality eye images from dedicated eye gaze tracking systems were gradually adapted to work for webcam eye gaze tracking where the eye image quality is much lower. A major drawback of the proposed approach, however, is the need for well selected thresholds and region con-straints in order to find the true iris position and not other features with radial charac-teristics.

In later work, Zhou et al. [149] proposed an improved two stage iris localization algorithm using radial symmetry and the circular integro differential operator. Compared to [147], the new work proposes various improvements to minimize the negative impact of eyelids, eyelashes and reflections. The coarse stage only considers central pixels that are sufficiently dark, using knowledge that the pupil is darker than all other parts of the eye. The refinement stage is modified in two ways. First of all, only pixels with a suffi-ciently large gradient contribute to the gradient sum calculated along the iris edge. Eyelashes and other unwanted features generally have smaller gradients than that be-tween the iris and the sclera. Second of all, domain knowledge is used to restrict the search circle to two arcs, constituting the left and right part of the iris. Thus, eyelids and eyelashes are largely eliminated when searching for the maximum gradient sum.

Because of using a two-stage approach, the method of Zhou et al. [149] demon-strates both good execution speed and high precision. While the authors evaluated their approach on high resolution images only, it seems that the method can perform well in any conditions and demonstrates state-of-the accuracy.

3 Head pose estimation

56


The system proposed in this work relies on two core component algorithms: head pose estimation and iris localization. These are equally important as the system design itself, as the accuracy and performance of the whole system depends directly on the accuracy of the component algorithms. It will be explained in detail in Chapter 5 how these algorithms are used. For better understanding of this explanation, the component algo-rithms are presented first. This chapter describes head pose estimation, while the next chapter describes iris localization.

A number of head pose estimation algorithms are possible. As described in Chap-ter 2, the most accurate head pose estimation algorithms are based on tracking. This is the reason why the proposed eye gaze tracking system should use a head pose estimation method from this group.

Head pose tracking can be performed in a number of ways, but the core concept of any tracking algorithm is that it is performed on consecutive images from a video sequence. At the beginning, the head pose is unknown, so some form of initialization is necessary. Two important aspects of head pose estimation algorithms will therefore be described: initial pose estimation and pose tracking. Each of them will be described in detail, with an indication of what has been re-used from original algorithms and what has been changed or added by the author as a result of research. A separate section is dedicated to the proposed hybrid head pose estimation algorithm.

The description will follow that in the author’s previous work [57] to a certain extent. However, more details are provided, as well as new aspects previously unac-counted for (for example depth camera).

3.1 Algorithm initialization

The head pose estimation algorithm is initialized by aligning a 3D face model in camera coordinates based on the projected face image. This means that the exact 3D location of several facial features needs to be known in order to map them to the model. The most effective algorithms that can provide such information about features are flexible models [104, 110]. They are chosen to provide the initial location of the face and a set of feature points that the proposed head pose estimation concept requires.


57

The face model is initialized using the iris centers – two points that can be local-ized with good accuracy and which have known 3D coordinates – as will be explained in the following chapters. However, two points are insufficient to accurately orient a model in 3D space. Therefore, the initial head rotations provided by the flexible model are used to initialize the fitted model. Furthermore, points on the face boundary – such as those on the edge of the chin and cheeks – are used to warp a generic face model to the individual face. The warping is performed as linear interpolation. This allows to have a well-fitted model for further tracking using single-frame initialization. The generic model that is used as input for the warping procedure is an average face made from depth images of 10 different people registered using Microsoft Kinect. All those images have been captured for a frontal pose and aligned so that the eye centers were fixed. The generic model is shown in Figure 3.1.

Figure 3.1. Generic face model used for head pose estimation initialization.

The results of the warping function for 3 different users are shown in Figure 3.2. As can be seen, different users can have substantially different head shapes and sizes. The warp-ing procedure allows to obtain a good initial model fit regardless of these appearance variations. It should be noted that the warping procedure will only work for a near-frontal head pose of the user, as otherwise various length differences caused by head rotations can be misinterpreted as appearance differences. However, a frontal head pose


58

is natural when gazing at the screen center. Small inaccuracies do not cause failure of the system.

Figure 3.2. Warped mesh after initialization for 3 different users.

The core element of the initialization procedure is the flexible model algorithm itself. A proprietary implementation of an Active Appearance Model (AAM) was used. The AAM algorithm aims to fit a flexible model, consisting of shape and appearance parameters, to the observed face. The flexible model has to be previously trained on labelled face images in order to capture the shape and appearance variations. Principal Component Analysis (PCA) is typically used to capture the variations of the labelled training data – both in terms of shape and appearance.

An illustration of the AAM points on faces in three poses is shown in Figure 3.3.

Figure 3.3. Illustration of AAM results for three head poses.


59

Fitting the flexible model to an input image consists of minimizing the error between the image and the closest model appearance, which is a nonlinear optimization problem. Let the face shape 𝑠𝑠 be described as a base shape 𝑠𝑠𝑜𝑜 plus a linear combination of 𝑛𝑛 shape vectors 𝑠𝑠𝑖𝑖

𝑠𝑠 = 𝑠𝑠0 + 𝑝𝑝𝑖𝑖𝑠𝑠𝑖𝑖

𝑛𝑛

𝑖𝑖=1

(3.1)

where the coefficients 𝑝𝑝𝑖𝑖 are shape parameters. Let 𝐷𝐷0 denote the image area confined by 𝑠𝑠0. Let the face appearance 𝐴𝐴(𝑥𝑥) defined over a set of pixels 𝑥𝑥 ∈ 𝐷𝐷0 be described as a base appearance 𝐴𝐴0(𝑥𝑥) and a linear combination of 𝑚𝑚 appearance images 𝐴𝐴𝑖𝑖(𝑥𝑥)

𝐴𝐴(𝑥𝑥) = 𝐴𝐴0(𝑥𝑥) + 𝜆𝜆𝑖𝑖𝐴𝐴𝑖𝑖(𝑥𝑥)𝑚𝑚

𝑖𝑖=1

∀𝑥𝑥 ∈ 𝐷𝐷0 (3.2)

where the coefficients 𝜆𝜆𝑖𝑖 are appearance parameters. The function that needs to be min-imized can then be formulated as

𝐴𝐴0(𝑥𝑥) + 𝜆𝜆𝑖𝑖𝐴𝐴𝑖𝑖(𝑥𝑥)𝑚𝑚

𝑖𝑖=1

− 𝐼𝐼𝑾𝑾(𝑥𝑥,𝒑𝒑)2

𝑥𝑥∈𝐷𝐷0

(3.3)

where 𝐼𝐼𝑾𝑾(𝑥𝑥,𝒑𝒑) is the intensity at pixel 𝑥𝑥 of a piecewise affine warped input image

based on the shape parameters 𝒑𝒑 = 𝑝𝑝1𝑙𝑙 ,𝑝𝑝2𝑙𝑙 …𝑝𝑝𝑛𝑛𝑙𝑙 𝑇𝑇. Various techniques are used to mini-

mize the function in equation (3.3): gradient descent [150], linear regression [151] or compositional algorithms [104] are just a few. In the proposed eye gaze tracking system the author has used a proprietary AAM implementation and its details cannot be dis-closed.

Alternatively, a recent flexible model algorithm based on linear regression with local binary features [116] has been tested for comparison. The tested implementation relies on iterative refinement of an initial face shape estimate 𝑆𝑆0. At each iteration 𝑡𝑡 an increment that refines the previous pose estimate 𝑆𝑆𝑡𝑡−1 is found by regressing a set of features

𝑆𝑆𝑡𝑡 = 𝑆𝑆𝑡𝑡−1 + 𝑅𝑅𝑡𝑡Φ𝑡𝑡(𝐼𝐼, 𝑆𝑆𝑡𝑡−1) (3.4)


60

where 𝑅𝑅𝑡𝑡 is the regression matrix at iteration 𝑡𝑡 and Φ𝑡𝑡(𝐼𝐼, 𝑆𝑆𝑡𝑡−1) is a vector of features extracted at landmarks 𝑆𝑆𝑡𝑡−1 from image 𝐼𝐼.

The matrix 𝑅𝑅𝑡𝑡 is learned during the training phase based on manually labelled training images. The features Φ𝑡𝑡(𝐼𝐼, 𝑆𝑆𝑡𝑡−1) are sparse binary features generated by regres-sion forests. Each of the 68 landmarks that constitute the face shape has a corresponding regression forest that was trained to minimize the error of that landmark in the training set. Each forest consists of 3 trees: one for minimizing the x axis error, one for the y axis error and one that minimizes the errors for both axes simultaneously. The trees are constructed based on HOG [152] features extracted at each of the landmarks. The results of the tested regression face alignment algorithm are shown in Figure 3.4.

Figure 3.4. Illustration of face alignment by cascaded regression.

The performance of the eye gaze tracking system using either algorithm was com-parable. While the newer flexible model algorithm might be superior, the importance of the flexible model algorithm for the whole eye gaze tracking system is limited to the initialization stage and does not directly influence eye gaze tracking accuracy.

3.2 Head pose tracking

Head pose tracking can be performed using data from two sensors: the RGB and RGBD camera. This section presents two state-of-the-art head pose tracking methods utilizing data from the RGB sensor only. Both have been implemented from scratch, tested and


61

improved in various ways. Their evaluation is presented in Chapter 7. The usage of depth sensor data is discussed in Section 3.4.

One important assumption used throughout this work is a pinhole camera model, typically used in computer vision. In this model, the scene view is formed by projecting 3D points into the image plane using a perspective transformation

𝑙𝑙𝑢𝑢𝑙𝑙𝑣𝑣𝑙𝑙𝑤𝑤′ =

𝑓𝑓𝐿𝐿𝐿𝐿 𝑥𝑥 0 𝑐𝑐𝑥𝑥0/

/ 𝑓𝑓𝑦𝑦 𝑐𝑐𝑦𝑦0𝐿𝐿𝐿𝐿 0 1

𝑙𝑙 ′

𝑋𝑋𝑙𝑙𝑌𝑌𝑙𝑙𝑍𝑍𝑙𝑙

(3.5)

where

(𝑋𝑋,𝑌𝑌,𝑍𝑍) are the coordinates of a 3D point in the world coordinate space, (𝑢𝑢, 𝑣𝑣,𝑤𝑤) are the coordinates of the image point, (𝑐𝑐𝑥𝑥, 𝑐𝑐𝑦𝑦) is the camera principal point that is usually at the image center,

(𝑓𝑓𝑥𝑥,𝑓𝑓𝑦𝑦) are the camera focal lengths expressed in pixels.

The matrix containing focal lengths and principal point coordinates is commonly referred to as the matrix of camera intrinsic parameters. The assumed camera model is somewhat simplified – axis skew, radial distortion and tangential distortion are not considered. The impact of these distortion factors on the eye gaze tracking system has been found to be small compared to other error sources.

The pinhole camera model given by equation (3.5) results in the following per-spective projection equation for image pixel 𝑝𝑝 of world point 𝑃𝑃:

𝑝𝑝 = 𝜋𝜋(𝑃𝑃) = 𝑃𝑃𝑥𝑥𝑓𝑓𝑥𝑥𝑃𝑃𝑧𝑧

+ 𝑐𝑐𝑥𝑥,𝑃𝑃𝑦𝑦𝑓𝑓𝑦𝑦𝑃𝑃𝑧𝑧

+ 𝑐𝑐𝑦𝑦𝑇𝑇

(3.6)

Similarly, the inverse projection function is defined as

𝑃𝑃 ≡ 𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥𝑑𝑑

𝑓𝑓𝑥𝑥𝑃𝑃𝑧𝑧,𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦𝑑𝑑

𝑓𝑓𝑦𝑦𝑃𝑃𝑧𝑧,𝑃𝑃𝑧𝑧, 1

𝑇𝑇

(3.7)

where 𝑃𝑃𝑧𝑧 is the depth measurement at the location of image pixel 𝑝𝑝.

The values 𝑓𝑓𝑥𝑥 and 𝑓𝑓𝑦𝑦 are normalized focal lengths expressed in pixels, defined as the quotient of the focal length in world units and the pixel size in world units: 𝑓𝑓𝑥𝑥 =𝑓𝑓/𝑠𝑠𝑥𝑥 and 𝑓𝑓𝑦𝑦 = 𝑓𝑓/𝑠𝑠𝑦𝑦. It is from here on assumed that the focal lengths in the horizontal and vertical plane are equal (experiments with the used cameras have shown very small differences) and for notation simplicity that 𝑓𝑓 = 𝑓𝑓𝑥𝑥 = 𝑓𝑓𝑦𝑦.


62

3.2.1 Lukas-Kanade intensity based tracking

The proposed intensity based tracker stems from the work of [49]. The fundamental assumptions of this tracking are a pinhole camera model and a rigid 3D head model, which can be approximated as a cylinder forming a mesh of evenly distributed points. The 3D model provides the spatial locations of all observed facial pixels in the image. It is further assumed, that the pose of the 3D model is also the pose of the head.

In the pinhole camera model the perspective projection formula for a 3D point P and its location 𝑝𝑝 in the camera image was previously given in equation (3.6). On the other hand, the position of the point 𝑃𝑃 in relation to its position in the previous frame can be written as

𝑃𝑃𝑡𝑡 = 𝑇𝑇 ∙ 𝑃𝑃𝑡𝑡−1 (3.8)

where T is a transformation matrix describing the motion of this point, given by

𝑇𝑇 =

⎣⎢⎢⎢⎢⎡ 1𝑙𝑙

| −𝜔𝜔𝑧𝑧 𝜔𝜔𝑦𝑦 𝑡𝑡𝑥𝑥𝑙𝑙|

𝜔𝜔𝑧𝑧𝑙𝑙𝑙𝑙 1 −𝜔𝜔𝑥𝑥 𝑡𝑡𝑦𝑦𝑙𝑙

𝑙𝑙

−𝜔𝜔𝑦𝑦 𝜔𝜔𝑥𝑥 1 𝑡𝑡𝑧𝑧𝑙𝑙𝑙𝑙

0|𝑙𝑙 0 0 1|𝑙𝑙 ⎦⎥⎥⎥⎥⎤

(3.9)

The transformation matrix can be associated with an instant motion vector 𝜇𝜇 in twist representation

𝜇𝜇 = 𝜔𝜔𝑥𝑥,𝜔𝜔𝑦𝑦,𝜔𝜔𝑧𝑧, 𝑡𝑡𝑥𝑥, 𝑡𝑡𝑦𝑦, 𝑡𝑡𝑧𝑧 (3.10)

where 𝑡𝑡𝑥𝑥, 𝑡𝑡𝑦𝑦, 𝑡𝑡𝑧𝑧 denote translations relative to the camera and 𝜔𝜔𝑥𝑥,𝜔𝜔𝑦𝑦,𝜔𝜔𝑧𝑧 represent rota-tions relative to the camera. In later derivations the following notation will be used to link the motion vector 𝜇𝜇 with the transformation matrix 𝑇𝑇:

[𝜇𝜇] = 𝑇𝑇 (3.11)

The benefit of using the twist motion vector is that all six degrees of freedom are de-scribed with only six parameters. The representation using the rotation matrix is redun-dant.

Plugging (3.9) into (3.8) results in


63

𝑃𝑃𝑡𝑡 = 𝑇𝑇 ∙ 𝑃𝑃𝑡𝑡−1 =

⎣⎢⎢⎢⎡ 𝑋𝑋𝑡𝑡−1 − 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑡𝑡𝑥𝑥

𝑋𝑋𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑌𝑌𝑡𝑡−1 − 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑡𝑡𝑦𝑦

−𝑋𝑋𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑍𝑍𝑡𝑡−1 + 𝑡𝑡𝑧𝑧

1| ⎦⎥⎥⎥⎤

(3.12)

where 𝑋𝑋,𝑌𝑌,𝑍𝑍 represent the world coordinates of point 𝑃𝑃.

If the matrix 𝑇𝑇 describes the motion of a rigid model, the above equation holds for all points that form this rigid model. Using pinhole camera model projection (3.6), and omitting the image center translations for better clarity, the projection of 𝑃𝑃𝑡𝑡 can be expressed using the previous position 𝑃𝑃𝑡𝑡−1 and the motion parameterized by vector 𝜇𝜇=[𝜔𝜔𝑥𝑥,𝜔𝜔𝑦𝑦,𝜔𝜔𝑧𝑧, 𝑡𝑡𝑥𝑥, 𝑡𝑡𝑦𝑦, 𝑡𝑡𝑧𝑧]

𝑝𝑝𝑡𝑡′ = 𝜋𝜋𝑇𝑇 ∙ 𝑃𝑃i,t−1 = 𝑋𝑋𝑡𝑡𝑌𝑌𝑡𝑡 ∙𝑓𝑓𝑍𝑍𝑡𝑡

=

= 𝑋𝑋𝑡𝑡−1 − 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑡𝑡𝑥𝑥𝑋𝑋𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑌𝑌𝑡𝑡−1 − 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑡𝑡𝑦𝑦

𝑓𝑓


(3.13)

Let the intensity of the image at point 𝑝𝑝 and time 𝑡𝑡 be denoted as 𝐼𝐼(𝑝𝑝, 𝑡𝑡). Let 𝐹𝐹(𝑝𝑝, 𝜇𝜇) be a function that maps point 𝑝𝑝 to a new location 𝑝𝑝′ using vector 𝜇𝜇, according to the motion model given by (3.13). Let the region containing all considered face pixels be denoted as 𝛺𝛺. Computing the motion vector between two frames based on the luminance constancy principle can be expressed as the minimization of the sum of luminance dif-ferences between the face image from the previous frame and the current face image transformed by the mapping function 𝐹𝐹

𝑚𝑚𝑚𝑚𝑛𝑛(𝐼𝐼(𝐹𝐹(𝑝𝑝, 𝜇𝜇), 𝑡𝑡) − 𝐼𝐼(𝑝𝑝, 𝑡𝑡 − 1))2𝑝𝑝𝑝𝑝Ω

(3.14)

Following the original derivation from [49], the motion vector 𝜇𝜇 can be computed using the Lucas-Kanade method

𝜇𝜇 = 𝑤𝑤(𝐼𝐼𝑝𝑝𝐹𝐹𝜇𝜇)𝑇𝑇Ω

(𝐼𝐼𝑝𝑝𝐹𝐹𝜇𝜇)−1

𝑤𝑤(𝐼𝐼𝑇𝑇(𝐼𝐼𝑝𝑝𝐹𝐹𝜇𝜇)𝑇𝑇)Ω

(3.15)

where 𝐼𝐼𝑇𝑇 and 𝐼𝐼𝑃𝑃 are temporal and spatial image gradients, while 𝑤𝑤 is a weight assigned to each point. 𝐹𝐹𝜇𝜇 denotes the partial differential of 𝐹𝐹 with respect to 𝜇𝜇 at 𝜇𝜇 = 0


64

𝐹𝐹𝜇𝜇 = −𝑋𝑋𝑌𝑌 𝑋𝑋2 + 𝑍𝑍2 −𝑌𝑌𝑍𝑍 𝑍𝑍 0 −𝑋𝑋−(𝑌𝑌2 + 𝑍𝑍2) 𝑋𝑋𝑌𝑌 𝑋𝑋𝑍𝑍 0 𝑍𝑍 −𝑍𝑍

𝑓𝑓𝑍𝑍2

(3.16)

Motion can be computed iteratively. In each iteration the 3D face model is transformed using the computed motion vector. With each such transformation the weights 𝑤𝑤 are updated for all points, depending on the strength of the image gradient, the density of the projected model points and the luminance difference before and after the model transformation. Such iterative methodology allows to efficiently eliminate outliers. To handle large movements without loss of accuracy the tracking is performed on a Gaussian image pyramid.

Key modifications of the original algorithm

The most important modification compared to the originally proposed algorithm is different 3D model initialization, as described in Section 3.1. A more accurate model that is well aligned can greatly improve the 3D tracking accuracy, as its geometry will be a much better approximation of the true face geometry than when using a simple model such as a cylinder.

In the original article each mesh point was weighted by a combination of three weights depending on the strength of the image gradient, the density of the projected model points and the luminance difference before and after the model transformation. In extensive experiments the author has found that the first two do not improve tracking accuracy in a consistent manner. Therefore, only the third weighting method was re-tained, which decreases the impact of points that are not consistent with the estimated model motion in an exponential fashion. This reduces the impact of inaccurate alignment of the model, non-rigid motion, illumination changes or occlusions.

Another important difference is the reinitialization method. Originally, it was proposed to save a luminance template of the first frame and for every incoming frame attempt to track both to the previous frame and to this template frame. If tracking to the template frame gave a sufficiently small error, reinitialization was performed. In practice this reinitialization did not always perform well and often introduced large ran-dom errors. The author proposes to use feature-based templates instead. As feature points are largely invariant to illumination, scale and rotation, such templates provide much better reinitialization robustness. This is described further in Section 3.2.2.


65

The sum up, the most important modifications to the original algorithm are as follows:

1. More accurate model shape than the generic cylinder 2. Better alignment of model during initialization using facial features 3. More efficient model point weighting mechanism 4. More accurate and robust reinitialization procedure

Unsuccessful modifications of the original algorithm

A different promising improvement that has been tested was a flock-of-trackers approach based on the work of Matas and Vojir [153]. In this scenario the mesh points were not distributed evenly in a grid as proposed in [49]. On the contrary, each point was assigned a certain rectangle, and it was moved within this rectangle into the location that gave the best corner detector response. The assumption is that corner points are more characteristic and should be easier to track. Surprisingly, this modification failed to give any improvement, it rather slightly worsened the 3D model tracking accuracy. The most probable conclusion is that when dealing with a rigid structure of many points gradual intensity changes (evenly distributed points) can be more helpful than distinc-tive corner points (flock-of-trackers). In case of rigid model tracking each point shift contributes to the estimated motion of the whole model. This is a completely different situation than when every point is tracked separately.

3.2.2 Feature based tracking

The proposed feature-based head pose tracking is a modification of the approach by Jang and Kanade [126]. It can be assumed that two independent sets of feature points are detected in two facial images. It can be assumed further that a subset of these points is matched to form pairs (𝑝𝑝𝑡𝑡−1, 𝑝𝑝𝑡𝑡). The points from the first frame are related to a known head pose, so their 3D coordinates, 𝑃𝑃𝑡𝑡−1 , are known. The depth is known from the 3D model, similarly as in case of optical flow tracking described previously. Therefore, two forms of point coordinates in the second image are available. The first are the observed locations of the detected points 𝑝𝑝𝑡𝑡 . The second form are the projections 𝑝𝑝′𝑡𝑡 of the points 𝑃𝑃′𝑡𝑡, obtained by estimating the motion of the previous locations 𝑃𝑃 𝑡𝑡−1 based on the 3D

model and motion vector 𝜇𝜇 – given in equation (3.13). Assuming 𝑁𝑁 such point pairs have


66

been collected, the goal is to compute a motion vector 𝜇𝜇 which minimizes the sum of distances between the observed points 𝑝𝑝𝑖𝑖,𝑡𝑡 and the estimated points 𝑝𝑝𝑖𝑖,𝑡𝑡′

𝑝𝑝𝑖𝑖,𝑡𝑡 − 𝑝𝑝𝑖𝑖,𝑡𝑡′ 2

𝑖𝑖

(3.17)

This can be achieved by solving the following equation set:

𝑝𝑝1,t − 𝑝𝑝′1,t = 0

⋮𝑝𝑝N,t − 𝑝𝑝′N,t = 0

(3.18)

The equation set can be solved using the linear least squares method with weights. The general form of a weighted equation set in matrix notation for the discussed case is

𝑊𝑊𝐴𝐴𝑥𝑥 = 𝑊𝑊𝑊𝑊 (3.19)

where 𝑊𝑊 are weights for each correspondence and 𝑥𝑥 is the sought motion vector. The matrices 𝐴𝐴 and 𝑊𝑊 can be found by plugging equation (3.13) into (3.18). For a single correspondence this gives

𝑋𝑋𝑡𝑡−1

| − 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑡𝑡𝑥𝑥𝑋𝑋𝑡𝑡−1

| 𝜔𝜔𝑧𝑧 + 𝑌𝑌𝑡𝑡−1 − 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑡𝑡𝑦𝑦

𝑓𝑓−𝑋𝑋𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑍𝑍𝑡𝑡−1 + 𝑡𝑡𝑧𝑧

= |𝑥𝑥𝑡𝑡|

|

𝑦𝑦𝑡𝑡|𝑙𝑙 (3.20)

This can be transformed into the form of 𝐴𝐴𝑥𝑥 = 𝑊𝑊

−𝑥𝑥𝑡𝑡𝑌𝑌𝑡𝑡−1 𝑓𝑓𝑍𝑍𝑡𝑡−1 + 𝑥𝑥𝑡𝑡𝑋𝑋𝑡𝑡−1

| −𝑓𝑓𝑌𝑌𝑡𝑡−1 𝑓𝑓 0 −𝑥𝑥𝑡𝑡||

−𝑓𝑓𝑍𝑍𝑡𝑡−1 − 𝑦𝑦𝑡𝑡𝑌𝑌𝑡𝑡−1| 𝑦𝑦𝑡𝑡𝑋𝑋𝑡𝑡−1

| 𝑓𝑓𝑋𝑋𝑡𝑡−1| 0 𝑓𝑓 −𝑦𝑦𝑡𝑡|

𝑙𝑙

⎣⎢⎢⎢⎢⎢⎡

|

𝜔𝜔𝑥𝑥|

𝜔𝜔𝑦𝑦 𝜔𝜔𝑧𝑧 𝑡𝑡𝑥𝑥|𝑡𝑡𝑦𝑦|𝑡𝑡𝑧𝑧| ⎦⎥⎥⎥⎥⎥⎤

= |𝑥𝑥𝑡𝑡

|𝑍𝑍𝑡𝑡−1 − 𝑓𝑓𝑍𝑍𝑡𝑡−1|

𝑦𝑦𝑡𝑡|𝑍𝑍𝑡𝑡−1 − 𝑓𝑓𝑌𝑌𝑡𝑡−1|

(3.21)

The equation set in (3.18) can be solved by building a full system for all correspondences, where each correspondence is related to the motion vector 𝜇𝜇 as given by equation (3.21). If each correspondence has an assigned weight 𝑤𝑤𝑖𝑖, the iterative procedure will modify these weights in each iteration to downvote outliers. The initial weights are based on feature point match quality.

In the original publication [126] the described feature tracking algorithm was applied for each incoming frame twice:


67

• Tracking to the previous frame • Tracking to a template frame

Both of these tracking results were merged using a Kalman filter.

Key modifications of the original algorithm

The first important modification is using different feature points than those originally proposed by Jang and Kanade. The nature of face images is that they do not contain many corner-type points. SIFT points were designed for a different purpose and so do not work very well with faces. It has been found that a better feature point detector in case of face images is the STAR [1] detector.

The final selected feature descriptor is BRISK [154]. This is because it is light-weight, robust and free to use, with open source implementations available. Figure 3.5 shows a comparison of two feature point detectors for a face image. As can be seen, many more points are detected when using the STAR detector with a low threshold than when using the SIFT detector. What is more, experiments have shown that the detected STAR points are also more stable and allow much better model tracking.

Figure 3.5. Left image shows SIFT features, right image shows STAR features.


68

Typical matching of feature points is done by comparing the feature distance ratio between the best and second best match. If this is more than a certain threshold, the match is deemed valid. An additional constraint that has been added by the author is geometrical consistency verification. Using the transformation from the previous frame, the feature points detected in the current frame are mapped to the reference keyframe. A keyframe can be any stored template frame with indexed feature points. The maxi-mum distance between the mapped features cannot be larger than 1 3 of the distance

between the eyes. This filtering is more important in case of faces images than in images of textured objects, as the face contains many skin parts where very similar feature points can be detected and falsely matched. The geometrical consistency filtering has been found to significantly reduce the number of false matches and improve the pose estimation quality.

The continuous feature based pose estimation has been found to be very prone to drift - in fact much more than optical flow methods. On the other hand, features can be detected independently in each frame, which means that there is no drift error between matched features themselves. This is an important advantage when considering the rei-nitialization concept. Therefore, feature based head pose estimation is better for tracking to template frames (keyframes) than optical flow, originally proposed in [49]. It is a core component of the proposed hybrid head pose tracking system described in the following section.

3.3 Proposed hybrid algorithm

It has been previously stated that the most accurate methods of head pose estimation rely on tracking [79]. Of course, any tracking algorithm requires initialization and possi-bly reinitialization if tracking gets lost. The proposed hybrid algorithm improves on the initialization and reinitialization approaches proposed so far.

The hybrid head pose tracking approaches described in Chapter 2 [125, 130, 132] present different ways to combine frame-to-frame tracking, keyframe tracking and static pose estimation. Jang and Kanade [125] proposed to combine frame-to-frame feature based tracking with keyframe feature based tracking using a Kalman filter. Liao et al. [130] proposed to weight frame-to-frame intensity tracking with frame-to-frame feature based tracking depending on each method’s quality. Morency et al. [132] proposed to merge static pose estimation using an AAM, frame-to-frame optical flow tracking and


69

feature-based keyframe tracking using a Kalman filter. All of the mentioned methods failed to show an improvement of tracking accuracy in favorable conditions over previous work, but instead demonstrated an improvement of robustness and stability.

The author of this work claims that in favorable tracking conditions, which means ambient illumination and relatively slow movements, the best accuracy is achieved by optical flow intensity based tracking. This claim is largely supported by literature. It can be explained by the fact that features are always detected independently in each frame. While two identical views will result in the same detected features and ideal alignment when using them, two different views might contain different feature locations, even if these features are later matched as corresponding. These discrepancies in feature locations lead to tracking drift, which can be much larger than when performing intensity tracking using optical flow.

The proposed hybrid head pose tracking algorithm can be described as follows. Two core tracking algorithms are run for every new frame: an intensity-based frame-to-frame tracker as described in Section 3.2.1 and a feature-based keyframe tracker as de-scribed in Section 3.2.2. Depending on which tracker yields smaller error, this tracker is selected as more confident and its tracking result is used. This has been found to perform better than alternative forms of averaging. If the frame-to-frame tracker returns a smaller error, its head pose estimate is used and this estimate can be referred to as tracking – it depends on the previous frame only. If the keyframe tracker returns a smaller error, the performed tracking can be referred to as reinitialization, as the head pose estimate de-pends on a reference frame. The method of measuring tracker confidence is described in Section 3.3.1, while the keyframe initialization is described in Section 3.3.2.

There is one more situation that has so far not been mentioned but needs to be considered. This is the case when the tracking gets lost completely due to occlusion, extreme rotation or other factors. In this case there is no model that can be tracked to previous frames or keyframes. This requires a different form of reinitialization. To handle this case, the flexible model algorithm is used. The flexible model algorithm can be run as a background process due to its relatively low computational requirements. This means that in every frame, regardless of the model-based tracking state, a coarse esti-mate of the facial features and head pose is available. This is used to determine the face boundary and calculate feature points. These feature points are matched to keyframes that have been stored earlier. If a sufficiently good match can be found (judging by the number of matched correspondences), the head pose can be calculated and the 3D model


70

reinitialized. To improve such reinitialization accuracy, it is limited to the case when the face has a near frontal pose.

3.3.1 Tracker error measurement

One of the major issues is the question how to compare the quality of the intensity-based frame-to-frame tracker and the feature-based keyframe tracker. The first processes single intensities while the second one deals with feature points. These are different domains, so a fair comparison cannot be easily done.

The proposed way to solve this problem is to measure the accuracies of both trackers in the pixel intensity domain. The intensity-based tracker in fact directly min-imizes the luminance difference between the face image from the previous frame and the face image from the current frame transformed by the mapping function 𝐹𝐹 – as given in equation (3.14). The calculation of the tracking error is done in a similar manner

𝐸𝐸𝑖𝑖 = (𝐼𝐼(𝐹𝐹(𝑝𝑝, 𝜇𝜇), 𝑡𝑡) − 𝐼𝐼(𝑝𝑝, 𝑡𝑡 − 1))2

𝑝𝑝𝑝𝑝Ω

(3.22)

with 𝑝𝑝 being a pixel of the considered face region. However, the luminance difference between two consecutive frames cannot be reliably compared to the difference between the current frame and keyframe. Both values need to be relative to a frame that was used for initialization. Therefore, luminance difference needs to be calculated for the current frame and the last reference frame. The formula for this is therefore

𝐸𝐸𝑖𝑖 = (𝐼𝐼𝐹𝐹𝑝𝑝, 𝜇𝜇𝑟𝑟𝑟𝑟𝑟𝑟, 𝑡𝑡 − 𝐼𝐼(𝑝𝑝, 𝑡𝑡𝑟𝑟𝑟𝑟𝑟𝑟))2

𝑝𝑝𝑝𝑝Ω

(3.23)

where 𝜇𝜇𝑟𝑟𝑟𝑟𝑟𝑟 is the motion vector representing the composed transformation between the current frame and the last reference keyframe

𝜇𝜇𝑟𝑟𝑟𝑟𝑟𝑟 = [𝜇𝜇𝑘𝑘−1] ∙ … ∙ [𝜇𝜇𝑘𝑘−𝑛𝑛] (3.24)

The relative frame-to-frame transformations update this composed transformation for each new frame, or reset it if reinitialization is performed.

The feature-based tracker minimizes the distance between the locations of the feature points in the current frame and the transformed corresponding feature points in the keyframe, resulting in a transformation [𝜇𝜇] associated with a motion vector 𝜇𝜇. As


71

the motion vector is eventually found, it can be used to transform the 3D face model from the reference keyframe to match the current frame. After this, the pixel intensities of model points from the keyframe and current frame can be compared. Eventually, equation (3.23) is used once again but with a different motion vector 𝜇𝜇𝑘𝑘𝑟𝑟𝑦𝑦

𝐸𝐸𝑟𝑟 = (𝐼𝐼𝐹𝐹𝑝𝑝, 𝜇𝜇𝑘𝑘𝑟𝑟𝑦𝑦, 𝑡𝑡 − 𝐼𝐼(𝑝𝑝, 𝑡𝑡𝑘𝑘𝑟𝑟𝑦𝑦))2

𝑝𝑝𝑝𝑝Ω

(3.25)

This time the motion vector is given by the feature-based tracker and refers to the transformation between the current frame and selected keyframe (this may not be the same frame that was last used for reinitialization).

Depending on whether 𝐸𝐸𝑟𝑟 or 𝐸𝐸𝑖𝑖 is smaller, the best estimate is chosen as the cur-rent head pose.

3.3.2 Keyframe initialization

The problem of keyframe selection is crucial for the performance of the feature tracker. Several strategies have been evaluated, most importantly:

• Using a single keyframe saved during system initialization • Using multiple keyframes collected for various head poses when the tracking error

is below a certain threshold • Using a composite keyframe containing features from many frames aligned to the

initial pose and forming a single 3D model

The first method can be considered as the standard approach. Saving a template in the initialization frame means that the face pose and relative positions of eyeball rotation centers are known. Using this single keyframe allows straightforward relative pose and eye center estimation without drift errors for any new frame, as long as a minimum number of features can be correctly matched.

The second method is similar to the approach of Jang and Kanade [125] and Morency et al. [132]. Templates collected online significantly improve the robustness of the tracking system, as reference frames for various head poses become available. How-ever, these reference templates collected online inherently contain drift error. Tracking to reference frames containing drift significantly limits the accuracy. The performed ex-periments prove that although such an approach allows to achieve a nice visual effect as


72

far as head pose tracking is concerned, it deteriorates the accuracy of the eye gaze track-ing system. That is why the results reported in Chapter 7 refer to the case when such templates are collected in the initial, frontal head pose only.

The third proposed method is the most original. It proposes to grow the initial keyframe containing features from the initialization frame by new features from following frames. As the head pose is constantly tracked, the feature locations in any frame can be mapped back to the initial keyframe and the new feature points can be aggregated with previously detected ones. This approach is partly similar to the second strategy, but instead of having multiple independent templates, only one composite template is stored. This increases the amount of features, which increases the number of matches. Because the Iterative Least Square Reweighting scheme eliminates outliers – a bigger number of matches usually means a bigger number of correct matches. A bigger number of correct matches in turn leads to the reduction of pose estimation error compared to the case when independent templates are used.

The above is most significant when features are aggregated for several frames in one pose. Due to the camera sensor noise, even frames registered with the same head pose contain up to 25% different feature points (this is caused by the low detection threshold used in order to find sufficiently many feature points on skin regions). Exper-iments have proved that when using the composite single template approach aggregated for various head poses, the drift problem similar as for multiple templates was significant. However, when growing the initial template only while the head pose was near-frontal, the drift problem was negligible, while the amount of additional feature points consider-able. The best strategy has been determined as growing the composite model beginning from the initialization frame and continuing as long as the head pose is relatively frontal (head rotations very small).

The experimental results for each of the described keyframe storing strategies are presented in Chapter 7. The last method introduced by the author will be shown to achieve the best average accuracy.

3.3.3 Illumination compensation

Illumination plays an important role in all vision-based tracking algorithms. Usually, the most favorable lighting conditions are ambient lighting with no shadows. Directed light-ing causes the observed object to change appearance during movements – as different


73

parts get various amounts of light. In case of head pose tracking uneven illumination can significantly degrade the accuracy of head pose estimation.

In order to overcome lighting problems, several illumination compensation algo-rithms have been analyzed. One of the most widely used methods for illumination com-pensation is multiscale retinex [155]. It combines dynamic range compression, color con-sistency and tonal rendition along three or more scales in order to achieve satisfactory compensation. It seems to work well, although it does tend to remove some texture. The results of a simplified retinex method implementation without color restoration are shown in Figure 3.6.

Figure 3.6. Result of simplified multiscale retinex.

Based on the simplified implementation and images available in the original publication, several negative consequences of using the algorithm are visible. These are most im-portantly texture removal and additive Gaussian noise. As the multiscale retinex method relies on multiple convolutions of the image with the Gaussian kernel, it is a slow method. In fact, the typical Gaussian kernels that should be used are large – having even hun-dreds of pixels. This makes the method an offline method. The low speed and image texture degradation were the decisive factors that led to abandoning further testing of this method in favor of other ones.

Another popular illumination compensation technique called the self quotient im-age [156] is most widely applied in face image recognition. This is because it extracts and enhances facial features very well in an illumination invariant way. However, be-cause of large texture transformations it doesn’t seem to be ideal for gradient-based mesh


74

tracking. The method is also based on multiple convolutions of the image with various Gaussian Kernels. It seems to be similarly ‘heavyweight’ as the multiscale retinex method, and also removes texture. Therefore, its usability for pose tracking is doubtful.

A better option for the case of tracking seems to be the simultaneous dynamic range compression and local contrast enhancement (SDRCLE) algorithm [157]. The results of a simplified implementation of this algorithm are shown in Figure 3.7. In initial tests it seemed that the local contrast enhancement part of the algorithm contributed little to the image appearance. Because of this, it was decided that this part would not be included in the algorithm that would be tested with the eye gaze tracking system – due to already high computational demands.

Figure 3.7. Result of dynamic range compression.

The dynamic range compression consists of calculating a convolution once for the whole image and later scaling using the hyperbolic tangent function. The convolution itself cannot be sped up without using the GPU. The hyperbolic tangent however is a function that can be implemented as a pre-calculated look-up table. Due to the computational requirements and the necessity to run the illumination compensation algorithm “on top” of other algorithms, the look-up table approach was found to be helpful.

Another algorithm that seems promising for usage with head pose tracking is using the cosine transform and discarding certain DCT coefficients [158]. The general idea is to transform the image into the frequency domain, remove chosen coefficients and then transform it back to the pixel intensity domain. In this approach the DCT coefficients are calculated for the whole image (not for blocks). As only several low-


75

frequency components need to be removed to compensate uneven illumination on some part of the face, only these components need to be calculated. In this case the pixel intensity image obtained from these coefficients may be subtracted from the original image. An example output for this method when discarding a set of coefficients up to the 3rd order is shown in Figure 3.8.

Figure 3.8. Result of discarding DCT coefficients.

A further improvement of this method as described in [158] is operating on a logarithmic version of the original image when discarding DCT coefficients.

The dynamic range compression algorithm and the second variant of DCT coef-ficient removal and have been tested on sequences with lateral illumination. The results are presented and discussed in Chapter 7.

This concludes the topic of head pose estimation using an RGB sensor. The proposed framework is evaluated in Chapter 7. The author believes that the obtained results prove that the proposed algorithms are in the state-of-the-art category with significant ad-vantages over previously published work.

3.4 Usage of a depth camera

A depth sensor can be used additionally to the RGB camera. Using a depth sensor alone is impractical for eye gaze tracking, as the irises cannot be distinguished in the depth


76

image. However, most systems that consist of a depth camera also contain an RGB camera – this is the case with Microsoft Kinect or the newest laptops and tablets equipped with Intel’s RealSense depth sensors.

The research related to eye gaze tracking when using a depth camera additionally to the RGB camera [63, 66, 71, 75] demonstrates improved robustness, but fails to offer improved accuracy over previous state-of-the-art RGB only methods. This is somewhat surprising, as a depth sensor is an additional and very informative source of data. It should be remembered, though, that the noise levels for depth measurements are much higher than for the RGB camera. Similarly, the resolution is much lower and depth sensors are still an emerging market. It can be expected that better eye gaze tracking results with depth sensor utilization are a matter of time.

This work will be one the first to present results proving that usage of a depth sensor can significantly increase the accuracy of commodity camera eye gaze tracking when adopting a model-based approach. The rest of this Section describes proposed algorithms for depth data utilization, while an experimental evaluation is presented in Chapter 7. As the depth sensor does not provide any data that can be useful for iris localization, all the described improvements are related to head pose estimation.

3.4.1 Improved head pose initialization

A big advantage of using a depth sensor is information about the z-axis coordinates of points in the camera image – assuming that a calibrated setup is available. Instead of calculating the depth of various features based on real-world lengths and appearance in the 2D image, the depth can be measured directly by the sensor. This allows a more accurate 3D head model initialization in the initialization phase, as there is no need to rely on the assumed distance between the user’s eyes and inaccurate head rotations estimated by the flexible model.

However, a more accurate measure of facial feature locations is not the only ad-vantage of using depth data. In this case it is actually possible to completely abandon a generic face model and warping this model – the face model can be created directly from depth measurements. Once depth sensor data is available for the face area constrained by a flexible model, it can be used to create a true depth mesh for each person, without the need to use a generic model or warp it. The better accuracy of such a model should improve both feature based and optical flow based head pose tracking, as both methods inherently rely on the model during pose estimation.


77

The depth mask which is transformed into a mesh and used as the tracking model is constructed as follows. First, an RGB image is captured and the face is found using the flexible model. This information is then aligned with the depth image. The relevant part of the depth image is segmented and used to create a mesh model. The mesh model consists of points evenly distributed along lines and columns, which are not necessarily in the same locations as the depth readings. The depth of mesh points are calculated by linear interpolation of the nearest available depth readings from the depth sensor. This is necessary, because 1) the RGB image usually has a higher resolution than the depth image and 2) the transformed depth image in the coordinates of the RGB camera has holes.

The face model initialization accuracy has been found to significantly influence the tracking accuracy. The first, and thus also second goal can be improved by using a personalized model created from depth sensor measurements. A comparison of results is provided in Chapter 7.

3.4.2 Improved head pose tracking

Apart from more accurate initialization, depth can be used for all the registered frames to improve the head pose tracking accuracy. The typical approach used for point cloud alignment – ICP – has also been proposed to track the head pose, for example by Li et al. [128]. While being fast, the system is not very accurate compared to RGB only ap-proaches. This is no surprise considering the nature of depth data from commodity depth sensors – much more noisy than that of RGB cameras. True benefits in terms of tracking accuracy are possible only when combining depth data and RGB data.

The amount of work published on head pose estimation methods using a depth and RGB sensor simultaneously is not very large as it is a quite recent topic. Several most accurate algorithms will be mentioned. Wang et al. [159] proposed an accurate method to track non-rigid faces using harmonic maps, but it requires very high resolution input data that is beyond the capabilities of current commodity depth cameras. A dif-ferent framework focused on flexible models was later presented without the requirement of such high resolution input data [160]. The accuracy of the system seems good with reported errors of around 3 pixels related to manually tagged facial landmarks in images. More recently, a new approach using Active Appearance Models has been proposed [161]. It is an improvement of AAM tracking with linear 3D morphable model constraints. It seems to be among the current state-of-the-art AAM based tracking, with a reported


78

accuracy of 3-6mm for each point. Even more recently, it was proposed to use a state-of-the-art camera pose estimation technique based on total variation to estimate the head pose from RGBD input data [129]. This approach involves very popular visual odometry algorithms developed at Technical University of Munich (TUM) and so is a very good reference for evaluation purposes.

This thesis proposes two algorithms for head pose tracking improvement using a depth camera:

1. The approach using visual odometry algorithms already published by the author in [129].

2. An extension of the algorithm described in Section 3.3.

3.4.2.1 Total variation approach

The first approach has already been described in detail in [129], and so here it will be just summarized. Let an observed image pixel 𝑝𝑝 have a related depth 𝑍𝑍(𝑝𝑝). Given a transformation matrix 𝑇𝑇, and camera projection function 𝑝𝑝 = 𝜋𝜋(𝑃𝑃), a warping function 𝜏𝜏 that computes the location of a pixel from the first image in the second image given a rigid body motion can be defined as

𝑝𝑝′ = 𝜏𝜏(𝑝𝑝,𝑇𝑇) = 𝜋𝜋𝑇𝑇𝜋𝜋−1(𝑝𝑝,𝑍𝑍1(𝑝𝑝)) (3.26)

Using this, the photometric error of a pixel 𝑝𝑝 can be defined as

𝑟𝑟𝐼𝐼 = 𝐼𝐼2𝜏𝜏(𝑝𝑝,𝑇𝑇) − 𝐼𝐼1(𝑥𝑥) (3.27)

Similarly, the depth error is given by

𝑟𝑟𝑍𝑍 = 𝑍𝑍2𝜏𝜏(𝑝𝑝,𝑇𝑇) − [𝑇𝑇𝜋𝜋−1(𝑝𝑝,𝑍𝑍1(𝑝𝑝))]𝑍𝑍 (3.28)

where [∙]𝑍𝑍 returns the 𝑍𝑍 component of a point, for example [𝑃𝑃]𝑍𝑍 = 𝑍𝑍. The second term in equation (3.28) is the depth of the transformed point, which was reconstructed from the first depth image. Kerl et al. [162] propose to minimize the combined photometric and depth error given by equations (3.27) and (3.28) by modelling them as a bivariate random variable 𝒓𝒓 = (𝑟𝑟𝐼𝐼 , 𝑟𝑟𝑍𝑍)𝑇𝑇, which follows a bivariate t-distribution, and solving the resulting non-linear least squares problem. A big advantage of the proposed approach is that it uses the intensity and depth measurements from both frames, and considers the fact that they depend on each other.


79

3.4.2.2 Lukas-Kanade approach

It is also possible to use depth information to improve the approach proposed in Section 3.2.1. It has been observed that the main problem with local optical flow tracking is small drift accumulated in each frame during tracking. This drift is most significant in the case of out-of-plane rotations, when the depth of all facial points changes, but the intensity differences perceived by the camera are small. In this case head rotation esti-mation can often be imprecise. The proposed improvement is adding a second tracking stage utilizing the measured depth values from the depth camera. The first stage is standard luminance only tracking using optical flow or feature matching, as described earlier. The result of this stage is a certain estimate of the new pose, but usually associ-ated with a certain drift. The second tracking stage is aimed to reduce this drift.

In the second tracking stage the estimated mesh from the first stage is first aligned with the depth image for the new frame. The z axis values of this mesh are then swapped with those measured by the depth sensor. This yields a new mesh, which perfectly fits the current pose of the head, as all depths are equal to the sensor measurements. This can be called the temporary mesh. The problem is that this temporary mesh does not preserve the original mesh geometry (depth values can be replaced in an inconsistent manner) and no current pose transformation is given. The goal is to obtain a transformed version of the original, rigid mesh that will be aligned with the temporary mesh.

Let the points 𝑃𝑃𝑖𝑖 be 3D points of the first mesh from the previous frame. Let 𝑃𝑃𝑖𝑖′ denote 3D mesh points from the temporary mesh. It is assumed that both meshes have similar shapes and that they can be aligned by matching points with the same mesh indices. This provides a set of matched point pairs. Following equations (3.8) and (3.12), and assuming the meshes can be perfectly aligned, the coordinates 𝑃𝑃𝑖𝑖′ can be expressed as

𝑋𝑋𝚤𝚤′ = 𝑋𝑋𝚤𝚤 − 𝑌𝑌𝚤𝚤𝜔𝜔𝑧𝑧 + 𝑍𝑍𝚤𝚤𝜔𝜔𝑦𝑦 + 𝑡𝑡𝑥𝑥 𝑌𝑌𝚤𝚤′ = 𝑋𝑋𝚤𝚤𝜔𝜔𝑧𝑧 + 𝑌𝑌𝚤𝚤 − 𝑍𝑍𝚤𝚤𝜔𝜔𝑥𝑥 + 𝑡𝑡𝑦𝑦 𝑍𝑍𝚤𝚤′ = −𝑋𝑋𝚤𝚤𝜔𝜔𝑦𝑦 + 𝑌𝑌𝚤𝚤𝜔𝜔𝑥𝑥 + 𝑍𝑍𝚤𝚤 + 𝑡𝑡𝑧𝑧

(3.29)

This gives rise to the following linear least squares system for all matches:

𝑃𝑃1′ − 𝑇𝑇𝑃𝑃1 = 0

⋮𝑃𝑃𝑁𝑁′ − 𝑇𝑇𝑃𝑃N = 0

(3.30)


80

This system can be solved with a typical IRLS algorithm, similarly as was explained in Section 3.2.2. This time the 3D coordinates are known directly so the projection equa-tions don’t have to be used. Using (3.29), one correspondence provides the following relation constraining the motion vector 𝜇𝜇=[𝜔𝜔𝑥𝑥,𝜔𝜔𝑦𝑦,𝜔𝜔𝑧𝑧, 𝑡𝑡𝑥𝑥, 𝑡𝑡𝑦𝑦, 𝑡𝑡𝑧𝑧]:

0 𝑍𝑍𝑖𝑖

| −𝑌𝑌𝑖𝑖| 1 0 0

−𝑍𝑍𝑖𝑖| 0 𝑋𝑋𝑖𝑖

| 0 1 0𝑌𝑌𝑖𝑖

| −𝑋𝑋𝑖𝑖| 0 0 0 1

|

⎣⎢⎢⎢⎢⎢⎡

|

𝜔𝜔𝑥𝑥|

𝜔𝜔𝑦𝑦 𝜔𝜔𝑧𝑧 𝑡𝑡𝑥𝑥|𝑡𝑡𝑦𝑦|𝑡𝑡𝑧𝑧| ⎦⎥⎥⎥⎥⎥⎤

= |𝑋𝑋𝑖𝑖′|

𝑌𝑌𝑖𝑖′|

𝑍𝑍𝑖𝑖′|

(3.31)

Similarly as before, the final motion vector is found by iterative solving of the full equa-tion set with variable weights for all correspondences.

The described refinement stage can be intuitively explained as pulling the old mesh towards the new mesh to achieve an optimal alignment. It can be thought of as a variant of minimizing the geometric misalignment in Iterative Closest Point based fash-ion, but with correspondences fixed throughout the process. The comparison of this tracking method with other techniques is presented in Chapter 7.

4 Iris localization

81

4 Iris localization

Accurate iris localization is, besides head pose estimation, the second core component of the proposed eye gaze tracking system. This chapter describes the optimal configuration of the algorithm which has been found after extensive testing of various possibilities. A partial description of this chapter can be found in the author’s previous work [57].

Similarly to what has been suggested in literature, the proposed iris localization algorithm is two-stage. The coarse iris localization is performed using a modified Radial Symmetry Transform (RST) algorithm [147, 149]. This algorithm has been found to provide good accuracy and speed while allowing straightforward formulation of region constraints. Most importantly however, it performs well on low-resolution eye images – much better than the Starburst algorithm [144] or ellipse fitting on detected edges [46]. It is also faster than the Circular Hough Transform and does not require explicit detec-tion of iris edges. Therefore, no fixed thresholds for edges need to be specified beforehand.

The refinement step relies on a modified version of Daugman’s Integro Differential Operator [146]. It has been found to give better and more stable results than direct ellipse fitting to a set of points or edges [44, 46]. Along with the refinement step it will be described how to adaptively select the potential iris boundary to increase the locali-zation robustness and accuracy.

4.1 Coarse iris localization

The first coarse iris center localization step is actually performed by the head pose track-ing algorithm. Based on the initial eyeball rotation center positions and the eye corners detected by the flexible model in the first frame, the eye regions can be mapped to the 3D face model during initialization. From then on, these regions are tracked under per-spective projection and their boundaries in image coordinates are known in every frame. Such region extraction has been found to be more stable than using the flexible model directly, as the flexible model gets lost much easier than the hybrid head pose tracking algorithm presented in Chapter 3. The typical appearance of the extracted eye regions when using a high quality webcam (Logitech C920) is shown in Figure 4.1.

4 Iris localization

82

Figure 4.1. Extracted eye regions using high quality webcam, magnified 300%, user dis-tance = 50cm. Left: Illumination from top of head, right: ideal illumination.

Once the eye regions are known, the proper coarse iris localization algorithm can be performed. This will now be explained in detail.

The Radial Symmetry Transform is a voting technique. It transforms an input image 𝐼𝐼 into an output image 𝑆𝑆 while capturing the radial symmetry characteristics of the region. It can be perceived as a modified Circular Hough Transform, with each point voting for a single circle center based on local gradient, instead of blindly voting for all points that are a certain distance away as potential circle centers. This is illustrated in Figure 4.2. The three points marked with black dots vote for the circle center based on the gradient direction at their location. The radius is assumed to be known.

Figure 4.2. Illustration of radial symmetry transform voting.

In the rest of this chapter the symbol 𝐼𝐼 will refer to both the image domain and the image intensity function. Let each pixel in the input image be denoted by 𝑝𝑝 = 𝑥𝑥,𝑦𝑦. Each such pixel has an associated gradient at its location. This gradient has a certain

4 Iris localization

83

magnitude 𝑀𝑀 and orientation 𝑂𝑂. The gradient orientation can be understood as the ori-entation of a unit vector directed in the same direction as the gradient. In case of irises the sought circle is always darker than the surrounding sclera, therefore the gradient orientation is defined as opposite to the normal gradient direction: it is directed towards decreasing image intensity. Assuming a known circle radius 𝑟𝑟, the orientation vector is scaled to have exactly this radius length. It is then used to determine the point which is upvoted as the circle center. In the original articles [147, 149] the gradient strength 𝑀𝑀 directly determines how much the candidate center point is upvoted.

Let 𝐿𝐿𝑟𝑟(𝑝𝑝) denote the center location for which an image point 𝑝𝑝 votes assuming radius 𝑟𝑟. The transformed image 𝑆𝑆𝑟𝑟 can be then defined as

𝑆𝑆𝑟𝑟 = 𝐿𝐿𝑟𝑟(𝑝𝑝)

𝑝𝑝∈𝐼𝐼

𝑀𝑀(𝑝𝑝) (4.1)

Typically, the voting output is smoothed with a 2D Gaussian kernel 𝐴𝐴𝑟𝑟 dependent on the radius. This leads to

𝑆𝑆𝑟𝑟 = 𝐴𝐴𝑟𝑟 ∗𝐿𝐿𝑟𝑟(𝑝𝑝)

𝑝𝑝∈𝐼𝐼

𝑀𝑀(𝑝𝑝) (4.2)

The output image 𝑆𝑆𝑟𝑟 needs to be searched for the maximum to determine the most likely candidate center point. In practice, the radius is never known precisely beforehand. A certain range of radii is considered. If formula (4.2) was used to determine several trans-formed images with various radii, the results would be biased as larger radii allow more points to vote. Therefore, the radius needs to be considered for result scaling. As the number of voting points is proportional to the circumference and 2𝜋𝜋𝑟𝑟 ∝ 𝑟𝑟 , it is sufficient to divide the sum by the current radius to remove the bias

𝑆𝑆𝑟𝑟 = 𝐴𝐴𝑟𝑟 ∗1𝑟𝑟𝐿𝐿𝑟𝑟(𝑝𝑝)𝑝𝑝∈𝐼𝐼

𝑀𝑀(𝑝𝑝) (4.3)

Such transformed images can be objectively compared to find the best candidate across various radii. The mentioned publications [147, 149] actually proposed to sum all the transformed images to obtain a single combined center point likelihood map

4 Iris localization

84

𝑆𝑆 =1

𝑛𝑛2 − 𝑛𝑛1 𝑆𝑆𝑟𝑟

𝑟𝑟=𝑛𝑛2

𝑟𝑟=𝑛𝑛1

(4.4)

In the tested implementation such aggregation has been found to provide very similar, but slightly inferior accuracy than calculating and comparing each 𝑆𝑆𝑟𝑟 maximum sepa-rately.

The work of Zhou et al. [149] additionally suggested to weight the final results by the intensity of the candidate location. This uses the knowledge that pupil centers are usually darker than the rest of the eye region; candidate points lying in dark places should therefore be promoted. This is a sensible approach and is also used in the imple-mentation of the proposed iris localization module.

4.1.1 Key modifications of the original algorithm

The algorithm described so far has been implemented as part of the proposed eye gaze tracking system. As a result of extensive testing, however, several improvements building on prior work have been made.

First of all, it is known that large parts of the iris are usually occluded by the eyelids. These are typically the top and bottom parts. Based on observations it has been decided to only use such pixels 𝑝𝑝 of the input image 𝐼𝐼 for voting, which meet a certain criterion related to the gradient

𝑂𝑂𝑥𝑥 > 𝜅𝜅 ∙ 𝑂𝑂𝑦𝑦 (4.5)

This simply means that only points lying on vertical edges are considered, with 𝜅𝜅 being a factor determining the required ‘verticality’. A comparison of the original and modified voting algorithm is shown in Figure 4.3. While one can argue that points lying on the eyelids and eyelashes do not result in any meaningful votes anyway, their elimination reduces random noise of the voting process and thus improves accuracy. Additionally, gradients with a magnitude below a certain threshold are also omitted from voting. This has a similar purpose as constraining the allowed orientation.

Another proposed improvement is to consider not only the gradient magnitude of voting points, but also the number of voting points. In fact, it has been empirically found that eye corners often have a very strong horizontal gradient that leads to incorrect iris center voting. Requiring a large number of votes apart from a large cumulative gradient

4 Iris localization

85

Figure 4.3. Comparison of original (left) and modified (right) voting algorithm. White curves indicate points that take part in voting.

magnitude eliminates such cases where a small number of noisy points could degrade the algorithm results. It is proposed to modify the final score of each candidate center point by transforming the total initial score 𝑇𝑇 of each candidate point 𝑚𝑚 in the following way:

𝑇𝑇𝑖𝑖 ←𝑖𝑖∈𝑆𝑆𝑟𝑟

𝑇𝑇𝑖𝑖 ∙ 𝑈𝑈𝑖𝑖32 (4.6)

where 𝑈𝑈𝑖𝑖 represents the number of points that have voted for the location 𝑚𝑚 to be the circle center. An illustration of the proposed coarse iris center localization algorithm for several radii is shown in Figure 4.4.

Figure 4.4. Results of voting after smoothing for various radii.

The final proposed improvement addresses the temporal coherency of incoming camera frames. While the initial iris radius has to be coarsely estimated from the distance

4 Iris localization

86

between the eyes, the radius in the next image is quite well constrained by the eye radius in the previous image. Even in case of abrupt head movements the eye radius does not change more than by 20% between frames assuming 30 FPS. This observation can be used to additionally constrain the number of tested radii, thus reducing the computa-tional complexity.

4.2 Iris location refinement

The refinement step of the iris localization algorithm also follows the algorithm presented by Zhou et al. [149], which is an extension of the original algorithm of John Daugman [146]. The circular integro differential operator, which was already mentioned in Chapter 2, is defined as follows

argmax(𝑟𝑟,𝑥𝑥𝑐𝑐,𝑦𝑦𝑐𝑐)

𝐺𝐺𝜎𝜎(𝑟𝑟) ∗𝜕𝜕𝜕𝜕𝑟𝑟

𝐼𝐼(𝑥𝑥,𝑦𝑦)2𝜋𝜋𝑟𝑟

𝑑𝑑𝑠𝑠.

𝑟𝑟,𝑥𝑥𝑐𝑐,𝑦𝑦𝑐𝑐 (4.7)

where 𝐼𝐼(𝑥𝑥, 𝑦𝑦) is the gray level of the image and 𝐺𝐺𝜎𝜎(𝑟𝑟) is the Gaussian smoothing filter. The above operator searches over the iris image domain (𝑥𝑥,𝑦𝑦) for the maximum change in pixel values, with respect to increasing radius 𝑟𝑟 along a circular 𝑑𝑑𝑠𝑠 of radius 𝑟𝑟 and center coordinates (𝑥𝑥𝑐𝑐 ,𝑦𝑦𝑐𝑐). This is illustrated in Figure 4.5. Circles with different radii are fitted to the iris boundaries and gradients directed towards the circle center are found. The gradients are added to form the total score of a given circle. After processing all the considered combinations of 𝑟𝑟, 𝑥𝑥𝑐𝑐 , 𝑦𝑦𝑐𝑐, the circle with the highest score is selected. It is important to notice that this algorithm is computationally expensive so an initial coarse iris center estimate is necessary to limit the number of processed possibilities.

As stated in [149], only arcs are considered instead of circles in order to minimize the impact of eyelashes and eyelids, but these arcs are selected in an adaptive way as will be described later. The other modification proposed in [149] – to weight the score by the number of points having a gradient above a threshold – has not been found to provide improvement when using the adaptive iris boundary selection algorithm. It should be pointed out that previous work utilizing this type of algorithm [147, 149] dealt with high quality images from infra-red cameras, where the iris and pupil were very clear. This is a different type of input data from what this thesis focuses on, so the algorithms are justified to behave in a slightly different way.

4 Iris localization

87

Figure 4.5. Illustration of iris localization refinement (∆𝑥𝑥,∆𝑦𝑦,∆𝑟𝑟,𝑑𝑑𝑠𝑠 are not in scale).

4.2.1 Adaptive arc selection

Adaptive arc selection is certainly the biggest novelty of the proposed iris localization algorithm. It is an extension of the arc selection mechanism for the coarse stage illus-trated in Figure 4.3. Instead of always using a fixed characteristic of the arc, the arc is adjusted based on head pose and gaze angles. The reason for this is simple: depending on where the user is looking and how their head is rotated, a different part of the iris is visible. While the visible parts are generally two arcs in the ranges of [−80°~65°, 115°~260°] as shown in Figure 4.6, this may change significantly - Figure 4.7.

Figure 4.6. Typically visible iris boundaries.

4 Iris localization

88

Figure 4.7. Changes in visible iris boundaries. Left: frontal pose, frontal gaze. Middle: Head pose top-right, gaze frontal. Right: frontal head pose, gaze top-right. Images

obtained using frontal lighting placed before eyes.

In order to account for the fact that eye and head movements cause different boundaries to be visible, the following methodology is proposed. Four separate angles 𝜃𝜃𝑙𝑙𝑡𝑡 ,𝜃𝜃𝑙𝑙𝑙𝑙,𝜃𝜃𝑟𝑟𝑡𝑡 ,𝜃𝜃𝑟𝑟𝑙𝑙 are established, as is shown in Figure 4.6. These angles are chosen to determine the visible top and bottom part of the left and right iris boundary. It is important to consider each side of the iris differently, as the iris boundary that is closer to the eye corner is usually occluded by the eyelids to a much larger extent than the opposite iris boundary.

In order to calculate the angles 𝜃𝜃𝑙𝑙𝑡𝑡 ,𝜃𝜃𝑙𝑙𝑙𝑙,𝜃𝜃𝑟𝑟𝑡𝑡 ,𝜃𝜃𝑟𝑟𝑙𝑙 it is helpful to notice that they depend on the relative head pose direction and gaze direction. If the gaze is directed in the same direction as the head, the iris placement in the eye region will be the same regardless of the head pose. The relative head and gaze angles can be defined as follows:

𝑟𝑟𝑥𝑥 = 𝑔𝑔𝑥𝑥 − ℎ𝑥𝑥𝑟𝑟𝑦𝑦 = 𝑔𝑔𝑦𝑦 − ℎ𝑦𝑦

|

|

(4.8)

where ℎ𝑥𝑥,ℎ𝑦𝑦 are head pose orientation angles and 𝑔𝑔𝑥𝑥,𝑔𝑔𝑦𝑦 are gaze angles in the horizontal and vertical planes. These parameters are taken from the previous frame, as the gaze direction is still unknown in the current frame at this point. Because of temporal conti-nuity this technique works very well in almost all situations.

Having defined 𝑟𝑟𝑥𝑥 and 𝑟𝑟𝑦𝑦, the four iris boundary angles can be calculated using the following formulas:

4 Iris localization

89

⎩⎪⎨

⎪⎧𝜃𝜃𝑙𝑙𝑡𝑡 = 65° + 𝛼𝛼𝑙𝑙 ∙ min (𝑟𝑟𝑥𝑥, 0) + 𝛽𝛽𝑡𝑡 ∙ 𝑟𝑟𝑦𝑦 𝜃𝜃𝑙𝑙𝑙𝑙 = 80° + 𝛼𝛼𝑙𝑙 ∙ min (𝑟𝑟𝑥𝑥, 0) + 𝛽𝛽𝑙𝑙 ∙ 𝑟𝑟𝑦𝑦 𝜃𝜃𝑟𝑟𝑡𝑡 = 65° + 𝛼𝛼𝑟𝑟 ∙ min (𝑟𝑟𝑥𝑥, 0) + 𝛽𝛽𝑡𝑡 ∙ 𝑟𝑟𝑦𝑦 𝜃𝜃𝑟𝑟𝑙𝑙 = 80° + 𝛼𝛼𝑟𝑟 ∙ min (𝑟𝑟𝑥𝑥, 0) + 𝛽𝛽𝑙𝑙 ∙ 𝑟𝑟𝑦𝑦|

|

(4.9)

The values of 𝛼𝛼𝑙𝑙 ,𝛼𝛼𝑟𝑟 ,𝛽𝛽𝑡𝑡 ,𝛽𝛽𝑙𝑙 have been found experimentally based on the available test sequences. The 𝑚𝑚𝑚𝑚𝑛𝑛() operator indicates that horizontal movements of the eyes can only cause one side of the iris to be occluded, without significantly impacting the other side. On the other hand, the vertical eye movements can cause the top and bottom of the iris to both get occluded and become visible in a larger extent – depending on the eye movement relative to the head. While equation (4.9) is a linear simplification of real geometry, a more exact calculation is problematic because together with the eyes the eyelids and eyelashes also move.

The visual results of the proposed adaptive algorithm are shown in Figure 4.8 and Figure 4.9. The first figure shows the estimated iris boundaries in the case when the head pose is frontal and the user gazes at the screen center and four screen corners. The second figure shows the case when the user continues to gaze at the screen center and rotates the head both horizontally and vertically. The provided images prove that the iris boundaries are highly dependent on the gaze and head direction, and also demon-strate that the proposed algorithm can adjust the analyzed arcs of the iris efficiently. A qualitative evaluation of the accuracy of the proposed iris center localization algorithm is performed in Chapter 7.

To further reduce the influence of eyelashes, eyelids and other skin parts when calculating the sum in equation (4.7), only a subset of the gradients is used for calcula-tions. To be specific, 10% of the points with smallest gradients in the direction of the center of the analyzed circle are discarded. Together with the adaptive arc selection algorithm, this virtually ensures that in most cases only boundary points of the iris impact the calculated gradient sum.

4 Iris localization

90

Figure 4.8. Illustration of adaptive iris boundary selection. The head pose is frontal, gaze directions vary.

Figure 4.9. Illustration of adaptive iris boundary selection. The head pose varies, gaze direction is kept frontal.

4 Iris localization

91

4.2.2 Other key modifications of the original algorithm

Similarly as in case of the coarse iris localization algorithm, temporal constraints are used to limit the number of analyzed arcs. The iris radius is assumed to be fairly constant – it cannot change more than 20% between consecutive frames. This constraint is com-bined with another constraint – that iris the radius determined in the coarse localization stage and the refinement stage cannot change by more than 20%. These two conditions significantly limit the number of radii that are analyzed by the refinement algorithm.

To achieve an even bigger speed up, the algorithm is performed in a coarse to fine fashion - the used steps ∆𝑟𝑟,∆𝑥𝑥,∆𝑦𝑦 are larger in the first stage, and smaller in later stages. Altogether three stages are used, which provides high efficiency with the desired accuracy. Apart from using different step sizes, the author proposes to use different arc shapes. Different arc shapes can effectively allow to not only fit circles to the iris bound-aries, but also ellipses. The iris often has an elliptical shape in the registered camera image so fitting ellipses allows potentially better accuracy.

The ellipse shape can be defined by the ratio between the minor and major axis. At most 9 ellipses are used by the iris localization algorithm. These are ellipses with ratios between the vertical and horizontal axis equal to: 0.90, 0.925, 0.95, 0.975, 1.00, 1.025, 1.05, 1.075, 1.10. An overview of these elliptical shapes is shown in Figure 4.10. To ensure that the considered ellipses are of similar size, the sum of the minor and major axis is kept constant and equal to four times the current radius length.

Figure 4.10. Elliptical shapes used in refinement stage.

The structuring of the refinement algorithm stages is shown in Table 4.1 The index 1 in 𝑟𝑟1, 𝑥𝑥𝑐𝑐1,𝑦𝑦𝑐𝑐1 indicates that they are results from the coarse algorithm, while higher indexes refer to results of the refinement algorithm stages.

4 Iris localization

92

Stage ∆𝑟𝑟 ∆𝑥𝑥 ∆𝑦𝑦 𝑟𝑟𝑚𝑚𝑖𝑖𝑛𝑛 𝑟𝑟𝑚𝑚𝑚𝑚𝑥𝑥 𝑥𝑥𝑐𝑐𝑚𝑚𝑖𝑖𝑛𝑛 𝑥𝑥𝑐𝑐𝑚𝑚𝑚𝑚𝑥𝑥 𝑦𝑦𝑐𝑐𝑚𝑚𝑖𝑖𝑛𝑛 𝑦𝑦𝑐𝑐𝑚𝑚𝑚𝑚𝑥𝑥 ellipses

1 1 1 1 ⌊0.8 ∙ 𝑟𝑟1⌋ ⌊1.2 ∙ 𝑟𝑟1⌋ 𝑥𝑥𝑐𝑐1 − 3 𝑥𝑥𝑐𝑐1 + 3 𝑦𝑦𝑐𝑐1 − 3 𝑦𝑦𝑐𝑐1 + 3 1

2 12 1

2 12 𝑟𝑟2 −

12 𝑟𝑟2 +

12 𝑥𝑥𝑐𝑐2 − 1 𝑥𝑥𝑐𝑐2 + 1 𝑦𝑦𝑐𝑐2 − 1 𝑦𝑦𝑐𝑐2 + 1 3

3 12 1

4 14 𝑟𝑟3 −

12 𝑟𝑟3 +

12 𝑥𝑥𝑐𝑐3 −

14 𝑥𝑥𝑐𝑐3 +

14 𝑦𝑦𝑐𝑐3 −

14 𝑦𝑦𝑐𝑐3 +

14 9

Table 4.1. Iris localization refinement algorithm stage structure, units are image pixels.

Such structuring has been found to combine good accuracy with relatively low compu-tational cost. The final resulting 𝑟𝑟4, 𝑥𝑥𝑐𝑐4,𝑦𝑦𝑐𝑐4 is considered to be the true center of the iris.

5 Proposed system

93

5 Proposed system

The aim of this work is to propose the most accurate eye gaze tracking system possible under the assumption that only passive vision sensors are used (RGB and optionally depth camera). An additional requirement for the proposed system is to allow uncon-strained head movements of the user, which is a necessity if the system is to be useful in real world scenarios.

Based on the existing literature it is clear that all appearance based methods, despite recent advances [26], are unable to fully compensate head movements. Therefore, an approach based on a geometrical model is certainly more promising. Furthermore, high accuracy of eye gaze tracking requires high accuracy of component feature tracking. Sesma et al. [163] performed a study to evaluate the accuracy of estimating the pupil center based on eye corners. They claim that the eye corners move both with eye and head movements, and that the best possible accuracy that can be achieved when relying on eye corner locations to estimate the gaze is around 3°. If it is additionally considered how difficult eye corner localization can be for low quality webcam images (Figure 5.1), it becomes clear that relying on eye corners is not ideal in a high-accuracy eye gaze tracking system.

Figure 5.1. Comparison of high quality eye region from a head mounted camera (left) and a Sony LifeCam Studio HD webcam at 70cm distance (right).

As will be shown in Chapter 7, iris localization can be done much more accurately than eye corner localization. In fact, the iris is the one feature in the entire face that can be localized with subpixel accuracy in almost every frame of even low quality and low

5 Proposed system

94

resolution images. This is the reason why the proposed system relies only on the iris to be localized in the facial image. It will be shown that using high-precision iris localization and head pose tracking a high-accuracy real time eye gaze tracking system can be created.

5.1 Proposed eye model

A relatively detailed eye model showing the gaze formation mechanisms was shown in Chapter 1 in Figure 1.3. In a typical eye model used in eye gaze tracking several approx-imations are usually made [51]. For simplicity, only the optical and visual axes are con-sidered. The nodal points are assumed to be located in the same place as the center of corneal curvature. This point is also approximated as the point where the optical and visual axes intersect each other. The entrance pupil center and exit pupil center are assumed to be one point – the pupil center. Also, both the eyeball and cornea are as-sumed to be spheres, despite being ellipsoids [164]. Such a simplified typical eyeball model is shown in Figure 5.2. Throughout the rest of this work torsional movements of the eyeball will not be considered. These movements can be described as such in-plane eyeball rotations which cause the angles between the optical and visual axis to change. Few methods consider such movements [54], and to the best of the author’s knowledge no commodity camera based eye gaze tracking systems have considered it to date.

Figure 5.2. Typical eye and gaze model.

5 Proposed system

95

It should be noted that most webcam based eye gaze tracking methods assume an even more simplified model where the visual and optical axes are the same [42, 44, 50], as does the author’s own previous work [57]. In fact, apart from methods using a depth sensor, the work of Chen et al. [55] might be the only eye gaze tracking system not using IR illumination that uses this more complicated model. In order to find the deviation between the optical and visual axis, as well as several other eye parameters, Chen et al. propose to use a 9-point calibration procedure.

This dissertation proposes to use the same eye model as shown in Figure 5.2. In the author’s opinion, such an eye model is sufficiently accurate so as not to limit the eye gaze tracking system accuracy due to methodological error (state-of-the-art IR systems using this eye model achieve average accuracy better than 0.5° [2]). At the same time, the proposed eye model captures a significant phenomenon that is ignored in the typical webcam eye model – the fact that the eyeball rotation center does not lie on the visual axis. In case of head movements, apart from tracking the head pose itself, correct esti-mation of the eyeball rotation center is absolutely crucial to estimate the gaze in model-based approaches. Without it the information about head pose changes cannot be used correctly.

For further analysis the proposed eye model is shown again in Figure 5.3 with labelled parameters. They are as follows:

• Radius of eyeball – 𝑅𝑅𝑟𝑟 • Radius of corneal curvature – 𝑅𝑅𝑐𝑐 • Distance between center of the eyeball and center of corneal curvature –

𝑑𝑑𝐸𝐸𝐸𝐸 • Distance between center of the pupil and center of corneal curvature – 𝑑𝑑𝐸𝐸𝑃𝑃 • Distance between center of the eyeball and apex of cornea – 𝑑𝑑𝐸𝐸𝐸𝐸 • Total angle between visual and optical axis – 𝛾𝛾 • Horizontal angle between visual and optical axis – 𝛼𝛼 • Vertical angle between visual and optical axis – 𝛽𝛽

5 Proposed system

96

Figure 5.3. Proposed eye gaze model with labelled parameters. Used abbreviations: F – fovea, E – Eyeball center, C – Corneal center of curvature, P – pupil center, A –

apex of cornea, O – observed object.

This dissertation presents a model-based eye gaze tracking system relying on the de-scribed eye model. Furthermore, the described system does not rely on complex user calibration. Only a single-point initialization procedure is assumed. This effectively means that the eye model, all of its parameters and all 3D positions of eye structures have to be known from a single image of the user’s face and eyes with the assumption that the observed point is known.

Under these conditions, assuming a single camera and no IR glints, it is impossible to calculate the unknown eye parameters, as they cannot be measured directly in the image. Therefore, the author proposes to approximate the values of these parameters by typical values known from statistics. Such typical values are listed in Table 5.1 based on well-known literature [6, 51]. Additionally, an average value of interpupillary distance is listed based on the statistical study in [165]. The interpupillary distance is roughly the same as the distance between eyeball rotation centers and it changes very little for a given person regardless of eye and head movements. It has been found to follow a Gauss-ian distribution ranging from 56mm to 2mm, but over 90% of the measured samples are within 60mm and 68mm. The value of 63mm or 64mm is likely to be quite close to the true distance of any random person.

5 Proposed system

97

Symbol Name Average value

𝑅𝑅𝑟𝑟 Radius of eyeball 12.0 mm

𝑅𝑅𝑐𝑐 Radius of corneal curvature 7.8 mm

𝑑𝑑𝐸𝐸𝐸𝐸 Distance between E and C 5.3 mm

𝑑𝑑𝐸𝐸𝑃𝑃 Distance between C and P 4.2 mm

𝑑𝑑𝐸𝐸𝐸𝐸 Distance between E and A 13.1 mm

𝛼𝛼 Horizontal angle between visual and optical axis 5.5°

𝛽𝛽 Vertical angle between visual and optical axis 1.0°

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 Interpupillary distance 63 mm Table 5.1. Average eye parameter values found in literature. Please refer to Figure 5.3

for symbol explanation.

The deviation of the user’s eye parameters from the assumed model parameters will degrade the accuracy of the proposed eye gaze tracking system. However, as will be shown in the next chapters, this degradation will be slow. This means that for small deviations of the parameters from optimal values the system accuracy degradation will also be small. At the same time, the fact that eye parameters have a rather small spread among humans justifies the usage of fixed model parameters.

If the eye model parameters are known, the 3D location and orientation of the eye in the real world can be found. This is described in detail in Section 5.2.

5.2 Geometric model initialization

The proposed eye gaze tracking system assumes that a single-point initialization proce-dure is performed. This requires the user to look at a predetermined point for a short while. When the user is looking at this point, one of the video frames is used to initialize the system. There are two stages of initialization:

1. Determining the location and pose of the user’s head, as well as the exact locations of the user’s pupils in the real world.

2. Determining the exact locations of the user’s eyeball rotation centers.

Each of these two stages will now be described in detail.

5 Proposed system

98

5.2.1 Initialization of head and pupils

To begin with, several core assumptions need to be explained. It is assumed that the user begins using the system when sitting in front of the screen and camera, facing towards the screen center. Because humans have two eyes spread apart horizontally, this allows them to estimate the horizontal head rotation – yaw – very well. It is rare for people to look at an object in front of them with a horizontally rotated head. Following this, the yaw of the head is assumed to be non-existent in the initialization phase. It is further assumed that the user is sitting so that their head is exactly in front of the middle of the screen when considering the horizontal plane. This is a natural position and people subconsciously tend to take up such a position when facing electronic displays.

Figure 5.4. Natural head pose when looking at a display.

The camera is fully calibrated – the internal camera parameters used in the pro-jection model (equation (3.5)) are known and the relative position of the camera and the display is also known. The exact camera calibration procedure is discussed in Chapter 7. The importance of using a calibrated camera is that if one dimension of an observed object is known, the other two dimensions can be calculated based on the observed projection image.

The head and pupil initialization procedure works as follows. First, the face is located in the image using a flexible model type of algorithm (AAM or CLM), as was described in Chapter 3. The flexible model allows precise image-coordinate localization of the face and of the main facial features – eyes, nose and mouth. It also allows to

5 Proposed system

99

estimate a coarse head pose, based on the relative positions of the model points. As has already been mentioned, the horizontal head pose is assumed to be frontal, while other head rotations – pitch and roll – can be estimated quite well by the flexible model.

One of the core concepts of the proposed eye gaze tracking system is estimating the initial head location relative to the screen and camera based on the distance between the eyes. The illustration in Figure 5.5 shows the moment of system initialization when the user is looking at a predefined point with both eyes. It has already been stated that the interpupillary distance is assumed to be known (Table 5.1). If the user is in front of the camera in the horizontal plane, and the head yaw is zero, the camera will observe the full length of the interpupillary distance 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 regardless of the other head rotations (pitch, roll) and of the head position in the vertical plane.

Figure 5.5. Illustration of binocular vision in system initialization phase.

The interpupillary distance can be measured by localizing the irises of both eyes, as the pupils and irises are concentric. One important thing to consider is how the dis-tance between observed pupils changes depending on how far the gaze is focused. In order to measure the significance of this phenomenon, typical use cases will be analyzed.

5 Proposed system

100

For the purpose of this calculation the horizontal plane is approximated by the visual plane formed by the eyes and point of gaze. This is an acceptable simplification given the purpose of the calculation.

As is shown in Figure 5.5, the actual gaze direction, determined by the visual axes of each eye, deviates from the optical axes. At this stage it is assumed that this deviation in the horizontal plane is 𝛼𝛼 = 5.5° towards the center (Table 5.1). The pupils, which lie on the optical axis, are placed more outward than the gaze direction by this angle. If the distance between the user’s head (point 𝑀𝑀) and the observed point on the display is 𝑑𝑑 and the interpupillary distance is 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷, the gaze angle in the horizontal plane is given by

𝜃𝜃1 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

2𝑑𝑑𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷

(5.1)

The actual optical axis horizontal angle, which determines the pupil center location, is given by

𝜃𝜃2 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

2𝑑𝑑𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷

+ 5.5° (5.2)

Finally, the change in the position of the pupil compared to the situation where the optical axes of both eyes are parallel to the z axis is given by 𝜃𝜃3 = 𝜃𝜃2 − 90° and is shown in Figure 5.6.

Figure 5.6. Analysis of pupil deviation when looking at initialization point (top view).

5 Proposed system

101

The triangle formed by the z axis, the optical axis and the pupil shift is equilateral. Using simple trigonometry, the unknown edge length can be found

∆𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 = 2𝐿𝐿 ∙ 𝑠𝑠𝑚𝑚𝑛𝑛 𝜃𝜃32

= 2𝐿𝐿 ∙ 𝑠𝑠𝑚𝑚𝑛𝑛 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 2𝑑𝑑

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 − 84.5°

2 (5.3)

To determine whether this interpupillary distance change is significant, it has been cal-culated for two extreme head pose distances from the screen, assuming average eye pa-rameters from Table 5.1. The results are in Table 5.2.

Distance of user’s head from dis-play

∆𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷

40cm 0.21 mm

80cm 0.73 mm Table 5.2. Interpupillary distance deviation depending on head distance from display.

The deviation observed by the camera is different than the actual ∆𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷, but as 𝑠𝑠𝑚𝑚𝑛𝑛𝛼𝛼 ≈ 𝛼𝛼 for small angles this can be neglected. The calculated range of ∆𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 values is small – only around 0.5mm for one eye. This results in a total error of 1mm at maximum. As such values are smaller than the standard deviation of the interpupillary distance among humans they can be neglected without causing any substantial degradation of the eye gaze tracking system accuracy.

In further derivations it will be assumed that the distance between the pupils observed by the camera is equal to the interpupillary distance given in Table 5.1. In order to demonstrate how the initial head and pupil locations are determined under this assumption, a side view of the initialization phase is shown in Figure 5.7. The same situation is also shown in a perspective view for better reference – Figure 5.8.

The world coordinates of the pupils are calculated as follows. First, the position of the center point between the eyes, 𝑀𝑀, is found. This is done by using simple trigo-nometry and camera projection formulas. As has been stated earlier, in the initialization frame the camera can observe the full length of the user’s interpupillary distance. Let the observed pupil center pixels be denoted as 𝑝𝑝1,𝑝𝑝2. Let the real world pupil positions

5 Proposed system

102

be denoted as 𝑃𝑃1,𝑃𝑃2. Using equation (3.6), the distance between the eyes in the image plane is related to the distance between the eyes in the real world by

𝑑𝑑𝑝𝑝1𝑝𝑝2 =

𝑓𝑓𝑀𝑀𝑧𝑧

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 (5.4)

This leads to the following formula for the z-axis world coordinate of the center point between the pupils 𝑀𝑀:

Figure 5.7. Head pose and pupil initialization illustration (side view).

Figure 5.8. Head pose and pupil initialization illustration (perspective view).

5 Proposed system

103

𝑀𝑀𝑧𝑧 = 𝑓𝑓

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷𝑑𝑑𝑝𝑝1𝑝𝑝2

(5.5)

The real world distance 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 is the interpupillary distance and it is known from model parameters, while the observed distance 𝑑𝑑𝑝𝑝1𝑝𝑝2 is a simple calculation

𝑑𝑑𝑝𝑝1𝑝𝑝2 = 𝑝𝑝1𝑥𝑥 − 𝑝𝑝2𝑥𝑥

2+ 𝑝𝑝1𝑦𝑦 − 𝑝𝑝2𝑦𝑦

2 (5.6)

The z-axis position the face center 𝑀𝑀𝑧𝑧 is assumed to be equal to the z-axis position of each pupil 𝑃𝑃𝑧𝑧. This holds under the assumed condition that the head yaw and horizontal translation relative to the camera are both zero. Alternatively, if an accurate head yaw estimation module is used, the values of 𝑃𝑃𝑧𝑧 for each pupil can be calculated more accu-rately – this is however beyond the scope of this thesis. Once the coordinates 𝑃𝑃𝑧𝑧 are known, the remaining coordinates 𝑃𝑃𝑥𝑥,𝑃𝑃𝑦𝑦 of each pupil can be calculated from the per-spective inverse projection equation (3.7) using the observed image coordinates

𝑙𝑙𝑃𝑃𝑥𝑥𝑑𝑑

𝑙𝑙𝑃𝑃𝑦𝑦𝑑𝑑 =

𝑃𝑃𝑧𝑧𝑓𝑓

𝑙𝑙𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥𝑙𝑙𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦

(5.7)

Similarly, the coordinates 𝑀𝑀𝑥𝑥,𝑀𝑀𝑦𝑦 can be found. As the camera is calibrated both with regard to internal and external parameters, world coordinates of the point of gaze 𝑂𝑂 are known. Once the pupil positions are found, the vertical angle 𝜑𝜑 at which the user is looking at 𝑂𝑂 can be calculated as

𝜑𝜑 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

±|𝑂𝑂𝑂𝑂′|𝑧𝑧

= 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 𝑂𝑂𝑦𝑦 − 𝑀𝑀𝑦𝑦

𝑀𝑀𝑧𝑧 (5.8)

The angle 𝜑𝜑 is crucial to determine the eyeball rotation centers. This process is described in detail in Section 5.2.2.

Once the exact 3D locations of the pupils in world coordinates are known, this information is used together with the flexible model based coarse head pose estimation to initialize a 3D mesh for head pose tracking. This has been described in Chapter 3. The result of this process is a 3D face model with known world coordinates. This model can be used for constraining the eyeball rotation centers with the head pose.

5 Proposed system

104

5.2.2 Initialization of eyeball rotation centers

Determining the eyeball rotation centers is the second initialization phase, performed after finding the pupil world coordinates and initializing the head pose tracking algo-rithm. As the process is analogical for both eyes, only the case of one eye will be discussed.

The 3D pupil center coordinates that are determined as described in Section 5.2.1 are denoted by 𝑃𝑃. For each eye, apart from this point, the visual axis direction is known, because the user is looking at a known point. However, the visual axis angle determined in the previous section (𝜑𝜑) is not the true visual axis, because it was calculated by projecting a line through points 𝑂𝑂 and 𝑃𝑃, instead of 𝑂𝑂 and 𝑃𝑃’ – as shown in Figure 5.9. This means that the found visual axis and the optical axis of the eye form an angle 𝛾𝛾′ and not 𝛾𝛾. In order to find the true eye center 𝐸𝐸, the real visual axis needs to be estab-lished, as only the angle 𝛾𝛾 is known from statistical research.

Figure 5.9. Eyeball rotation center calculation during initialization (cross-sectional view).

It is possible to determine the deviation of the measured visual axis and the true visual axis using known eyeball parameters from Table 5.1. For this purpose the triangle 𝑂𝑂𝑂𝑂𝑃𝑃 is shown in Figure 5.10. For better clarity, the side lengths are depicted disproportionally. The goal is to determine the deviation ∆𝛾𝛾. The angle 𝛾𝛾 and the distance between the center of corneal curvature and the center of the pupil 𝑑𝑑𝐸𝐸𝑃𝑃 are known from statistics. Also, based on Figure 5.8 the distance 𝑑𝑑𝑃𝑃𝑃𝑃 (𝑃𝑃𝑂𝑂 meaning 𝑃𝑃1𝑂𝑂 or 𝑃𝑃2𝑂𝑂) can be found as

5 Proposed system

105

𝑑𝑑𝑃𝑃𝑃𝑃 = 𝑑𝑑𝑃𝑃𝑃𝑃

2 + 𝑑𝑑𝑃𝑃𝑃𝑃2 (5.9)

where each of the points 𝑃𝑃, 𝑀𝑀 and 𝑂𝑂 have known world coordinates.

Figure 5.10. Triangle OCP showing the deviation between measured and true visual

axis Because 𝑂𝑂𝑃𝑃𝑃𝑃′ is a right-angled triangle, the distance between 𝑃𝑃 and 𝑃𝑃′ can be determined as

ℎ = 𝑑𝑑𝐸𝐸𝑃𝑃 ∙ 𝑡𝑡𝑔𝑔𝛾𝛾 (5.10)

Using the fact that the angles ⦛𝑂𝑂𝑃𝑃′𝑃𝑃 and ⦛𝑃𝑃𝑃𝑃′𝑂𝑂 are supplementary and the law of sines for triangle 𝑂𝑂𝑃𝑃𝑃𝑃′ allows to write

ℎ

sin (∆𝛾𝛾)=

𝑑𝑑𝑃𝑃𝑃𝑃sin (90° + 𝛾𝛾)

(5.11)

Hence

∆𝛾𝛾 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑠𝑠𝑚𝑚𝑛𝑛

ℎ𝑑𝑑𝑃𝑃𝑃𝑃

𝑐𝑐𝑐𝑐𝑠𝑠𝛾𝛾 (5.12)

From equations (5.9), (5.10) and (5.12) the final formula for ∆𝛾𝛾 is

∆𝛾𝛾 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑠𝑠𝑚𝑚𝑛𝑛

⎝

⎛ 𝑑𝑑𝐸𝐸𝑃𝑃 ∙ 𝑡𝑡𝑔𝑔𝛾𝛾

𝑑𝑑𝑃𝑃𝑃𝑃2 + 𝑑𝑑𝑃𝑃𝑃𝑃

2𝑐𝑐𝑐𝑐𝑠𝑠𝛾𝛾

⎠

⎞ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑠𝑠𝑚𝑚𝑛𝑛

⎝

⎛ 𝑑𝑑𝐸𝐸𝑃𝑃 ∙ 𝑠𝑠𝑚𝑚𝑛𝑛𝛾𝛾

𝑑𝑑𝑃𝑃𝑃𝑃2 + 𝑑𝑑𝑃𝑃𝑃𝑃

2⎠

⎞ (5.13)

5 Proposed system

106

This formula can also be obtained by using the law of sines for triangle 𝑂𝑂𝑂𝑂𝑃𝑃 directly. Substituting 𝑑𝑑𝐸𝐸𝑃𝑃 and 𝛾𝛾 with values from Table 5.1 and 𝑑𝑑𝐸𝐸𝑃𝑃 with a typical distance of 60cm gives ∆𝛾𝛾=0,038° in the horizontal plane (where the optical and visual axis devia-tion is largest). This proves that the deviation between the observed visual axis and true visual axis is two orders of magnitude smaller than the deviation between the true visual and optical axes approximated from statistics. Therefore, this discrepancy can be ne-glected without visibly influencing the accuracy of the system. In the rest of this work it will be assumed that the observed visual angle 𝜑𝜑 is equal to the true visual angle. While the deviation between the observed visual axis and true visual axis angles is negligible, the translation between the pupil positions 𝑃𝑃 and 𝑃𝑃′ needs to be accounted for. It cannot be assumed that the observed pupil center is 𝑃𝑃′, as that would lead to large errors in eyeball center estimation. It will now be explained how to determine the eyeball rotation center 𝐸𝐸 using relations from Figure 5.9. For better explanation, a per-spective view of the same situation is shown in Figure 5.11. For clarity only one eye is shown with the optical axis and observed visual axis. As the observed visual axis goes through the pupil center, both the optical and visual axis is drawn through the pupil center. It was earlier mentioned that the angle 𝜑𝜑 is important for determining the eyeball centers. This was in practice the vertical gaze angle of both eyes, calculated relative to the face center (between the eyes). The horizontal gaze direction was assumed to be parallel to the z axis. When considering one eye only, the visual axis (gaze direction) of this eye has two gaze angles 𝜑𝜑𝑣𝑣 and 𝜑𝜑ℎ in the vertical and horizontal plane. These angles can be calculated for each eyeball similarly as 𝜑𝜑 was for the resultant gaze in (5.8)

⎩⎪⎨

⎪⎧𝜑𝜑𝑣𝑣 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

𝑂𝑂𝑦𝑦 − 𝑃𝑃𝑦𝑦𝑃𝑃𝑧𝑧

𝜑𝜑ℎ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 |𝑂𝑂𝑥𝑥 − 𝑃𝑃𝑥𝑥|

𝑃𝑃𝑧𝑧

(5.14)

It should be stressed that the angles 𝜑𝜑𝑣𝑣 and 𝜑𝜑ℎ are not equal to the angles 𝜑𝜑𝑣𝑣′ and 𝜑𝜑ℎ′ given by

⎩⎪⎨

⎪⎧𝜑𝜑𝑣𝑣′ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

𝑂𝑂𝑦𝑦 − 𝑃𝑃𝑦𝑦𝑑𝑑𝑃𝑃𝑃𝑃′

𝜑𝜑ℎ′ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 |𝑂𝑂𝑥𝑥 − 𝑃𝑃𝑥𝑥|𝑑𝑑𝑃𝑃𝑃𝑃′′

(5.15)

5 Proposed system

107

Figure 5.11. Eyeball rotation center calculation during initialization (perspective view) The position of the point 𝐸𝐸 is equal to the position of 𝑃𝑃 translated by the vector 𝑃𝑃𝐸𝐸 . The length of vector 𝑃𝑃𝐸𝐸 is known from Table 5.1, as a sum of two known distances along the optical axis. The angle of this vector relative to the observed visual axis is also known, and, as explained earlier, is equal to 𝛾𝛾. The horizontal and vertical components of 𝛾𝛾 are denoted as 𝛼𝛼 and 𝛽𝛽. The relations between the angles 𝛼𝛼,𝛽𝛽 and 𝜑𝜑𝑣𝑣,𝜑𝜑ℎ are shown

5 Proposed system

108

in Figure 5.11. The axis drawn through points 𝐴𝐴 and 𝑃𝑃 is parallel to the z axis of the camera and display. The aim is to determine the translations ∆𝑥𝑥,∆𝑦𝑦,∆𝑧𝑧 of 𝐸𝐸 relative to 𝑃𝑃 in world coordinates. To begin with, the angle 𝛿𝛿 will be found. The law of sines for triangles 𝐸𝐸𝐸𝐸′𝑃𝑃 and 𝐴𝐴𝐴𝐴′𝑃𝑃 states that

⎩⎪⎨

⎪⎧

|𝑃𝑃𝐸𝐸|𝑠𝑠𝑚𝑚𝑛𝑛90°

=|𝐸𝐸𝐸𝐸′|𝑠𝑠𝑚𝑚𝑛𝑛𝛿𝛿

.|𝑃𝑃𝐴𝐴′|𝑠𝑠𝑚𝑚𝑛𝑛90°

=|𝐴𝐴𝐴𝐴′|

𝑠𝑠𝑚𝑚𝑛𝑛(𝜑𝜑𝑣𝑣 − 𝛽𝛽)

(5.16)

Hence, as |𝐸𝐸𝐸𝐸′| = |𝐴𝐴𝐴𝐴′|, dividing the equations by each other gives

|𝑃𝑃𝐸𝐸||𝑃𝑃𝐴𝐴′|

=𝑠𝑠𝑚𝑚𝑛𝑛(𝜑𝜑𝑣𝑣 − 𝛽𝛽)

𝑠𝑠𝑚𝑚𝑛𝑛𝛿𝛿 (5.17)

From the properties of triangle 𝐸𝐸𝐸𝐸′𝑃𝑃 the edge length |𝑃𝑃𝐸𝐸′| can be expressed as

|𝑃𝑃𝐸𝐸′| = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 (5.18)

Similarly, from the properties of triangle 𝐴𝐴𝐸𝐸′𝑃𝑃

|𝑃𝑃𝐴𝐴| = |𝑃𝑃𝐸𝐸′| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ) (5.19)

Also, from the properties of triangle 𝐴𝐴𝐴𝐴′𝑃𝑃

|𝑃𝑃𝐴𝐴′| =

|𝑃𝑃𝐴𝐴|𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑𝑣𝑣 − 𝛽𝛽) (5.20)

Using (5.18) and (5.19) in (5.20) gives

|𝑃𝑃𝐴𝐴′| =

|𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑𝑣𝑣 − 𝛽𝛽) (5.21)

Using this result in equation (5.17) leads to

|𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑𝑣𝑣 − 𝛽𝛽)

|𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ) =𝑠𝑠𝑚𝑚𝑛𝑛(𝜑𝜑𝑣𝑣 − 𝛽𝛽)

𝑠𝑠𝑚𝑚𝑛𝑛𝛿𝛿 (5.22)

After simplification this gives

𝑡𝑡𝑔𝑔𝛿𝛿 = 𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)= 𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑ℎ − 𝛼𝛼) (5.23)

5 Proposed system

109

And

𝛿𝛿 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑ℎ − 𝛼𝛼) (5.24)

Once the angle 𝛿𝛿 is known, the length of edge |𝑃𝑃𝐸𝐸′| is given by (5.18). Using this the translations of 𝐸𝐸 relative to 𝑃𝑃 can be found using simple trigonometry

∆𝑥𝑥 = |𝐴𝐴𝐸𝐸′| = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑠𝑠𝑠𝑠𝑛𝑛(𝛼𝛼 − 𝜑𝜑ℎ)

∆𝑦𝑦 = |𝐸𝐸𝐸𝐸′| = |𝑃𝑃𝐸𝐸| ∙ 𝑠𝑠𝑠𝑠𝑛𝑛𝛿𝛿 ∆𝑧𝑧 = |𝑃𝑃𝐴𝐴| = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)

(5.25)

These translations allow to uniquely determine the position of the eyeball rotation center in world coordinates. The described calculations should be performed twice – once for each eye.

5.3 Geometric eye gaze tracking

Geometric eye gaze tracking requires tracking of the eyeball centers and pupil centers. This section explains how this is done, and also how the gaze point is calculated using this information.

5.3.1 Eyeball center tracking

After the eyeball rotation centers are found, their positions relative to the head pose model are calculated. This is straightforward, as the head pose model used for head pose tracking is initialized using the pupil positions as reference. As explained in Section 5.2, the eyeball rotation centers relative to the pupils are known.

In each consecutive frame the head pose is tracked using one of the algorithms described in Chapter 3. The tracking assumes that every point in the 3D face model is updated according to a transformation 𝑇𝑇

𝑃𝑃𝑡𝑡 = 𝑇𝑇 ∙ 𝑃𝑃𝑡𝑡−1 =

⎣⎢⎢⎢⎡ 𝑋𝑋𝑡𝑡−1 − 𝑌𝑌𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑦𝑦 + 𝑡𝑡𝑥𝑥

𝑋𝑋𝑡𝑡−1𝜔𝜔𝑧𝑧 + 𝑌𝑌𝑡𝑡−1 − 𝑍𝑍𝑡𝑡−1𝜔𝜔𝑥𝑥 + 𝑡𝑡𝑦𝑦


1| ⎦⎥⎥⎥⎤

(5.26)

5 Proposed system

110

where 𝑋𝑋,𝑌𝑌,𝑍𝑍 represent the world coordinates of point 𝑃𝑃, while 𝑡𝑡𝑥𝑥, 𝑡𝑡𝑦𝑦, 𝑡𝑡𝑧𝑧 and 𝜔𝜔𝑥𝑥,𝜔𝜔𝑦𝑦,𝜔𝜔𝑧𝑧 denote translations and rotations relative to the camera. Both the head model points and eyeball rotation centers are tracked in each consecutive frame using (5.26), starting from the positions determined in the initialization frame. The position of any given point 𝑃𝑃𝑖𝑖 at time 𝑡𝑡 is given by the initial position of this point 𝑃𝑃𝑖𝑖,0 and a composition of trans-formations 𝑇𝑇0…𝑡𝑡−1.

5.3.2 Pupil center tracking

The coordinates of the point of regard on the screen can be found by determining the intersection of the gaze vector and the display. The position of the display is known and fixed relative to the camera. Only the gaze vector changes and the key problem is esti-mating this gaze vector as accurately as possible. It has been explained in Section 5.2 how the eyeball rotation centers are initialized. Based on this information and the up-dates in each frame using (5.26), the world-coordinate positions of the eyeball centers are known in every frame. Apart from the eyeball centers, the pupil centers are also found in every frame. This is done in three steps:

1. Localizing the 2D pupil centers in the camera image (described in Chapter 4). 2. Finding the z-coordinate of these points – while the head model can be used for

this purpose, a more accurate method is recommended. 3. Calculating the 𝑋𝑋 and 𝑌𝑌 world coordinates of the pupil centers using the projec-

tion formula (5.7).

The first of these steps has already been described in Chapter 4. It is worth noting that the pupils are detected independently in each image. Although the position from the previous frame is used to constrain the new position, the concept of tracking is absent. Tracking would not serve much purpose as it would not allow to achieve subpixel loca-tion accuracy, and the pupil positions are constrained by the eye location anyway.

Finding the z coordinates of the pupils is of key importance. Small variations of this value relative to the eyeball center can cause large errors in the estimated gaze point – as the length of the gaze vector projected from the eyeball to the pupil center is less than 10mm. This leads to the conclusion that relying on the mesh from the face model to provide this z value is suboptimal. While the face model may be tracked very accu-rately, it is sparse and it is not guaranteed that all of its surface is adjacent to the face

5 Proposed system

111

and eyes. Even in case of good alignment errors can be caused by simple need of inter-polation, as it cannot be expected that a mesh point will always appear exactly at the location of the pupil.

That is why a different method is proposed for pupil 3D position estimation. It is assumed that the pupil trajectory covers an ellipsoid. This is a very reasonable as-sumption, in accordance with the eye model described in Section 5.1. Under this assump-tion, given known ellipsoid size and the observed pupil location in camera coordinates, the 3D pupil coordinates can be calculated. The problem that needs to be solved is shown in Figure 5.12.

Figure 5.12. Eyeball optical and visual axes (perspective view)

The 3D eyeball center position E is known, as is the shape of the eyeball. Assuming that the trajectory of the pupil covers an ellipsoid, the following is true:

(𝑃𝑃𝑥𝑥 − 𝐸𝐸𝑥𝑥)2

𝐴𝐴2+𝑃𝑃𝑦𝑦 − 𝐸𝐸𝑦𝑦

2

𝐵𝐵|2 +

(𝑃𝑃𝑧𝑧 − 𝐸𝐸𝑧𝑧)2

𝑂𝑂2= 𝑅𝑅2 (5.27)

At the same time, the pupil center world coordinates 𝑃𝑃𝑥𝑥 and 𝑃𝑃𝑦𝑦 can be expressed using the inverse projection formula (5.7). This leads to

𝑃𝑃𝑧𝑧

𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥𝑓𝑓 − 𝐸𝐸𝑥𝑥

2

𝐴𝐴2+𝑃𝑃𝑧𝑧

𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦𝑓𝑓 − 𝐸𝐸𝑦𝑦

2

𝐵𝐵2+

(𝑃𝑃𝑧𝑧 − 𝐸𝐸𝑧𝑧)2

𝑂𝑂2= 𝑅𝑅2 (5.28)

Equation (5.28) is quadratic with respect to 𝑃𝑃𝑧𝑧, but does not have any other unknowns. Therefore, it can be solved in the typical way. A reformulated form of this equation meeting the canonical quadratic equation form of 𝑎𝑎𝑥𝑥2 + 𝑊𝑊𝑥𝑥 + 𝑐𝑐 = 0 is

5 Proposed system

112

𝑃𝑃𝑧𝑧2 𝐵𝐵2𝑂𝑂2

𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥𝑓𝑓

2+ 𝐴𝐴2𝑂𝑂2

𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦𝑓𝑓

2+ 𝐴𝐴2𝐵𝐵2 +

𝑃𝑃𝑧𝑧 ∙ (−2) 𝐵𝐵2𝑂𝑂2 𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥

𝑓𝑓 𝐸𝐸𝑥𝑥 + 𝐴𝐴2𝑂𝑂2 𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦

𝑓𝑓 𝐸𝐸𝑦𝑦 + 𝐴𝐴2𝐵𝐵2𝐸𝐸𝑧𝑧 +

(𝐵𝐵2𝑂𝑂2𝐸𝐸𝑥𝑥2+𝐴𝐴2𝑂𝑂2𝐸𝐸𝑦𝑦2 + 𝐴𝐴2𝐵𝐵2𝐸𝐸𝑧𝑧2 − 𝐴𝐴2𝐵𝐵2𝑂𝑂2𝑅𝑅2) = 0

(5.29)

The two solutions are given by

⎩⎪⎨

⎪⎧𝑃𝑃𝑧𝑧1 =

−𝑊𝑊 − √𝑊𝑊2 − 4𝑎𝑎𝑐𝑐2𝑎𝑎

.

𝑃𝑃𝑧𝑧2 =−𝑊𝑊 + √𝑊𝑊2 − 4𝑎𝑎𝑐𝑐

2𝑎𝑎

(5.30)

The two solutions correspond to two potential places on the surface of the eye – one is between the eyeball center and the display, while the other is on the other side of the eyeball center. Of course the correct solution is the point closer to the display and camera – and this is always taken.

Once the 𝑧𝑧 coordinate of the pupil center is known, the other coordinates can be calculated using the observed screen position and the inverse projection formula (5.7). This allows to find the pupil center with high very high accuracy relative to the eyeball center.

In a simple case, if the trajectory of the pupil is approximated by a sphere (𝐴𝐴 =𝐵𝐵 = 𝑂𝑂 = 1), formula (5.29) gets simplified into

𝑃𝑃𝑧𝑧2


2+


2+ 1 +

𝑃𝑃𝑧𝑧 ∙ (−2)𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥

𝑓𝑓 𝐸𝐸𝑥𝑥 + 𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦

𝑓𝑓 𝐸𝐸𝑦𝑦 + 𝐸𝐸𝑧𝑧 +

(𝐸𝐸𝑥𝑥2 + 𝐸𝐸𝑦𝑦2 + 𝐸𝐸𝑧𝑧2 − 𝑅𝑅2) = 0

(5.31)

where 𝑅𝑅 is the radius of pupil movement, being the distance between the pupil and the eyeball center. Based on Table 5.1. this is 9.5mm. A comparison of using the spherical and elliptical eyeball shape is performed in Chapter 7.

5 Proposed system

113

5.3.3 Gaze point estimation

This section will explain how to determine the screen coordinates of the gaze point having known eyeball rotation centers and pupil centers in world coordinates. This can be formed as a dual task to finding the eyeball rotation center in the initialization stage. Instead of having a known visual axis and unknown optical axis, the visual axis has to be found using the optical axis. It will be helpful to go back to the axes of the eyeball previously shown in Figure 5.11. A simplified picture of this is in Figure 5.13. In the current task the point 𝑃𝑃 and 𝐸𝐸 have known world coordinates. Finding the gaze vector 𝐵𝐵𝑃𝑃 is equivalent to finding the unknown gaze angles 𝜑𝜑𝑣𝑣 ,𝜑𝜑ℎ.

Figure 5.13. Eyeball optical and visual axes (perspective view).

To begin with, it should be noticed that the coordinates of point 𝐴𝐴 are known and equal to [𝑃𝑃𝑥𝑥,𝑃𝑃𝑦𝑦,𝐸𝐸𝑧𝑧]. Similarly, the coordinates of point 𝐴𝐴′ are known and equal to [𝑃𝑃𝑥𝑥,𝐸𝐸𝑦𝑦,𝐸𝐸𝑧𝑧]. Hence, as the triangle 𝐴𝐴𝐴𝐴′𝑃𝑃 is right-angled

|𝐴𝐴𝐴𝐴′||𝐴𝐴𝑃𝑃| = 𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) (5.32)

which leads to

5 Proposed system

114

𝜑𝜑𝑣𝑣 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 ±

|𝐴𝐴𝐴𝐴′||𝐴𝐴𝑃𝑃| + 𝛽𝛽 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

𝐸𝐸𝑦𝑦 − 𝑃𝑃𝑦𝑦|𝐸𝐸𝑧𝑧 − 𝑃𝑃𝑧𝑧| + 𝛽𝛽 (5.33)

Also the coordinates of point 𝐸𝐸′ are known and given by [𝐸𝐸𝑥𝑥,𝑃𝑃𝑦𝑦,𝐸𝐸𝑧𝑧]. The properties of triangle 𝐴𝐴𝐸𝐸′𝑃𝑃 give

|𝐴𝐴𝐸𝐸′||𝐴𝐴𝑃𝑃| = 𝑡𝑡𝑔𝑔(𝛼𝛼 − 𝜑𝜑ℎ) (5.34)

which leads to

𝜑𝜑ℎ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 ±

|𝐴𝐴𝐸𝐸′||𝐴𝐴𝑃𝑃| + 𝛼𝛼 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

|𝐸𝐸𝑥𝑥 − 𝑃𝑃𝑥𝑥||𝐸𝐸𝑧𝑧 − 𝑃𝑃𝑧𝑧| + 𝛼𝛼 (5.35)

Now that the gaze angles are found, the world coordinates of the point of gaze 𝑂𝑂 can be calculated. Using basic trigonometry, the translations of 𝑂𝑂 relative to the orthographic projection of the pupil 𝐼𝐼 are given by

∆𝑥𝑥 = |𝑃𝑃𝐼𝐼| ∙ 𝑡𝑡𝑔𝑔𝜑𝜑ℎ = 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔𝜑𝜑ℎ∆𝑦𝑦 = |𝑃𝑃𝐼𝐼| ∙ 𝑡𝑡𝑔𝑔𝜑𝜑𝑣𝑣 = 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔𝜑𝜑𝑣𝑣 (5.36)

This results in the following formulas for the coordinates of the gaze point 𝑂𝑂:

𝑂𝑂𝑥𝑥 = 𝐼𝐼𝑥𝑥 + ∆𝑥𝑥 = 𝑃𝑃𝑥𝑥 + 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

|𝐸𝐸𝑥𝑥−𝑃𝑃𝑥𝑥||𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|

+ 𝛼𝛼

𝑂𝑂𝑦𝑦 = 𝐼𝐼𝑦𝑦 + ∆𝑦𝑦 = 𝑃𝑃𝑦𝑦 + 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 𝐸𝐸𝑦𝑦−𝑃𝑃𝑦𝑦|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|

+ 𝛽𝛽 (5.37)

The relative position and orientation of the display and camera is calibrated, and it is also assumed that the physical dimensions and pixel resolution of the display are known and fixed. Therefore, having the world coordinates of the gaze point it is straightforward to find the pixel coordinates observed by the user. The final gaze point is calculated as the average of component gaze points for each eye.

Equation (5.37) proves that the point of gaze can be found using the proposed eye and gaze model if only the eyeball center and pupil center positions are known. The above derivation assumes that the display is placed parallel to the x- and y- axis of the camera. Although such a scene arrangement is logical and used most often, a slightly different orientation of the camera introduces negligible errors in the gaze point meas-urement, as will be justified in Chapter 6.

5 Proposed system

115

5.4 Block diagram of proposed system

A block diagram of the proposed eye gaze tracking system is shown in Figure 5.14. This provides a good overview of what modules compose the system, how data flows between them and how they interact.

The author also wishes to point out that when measurements from the depth sensor are used, the initial distance of the head from the camera is not calculated by using formula (5.5), but measured directly. Thus, there is no need to assume any fixed value of the user’s interpupillary distance. This parameter is used solely in the RGB-only eye gaze tracking system.

This concludes the general description of the proposed eye gaze tracking system. A nu-merical analysis of the assumed initialization conditions and parameters is performed in Chapter 6. It will be shown that small deviations from the assumed model parameters and conditions do not degrade the accuracy of the proposed eye gaze tracking system significantly. An important feature of the proposed system is that it strongly relies on accurate head pose estimation and iris localization. That is why so much focus was given to find state-of-the-art algorithms in terms of accuracy.

5 Proposed system

116

Figure 5.14. Block diagram of proposed system.

6 Model parameter inaccuracy analysis

117


Several key assumptions and statistical parameters are used in the eye gaze tracking model presented in Chapter 5. This chapter will analyze the impact of small deviations from the used assumptions and parameters. It will be shown that the impact of these deviations on eye gaze tracking accuracy is small compared to other error sources.

For easier reference, the most important formulas from the derivation in Chapter 5 will be repeated. The initial z-coordinates of the pupil centers were assumed to be the same for both pupils and equal to

𝑃𝑃𝑧𝑧1 = 𝑃𝑃𝑧𝑧2 = 𝑀𝑀𝑧𝑧 = 𝑓𝑓

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷𝑑𝑑𝑝𝑝1𝑝𝑝2

(6.1)

where 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 is the interpupillary distance and 𝑑𝑑𝑝𝑝1𝑝𝑝2 is the observed distance between pu-pils not dependent on any assumed parameters. Based on this depth the initial pupil positions were given by

𝑙𝑙𝑃𝑃𝑥𝑥𝑑𝑑

𝑙𝑙𝑃𝑃𝑦𝑦𝑑𝑑 =

𝑃𝑃𝑧𝑧𝑓𝑓

𝑙𝑙𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥𝑙𝑙𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦

(6.2)

Based on the initial pupil position and known gaze point on the screen, the initial visual axis angles 𝜑𝜑𝑣𝑣,𝜑𝜑ℎ can be defined as

⎩⎪⎨

⎪⎧𝜑𝜑𝑣𝑣 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

𝑂𝑂𝑦𝑦 − 𝑃𝑃𝑦𝑦𝑃𝑃𝑧𝑧

𝜑𝜑ℎ = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 |𝑂𝑂𝑥𝑥 − 𝑃𝑃𝑥𝑥|

𝑃𝑃𝑧𝑧

(6.3)

Using the initial pupil positions and known initial visual axis angles, the eye centers are given by translations relative to the pupil centers

∆𝑥𝑥 = |𝐴𝐴𝐸𝐸′| = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑠𝑠𝑠𝑠𝑛𝑛(𝛼𝛼 − 𝜑𝜑ℎ)

∆𝑦𝑦 = |𝐸𝐸𝐸𝐸′| = |𝑃𝑃𝐸𝐸| ∙ 𝑠𝑠𝑠𝑠𝑛𝑛𝛿𝛿 ∆𝑧𝑧 = |𝑃𝑃𝐴𝐴| = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠𝛿𝛿 ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)

(6.4)

where |𝑃𝑃𝐸𝐸| is the fixed distance between the eyeball and pupil center, and the angle 𝛿𝛿 is given by


118

𝛿𝛿 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑ℎ − 𝛼𝛼) (6.5)

These initial eye center positions are then tracked together with the head pose under perspective projection. The initial location of the eyeball centers does not influence the head pose tracking accuracy. The final gaze point is given as

𝑂𝑂𝑥𝑥 = 𝐼𝐼𝑥𝑥 + ∆𝑥𝑥 = 𝑃𝑃𝑥𝑥 + 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔

|𝐸𝐸𝑥𝑥−𝑃𝑃𝑥𝑥||𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|

+ 𝛼𝛼

𝑂𝑂𝑦𝑦 = 𝐼𝐼𝑦𝑦 + ∆𝑦𝑦 = 𝑃𝑃𝑦𝑦 + 𝑃𝑃𝑧𝑧 ∙ 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 𝐸𝐸𝑦𝑦−𝑃𝑃𝑦𝑦|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|

+ 𝛽𝛽 (6.6)

where the pupil center coordinates are calculated by solving a quadratic equation for 𝑃𝑃𝑧𝑧 and using the inverse projection equation (6.2) to find 𝑃𝑃𝑥𝑥 and 𝑃𝑃𝑦𝑦. A simplified version of this equation assuming a spherical eyeball shape is

𝑃𝑃𝑧𝑧2


2+


2+ 1 +

𝑃𝑃𝑧𝑧 ∙ (−2)𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥

𝑓𝑓 𝐸𝐸𝑥𝑥 + 𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦

𝑓𝑓 𝐸𝐸𝑦𝑦 + 𝐸𝐸𝑧𝑧 +

(𝐸𝐸𝑥𝑥2 + 𝐸𝐸𝑦𝑦2 + 𝐸𝐸𝑧𝑧2 − 𝑅𝑅2) = 0

(6.7)

6.1 Theoretical error resulting from incorrect interpupillary dis-tance

The distance between the pupils 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 given in Table 5.1 is one of the core assumptions of the proposed eye gaze tracking system when using RGB input only. In order to reliably assess the impact of this parameter on eye gaze tracking accuracy, it will be discussed how deviations of this parameter impact the final gaze point - equation (6.6). For better clarity the matter will be discussed considering one eye only.

6.1.1 Impact on initial eyeball center position

The initial position of the eyeball center is found relative to the pupil center position, based on the gaze angles 𝜑𝜑𝑣𝑣 ,𝜑𝜑ℎ. The calculation procedure first finds the pupil center and then finds the relative translations ∆𝑥𝑥,∆𝑦𝑦,∆𝑧𝑧. This means that the potential inac-curacy of the pupil position influences the eyeball center calculation directly. However,


119

the gaze vector calculation method which projects the gaze vector from one of these points to the other will not be influenced by errors that are common for both points (other than translations of the vector beginning). The only thing that can influence the orientation of the gaze vector are the relative translations ∆𝑥𝑥,∆𝑦𝑦,∆𝑧𝑧. These depend on fixed parameters and the initial gaze angles 𝜑𝜑𝑣𝑣 ,𝜑𝜑ℎ.

The key issue is to establish how the initial gaze angles 𝜑𝜑𝑣𝑣,𝜑𝜑ℎ influence the initial eyeball center position. Using equations (6.4) and (6.5) provides

⎩⎪⎨

⎪⎧∆𝑥𝑥 = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ) ∙ 𝑠𝑠𝑠𝑠𝑛𝑛(𝛼𝛼 − 𝜑𝜑ℎ)

∆𝑦𝑦 = |𝑃𝑃𝐸𝐸| ∙ 𝑠𝑠𝑠𝑠𝑛𝑛 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)

∆𝑧𝑧 = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑐𝑐𝑠𝑠 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔𝑡𝑡𝑔𝑔(𝜑𝜑𝑣𝑣 − 𝛽𝛽) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ) ∙ 𝑐𝑐𝑐𝑐𝑠𝑠(𝛼𝛼 − 𝜑𝜑ℎ)

(6.8)

Trigonometrical laws state that

𝑠𝑠𝑚𝑚𝑛𝑛𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔(𝑥𝑥) = 𝑥𝑥

√1+𝑥𝑥2

𝑐𝑐𝑐𝑐𝑠𝑠𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔(𝑥𝑥) = 1√1+𝑥𝑥2

(6.9)

Using this in (6.13) gives

⎩⎪⎨

⎪⎧∆𝑥𝑥 = |𝑃𝑃𝐸𝐸| ∙ 𝑠𝑠𝚤𝚤𝑛𝑛(𝛼𝛼−𝜑𝜑ℎ)

1+𝑡𝑡𝑡𝑡(𝜑𝜑𝑣𝑣−𝛽𝛽)2𝑐𝑐𝑜𝑜𝑠𝑠(𝛼𝛼−𝜑𝜑ℎ)2

∆𝑦𝑦 = |𝑃𝑃𝐸𝐸| ∙ 𝑡𝑡𝑡𝑡(𝜑𝜑𝑣𝑣−𝛽𝛽)𝑐𝑐𝑜𝑜𝑠𝑠(𝛼𝛼−𝜑𝜑ℎ)1+𝑡𝑡𝑡𝑡(𝜑𝜑𝑣𝑣−𝛽𝛽)2𝑐𝑐𝑜𝑜𝑠𝑠(𝛼𝛼−𝜑𝜑ℎ)2

∆𝑧𝑧 = |𝑃𝑃𝐸𝐸| ∙ 𝑐𝑐𝑜𝑜𝑠𝑠(𝛼𝛼−𝜑𝜑ℎ)1+𝑡𝑡𝑡𝑡(𝜑𝜑𝑣𝑣−𝛽𝛽)2𝑐𝑐𝑜𝑜𝑠𝑠(𝛼𝛼−𝜑𝜑ℎ)2

(6.10)

For a frontal pose the angles 𝜑𝜑𝑣𝑣 − 𝛽𝛽 and 𝛼𝛼 − 𝜑𝜑ℎ should be small. This allows to use the following approximations:

𝑠𝑠𝑚𝑚𝑛𝑛

(𝑥𝑥) ≈ 𝑡𝑡𝑔𝑔(𝑥𝑥) ≈ 𝑥𝑥 𝑐𝑐𝑐𝑐𝑠𝑠(𝑥𝑥) ≈ 1 (6.11)

which leads to


120

⎩⎪⎨

⎪⎧∆𝑥𝑥 ≈ |𝑃𝑃𝐸𝐸| ∙ 𝛼𝛼−𝜑𝜑ℎ

1+(𝜑𝜑𝑣𝑣−𝛽𝛽)2≈ |𝑃𝑃𝐸𝐸| ∙ (𝛼𝛼 − 𝜑𝜑ℎ)

∆𝑦𝑦 ≈ |𝑃𝑃𝐸𝐸| ∙ 𝜑𝜑𝑣𝑣−𝛽𝛽1+(𝜑𝜑𝑣𝑣−𝛽𝛽)2

≈ |𝑃𝑃𝐸𝐸| ∙ (𝜑𝜑𝑣𝑣 − 𝛽𝛽)

∆𝑧𝑧 ≈ |𝑃𝑃𝐸𝐸| ∙ 11+(𝜑𝜑𝑣𝑣−𝛽𝛽)2

≈ |𝑃𝑃𝐸𝐸|

(6.12)

This means that for small angles, the translations ∆𝑥𝑥,∆𝑦𝑦 of the eyeball center relative to the pupil center are linearly dependent on the gaze angles 𝜑𝜑𝑣𝑣 ,𝜑𝜑ℎ. These angles are in turn dependent on the interpupillary distance 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 in an 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔() fashion given by equa-tion (6.3). In the typical eye gaze tracking initialization scenario with a frontal pose 𝑃𝑃𝑧𝑧 is much larger than each |𝑂𝑂𝑥𝑥 − 𝑃𝑃𝑥𝑥| and 𝑂𝑂𝑦𝑦 − 𝑃𝑃𝑦𝑦.

According to [165] humans have interpupillary distances forming an approxi-mately Gaussian distribution with an average around 63 mm. As over 90% of the popu-lation has an interpupillary distance within ±4𝑚𝑚𝑚𝑚 of the average value, the borderline case that can be analyzed to give some insight into potential errors is when the deviation of this distance is 4𝑚𝑚𝑚𝑚. This corresponds to a relative interpupillary distance error of 6.35%. Based on formula (6.1), the same relative error will be observed for the 𝑧𝑧 coordi-nates of the pupil centers. If 𝑃𝑃𝑧𝑧 is around ten times larger than each |𝑂𝑂𝑥𝑥 − 𝑃𝑃𝑥𝑥| and 𝑂𝑂𝑦𝑦 − 𝑃𝑃𝑦𝑦, which is a reasonable assumption for the proposed initialization setup, the errors of the initial gaze angles 𝜑𝜑𝑣𝑣,𝜑𝜑ℎ will be around the same value – based on formula (6.3). Using formula (6.12), a similar error will be present for the initial eyeball center position relative to the pupil center. However, the gaze point error resulting from this will be compensated completely when looking at the initialization point in the initial head pose and will be compensated to a large extent in other scenarios. The maximal estimated error will rarely materialize.

6.1.2 Impact on gaze point estimation

By reformulating equation (6.6) using (6.1) and (6.2), the coordinates of the final gaze point can be expressed as

⎩⎪⎨

⎪⎧𝑂𝑂𝑥𝑥 = 𝑑𝑑𝐼𝐼𝐼𝐼𝐼𝐼

𝑑𝑑𝑝𝑝1𝑝𝑝2∙ (𝑝𝑝𝑥𝑥 − 𝑐𝑐𝑥𝑥) + 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 |𝐸𝐸𝑥𝑥−𝑃𝑃𝑥𝑥|

|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧| + 𝛼𝛼

𝑂𝑂𝑦𝑦 = 𝑑𝑑𝐼𝐼𝐼𝐼𝐼𝐼𝑑𝑑𝑝𝑝1𝑝𝑝2

∙ 𝑝𝑝𝑦𝑦 − 𝑐𝑐𝑦𝑦 + 𝑡𝑡𝑔𝑔 𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔 𝐸𝐸𝑦𝑦−𝑃𝑃𝑦𝑦|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|

+ 𝛽𝛽 (6.13)


121

For each coordinate the interpupillary distance 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 is a multiplier of the whole expres-sion inside the bracket, and therefore linearly contributes to the final result. Additionally,

the expressions ′|𝐸𝐸𝑥𝑥−𝑃𝑃𝑥𝑥|′′

′|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|′′ and ′𝐸𝐸𝑦𝑦−𝑃𝑃𝑦𝑦′

′

′|𝐸𝐸𝑧𝑧−𝑃𝑃𝑧𝑧|′′ depend on 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷. As has been derived in Section 6.1.1,

the eyeball center location error relative to the pupil can be up to 6% in the borderline case. It can be assumed that the magnitude of this error does not change during head movements when the eye position is tracked with the head. Therefore, the final gaze point is influenced by the interpupillary distance error in two ways:

1) Linear dependency due to wrongly estimated eye depth 2) Near-linear dependency – 𝑡𝑡𝑔𝑔(𝑎𝑎𝑟𝑟𝑐𝑐𝑡𝑡𝑔𝑔(𝜀𝜀) + 𝑥𝑥) – due to wrongly estimated relative

eyeball center and pupil center positions

It is very important, however, that for the initial gaze point both of these errors fully compensate each other. If the estimated pupil depth is smaller than in reality, this will result in the relative eyeball center translations ∆𝑥𝑥,∆𝑦𝑦 being larger. Similarly, a too large initial pupil depth estimate will result in smaller eyeball center ∆𝑥𝑥,∆𝑦𝑦 translations. These errors compensate each other completely for the initial head pose and gaze point. Depending on the later head pose and gaze point they compensate each other to a dif-ferent extent. In the most unfortunate case, if the errors fail to compensate each other completely, there will be an error equal to 6% of the gaze point translation relative the orthographic projection of the pupil on the camera plane 𝑧𝑧 = 0. This is very rare though, and can happen only for significantly different head poses and gaze directions than those during initialization.

As a final remark it should be noted that the system error due to incorrect inter-pupillary distance can be eliminated by either using a depth camera or asking the user to measure this distance using a ruler. The latter requires the assistance of another person, but only needs to be done once in a lifetime and easily provides 1mm precision, which limits the maximum gaze estimation error to a very small value of 0.75% screen size in typical scenarios.

6.2 Theoretical error resulting from the user having a non-frontal pose in the initialization frame

The consequences of the user having a non-frontal pose in the initialization frame can be twofold: this can lead to incorrect eyeball center initialization and also to incorrect face model initialization in terms of depth and out of plane rotations.


122

6.2.1 Impact of eyeball location

The issue of incorrect eyeball center initialization will be discussed first. As the pupils are spread apart in the horizontal plane, head pitch does not impact the perceived in-terpupillary distance. Even in case of certain head roll, the vertical distances between the eyes are very small in normal user interaction. Because of this, the author assumes that only head yaw can significantly alter the perceived interpupillary distance.

The impact of head yaw on the distance between the pupils perceived by the camera is shown in Figure 6.1. The full interpupillary distance is observed when the face is perpendicular to the camera z axis. When the head is rotated by an angle 𝜃𝜃, the perceived interpupillary distance 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷′ is

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷′ = 𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷𝑐𝑐𝑐𝑐𝑠𝑠𝜃𝜃 (6.14)

This means that even if the user has an initial head yaw of 10°, the error in interpupillary distance caused by this will be 1.5%. This is much less than was considered in Section 6.1 (over 6%), and is therefore negligible.

Figure 6.1. Head yaw during initialization (top view)

As has been explained in Chapter 5, people tend to control the horizontal head rotation when looking at things. It can be expected that when gazing at the screen center it is very unlikely for a person to keep a significantly rotated head, so a yaw of 10° should be considered large.


123

6.2.2 Impact on head pose tracking

The impact of initial head yaw can be more significant in case of head pose tracking. Even though the warping procedure ensures that the initial face model fits the 2D face image, when the head is rotated the model topology will not be consistent with the face topology. This is shown in Figure 6.2.

Figure 6.2. Schematic presentation of inconsistent face model topology and face topol-ogy caused by initial head yaw (top view).

It is very difficult to analytically assess the impact of this phenomenon. Therefore, an evaluation of this is presented in Chapter 7. The head pose is tracked for different initial mesh rotations (from 0 to 10° in the horizontal plane). It is analyzed how initial head rotations influence later head pose tracking accuracy in the proposed system.

It should also be noted, that similarly as in case of interpupillary distance, the issue of initial head rotations can be eliminated by using a depth camera.

6.3 Theoretical error resulting from incorrect angles between op-tical and visual axis

The impact of incorrect gaze angles between the optical and visual axis is similar to the impact of incorrect interpupillary distance. Based on the derivation in Section 6.1.1 and equation (6.12), the error of the horizontal angle between the optical and visual axis 𝛼𝛼 has an approximately linear relation to the initial horizontal translation ∆𝑥𝑥 of the eyeball center relative to the pupil center

∆𝑥𝑥 ≈ |𝑃𝑃𝐸𝐸| ∙ 𝛼𝛼−𝜑𝜑ℎ

1+(𝜑𝜑𝑣𝑣−𝛽𝛽)2≈ |𝑃𝑃𝐸𝐸| ∙ (𝛼𝛼 − 𝜑𝜑ℎ) (6.15)


124

Similarly, the vertical angle between the optical and visual axis 𝛽𝛽 has an approximately linear relation to the initial vertical translation ∆𝑦𝑦 of the eyeball center relative to the pupil center

∆𝑦𝑦 ≈ |𝑃𝑃𝐸𝐸| ∙ 𝜑𝜑𝑣𝑣−𝛽𝛽

1+(𝜑𝜑𝑣𝑣−𝛽𝛽)2 ≈ |𝑃𝑃𝐸𝐸| ∙ (𝜑𝜑𝑣𝑣 − 𝛽𝛽) (6.16)

The errors of angles 𝛼𝛼,𝛽𝛽 also influence final gaze point coordinates given by equation (6.6) in a near-linear fashion.

Similarly as for the interpupillary distance, the error in angles 𝛼𝛼,𝛽𝛽 is compensated by the estimated initial eyeball center positions. Therefore, when the head pose changes, errors in angles 𝛼𝛼,𝛽𝛽 will cause inaccurate eyeball center estimation. It is reasonable to assume that an angular error of 𝜀𝜀 for each of these angles will lead to a maximum gaze vector orientation error of 𝜀𝜀 in case of extreme head rotations, when the eyeball position offset completely fails to compensate the error 𝜀𝜀. For typical use cases, however, the total error introduced by the inaccuracy of 𝛼𝛼 and 𝛽𝛽 should be much smaller than 𝜀𝜀 itself.

6.4 Theoretical error resulting from incorrect eye dimensions

Incorrect eye dimensions lead to estimating the eyeball center in the wrong place. The distance between the eyeball center and pupil center |𝑃𝑃𝐸𝐸| is a scaling factor in equation (6.4) describing the eyeball center displacement relative to the pupil center. This dis-tance determines how far one has to move along the visual axis, starting from the pupil center, in order to get to the eyeball center.

While certain conclusions can be drawn directly from equation (6.6), the impact of |𝑃𝑃𝐸𝐸| changes on the calculated gaze angle will be analyzed first. The gaze direction is determined by projecting a vector from the invisible eyeball center towards the pupil center and rotating it by deviation angles 𝛼𝛼 and 𝛽𝛽. When the eye rotates and the pupil moves, the gaze direction changes. Depending on how far the eyeball center is from the pupil center, these gaze changes will be different for the same observed pupil displace-ment. This holds for both the vertical and horizontal dimension and is illustrated in Figure 6.3.

The pupil position changes from 𝑃𝑃0 in the initialization frame to 𝑃𝑃1 when the user looks at a new location. The assumed eyeball rotation center 𝐸𝐸 allows to measure this


125

gaze angle change as 𝜑𝜑. Let the true eyeball center position be 𝐸𝐸′. An error in the as-sumed distance |𝑃𝑃𝐸𝐸| will lead to an angular error of 𝜃𝜃. Using the law of sines for triangle 𝑃𝑃𝐸𝐸𝐸𝐸′ gives

Figure 6.3. Impact of eyeball size on estimated gaze direction (one dimension).

∆|𝑃𝑃𝐸𝐸|𝑠𝑠𝑚𝑚𝑛𝑛𝜃𝜃

=|𝑃𝑃𝐸𝐸′|𝑠𝑠𝑚𝑚𝑛𝑛𝜑𝜑

(6.17)

which leads to

𝜃𝜃 = 𝑎𝑎𝑟𝑟𝑐𝑐𝑠𝑠𝑚𝑚𝑛𝑛

𝑠𝑠𝑚𝑚𝑛𝑛𝜑𝜑 ∙ ∆|𝑃𝑃𝐸𝐸||𝑃𝑃𝐸𝐸′|

= 𝑎𝑎𝑟𝑟𝑐𝑐𝑠𝑠𝑚𝑚𝑛𝑛 𝑠𝑠𝑚𝑚𝑛𝑛𝜑𝜑 ∙ ∆|𝑃𝑃𝐸𝐸||𝑃𝑃𝐸𝐸| − ∆|𝑃𝑃𝐸𝐸| (6.18)

This means that the angular error of the estimated optical axis direction depends on the error in |𝑃𝑃𝐸𝐸| distance and on the true optical axis angle. As the optical axis is rotated relative to the visual axis by a fixed angle, the errors in optical axis estimation will be proportional to the errors in visual axis (gaze axis) estimation.

The relation given by equation (6.18) is close to linear for small angles 𝜑𝜑 and errors ∆|𝑃𝑃𝐸𝐸|. For a better demonstration of what it means, a set of error angles has been calculated for various 𝜑𝜑 and ∆|𝑃𝑃𝐸𝐸| values in Table 6.1. These calculations and equation (6.18) assume that the true distance |𝑃𝑃𝐸𝐸′| is shorter than |𝑃𝑃𝐸𝐸|, but the angular error values will be similar if this is the other way round.

It should be noted that the length of |𝑃𝑃𝐸𝐸| is denoted as R in Section 5.3.2 where the method of calculating the pupil center is described. Despite having to solve a quad-ratic equation (6.7) to obtain the pupil center coordinates, the error of the distance between the pupil and eyeball center is not in a quadratic relation to the pupil position.


126

The relative pupil and eyeball position will be found to exactly satisfy equation (6.7) and place the pupil center in the distance of R away from the eye center.

∆|𝑃𝑃𝐸𝐸||𝑃𝑃𝐸𝐸|

𝜑𝜑

0° 10° 20° 30°

1% 0° 0.10° 0.20° 0.29°

2% 0° 0.20° 0.40° 0.58°

5% 0° 0.52° 1.03° 1.51°

10% 0° 1.11° 2.18° 3.18° Table 6.1. Angular errors of gaze direction caused by eye size errors.

As can be seen, the error values in Table 6.1 are rather small, and in practice significant only for gaze directions far from frontal and large eyeball size deviations. Such deviations only concern a small part of the population, as according to various sources that can be found on the internet it is rare for the size of the eyeball to differ by more than 10% from the average value. While the angular error 𝜃𝜃 can result in different pixel errors on the screen depending on the location and orientation of the user’s head, as given by equation (6.6), the angular error values listed in Table 6.1 are sufficient to give an idea about the level of expected inaccuracies. As these do not exceed 1° in the great majority of cases, there is reason in the claim to neglect them. However, an experimental evaluation to confirm this would be helpful, and is presented in Chapter 7.

6.5 Theoretical error resulting from incorrect camera external calibration relative to display

The camera and display relative positions and orientations are assumed to be known for the described eye gaze tracking system to work. This allows to map the calculated gaze direction relative to the camera into the position on the display. If the screen resolution is also provided, the exact observed pixel can be determined.

Two camera positions are optimal and most often used in prior systems. One is with the camera above the display, exactly in the horizontal center. The other is with


127

the camera beneath the display, also in the horizontal center. Due to ease of assembly the first configuration was used for experiments described in Chapter 7.

Measuring the camera translations relative to the display is fairly easy and can be done accurately even with a simple ruler. Ensuring precise orientation is much more difficult, as the camera is a relatively small object. While the z axis of the camera could be pointed towards various fixed directions relative to the normal vector of the display, for simplicity the proposed system assumes these axes are parallel. This means that both the horizontal and vertical angle between the camera z axis and the display normal are assumed zero. This section analyzes small deviations from this assumption.

The possible errors in camera and display orientation are shown in Figure 6.4 and Figure 6.5. In the correct setup the user is looking at the point 𝑂𝑂 and this point is estimated by the eye gaze tracking system. When the screen orientation is inconsistent with the assumptions, the point 𝑂𝑂 is thought to be observed, while the point 𝑂𝑂′ is de-termined as the screen position, instead of 𝑂𝑂". This is because |𝑂𝑂𝑂𝑂| = |𝑂𝑂𝑂𝑂′|. The distance between point 𝑂𝑂′ and 𝑂𝑂′′ is the error caused by this inconsistency. It depends on the angular orientation error 𝜃𝜃 and the gaze angle.

Using the law of sines for triangle 𝑂𝑂𝑂𝑂𝑂𝑂′′ in Figure 6.5 gives

|𝑂𝑂𝑂𝑂|

𝑠𝑠𝑚𝑚𝑛𝑛90° − (𝜃𝜃 + 𝜑𝜑ℎ)=

|𝑂𝑂𝑂𝑂′′|𝑠𝑠𝑚𝑚𝑛𝑛90° + 𝜑𝜑ℎ𝑙𝑙

(6.19)

which is equivalent to

|𝑂𝑂𝑂𝑂|

𝑐𝑐𝑐𝑐𝑠𝑠(𝜃𝜃 + 𝜑𝜑ℎ) =|𝑂𝑂𝑂𝑂| + ∆𝑥𝑥𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑ℎ) (6.20)

and hence

∆𝑥𝑥 = |𝑂𝑂𝑂𝑂|

𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑ℎ)𝑐𝑐𝑐𝑐𝑠𝑠(𝜃𝜃 + 𝜑𝜑ℎ) − 1 (6.21)

Similarly, when considering rotations in the vertical plane

∆𝑦𝑦 = |𝑂𝑂𝑂𝑂|

𝑐𝑐𝑐𝑐𝑠𝑠(𝜑𝜑𝑣𝑣)𝑐𝑐𝑐𝑐𝑠𝑠(𝜃𝜃 + 𝜑𝜑𝑣𝑣) − 1 (6.22)


128

Figure 6.4. Relative camera and display orientation errors – horizontal plane.

Figure 6.5. Relative camera and display orientation errors - vertical plane.

The errors depend on three things: the observed point on the screen, the viewing angle and the display orientation error. For easier presentation, a set of relative errors ∆𝑥𝑥

|𝐸𝐸𝑃𝑃|,

∆𝑦𝑦|𝐸𝐸𝑃𝑃|

for several 𝜃𝜃 and 𝜑𝜑 are listed in Table 6.2.


129

𝜃𝜃 𝜑𝜑

0° 10° 20° 30°

1° 0.02% 0.32% 0.65% 1.03%

2° 0.06% 0.68% 1.35% 2.12%

5° 0.38% 1.95% 3.68% 5.72%

10° 1.54% 4.80% 8.51% 13.05% Table 6.2. Relative errors resulting from display orientation inaccuracy.

The author argues that if care is taken, a rotation error of 10° shouldn’t happen. At most half of this value seems to be feasible. In such a case significant relative gaze point errors appear only for large gaze angles such as 20° or 30°. This happens when the user is looking at screens corners or near boundaries. Assuming this, if the gaze point is 20cm away from the camera in a given plane, the largest errors from Table 6.2 for a 5° display orientation error will translate into an error of 0.74cm and 1.14cm on the screen for 20° and 30° gaze angles, respectively. Considering that these are the largest errors possible only in the extreme cases, they have been categorized as negligible compared to other error sources.

The errors analyzed in this section apply equally if a depth camera is used along with the RGB camera.

This concludes the chapter about theoretical errors resulting from system setup and assumed model parameters. The main error sources influencing the estimated gaze point have been analyzed. Each error source was shown to be in principle insignificant on its own. However, the errors generally influence each other and appear simultaneously. This is difficult to model analytically. For more justified conclusions an empirical evaluation of the most crucial error sources is presented in Chapter 7.

7 Experimental validation

130


This chapter contains an extensive empirical evaluation of the proposed system and its different configurations.

7.1 Environment setup

The test environment consists of a computer screen, RGB camera, depth camera and the user. The placement of the hardware is shown in Figure 7.1. The relative placement of the user is shown in Figure 7.2. The relative orientation and position of the depth and RGB camera has been calibrated using the algorithm of Zhengyou Zhang [166] and OpenCV functions [167]. For this purpose a set of calibration chessboard images was captured simultaneously with the depth and RGB camera.

The relative orientation of the cameras and display was not calibrated mathe-matically, but measured using a ruler. As has been demonstrated in Section 6.5, small inaccuracies of relative camera and display rotations have an insignificant impact on the eye gaze tracking system accuracy. As a result of the calibration and measurement pro-cedure, the relative positions and orientations of the cameras and the screen were all known during test sequence recording.

Figure 7.1. Relative hardware placement (frontal view).


131

Figure 7.2. User placement in test environment.

The physical width and height of the used 23" display have been measured as 51cm and 29.5cm respectively. The depth and RGB cameras have been both placed above the screen center, 4cm and 9cm above the top of the display, respectively. The user’s head was placed in front of the display and slightly below the cameras.

Using the described setup, a set of test sequences has been recorded. The test sequences have been recorded for 10 users, with both RGB and depth sensor data cap-tured in each sequence. The sequences have been recorded for two distances of the user from the screen: 60cm and 80cm. These distances correspond to typical usage of com-puters. The sequences have also been recorded in different illumination conditions: in daylight with ambient illumination and in the evening with office lighting from the ceiling being the only light source. For special evaluation purposes several sequences have also been recorded with lateral illumination.

Each sequence was recorded by displaying certain points on the screen and asking the user to look at them, as shown in Figure 7.3. The recording scenario consisted of an initialization stage and five test stages:

0. Initialization stage: a point was displayed in the screen center and the user was asked to look at it with a frontal head pose.


132

1. Static head stage: after the initialization was complete, points appeared near four screen corners and the user was asked to observe these points without moving their head.

2. Dynamic head stage – frontal gaze: points appeared near screen corners as before, but the user was asked to look at them by turning their head and keeping a frontal gaze. Thus, the gaze direction was approximately caused by the head pose rotation only. In total four points were displayed in this stage, displaced from the screen corners by 30% of the screen size in each direction as shown in Figure 7.3. In order to prevent tracking error accumulation, the user was asked to bring the head back to a frontal pose before looking at each corner point. This allowed the head pose tracking algorithm to reinitialize and remove accumulated tracking error.

3. Dynamic head stage – gaze at screen center: after each point appeared near a screen corner in stage 2, a second point later appeared in the screen center. The user was asked to look at it while keeping the head rotation from stage 2. Thus, the gaze accuracy could be measured in relation to an approximately known head rotation.

4. Dynamic head stage – frontal gaze: same as stage 2 but the points appeared further away from the screen center, displaced from the screen corners by 10% of the screen size in each direction, as shown in Figure 7.3. The head rotations were therefore larger than in stage 2.

5. Dynamic head stage – gaze at screen center: same as stage 3 but for larger head rotations. The same rotations as performed in stage 4 are maintained in this stage.

Figure 7.3. Points displayed during test sequence recording. Black- initialization point. Blue- used for static and dynamic head stages. Red- used for dynamic head stages only.


133

To support the recording process, the points that were intended for observation were animated so that the user could follow them smoothly. Additionally, one point was dis-played to indicate the gaze point, while another, different colored point was displayed to indicate the direction of the head pose. Thus, apart from receiving instructions before the recording procedure, the user received hints during the process.

The gathered test sequences allow to evaluate both head pose tracking accuracy, iris tracking accuracy and eye gaze tracking accuracy as will be described in the following sections. The baseline sequence set for testing has been chosen to be sequences with the face 60cm away and ambient lighting. These are favorable conditions, but it is reasonable to use such for drawing basic conclusions about the performance of the proposed algo-rithms. This set of 10 sequences is used for most experiments, and will be referred to as the standard test set in the rest of this chapter. Two more test sets have been recorded:

• 10 sequences with ambient lighting and the user 80cm away from the screen, referred to as the far test set

• 10 sequences with office lighting (from top only) and the user 60 cm away from the screen, referred to as the top lighting test set

These two test sets are used for comparisons to justify that the proposed eye gaze track-ing system can perform well in various conditions. Additionally, specific limited test sets with lateral illumination and significant initial head rotation have been recorded to verify certain characteristics of the proposed algorithms. These will be described when evalua-tions using them are presented.

7.2 Head pose estimation

The evaluation of head pose tracking accuracy is important to understand how the head pose tracking algorithm influences eye gaze tracking accuracy. Recording the ground truth of a person’s head pose is difficult. It requires specialized equipment such as mag-netic trackers and is limited by the accuracy of used equipment. Due to technical diffi-culties, the head pose tracking accuracy evaluation does not compare the estimated head orientations to ground truth, but measures head orientation accuracy indirectly as de-scribed in the following section.


134

7.2.1 Evaluation framework

The head pose estimation accuracy has been evaluated by comparing the positions of a set of facial landmarks tagged in the current frame with the positions of landmarks tagged in an initial frame transformed by the head pose estimate. Although the previous work of the author [129] proposes to use a flexible model algorithm to obtain current landmark positions in every frame and use them for automatic assessment of head pose tracking accuracy, such an approach has limited precision. Contemporary state-of-the-art facial landmark detection algorithms [168] still report errors of several pixels related to ground truth tagging. It is better to avoid such bias when evaluating algorithms. Despite its drawbacks, manual tagging of facial features is still more accurate.

The facial features chosen for tagging are two inner eye corners and two mouth corners. These are the most distinctive facial features, visible even in poor lighting and low quality images. Unfortunately, the mouth corners proved to be indiscernible on sev-eral test sequences. To reduce tagging errors to a minimum, it was decided to measure only errors for inner eye corners.

In order to accurately tag facial features in the recorded test sequences, a dedi-cated tagging tool has been developed. It allows to mark each tagged facial feature on a magnified image and move around points after marking them to optimize their position if necessary. The same tool was also used for tagging iris centers, as will be described in Section 7.3.

The described test sequence recording methodology serves a specific purpose. By displaying certain points on the screen and asking the user to look at them with a frontal eye gaze direction, the approximate head pose orientation can be determined. This allows to associate the measured eye gaze tracking and head pose tracking errors with the magnitude of head rotations. In case of head pose tracking evaluation, it is sufficient to manually tag frames from each sequence for four moderate (stage 2,3) and four extreme (stage 4,5) head rotations in the dynamic head pose stage. The detailed results in the next section show tracking errors associated with approximate head rotations – based on the positions of displayed gaze points and head distance from the display.

Several different algorithms are compared. When considering RGB input only, optical flow based tracking described in Section 3.2.1, as well as feature based tracking described in 3.2.2 are evaluated. Apart from this, a hybrid tracker as proposed in Section 3.3 is compared in three versions: with a single template, with multiple templates and with a multiple-frame aggregate template as described in Section 3.3.2. When considering


135

RGBD input all algorithms developed for RGB only tracking are compared with depth used in the first frames to support accurate mesh initialization. Additionally, two new algorithms are tested. First – an improvement by using depth data for tracking in every frame described in Section 3.4.2.2. Second – a third party total variation based tracker described in Section 3.4.2.1.

7.2.2 Results – RGB input

The first experiment compares the accuracy of five head pose tracking algorithms de-scribed in Section 3.2 and 3.3. These are listed beginning with abbreviations used for labelling graphs:

1. Optical flow - Head pose tracking using optical flow mesh tracking. 2. Features only - Head pose tracking using feature matching to single template. 3. OF + FT - Hybrid head pose tracking using algorithms 1 and 2 with a single

template. 4. OF + FT aggregate template - Hybrid head pose tracking using algorithms

1 and 2 with an aggregate template captured in the initialization stage (frontal head pose).

5. OF + FT multiple templates - Hybrid head pose tracking using algorithms 1 and 2 with multiple templates captured in the initialization stage (frontal head pose).

A graphical illustration of measured accuracies for these algorithms is shown in Figure 7.4. Along with average errors the standard deviations are marked on the chart in the form of black bars. The standard deviations are calculated relative to errors between all tagged and tracked frames in one sequence, averaged among the ten sequences in the standard test set.

The first method runs without re-initialization, so errors accumulate over time. The second method, on the other hand, performs reinitialization in every frame but as there is only one template, large head rotations cannot be tracked with this method. Because of these two reasons, the first two methods give large tracking errors. The last three methods, in contrast, give smaller and very similar errors. Using these errors alone it is difficult to say which is more accurate. Here it should be noted that the measured pixel distances between tracked and labelled points are strongly correlated with the translational tracking error, but not very well with the rotational tracking error. In fact, as inner eye corners lie close to the vertical symmetry axis of the face, incorrect rotations


136

of the rigid model will result in relatively small translational errors. Unfortunately, points lying near face edges cannot be reliably annotated.

Figure 7.4. Average landmark displacement in pixels for various head pose tracking al-gorithms on standard test set.

As has been mentioned earlier, the accuracy of head pose tracking algorithms can be compared separately for various head rotation angles (stages in the test sequence). The angles can be calculated based on the assumed distance of the user’s face from the display and the location of the gaze point. This is possible because the users were asked to keep a frontal gaze while turning their heads. Despite being imprecise, this method of measuring the head rotations is sufficient to provide information about how head rota-tion magnitude impacts head pose tracking accuracy. For a more meaningful comparison and better clarity, at this stage detailed results will be presented for one selected head pose tracking algorithm. As will be shown at the end of this chapter, the most accurate head pose tracking algorithm is the proposed hybrid head pose tracker using optical flow and feature tracking with an aggregate template captured in the initialization stage (OF + FT aggregate template). Because of this, the author has chosen to perform more detailed analyses using this algorithm – both in the case when using RGB input only and when additionally using a depth sensor. The test stage specific results for this track-

1,721,99

1,61 1,61 1,61

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

Optical flow Features only OF + FT OF + FT aggregatetemplate

OF + FT multipletemplates


137

ing algorithm are shown in Figure 7.5. Additionally to the combined landmark displace-ment errors, x- and y- errors with corresponding standard deviations are shown sepa-rately.

Figure 7.5. Test stage specific landmark displacement in pixels for OF + FT aggregate template algorithm on standard test set.

In the first stage the head does not move, so the rotation angle can be estimated as close to 0°. In the second and third stage the head is rotated towards the inner quadrangle corners displaced from the screen center by 20% of the screen size in each dimension. If the user is seated 60cm away from the display, this gives a head rotation of 11° in total. Because of the screen shape the horizontal and vertical angles differ – the first is approximately 9.5°, while the second is around 6°. In the fourth and fifth stage the head is rotated toward the outer quadrangle corners displaced from the screen center by 40% of the screen size in each dimension. Assuming the same conditions, this is associated with a total head rotation of 21°, composed of an 18.7° horizontal and 11° vertical rotation. For better clarity, the graphs are labelled only with the number of the stage.

As can be seen in Figure 7.5, the tracking error increases with the magnitude of head rotation. This is quite understandable. In fact, the error in the first stage is caused mostly by noise – partly tracking and partly tagging noise. The graph for optical flow only tracking has very similar error values in the first stage. This suggests that the

0,89

1,58 1,602,11

2,37

1,61

0,0

0,5

1,0

1,5

2,0

2,5

3,0

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Average

x

y

xy


138

observed error of slightly less than one pixel is caused mostly by tagging noise and camera noise.

The second conclusion that can be made from Figure 7.5 is that horizontal and vertical errors are not far apart. This refers to translational mesh tracking errors, though, and not necessarily rotational errors which are crucial in the eye gaze tracking scenario. The third conclusion that can be made is that the measured tracking errors are small. Even if the tagging error noise is as low as 0.5 pixel, the measured landmark displace-ments are below 2 pixels even for large 20° head rotations. This is much more accurate than the results reported by the author in [129], although a different method for error calculation is used.

A different comparison of tracking accuracy can be presented by comparing the average results over all test stages for each of the ten test sequences. This is shown in Figure 7.6. What stands out most of all are the big disparities between errors measured for different sequences. The largest error is more than twice the smallest one. This can be caused by two things. First of all, tagging errors can be much larger on some people than on others, as depending on facial appearance and performed head rotations the eye corners can be clear or obscure, in some cases even occluded by the nose. Secondly, the mesh tracking accuracy in terms of translational error is different for each person. As currently RGB only algorithms are compared, a fixed generic face model is used to track the face pose. This model can match the different faces with different accuracy which in turn results in various accuracy of model tracking.

In order to demonstrate that the proposed tracking algorithms work well in var-ious conditions, the tracking performance on the standard test set was compared to the tracking performance on the top lighting test set and the far test set. The results for the three best tracking algorithms are shown in Figure 7.7. All three of these algorithms contain novel elements proposed by the author, but the hybrid tracking algorithm with an aggregate template is the most innovative. Additionally, test-stage specific results are shown in Figure 7.8.

It is important to notice that average tracking accuracy is similar on all three test sets – the tested algorithms work well in various conditions. Slightly larger errors than those for the standard test set are observed for the test set with top lighting. This is caused by less favorable illumination: when the head moves lighting directed from the ceiling causes shadows and changing appearance. The slightly lower accuracy is therefore entirely understandable.


139

Figure 7.6. Sequence specific landmark displacement in pixels for OF + FT aggregate template algorithm on standard test set.

Figure 7.7. Average landmark displacement in pixels for three head pose tracking algo-rithms on three test sets.

1,42

2,41

1,58 1,52 1,37

2,09

0,981,57 1,66 1,53

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

Seq 1 Seq 2 Seq 3 Seq 4 Seq 5 Seq 6 Seq 7 Seq 8 Seq 9 Seq 10

x y xy

1,61 1,61 1,61

0,0

0,5

1,0

1,5

2,0

2,5

OF + FT OF + FT aggregatetemplate


60cm, top

60cm, ambient

80cm, ambient


140

Figure 7.8. Test stage specific landmark displacement in pixels for OF + FT aggregate template algorithm on three test sets.

What is somewhat interesting is the seemingly better accuracy for the far test set. While the measured pixel distances are smallest, this is misleading. When the head is further away from the camera it is smaller and the same pixel error corresponds to a larger real-world error. In fact, the 33% increase in distance from 60cm to 80cm is sig-nificant, and considering the pixel scaling it turns out that the results on the far set are least accurate when compared in world units. This is also understandable, as when the user is further away, the face resolution is smaller and more noise is present. The error distribution for various test stages is similar for all test sets – as is expected.

7.2.3 Results – RGBD input

So far only head pose tracking without using the depth sensor has been analyzed. When using depth as well, the number of tracking algorithms that can be tested is larger. The methods chosen for tests are listed beginning with abbreviations used for labelling graphs:

1. Optical flow DI - Head pose tracking using optical flow mesh tracking with initialization of the mesh model using depth measurements in first frame

2. Features only DI - Head pose tracking using feature matching with initial-ization of the mesh model using depth measurements in first frame

0,89

1,58 1,602,11

2,37

1,61

0,0

0,5

1,0

1,5

2,0

2,5

3,0


60cm, top lighting60cm, ambient lighting80cm, ambient lighting


141

3. OF + FT DI - Hybrid head pose tracking using algorithms 1 and 2 with a single template with initialization of the mesh model using depth measure-ments in first frame

4. OF + FT aggregate template DI - Hybrid head pose tracking using algo-rithms 1 and 2 with an aggregate template captured in the initialization stage (frontal head pose) using depth measurements for all aggregated templates.

5. OF + FT multiple templates DI - Hybrid head pose tracking using algo-rithms 1 and 2 with multiple templates captured in the initialization stage (frontal head pose) using depth measurements for all collected templates.

6. OF + FT with 3D refinement - Hybrid head pose tracking using algo-rithms 1 and 2 with depth used in every frame for refinement of optical flow tracking through a second stage as described in Section 3.4.2.2.

7. OF + FT with 3D refinement AT - Hybrid head pose tracking using algorithms 1 and 2 with aggregate template and depth used in every frame for refinement of optical flow tracking through a second stage as described in Section 3.4.2.2.

8. VOF + FT - Hybrid head pose tracking using algorithms 1 and 2 with depth used in every frame to track in frame-to-frame fashion using the total varia-tion approach described in Section 3.4.2.1.

9. VOF + FT AT - Hybrid head pose tracking using algorithms 1 and 2 with aggregate template and depth used in every frame to track in frame-to-frame fashion using the total variation approach described in Section 3.4.2.1.

The first five listed algorithms correspond to those discussed in the previous section, only this time using depth sensor measurements for better initialization. The next four algorithms are new and are created from two base algorithms using RGB and depth input in every frame. Each of these two base algorithms is compared in the standard version and in the version with an aggregate template – as the approach with an aggre-gate template outperforms other approaches (what will be shown). In the rest of this work the hybrid head pose tracking algorithm will often be referred to as the baseline algorithm.

The first experiment compares the average accuracy of all listed head pose track-ing algorithms using depth measurements. The results are presented in Figure 7.9. The first two algorithms are either unable to reduce accumulated tracking error (optical flow) or are unable to track correctly in case of larger head rotations (single template feature


142

matching). Their errors are clearly larger than those of the hybrid methods, so in later analyses only hybrid methods will be considered. Also, for a more concise comparison the methods described in Section 3.4.2.1 and Section 3.4.2.2 that use depth in every frame will only be compared in variants using aggregate templates from here on.

The pixel errors indicate that the best performing methods are the three hybrid methods previously analyzed for the RGB only scenario. The pixel errors, however, do not say much about rotational errors of the mesh model. It may happen that bigger pixel errors are caused mostly by wrong translation, with the rotational error being smaller. As the rotational error is decisive for the eye gaze tracking system, the magnitude of the measured mesh pixel error does not say everything and measurement of the eye gaze tracking errors needs to be performed to know the rotational accuracy of head pose tracking.

Figure 7.9. Landmark displacement in pixels for various head pose tracking algorithms using depth on standard test set.

The dependency of mesh tracking errors on the test sequence stage is shown in Figure 7.10 for the hybrid tracking algorithm with an aggregate template and depth used for initialization. It is similar to that when not using depth measurements at all. The errors in the first three test stages are slightly smaller, while those in the last two stages larger than in the case without any depth. More conclusions will be drawn when these mesh tracking errors are compared with eye gaze tracking errors in Section 7.4.

1,98 1,89 1,63 1,67 1,63 1,76 1,84 2,08 1,98

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

Opt

ical

flow

DI

Feat

ures

onl

y DI

OF

+ FT

DI

OF

+ FT

agg

rega

tete

mpl

ate

DI

OF

+ FT

mul

tiple

tem

plat

es D

I

OF

+ FT

with

3D

refin

emen

t

OF

+ FT

with

3D

refin

emen

t AT

VOF

+ FT

VOF

+ FT

AT


143

An interesting comparison can be made by equating the tracking errors when using and not using depth on each test sequence in the test set. This is shown in Figure 7.11. While the average result is similar, the individual errors differ greatly when using and when not using depth. This is caused by a simple fact: when not using depth the generic mesh can fit very well for some people, but poorly for others. When depth is used however, the mesh model contains measurement noise but has a more equal fit quality for every tested individual. This is confirmed by the variance when considering average mesh tracking errors for each test sequence separately. This variance is 0.08 for the case when depth is used and 0.15 for the case when depth isn’t used. The graph in Figure 7.11 supports this observation – bars for tracking without depth-based initializa-tion are more scattered.

Figure 7.10. Test stage specific landmark displacement in pixels for baseline tracking algorithm with depth used for initialization on standard test set.

Similarly as in the case when not using depth sensor measurements, in order to demonstrate that the proposed tracking algorithms work well in various conditions, the tracking performance on the standard test set was compared to the tracking performance on the top lighting test set and the far test set. The results for the five baseline head pose tracking algorithms are shown in Figure 7.12. Additionally, test-stage specific re-sults are shown in Figure 7.13.

The results shown in both Figure 7.12 and Figure 7.13 prove similar regularities as in the case when depth is not used. The tracking errors are similar for all three test

0,88

1,48 1,54

2,49 2,58

1,67

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5


x

y

xy


144

sequences. The largest errors are measured for the sequence with top lighting and the smallest errors are measured for the far sequence. However, as has been explained earlier in Section 7.2.2, if the pixel error was normalized relative to the user distance from the camera, the best accuracy would be achieved on the 60cm sequence with ambient lighting. All of these results are in line with what should be expected, so they will not be com-mented further.

Figure 7.11. Sequence specific landmark displacement in pixels baseline tracking algo-

rithm on standard test set with and without depth used for initialization.

Figure 7.12. Average landmark displacement in pixels for five head pose tracking algo-

rithms using depth on three test sets.

1,401,69

2,00 1,891,35

1,831,51 1,43

2,12

1,46

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5


no depth with depth

1,63 1,67 1,63 1,84 1,98

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

OF + FT DI OF + FT aggregatetemplate DI

OF + FT multipletemplates DI

OF + FT with 3Drefinement AT

VOF + FT AT

60cm, top 60cm, ambient 80cm, ambient


145

Figure 7.13. Test stage specific landmark displacement in pixels for baseline tracking algorithm with depth used for initialization on three test sets.

7.2.4 Analysis of non-frontal initialization

In order to evaluate the impact of a non-frontal head pose in the initialization frame, special test sequences have been recorded for two people (number 1 and 5 in the standard test set), where the initial horizontal head rotation was 5° and 10°. The accuracy of the baseline head pose tracking algorithm without depth is presented in Figure 7.14, while the same results when using depth are shown in Figure 7.15.

The results show that despite non-frontal rotation in the initialization stage, the tracking algorithms have small errors. On the standard test set the errors for the same two people are around 1.40 and 1.35 pixels. On the rotated test sequences they are even smaller. This is a little unexpected, but is probably caused by the fact that recorded users were highly aware of difficulty of the task and made slightly smaller head move-ments than when recording for the standard test set. This is most probably the same reason for which the test results for a larger initial rotation of 10° are better than for the initial rotation of 5°. Of course, statistical error also plays a part as only two recordings are analyzed for each case. Additionally, as has been mentioned earlier, the measured 2D displacements of tracked model points say little about the rotational error of model tracking. Gaze errors will provide more information about that.

One thing that is worth pointing out are the significantly smaller errors when depth is used for initialization (Figure 7.15) compared to no depth usage (Figure 7.14).

0,88

1,48 1,54

2,49 2,58

1,67

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5


60cm, top lighting

60cm, ambient lighting



146

Figure 7.14. Sequence specific landmark displacement in pixels for baseline tracking al-gorithm without depth on test set with initial head rotation.

Figure 7.15. Sequence specific landmark displacement in pixels for baseline tracking al-gorithm with depth used for initialization on test set with initial head rotation.

The gain is very clear on all four recorded test sequences – something that was not observed on the standard test set. The reason of this can be explained quite easily. In case of initial rotation the observed face topology is much different than in the frontal case. Therefore, even when using a warping procedure, the fit quality of the generic

1,481,87

0,961,24

0,0

0,5

1,0

1,5

2,0

2,5

3,0

Person 1 - 5° Person 2 - 5° Person 1 - 10° Person 2 - 10°

x y xy

1,011,28

0,86 0,93

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6

1,8


x y xy


147

model will be significantly lower than when the face is frontal. When depth is used it does not matter how much the face is rotated during initialization – a well fitted mesh model will always be created based on depth measurements.

7.2.5 Analysis of lateral illumination

In order to evaluate the illumination compensation module, head pose tracking has been performed on two test sequences with lateral illumination, recorded for the same two people as in the experiment with non-frontal initialization (person 1 and person 5 from the standard test set). The sequences with lateral illumination were recorded by setting up a jib lamp near the user’s face. A comparison of ambient lighting and lateral lighting is shown in Figure 7.16.

Figure 7.16. Comparison of face images with ambient and lateral lighting.

To begin with, an evaluation of how the baseline tracking algorithm with and without depth performs on two lateral illumination sequences is shown in Figure 7.17.

As can be seen, the average tracking errors when not using depth are much larger than on other test sequences. This is because the flexible model is much less accurate when lateral illumination is present. When the flexible model is inaccurate, the warping of the generic mesh is also inaccurate and the tracking initialization is poor. This later leads to large errors. As the illumination is uneven horizontally, the horizontal tracking errors dominate in the first two cases in Figure 7.17.


148

Figure 7.17. Sequence specific landmark displacement in pixels for OF + FT aggregate template algorithm with and without depth used for initialization on two test se-

quences with lateral illumination.

In contrast, when depth is used the inaccuracy of the flexible model in the ini-tialization frame has much less impact. This is the reason why tracking errors are re-duced in this case.

Regardless of how large the errors are for the baseline head pose tracking algo-rithm, the test sequences with lateral illumination are ideal to test illumination compen-sation algorithms described in Section 3.3.3. Two algorithms have been chosen for im-plementation: dynamic range compression and DCT coefficient removal. The DCT coef-ficient removal procedure was tested in two variants: with the removal of the first 3 coefficients and with the removal of the first 5 coefficients. The comparison of mesh tracking accuracy using various illumination compensation techniques is shown in Figure 7.18 when using RGB input and in Figure 7.19 when using RGBD input.

It is difficult to draw any decisive conclusions from the results. The average track-ing errors change inconsistently. In case of tracking without using depth information the measured displacements are similar for all used illumination compensation modules. In case of tracking with using depth the dynamic range compression algorithm reduces the measured error. It is clear, however, that measuring eye gaze tracking errors directly will be necessary to determine the rotational model tracking accuracy in each case.

2,322,59 2,40

1,99

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

Person 1 - no depth Person 2 - no depth Person 1 - with depth Person 2 - with depth

x y xy


149

Figure 7.18. Average landmark displacement in pixels for OF + FT aggregate template algorithm without depth used for initialization on two test sequences with lateral illu-

mination; for three illumination compensation algorithms.

Figure 7.19. Average landmark displacement in pixels for OF + FT aggregate template algorithm with depth used for initialization on two test sequences with lateral illumina-

tion; for three illumination compensation algorithms.

As the dynamic range compression algorithm shows promising results when used with depth-based initialization and the recorded test set with lateral illumination is small,

2,57 2,75 2,50 2,322,38 2,242,72 2,59

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

Original DCT 3 DCT 5 DRC

Person 1

Person 2

2,40 2,32 2,131,60

1,99 2,05 2,301,92

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0


Person 1

Person 2


150

this case has been additionally tested on the standard test set for comparison. The results are shown in Figure 7.20. The landmark displacement error changes are again inconclu-sive and measuring the gaze errors directly seems crucial to assess the actual impact of illumination compensation on head pose tracking accuracy.

Figure 7.20. Sequence specific landmark displacement in pixels for OF + FT aggregate

template algorithm with depth used for initialization on standard test set with and without illumination compensation.

7.3 Iris localization

Iris localization is the second component task that, apart from head pose estimation, directly influences the accuracy of eye gaze tracking. Similarly as for head pose estima-tion accuracy evaluation, the iris localization accuracy evaluation has been performed by comparing localization algorithm results to manually tagged iris centers.


The framework for iris localization evaluation is analogical to that for head pose estima-tion accuracy evaluation. In case of irises, however, it is easier to accurately label the sought center points than for other facial features. A previously mentioned dedicated tool developed for the purpose of facial feature and iris tagging is shown in Figure 7.21.

1,401,69

2,00 1,891,35

1,831,51 1,43

2,12

1,46

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5


No compensation

Dynamic Range Compression (DRC)


151

The tool allows to not only select the iris center, but also to observe the complete circle that is being fitted to a magnified image of the iris, and thus allows to label the center with subpixel accuracy. The typical way to tag the desired point on a magnified image is clicking, but this point can later be moved with fine accuracy using arrows to correct the location. The author estimates that this tagging process has an accuracy of around 1

8 of a pixel for the iris centers.

Figure 7.21. Tool developed for purpose of iris tagging.

7.3.2 Results

Similarly as for head pose tracking accuracy analysis, the accuracy of the iris is compared for various head poses, but also for various gaze directions. To begin with, the complete iris localization algorithm as described in Chapter 4 is tested on the standard test set for various head pose tracking algorithms as shown in Figure 7.22 (without depth) and in Figure 7.23 (with depth). As can be seen, the iris localization algorithm accuracy is reproducible regardless of the chosen head pose tracking method. Because of this, all further iris evaluations will be performed when using the hybrid aggregate template head pose tracking algorithm with depth used for initialization.

The variations of iris localization accuracy among individual test sequences are shown in Figure 7.24. The error spread is smaller than in case of mesh tracking, but still significant. This is understandable, as each person has a different color of the iris, a slightly different face geometry and performs slightly different head movements during the test sequence. As a result, the processed iris images have different quality for each sequence and this shows in the results.


152

Figure 7.22. Iris localization algorithm average errors in pixels on standard test set – head pose tracking algorithms without depth.

Figure 7.23. Iris localization algorithm average errors in pixels on standard test set – head pose tracking algorithms with depth.

0,860,85

0,86 0,87 0,86

0,0

0,2

0,4

0,6

0,8

1,0

1,2



0,86 0,87 0,85 0,85 0,85 0,86 0,85 0,85 0,84

0,0

0,2

0,4

0,6

0,8

1,0

1,2

Opt

ical

flow

Feat

ures

onl

y

OF

+ FT

OF

+ FT

agg

rega

tete

mpl

ate

OF

+ FT

mul

tiple

tem

plat

es

OF

+ FT

with

3D

refin

emen

t

OF

+ FT

agg

rega

tete

mpl

ate

with

3D

refin

emen

t

VOF

+ FT

VOF

+ FT

agg

rega

tete

mpl

ate


153

Figure 7.24. Sequence specific iris localization algorithm average errors in pixels on standard test set.

To learn more, the iris accuracy is evaluated for various stages of the test se-quences. This is shown in Figure 7.25. Not surprisingly, the highest accuracy is achieved in Stage 3 and Stage 5. In these two cases the person is looking at the screen center, so the iris is well visible from the camera. The same level of accuracy can also be observed for tagged frames in Stage 2. This is also understandable – for small head rotations and a frontal gaze the observed iris images continue to be clear and distinctive. Significantly larger errors can be observed for Stages 1 and 4. In the first stage the testers had to look at screen corners without moving their head. This caused extremely non-frontal gaze directions related to the head pose. The iris often ended up occluded by the eyelids and eyelashes. With such occlusions the localization algorithm understandably performs worse. Similarly in Stage 4, the testers gazed at screen corners, and while head rotation made it easier and more comfortable for them, it only slightly reduced the iris occlusions and deformations observed by the camera. The author concedes that when looking at screen corners the iris localization algorithm has a very difficult task and naturally pro-duces larger errors.

0,71 0,76 0,83 0,911,11

0,74 0,640,87

1,11

0,81

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6


x y xy


154

Figure 7.25. Test stage specific iris localization algorithm average errors in pixels on standard test set.

There is one more thing that should be pointed out. The horizontal localization accuracy is much higher than the vertical accuracy. The average difference is more than 0.1 pixel. This is because of the fact that when any part of the iris gets occluded, this is caused by occlusion of the top part by the eyelid and eyelashes, or the bottom part by the eyelid. The left and right edges of the iris are nearly always visible. In fact, the iris localization algorithm is tuned to use only the side iris edges for ellipse fitting. It is therefore no wonder that the horizontal localization is more accurate.

A comparison of the average iris localization algorithm accuracy across various test sequences is shown in Figure 7.26. The results demonstrate similar properties as those for the mesh. The test set with top lighting, unsurprisingly, gives slightly larger errors than the test set with ambient lighting. The far test set produces the smallest errors, but if normalized by the user distance from the camera it would turn out to be least accurate of the three. This is also understandable, as when the user is further away the irises are smaller so there is less information in the image.

Difficult lighting conditions may lower the iris localization accuracy. A verifica-tion of this was performed on the two test sequences with lateral illumination and is shown in Figure 7.27. The average stage errors follow the same pattern as they have for the three basic test sets. The one notable difference is that on the two sequences with lateral illumination the observed horizontal localization error is larger than the vertical

0,970,77 0,76

0,95

0,700,85

0,0

0,2

0,4

0,6

0,8

1,0

1,2


x

y

xy


155

Figure 7.26. Test stage specific iris localization algorithm average errors in pixels on three test sets.

Figure 7.27. Test stage specific iris localization algorithm average errors in pixels on test set with lateral illumination.

error. This is most probably caused by the fact that a light was shining from one side impacts the vertical iris edges much more unevenly than the vertical ones.

0,970,77

0,760,95

0,700,85

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4


60cm, top lighting



1,200,93

0,791,06

0,93 0,99

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6


x

y

xy


156

To conclude this section the author would like to present a comparison of several variants of the iris localization algorithm to prove the worth of the proposed modifica-tions described in Chapter 4. Four variants are compared:

1. Coarse localization only – Only radial symmetry voting was performed with-out any refinement stage. According to literature, this approach should provide suboptimal results.

2. Coarse and standard fine localization – In this case both stages of the iris localization algorithm described by Zhou et al. [149] are performed, with some minor modifications of the coarse voting stage by the author as described in Sec-tion 4.1.1.

3. Coarse and fine localization with adaptive arc selection – In this case the algorithm described in point 2 was used with the addition of the adaptive arc selection algorithm as described in Section 4.2.1.

4. Coarse and fine localization with adaptive arc selection and ellipses – In this case the algorithm described in point 3 was used with the addition of ellipses used for fitting as described in Section 4.2.2.

The results of this comparison are shown in Figure 7.28. It is clearly visible that the modifications proposed by the author improve iris localization accuracy. In Section 7.4 it will also be shown how the iris localization errors relate to gaze angle estimation errors. The rightmost algorithm is the final configuration used in the eye gaze tracking system.

Figure 7.28. Various configurations of the iris localization algorithm tested on the

standard test set – errors in pixels.

1,330,99 0,97 0,85

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6

1,8

Coarse stage only Fine stage standard(no modifications)

Fine stage withadaptive arcs no

ellipses

Fine stage withadaptive arcs and

ellipses

x

y

xy


157

It is noteworthy that the results presented so far are quite similar – in terms of average error – to those reported in the author’s previous work [57] for the case of low quality data. The high quality data scenario in [57] was recorded around 40cm from the camera and in special lighting conditions, so those errors are hard to achieve in real-world use cases.

The authors of previous work on two-stage iris localization presented in this thesis [149] consider a completely different scenario where the eye is visible from a very close distance. There is therefore not much point in comparing their results. On the other hand, the results demonstrated in this section can be compared to the isophote curvature approach of Valenti and Gevers [140]. The isophote curvature approach published several years ago was considered to be among the state of the art for low resolution webcam eye gaze tracking. The accuracy reported in [140] is a 84,1% success rate of localizing the iris centers with an error smaller than 5% of the distance between the eyes. If the distance between the eyes is 100 pixels, 5% of this is 5 pixels. This means that the approach of Valenti and Gevers achieves an error below 5 pixels 84% of the time. Despite average pixel accuracy not being given, this is clearly significantly lower accuracy than that of the algorithm proposed by the author. This justifies the claim that the algorithm pre-sented in this thesis is among the current state of the art. To conclude the author wishes to point out that the presented iris localization results consider all tagged images – no cases were classified as completely lost. In situa-tions when preselected frames contained closed eyes due to blinking, different frames were selected for tagging. This allowed to evaluate the iris localization algorithm on proper test data.

7.4 Eye gaze tracking

Eye gaze tracking is the main focus of this thesis. While the accuracies of component algorithms described in sections 7.2 and 7.3 are important, the eye gaze tracking accu-racy is not linearly related to them. Therefore, the empirical evaluation presented in this section is of key importance to assess the accuracy of the proposed eye gaze tracking system.


158


To measure the eye gaze tracking error there is no need for point labelling. The location of the displayed gaze point is known beforehand and is stored during test sequence recording. When the test sequence is run, this data is read for every frame and can be compared with the actual gaze point calculated by the algorithm. The x- and y- pixel errors can be converted to distance errors using the known screen resolution and physical dimensions. Because the distance of the user from the display is also estimated for every frame, the distance errors can be converted to approximated angular errors. These errors are provided in this thesis, as such errors are most commonly used for measuring accu-racies of eye gaze tracking systems in literature.

7.4.2 Model parameter selection

The first task is to select optimal parameters for the proposed eye model described in Chapter 5. This can be done experimentally by measuring the average eye gaze tracking accuracy for different values of the parameters, starting off with the statistical parame-ters listed in Table 5.1. As a full run of the algorithm in one configuration on the stand-ard test set takes around 30 minutes, testing all possible combinations of the parameters is infeasible. The author has decided to adopt a coarse-to-fine approach by optimizing the parameter values consecutively, beginning from the most important one. The pa-rameter selection process will be performed in the following order:

1. Eyeball radius selection, or to be more specific the distance between the eyeball center and the observed pupil center

2. Eyeball ellipsoidal shape selection – the ratio of the eyeball vertical and horizontal size

3. Visual axis and optical axis relative angle selection – first vertical (𝛽𝛽), then hor-izontal (𝛼𝛼)

4. Interpupillary distance selection

Judging by numerous experiments, the hybrid head pose tracking method with an ag-gregate template and depth used for initialization performs best. This method was there-fore used for finding the first three parameters. As the interpupillary distance is used only when tracking with RGB input (to calculate the distance of the user from the camera), this parameter was optimized using the hybrid head pose tracking method with


159

an aggregate template without depth. The average from all stages of all sequences in the standard test set were used for this purpose.

The first and decisively most important parameter to be selected is the distance between the eyeball center and the observed pupil center. It will be from here on referred to as the observed eyeball radius. This parameter is used to estimate the 3D pupil posi-tion, which is in turn used directly to calculate the gaze vector as described in Section 5.3.2. While the observed eyeball radius is optimized, the eyeball is assumed to be a perfect sphere and a simple eye model is used where the visual and optical axis are the same. These simplifications are intended to prevent any bias of the obtained results.

A comparison of the obtained gaze tracking errors for various observed eyeball radii is shown in Figure 7.29. The observed eyeball radii were tested with a 0.5 mm step. One might argue that a smaller step would help to find a more optimal parameter value, but it should be remembered that only 10 test sequences are used for testing. Therefore, tuning parameters to obtain the smallest possible error might not help much in terms of the system’s general performance, as parameters will become over-fitted to the training data.

The best average gaze accuracy equal to 2.50° was measured for the observed eye radius of 11.0 mm. This is a little unexpected, as literature suggests that the average distance between the eyeball center and pupil center is 9.5 mm. However, it is not clear whether the observed center of the pupil really lies in its physical center. The actual radius of the eye is statistically 12 mm, so it is quite probable that in terms of depth the iris observed by the camera lies somewhere in between the true iris and the edge of the cornea. The result of 11.0 mm is well within that area. Furthermore, the tested group of people – ten – is not large enough to ensure that statistical values are optimal. Perhaps this is another reason why the experimentally determined observed eyeball radius is slightly larger than expected.

The obtained results are also in accordance with what the theoretical analysis of eyeball size error has shown in Section 6.4. That analysis suggested that if the eyeball radius was inaccurate by 10%, this would cause an error below 1.11° when gazing 10° away from the screen center and an error below 2.18° when gazing 20° away from the screen center (near screen corners when seated 60cm away from the screen). The exper-imental evaluation in Figure 7.29 shows an average error of only about 0.3° for the 10% inaccuracy of the observed eyeball radius (10.0 mm vs 11.0 mm). This means that the maximum error for even 20° gaze angles is certainly less than 2.18°.


160

Figure 7.29. Average eye gaze tracking errors in degrees on standard test set depending on observed eyeball radius, baseline head pose tracking algorithm is used.

It is important to note one more thing. The value of 11.0 mm for the observed eyeball radius minimizes not only the total gaze angle error, but both the horizontal and vertical gaze error. This is convenient, as it indicates the eye shape is close to a perfect sphere.

To determine whether a perfect sphere is really the best eyeball model is the goal of the next experiment. Figure 7.30 shows a comparison where two parameters of the eyeball model described in Section 5.3.2 are optimized. Ellipsoids with both shorter and larger vertical axis compared to the transverse horizontal axis are tested. The sagittal axis length (close to z axis of the camera) is kept constant. The results confirm that any deviation from a perfect sphere worsens accuracy. This is quite convenient, as it means the ellipsoidal eye model from Section 5.3.2 can be significantly simplified into a sphere.

The next experiment aims to verify whether a more complex eye model where the visual axis and optical axes are rotated can further improve the eye gaze tracking accu-racy. To begin with, the vertical angle between the optical and visual axis is optimized. The results are presented in Figure 7.31. It is known from literature that the visual axis is inferior to the optical axis between 0.25° and 3.0°, on average about 1.0° [6]. The measured results, however, indicate that the optimal 𝛽𝛽 angle for the test data is 2.0°. While this is significantly different than the average for all humans, this result may be caused either by the small number of test subjects or may result from head pose tracking

3,503,13 2,78 2,63 2,50 2,60

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0

9.0 mm 9.5 mm 10.0 mm 10.5 mm 11.0 mm 11.5 mm

x

y

xy


161

errors. Without a better method to measure the real angles for the tested people using the experimental value of 2.0° is the only solution.

Figure 7.30. Average eye gaze tracking errors in degrees on standard test set depending on observed eyeball ellipsoid shape, baseline head pose tracking algorithm is used.

Figure 7.31. Average eye gaze tracking errors in degrees on standard test set depending on angle 𝛽𝛽 of the visual axis, baseline head pose tracking algorithm is used.

2,50 2,60 2,57 2,61 2,60

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

Rh: 11.0 mm,Rv: 11.0 mm

Rh: 11.0 mm,Rv: 10.5 mm

Rh: 11.0 mm,Rv: 11.5 mm

Rh: 10.5 mm,Rv: 11.0 mm

Rh: 11.5 mm,Rv: 11.0 mm

x

y

xy

2,498 2,468 2,459 2,450 2,463 2,472

0,5

1,0

1,5

2,0

2,5

3,0

3,5

0.0° 1.0° 1.5° 2.0° 2.5° 3.0°

x

y

xy


162

A similar optimization was performed for the horizontal angle between the visual and optical axis. The reason why the angle 𝛼𝛼 was chosen as less significant and optimized second is caused by the fact that each eye has the vertical axis rotated towards the nose – so if the gaze point is the average of both eyes, any errors in the horizontal inclination of the visual axes relative to the optical axes will be to a large extent compensated. This is not so with the vertical angle 𝛽𝛽 – which is inclined in the same direction for both eyes.

Figure 7.32. Average eye gaze tracking errors in degrees on standard test set depending on angle 𝛼𝛼 of the visual axis, baseline head pose tracking algorithm is used.

The results of α angle optimization shown in Figure 7.32 are quite unexpected. It seems that 𝛼𝛼 = 0 provides the best results. At the same time it is stated in literature that the angle 𝛼𝛼 varies for humans between 3.5° and 7.5°. This is contradictory. Because of this, the author has decided to investigate one more possibility. In principle, a non-zero angle 𝛼𝛼 causes that estimated gaze angles are slightly increased when the user looks at screen corners. It could happen that a wrong perceived eye radius was selected at the beginning when not considering a complex eye model. Once this eye radius was selected for the simple eye model, changing the eye model cannot improve eye gaze tracking accuracy. To verify this, the gaze errors were measured once again assuming a complex eye model with angles 𝛼𝛼 = 5.5° (theoretical average) and 𝛽𝛽 = 2.0°, for various eyeball radii and shapes. The results are presented in Figure 7.33. As can be seen, no configura-tion of eye radii allows to achieve a smaller error than when using the previously fixed

2,466 2,450 2,477 2,508 2,533

0,5

1,0

1,5

2,0

2,5

3,0

3,5

-1.5° 0.0° 1.5° 3.5° 5.5°

x

y

xy


163

parameters and 𝛼𝛼 = 0°. This leaves no alternative but to accept that the horizontal deviation of the optical and visual axis does not improve eye gaze tracking accuracy when modelled in the proposed system.

Figure 7.33. Average eye gaze tracking errors in degrees on standard test set for vari-ous eye radii and eyeball shapes assuming a complex eye model with angles 𝛼𝛼 = 5.5°

and 𝛽𝛽 = 2.0°, baseline head pose tracking algorithm is used.

One possible explanation why setting a non-zero angle α worsens the eye gaze tracking accuracy can be that the eyes compensate each other’s errors. The final gaze point used for angular accuracy estimation is the average of the vectors estimated for both eyes. If this was the only reason for the results shown in Figure 7.32, a non-zero angle 𝛼𝛼 should limit the individual errors of each eye. The author has verified this – with results shown in Figure 7.34.

The measured accuracies again indicate that the optimal 𝛼𝛼 angle is 0°. This means that the angle 𝛼𝛼 set to non-zero values not only deteriorates the combined eye gaze tracking accuracy, but deteriorates the average accuracy of tracking the gaze of each eye as well.

2,6282,533 2,530

2,6042,530 2,556 2,539 2,583

2,0

2,1

2,2

2,3

2,4

2,5

2,6

2,7

2,8

h:11.0mmv:11.0mm

h:11.5mmv:11.5mm

h:12.0mmv:12.0mm

h:12.5mmv:12.5mm

h:12.0mmv:11.5mm

h:12.0mmv:12.5mm

h:11.5mmv:12.0mm

h:12.5mmv:12.0mm


164

Figure 7.34. Average eye gaze tracking errors for the left and right eye in degrees on standard test set depending on angle 𝛼𝛼 of the visual axis, baseline head pose tracking

algorithm is used.

The reason of the observed results is not clear. The author suspects that such results are caused by head pose tracking errors, which are considerably larger than errors resulting from incorrect angles between the optical and visual axis. It should be remem-bered that the eye model parameters are selected when using the head pose and iris tracking algorithms described in Chapters 3 and 4. In the investigated case it seems that, for a limited amount of test data, the accuracy of the used component algorithms is not high enough to allow observing correct behavior when changing parameters of the com-plex eye model. The author suspects that a much larger test dataset with thousands of sequences would lead to a different observation – the tracking errors would be more evenly distributed among test sequences and a more detailed eye model would, on aver-age, lead to smaller errors.

As far as the parameters of the proposed eye model are concerned, the value of 𝛼𝛼 = 0.0° is used from here on for all performed experiments. The eye model proposed in this work aims to minimize errors of the proposed system for the test data available, so it is somewhat justified to use parameter values not fully congruent with biological measurements if this improves the obtained results.

The earlier presented theoretical analysis of the impact the angles 𝛼𝛼 and 𝛽𝛽 have on the accuracy of the whole eye gaze tracking system is correct. An error of 𝜀𝜀 for any

3,36 3,33 3,313,34

3,383,35 3,32

3,38

3,44

3,51

3,00

3,10

3,20

3,30

3,40

3,50

3,60

-1.5° 0.0° 1.5° 3.5° 5.5°

Left eye

Right eye


165

of the angles 𝛼𝛼 and 𝛽𝛽 results in a much smaller error of the eye gaze tracking system. In fact, a change of 𝛼𝛼 by 5.5° changes the average eye gaze tracking error by 0.08°, while a change of 𝛽𝛽 by 2.0° changes the average eye gaze tracking error by 0.05°. The last value is also the total improvement of the system that is gained by using complex eye with separate optical and visual axis. While it is a gain, it constitutes only around 2% of the total eye gaze tracking error. Other error sources are more important and decisive. This largely justifies the choices made by most researchers of commodity camera eye gaze tracking to use a simple eye model.

The last parameter to optimize is the interpupillary distance used in RGB only tracking to determine the initial distance of the user from the camera. As mentioned before, this parameter is optimized using the hybrid head pose tracking method with an aggregate template without depth. The results of this optimization are shown in Figure 7.35.

Figure 7.35. Average eye gaze tracking errors in degrees on standard test set depending on interpupillary distance, baseline head pose tracking algorithm without depth is

used.

The value of 63 mm reported as average in literature is very close to the 64 mm the experimental optimization shows. After some thought the author realized that all test recordings were performed on men, which have a slightly larger interpupillary dis-tance than women, with an average between 64 mm and 65 mm. The experimentally

3,001 2,933 2,871 2,869 2,936 3,047 3,154

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

61 mm 62 mm 63 mm 64 mm 65 mm 66 mm 67 mm

x

y

xy


166

determined value of 64 mm is therefore consistent with literature surveys. As a means of verification the author manually measured the interpupillary distance for all the tested people. The average interpupillary distance is very close to 64 mm.

To summarize, the optimal selected eye and gaze model parameters are listed in Table 7.1. They are used in all experiments described in the rest of this chapter.

Sym-bol

Name Optimized value

𝑅𝑅𝑜𝑜𝑙𝑙𝑠𝑠 Observed radius of eyeball 11.0 mm

𝐴𝐴,𝐵𝐵,𝑂𝑂 Eyeball ellipsoid shape 1,1,1 - Perfect

sphere

𝛼𝛼 Horizontal angle between visual and optical axis 0.0°

𝛽𝛽 Vertical angle between visual and optical axis 2.0°

𝑑𝑑𝐼𝐼𝑃𝑃𝐷𝐷 Interpupillary distance 64 mm Table 7.1. Optimized eye parameter values used in the proposed eye gaze tracking sys-

tem. Please refer to Figure 5.3 for symbol explanation.

7.4.3 Results – RGB input

Similarly as was done for the head pose and iris errors, the angular gaze errors will be compared among head pose tracking algorithms, test sets, test sequences and test stages of a test set. Most comparisons will be performed on the standard test set.

The first evaluation is measuring the average eye gaze tracking accuracy of five head pose tracking algorithms described in Section 7.2.2. The results are shown in Figure 7.36. They are highly correlated with the accuracy of mesh tracking shown in Figure 7.4. Similarly as before, the hybrid head pose tracking method with an aggregate template will be used as a baseline algorithm for more detailed comparisons.

The test stage specific eye gaze tracking accuracy is presented in Figure 7.37. These results are correlated with both the measured test stage specific mesh tracking accuracy shown in Figure 7.5 and iris localization accuracy shown in Figure 7.25. To better visualize this correlation, all three accuracies are shown in Figure 7.38.

In the first stage there is no head movement so the head pose estimation error is very low, but the iris is not localized very accurately due to occlusions resulting from


167

extreme eyeball rotations. In the second and fourth stage the iris is localized with slightly better accuracy, but head pose estimation errors appear. In the third and fifth stage the iris image quality is best and the gaze error in these stages results nearly solely from head pose estimation inaccuracies. In fact, as the iris localization accuracy is fairly con-stant and low in the third and fifth test stage, the measured eye gaze error can be treated as a good approximation of the head pose estimation error.

Figure 7.36. Average eye gaze tracking errors in degrees for several head pose tracking algorithms on standard test set.

Figure 7.37. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm on standard test set.

4,34 5,442,99 2,87 2,99

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

8,0



2,29 2,38 2,12

4,503,71

2,87

0,0

1,0

2,0

3,0

4,0

5,0

6,0


x

y

xy


168

Figure 7.38. Test stage specific eye gaze tracking errors in degrees, mesh tracking er-rors in pixels and iris localization errors in pixels for baseline tracking algorithm on

standard test set. Left axis refers to degree errors and right to pixel errors.

The distribution of eye gaze tracking errors among test sequences is shown in Figure 7.39. The gaze errors range from 2.12° to 3.87°. This is quite a large spread of results, but it should be remembered that they depend on many factors: the person’s appearance, the person’s individual eye and face parameters, the accuracy of initializa-tion and the extent of performed head movements. With iris localization accuracy rela-tively repeatable, there is interestingly little correlation between the eye gaze tracking errors and head pose tracking errors when comparing individual test sequences. This may be confusing, but actually provides important information. It means that the meas-ured translational errors of head pose tracking are only loosely correlated with the rota-tional errors. If the measured translational error is significantly larger for a tracking method, the rotational error will most probably also be larger. This can be observed as the correlation between Figure 7.4 and Figure 7.36, or the correlation shown in Figure 7.38. In case of small differences among observed translational errors, however, the trans-lational error may be largely independent from the rotational error. This is the reason why measured eye gaze tracking accuracies have little correlation with the measured head pose tracking accuracies – when considered for one chosen algorithm.

2,29 2,382,12

4,50

3,71

2,87

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0


gaze

mesh

iris


169

Figure 7.39. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm on standard test set.

The next experiment is performed to compare the accuracy of eye gaze tracking on several test sets. The result of this comparison for three best RGB only head pose track-ing algorithms is shown in Figure 7.40. Based on this comparison, the hybrid head pose tracking algorithm is the most accurate, but the differences are not very large. On the standard test set the improvement in accuracy when using an aggregate template com-pared to using a single standard template is around 0.12°. Also, just as measurements of head pose tracking accuracy suggested, the eye gaze tracking accuracy on the standard test set is best, while on the far test set it is the worst. The test set with top lighting is somewhere in between. The differences in accuracies are up to 35% or around 1.1°. Based on this limited evaluation it can be suspected that the accuracy of the eye gaze tracking system is correlated with the distance of the user from the screen. However, it will be shown that when depth input is used the disparities are much smaller than the 35% observed in Figure 7.40.

An additional test stage specific comparison when using the baseline head pose tracking algorithm is shown in Figure 7.41. This is generally consistent with the corre-sponding results for head pose tracking accuracy and iris localization accuracy. The best results are obtained on the standard test set. The far test set demonstrates the largest errors clearly for three test stages, and is a close second for stage one and four where the iris images are of lowest quality (extreme eye rotations). In these two test stages the test

2,12

3,37 3,03 3,32 2,54 2,162,69

3,873,07

2,52

0,0

1,0

2,0

3,0

4,0

5,0

6,0


x y xy


170

Figure 7.40. Average eye gaze tracking errors in degrees for three head pose tracking algorithms on three test sets.

set with difficult top lighting provides the largest errors. This is because in the far test set where the user is further away from the screen the eye rotations are less extreme.

Figure 7.41. Test stage specific eye gaze tracking errors in degrees for baseline head pose tracking algorithm on three test sets.

2,99 2,87 2,99

0,0

1,0

2,0

3,0

4,0

5,0

6,0




60cm, top lighting


2,29 2,38 2,12

4,503,71 2,87

0,0

1,0

2,0

3,0

4,0

5,0

6,0

7,0

8,0



60cm, top lighting



171

7.4.4 Results – RGBD input

The first experiment when using RGBD input is measuring the average eye gaze tracking accuracy of nine head pose tracking algorithms described in Section 7.2.3. The results are shown in Figure 7.42. Similarly as for the case of RGB input, they are highly corre-lated with the accuracy of mesh tracking shown in Figure 7.9. The best performing methods are hybrid methods. In later analyses four tracking methods will be compared: the three methods analyzed in the previous section only this time with depth used during initialization and the hybrid head pose tracking algorithm using optical flow and tem-plate tracking with an aggregate template and using depth in every frame for refinement of optical flow tracking through a second stage as described in Section 3.4.2.2. It is worth noting that the last method achieves a 0.12° smaller tracking error in the vertical plane. While this does not compensate sufficiently for the increase in horizontal tracking error, it indicates that using depth in every frame for refining mesh alignment might be a good way to reduce the errors in vertical head movement tracking – a current common weak-ness of all model-based head pose tracking algorithms.

One more notable thing is the significantly superior performance of all the hybrid methods proposed by the author compared to the variational tracker from Technical University of Munich [162], considered to be among the state-of-the-art for camera pose estimation. This says a lot about how accurate the proposed head pose tracking methods really are, and also underlines the fact that head pose tracking is a very specific task.

For test stage specific and sequence specific evaluations the best performing base-line head pose tracking algorithm will be used – similarly as before it is the hybrid head pose tracking method with an aggregate template. The test stage specific eye gaze track-ing accuracy is presented in Figure 7.43. Similarly as in the RGB only case, these results are correlated with both the measured test stage mesh tracking accuracy shown in Figure 7.10 and iris localization accuracy shown in Figure 7.25. Similarly as in the previous section, all three accuracies are shown in Figure 7.44 for better visualization of this correlation.

The distribution of gaze errors among various test stages is similar to what has been observed for RGB only case. The errors in stages with head rotations are larger when the rotation is larger (stage 4 and 5 vs stage 2 and 3). Also, the errors are smaller when the user is looking at the screen center and not corners, as in this case the visible part of the iris is larger (stage 3 and 5 vs stage 2 and 4). When using depth, the test stages produce on average much smaller gaze angle errors than in the RGB only case.


172

Figure 7.42. Average eye gaze tracking errors in degrees for several head pose tracking algorithms using depth on standard test set.

Figure 7.43. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth for initialization on standard test set.

4,24

7,56

2,57 2,45 2,56 2,54 2,46 4,02 3,700,0

2,0

4,0

6,0

8,0

10,0

12,0

Opt

ical

flow

Feat

ures

onl

y

OF

+ FT

OF

+ FT

agg

rega

tete

mpl

ate

OF

+ FT

mul

tiple

tem

plat

es

OF

+ FT

with

3D

refin

emen

t

OF

+ FT

agg

rega

tete

mpl

ate

with

3D

refin

emen

t

VOF

+ FT

VOF

+ FT

agg

rega

tete

mpl

ate

2,79

1,891,49

3,102,68 2,45

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0


x

y

xy


173

Figure 7.44. Test stage specific eye gaze tracking errors in degrees, mesh tracking er-rors in pixels and iris localization errors in pixels for for baseline tracking algorithm us-

ing depth for initialization on standard test set. Left axis refers to degree errors and right to pixel errors.

The distribution of eye gaze tracking errors among test sequences is shown in Figure 7.45. The gaze errors range from 1.87° to 3.38°. The spread of results is large, but smaller than for RGB input only. Similarly as before, there is little correlation between the eye gaze tracking errors and head pose tracking errors when comparing individual test sequences. However, when depth is used for initialization the average gaze tracking errors have decreased quite consentaneously – for eight out of ten test sequences. Also the standard deviation of the measured errors is smaller. This is shown in Figure 7.46. This proves that using depth is helpful in the eye gaze tracking task.

It is interesting to observe that while eye gaze tracking errors decrease signifi-cantly when using depth and comparing sequences between each other, the mesh tracking errors do not decrease as was shown in Figure 7.11. This is further proof that the trans-lational errors that can be measured to assess head pose tracking accuracy are only loosely related to the rotational accuracy which is decisive in the eye gaze tracking scenario. In the author’s opinion the accuracy of gaze angle estimation is probably a better indicator of head pose tracking accuracy than the measured landmark displace-ments.

2,79

1,89

1,49

3,10

2,682,45

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5


gaze

mesh

iris


174

Figure 7.45. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth during initialization on standard test set.

Figure 7.46. Comparison of sequence specific average eye gaze tracking errors in de-grees for baseline tracking algorithm on standard test set when using and when not us-

ing depth during initialization.

The next experiment is performed to compare the accuracy of eye gaze tracking using depth input on several test sets. The result of this comparison for four best RGBD head pose tracking algorithms is shown in Figure 7.47. Similarly as in the RGB only

1,872,25

2,86

1,562,22 2,15 2,45

3,76

2,70 2,75

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0


x y xy

1,87 2,252,86

1,562,22 2,15 2,45

3,76

2,70 2,75

0,0

1,0

2,0

3,0

4,0

5,0

6,0


RGB RGBD


175

case, the hybrid head pose tracking algorithm with an aggregate template is the most accurate. This confirms that the aggregate template approach is a valuable improvement over the single-template hybrid tracking algorithm, even if the gain is not very large – around – around 0.12° on the standard test set. The relations between errors on the standard and other test sets are also similar. However, the absolute differences in accu-racies among the three tests set are smaller when depth is used. For the best head pose tracking algorithm the difference between the standard and far test set is around 0.6° (25%) instead of 1.1° in the case without depth.

Figure 7.47. Average eye gaze tracking errors in degrees for four head pose tracking al-gorithms using depth on three test sets.

An additional test stage specific comparison when using the best performing head pose tracking algorithm is shown in Figure 7.48. It is generally consistent with the cor-responding results for head pose tracking accuracy and iris localization accuracy, and with the test stage specific gaze accuracies measured when depth wasn’t used. The far test set produces largest errors and the standard test set produces smallest errors in all stages except the first, where head pose tracking is less important than correct mesh model initialization and precise iris localization. The test set with the user 80 cm away requires less extreme eye rotations in the first stage – as the user is further away from the screen corners at which they are looking. This is the reason why the errors measured for the far test set are smallest in the first stage.

2,57 2,45 2,56 2,46

0,00,51,01,52,02,53,03,54,04,55,0



OF + FT aggregatetemplate with 3D

refinement

60cm, ambient lighting 60cm, top lighting 80cm, ambient lighting


176

The final experiment that the author wishes to show is the relation of iris locali-zation errors and eye gaze tracking errors when using four different variants of the iris localization algorithm previously shown in Figure 7.28. The comparison is performed on the standard test set when using the baseline head pose tracking algorithm with depth. It is shown in Figure 7.49. The results confirm what has been expected – the modifica-tions of the iris localization algorithm proposed by the author reduce not only the iris displacement from ground truth, but also improve the eye gaze tracking accuracy.

Figure 7.48. Test stage specific eye gaze tracking errors in degrees for baseline tracking algorithm using depth for initialization on three test sets.

Figure 7.49. Relation of eye gaze error in degrees and iris localization error in pixels for various configurations of the iris localization algorithm tested on the standard test set.

2,79

1,89 1,49

3,10 2,68 2,45

0,0

1,0

2,0

3,0

4,0

5,0

6,0


60cm, ambient lighting60cm, top lighting80cm, ambient lighting

1,33 0,99 0,97 0,85

3,322,87 2,70 2,45

0,00,51,01,52,02,53,03,54,04,5

Coarse stageonly

Fine stagestandard (no

modifications)

Fine stage withadaptive arcs no

ellipses

Fine stage withadaptive arcsand ellipses

Iris error [px]

Gaze error [deg]


177

7.4.5 Analysis of non-frontal initialization

The accuracy of eye gaze tracking using the baseline head pose tracking algorithm with-out depth is presented in Figure 7.50, while the same results when using depth are shown in Figure 7.51.

The results show that non-frontal rotation at initialization slightly decreases eye gaze tracking accuracy – more than mesh tracking errors presented in Figure 7.14 and Figure 7.15 would indicate. The errors for the standard test set for the same two people are low and on average around 2.0°, while on the test sequences with non-frontal initial-ization they are 2.4° - 2.8°. This still proves, however, that the proposed eye gaze track-ing system is quite robust to non-frontal initialization errors, as the system doesn’t fail completely but only slightly increases the average gaze angle error when a relatively large initial head rotation of 10° is present.

Figure 7.50. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm without depth on test set with initial head rotation.

The test set is a little too small to draw definite conclusions, but it seems that using depth also provides better robustness to initial head rotation. First - it allows to correctly estimate the distance of the user from the camera regardless of the perceived interpupillary distance in pixels. Second – it allows to fit the mesh model more accurately as the face topology is known explicitly no matter how the face is rotated. The gaze error measurements in Figure 7.50 and Figure 7.51 comply with this theoretical logic

2,42 2,67 2,79 2,70

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0


x y xy


178

and suggest that when depth isn’t used, the gaze tracking accuracy depends on the initial head rotation (gaze tracking accuracy is lower for larger rotations), while when depth is used it doesn’t depend on it so much (similar gaze tracking accuracy for different initial head rotation angles).

Figure 7.51. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on test set with initial head rotation.

7.4.6 Analysis of lateral illumination

To begin with, an evaluation of how the baseline tracking algorithm with and without depth performs on two lateral illumination sequences in terms of average eye gaze track-ing error is shown in Figure 7.52.

As can be seen, the average tracking errors on test sequences with lateral illumi-nation are larger than on other test sequences. This is partly because the flexible model is much less accurate when lateral illumination is present – as has been explained in Section 7.2.5. The other reason is that lateral illumination changes the visible gradients on the face and thus makes gradient-based optical flow tracking more difficult.

When depth is used the gaze errors are significantly smaller on the same sequences. This is probably caused mainly by more accurate mesh model initialization even if the flexible model produces a large error. One more thing that should be noted is the excep-tionally large standard deviation of the measured gaze errors when depth input isn’t

2,40 2,58 2,332,88

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0


x y xy


179

used. It means than in some frames the tracking fails and there is a very large error, while on most the error is relatively small. This is quite understandable – tracking fail-ures are more probably in difficult lighting conditions.

Figure 7.52. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with and without depth used for initialization on two test sequences with

lateral illumination.

Figure 7.53. Average eye gaze tracking errors in degrees for baseline tracking algorithm without depth used for initialization on two test sequences with lateral illumination;

for three illumination compensation algorithms.

3,493,51

2,663,33

0,0

1,0

2,0

3,0

4,0

5,0

6,0

Person 1 - no depth Person 2 - no depth Person 1 - with depth Person 2 - with depth

x y xy

3,496,46 5,57

4,283,514,21

9,03

5,33

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0


Person 1

Person 2


180

The comparison of eye gaze tracking accuracy using three illumination compen-sation techniques described in Section 7.2.5 is shown in Figure 7.53 when using RGB input and in Figure 7.54 when using RGBD input.

Figure 7.54. Average eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on two test sequences with lateral illumination; for

three illumination compensation algorithms.

While the mesh translational error analysis was inconclusive, the eye gaze track-ing results prove that the illumination compensation algorithms do not provide the de-sired improvement. The texture and gradients that they remove make it more difficult to track using optical flow or keypoints. While the compensated images clearly have more uniform brightness as can be seen in Figure 3.7 and Figure 3.8, the removed infor-mation is crucial for tracking purposes.

While removing DCT coefficients leads to very significant tracking degradation, the dynamic range compression algorithm seems to degrade accuracy only a little. It is therefore interesting how this algorithm would perform on the much less challenging standard test set. The results are shown in Figure 7.55.

It turns out that in case of the standard test set the dynamic range com-pression algorithm degrades the eye gaze tracking accuracy by around 20%. This con-cludes the topic of illumination compensation. None of the tested algorithms allows more

2,665,01

3,50 2,943,333,88

8,073,08

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0


Person 1

Person 2


181

accurate head pose and eye gaze tracking. Most evidently the illumination compensation methods used in other areas of image processing apply poorly to tracking – as evaluated in this thesis. Different, new approaches are necessary to improve tracking accuracy.

Figure 7.55. Sequence specific eye gaze tracking errors in degrees for baseline tracking algorithm with depth used for initialization on standard test set with and without illu-

mination compensation.

1,952,50 2,53

2,05 1,87 1,922,90

3,382,87 2,53

0,0

1,0

2,0

3,0

4,0

5,0

6,0


No compensation Dynamic Range Compression (DRC)

8 Conclusions

182

8 Conclusions

The results described in Chapter 7 prove that the proposed eye gaze tracking system can work well in a variety of conditions both when using only an RGB sensor and when using an additional depth camera. This work further proves that using input from a depth sensor additionally to RGB input can improve the eye gaze tracking accuracy. The measured average accuracy for the best system configuration of 2.87° without using a depth sensor and 2.45° with a depth sensor are impressive and belong in the state-of-the-art category. The work of [55] reports a similar level of accuracy among model-based methods, but that was measured on one person only, so is not very reliable. The work of Valenti et al. [11] reports an accuracy between 2° and 5° depending on the usage scenario, which is also close to what has been measured for the proposed system. How-ever, both [11] and [55] require multiple point calibration to work. In contrast, the system presented by the author requires only single point initialization.

The author has further proved that the proposed modifications of the head pose tracking algorithms and iris localization algorithms increase their accuracy. While meas-uring manually tagged landmark displacements for the purpose of head pose tracking evaluation is somewhat inconclusive, the measured eye gaze errors provide valuable in-formation. The head pose tracking error cannot be larger than the measured eye gaze tracking error, as it is only one component of this error in the proposed system. This leads to the conclusion, that with eye gaze tracking errors being between 2° and 3°, the head pose tracking errors in terms of pitch and yaw are at most within the same range. This is better than the average errors of 3.2° and 3.8° for pitch and yaw reported in [49] – despite the testing methodology being different.

Most importantly, the author has presented a novel geometric approach to eye gaze tracking that does not require calibration unlike most systems presented in litera-ture. It has been proved in Chapter 7 that such a system can work well even if the model parameters differ from the optimal configuration.

It should be noted that the accuracy presented in this thesis is slightly lower than that reported in the author’s previous work [57]. However, this is somewhat misleading as the experiments were performed differently in each case. The stage without head movements was similar, but the stage with head movements in [57] measured eye gaze accuracy when the observed point was in the center of the display – so the same point as observed during system initialization. As has been justified in Chapter 6, this case is

8 Conclusions

183

favorable as all model-resulting errors compensate each other. The experiments in this thesis consider the case when head movements are combined with gazing towards screen corners. This is a real-world scenario where model errors do not compensate each other, and because of this the measured accuracy of the system should be expected to be lower.

The extensive experimental evaluation performed in Chapter 7 is valuable, as it proves that the reported accuracy can be obtained in real-world scenarios. Most publi-cations on eye gaze tracking measure accuracy in a very specific way which is completely infeasible in everyday use. In contrast, the system presented in this thesis is feasible for application even on mobile devices, on which it has also been tested. As the quality of available RGB and depth sensors improves, the commercial application of a similar sys-tem to the one described here can become reality.

One of the main shortcomings of the presented system is the need to initialize each time before usage. However, the initial template can be stored offline and used again later. This has been tested and proved to work well assuming lighting conditions did not change. In the real world lighting conditions differ, so this issue would need to be somehow solved for more convenient usage.

Another problem with the proposed system is that the tracking accuracy is sig-nificantly lower in case of vertical head movements than in case of horizontal head movements. This is a common problem of all systems that involve head pose tracking and is caused by face topology – small depth differences in the vertical direction. There is no obvious solution to resolve this problem, although more extensive depth sensor usage is one possible direction of research.

8.1 Justification of theses

The first thesis claiming that the system achieves comparable accuracy to systems that require calibration is proved to be correct. The results presented in Chapter 7 show that the proposed eye gaze tracking system can perform with similar or even higher accuracy than most model-based systems described in literature.

The second thesis claiming that small deviations of the system parameters from optimal values do not degrade the system’s performance significantly also seems to be justified. The experiments presented in Section 7.4.2 prove that changing the eye radius by 10% from the optimal value increases the eye gaze tracking error only by 10%. Sim-ilarly, changing the interpupillary distance by 5% from the optimal value increases the

8 Conclusions

184

tracking error by 5% for the RGB only method. Changing the deviation angle between the visual and optical axis has even smaller impact on the accuracy of the presented system.

Finally, the third thesis has also been largely justified in Chapter 7. It has been experimentally proved that the modified head pose tracking algorithms proposed by the author increase its accuracy and outperform other methods from literature. It has also been demonstrated that proposed improvements in the iris localization algorithm reduce the error measured to manually tagged center points, as well as the measured gaze error. One element of the third thesis that cannot be fully justified is the proposed method using depth in every frame described in in Section 3.4.2.2. The results show that while this method is more accurate for head pose tracking than the current state-of-the-art for camera pose estimation [162], it shows no improvement (in terms of average errors) over the hybrid head pose tracker that uses depth for initialization of the head model only. Furthermore, the method using depth for initialization only is at least two times faster. This shows that a highly accurate eye gaze tracking system can be constructed using depth sensor input only during start-up. Further research is necessary to optimally utilize depth sensor measurements in every frame.

Another element of the third thesis that cannot be fully justified is using a com-plex eye model where optical and visual axes are different. The proposed eye gaze track-ing system configuration has too large errors resulting from head pose tracking to allow observing any improvement from using the more complex eye model – at least for the limited amount of test data that has been recorded for evaluations in this work. It also seems that the potential improvement resulting from using a more complex eye model is at most 3-5% of the total eye gaze tracking angular error – such are the changes in eye gaze tracking errors when the model is used. It therefore seems justified to use a simple eye model with a common optical and visual axis – as the author has proposed to do in [57] – until the head pose tracking accuracy is significantly improved.

8.2 Future work

Despite demonstrating impressive accuracy, the system presented in this thesis can be improved in many ways. The first and most natural improvement is making the initial-ization conditions less restrictive. More accurate flexible model algorithms could allow initializing the eye gaze tracking system in any non-frontal pose, with accurate rotation of any generic face models used for later tracking. The flexible models used in this work

8 Conclusions

185

did not make this possible, but there is some promising new work in this area [169] so this might be a good direction for future improvement.

Another direction of research can be aimed to provide better usability of the proposed system. Initialization performed each time when using the system can be tire-some. Saving the acquired template frames offline and retrieving them when the system is next used can be appealing for the end user. This approach was actually tested by the author with some success. However, slightly different illumination conditions often caused old templates to fail. Further work is required to overcome this problem with satisfactory results.

Another promising idea is to infer the 3D facial geometry when using an RGB camera only through Structure from Motion [170]. This could allow to create an optimal mesh model for head pose estimation without using a depth sensor. The author has performed some initial experiments in this direction but was unable to achieve sufficient accuracy of the resulting 3D model to improve the warped generic one.

As has been mentioned earlier, new ways to reduce the head pose tracking errors during vertical head movements would be helpful. Currently this scenario is the weakest element of this and many other commodity camera eye gaze tracking systems. Some initial exploration of this area has been done by the author as described in Section 3.4.2.2, but more work is necessary.

The author believes that the ideas presented in this work are a significant contribution in the field of eye gaze tracking. As efficient human-computer interfaces become a more and more important aspect in everyone’s life, eye gaze tracking research will surely con-tinue. The author hopes that this work can be of some value in this respect and that future work will lead to further improvement of the presented system.

186

Author’s published work

Conference papers

[1] B. Czupryński and A. Strupczewski, “High Accuracy Head Pose Tracking Survey”, in Active Media Technology (AMT), Warsaw, Poland, 2014

[2] A. Strupczewski and B. Czupryński, “3D reconstruction software comparison for short sequences”, in Photonics Applications in Astronomy, Communications, Indus-try, and High-Energy Physics Experiments, SPIE Proceedings, vol. 9290, 2014

[3] A. Strupczewski, B. Czupryński, W. Skarbek, M. Kowalski, J. Naruniec, “Head Pose Tracking from RGBD Sensor Based on Direct Motion Estimation”, in Pattern Recognition and Machine Intelligence (PReMI), Warsaw, Poland, 2015

[4] B. Czupryński and A. Strupczewski, “Real-time RGBD SLAM system”, in Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments, SPIE Proceedings, vol. 9662, 2015

[5] A. Strupczewski, B. Czupryński, J. Naruniec and K. Mucha, “Geometric Eye Gaze Tracking”, in International Conference on Computer Vision Theory and Applica-tions (VISAPP), Rome, Italy, 2016

Patents

[1] D. Shirron, T. Chaim Lev, A. Porat, A. Lieber, J. Marciniak, A. Strupczewski “A System and method for video recognition based on visual image matching”, U.S. Patent US8805123 B2, 12 August 2014.

[2] A. Jakubiak, P. Zborowski, A. Strupczewski, M. Talarek, “Electronic device includ-ing projector and method for controlling the electronic device”, U.S. Patent Appli-cation US20140298271 A1, 2 October 2014.

[3] A. Jakubiak, A. Strupczewski, G. Grzesiak, J. Bieniusiewicz, K. Nowicki, M. Talarek, P. Simiński, T. Toczyski, “Portable device and method for providing non-contact interface”, U.S. Patent Application US20140300542 A1, 9 October 2014.

[4] A. Strupczewski, J. Naruniec, K. Mucha, B. Czupryński, “Eye gaze tracking method and apparatus and computer readable recording medium”, U.S. Patent US9182819 B2, 10 November 2015.

http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1984644



https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Dan+Shirron%22

https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Tsvi+CHAIM+LEV%22

https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Amit+Porat%22

https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Aaron+LIEBER%22

https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Jacek+MARCINIAK%22

https://www.google.pl/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Adam+STRUPCZEWSKI%22

187

References

[1] M. Agrawal, K. Konolige and M. R. Blas, "CenSurE: Center Surround Extremas for Realtime Feature Detection and Matching," in European Conference on Computer Vision (ECCV), Marseille, France, 2008.

[2] D. W. Hansen and Q. Ji, "In the Eye of the Beholder: A Survey of Models for Eyes and Gaze," IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 478-500, 2010.

[3] I. Wikimedia Foundation, "Human eye - Wikipedia," Wikimedia Foundation, Inc., 05 10 2015. [Online]. Available: https://en.wikipedia.org/wiki/Human_eye. [Accessed 07 10 2015].

[4] S. A. Mosquera, S. Verma and C. McAlinden, "Centration axis in refractive surgery," Eye and Vision, vol. 2, no. 1, pp. 1-16, 2015.

[5] M. Nowakowski, M. Sheehan, D. Neal and A. V. Goncharov, "Investigation of the isoplanatic patch and wavefront aberration along the pupillary axis compared to the line of sight in the eye," Biomedical Optics Express, vol. 3, no. 2, pp. 240-258, 2012.

[6] T. Nagamatsu, K. Junzo and N. Tanaka, "Calibration-free gaze tracking using a binocular 3D eye model," in CHI 2009, Boston, USA, 2009.

[7] SMI SensoMotoric Instruments, "SMI Gaze & Eye Tracking Systems," 25 09 2015. [Online]. Available: http://www.smivision.com/. [Accessed 25 09 2015].

[8] Tobii, "Tobii - The world leader in eye tracking," 25 09 2015. [Online]. Available: http://www.tobii.com/. [Accessed 25 09 2015].

[9] S. Deja, "GazePointer - Webcam EyeTracker," 21 05 2015. [Online]. Available: http://sourceforge.net/projects/gazepointer/. [Accessed 26 09 2015].

[10] Sightcorp, "Insight SDK," 2015. [Online]. Available: http://sightcorp.com/ insight/. [Accessed 26 09 2015].

188

[11] R. Valenti, N. Sebe and T. Gevers, "Combining Head Pose and Eye Location Information for Gaze Estimation," Image Processing, vol. 21, no. 2, pp. 802-815, 2012.

[12] P. Zieliński, "OpenGazer," 30 11 2009. [Online]. Available: http://www.inference.phy.cam.ac.uk/opengazer/. [Accessed 2015 09 26].

[13] xLabs Pty Ltd., "xlabs - Eye gaze tracking," 2015. [Online]. Available: http://xlabsgaze.com/. [Accessed 26 09 2015].

[14] E. Wood and A. Bulling, "EyeTab: model-based gaze estimation on unmodified tablet computers," in Symposium on Eye Tracking Research and Applications (ETRA), Safety Harbor, Florida, USA, 2014.

[15] H. R. Chennamma and Y. Xiaohui, "A Survey on Eye-Gaze Tracking Techniques," CoRR, 2013.

[16] A. Sharma and P. Abrol, "Eye Gaze Techniques for Human Computer Interaction: A Research Survey," International Journal of Computer Applications, vol. 71, no. 9, pp. 18-25, 2013.

[17] S. Baluja and D. Pomerleau, "Non-Intrusive Gaze Tracking Using Artificial Neural Networks," Carnegie Mellon University, Pittsburgh, PA, USA, 1994.

[18] T. D. Rikert and M. J. Jones, "Gaze Estimation using Morphable Models," in IEEE Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998.

[19] K.-H. Tan, D. J. Kriegman and N. Ahuja, "Appearance-based Eye Gaze Estimation," in WACV, 2002.

[20] L.-P. Morency, C. M. Christoudias and T. Darrell, "Recognizing Gaze Aversion Gestures in Embodied Conversational Discourse," in International conference on Multimodal interfaces (ICMI), Banff, Canada, 2006.

[21] Y. Ono, T. Okabe and Y. Sato, "Gaze Estimation from Low Resolution Images," in First Pacific Rim Symposium on Image and Video Technology (PSIVT), Hsinchu, Taiwan, 2006.

189

[22] O. Williams, A. Blake and R. Cipolla, "Sparse and Semi-supervised Visual Mapping with the S3GP," in Computer Vision and Pattern Recognition (CVPR), New York, USA, 2006.

[23] F. Lu, Y. Sugano, T. Okabe and Y. Sato, "Inferring Human Gaze from Appearance via Adaptive Linear Regression," in International Conference on Computer Vision (ICCV), Barcelona, Spain, 2011.

[24] Y. Sugano, Y. Matsushita and Y. Sato, "Appearance-based gaze estimation using visual saliency," TPAMI, vol. 35, no. 2, pp. 329-41, 2013.

[25] F. Alnajar, T. Gevers, R. Valenti and S. Ghebreab, "Calibration-Free Gaze Estimation Using Human Gaze Patterns," in International Conference on Computer Vision (ICCV), Sydney, Australia, 2013.

[26] F. Lu, T. Okabe, Y. Sugano and Y. Sato, "Learning gaze biases with head motion for head pose-free gaze estimation," Image and Vision Computing, vol. 32, no. 3, pp. 159-169, 2014.

[27] X. Zhang, Y. Sugano, M. Fritz and A. Bulling, "Appearance-Based Gaze Estimation in the Wild," in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015.

[28] Y. Sugano, Y. Matsushita and Y. Sato, "Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation," in Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014.

[29] F. Lu, Y. Sugano, T. Okabe and Y. Sato, "Adaptive Linear Regression for Appearance-Based Gaze Estimation," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 36, no. 10, pp. 2033-2046, 2014.

[30] T. Schneider, B. Schauerte and R. Stiefelhagen, "Manifold Alignment for Person Independent Appearance-Based Gaze Estimation," in International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 2014.

[31] T. Baltrusaitis, P. Robinson and L.-P. Morency, "Continuous Conditional Neural Fields for Structured Regression," in European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014.

190

[32] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee and A. Y. Ng, "Multimodal Deep Learning," in International Conference on Machine Learning (ICML), Bellevue, WA, USA, 2011.

[33] K. A. Funes Mora, F. Monay and J.-M. Odobez, "EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras," in Eye Tracking Research and Applications (ETRA), Safety Harbor, FL, USA, 2014.

[34] J.-G. Wang and E. Sung, "Gaze determination via images of irises," Image and Vision Computing, vol. 19, no. 12, pp. 891-911, 2001.

[35] H. Wu, Q. Chen and T. Wada, "Conic-based algorithm for visual line estimation from one image," in Automatic Face and Gesture Recognition (FGR), Seoul, Korea, 2004.

[36] J.-G. Wang, E. Sung and R. Venkateswarlu, "Eye Gaze Estimation from a Single Image of One Eye," in International Conference on Computer Vision (ICCV), Nice, France, 2003.

[37] J.-G. Wang, Head-pose and eye-gaze determination for Human-Machine Interaction, PhD Thesis, Singapore: School of Electrical and Electronic Engineering, Nanyang Technological University,, 2001.

[38] S. Kohlbecher and T. Poitschke, "Calibration-free eye tracking by reconstruction of the pupil ellipse in 3D space," in Eye tracking research & applications (ETRA), Savannah, GA, USA, 2008.

[39] E. Wood, "GitHub - EyeTab," 07 04 2014. [Online]. Available: https://github.com/errollw/EyeTab/. [Accessed 04 10 2015].

[40] A. Gee and R. Cipolla, "Non-Intrusive Gaze Tracking for Human-Computer Interaction," in International Conference on Mechatronics and Machine Vision in Practice, 1994.

[41] J. Heinzmann and A. Zelinsky, "3-D Facial Pose and Gaze Point Estimation using a Robust Real-Time Tracking Paradigm," in Automatic Face and Gesture Recognition (FGR), Nara, Japan, 1998.

191

[42] Y. Matsumoto and A. Zelinsky, "An Algorithm for Real-time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement," in Automatic Face and Gesture Recognition (FGR), Grenoble, France, 2000.

[43] R. Newman, Y. Matsumoto, S. Rougeaux and A. Zelinsky, "Real-time stereo tracking for head pose and gaze estimation," in Automatic Face and Gesture Recognition (FGR), Grenoble, France, 2000.

[44] S. Baker, I. Matthews and T. Kanade, "Passive Driver Gaze Tracking with Active Appearance Models," in World Congress on Intelligent Transportation Systems, Washington, USA, 2004.

[45] J. Xiao, S. Baker, I. Matthews and T. Kanade, "Real-Time Combined 2D+3D Active Appearance Models," in Conference on Computer Vision and Pattern Recognition (CVPR), Washington, USA, 2004.

[46] J. Zhu and J. Yang, "Subpixel eye gaze tracking," in Automatic Face and Gesture Recognition (FGR), Washington, USA, 2002.

[47] D. W. Hansen and A. E. Pece, "Eye tracking in the wild," Computer Vision and Image Understanding, vol. 98, pp. 155-181, 2005.

[48] R. Valenti, J. Staiano, N. Sebe and T. Gevers, "Webcam-based Visual Gaze Estimation," in International Conference on Image Analysis and Processing (ICIAP), Vietri sul Mare, Italy, 2009.

[49] J. Xiao, T. Moriyama, T. Kanade and J. Cohn, "Robust Full-Motion Recovery of Head by Dynamic Templates and Re-registration Techniques," International Journal of Imaging Systems and Technology, vol. 13, pp. 85-94, 2003.

[50] H. Yamazoe, A. Utsumi, T. Yonezawa and S. Abe, "Remote Gaze Estimation with a Single Camera Based on Facial-Feature Tracking without Special Calibration Actions," in Symposium on Eye tracking research & applications (ETRA), Savannah, Georgia, USA, 2008.

[51] E. D. Guestrin and M. Eizenman, "General theory of remote gaze estimation using the pupil center and corneal reflections," Biomedical Engineering, vol. 53, no. 6, pp. 1124- 133, 2006.

192

[52] Z. Zhu and Q. Ji, "Novel Eye Gaze Tracking Techniques Under Natural Head Movement," IEEE Transactions on Biomedical Engineering, vol. 44, no. 12, pp. 2246-2260, 2007.

[53] T. Nagamatsu, R. Sugano, Y. Iwamoto, J. Kamahara and N. Tanaka, "User-calibration-free Gaze Tracking with Estimation of the Horizontal Angles between the Visual and the Optical Axes of Both Eyes," in Symposium on Eye-Tracking Research & Applications (ETRA), Austin, USA, 2010.

[54] T. Nagamatsu, J. Kamahara and N. Tanak, "3D Gaze Tracking with Easy Calibration Using Stereo Cameras for," in 17th IEEE International Symposium on Robot and Human Interactive Communication, Munich, Germany, 2008.

[55] J. Chen and Q. Ji, "3D Gaze Estimation with a Single Camera without IR Illumination," in International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 2008.

[56] Y. Tong, Y. Wang, Z. Zhu and Q. Ji, "Robust facial feature tracking under varying face pose and facial expression," Pattern Recognition, vol. 40, no. 11, p. 3195–3208, 2007.

[57] A. Strupczewski, B. Czupryński, J. Naruniec and K. Mucha, "Geometric Eye Gaze Tracking," in International Conference on Computer Vision Theory and Applications, Rome, Italy, 2016.

[58] Y. Li, D. S. Monaghan and N. E. O'Connor, "Real-Time Gaze Estimation using a Kinect and," in International Conference on MultiMedia Modeling (MMM), Dublin, Ireland, 2014.

[59] J. Mansanet, A. Albiol, R. Paredes, J. M. Mossi and A. Albiol, "Estimating Point of Regard with a Consumer Camera at a Distance," in Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), Madeira, Portugal, 2013.

[60] N. S. Altman, "An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression," The American Statistician, vol. 46, no. 3, pp. 175-185, 1992.

[61] C. J. B. ,. L. K. Harris Drucker, B. Chris J. C, L. Kaufman, A. Smola and V. Vapnik, "Support Vector Regression Machines," in Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1996.

193

[62] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.

[63] K. Funes Mora and J. Odobez, "Gaze estimation from multimodal Kinect data," in Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 2012.

[64] P. Paysan, R. Knothe, B. Amberg, S. Romdhani and T. Vetter, "A 3D Face Model for Pose and Illumination Invariant Face Recognition," in Advanced Video and Signal Based Surveillance (AVSS), Genova, Switzerland, 2009.

[65] S. Rusinkiewicz and M. Levoy, "Efficient Variants of the ICP Algorithm," in 3-D Digital Imaging and Modeling, Quebec City, Canada, 2001.

[66] K. A. Funes Mora and J.-M. Odobez, "Person independent 3D gaze estimation from remote RGB-D cameras," in International Conference on Image Processing (ICIP), Melbourne, Australia, 2013.

[67] R. Jafari and D. Ziou, "Gaze Estimation Using Kinect/PTZ Camera," in International Symposium on Robotic and Sensors Environments (ROSE), Magdeburg, Germany, 2012.

[68] G. Fanelli, T. Weise, J. Gall and L. V. Gool, "Real Time Head Pose Estimation from Consumer Depth Cameras," in DAGM Symposium, Frankfurt/Main, Germany, 2011.

[69] L. Sun, Z. Liu and M.-T. Sun, "Real time gaze estimation with a consumer depth camera," Information Sciences, vol. 320, pp. 346-360, 2015.

[70] M. Song, D. Tao, Z. Sun and X. Li, "Visual-Context Boosting for Eye Detection," IEEE Transactions on Systems, Man, and Cybernetics, vol. 40, no. 6, pp. 1460-1467, 2010.

[71] X. Xiong, Q. Cai, Z. Liu and Z. Zhang, "Eye Gaze Tracking Using an RGBD Camera: A Comparison with an RGB Solution," in ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, Seattle, USA, 2014.

[72] X. Xiong and F. De la Torre, "Supervised Descent Method and Its Applications to Face Alignment," in Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 2013.

194

[73] D. F. Dementhon and L. S. Davis, "Model-based object pose in 25 lines of code," International Journal of Computer Vision, vol. 15, no. 1, pp. 123-141, 1995.

[74] D. Li, D. Winfield and D. J. Parkhurst, "Starburst: A hybrid algorithm for video-based eye tracking combining feature-based and model-based approaches," in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), San Diego, CA, USA, 2005.

[75] L. Jianfeng and L. Shigang, "Eye-Model-Based Gaze Estimation by RGB-D Camera," in Computer Vision and Pattern Recognition Workshops (CVPRW), Columbus, OH, USA, 2014.

[76] F. Timm and E. Barth, "Accurate eye center localization by means of gradients," in International Conference on Computer Vision Theory and Applications (VISAPP), Vilamoura, Portugal, 2011.

[77] A. Strupczewski and K. M. B. C. Jacek Naruniec, "Eye gaze tracking method and apparatus and computer-readable recording medium". USA Patent US9182819 B2, 10 November 2015.

[78] E. Murphy-Chutorian and M. M. Trivedi, "Head Pose Estimation in Computer Vision: A Survey," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 31, no. 4, pp. 607-626, 2008.

[79] B. Czupryński and A. Strupczewski, "High Accuracy Head Pose Tracking Survey," in Active Media Technology (AMT), Warsaw, Poland, 2014.

[80] D. Beymer, "Face recognition under varying pose," in Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 1994.

[81] S. Niyogi and W. Freeman, "Example-based head tracking," in Automatic Face and Gesture Recognition (FGR), Killington, VT, USA, 1996.

[82] J. Sherrah, S. Gong and E. Ong, "Face distributions in similarity space under varying head pose," Image and Vision Computing, vol. 19, no. 12, pp. 807-819, 2001.

[83] J. Ren, M. Rahman, N. Kehtarnavaz and L. Estevez, "Real-Time Head Pose Estimation on Mobile Platforms," Journal of Systemics, Cybernetics & Informatic, vol. 8, no. 3, p. 56, 2010.

195

[84] M. Viola, M. J. Jones and P. Viola, "Fast multi-view face detection," in Computer Vision and Pattern Recognition (CVPR), Madison, WI, USA, 2003.

[85] Y. Li, S. Gong, J. Sherrah and H. Liddell, "Support vector machine based multi-view face detection and recognition," Image and Vision Computing, vol. 22, pp. 413-427, 2004.

[86] Y. Ma, Y. Konishi, K. Kinoshita, S. Lao and M. Kawade, "Sparse Bayesian Regression for Head Pose Estimation," in International Conference on Pattern Recognition (ICPR), Hong Kong, 2006.

[87] M. Voit, K. Nickel and R. Stiefelhagen, "A Bayesian Approach for Multi-view Head Pose Estimation," in Multisensor Fusion and Integration for Intelligent Systems, Heidelberg, Germany, 2006.

[88] M. Zhang, K. Li and Y. Liu, "Head Pose Estimation from Low-Resolution Image with Hough Forest," in Chineese Conference on Pattern Recognition (CCPR), Chongqing, China, 2010.

[89] S. Srinivasan and K. Boyer, "Head pose estimation using view based eigenspaces," in International Conference on Pattern Recognition (ICPR), Quebec, Canada, 2002.

[90] B. Ma, W. Zhang, S. Shan, X. Chen and W. Gao, "Robust Head Pose Estimation Using LGBP," in International Conference on Pattern Recognition (ICPR), Hong Kong, 2006.

[91] B. Raytchev, I. Yoda and K. Sakaue, "Head pose estimation by nonlinear manifold learning," in International Conference on Pattern Recognition (ICPR), Cambridge, UK, 2004.

[92] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear embedding," Science, vol. 290, pp. 2323-2326, 2000.

[93] M. Belkin and P. Niyogi, "Laplacian Eigenmaps for dimensionality reduction and data representation," Neural Computation, vol. 15, no. 6, pp. 1373-1396 , 2003.

[94] T. Horprasert, Y. Yacoob and L. S. Davis, "Computing 3-D Head Orientation from a Monocular Image," in Automatic Face and Gesture Recognition (FGR), Killington, VT, USA, 1996.

196

[95] V. Lepetit, F. Moreno-Noguer and P. Fua., "EPnP: An Accurate O(n) Solution to the PnP Problem," International Journal Of Computer Vision, vol. 81, no. 2, pp. 155-166, 2009.

[96] C.-P. Lu, G. Hager and E. Mjolsness, "Fast and globally convergent pose estimation from video images," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 22, no. 6, pp. 610-622, 2000.

[97] M. Sapienza and K. P. Camilleri, "Fasthpe: A recipe for quick head pose estimation," Systems and Control Engineering, Department of Systems and Control Engineering, University of Malta, Msida, Malta, 2011.

[98] M. L. Cascia, S. Sclaroff and V. Athitsos, "Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Registration of Texture-Mapped 3D Models," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 22, no. 4, pp. 322-336, 2000.

[99] J.-G. Wang and E. Sung, "EM enhancement of 3D head pose estimated by point at infinity," Image and Vision Computing, vol. 25, no. 12, pp. 1864-1874, 2007.

[100] S. Malassiotis and M. G. Strintzis, "Robust real-time 3D head pose estimation from range data," Pattern Recognition, vol. 38, no. 8, pp. 1153-1165, 2005.

[101] P. Padeleris, X. Zabulis and A. Argyros, "Head pose estimation on depth data based on Particle Swarm Optimization," in Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 2012.

[102] N. Krüger, M. Pötzsch and C. V. D. Malsburg, "Determination of Face Position and Pose With a Learned Representation Based on Labelled Graphs," Image and Vision Computing, vol. 15, no. 8, pp. 665-673, 1997.

[103] T. Cootes, C. Taylor, D. Cooper and J. Graham, "Active Shape Models-Their Training and Application," Computer Vision and Image Understanding, vol. 61, no. 1, p. 38–59, 1995.

[104] I. Matthews and S. Baker, "Active Appearance Models Revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, 2004.

[105] T. Cootes, K. Walker and C. Taylor, "View-Based Active Appearance Models," in Automatic Face and Gesture Recognition (FGR), Grenoble, France, 2000.

197

[106] P. Viola and M. J. Jones, "Robust Real-Time Face Detection," International Journal of Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.

[107] C. Tomasi and T. Kanade, "Shape and Motion from Image Streams under Orthography: a Factorization Method," International Journal of Computer Vision, vol. 9, no. 2, pp. 137-154, 1992.

[108] Z. Gui and C. Zhang, "3D Head Pose Estimation Using Non-rigid Structure-from-motion and Point Correspondence," in IEEE Region 10 Conference (TENCON), Hong Kong, 2006.

[109] N. Smolyanskiy, C. Huitema, L. Liang and S. E. Anderson, "Real-time 3D face tracking based on active appearance model constrained by depth data," Image and Vision Computing, vol. 32, no. 11, pp. 860-869, 2014.

[110] D. Cristinacce and T. Cootes, "Feature Detection and Tracking with Constrained Local Models," in British Machine Vision Conference (BMVC), Edinburgh, UK, 2006.

[111] P. Martins, R. Caseiro and J. Batista, "Non-Parametric Bayesian Constrained Local Models," in Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014.

[112] L. Zamuner, K. Bailly and E. Bigorgne, "A Pose-Adaptive Constrained Local Model for Accurate Head Pose Tracking," in International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 2014.

[113] L.-P. Morency, J. Whitehill and J. Movellan, "3D Constrained Local Model for Rigid and Non-Rigid Facial Tracking," in Computer Vision and Pattern Recognition (CVPR), Providence, Rhode Island, USA, 2012.

[114] X. Zhu and D. Ramanan, "Face Detection, Pose Estimation, and Landmark Localization in the Wild," in Computer Vision and Pattern Recognition (CVPR), Providence, Rhode Island, USA, 2012.

[115] X. Cao, Y. Wei, F. Wen and J. Sun, "Face Alignment by Explicit Shape Regression," International Journal of Computer Vision (IJCV), vol. 107, no. 2, pp. 177-190 , 2012.

198

[116] S. Ren, X. Cao, Y. Wei and J. Sun, "Face Alignment at 3000 FPS via Regressing Local Binary Features," in Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 2014.

[117] M. Malciu and F. Preteux, "A robust model-based approach for 3D head tracking in video sequences," in Automatic Face and Gesture Recognition (FGR), Grenoble, France, 2000.

[118] Z. Zhu and Q. Ji, "Real time 3D face pose tracking from an uncalibrated camera," in Computer Vision and Pattern Recognition Workshops (CVPRW), Washington, DC, USA, 2004.

[119] B. D. Lucas and T. Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision," in International Joint Conference on Artificial Intelligence (IJCAI), Vancouver, Canada, 1981.

[120] G. Aggarwal, A. Veeraraghavan and R. Chellappa, "3D Facial Pose Tracking in Uncalibrated Videos," in Pattern Recognition and Machine Intelligence (PReMI), Kolkata, India, 2005.

[121] L. Lu, X.-T. Dai and G. Hager, "A Particle Filter without Dynamics for Robust 3D Face Tracking," in Computer Vision and Pattern Recognition Workshop (CVPRW), Washington, DC, USA, 2004.

[122] W. Yu and L. Gang, "Head Pose Estimation Based on Head Tracking and the Kalman filter," in International Conference on Physics Science and Technology (ICPST), Dubai, United Arab Emirates, 2011.

[123] L. Vacchetti, V. Lepetit and P. Fua, "Stable real-time 3D tracking using online and offline information," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1385-1391, 2004.

[124] D. Fidaleo, G. Medioni, P. Fua and V. Lepetit, "An Investigation of Model Bias in 3D Face Tracking," in Analysis and Modelling of Faces and Gestures (AMFG), Beijing, China, 2005.

[125] J.-S. Jang and T. Kanade, "Robust 3D Head Tracking by Online Feature Registration," in Automatic Face and Gesture Recognition (FGR), Amsterdam, Netherlands, 2008.

199

[126] J.-S. Jang and T. Kanade, "Robust 3D Head Tracking by View-based Feature Point Registration," Carnegie Mellon University, Pittsburgh, PA, USA, 2010.

[127] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision, vol. 60, pp. 91-110, 2004.

[128] S. Li, K. N. Ngan and L. Sheng, "A Head Pose Tracking System using RGB-D Camera," in International conference on Computer Vision Systems (ICVS), Petersburg, Russia, 2013.

[129] A. Strupczewski, B. Czupryński, W. Skarbek, M. Kowalski and J. Naruniec, "Head Pose Tracking from RGBD Sensor Based on Direct Motion Estimation," Pattern Recognition and Machine Intelligence (PREMI), Warsaw, Poland, 2015.

[130] W.-K. Liao, D. Fidaleo and G. Medioni, "Robust: real-time 3D face tracking from a monocular view," EURASIP Journal on Image and Video Processing, vol. 2010, 2010.

[131] L. Zhang, H. Ai and S. Lao, "Robust face alignment based on hierarchical classifier network," in ECCV Workshop Human-Computer Interaction, Graz, Austria, 2006.

[132] L.-P. Morency, J. Whitehill and J. Movellan, "Generalized adaptive view-based appearance model: Integrated framework for monocular head pose estimation," in Automatic Face and Gesture Recognition (FGR), Amsterdam, Holland, 2008.

[133] S. Vedula, P. Rander, R. Collins and T. Kanade, "Three-dimensional scene flow," IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 27, no. 3, pp. 475-480, 2005.

[134] Y. Zhou, L. Gu and H.-J. Zhang, "Bayesian Tangent Shape Model: Estimating Shape and Pose Parameters via Bayesian Inference," in Computer Vision and Pattern Recognition (CVPR), Madison, WI, USA, 2003.

[135] K.-N. Kim and R. S. Ramakrishna, "Vision-Based Eye-Gaze Tracking for Human Computer Interface," in International Conference on Man, Systems and Cybernetics, Tokyo, Japan, 1999.

200

[136] A. Pérez, M. L. Córdoba, A. García, R. Méndez, M. L. Muñoz, J. L. Pedraza and F. Sánchez, "A Precise Eye-Gaze Detection and Tracking System," in International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), Plzen-Bory, Czech Republic, 2003.

[137] D. Young, H. Tunley and R. Samuels, "Specialised Hough Transform and Active Contour Methods for Real-Time Eye Tracking," School of Cognitive and Computing Sciences, University of Sussex, Sussex, UK, 1995.

[138] K. Toennies, F. Behrens and M. Aurnhammei, "Feasibility of Hough-transform-based iris localisation for real-time-application," in International Conference on Pattern Recognition (ICPR), Quebec, Canada, 2002.

[139] P. Li and X. Liu, "An Incremental Method for Accurate Iris Segmentation," in International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 2008.

[140] R. Valenti and T. Gevers, "Accurate Eye Center Location and Tracking Using Isophote Curvature," in Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 2008.

[141] BioID Technology Research, "The BioID Face Database," 2001. [Online]. Available: https://www.bioid.com/About/BioID-Face-Database. [Accessed 09 10 2015].

[142] O. Jesorsky, K. J. Kirchberg and R. Frischholz, "Robust Face Detection Using the Hausdorff Distance," in International Conference on Audio- and Video-Based Biometric Person Authentication, Halmstad, Sweden, 2001.

[143] M. Pilu, A. W. Fitzgibbon and R. B. Fischer, "Ellipse-specific direct least-square fitting," in International Conference on Image Processing (ICIP), Lausanne, Switzerland, 1996.

[144] D. Li, Low-cost eye-tracking for human computer interaction, Master Thesis, Ames, Iowa, USA: Iowa State University, 2005.

[145] T. F. Cootes and C. J. Taylor, "Active Shape Models—‘Smart Snakes'," in British Machine Vision Conference (BMVC), Leeds, UK, 1992.

201

[146] J. Daugman, "How Iris Recognition Works," in International Conference on Image Processing, Rochester, NY, USA, 2002.

147] W. Zhang, B. Li, X. Ye and Z. Zhuang, "A Robust Algorithm for Iris Localization Based on Radial Symmetry," in International Conference on Computational Intelligence and Security Workshops, Harbin, China, 2007.

[148] G. Loy and A. Zelinsky, "A Fast Radial Symmetry Transform for Detecting Points of Interest," in European Conference on Computer Vision (ECCV), Copenhagen, Denmark, 2002.

[149] Z. Zhou, P. Yao, Z. Zhuang and J. Li, "A robust algorithm for iris localization based on radial symmetry and circular integro differential operator," in IEEE Conference on Industrial Electronics and Applications (ICIEA), Beijing, China, 2011.

[150] S. Sclaroff and J. Isidoro, "Active blobs," in International Conference on Computer Vision (ICCV), Mumbai, India, 1998.

[151] T. F. Cootes, G. J. Edwards and C. J. Taylor, "Active appearance models," in European Conference on Computer Vision (ECCV), Freiburg, Germany, 1998.

[152] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 2005.

[153] J. Matas and T. Vojir, "Robustifying the Flock of Trackers," in 16th Computer Vision Winter Workshop, Mitterberg, Austria, 2011.

[154] S. Leutenegger, M. Chli and R. Y. Siegwart, "BRISK: Binary Robust Invariant Scalable Keypoints," in International Conference of Computer Vision (ICCV), Barcelona, Spain, 2011.

[155] D. Jobson, Z.-u. Rahman and G. Woodell, "A multiscale retinex for bridging the gap between color images and the human observation of scenes," IEEE Transactions on Image Processing, vol. 6, no. 7, pp. 965-976, 1997.

[156] O. Arandjelovic, "Making the most of the self-quotient image in face recognition," in Automatic Face and Gesture Recognition (FGR), Shanghai, China, 2013.

202

[157] C.-Y. Tsai and C.-H. Chou, "A novel simultaneous dynamic range compression and local contrast enhancement algorithm for digital video cameras," EURASIP Journal on Image and Video Processing, vol. 2011, no. 1, 2011.

[158] W. Chen, M. J. Er and S. Wu, "Illumination compensation and normalization for robust face recognition using discrete cosine transform in logarithm domain," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 36, no. 2, pp. 458-466, 2006.

[159] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras and P. Huang, "High resolution tracking of non-rigid motion of Densely Sampled 3D Data Using Harmonic Maps," International Journal of Computer Vision (IJCV), vol. 76, no. 3, pp. 283-300, 2008.

[160] Q. Cai, D. Gallup, C. Zhang and Z. Zhang, "3D deformable face tracking with a commodity depth camera," in European Conference on Computer Vision (ECCV), Heraklion, Greece, 2010.

[161] N. Smolyanskiy, C. Huitema, L. Liang and S. E. Anderson, "Real-time 3d face tracking based on active appearance model constrained by depth data," Image and Vision Computing, vol. 32, no. 11, pp. 860-869, 2014.

[162] C. Kerl, J. Sturm and D. Cremers, "Dense Visual SLAM for RGB-D Cameras," in Intelligent Robots and Systems (IROS), Tokyo, Japan, 2013.

[163] L. Sesma, A. Villanueva and R. Cabeza, "Evaluation of Pupil Center-Eye Corner Vector for Gaze Estimation Using a Web Cam," in Eye Tracking Research and Applications (ETRA), Santa Barbara, CA, USA, 2012.

[164] D. Beymer and M. Flickner, "Eye gaze tracking using an active stereo head," in Computer Vision and Pattern Recognition (CVPR), Madison, WI, USA, 2003.

[165] H. Gross, F. Blechinger and B. Achtner, "Human Eye," in Handbook of Optical Systems: Vol. 4 Survey of Optical Instruments, Weinheim, Wiley-VCH, 2008, pp. 1-87.

[166] Z. Zhang, "A flexible new technique for camera calibration," Pattern Analysis and Machine Intelligence (PAMI), vol. 22, no. 11, pp. 1330-1334, 2000.

203

[167] W. Garage, "Camera Calibration and 3D Reconstruction," OpenCV.org, 11 11 2015. [Online]. Available: http://docs.opencv.org/2.4/modules/calib3d/doc/ camera_calibration_and_3d_reconstruction.html. [Accessed 11 11 2015].

[168] D. Lee, H. Park and C. Yoo, "Face alignment using cascade Gaussian process regression trees," in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015.

[169] N. Wang, X. Gao, D. Tao and X. Li, "Facial Feature Point Detection: A Comprehensive Survey," 10 2014. [Online]. Available: http://arxiv.org/pdf/ 1410.1037v1.pdf. [Accessed 23 01 2016].

[170] D. Fidaleo and G. Medioni, "Model-assisted 3D face reconstruction from video," in Analysis and modeling of faces and gestures (AMFG), Rio de Janeiro, Brazil, 2007.

Documents

home.elka.pw.edu.plhome.elka.pw.edu.pl/.../2016_commodity_camera_egt_phd.pdf · 2016-04-06 · This thesis presents a complete eye gaze tracking system intended for use with a com-modity