Eye Gaze Tracking with Google Cardboard Using Purkinje Images · Input methods for mobile VR systems are currently very limited – Google Cardboard, for example, uses a combination

Eye Gaze Tracking with Google Cardboard Using Purkinje Images

Scott W. Greenwald, Luke Loreti, Markus Funk, Ronen Zilberman, Pattie MaesMIT Media Lab, Cambridge, MA

Abstract

Mobile phone-based Virtual Reality (VR) is rapidly growing as aplatform for stereoscopic 3D and non-3D digital content and ap-plications. The ability to track eye gaze in these devices wouldbe a tremendous opportunity on two fronts: firstly, as an interac-tion technique, where interaction is currently awkward and limited,and secondly, for studying human visual behavior. We propose amethod to add eye gaze tracking to these existing devices usingtheir on-board display and camera hardware, with a minor modifi-cation to the headset enclosure. We present a proof-of-concept im-plementation of the technique and show results demonstrating itsfeasibility. The software we have developed will be made availableas open source to benefit the research community.

CR Categories: H.5.1 [Multimedia Information Systems]: Artifi-cial, augmented, and virtual realities H.5.2 [User Interfaces]: Inputdevices and strategies I.3.1 [Hardware Architecture]: Input devices

Keywords: eye tracking, virtual reality, low cost, Purkinje images

1 Introduction and Background

With the recent growth of VR, an increasing number of head-mounted displays (HMDs) for VR are now commercially available– in both mobile (e.g. Gear VR and Google Cardboard) and station-ary or room-scale (e.g. Oculus Rift and HTC Vive) variants. Wemake two observations related to eye gaze tracking in the context ofthis recent trend: firstly, that the use of eye gaze as an input methodcould improve device usability, and secondly, that introducing eyegaze tracking on VR headsets could advance research on human vi-sual behavior. The impact of both of these would be particularlygreat if the solution could scale to existing mobile VR devices –which are the most numerous by far – without the use of additionalhardware.

Traditionally, input for these VR systems is done using either ahand-held controller or touch-based input mounted on the HMD.Input methods for mobile VR systems are currently very limited –Google Cardboard, for example, uses a combination of head ori-entation and single-button input, and some 3DoF IMU-based in-put devices are becoming available, such as those from Nod Labs,and the upcoming controller for the Google Daydream platform.As an alternative to hand-based input, research projects have sug-gested using eye tracking as an interaction technique for VR en-vironments [Duchowski et al. 2000; Kangas et al. 2016; Ohshimaet al. 1996; Tanriverdi and Jacob 2000]. It has been shown that inVR, eye tracking is perceived as less tiring than hand-based interac-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]. c© 2016 ACM.VRST ’16,, November 02-04, 2016, Garching bei Munchen, GermanyISBN: 978-1-4503-4491-3/16/11DOI: http://dx.doi.org/10.1145/2993369.2993407

Figure 1: We modified a Google Cardboard with a (1) spacer thatholds the Samsung Note 5 in the correct position. Further, weadded a (2) mirror to reflect time-codes for synchronizing contentand gaze.

tion [Tanriverdi and Jacob 2000], and eye tracking has been used fortraining users in airplane inspection tasks [Duchowski et al. 2000].Also, traditionally eye tracking was always expensive; therefore re-search also focused on building low-cost alternatives using off-the-shelf components [Li et al. 2006].

For many years, eye tracking was only possible in stationary labenvironments due to the bulkiness of the equipment. Several yearsago, dedicated eye tracking glasses became available for eye track-ing in a mobile setting, and these remain very costly. Most recently,the hardware has become small and modular enough to be includedin mobile devices and commercial products. For example, thereare eye tracking extensions available for many VR HMDs suchas those for the Gear VR and the Oculus Rift by SMI and Eye-Tribe. However, these approaches still require additional hardwareand are costly. Introducing eye tracking to head-mounted VR dis-plays without additional hardware would be a huge opportunity forboth of the aforementioned problems: improving system usability,and scaling the study of visual behavior.

Google Cardboard provides a VR experience available as a $15add-on to mobile phones that hundreds of millions of people al-ready own, and as such it is the cheapest VR HMD currently on themarket for these people. According to LA Times as of January 20161, 5 million Google Cardboard systems had been sold worldwide.This makes the Google Cardboard a suitable platform to test newtechnology on a large scale with a large number of users. In thispaper, our primary contribution is a novel low-cost approach forenabling synchronized eye-tracking in mobile VR using computervision algorithms and a Google Cardboard. We demonstrate thefeasibility of the technique through user testing. We make a secondcontribution by providing our software to the research communityas an open-source project, enabling others to advance the techniquetowards practical deployability.

2 Concept for HMD eye tracking

In this section we describe our basic concept for the eye trackingwith Google Cardboard, including how it differs from the mostcommon commercial eye tracking technique. We then enumeratethe specific capabilities that we do and do not demonstrate with ourprototype in the scope of this paper.

1http://phys.org/news/2016-01-google-cardboard-\shipped-million-virtual.html

19

http://dx.doi.org/10.1145/2993369.2993407

http://phys.org/news/2016-01-google-cardboard-\ shipped-million-virtual.html

http://phys.org/news/2016-01-google-cardboard-\ shipped-million-virtual.html

(a) (b) (c) (d)

Figure 2: An overview of the maximum eye positions taken with the front-facing camera using our system: (a) furthest right (b) furthest left(c) furthest up, and (d) furthest down. (1) shows the region of the image where the screen content is reflected most clearly, which is used asinput for our algorithm. (2) The Google Cardboard internal layout causes the content to be reflected from the lenses directly. This reflection istreated as noise and is ignored by our algorithm. (3) The mirror reflecting parts of the screen. (4) A timecode used for synchronizing contentand eye position.

We propose enabling eye gaze tracking for the Cardboard using themobile phone’s front-facing camera. The technique leverages thecorneal reflection of on-screen content in images captured by thecamera. As the eye moves, the location and features of this reflec-tion change, and these changes are used to estimate the user’s gazeposition. The most common technique for tracking gaze on a com-puter display uses an assembly consisting of a camera and infraredLED’s. The changing positions of the glints created by the reflec-tion of these LED’s in the images captured by the camera are used toinfer the gaze point of the user [Hansen and Pece 2005]. Just as thepositions of glints in the corneal reflection used in this IR-based eyetracking technique move when the eye moves, so do the positionsof reflected on-screen image features in the visible-spectrum, seenthrough the front-facing camera on the Google Cardboard device.As in [Nakazawa and Nitschke 2012], the features of this image canbe resolved and detected in the reflection. Two requirements for IR-based remote eye-tracking systems that are challenging to meet are(1) they must be robust to the presence of other sources of IR lightin the environment (e.g. sunlight), as noted in [Li et al. 2006], and(2) they must account for the movement of both the user’s head andthe user’s eyes. These challenges are part of the reason such sys-tems are still so expensive. Both of the mentioned requirements aremuch easier to meet in our case: the cardboard device blocks mostambient light from entering the space between the camera and theeye, and the user’s head stays comparatively still with respect to thedevice display.

The goal for our proof-of-concept implementation is to show empir-ically, with the front-facing camera of a Google Cardboard device,that there is sufficient information in the Purkinje image to trackgaze position with a useful degree of precision and accuracy. Thatis, given the performance specifications of existing devices, the at-tainable quality of feature resolution and position estimation areadequate for this purpose. Data collection posed some unexpectedchallenges, which we needed to tackle as part of this practical ex-ploration. In particular, in this paper: (i) we present a practicalworkflow for acquiring high-quality data, which means assigningreliable timestamps to camera images, and allowing these to be la-beled with known gaze positions and (ii) we present two algorithmsfor estimating gaze position: one which performs discrete classifi-cation and another which performs continuous gaze position esti-mation. The most notable limitations of our implementation are (1)the algorithm is not implemented on the mobile device itself, (2) ittracks gaze on calibrated images only, and (3) it does not work inreal-time.

3 Data Acquisition

In this section, we describe some practical problems associated withcapturing the data necessary for eye tracking on a Google Card-board device, along with our corresponding solutions. To begin,data acquisition in this context means acquiring pairs of images(these will be referred to as data points) – one a screen capture,and the other a frame from the device’s front-facing camera – andlabeling them with a timestamp. From the camera side, this requiresknowing when a photo was taken, with a precision of about 30ms.Front-facing mobile phone cameras are designed primarily to sup-port streaming video chat and point-and-shoot selfies, which do notrequire this kind of precision. Consequently, this trivial-soundingtask turns out to be unexpectedly difficult. With our Samsung Note5, from the time of issuing a camera API call until the photo istaken was non-deterministic, and typically in the range of 200 to300ms. Another 200 to 300ms then elapses before the frame is pre-sented back to our app by the Android OS. For comparison, the eyecan move with a velocity of 300 deg/sec for a large saccade – atwhich rate it would traverse the entire 90-degree field of view ofthe Google Cardboard 90-degree field in 300ms. Hence the basictools at our disposal, used naıvely, are an order of magnitude lessprecise than this application requires. In the paragraphs that follow,we will first present a technique for gathering a small number ofpoints with maximal precision for the training set, and a differenttechnique for attaining best-effort precision at run-time.

Our training set acquisition technique uses a traditional grid-basedcalibration, requiring one data point for each grid point. The ap-plication we designed shows the user where to look using a smallred dot which is minimally visible in the reflection. The user isprompted to press a button to indicate that she is looking at the thedot. In response the application triggers capture and advances thedot to the next location. Two problems arise when implementingthis basic design: (i) there is a delay from triggering until imagecapture, which means that the user may no longer be looking at thedesired point when the image is captured, and (ii) pressing the built-in Cardboard button can cause the device to move slightly with re-spect to the face, decreasing the consistency of the device positionthroughout the calibration. Addressing the first of these, the appli-cation gives the user feedback once an image has been captured:once she has pressed the button to indicate that she is looking, thephoto is taken, and the screen flashes red to indicate that the datapoint has been captured. This gives the user an explicit indicationwhen it is for her eye to move again, so she can ensure that the eyeis in the desired position when the image is captured. To address thesecond problem – that pressing the button causes the device to move– we have the user press a key on a laptop keyboard, and this sends atrigger to the mobile application over the network using TCP. This

20

(a) (b)

Figure 3: (a) The coarse-grained calibration routine using a 3×4grid. We were labeling the buttons using characters to make it eas-ier for the users to remember the last target. (b) We gather test datausing a pursuit approach. A red dot moves across the screen. Theuser is asked to follow the movement of the red dot with their eyes.The collection of red dots represents the path that we were using forcalibration.

leads to a slightly greater delay between the button press and thesubsequent capture, but the visual feedback mentioned above helpsensure that the user will fixate long enough anyway.

For test- and run-time data acquisition, the camera must be operatedin a frame-streaming or video capture mode, since gaze position isintended to be captured continuously. Two problems arise when at-tempting to correlate streamed frames or video frames with theircorresponding on-screen content: the first is the unknown delayfrom triggering image streaming or video capture until the oper-ation starts, and the second is the variable time between frames ineither of these modes of operation. To address the first problem,we devised an “external” means for achieving wall-clock synchro-nization between camera frames and display frames. In particular,as shown in Figure 1, we mount a mirror next to the Cardboardlens such that a portion of the device display is visible to the cam-era, Then, during data acquisition, millisecond timecodes are dis-played in that visible portion of the screen, which is located faroff-center so as to be minimally noticeable to the user. In our casethese codes were read manually by humans, but it would be triv-ial to use machine-readable codes to automate the registration pro-cess. If every frame could be scanned for its timecode, then thissolution would be sufficient for capturing quality data using anymethod of frame-streaming. Instead of implementing this, we ob-served that frames captured using video recording, although theyare spaced irregularly with variable durations, have their relativetime offsets properly recorded in the video file. Once the start timeof the video is determined (using inspection of the displayed time-codes), this allowed all the frames to be properly timestamped au-tomatically. Note that in the case of our Samsung Note 5, the frametimestamps were unreliable (sometimes even non-monotonic in se-quence), while the frame durations were reliable.

4 Algorithms for Eye Gaze Tracking

Our gaze tracking technique leverages matching SURF features inthe Purkinje image – that is, features in the reflection of the on-screen content, as shown in Figure 4a. Two different techniques, fordiscrete classification and continuous gaze point estimation, lever-age the quantity and offset of matched features, respectively. Forthe former, the intuition is that, at the camera’s point of regard, asthe eye moves, there are changes in both the position of the screen’sreflection and the portion of the on-screen content that is visible inthe reflection. If the eye is near the same position, many of the samefeatures will be visible, but slightly offset and distorted; whereasif the eye is in a very different position, very few features might

match – since they are either not visible, or distorted beyond recog-nition. If the offset threshold where reflected SURF features nolonger match is small enough, just counting the number of matchedfeatures might support discrete classification of an unknown imagein the neighborhood of a specific grid point. We tested this methodwith six participants using two different grid sizes (coarse 3 × 4,fine 4 × 7). For each grid size, each participant repeated the pro-cedure three times, and the first of the three was used as the train-ing set. This naıve method yielded 86.11% accuracy for the coarsegrid, and 68.45% for the fine grid. Inspecting the “best” matchesin cases where the classification chooses incorrectly reveals the ex-pected failure mode: because SURF is robust to changes in sizeand perspective, the number and quality of feature matches in theneighborhood around a given point differs very little. Errors occurwhen this threshold radius is close to or greater than the distancebetween calibration points, and this had a greater effect in the caseof the finer grid.

The second technique we propose turns the shortcoming of the firstinto a strength – because features are recognizable within a fairlylarge radius, their exact offset gives information about the relativeoffset between the two gaze points in question. Figure 4a shows anexample, where it is clear that all of the matched features are off-set by the same vector. Because these offsets are only relative, it isnecessary to establish a single coordinate system for the training setgrid points, against which to compute feature offsets of a given run-time image. This can be done by first computing all neighbor offsets(which contain small errors, presumably due to the movement of thedevice with respect to the face), and then running an optimizationto find a global geometric layout that is maximally consistent withall of these pairwise offsets. An example of such a layout is shownin Figure 4b. Using such a layout, a set of estimated offsets to dif-ferent training set grid points can be computed. A offset estimatewas considered valid if it matched a minimum number of features(n = 8), making a false match highly improbable. We averaged allvalid estimates together to arrive at an estimated gaze point in theeye image space. Finally, a continuous mapping between the cam-era image space and the display image space is needed to map thisestimate to a gaze point in the on-screen content space. We useda single homography fitted to map the calibration points from theeye image space to the grid. It is likely that better results could beobtained using multiple piecewise homographies or polynomial ap-proximations. Mapped segments of the gaze were accurate enoughto be highly recognizable, as shown in Figure 4c.

We tried the technique on four participants, each capturing roughly1400 images divided into two trials. Average errors were 70.5screen pixels, corresponding to 3.4mm on screen, or 5.0 degreesin the field of view, with a standard deviation of 29.0 screen pix-els. While this is significantly less precise than commercial trackers(which typically achieve accuracy within 0.5 to 1 degree), the factthat it has been done with no additional hardware makes it worthyof note.

5 Discussion

In this section, we discuss the current strengths and limitations ofthe technique, and correspondingly, what could be done now lever-aging the strengths, and what would be required to overcome thelimitations. One strength of our results is that it would be pos-sible to implement dialogs or menus in real-time, using the dis-crete classification technique. This would be done by performingthe necessary feature-based image matching off-board in the cloud,easily achieving an adequate latency of 500ms or less. This ap-plication architecture is directly supported by the popular OpenCVand SimpleCV frameworks. One limitation is that the continuousgaze point estimation technique as it stands cannot be done in real-

21

(a) (b) (c)

Figure 4: (a) An example showing matched feature offsets between two overlaid images. (b) Example grid layout generated with linearprogram. Left: content space, right: eye image space. (c) A curve segment mapped using our technique.

time, although parallelization would speed it up substantially. Evenso, an offline method is acceptable for studying visual behavior.Next, a significant limitation of our implementation is that it onlyworks on calibrated images. In order to track gaze during a game,or while watching a video, it would be necessary to “transfer” acalibration from a known image to a new, previously unseen image.This could be done by generating a simulated training set usingcomputer graphics rendering (e.g. a parametric, ray-traced modelof the eye, along with point spread functions of the lenses). Our pre-liminary explorations indicated this is a promising direction, and itmight be possible to omit the lenses from the model to simplify theimplementation. Another limitation is the field of view. When theeye is pointed too far away from the camera, the screen reflectionis sometimes not visible or is occluded by the eyelid or eyelashes.One possible approach to address this would be to use a sequence ofmirrors to split the front-facing camera image into two or more dif-ferent views of the eye. It would be possible using the same mirrorsthat we use for synchronization currently, albeit requiring a moresophisticated setup.

6 Conclusion

We have demonstrated that the geometry and hardware capabilitiesof a Google Cardboard device allow it to be used for eye tracking.The implementation described here has significant limitations, butit would provide a practical way to support the visual activation ofa small number of buttons or menu items using trivial extensions.With further development, it could be used as a tool for more so-phisticated visual interactions, and for studying visual behavior. Wehave clearly described what next steps would be required to makethe technique more practical and versatile for both use cases. Thesesteps require significant additional effort, but their fundamental fea-sibility is clear. Recent related work used a crowd-sourced, neural-net-based approach to coarse-grained remote eye tracking for mo-bile phones and tablets [Krafka et al. 2016]. Having shown here thatthe information contained in front-facing camera images is suffi-cient for the mobile VR case, we conclude that it should be feasibleto use such a “big data” approach as an alternative to the explicithand-crafted model we presented here. In order to allow others tobuild on our work, and use it as an instructive practical applicationof computer vision, our tools will be made available online with anopen source license. 2

7 Acknowledgments

We wish to thank Boris Smus and Chris Schmandt for their valu-able feedback, and acknowledge support from Google through theFaculty Research Award Program.

2https://github.com/ResVR/resvr-cardboard-eye-track

References

DUCHOWSKI, A. T., SHIVASHANKARAIAH, V., RAWLS, T.,GRAMOPADHYE, A. K., MELLOY, B. J., AND KANKI, B.2000. Binocular eye tracking in virtual reality for inspectiontraining. In Proc. ETRA’00, 89–96.

HANSEN, D. W., AND PECE, A. E. 2005. Eye tracking in thewild. Computer Vision and Image Understanding 98, 1 (apr),155–181.

KANGAS, J., SPAKOV, O., ISOKOSKI, P., AKKIL, D., RANTALA,J., AND RAISAMO, R. 2016. Feedback for smooth pursuit gazetracking based control. In Proc. Augmented Human Intl Conf’16,ACM, New York, NY, USA, AH ’16, 6:1–6:8.

KRAFKA, K., KHOSLA, A., KELLNHOFER, P., KANNAN, H.,BHANDARKAR, S., MATUSIK, W., AND TORRALBA, A. 2016.Eye tracking for everyone. In Proc. IEEE CVPR’16.

LI, D., BABCOCK, J., AND PARKHURST, D. J. 2006. openeyes.In Proc. ETRA’06, ACM Press, 95.

NAKAZAWA, A., AND NITSCHKE, C. 2012. Point of gaze estima-tion through corneal surface reflection in an active illuminationenvironment. In ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Per-ona, Y. Sato, C. Schmid, D. Hutchison, T. Kanade, J. Kittler,J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nier-strasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos,D. Tygar, M. Y. Vardi, and G. Weikum, Eds., LNCS. SpringerBerlin Heidelberg, jan, 159–172.

OHSHIMA, T., YAMAMOTO, H., AND TAMURA, H. 1996. Gaze-directed adaptive rendering for interacting with virtual space. InProc. IEEE VR’96, IEEE, 103–110.

TANRIVERDI, V., AND JACOB, R. J. K. 2000. Interacting witheye movements in virtual environments. In Proc. CHI’00, ACMPress, 265–272.

22

https://github.com/ResVR/resvr-cardboard-eye-track

Documents

Eye Gaze Tracking with Google Cardboard Using Purkinje Images · Input methods for mobile VR systems are currently very limited – Google Cardboard, for example, uses a combination