Program: Graduate Program in Computer Science Co-advisors ......Acknowledgments Countless people contributed directly or indirectly to this thesis. First and foremost I thank God,creatorandsustaineroflife.Tohimbetheglory

EyeSwipe: Text Entry Using Gaze Paths

Andrew Toshiaki Nakayama Kurauchi

Thesis presentedto the

Institute of Mathematics and Statisticsof the

University of São Paulofor

obtainment of the titleof

Doctor in Science

Program: Graduate Program in Computer ScienceCo-advisors: Prof. Carlos Hitoshi Morimoto

Prof. Margrit Betke

During the development of this work the author received financial aidfrom São Paulo Research Foundation (FAPESP), grants:

2012/10552-8, 2013/06791-0, 2014/12048-0

São Paulo, January 2018

EyeSwipe: Text Entry Using Gaze Paths

This version of the thesis contains the corrections and enhancements suggestedby the PhD Thesis Committee in the defense of the original version of the thesis,

in January 30, 2018. A copy of the original version is available at theInstitute of Mathematics and Statistics of the University of São Paulo.

PhD Thesis Committee:

• Prof. Carlos Hitoshi Morimoto (advisor) - Universidade de São Paulo

• Prof. Margrit Betke (advisor) - Boston University

• Prof. Päivi Hannele Majaranta - University of Tampere

• Prof. Vânia Paula de Almeida Neris - Universidade Federal de São Carlos

• Prof. Eduardo Furtado de Mendonça Monçores Velloso - University of Melbourne

Acknowledgments

Countless people contributed directly or indirectly to this thesis. First and foremost I thankGod, creator and sustainer of life. To him be the glory.

I would like to thank my advisors Carlos Morimoto and Margrit Betke for their guidance andenlightening conversations over the past years. Both have given me far more amazing opportunitiesthan I expected to have when I started my academic life. Carlos continuously teaches me aboutthe importance of asking the right question and seeking for the true motivations to solve problems.Margrit helped me finding such motivation, with the drive to seek real impact in people’s lives. Iam very grateful for their teachings and patience.

I also thank FAPESP (grants 2012/10552-8, 2013/06791-0, and 2014/12048-0) for supportingthis work. Because of their support I was exposed to several opportunities, from attending confer-ences to spending a year abroad in Boston University (BU) with the Image and Video ComputingGroup, under the supervision of Prof. Margrit Betke.

Many people helped me in my first steps into the eye tracking world. Among them, I wouldlike to especially thank Antonio Diaz Tula, Diako Mardanbegi, and Flavio Coutinho. I am alsothankful to all other friends from the Laboratory of Technologies for Interaction (LaTIn) and theComputer Science Department at IME-USP. Alex Carneiro, Carlos Eduardo Elmadjian, HenriqueStagni, Igor Montagner, José Tula Leyva, Vinícius Daros, and so many other friends helped mewith both insightful discussions and moments of laughter and random conversations.

My experience abroad at the Image and Video Computing Group at BU would not be nearlyas full of learning and growth without the people I met there. I am forever thankful for the warmwelcome and so many good conversations. I would like to thank Ajjen Das Joshi, Jianming Zhang,and Wenxin Feng, not only for the fruitful collaborations but especially for their friendship.

All friends from TBT, Felipe Takeda, Jorge Osiro, Joyce Osiro, Lucas Machado, Otavio Izumi,Ricardo Borges, Vivian Yamaguchi, and many other friends gave me emotional support, compan-ionship, and especially shared moments of good laughter and good food.

Rebecca Teoi has supported and motivated me throughout this whole journey. I am foreverthankful for her kind words of encouragement when I needed the most and for inspiring me to beand do my best.

Finally, I thank my family for everything they mean and gave to me. I would not be here withouttheir full support. I thank my brother, Martim, for constantly challenging me to improve and grow.My parents always inspire and encourage me to find purpose and do my best in what I do, evenwhen they do not understand what it is.

So many other people not named here played important roles in this journey. A general ac-knowledgment would not be enough. I hope to be able to thank each one in person.

i

ii

Abstract

KURAUCHI, A. T. N. EyeSwipe: Text Entry Using Gaze Paths. 2018. Thesis (Doctorate) -Institute of Mathematics and Statistics, University of São Paulo, São Paulo, 2018.

People with severe motor disabilities may communicate using their eye movements aided by avirtual keyboard and an eye tracker. Text entry by gaze may also benefit users immersed in virtualor augmented realities, when they do not have access to a physical keyboard or touchscreen. Thus,both users with and without disabilities may take advantage of the ability to enter text by gaze.However, methods for text entry by gaze are typically slow and uncomfortable. In this thesis wepropose EyeSwipe as a step further towards fast and comfortable text entry by gaze.

EyeSwipe maps gaze paths into words, similarly to how finger traces are used on swipe-basedmethods for touchscreen devices. A gaze path differs from the finger trace in that it does not haveclear start and end positions. To segment the gaze path from the user’s continuous gaze data stream,EyeSwipe requires the user to explicitly indicate its beginning and end. The user can quickly glanceat the vicinity of the other characters that compose the word. Candidate words are sorted basedon the gaze path and presented to the user.

We discuss two versions of EyeSwipe. EyeSwipe 1 uses a deterministic gaze gesture called ReverseCrossing to select both the first and last letters of the word. Considering the lessons learned duringthe development and test of EyeSwipe 1 we proposed EyeSwipe 2. The user emits commands to theinterface by switching the focus between regions.

In a text entry experiment comparing EyeSwipe 2 to EyeSwipe 1, 11 participants achieved anaverage text entry rate of 12.58 words per minute (wpm) with EyeSwipe 1 and 14.59 wpm withEyeSwipe 2 after using each method for 75 minutes. The maximum entry rates achieved with Eye-Swipe 1 and EyeSwipe 2 were, respectively, 21.27 wpm and 32.96 wpm. Participants consideredEyeSwipe 2 to be more comfortable and faster, while less accurate than EyeSwipe 1. Additionally,with EyeSwipe 2 we proposed the use of gaze path data to dynamically adjust the gaze estimation.Using data from the experiment we show that gaze paths can be used to dynamically improve gazeestimation during the interaction.

Keywords: EyeSwipe, text entry, gaze path, eye tracking, swipe.

iii

iv

Resumo

KURAUCHI, A. T. N. EyeSwipe: Entrada de Texto Usando Gestos do Olhar. 2018. Tese(Doutorado) - Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, 2018.

Pessoas com deficiências motoras severas podem se comunicar usando movimentos do olhar como auxílio de um teclado virtual e um rastreador de olhar. A entrada de texto usando o olhar tambémbeneficia usuários imersos em realidade virtual ou realidade aumentada, quando não possuem acessoa um teclado físico ou tela sensível ao toque. Assim, tanto usuários com e sem deficiência podemse beneficiar da possibilidade de entrar texto usando o olhar. Entretanto, métodos para entrada detexto com o olhar são tipicamente lentos e desconfortáveis. Nesta tese propomos o EyeSwipe comomais um passo em direção à entrada rápida e confortável de texto com o olhar.

O EyeSwipe mapeia gestos do olhar em palavras, de maneira similar a como os movimentosdo dedo em uma tela sensível ao toque são utilizados em métodos baseados em gestos (swipe).Um gesto do olhar difere de um gesto com os dedos em que ele não possui posições de início efim claramente definidas. Para segmentar o gesto do olhar a partir do fluxo contínuo de dados doolhar, o EyeSwipe requer que o usuário indique explicitamente seu início e fim. O usuário podeolhar rapidamente a vizinhança dos outros caracteres que compõe a palavra. Palavras candidatassão ordenadas baseadas no gesto do olhar e apresentadas ao usuário.

Discutimos duas versões do EyeSwipe. O EyeSwipe 1 usa um gesto do olhar determinísticochamado Cruzamento Reverso para selecionar tanto a primeira quanto a última letra da palavra.Levando em consideração os aprendizados obtidos durante o desenvolvimento e teste do EyeSwipe1 nós propusemos o EyeSwipe 2. O usuário emite comandos para a interface ao trocar o foco entreas regiões do teclado.

Em um experimento de entrada de texto comparando o EyeSwipe 2 com o EyeSwipe 1, 11participantes atingiram uma taxa de entrada média de 12.58 palavras por minuto (ppm) usandoo EyeSwipe 1 e 14.59 ppm com o EyeSwipe 2 após utilizar cada método por 75 minutos. A taxade entrada de texto máxima alcançada com o EyeSwipe 1 e EyeSwipe 2 foram, respectivamente,21.27 ppm e 32.96 ppm. Os participantes consideraram o EyeSwipe 2 mais confortável e rápido, masmenos preciso do que o EyeSwipe 1. Além disso, com o EyeSwipe 2 nós propusemos o uso dos dadosdos gestos do olhar para ajustar a estimação do olhar dinamicamente. Utilizando dados obtidosno experimento mostramos que os gestos do olhar podem ser usados para melhorar a estimaçãodinamicamente durante a interação.

Palavras-chave: EyeSwipe, entrada de texto, gestos do olhar, rastreamento do olhar, swipe.

v

vi

Contents

List of Abbreviations xi

List of Figures xiii

List of Tables xvii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Objective and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 EyeSwipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Concepts 72.1 The Eyes and Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Fixations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Saccades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Smooth Pursuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Eye Tracker and the Gaze Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Midas Touch Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Metrics for Text Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Text Entry Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.3 Key Strokes per Character in Eye Typing . . . . . . . . . . . . . . . . . . . . 10

3 Literature Review 133.1 Fixation-Based Text Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Effect of Feedback on Speed and Accuracy . . . . . . . . . . . . . . . . . . . . 133.1.2 User-Adjusted Dwell-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Automatically-Adjusted Dwell-Time . . . . . . . . . . . . . . . . . . . . . . . 143.1.4 Cascading Dwell-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.5 AugKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.6 Potential of Dwell-Free Text Entry . . . . . . . . . . . . . . . . . . . . . . . . 153.1.7 Inferring Intent With Process Models . . . . . . . . . . . . . . . . . . . . . . . 163.1.8 Filteryedping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Saccade-Based Text Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii

viii CONTENTS

3.2.1 Saccade-Based Text Entry with Static GDO’s . . . . . . . . . . . . . . . . . . 183.2.2 Saccade-Based Text Entry with No GDO’s . . . . . . . . . . . . . . . . . . . . 203.2.3 Saccade-Based Text Entry with Dynamic GDO’s . . . . . . . . . . . . . . . . 20

3.3 Smooth-Pursuit-Based Text Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Smooth-Pursuit-Based Text Entry with Static GDO’s . . . . . . . . . . . . . 223.3.2 Smooth-Pursuit-Based Text Entry with Dynamic GDO’s . . . . . . . . . . . . 23

3.4 Literature Review Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 EyeSwipe 1 274.1 Using Gaze Paths to Enter Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Selection of Gaze Path Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Reverse Crossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.2 Action Button . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Interface Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Gesture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.2 DTW Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.3 Prefix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 EyeSwipe 1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.1 Text Entry Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.6.3 Subjective Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 EyeSwipe 2 475.1 Improvement Challenges from EyeSwipe 1 . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Reverse Crossing and Pop-Up Action Buttons . . . . . . . . . . . . . . . . . . 475.1.2 Gaze Estimation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Interface Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.1 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.2 Context Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Gesture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.1 Discrete Fréchet Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.2 Word Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Dynamic Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.2 Dynamic Calibration with Gaze Gestures . . . . . . . . . . . . . . . . . . . . 57

5.5 EyeSwipe 2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

CONTENTS ix

5.5.2 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.6.1 Text Entry Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.6.3 QWERTY Layout Expertise Results . . . . . . . . . . . . . . . . . . . . . . . 715.6.4 Subjective Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.6.5 Dynamic Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Conclusions 816.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2 EyeSwipe 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.3 EyeSwipe 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4 Design Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.5 Dynamic Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.6 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Publications 87

Bibliography 89

x CONTENTS

List of Abbreviations

DFD Discrete Fréchet DistanceDTW Dynamic Time WarpingGDO Graphic Display ObjectKSPC Key Strokes Per CharacterMSD Minimum String Distance

xi

xii LIST OF ABBREVIATIONS

List of Figures

3.1 AugKey augments the keys with contextual information. In this example, prefix andpossible completions are displayed around the letter “L” in the key. Adapted from [9]. 15

3.2 Context Switching uses duplicated contexts for dwell-free selection. The last selectedkey is entered when the user performs a saccade to the opposite context. Retrievedfrom [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Interfaces of (a) EyeWrite and (b) EyeK. . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 In pEYEwrite letters are grouped in sections of a pie menu. When the user fixates

the border of a section, another pie menu is displayed, so the user can select lettersindividually. A third level pie menu is used to enter pairs of letters. In this examplethe letters “A”, “I”, “D”, “E”, and “S” can be entered alongside “N”. Adapted from [57]. 21

3.5 In Quikwriting adapted for gaze, letters are grouped around a circle. When the user’sgaze exits the circle, the letters in the group in the exit region are displayed in theexternal regions. The letter in the last focused region before the user’s gaze reentersthe central area is typed. Adapted from [4]. . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Letters are displayed vertically in alphabetical order on the right side of Dasher’sinterface. When the user looks to the right, the letters start moving towards thecenter and zooming in the letters closer to the user’s gaze. Letters are entered asthey cross the central line. Screenshot of Dasher software by Ward et al. [61]. . . . . 23

3.7 In StarGazer, characters are distributed around two concentric circles. The interfacepans and zooms in the direction of the user’s gaze. Adapted from [15]. . . . . . . . . 24

3.8 In SMOOVS, selection of a character occurs in 3 steps using smooth pursuits. (1)Clusters of characters start moving outwards. (2) Characters in the selected clustermove apart from each other. (3) The character followed by the user’s gaze is selected.Adapted from [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Selection of a key is performed with a reverse crossing. The red dots indicate theestimated gaze position at each point in time. As the user looks at the target key(t1), an action button pops up (t2). To perform the action, the user looks at theaction button (t3) and then looks back at the key (t4). . . . . . . . . . . . . . . . . . 29

4.2 Binary options in the NeoVisus library can be selected in 4 steps. When the userlooks at the option, an icon appears to its right. If the user looks at the icon theoption is toggled. Adapted from [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xiii

xiv LIST OF FIGURES

4.3 The interface of EyeSwipe 1 is composed of a text box, word candidates, a can-cel/delete button, and a virtual keyboard. The reference text to be transcribed isshown above the text box (“gaze”) in the experiment. The word can be entered bycompleting the following actions: (1) Fixate on the first letter of the desired word. (2)A pop-up action button appears above the fixated key. Select the key using reversecrossing to initiate a gaze path. (3) Glance through the intermediate letters. Theresulting gaze path is used to compute the candidate words. (4) When fixating onthe last letter, the most probable candidate word will be displayed on the pop-upaction button. As the user selects it, the word is entered. (5) The top 5 candidates aredisplayed when a word is entered. The user can change the entered word by selectinganother candidate button. (6) Select the cancel/delete key to cancel a gaze path (inthe middle of a gesture) or delete a word. . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Example of prefix tree with the words “a”, “an”, “and”, “gaze”, “gate”, and “end”.Though the suffixes of “and” and “end” are the same, they are stored in different nodes. 33

4.5 Boxplots for the mean text entry rates for each session and interface. Overall meansare shown with a star symbol, outliers are represented by a black diamond. . . . . . 36

4.6 Boxplots for the maximum text entry rates for each session and interface. Overallmeans are shown with a star symbol, outliers are represented by a black diamond. . . 37

4.7 Boxplots for the mean clean text entry rates for each session and interface. Overallmeans are shown with a star symbol, outliers are represented by a black diamond. . . 38

4.8 Boxplots for the maximum clean text entry rates for each session and interface.Overall means are shown with a star symbol, outliers are represented by a blackdiamond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.9 Mean and standard deviation (error bars) of the text entry rate in characters perminute (cpm) per entered word length. The dashed line shows the mean length ofthe expected words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.10 Mean and standard deviation (error bars) of the rate of reverse crossings durationover gaze path duration per word length. The dashed line shows the mean length ofthe expected words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.11 Boxplots for the mean error rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond. . . . . . . . . 41

4.12 Top-n accuracy for the gesture classification with the (blue) full lexicon and the(orange) filtered lexicon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.13 Participants’ perception of performance, learnability, and user experience in a 7-pointLikert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.14 Average responses for all participants to the 7-point Likert scale questions. . . . . . . 43

5.1 The pop-up action buttons in EyeSwipe 1 may overlap some keys. The user may notbe able to select a key partially occluded by the action button, such as “A” or “S”, ifthat was their actual intention, depending on the gaze estimation error. . . . . . . . 48

5.2 It is not clear in which cases it is safe to assume the user was looking at the center ofthe characters while performing a gesture. Two example strategies are shown: [red]the user fixated at the center of each letter in the word “case”; [blue] the user fixatedapproximately at “C”, then in between “A” and “S” and finally close to “E”. . . . . . . 49

LIST OF FIGURES xv

5.3 The interface of EyeSwipe 2 is composed of three regions (from top to bottom): text,action, and gesture. The action buttons (in green) on the action region depend onthe actions of the user in the previously focused region. The word can be enteredby completing the following actions: (1) Look at the vicinity of the first letter of thedesired word to indicate the beginning. (2) Glance through the intermediate letters.(3) When the user switches the focus from the gesture region to the action region,the word candidates are shown. (4) When the user moves from the action region toeither the text or the gesture region, the word is entered. (5) The user can changeor delete the word by fixating on the backspace key in the text region, looking atthe desired word/delete word button, and moving to either the text or the gestureregion. (6) The user can also select “Cancel” to ignore the current action buttons. . . 50

5.4 Commands emitted in each region transition. User’s selections on the previous regionare shown in brackets followed by the emitted command. For example, if the usertransitions from the gesture region after performing a gaze gesture, the command todisplay word candidates is emitted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Cost function (Equation 5.5) evaluated for ki in the interval [0, 10] using data fromthe last session from Participant A01. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.6 Cost function (Equation 5.5) evaluated for ki in the interval [0, 10] using data fromthe last session from Participant A07. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 The key distance to target is the length of the shortest path between two keys in agraph connecting neighbor keys. The key distance between W and V, for example,is 4. Two of the shortest paths between W and V are shown in red and blue. . . . . 61

5.8 Boxplots for the mean text entry rates for each session and interface. Overall meansare shown with a star symbol, outliers are represented by a black diamond. . . . . . 62

5.9 Boxplots for the maximum text entry rates for each session and interface. Overallmeans are shown with a star symbol, outliers are represented by a black diamond. . . 63

5.10 Boxplots for the mean clean text entry rates for each session and interface. Theclean text entry rate includes only transcribed sentences with no uncorrected errors.Overall means are shown with a star symbol, outliers are represented by a blackdiamond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.11 Boxplots for the maximum clean text entry rates for each session and interface. Theclean text entry rate includes only transcribed sentences with no uncorrected errors.Overall means are shown with a star symbol, outliers are represented by a blackdiamond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.12 Gesture entry rate in characters per minute. The dashed line indicates the averageentered word length (4.466 characters). . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.13 Boxplots for the mean error rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond. . . . . . . . . 67

5.14 Top-n scores for all gestures using the Fréchet and DTW scores. . . . . . . . . . . . . 685.15 Boxplots for the mean correction rates for each session and interface. Overall means

are shown with a star symbol, outliers are represented by a black diamond. . . . . . 695.16 Boxplots for the mean cancel rates for each session and interface. Overall means are

shown with a star symbol, outliers are represented by a black diamond. . . . . . . . . 70

xvi LIST OF FIGURES

5.17 Boxplots for the mean replace rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond. . . . . . . . . 72

5.18 A significant linear regression was found between the mean reaction time (layoutexperiment) in milliseconds and the mean text entry rate (text entry sessions) in wpm. 73

5.19 Participants’ perception of performance, learnability, and user experience in a 7-pointLikert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.20 Average responses for all participants to the 7-point Likert scale questions. . . . . . . 755.21 Left : Boxplots for the errors in degrees of visual angle using each correction function.

Right : Boxplots for the improvement over the baseline using each correction function(in degrees). Negative values mean that the correction function reduced the meanerror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

List of Tables

3.1 Summary of results from the literature on text entry by gaze. Some results wereinfered from figures or other data available in the original papers. Results infered byus are indicated by a ∗ symbol. The † symbol indicates that the average entry rateof the fastest participant on their best session is shown. The column “Max. WPM”shows the maximum text entry rate achieved by a single participant (unless statedotherwise). The column “Last WPM” shows the average entry rate for the last sessionof the experiment for all participants. The last row shows the results from Kristenssonand Vertanen’s study [26] with their ideal virtual keyboard. . . . . . . . . . . . . . . 26

5.1 Strategy adopted by each participant. LEFT: the participant looked for the leftmostaction button, independently of the location of the last letter of the word. UP: theparticipant looked at the action button right above the last letter of the word. . . . . 66

xvii

xviii LIST OF TABLES

Chapter 1

Introduction

Text entry by gaze is typically realized by a virtual keyboard and a technique called dwell-time.With dwell-time, keys on a virtual keyboard are typed by fixating the gaze on the key for longerthan a predefined duration. Gaze location on the screen is estimated by an eye tracker and the useris able to enter text using their eye movements alone. However, why should someone enter text bygaze to start with?

Manual typing on physical or virtual keyboards is the predominant method to enter text in acomputer. Nevertheless, doing so may not be possible in some situations or for some people. Peoplewith severe motion disabilities, such as those with amyotrophic lateral sclerosis (ALS) or peoplewith locked-in syndrome (LIS), may not be able to use traditional input devices such as mouseand keyboard. The use of eye movements for communication allows people with severe motiondisabilities to communicate, even if they are not able to vocalize words and move their limbs. Also,users immersed in virtual reality (VR) wearing a VR headset may also be restricted in terms oftraditional input devices: either because they are moving or standing far from them, or becausethey are unable to see the relative position between their hands and the keyboard. Other situations,such as interacting with an augmented reality headset with the hands occupied (e.g. while cooking),may also restrict the use of the hands to enter text. Despite the very different reasons, users maybenefit from being able to enter text with their gaze.

At first, using gaze to enter text may seem a promising alternative to manual typing. The eyesare able to move very rapidly, reaching up to 700 degrees per second [2]. Thus “eye typing” shouldbe faster than typing manually. However, gaze control makes typing not only slower, but also lessnatural and comfortable.

Though finger movements are slower than eye movements, users typically type with multiplefingers. As a finger is typing a key, another finger is already preparing to type the next one. Gaze isonly able to select the keys sequentially which culminates in a lower text entry rate. Compared totyping with multiple fingers, typing with a stylus is more similar to entering text with gaze, as it alsodepends on a single sequential input. Soukoreff and MacKenzie [53] developed a theoretical modelto estimate upper and lower bounds on typing speed using a stylus and soft keyboard. Accordingto their model, a typing rate of 8.9 words per minute is expected for novice users and 30.1 wordsper minute for experts. Currently, to our knowledge, no method for text entry by gaze has achievedrates close to 30 words per minute. The fastest methods often achieve about 20 words per minute.

Besides being slower than regular typing with 10 fingers, using gaze for control is an unnaturaltask. The eyes are primarily used for visual information acquisition, in other words, for input.

1

2 INTRODUCTION 1.1

Human beings are not accustomed to use the eyes to control the outer world, as output. Even ifthe user practices using their gaze for control, they will still need their eyes for visual informationacquisition. This dichotomy between input and output was discussed by Jacob [22], who coined theterm Midas Touch Problem.

The Midas Touch problem recalls the Greek myth about King Midas, who turned everything hetouched into gold. One of the main challenges in gaze-based interaction is how to avoid the MidasTouch problem. In other words, how to allow the user to interact only with what they want, andnot with everything they look at.

Dwell-time is an attempt to reduce the impact of the Midas Touch problem. By requiring theuser to hold their gaze on the desired target to be selected allows the user to look at anything theywant, provided that they do not fixate it for longer than the dwell-time. Two direct consequencesof using dwell-time are that (1) the user must be careful not to look at selectable targets for toolong if they do not intend to select them, and (2) if the user wants to select a target they mustwait for the whole dwell-time. To decide the duration of the dwell-time these two factors must beconsidered. If it is too long the user will need to wait longer to enter each key. If it is too short thetext entry rate will be higher, but so will be the error rate, as the Midas Touch problem may occurmore frequently.

Špakov and Miniotas [59], Majaranta et al. [34], and Mott et al. [38] showed how dynamic andadjustable dwell-times can be used to improve user performance. Also, Diaz Tula and Morimoto [9]proposed augmenting the key with contextual information, e.g. previously entered text and possibleword completions, while the user is waiting for the dwell mechanism to complete. By showing theadditional information on the key, the interface assists the user in planning the next action whilewaiting for the current key to be selected.

Dwell-free alternatives have also been suggested. Ward et al. and Hansen et al. proposed zoominginterfaces, with Dasher [61] and StarGazer [15]. The target character is only selected once its regionis large enough, so that its size is a few times larger than the expected error on the estimated gazeposition. Zooming interfaces explore dynamic visual elements on the interface to enable more robustselection of the targets.

Huckauf and Urbina [19], Bee and André [4], Wobbrock et al. [64] suggested the use of gazegestures to enter individual characters. A specific sequence of eye movements is required for eachcharacter. Alternatively, Lutz et al. [31] proposed using pursuit eye movements instead of gazegestures to enter characters. Duplicated keyboard regions were proposed by Morimoto and Amir [37]with Context Switching. Every time the user switches contexts, the last focused character is entered.These are different approaches to dealing with the Midas Touch problem which do not requiredwell-time. In general, they assume that the required eye movement or eye movement sequences areunlikely to be performed without the intention to enter the character.

Lexicons and language models have been employed across several different methods to acceleratethe text entry task. Given the previously entered letters, possible word completions are presentedto the user, who can enter the rest of the word all at once [8] or letter-by-letter [61] more efficiently.

In this thesis we present EyeSwipe, a text entry method based on gaze paths. EyeSwipe mapsgaze paths into words analogously to how finger traces are used in swipe-based interfaces on mobiledevices.

1.1 MOTIVATION 3

1.1 Motivation

Dwell-time-based interfaces are easy to learn. It is often enough to tell the user to look at thekey they want to select and wait until it is entered. Getting used to use gaze for control is usuallyharder than understanding how the interface works. However, with dwell-time the system imposesthe waiting time to the user. This not only sets a low upper bound on the text entry rate, but alsoaffects user experience, as the user has less control over the time a key will be entered.

Extensions to dwell-time interfaces [9, 34, 38, 59] alleviate these limitations, but do not com-pletely solve them. In a literature review, Majaranta and Räihä [36] observed that the typical textentry rate of gaze-based methods was of about 10 words per minute or less. In a longitudinal experi-ment by Majaranta et al. [34], participants reached entry rates of almost 20 words per minute. Basedon the human performance model by Kristensson and Vertanen for dwell-based interfaces [26], theseentry rates are likely close to the upper bound without word completion.

Kristensson and Vertanen [26] showed the potential speed gain of dwell-free text entry methods.In their pilot experiment they demonstrated that if the virtual keyboard is able to correctly guesswhich keys the user wanted to select, without dwell-time, users would be able to enter text signif-icantly faster with their gaze. With the ideal dwell-free keyboard, participants were able to reacha mean entry rate of 46 words per minute. Moreover, a dwell-free method would allow the user toenter text without the waiting time imposed by the system.

Dasher, proposed by Ward et al. [60] can be classified as a dwell-free interface and is consideredone of the fastest methods for text entry by gaze. In an early experiment, Ward and MacKay [61]reported text entry rates between 20 and 25 words per minute for two expert users and between 10and 15 words per minute for two novice users after using the method for one hour.

The interface of Dasher displays the characters vertically in separate rectangles on the right-hand side of the window. As the user looks at the characters, they start moving towards the centerand the rectangles closer to the user’s gaze increase in size. A character is entered when it crossesthe center of the keyboard window. The speed of the horizontal movement of the character columnsis defined by the gaze location. When the user is looking at the center of the window the interfaceceases to move. As the user looks to the right, the characters move faster towards the center. If theuser looks to the left-hand side of the window the characters move to the right and are deleted asthey cross the center of the window.

Dasher is a very flexible text entry method that can be used with several 2 dimensional pointingdevices besides gaze. When the pointing method is decoupled from the user’s gaze, the user canuse their gaze exclusively to search and plan the next pointing action. Though it is potentiallyfaster to have both the searching and pointing actions assigned to the user’s gaze, it may becomedistracting and stressful. If the user wants to point to the current gaze location, the method will befast. However, if the user is searching for a specific character it may be distracting if the interfacecontinuously resize, reacting to the user’s gaze. The only place the user can look to make theinterface stop moving is at the center. However, due to noise and natural eye jitter, the pointer isnever exactly at the center, so the interface is never really static, unless paused. The interface isconstantly reacting to the user’s gaze, even when they do not want it to.

Another approach to dwell-free interfaces is the use of gaze gestures. Wobbrock et al. [64]proposed EyeWrite, a method based on EdgeWrite, by Wobbrock et al. [63]. With EyeWrite theuser enters characters by looking at the corners of a square window in specific sequences, as if

4 INTRODUCTION 1.3

connecting them to form the letter. Other approaches, such as Quikwriting adapted for gaze byBee and André [4], also use mostly static interfaces that allow the user to enter characters withsequences of eye movements. Gesture-based interfaces for character-level text entry typically do notforce the user to wait for the selection and are mostly static. Thus the user can enter text at theirown pace in a less visually overwhelming interface. However, as multiple eye movements are requiredfor each character, such methods are often slow, achieving less than 10 words per minute [4, 64].

Inspired by Kristensson and Vertanen’s results [26] and in swipe-based methods available formobile devices, we devised EyeSwipe. With EyeSwipe our goal was to develop a text entry methodthat is easy to learn, comfortable, fast, and that reduces the time the user needs to wait for theinterface.

1.2 Objective and Challenges

The main objective of this thesis is to give one more step towards efficient text entry by gazewith better user experience. Efficiency refers to quantitative measures such as text entry rate anderror rate. Better user experience refers to how the user perceives and feels interacting with theinterface. These qualitative measures include comfort, learnability, and perceived performance (e.g.speed and accuracy).

Though related, improving speed in terms of efficiency and user experience are two differentobjectives. A method can be perceived as faster than another even if in fact the opposite is true [64].Perceptually, instead of feeling stuck or having to wait for a time imposed by the interface, the usershould feel they can enter text as fast as they can move their eyes.

Regarding comfort, the interface should provide only the necessary visual stimuli. Unnecessaryvisual stimuli may cause discomfort and distract the user while they are focusing on the typingtask. Furthermore, the user should be able to look at any part of the interface they want, withoutnecessarily causing an action from the interface (e.g. entering or deleting text).

The more specific objective of this thesis is to propose and investigate a text entry method usinggaze gestures in terms of efficiency (text entry rate and error rate) and user experience (comfort,perceived speed, learnability, and accuracy).

A critical difference between input by touch and by gaze is that the first has clear start andend positions, while the latter has not. For a gaze-based method it means that there is no cleardefinition of which part of the gaze path was meant to be the gesture. Hence, a specific challengeto applying gaze gestures is determining the temporal location of the beginning and end of thegestures.

1.3 EyeSwipe

In this thesis we propose EyeSwipe, a text entry method that uses gaze gestures to input wholewords. Gaze gestures are performed on a virtual keyboard and are translated into words, analogouslyto finger traces in swipe-based methods. As the gaze gesture does not inherently have clear startand end positions we proposed different mechanisms for the user to explicitly indicate the beginningand end of the gaze gesture. The letters in the middle of the word can be swiped through: the userjust has to quickly look at the vicinity of the keys.

1.5 CONTRIBUTIONS 5

Gaze estimation error is handled differently than it is by zooming interfaces. Zooming interfacesincrease the size of the targets to a point in which the magnitude of the error is expected to be afew times lower than the size. EyeSwipe handles this uncertainty by using the overall gaze path andits shape instead of individual key selections. The use of the gaze path enables EyeSwipe to use astatic keyboard layout while maintaining the size of the keys not much larger than the expectedgaze estimation error. For the current implementation, we use the QWERTY layout due to thefamiliarity of the experiment participants with such layout.

In this thesis we present two versions of EyeSwipe. EyeSwipe 1 uses a deterministic gesturecalled Reverse Crossing to select the first and last letters of the word. When the user looks at akey, an action button pops up above it. To select the starting or ending key the user looks at theaction button and then looks back at the key.

We proposed EyeSwipe 2 as an enhancement of EyeSwipe 1. The interaction in EyeSwipe 2 isbased on regions. The commands are given to the interface by moving the gaze from one region tothe other. To indicate the first letter of a word the user looks at it and waits until it is highlighted.Because of the gaze estimation error, all keys within a distance to the user’s gaze point are consideredcandidates for the first letter. The end of the gaze gesture is implicitly indicated when the user’sgaze leaves the region of the keys.

With EyeSwipe 2 we also proposed the use of the gaze gestures performed by the user todynamically adjust the gaze position estimated by the eye tracker. The gestures are compared totheir ideal path formed by the center of the keys in the word. The pairs of points in the gesturesand in the ideal path are used for the computation of the adjustment function.

Though EyeSwipe 2 may require a hidden dwell-time to indicate the beginning of the gazegesture, it can become dwell-free with some adaptations. The required adaptations are the primaryfuture work of this thesis.

1.4 Contributions

The main contributions of this thesis are the proposal and investigation of EyeSwipe, a textentry method based on gaze gestures, and the use of gaze gesture data to dynamically adjustthe estimated gaze position. We provide experimental evidence that indicates that EyeSwipe ismore comfortable and faster than the baseline dwell-based keyboard, both as perceived by the 10participants and as measured by the text entry rate.

1.5 Organization

The remainder of this thesis is organized as follows. In Chapter 2 we introduce basic eye move-ment taxonomy and other concepts used in gaze-based interaction and text entry by gaze research.Chapter 3 presents a literature review on methods for text entry by gaze. The two iterations ofEyeSwipe are discussed in Chapters 4 and 5. Finally, in Chapter 6 we present a final discussion onthe contributions of this thesis and introduce some future work.

6 INTRODUCTION 1.5

Chapter 2

Concepts

The concepts presented in this chapter will be used in the rest of the thesis. The taxonomyof eye movements (Section 2.1) is used for the classification of text entry methods presented inChapter 3. Understanding the basics of how the eyes move is also important to comprehend somedetails about the presented text entry method. The remaining sections of this chapter introduceconcepts used in text entry by gaze research.

2.1 The Eyes and Eye Movements

For visual information to reach our brains, the photoreceptors in the internal surface of the eyes(the retina) must be stimulated. Light from the environment is refracted by a sequence of differentmedia until it reaches the photoreceptor cells in the retina. A small region of the retina, called fovea,presents a higher density of cones (photoreceptors associated with the perception of color and finedetails). The fovea is the region with highest visual acuity and is responsible for the central vision.

In order to acquire visual information, the eye is maintained relatively still in a fixation. Tomove from a fixation to another the eye performs very rapid movements. Eye movements can beclassified in five basic groups [48]: saccades, smooth pursuits, vergence, vestibular, and physiologicalnystagmus (extremely small eye movements performed during a fixation). Three out of these five eyemovements have been explored for gaze-based interaction according to a taxonomy by Møllenbachet al. [40]: fixations (comprising all physiological nystagmus), saccades and smooth pursuits.

2.1.1 Fixations

A fixation is the maintaining of the eye relatively still. The eye continues making short move-ments during a fixation (physiological nystagmus), however the overall variation is typically verysmall. For the purpose of gaze-based interaction, such small movements are often treated as noisewhen processing gaze as a signal [10]. During a fixation the visual information is acquired andprocessed by the brain.

The word “fixation” has two meanings in the gaze-based interaction literature. One is the spatiallocation fixated by a person at a given time. For example, a person is acquiring visual informationabout a dog in a painting. To do so they must do fixational eye movements while maintaining thedog centered in the fovea. In this case one could say that there was a fixation on the painted dog.

The other meaning is the eye movement itself, during the act of fixating something. The term

7

8 CONCEPTS 2.2

“fixation detection” is an example of use of the word with this meaning. It is only required infor-mation about the eye movements to detect a fixation. No information about the external world isneeded.

The word “fixation” is used for both meanings throughout this thesis, with disambiguationwhenever required.

2.1.2 Saccades

To move from a fixation to another the eye performs a ballistic movement called saccade.During a saccade, the eye can reach a peak angular velocity of up to 700 degrees per second forlarge amplitudes [2]. A perception phenomenon occurs in which the person is unable to perceivevisual displacements that occurred during a saccade [6]. This saccadic suppression is explored bysome saccade-based methods, such as Dynamic Context Switching (see Chapter 3).

Maybe because of the saccadic suppression gaze-based interaction often does not apply gaze datafrom saccades. However, the detection of saccades may be useful to isolate fixations, for fixationdetection [41]. Saccadic gaze data is often excluded from the final data used by the interactivemethods.

Similarly to the word “fixation”, “saccade” can be understood as either the position, or trajectory,in space and time, or the eye movement itself.

2.1.3 Smooth Pursuits

A smooth pursuit is performed to stabilize the gaze on a moving target. Typically, smoothpursuits require a moving target to be tracked by the person’s gaze [30] and can be interpreted asa fixation-in-motion [40].

Smooth pursuits differ from saccades in the nature of the movement. A saccade starts with ahigh acceleration, reaches a peak velocity, and ends with high deceleration typically within 100 or200 ms [2]. A smooth pursuit maintains a more constant speed, that should not exceed 100 to 200degrees per second [40], and often lasts longer as the eye is visually tracking a target.

As smooth pursuits are unlikely to occur without the presence of a visual moving target, theyhave been used for robust selection of elements in gaze-based interfaces [58].

2.2 Eye Tracker and the Gaze Position

As discussed above, to look at something the eyes move to position the image of the targetwithin the fovea, not necessarily completely centralized. For this reason the location of a person’sgaze can be detected up to a region, not a point.

The instrument used to estimate the location of a person’s gaze is called an eye tracker1. Eyetrackers estimate the gaze location on a computer monitor, a video from the person’s field of view,or in the 3D space. In any case the output is a point, either 2D or 3D. For this reason, in the eyetracking literature, the gaze location is referred to as a point.

To estimate the gaze position most eye trackers require a calibration procedure. During thecalibration the user must look at a set of know positions for the eye tracker to compute a model

1Other terms such as video-oculography or electro-oculography are used to refer to specific eye tracking technolo-gies.

2.4 MIDAS TOUCH PROBLEM 9

for that particular user. Most models used by eye trackers are sensitive to the location of the userwith respect to the screen. So the gaze estimation often degrades if the user moves too much. Also,drifts in the estimation often occur after some time. Possible reasons include movements of the eyetracker with respect to the screen or user, changes in lighting conditions, or user fatigue. For thesereasons the user is frequently required to recalibrate the eye tracker from time to time.

The gaze position estimated by an eye tracker will be affected by different sources of error,such as measurement errors, or limitations of the model [3]. Even if the other sources of error wereminimized, the size of the fovea limits the accuracy of the estimated gaze location to about 0.5degrees of visual angle [22]. This limitation explains why, for instance, it may not be possible toselect a target on a computer with pixel-level precision using the gaze. Gaze-based interaction musttake into account the magnitude of the estimation error.

2.3 Midas Touch Problem

Assume the user’s gaze position is known and we want to use it as input for the computer. Oneof the first ideas would be to directly replace the mouse pointer by the user’s gaze position. Atfirst this may sound a good idea: pointing with the eyes is considerably faster than pointing withthe mouse [39], and the user probably already looks at the target to move the mouse pointer in itsdirection, so it would save actions from the user. However, it soon becomes clear that having themouse pointer follow the user’s gaze is only desirable when the user wants it. It is not desirable thatthe mouse pointer follows the user’s gaze when they simply want to visually explore the interface.This becomes even worse if the clicking action is also associated with the gaze.

The term Midas Touch Problem, coined by Jacob [22], is used to describe this effect. The termis a reference to the myth about king Midas, who turned everything he touched into gold. TheMidas Touch Problem occurs because the eyes are primarily sensory organs, used to acquire visualinformation. In other words, the eyes are analogous to input methods of the human body. Using theeyes to send commands, or control things external to the body, is analogous to an output methodof the human body. As the eyes move it may be impossible to know the intention of the movement,whether it is for input or output. The Midas Touch Problem explicits that the user will not be ableto look at whatever they want completely freely if their gaze is also associated with an externalcontrol task.

Some techniques were proposed to address the Midas Touch Problem, such as the dwell-time [22],smooth pursuits [58], or gaze gestures [64]. However, at some level, the problem is still present inall techniques and it may not be possible to find a general solution that completely solves it usinggaze alone [22].

2.4 Metrics for Text Entry

Text entry research aims at understanding both subjective and objective aspects of text entrymethods. In this section we describe some of the metrics used for objective evaluation of text entrymethods that have been applied in the studies presented in this thesis.

10 CONCEPTS 2.4

2.4.1 Text Entry Rate

The text entry rate, or typing rate, is most commonly measured in words per minute (wpm). Aword is defined as any sequence of 5 characters 2, including spaces. In the literature, the text entryrate is often computed after a whole phrase was entered and can be computed as [62]:

WPM :=(|T | − 1) · 60

S · 5(2.1)

Where T is the full transcribed text, including spaces and punctuation marks and S is the timein seconds from the entry of the first character to the entry of the last character. As the time ismeasured from the entry of the first character, the time to enter it is not considered, thus the “-1”in the numerator.

Corrections made before the phrase is completely entered will not be considered in the finalword count. However, they affect the overall time. Thus if many corrections are made the wpm ratewill be lower. On the other hand, if the user enters many incorrect characters very rapidly, the wpmrate will be higher. For this reason other complementary metrics are required.

2.4.2 Error rate

The error rate is a measure of how many mistakes were made. The text input stream is composedof input text and editing commands. This input stream can be divided into four classes based onhow they affect the error rate [54]: (C ) correct text and other commands, (INF ) incorrect and notfixed text, (F ) fixing commands such as backspaces or replacements, and (IF ) incorrect and fixedtext. The final text is composed by the C and INF elements in the input stream.

When evaluating text entry methods it is common to ask participants to transcribe a set ofprerecorded phrases. The final phrase produced by the participant with the text entry method iscompared to the target phrase to obtain the error rate metrics.

The minimum string distance (MSD) is given by the minimum number of editing operations(deletions, insertions and substitutions) required to transform one string into another. The MSDerror rate is given by:

Corrected error rate :=MSD(p, t)

max(|p|, |t|)

Where p and t are the phrases produced by the participant and the target phrase, and | · | isthe string length operator.

The MSD error rate is also known as uncorrected error rate, as it only considers the uncorrectederrors (corrected errors are not present in the final text).

2.4.3 Key Strokes per Character in Eye Typing

Key strokes per character (KSPC) reflects the efficiency in terms of key strokes. If the amountof IF text is high, the MSD error rate will not be affected, but it will reflect in a higher KSPC. Keystrokes include the characters, the editing keys, and any other keys used by the text entry method.

2The average English word length is approximately 5 characters (http://www.wolframalpha.com/input/?i=average+english+word+length)

http://www.wolframalpha.com/input/?i=average+english+word+length

http://www.wolframalpha.com/input/?i=average+english+word+length

2.4 METRICS FOR TEXT ENTRY 11

The concept of key strokes is tightly related to typing on a keyboard. The term eye typing isfound with slight differences throughout the literature of text entry by gaze. In this thesis, eyetyping refers exclusively to methods that mimic the typing activity on a physical keyboard (e.g.with dwell-time). Eye typing systems often use virtual keyboards with traditional keyboard layouts,such as QWERTY.

The text entry method presented in this thesis is not an eye typing system and the concept of keystrokes does not apply to it. For this reason, the KSPC is not reported for any of the experiments.

12 CONCEPTS 2.4

Chapter 3

Literature Review

In this chapter, we present a literature review on methods for text entry by gaze. We use thetaxonomy proposed by Møllenbach et al. [40], that classifies gaze-based selection methods accordingto eye movements and the nature of the graphic display objects (GDO). Møllenbach et al. use threeclasses of GDO: no GDO (no visual element on the screen is used for the selection mechanism),static GDO (e.g. buttons that remain on the same position and are used to direct gaze and showselection feedback), and dynamic GDO (visual elements dynamically resize and/or move to guidegaze direction).

The present literature review only covers methods based on fixations, saccades and smoothpursuits. Other eye movements have been explored for gaze-based interaction, such as nystagmus [23]and vergence [42]. However, to our knowledge, there are currently no methods for text entry bygaze exploring such eye movements.

3.1 Fixation-Based Text Entry

Dwell-time is the most common approach to the Midas Touch problem in fixation-based in-teraction with static GDO’s. To our knowledge, no fixation-based text entry method has exploredthe use of no GDO or dynamic GDO. Typical dwell-times vary between 200 and 1000 ms [34, 36].Dwell-based methods are one of the most well studied methods for text entry by gaze, and severalenhancements have been proposed to improve its speed and user experience.

3.1.1 Effect of Feedback on Speed and Accuracy

The effect of feedback on text entry rate and accuracy has been investigated by Majaranta etal. [35]. Combinations of visual and auditory feedback were tested with both short (450 ms) andlong (900 ms) dwell-times in a user study.

Short non-speech auditory feedback (“click” sound) produced the best results and was preferredby the participants. According to the authors, when the user is typing as fast as possible, they relyon the typing rhythm inherent to dwell-time typing. The short audible “click” not only confirmedthe key selection, but also supported the typing rhythm.

13

14 LITERATURE REVIEW 3.1

3.1.2 User-Adjusted Dwell-Time

Majaranta et al. [34] conducted a longitudinal experiment to evaluate the impact of allowing theuser to adjust the dwell-time for a text entry interface. Ten participants gaze-typed with a dwell-based virtual keyboard for a total of two and half hours divided in ten sessions in ten differentdays.

The experiment participants were able to increase their average text entry rate from 6.9 wpm inthe first session to 19.9 wpm in the tenth session. The dwell-time was initially set to 1000 ms andparticipants were allowed to adjust it between sentences. The average dwell-time in the last sessionwas 282 ms and decreased more rapidly in the first 3 sessions. The MSD error rate decreased from1.28% in the first session to 0.36% in the last session.

In a longitudinal study by Räihä and Ovaska [46] participants achieved entry rates of 20–24 wpmwith an adjustable dwell-time after typing for 4 hours and 45 minutes. Though some participantsset the dwell-time to almost 200 ms, they returned it to closer to 300 ms before finishing the session.The lower dwell-times caused more errors, most of which were corrected, so participants decidednot to type as fast as they could to reduce the error rate.

3.1.3 Automatically-Adjusted Dwell-Time

Špakov and Miniotas [59] proposed an automatically adjustable dwell-time. Their method usesthe information about the average exit time. The exit time was measured as the time between theselection of a key and the moment the gaze left the key. Exit time data is collected from the past10 selections and used to compute the new dwell-time.

They conducted an experiment to validate their technique. Nine participants typed 21 phrasesin a single session of about twenty minutes. Participants achieved an average entry rate of 12.1 wpmand the final adjusted dwell-times ranged from 450 to 600 ms, with grand mean equal to 533 ms.The average MSD error rate was 2.3%.

3.1.4 Cascading Dwell-Time

Dwell-based text entry creates a cadence as the user gets accustomed to using the interface, asdiscussed by Majaranta et al. [35]. This rhythm helps predicting how long they have to wait oneach key. To avoid disrupting the typing rhythm Mott et al. [38] proposed the use of cascadingdwell-time.

They use the probabilities of the next character from a language model to update the dwell-timefor each key. If the character has a high probability, its dwell-time is slightly reduced to make iteasier to select. If a character is very unlikely to be the next one, its dwell-time is increased to avoidwrong selections. The variation in the dwell duration is kept small so the user does not perceive it,thus not disrupting the rhythm.

Participants achieved an average rate of 12.4 wpm using the cascading dwell-time, compared to10.6 wpm in a static dwell-time. The baseline dwell-time could be adjusted in the beginning of eachtrial for both methods. The experiment consisted of 8 sessions. The first and last sessions lasted 1hour and the others, 30 minutes. The mean baseline dwell-time was 334 ms for the cascading dwell-time and 327 for the static dwell-time. The average MSD error rate was 1.45% for the cascadingdwell-time and 1.95% for the static dwell-time.

3.1 FIXATION-BASED TEXT ENTRY 15

3.1.5 AugKey

Word prediction is a common solution to improve the typing task. However, on a gaze-basedmethod, verifying if the desired word is among the possible word candidates predicted by a languagemodel requires shifting the gaze to a different location.

The shifting required to look for information is related to the dichotomy between the use ofgaze for input and output. Gaze was being used to control, but now it needs to be used for visualinformation acquisition. Using gaze to control the interface reduces visual throughput compared towhen the eyes are free to look at the interface. Diaz-Tula and Morimoto proposed AugKey [9] toaugment the keys of a virtual keyboard with context information, thus increasing visual throughput.

Prefix data (previously typed text) is shown to the left of the letter inside the key and possibleword completions to the right (Figure 3.1). As the information is contained inside the key, theuser is not required to move their gaze much. For this reason, the user can obtain this contextualinformation while waiting for the dwell-time to complete.

Figure 3.1: AugKey augments the keys with contextual information. In this example, prefix and possiblecompletions are displayed around the letter “L” in the key. Adapted from [9].

If one of the possible word completions is the word the user wants to type, they can select it onthe right-hand side of the interface by dwell-time. If the word is not among the possible completions,the user does not have to search for it among the candidates.

Diaz-Tula and Morimoto conducted an experiment to evaluate the impact of AugKey comparedto a dwell-based virtual keyboard with and without word prediction. Participants reached an averagetext entry rate of about 15.5 wpm after typing for approximately 1 hour (9 sessions of at least 6minutes) using AugKey. The final text entry rate for AugKey was approximately 28% faster than thedwell-based keyboard without word prediction and 20% faster than the keyboard with traditionalword prediction. The MSD error rate remained below 0.8% along all sessions.

3.1.6 Potential of Dwell-Free Text Entry

If the dwell-time is set to zero, a fixation of any duration on the key would select it. Kristenssonand Vertanen [26] showed the potential speed gain of an ideal dwell-free virtual keyboard. Theypresent a human performance model that decomposes the text entry rate into dwell-time andoverhead time. The overhead time is defined as the time needed to transition between keys and toperform any necessary error corrections. Based on the proposed model the text entry rate TER inwords per minute can be computed as:

TER =12, 000

DWELL+OV ERHEAD(3.1)

Where DWELL and OV ERHEAD are the dwell and overhead times in milliseconds. Kris-


tensson and Vertanen estimated the overhead time using the results from [34] to be 318 ms forparticipants with 2.5 hours of eye typing experience. A direct conclusion is that to reach an entryrate of 25 wpm with a 300 ms overhead, the dwell-time must be set to 180 ms using the followingequation derived from Equation 3.1:

DWELL =12, 000

TER−OV ERHEAD (3.2)

In both the study by Majaranta et al. [34] and Mott et al. [38], in which the participants couldadjust the dwell-time, the minimum dwell-time was never lower than 250 ms. Thus an entry rate of25 wpm with dwell-time only (without word completion) is unlikely to be achieved and maintained.

Kristensson and Vertanen [26] conducted an experiment with eight participants using an idealdwell-free virtual keyboard. The interface knew what the next character should be. As soon asthe participant’s gaze was within a distance of 1.5 times the size of a key it was entered and theparticipant could look at the next character. Also, the space between words was added automati-cally, so the participants only had to select the letters in the phrase. After 40 minutes of practice,participants achieved a mean entry rate of 46 wpm. Applying Equation 3.2 with DWELL = 0,we find that the mean OV ERHEAD was about 261 ms. As the interface was an ideal keyboard,it made no mistakes, so participants did not have to correct errors. Even with an overhead of 261ms the dwell-time would have to be 219 ms for users to achieve a 25 wpm rate with a dwell-basedkeyboard.

3.1.7 Inferring Intent With Process Models

Salvucci [50] proposed one of the earliest dwell-free text entry methods. Fixations are mappedinto letters using hidden Markov models. The method tries to classify fixations as incidental orintended. Intended fixations are then mapped into letters based on a predefined grammar of howletters follow each other according to words in a dictionary. Unfortunately, the intent inferencemethod has not been incorporated in an actual interface because the method could not be executedin real time with the computers available at that time.

3.1.8 Filteryedping

Pedrosa et al. [43] proposed a dwell-free eye typing method called Filteryedping. The user selectsthe keys on the keyboard without dwell-time. As the user finishes selecting all the characters in theword they look at the bottom part of the interface, where word candidates are displayed. The wordis typed by looking back at the keyboard or at the text area. If the typed word is not among theword candidates, the user can navigate to the next block of word candidates by looking at an arrowkey. If the last selected button is the arrow key, no word is entered.

A key is presented on the top-right part of the interface that allows for deletions of words,backspace and add a new line. As the user looks at the key, the possible options are displayed ver-tically below it. Analogously, a key on the top-left part of the interface allows for key modifications(e.g. shift or caps-lock) and switching to a number and punctuation layout.

The word candidates are obtained from a lexicon (augmented with frequent misspellings) andfiltered by the ones that can be generated by the sequence of selected keys. The word candidatesare then sorted according to their score:

3.2 SACCADE-BASED TEXT ENTRY 17

scoreF (W ) = log10(freq(W )) + w · |W | (3.3)

Where W is the word candidate, freq(W ) is the frequency of W in a previously trained corpus,|W | is the length of W , and w is a weighting factor, empirically set to 1.08.

The character keys are displayed in small circles separated by 115 pixels (approximately 2degrees of visual angle, computed based on the monitor dimensions), but the sensitive area, thatrecognizes the gaze, is larger. The sensitive area of neighbor keys overlap in a small region. Whenthe estimated gaze position falls in the intersection, both keys are highlighted and considered forthe word candidates.

Words not in the lexicon (e.g. passwords or less common names) can be entered by dwell-time(default value is 1000 ms). When the user looks at the bottom of the interface, the first candidateis the word composed by the letters selected by dwell-time.

Pedrosa et al. also implemented a shape-based text entry method inspired in the work byKristensson and Zhai [27]. The whole sequence of eye movements on the keyboard is used to computethe word candidate list. The initial and final gaze samples are removed up to a threshold.

The authors performed two experiments. The first compared the shape-based method withFilteryedping. Twelve participants used each method for three 15-minute sessions. Participantsachieved an average of almost 13 wpm (MSD error rate was close to 5%) with Filteryedping andalmost 7 wpm (MSD error rate was almost 15%) with the shape-based method in the last session.Average position of the selected words on the candidate list was also computed. Ideally, the positionof the selected word should always be 1. The average position of the selected words for Filteryedpingin the last session was approximately 4, while for the shape-based method it was above 10.

Given the poor performance of the shape-based method, the second experiment compared onlyFilteryedping to the adjustable dwell-keyboard by Majaranta et al. [34]. The second experimentwas divided in two phases: the first with six participants without disabilities, and the second withsix participants with disabilities (four with Amyotrophic Lateral Sclerosis and two with DuchenneMuscular Dystrophy). Participants without disabilities achieved mean entry rates of 15.95 wpmwith Filteryedping and 11.71 with the dwell-keyboard after 100 min typing with each method.MSD error rate was below 1% for both methods in the last session. Participants with disabilitiesreached mean entry rates of 7.60 wpm with Filteryedping and 6.36 wpm with the dwell-keyboardafter typing with each method for 60 min. MSD error rate was close to 10% for Filteryedping andclose to 5% for the dwell-keyboard.

An improvement over Filteryedping’s filtering method has been proposed by Liu et al. [28], withGazeTry. Instead of filtering the words that can be generated with the sequence of letters selectedby the user’s gaze, they proposed a altered string matching algorithm. Their algorithm assignsdifferent costs to replacing letters by neighboring keys on the keyboard or by other keys.

3.2 Saccade-Based Text Entry

Saccade-based methods may use single or multiple saccades to perform actions. In the literature,single saccades have been used to trigger selections, while multiple saccades, also called eye or gazegestures, can represent different actions.

Antisaccades, voluntary saccades in the opposite direction of the onset of a visual stimuli, have


been explored for selection [20]. However, to our knowledge, no work in the literature has usedantisaccades for text entry.

3.2.1 Saccade-Based Text Entry with Static GDO’s

Context Switching

Context Switching, by Morimoto and Amir [37], presents duplicated keyboard regions, or con-texts to achieve dwell-free selection (Figure 3.2). A key is only selected by switching contexts. Thusthe user can freely explore the keyboard before deciding to enter a character. As the user switchescontexts with a saccade, the last focused key is entered. Key focus is activated by a short dwelltime.

Figure 3.2: Context Switching uses duplicated contexts for dwell-free selection. The last selected key isentered when the user performs a saccade to the opposite context. Retrieved from [37].

A text region separates the two contexts and displays the entered text. If the user performsa saccade from a context to the text region, no character is entered. The space between the twocontexts that contains the text region is called bridge [8]. The bridge is used to avoid accidentalselections and make the selection more robust to noise in gaze estimation. To select a target in onecontext the user has to perform a saccade to the other context, crossing the bridge entirely.

In an experiment, 6 participants achieved about 12 wpm and the fastest participant was able toenter above 20 wpm with error rate below 2%. Both the results and feedback from the participantsindicate that higher rates may be achievable with practice.

Context Switching comes at the expense of screen real estate. To reduce the impact of the sizeof the duplicated contexts, Diaz-Tula et al. [8] proposed Dynamic Context Switching. The contextcontaining the user’s gaze is displayed in full size, while the duplicated context is minimized. Duringthe saccade from one context to the other, the full size context is minimized and the minimizedcontext is restored to full size.

Diaz-Tula et al. conducted an experiment to evaluate Dynamic Context Switching with a buttonselection task. Context resizing was not considered disorienting by any participant. A possible


explanation is the saccadic suppression phenomenon. As the context resizing occurs during thesaccade, the user may not perceive it.

Gesture-Based Character-Level Text Entry

Isokoski [21] proposed the use of gaze gestures with large off-screen targets to save screen realestate. The proposed method, Minimal Device Independent Text Input Method (MDITIM), usesfive off-screen targets: four main targets in the top, bottom, left and right hand side of the screenand an additional target elsewhere. Letters are entered by performing gestures on the four maintargets and the additional target is used to modify the characters (e.g. upper case). No quantitativeexperimental results are presented in the paper.

EyeWrite [64], introduced by Wobbrock et al., is an adaptation of EdgeWrite [63], a text entrymethod for handheld devices. The interface of EyeWrite is composed by four corners and the centerof a square window (Figure 3.3a). A character is entered by “writing” it as a sequence of fixationson the corners of the window. The authors use the term “writing” to refer to the gaze gesture, asit resembles the process of writing a character. The gesture is finished by fixating the center of thewindow and the character is entered. The letter s, for example, is written by fixating the top rightcorner, the top left corner, the bottom right corner, the bottom left corner, and finally the centerof the window. If the user looks away from the window the gesture is canceled. The fixation onthe center of the window requires a brief dwell-time to avoid finishing a gesture early due to gazeestimation errors or noise.

(a) With EyeWrite, letters are entered byconnecting the dots in the four corners ofa square window. This example shows theletter “S”. Adapted from [64].

(b) With EyeK, a key is selected by fixatingit and performing a quick gesture lookingoutside and back inside. Adapted from [51].

Figure 3.3: Interfaces of (a) EyeWrite and (b) EyeK.

Wobbrock et al. conducted an experiment to evaluate EyeWrite compared to a dwell-basedkeyboard. Participants achieved an average text entry rate of 4.87 wpm, with some proficient usersreaching 8 wpm, compared to 7.03 wpm with a 330 ms dwell-based keyboard. Though the entryrate was lower than the baseline, EyeWrite was considered faster by the subjective feedback fromthe participants. Participants also left significantly less uncorrected errors with EyeWrite. Anotheradvantage is that, due to the simplicity of its interface, EyeWrite uses considerably less screen realestate than a virtual keyboard with the whole roman alphabet.

Sarcar et al. [51] proposed another dwell-free text entry method called EyeK. To select a keywith EyeK, the user must fixate the key and perform a quick gesture looking at the outside region


above the key and back inside it (Figure 3.3b). In an experiment, participants achieved a meanentry rate of approximately 6 wpm using EyeK.

Gesture-Based Word-Level Text Entry

In a position paper, Hoppe et al. [17] introduced Eype, their initial approach to entering wordswith gaze gestures. They argue that the beginning of the gaze gesture must be either indicated byan explicit method, e.g. dwell-time, or considered just as unreliable as the rest of the gesture. In thelatter case, alternative methods to the ones used in swipe-based methods should be investigated.

In their initial approach, gaze samples are mapped to the closest key and the resulting charactersequence is compared to words in a lexicon. This is similar to Filteryedping and GazeTry’s approach.

Pedrosa et al. [43] compared an implementation of a shape-based method to Filteryedping.However, as Filteryedping outperformed the shape-based method in terms of selecting the correctword, the shape-based method was not tested in further experiments.

Though both the character filtering and the shape-based methods may be applied on the samedata and interface, the interface may benefit one of the methods. To enter a word with Filteryedping,for instance, the user is instructed to look at each letter of the word. The keys change their color toindicate a selection by gaze. This implies the idea of fixating each key individually, which is a hardrequirement of the filtering method. A shape-based method may still work, even if some charactersare not selected.

Both Eype and the method proposed in this thesis, EyeSwipe, are based on gaze gestures, orgaze paths. This implies that the shape, or path, is more important than the absolute positions forthese methods. In theory the fixations are not required to be exactly on the characters that formthe word. For this reason, though similar, from the user’s perspective the interfaces convey differentideas. Thus we classified Filteryedping as a fixation-based method, while Eype and EyeSwipe assaccade-based methods.

3.2.2 Saccade-Based Text Entry with No GDO’s

A similar approach to EyeWrite was proposed by Porta and Turina, with Eye-S [45]. Charactersare also entered by fixating a sequence of regions. Instead of a window with regions on the 4 corners,Eye-S uses 9 semitransparent regions that occupy the whole screen in a grid of 3 × 3. To start a gazegesture the user must select the first region with a dwell-selection. Though Møllenbach classifiedEye-S as a saccade-based method with static GDO’s, the authors argue that fully transparentregions could be used, thus Eye-S can be considered a saccade-based method with no GDO’s.

For an evaluation of Eye-S, 8 participants entered the sentence “hello! I am writing with my eyes.”after 10 minutes practicing with the interface. The average time to complete the task was 188.5 s,which is equivalent to an entry rate of approximately 2.0 wpm. In a separate longer experiment,two experienced participants were able to achieve 6.8 wpm.

3.2.3 Saccade-Based Text Entry with Dynamic GDO’s

Saccade-Based methods with dynamic GDO’s often employ pop-up keys or regions or dynami-cally adjust the contents of regions depending on the current state of the interface.


pEYEwrite

Urbina and Huckauf proposed pEYEwrite [19], which is based on pie menus. Letters are groupedin sections of the pie menu. As the user enters the region of the border, called selection border, asecond level pie menu is displayed with each letter in a different section. The letters are entered bycrossing the respective border on the second level menu. No quantitative results were reported inthe paper.

Urbina and Huckauf [57] extended pEyeWrite by allowing users to enter pairs of charactersinstead of single characters. A third level pie menu is used (Figure 3.4), showing possible charactersto be entered alongside the character selected on the second level menu. Besides allowing for bigramsinstead of single characters, the possibility to select letters by dwell-time was also added. The usercan choose between selecting a character by dwell-time or by crossing the selection border.

Figure 3.4: In pEYEwrite letters are grouped in sections of a pie menu. When the user fixates the border ofa section, another pie menu is displayed, so the user can select letters individually. A third level pie menu isused to enter pairs of letters. In this example the letters “A”, “I”, “D”, “E”, and “S” can be entered alongside“N”. Adapted from [57].

Urbina and Huckauf [57] conducted an experiment comparing the original pEYEwrite with threeversions of the extension, one with vowels, one with the most probable letters, and another withthe most probable letter based on the previous two letters, on the third level. They also comparedthe three methods with and without word prediction, totaling 7 methods (no word prediction wasused with the original pEYEwrite). Up to three word candidates were displayed above the text fieldand could be entered by dwell-time.

Nine participants were divided into three groups. Each group tested one of the three versions ofthe bigram extension with and without word prediction, and the original pEYEwrite. Participantsperformed 20 sessions with each method. Each session consisted of 3 phrases. A mean entry rate of7.34 wpm was achieved with the original pEYEwrite. The maximum entry rate was 10.38 wpm. Themean MSD error rate was 1.41%. The highest mean entry rate with the bigram extension withoutword prediction was 10.64 wpm (maximum was 12.85 wpm and the MSD error rate was 0.96%) forthe dynamically computed most probable letters. The same method was also the fastest with wordcompletion. Mean entry rate was 13.47 wpm (maximum was 17.26 wpm) and the MSD error ratewas 0.01%.


Quikwriting

Bee and André [4] adapted the stylus-based interface of Quikwriting, introduced by Perlin [44],for gaze input. The central area of the interface, called resting area shows groups of up to fivecharacters. As the user moves their gaze away from the resting area through a group of characters,the characters in that group are displayed in up to five regions around the resting area. The characterin the last region the user looked before returning to the resting area is entered.

Figure 3.5: In Quikwriting adapted for gaze, letters are grouped around a circle. When the user’s gaze exitsthe circle, the letters in the group in the exit region are displayed in the external regions. The letter in thelast focused region before the user’s gaze reenters the central area is typed. Adapted from [4].

In an experiment with 3 participants, the mean text entry rate with Quikwriting was 5.0 wpmand 7.8 wpm with virtual keyboard with fixed dwell-time of 750 ms. The authors say that the errorrate for Quikwriting was much higher than for the dwell-keyboard, but the results are not provided.

3.3 Smooth-Pursuit-Based Text Entry

Smooth pursuit eye movements have started being explicitly explored more recently in gaze-based interaction research [25, 58]. However, earlier methods, such as Dasher, by Ward et al. [61],also prompted the user to perform smooth pursuits, though not explicitly.

3.3.1 Smooth-Pursuit-Based Text Entry with Static GDO’s

In most cases, smooth pursuit eye movements require moving targets to be initiated. Lorenceau [30]designed a flickering visual display that can be used to sustain smooth pursuits in arbitrary direc-tions. The method explores a perceptual phenomena that induces the perception of motion using astatic random-dot field undergoing counterphase flickering.

With some training, smooth pursuits can be performed in any direction using the flickeringdisplay. Lorenceau proposed using such voluntary smooth pursuits to write number, letters, and

3.3 SMOOTH-PURSUIT-BASED TEXT ENTRY 23

words. In a pilot experiment Lorenceau estimated that users could write approximately 20–30characters per minute (∼4–6 wpm).

3.3.2 Smooth-Pursuit-Based Text Entry with Dynamic GDO’s

Smooth pursuits have been implicitly explored by zooming interfaces [15, 61]. Zooming interfaceshandle inaccuracies in gaze estimation by continuously increasing zoom, or increasing the size, onthe GDO’s closest to the estimated gaze position. As the zooming is commonly also associated withsome movement the user may end up performing smooth pursuits to keep the GDO under focus.

Dasher

Ward et al. [60] proposed Dasher, a text entry method that can be used with any pointingdevice, such as a mouse, head-movement-based mouse-replacement interfaces, or eye trackers [61].It is one of the fastest methods for text entry by gaze to date. Characters are displayed verticallyin separate regions on the right-hand side of the interface. As the user moves the pointer to theright, the column moves faster towards the center. At the same time, the character regions closer tothe pointer’s vertical coordinate increase in size. Upon reaching the center of the interface, a singlecharacter region is visible and the character is entered.

Figure 3.6: Letters are displayed vertically in alphabetical order on the right side of Dasher’s interface.When the user looks to the right, the letters start moving towards the center and zooming in the letterscloser to the user’s gaze. Letters are entered as they cross the central line. Screenshot of Dasher software byWard et al. [61].

If the user looks to the left-hand side of the interface, the entered characters start moving back


to the right. As a character reaches the center of the interface it is deleted. Language models canbe used to alter the size of the character regions and accelerate the text entry.

When the pointer movement is decoupled from the user’s gaze, such as with a mouse or head-movement-based interface, the user can perform short vertical movements to select the desiredcharacter while visually searching for the next character in the following column. However, whengaze is used as the pointing method, visual search and target selection are coupled.

It may be distracting if the user is searching for the next character and the interface keepsmoving according to the new gaze position. Once the desired key is found it can be easily selectedby maintaining it under focus until it reaches the center of the interface.

Tuisku et al. [56] performed a longitudinal experiment to evaluate the speed of Dasher controledby gaze. Twelve participants transcribed Finnish text with Dasher for 2 hours and 30 minutes. Themean text entry rate varied from 2.5 wpm in the first session to 17.3 wpm in the tenth (last) session.

StarGazer

Hansen et al. [15] proposed the pan and zoom interface StarGazer. Characters are distributedin two concentric circles in the center of the interface. As the user saccades towards a character,the interface starts zooming in and panning to position the character at the center of the interface.The user follows the character with a smooth pursuit movement. A character is selected once it issufficiently zoomed and centered.

Figure 3.7: In StarGazer, characters are distributed around two concentric circles. The interface pans andzooms in the direction of the user’s gaze. Adapted from [15].

The authors conducted an experiment to evaluate StarGazer. Experimental results indicate thatthe method is robust to noise artificially added to the gaze estimation. Seven participants achievedan average text entry rate of 8.16 wpm while leaving no uncorrected errors.

Calibration-Free Text Entry

More recently, smooth pursuits have been explicitly used for calibration-free gaze-based interac-tion [58]. Khamis et al. [25] presented a method to enter predefined phrases or words with smoothpursuit. They give the example of a voting application. A question is presented and possible textualanswers are displayed moving along predefined trajectories. As the user reads the moving text thesmooth pursuit target is detected using a correlation function.

3.4 LITERATURE REVIEW RESULTS SUMMARY 25

Lutz et al. [31] introduced SMOOVS, a text entry method that uses smooth pursuits. Themethod requires coarse gaze estimation to enter moving characters distributed over the interface.Clusters of at most 6 characters are distributed in a hexagonal layout around the center of theinterface. The method uses a 1-point calibration to determine when the user is looking at thecenter of the interface, where the entered text is displayed. When gaze is located at the center, thecharacter clusters do not move.

Character selection is performed in two steps. In the first step the character clusters move apartfrom each other. A cluster is selected if its movement correlates to the gaze movement accordingto a angle criterion. In the second step only the selected cluster is visible. The characters in thecluster move apart from each other. A character is entered if it correlates to the gaze movement. Ifthe user looks back at the center at any moment the interface resets to the original state and nocharacter is entered.

Figure 3.8: In SMOOVS, selection of a character occurs in 3 steps using smooth pursuits. (1) Clustersof characters start moving outwards. (2) Characters in the selected cluster move apart from each other. (3)The character followed by the user’s gaze is selected. Adapted from [31].

3.4 Literature Review Results Summary

In this section we summarize the results from the works presented in the literature review(Table 3.1). Some of the results presented in the table have been estimated based on figures andother results available in the original papers when the actual numbers were not provided by theauthors.

From the literature review we observe that solutions to text entry by gaze are currently limitedto peak entry rates below 30 wpm. Based on the results by Kristensson and Vertanen we know thathigher rates are possible.

Text entry rates are part of the user experience, but not all. An uncomfortable method isundesirable, even if it is fast, for example. Also, even the perception of speed may differ from theactual entry rate. In the work presented in this thesis we aimed to develop a text entry methodthat is both comfortable and fast. Also, we intended to improve the perception of continuity of theinteraction, avoiding system delays such as selecting individual characters.


Table 3.1: Summary of results from the literature on text entry by gaze. Some results were infered fromfigures or other data available in the original papers. Results infered by us are indicated by a ∗ symbol. The† symbol indicates that the average entry rate of the fastest participant on their best session is shown. Thecolumn “Max. WPM” shows the maximum text entry rate achieved by a single participant (unless statedotherwise). The column “Last WPM” shows the average entry rate for the last session of the experiment forall participants. The last row shows the results from Kristensson and Vertanen’s study [26] with their idealvirtual keyboard.

Authors Method Max.WPM

LastWPM Participants Practice time

Porta andTurina [45] Eye-S 6.8

(2 experts) 2 8 1 phrase

Lutz et al. [31] SMOOVS N/A 3.34 24 1 phraseLorenceau [30] Flickering N/A ∼ 5∗ 7 90 minBee and André [4] Quikwriting N/A 5 3 10 phrasesWobbrock et al. [64] EyeWrite 8 5 8 112 phrases

Sarcar et al. [51] EyeK N/A 6 5 ∼ 45 min(90 phrases)

Hansen et al. [15] StarGazer N/A 8.16 7 1 word(own name)

Morimoto andAmir [37]

ContextSwitching 20† 12 7

Špakov andMiniotas [59]

Dwell(automatically

adjusted)N/A 12.1 9 ∼ 20 min

(21 phrases)

Mott et al. [38] CascadingDwell

13.79(Max. Avg.) 12.39 17 150 min

Urbina andHuckauf [57] pEYEwrite

17.26(word

completion)

13.47(word

completion)3 60 phrases

Pedrosa et al. [43] Filteryedping 19.28 15.95 12 100 minDiaz-Tula andMorimoto [9] AugKey N/A ∼ 17∗ 7 72 min

Tuisku et al. [56]Dasher

(longitudinalexp.)

∼ 24† 17.3 12 150 min

Majaranta et al. [34]Dwell(user

adjusted)

∼ 24∗

(Max. Avg.) 19.9 10 150 min(10 days)

Räihä andOvaska [46]

Dwell(user

adjusted)

∼ 24∗

(Max. Avg.) ∼ 20∗ 10 285 min

Ward andMacKay [61] Dasher 25†

(expert)

∼ 22∗

(expert)∼ 12∗

(novice)

4(2 experts) 60 min

Kristensson andVertanen [26]

Dwell-free(ideal) 54† 46 8 50 min

Chapter 4

EyeSwipe 1

Inspired by the promising results of dwell-free text entry methods [26] and by swipe-basedmethods available for smartphones, we propose EyeSwipe. With EyeSwipe the user’s gaze path ismapped into words, similarly to how finger traces are mapped into words in swipe-based methods.

A critical difference between the finger trace and the gaze path is that the former has clear startand end positions, while the latter does not. EyeSwipe provides mechanisms to explicitly indicatethe start and end positions of the gaze path. In this thesis we will use the term “gaze path” onlyreferring to the sequence of gaze points in between the start and end positions indicated by theuser.

EyeSwipe can be categorized as a gesture-based method with some dynamic GDO’s. In EyeSwipe1, the dynamic GDO’s are called pop-up action buttons. The pop-up action buttons are used forthe explicit selection of the limits of the gaze path (the start and end positions).

4.1 Using Gaze Paths to Enter Words

The use of gaze paths has potential advantages over mapping each fixation to a character. Thegaze estimation error may cause the wrong set of characters to be selected for a given fixation.However, the overall shape of the gaze path may not be altered significantly by an error of the samemagnitude.

When glancing through the letters of the word, the user may not reach some of its letters. Forexample, to enter the word “map” the user may look at “M”, undershoot the saccade and reachthe letter “F”, and finish at “P”. Even if the user does not undershoot, the eye tracker may fail inestimating the correct fixation at “A” and estimate it at “F”. In these cases, the gaze path mightstill be matched to the correct word.

Regarding speed, the user can perform the gaze path as fast as they can move their eyes,provided that they have enough knowledge about the keyboard layout. Also, if the mechanism forindicating the beginning and end of the gaze paths is not based on time (such as with dwell-time),the user can wait on each letter for as long as they want before continuing the gaze path, as longas they do not deviate much from the path while waiting.

Spelling errors may also be handled automatically. If the rest of the gaze path is still similarto the correct path, missing or wrong letters may not affect the final result. Methods that mapfixations into characters, such as Filteryedping, need to handle misspelled words directly. Pedrosaet al. [43], for example, proposed the inclusion of commonly misspelled words in the lexicon used

27

28 EYESWIPE 1 4.2

by Filteryedping, so the correct word can be found, even with some spelling errors. Though notrequired, the solution proposed by Pedrosa et al. can still benefit path-based methods, furtherimproving the results.

4.2 Selection of Gaze Path Limits

One of the most straight forward approaches to explicitly selecting the limits of the gaze pathis dwell-time. The user would simply dwell on the first letter of the word, swipe through the lettersin the middle of the word and then dwell again on the last letter of the word.

This approach was tested in a pilot experiment and we found some limitations that motivated usto search for alternative methods. The main issue occurs because of unintended selections by dwell-time. Each selection switches the state of the interface between performing a gaze path or not. Asthe user performs an unintended selection they must realize that the state of the interface changedand delete possibly wrong words or restart the gesture with another selection. If this realizationtakes longer than the dwell time and the user does not change the fixation point another selectionmay occur, switching the state of the interface once more.

Selection by dwell-time is simple to be performed and consequently may be performed unin-tentionally. As the meaning of a key selection keeps alternating (start or end a gaze path), it maybecome cognitively demanding for the user to keep track of the current state of the interface.

EyeSwipe 1 uses Reverse Crossing, a gaze-based selection method adapted from Target ReverseCrossing [13], to indicate the gaze path limits.

4.2.1 Reverse Crossing

Feng et al. [13] introduced Target Reverse Crossing, a selection mechanism developed for situa-tions in which the user can control the mouse pointer position, but not perform a click, such as withvideo-based mouse-replacement interfaces [5]. A target is selected when the mouse pointer leavesthe target area from the same edge it entered. For example, if the mouse pointer entered the areaof a button from the left, it must also leave the button from the left to select it, otherwise nothinghappens.

Target Reverse Crossing was initially proposed for mouse-replacement interfaces that use move-ments of body parts, such as the head. In such cases the user may be able to perform short continuousmovements to move the mouse pointer inside and outside of the selection target. The nature of eyemovements, however is roughly composed of a sequence of fixations. It may be understood, in termsof movements, as a more discrete method, moving from one fixation point to the next. Also, visualtargets help planning the location of the next fixation.

We adapted Target Reverse Crossing to handle the sequence of discrete eye movements. Insteadof using the edge between the outer and inner regions of the button for the selection, we introducedan additional pop-up action button. Once the user looks at a selection target the action buttonpops up around it, with a space between the buttons. To perform the selection the user must lookat the pop-up button and then look back at the target (Figure 4.1).

The action button is intended to provide a visual target for the fixation outside of the selectiontarget. So, instead of alternating between inside the selection target and the region outside of it,

4.2 SELECTION OF GAZE PATH LIMITS 29

Figure 4.1: Selection of a key is performed with a reverse crossing. The red dots indicate the estimated gazeposition at each point in time. As the user looks at the target key (t1), an action button pops up (t2). Toperform the action, the user looks at the action button (t3) and then looks back at the key (t4).

we provide a visual target to guide the user’s gaze. Also, the space between the target and actionbuttons makes the selection more robust to noise in the gaze estimation.

A similar selection method has been suggested by Tall [55] in the NeoVisus interface componentslibrary for gaze interaction (Figure 4.2). To toggle a binary option, the user looks at the option andan icon appears to its side. If the user performs a saccade towards the icon, the option is toggled.Reverse Crossing can also be interpreted as an extension to the Tall’s binary option selectionmethod that requires an additional saccade back to the option. This additional saccade is neededto disambiguate some gaze movements. As the keys are surrounded by other keys, a selection couldbe performed by accident. For example, assume the user fixated the letter “A”, but actually wantsto select the letter “W”. The action button would pop-up above “A”. As the user looks at “W” thefixation could accidentally fall inside the action button, and “A” would be selected. Forcing the userto look back at “A” helps avoiding this issue.

Figure 4.2: Binary options in the NeoVisus library can be selected in 4 steps. When the user looks at theoption, an icon appears to its right. If the user looks at the icon the option is toggled. Adapted from [55].

The reverse crossing to select the action button above the key also shares similarities withEyeK, by Sarcar et al. [51]. In EyeK, the user selects a key by looking above it and back inside ofit. The action button not only provides a visual target to guide the gesture, but also is used both toincrease the number of options of a selection (e.g. the punctuation key can be used to enter multiplepunctuation marks), and to provide additional visual feedback (e.g. the word to be entered).

4.2.2 Action Button

The Reverse Crossing with action buttons can also be interpreted as a variation of the pie menusused by Urbina and Huckauf in pEYEwrite [57]. As the user selects a key, the second level menu(the action buttons) shows actions related to that key.

In EyeSwipe 1, there is only one available action per letter key: either start or end a gaze path.For consistency the action is always displayed above the key. However, other actions can be easily

30 EYESWIPE 1 4.4

added by distributing pop-up action buttons around the key. For example, entering a single letter(to type a word letter by letter), or alternative accented letters such as “Ü” or “Ú”.

In the current implementation of EyeSwipe 1 this option is explored by the punctuation key.There is a single key that can be used to enter multiple punctuation marks. Each punctuationmark is displayed in a different action button around the punctuation key. To choose between thepunctuation marks, the user performs a Reverse Crossing on the desired action button.

4.3 Interface Description

The interface of EyeSwipe 1 (Figure 4.3) is composed of 4 main types of keys: letter keys,candidate keys, the backspace key, and the punctuation key. To enter a word with EyeSwipe 1, theuser must initially select the first letter by looking at it and performing a reverse crossing on thestart action button. Next, the user glances through the vicinity of the letters that form the middleof the word, stopping at the last letter. An action button will pop up showing the word that will beentered if the gaze path is ended at that letter. This word is the most probable in a candidate wordlist computed based on the gaze path starting at the first letter and ending at the current letter.As the user selects the action button by reverse crossing, the word is entered. Every selection byreverse crossing is also indicated by a short “click” sound.

While the user is performing the gaze path, the interface shows the current gaze position witha small blue dot and the previous points from the last few seconds in the gaze path in a semitransparent blue line.

As the word is entered, the top 5 word candidates are displayed to the user in the candidatekeys, sorted from the most probable to the left and least probable to the right. The candidate keycontaining the currently entered word is disabled to indicate that it is already selected. The wordsin the other 4 candidate keys can replace the current word. To do so, the user selects the candidatekey by reverse crossing. The disabled key is enabled, the newly selected candidate key is disabled,and the word is replaced.

If the user is not in the middle of a gaze path they can delete a word by selecting the backspacekey, indicated by an arrow pointing to the left, with reverse crossing. If in the middle of a gazepath, this same key can be used to cancel the current gaze path (e.g. when the wrong first letterwas selected).

Finally, the user can enter a punctuation mark by selecting it among the action buttons displayedfor the punctuation key, as described in the previous section.

The text that has been entered previously is displayed in a text field in the top part of theinterface. During the experiment the sentence to be transcribed was displayed above the text field.

4.4 Gesture Classification

EyeSwipe 1 computes the candidate word list based on the gaze path between the first andlast letters explicitly selected by the user. A lexicon is filtered using the first and last letters ascriteria. All remaining words start and end with the letters indicated by the user. The resultingword list is sorted according to a score that combines the likelihood of occurrence of the word andits similarity to the user’s gaze path. The similarity to the user’s gaze path is based on the concept

4.4 GESTURE CLASSIFICATION 31

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

Gaze

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

gave gate givegaze

Gaze

gradegaze

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

gave gate givegaze

Grade

grade

Select

gaze

Gaze Grade

① ②

③ ④

⑤ ⑥

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

G

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M .,?!

←

gave gate givegaze grade Delete(Cancel)

gaze

Gaze

Start

gaze

Figure 4.3: The interface of EyeSwipe 1 is composed of a text box, word candidates, a cancel/delete button,and a virtual keyboard. The reference text to be transcribed is shown above the text box (“gaze”) in theexperiment. The word can be entered by completing the following actions: (1) Fixate on the first letter ofthe desired word. (2) A pop-up action button appears above the fixated key. Select the key using reversecrossing to initiate a gaze path. (3) Glance through the intermediate letters. The resulting gaze path is usedto compute the candidate words. (4) When fixating on the last letter, the most probable candidate word willbe displayed on the pop-up action button. As the user selects it, the word is entered. (5) The top 5 candidatesare displayed when a word is entered. The user can change the entered word by selecting another candidatebutton. (6) Select the cancel/delete key to cancel a gaze path (in the middle of a gesture) or delete a word.

32 EYESWIPE 1 4.4

of “ideal path.” The ideal path of a word is the sequence of the centers of the keys that form a word.Repeated letters, such as “o” in “book”, are included just once. So the ideal path of the word “book”is the sequence of the centers of the keys “B”, “O”, and “K”.

4.4.1 Dynamic Time Warping

EyeSwipe 1 computes the similarity of the ideal path of each word to the user’s gaze pathusing Dynamic Time Warping (DTW) [49], a technique commonly employed to compare two timesequences that has also been used for swipe-based text entry on touch devices [65]. To computeDTW, a dynamic programming algorithm can be used (Algorithm 1).

Algorithm 1 Dynamic Time Warping [49]

Precondition: P = (p1, . . . , pn) and Q = (q1, . . . , qm) are sequences of points

1: function DTW(P,Q)2: d ← Array with dimensions [1 . . . n+ 1, 1 . . .m+ 1] initialized with ∞3: d[1, 1] ← 04: for i← 2 to n+ 1 do5: for j ← 2 to m+ 1 do

. Match, deletion, insertion6: d[i, j] ← distance(pi, qj) + min(d[i− 1, j − 1], d[i, j − 1], d[i− 1, j])

7: return d[n+ 1,m+ 1]

The complexity of the non optimized version of DTW is O(MN) where M and N are thelengths of each time sequence. Ratanamahatana and Keogh [47] showed that the amortized cost ofthe optimized DTW is linear. However, for EyeSwipe we needed to compute DTW multiple times.Our solution, as explained in Section 4.4.3, was to compute it with the nodes of a prefix tree. In theprefix tree it was simpler to implement the non-optimized version, so we opted to initially use theO(MN) solution. The upper limit for the length of the ideal path is the length of the word (may beless if there are double letters). For an English language lexicon, this means that the mean lengthof the ideal path is approximately 5. On the other hand, the length of the raw gaze path may be inthe order of a few hundreds (using the raw gaze data), depending on the sampling rate of the eyetracker. The length of the gaze path may be reduced by filtering fixations.

4.4.2 DTW Score

The word candidates retrieved from the lexicon are initially sorted according to their path score:

Path-Score(g, p) =1

1 + DTW(g, p)(4.1)

Where g is the gaze path, p is the ideal path of a word, and DTW(g, p) is the output distancefrom the DTW algorithm. The top n words are then sorted by their score:

DTW-Score(g, w) = αPath-Score(g, Ideal(w))∑

v∈L′Path-Score(g, Ideal(v))

+ (1− α) Occurrences(w)∑v∈L′

Occurrences(v)(4.2)


Where α is a weighting factor, Ideal(w) is the ideal path of the word w, Occurrences(w) is thenumber of occurrences of the word w on a text corpus, and L′ is the set containing the n wordswith highest Path-Score from the lexicon L. Only the top n candidates are considered because somewords, such as “get”, occur a few orders of magnitude more often than the others. These exceptionsare only possible candidates if their DTW match is high enough for them to be among the top nwords.

We empirically selected the values of n as 10 and α as 0.95. The text corpus was extracted fromWikipedia [29].

4.4.3 Prefix Tree

Every time a key is selected in the middle of a gaze path, an action button must pop-up showingthe most probable word for the current path. Because of this requirement, a new candidate list mustbe recomputed for every new key selection.

EyeSwipe optimizes the process of recomputing the candidate list by computing the pairingbetween each gaze sample and each word prefix only once. The lexicon is stored in a prefix treedata structure (Figure 4.4), also known as trie or radix tree. In a prefix tree, the concatenation ofthe characters stored by the ancestors of each node represents its prefix. For example, the words“gaze” and “gate” have the first three ancestors in common (the root, “G”, and “A”), and the lasttwo nodes are different (“Z” and “E” for “gaze”, and “T” and “E” for “gate”).

∅

A

N

D

E

N

D

G

A

Z

E

T

E

Figure 4.4: Example of prefix tree with the words “a”, “an”, “and”, “gaze”, “gate”, and “end”. Though thesuffixes of “and” and “end” are the same, they are stored in different nodes.

Each step of the algorithm that computes DTW depends on values computed for the previouspoints in both time sequences. One of these sequences is the ideal path of the word. If two wordsshare the same prefix, the values obtained for the common prefix will be the same. Thus DTW iscomputed following the prefix tree structure to avoid recomputing the same values more than oncefor each prefix.

The other sequence used to compute the score is the gaze path. Once DTW has been computedfor all words until a given gaze sample, it does not need to be recomputed until that sample anothertime. If all the values for the previous gaze sample are stored, once a new sample arrives, only oneiteration of the DTW algorithm must be performed for each point in the ideal path.

34 EYESWIPE 1 4.5

To combine these two observations it is necessary to store the latest value for DTW for eachprefix in the prefix tree. Once a new gaze sample, or fixation, is received, the DTW values can beupdated by visiting each node of the tree once (Algorithm 2) in a depth-first traversal. Thus theupdate process takes O(T ), where T is the number of nodes in the prefix tree.

Algorithm 2 Dynamic Time Warping update functionPrecondition: Current is a node from a prefix tree, gaze is the new gaze samplePrecondition: Node.Previous-DTW and Node.DTW are initialized with ∞ in the first iteration

for all nodesPrecondition: Root.Previous-DTW is initialized with 0 in the first iteration and∞ in subsequent

iterations

1: function DTW-Update(Current, gaze)2: Previous ← Current.DTW3: D ← distance(Current.Key-Center, gaze)4: P0,0 ← Current.Parent.Previous-DTW5: P0,1 ← Current.Parent.DTW6: P1,0 ← Current.Previous-DTW

7: Current.DTW ← D +min(P0,0, P0,1, P1,0)8: Current.Previous-DTW ← Previous9: return Current.DTW

This process is further optimized by updating only the subtrees that start with the first letterselected by the user by Reverse Crossing.

4.5 EyeSwipe 1 Evaluation

We conducted an experiment to evaluate EyeSwipe 1 in terms of performance and user experi-ence. We compared EyeSwipe 1 to a 600-ms fixed dwell-time keyboard as baseline. The dwell-timewas selected following recommendation by Hansen et al. [16] that it should be longer than 500 msfor novice users.

4.5.1 Participants

Ten undergraduate students (5 females, 5 males; ages 18 to 21, average 19.6) without disabilitiesfrom Boston University were recruited to participate in the experiment. All participants had no orlittle experience with eye tracking systems and were proficient in English (8 native speakers).

About half the participants wore corrective lenses (5 with glasses, 1 with contact lenses). Allparticipants were familiar to typing on a QWERTY keyboard.

4.5.2 Apparatus

A Tobii EyeX eye tracker (https://tobiigaming.com/product/tobii-eyex/) was used to collectthe gaze information. The keyboard was displayed on a full screen window on a 19-inch LCD monitor(1024 × 768 pixels resolution) connected to a laptop (2.30 GHz CPU, 4GB RAM) running Windows7. The length of the square keys (e.g. character keys in EyeSwipe 1) was approximately 3 degrees(90 pixels) and the distance between keys was approximately 1 degree (25 pixels). The key sizes

https://tobiigaming.com/product/tobii-eyex/

4.6 EXPERIMENTAL RESULTS 35

were selected considering the expected estimation error from the eye tracker. The keys should belarge enough so that the estimated gaze position could reliably select them.

For EyeSwipe 1 we used Kaufman’s lexicon [24] augmented with contractions and the words inMacKenzie and Soukoreff’s phrase dataset [33], resulting in 10,219 words. The word occurrenceswere extracted from a Wikipedia corpora [29].

The dwell-time interface used a shrinking gray box on the fixated key as visual feedback forthe elapsed dwell time. A fixed dwell time of 600 ms, following the recommendation by Hansen etal. [16] for novice users, was used during the whole experiment. A short “click” sound was emittedwhen a key was entered.

4.5.3 Procedure

Each participant visited the lab on two different days (48 – 78 hours apart) and performedtwo text entry sessions with each interface per day. The participants received a compensation ofUS$25 for participating in the experiment and an additional US$20 for the participant with thebest performance (considering both accuracy and speed) as motivation.

The experimenter began by introducing the eye tracker and the two text entry methods. Beforestarting the formal sessions, the participant practiced entering two sentences with each method.

During the sessions, phrases were sampled from MacKenzie and Soukoreff’s dataset [33] andpresented to the participant, one at a time, on the top of the experiment window. The participantswere encouraged to memorize and enter as fast and accurately as possible as many phrases as theycould. If the session timed out in the middle of a sentence, the participant had to finish entering itto end the session.

At the end of the experiment the participants completed a questionnaire about their subjectivefeedback on the two text entry methods and their basic demographic information.

4.5.4 Design

We used a within subjects design with dependent variables text entry rate and accuracy, andindependent variables session and interface. Each participant performed eight 10-minute text entrysessions, four using EyeSwipe 1 and four using the dwell-time keyboard, totaling 40 minutes witheach interface, aside from a short practice session. The interfaces were counterbalanced using aLatin square.

4.6 Experimental Results


Participants were able to enter text faster using EyeSwipe 1 than using the dwell-keyboard(Figure 4.5), on average. In the last session, after 30 minutes using each method, participants wereable to enter text significantly faster (t(9) = 2.73, p < 0.05, Cohen’s d = 0.99) with EyeSwipe 1(11.5 wpm) than with the dwell-keyboard (9.5 wpm). There was a significant learning effect forboth methods (EyeSwipe 1: F3,27 = 8.75, p < 0.0005 and dwell-keyboard: F3,27 = 3.01, p < 0.05).The average text entry rate increased from 9.20 wpm in the first session to 11.48 wpm in the fourthsession using EyeSwipe 1.

36 EYESWIPE 1 4.6

Figure 4.5: Boxplots for the mean text entry rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

Besides the mean entry rate, we also calculated the maximum text entry rate per session. Themaximum entry rate with a dwell-time keyboard depends on the duration of the dwell-time andcan be estimated using Equation 3.1. However, it is not clear how the maximum entry rate behaveswith EyeSwipe 1.

The maximum entry rates in each session for each participant are shown in Figure 4.6. Significantmain effects were found for both session and interface (F3,27 = 3.487, p < 0.05 and F1,9 = 26.810,p < 0.001, respectively). While the maximum entry rates for the dwell-keyboard are all below 15wpm, most of the maximum entry rates for EyeSwipe 1 are above 15 wpm, with some above 20wpm.

The overall increase in the maximum entry rates for EyeSwipe 1 over the four sessions indicatethat higher rates may be achieved with practice. Based on Kristensson and Vertanen’s model, fora user to achieve the typing rate of 15 wpm with dwell-time equal to 600 ms the overhead mustbe 200 ms. With and overhead of 100 ms, the user would be typing at about 17 wpm, and in theideal case of zero overhead, the user would type at most at 20 wpm. We then conclude that someparticipants reached close to the theoretical maximum typing rate using the keyboard with fixeddwell-time.

Some participants were able to enter text at more than 20 wpm with EyeSwipe 1. To achieve asimilar entry rate of 20 wpm with a dwell-keyboard, according to Equation 3.2, the sum of the dwell-time and the overhead time must be equal to 600 ms. So, for example, it would require a dwell-timeof 400 ms with 200 ms overhead, or dwell and overhead of 300 ms. Such combinations are possibleand have been reported as some of the highest entry rates achieved with dwell-keyboards in thestudy by Majaranta et al. on adjustable dwell-times [34].


Figure 4.6: Boxplots for the maximum text entry rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond.

Clean Text Entry Rate

Users can accidentally enter long words with very short gestures using EyeSwipe. For example,if the user selects the letter Y with reverse crossing twice (to start and immediately end a gazepath), one of the candidates will likely be the word “yesterday”, because there are not many wordsstarting and ending with Y. If the user enters the word “yesterday” by mistake with this procedure,the final text entry rate will be higher, but will not represent well the real entry rate if the outputtext was as intended by the user. For this reason we computed the clean text entry rate, to observethe impact of errors in the sentences. The clean text entry rate only considers sentences with nouncorrected errors. The mean and maximum clean text entry rates are shown in Figures 4.7 and4.8.

We did not find a significant difference between the mean clean entry rates and the entryrates considering all sentences (EyeSwipe 1: F1,9 = 2.084, p > 0.05, dwell-keyboard: F1,9 = 2.253,p > 0.05). The maximum clean entry rates for EyeSwipe 1 are also not significantly different thanthe entry rates considering all sentences (F1,9 = 2.904, p > 0.05). The maximum clean entry rates forthe dwell-keyboard were significantly lower than the maximum entry rates considering all sentences(F1,9 = 9.431, p < 0.05).

The overall maximum entry rate obtained for the dwell-keyboard remained the same (14.779wpm). However, the maximum entry rate for EyeSwipe 1 was initially 21.878 wpm for ParticipantA09 with the sentence “prescribed presidents drive expensive cars.” The extra word in the beginningof the sentence improved the entry rate. The overall clean maximum entry rate was achieved byParticipant A02 (20.839 wpm in the sentence “protect your environment”). The maximum entry ratewith dwell-time (14.779 wpm) was achieved by Participant A01 in the sentence “our life expectancy

38 EYESWIPE 1 4.6

Figure 4.7: Boxplots for the mean clean text entry rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond.

has increased.”

Gesture Entry Rate

The necessity of dwelling on each individual key compromises the text entry rate on dwell-basedkeyboards. EyeSwipe 1 counteracts this disadvantage by requiring that only the first and last lettersbe selected explicitly. Also, the first and last selections are dwell-free, because of reverse crossing,so the user can perform them either as fast or as slow as they can or want.

The average duration of the reverse crossing gesture was 1,058.072 ms. The duration decreasedsignificantly (F1,9 = 10.925, p < 0.01) over the sessions, starting from 1,119.582 ms in the firstsession and ending at 1,019.982 ms in the last session.

A direct consequence of this approach is that longer words can be entered at higher entryrates with EyeSwipe 1 than by using dwell-time. We measured the gesture entry rate in charactersper minute (cpm) from the start of the gesture until the final word candidate is selected for thatgesture. In EyeSwipe 1 the final word candidate may be the one selected in the action button, if itis the first candidate, or the word candidate that replaced it afterwards. For the dwell-keyboard,words separated by spaces are considered as a single gesture for comparison. The time of the dwell-keyboard word is defined by the selection of the first key of the word until the last letter is entered.The gesture entry rate by word length is shown in Figure 4.9. The dashed line indicates the averageentered word length (4.458 characters).

EyeSwipe 1 was found to be faster than the dwell-keyboard for words longer than 3 charactersand the difference is proportional to the length of the word. Words of length 1 were expected tobe slower with EyeSwipe 1, because it requires two reverse-crossing gestures: one to indicate thefirst letter and the other to indicate the last letter, which, in this case, are the same. The average


Figure 4.8: Boxplots for the maximum clean text entry rates for each session and interface. Overall meansare shown with a star symbol, outliers are represented by a black diamond.

reverse crossing duration of approximately 1,000 ms explains why EyeSwipe 1 surpasses the 600-msdwell-keyboard in words longer than two characters: three characters require at least 1,800 ms plusoverhead with dwell-time, which is similar to the 2,000 ms plus the gesture duration with EyeSwipe1.

A within-subjects repeated measures two-way ANOVA revealed significant main effects of thetwo factors, interface (F1,9 = 94.706, p < 0.001) and session (F1,9 = 282.598, p < 0.001), and theinteraction (F1,9 = 184.938, p < 0.001). This result is consistent with the results for the maximumentry rates. The highest entry rate was achieved with the sentence “protect your environment.”The average word length in this sentence is 7.333 characters, which is almost 50% higher than theaverage English word length of approximately 5 characters.

The percentage of the gaze path duration used for the start and end reverse crossings is shownin Figure 4.10. The reverse crossing duration rate decreases as the word length increases. Thus theduration of the reverse crossings has less impact on the gesture entry rate of longer words. Theaverage reverse crossing duration rate was 59.412%, meaning that almost 60% of the average gazegesture was spent performing reverse crossings.

4.6.2 Accuracy

Error Rate

A low mean MSD error rate, below 2 for sessions 2, 3, and 4, was observed for both inter-faces (Figure 4.11). This indicates that participants were careful entering the phrases using bothinterfaces.

40 EYESWIPE 1 4.6

Figure 4.9: Mean and standard deviation (error bars) of the text entry rate in characters per minute (cpm)per entered word length. The dashed line shows the mean length of the expected words.

Figure 4.10: Mean and standard deviation (error bars) of the rate of reverse crossings duration over gazepath duration per word length. The dashed line shows the mean length of the expected words.

Correction Rate

The correction rate is the ratio between deleted words and total entered words. EyeSwipe doesnot use the concept of keystrokes, so it does not make sense to report key strokes per character(KSPC) results. One of the factors that increase the KSPC in keystroke-based text entry methodsis a high deletion rate. The correction rate is reported as an alternative metric to show how manydeletions were performed.

The average correction rate for EyeSwipe 1 was 10.23%. An improvement in the correction ratewas observed over the sessions (F1,9 = 6.27, p < 0.001), starting at 12.55% in the first session andending at 7.75% in the fourth session. This result suggests that increasing familiarity with EyeSwipe1 reduces instances of deletion.

Impact of Filtering the Lexicon

The explicit selection of the first and last letters imposes hard constraints on word candidates.By pruning candidates that do not meet the criteria prior to computing the scores we improve


Figure 4.11: Boxplots for the mean error rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

both the computational cost and the accuracy of gesture classification. The computational cost isimproved because significantly less score computations are performed.

The top-n accuracy is the rate in which the correct word was among the first n candidates.We computed the top-n accuracy for the gesture classification with the whole lexicon and with thepruned lexicon (that included only the words starting and ending with the selected letters). Thisresult includes only the gestures for which both the first and last letters were selected correctly.The results are shown in Figure 4.12.

Figure 4.12: Top-n accuracy for the gesture classification with the (blue) full lexicon and the (orange)filtered lexicon.

For all tested values of n (1 through 7) we found that the top-n accuracy was always at least 15

42 EYESWIPE 1 4.6

percentage points higher for the filtered lexicon than for the full lexicon. This result may partiallyexplain the poor performance reported for the shape-based method compared to Filteryedping inthe study by Pedrosa et al. [43]. The shape-based method tested by the authors used the wholelexicon. In this case, EyeSwipe 1 would likely achieve a similarly poor performance.

4.6.3 Subjective Feedback

At the end of the experiment participants answered a questionnaire about their subjectivefeedback. Participants indicated their perception of performance, learnability, and user experiencein a 7-point Likert scale.

The answers to each question are summarized in Figure 4.13. There are three columns shownon the figure. The leftmost column shows the question (e.g. speed), the middle column shows theinterpretations of the two extreme points in the Likert scale. The column on the right shows thepercentage of participants who selected each point in the scale to answer each question. For example,regarding accuracy, 20% of the participants gave a score of 4, 50% a score of 5, and 30% a score of6 to EyeSwipe 1 in a scale from 1–7 (from very innacurate to very accurate). In the same question,20% of the participants rated the dwell-time keyboard a score of 5, 60% a score of 6, and 20% ascore of 7.

Figure 4.13: Participants’ perception of performance, learnability, and user experience in a 7-point Likertscale.

The mean scores in each question are shown in Figure 4.14. Each circle in the radar representsone point in the 7-point Likert scale.


Figure 4.14: Average responses for all participants to the 7-point Likert scale questions.

We conducted Wilcoxon Signed-Ranks tests for each question independently. EyeSwipe 1 wasconsidered faster (Z = 0.0, p < 0.01), with better overall performance (Z = 0.0, p < 0.05), andwas more liked (Z = 2.5, p < 0.05) by the participants. Dwell-time was considered easier to learn(Z = 2.0, p < 0.01), and more accurate (Z = 3.0, p < 0.05) than EyeSwipe 1. The differences foreye fatigue (Z = 18.0, p > 0.05), neck fatigue (Z = 5.0, p > 0.05), and comfort (Z = 13.5, p > 0.05)were not statistically significant.

The participants were also asked to comment on the two interfaces.

Feedback on Dwell-Keyboard

Participants in general considered the dwell-keyboard (Participant A05) “intuitive and simple”,and (Participant A06) “easy to learn.” Participant A03 reported that she felt “more in control” ofthe interface. A possible explanation was given by Participant A07: “I was assured of its accuracyby the clicking and I did not feel the need to check what I had typed to see if it was correct.”

Most participants considered the dwell-keyboard accurate. However, some participants reportedhaving problems with the accuracy of the eye tracker: (Participant A02) “the eye tracker had themost trouble detecting the edge letters (a and s).”

Participant A01 added that compared to EyeSwipe 1, the dwell-keyboard is “a lot easier forshorter words” and “never very frustrating.” However, regarding longer words some participantsadded: (Participant A02) “it was much more tedious to type longer words using this method” and(Participant A03) “it is more time consuming especially if you want to type a longer word.”

Also regarding speed, participants said: (Participant A06) “it is very slow. Often I would movemy eyes to the next letter slightly too fast and then have to go back.” (Participant A04) “Took alot longer, but [had] better accuracy. Using space bar is time consuming.” (Participant A07) “It waspainfully slow and I felt easy to get distracted or zone out while I was spelling each word. My brainwas thinking faster than it was clicking, which sometimes led to mistakes.” (Participant A09) “Thistyping method is very accurate but it takes a lot of time to write each individual words.”

44 EYESWIPE 1 4.7

Adjustable dwell-times, spell checking, and word prediction were suggested by some participantsas possible solutions to some of the accuracy and speed issues.

Feedback on EyeSwipe 1

As indicated in the comments about the dwell-keyboard, EyeSwipe 1 was considered overallfaster: (Participant A03) “it is much faster than the [dwell-keyboard] because you are not selectingeach letter individually.” Participant A08 considered EyeSwipe 1 “faster and more natural.”

Most participants, however, indicated the need for practice before being able to use EyeSwipe1: (Participant A06) “[EyeSwipe 1] takes a lot of practice to get used to. I think it would be overallfaster to someone who got used to it.” (Participant A07) “At first this method was difficult to getused to, but once I did, I was pretty accurate and fast at typing.” (Participant A09) “The gaze-gesture typing is extremely fast and accurate but it has a high learning curve.” (Participant A01)“I can see this method being the preferred method, especially after long periods of use.”

Participant A07 added that it was not only faster but also “even if [she] missed a few of theletters, the suggestions would come up.”

The lower entry rates for shorter words were also noticed by some participants: (Participant A01)“very annoying to type small words, particularly one-letter words. The improved speed for longerwords is very noticeable though, and it probably made up for all the time lost when mistyping smallwords.” (Participant A02) “Short words were more of a hassle to type, but were still manageable.It was much easier to type long words using this method.”

Participant A07 reported that the gaze estimation error affected the interaction when selectingthe last letter of the word: “once I was finished typing a word and I looked up at the [action button]to select it, it often thought I was looking at the letter above it to add to the word instead of finishtyping the word.”

Participant A03 added that “it is frustrating when you do not get the word you wanted. It justtook time to get used to how it works and once I did the frustration diminished.” She also said that“unless you master [EyeSwipe 1] the comfort level is low.” The frequent reverse crossing gesturesalso caused discomfort to Participant A06: “the main problem is that I feel like I should be movingmy body and that causes more neck fatigue. The looking and up and down so frequently is hard toget used to and can be confusing with many short words in a row.” Participant A04 had a differentperception, saying that EyeSwipe 1 caused “less eye fatigue.”

Only Participant A06 made a comment about the visual feedback for the gaze path on EyeSwipe1: “the blue dot was helpful in someways but harmful when I was trying to end a word because Iwould accidentally look at the dot rather than the letter and so it would take longer for me to focusand end a word.”

Participants suggested allowing the user to enter short words more easily. One example is en-tering articles along with the following word, e.g.: (Participant A01) “user selects A, then looks atthe letters F, O, and X. The program then separates that out as ‘A fox’.”

4.7 Discussion

Experimental results suggest that users can enter words approximately 20% faster with Eye-Swipe 1 compared to the keyboard with fixed 600-ms dwell-time. Some participants were able to

4.7 DISCUSSION 45

reach peak entry rates of approximately 20 wpm using EyeSwipe 1, compared to approximately 15wpm with the dwell-keyboard. In the subjective questionnaire participants also indicated that theyperceived EyeSwipe 1 as faster than the dwell-keyboard.

The difference in text entry rate between the methods can be partially explained by the observa-tion that the gesture entry rate in cpm for EyeSwipe 1 increases proportionally to the word length,while the dwell-keyboard remains relatively constant. EyeSwipe 1 is faster on average for wordslonger than 2 words. Considering the average English word length of approximately 5 characters,EyeSwipe 1 is expected to be faster than the fixed dwell-keyboard.

Participants maintained a low MSD error rate (below 2 for sessions 2–4) with both interfaces.With EyeSwipe 1, the correction rate improved from 12.55% to 7.75% in the last session, indicatingthat increasing familiarity with EyeSwipe 1 reduces instances of deletion.

When learning how to use EyeSwipe 1, it was not clear to some participants how other candidatewords could be entered. The experimenter had to explain that even if the action button to selectthe end of a gaze path was showing the wrong word, they should select it and then search for othercandidates.

The first and last letters selected by reverse crossing can be used to prune words from the lexicon.Filtering the lexicon prior to gesture classification produced top-n results at least 15 percentagepoints higher than using the whole lexicon.

On the subjective questionnaire participants considered EyeSwipe to be faster, with betteroverall performance and was more liked by the participants, while the dwell-keyboard was consideredeasier to learn and more accurate.

The main criticism on the dwell-keyboard was on its speed. This could have been alleviated byusing an adjustable dwell-time. In the longitudinal study by Majaranta et al. [34] with adjustabledwell-time participants achieved approximately 20 wpm. Though an experiment to directly comparethe two methods should be performed, experimental results indicate that the maximum entry ratesachieved with EyeSwipe 1 may be comparable to those using adjustable dwell-time.

EyeSwipe 1 was considered faster by the participants, especially for longer words. However,most participants indicated that it requires practice to be used. Some participants had difficultiesselecting the action buttons with reverse crossing, mainly because of gaze estimation errors. Forfuture iterations, the size of buttons should be automatically adjusted by estimating the magnitudeof the error, as proposed by Feit et al. [12]. The frequent reverse crossings were also mentioned aspossible causes of fatigue and confusion.

Overall participants did not like entering short words with EyeSwipe 1. In short words, thereverse crossings to start and end the gaze path occur temporally very close to each other andare responsible for 91.844% of 1-character word gesture durations and 75.335% of 2-character wordgesture durations.

Some participants had difficulties selecting the buttons to perform the reverse crossing gesturebecause of gaze estimation errors. Though EyeSwipe 1 does not require the characters in the middleof the word to be selected accurately, the estimated gaze position must be accurate enough to selectthe first and last characters. Participants struggled to select some keys when the estimation errorwas larger in some locations such as the corners of the window.

46 EYESWIPE 1 4.7

Chapter 5

EyeSwipe 2

We propose EyeSwipe 2 based on the lessons learned from developing and evaluating EyeSwipe1. The design process of EyeSwipe 2 was initiated by the decision to remove the pop-up actionbuttons. We extended reverse crossing to be performed between two regions instead of buttons.We later recognized similarities between this mechanism with Context Switching (Section 5.2.2).EyeSwipe 2 has three regions with different purposes. The user emits commands to the interfaceby switching their focus between the regions.

5.1 Improvement Challenges from EyeSwipe 1

5.1.1 Reverse Crossing and Pop-Up Action Buttons

Using reverse crossing to explicitly indicate both the beginning and end of the gesture hasimpact in text entry rate and user experience. As shown in Figure 4.10, the time to perform bothreverse crossings represents approximately 60% of an average gaze-gesture. An expert user may notneed to plan the gaze gesture for each word. As soon as they finish entering a word they might movetheir gaze directly to the first letter of the next word. However, instead of being able to immediatelystart the next gesture, the user will need to perform a reverse crossing to start the gesture. For thisreason, when entering a phrase, the reverse crossings will likely restrain the entry rate of expertusers. In such cases it would be faster to use the same action to finish the previous gesture andstart the new one.

With EyeSwipe 1 the number of reverse crossings performed by the user is at least twice thenumber of entered words (some gaze gestures may be canceled, for instance, so the proportioncan be higher). The frequent reverse crossings were perceived as annoying or confusing by someparticipants. Another problem emerges when the first word candidate, presented on the actionbutton, is not the one the user wants to enter. In such cases, the user must select the wrong word tofinish the gesture, search for the desired word on the candidate list and, if present, select it. If thedesired word is not among the candidates, the user must delete the previous word. The concept ofselecting the wrong word and then replacing it was not clear to some participants in the beginning,and some times it was necessary to explain it more than once during the training session.

The pop-up action buttons also caused a technical problem that may have impact on the userexperience. The action button often partially occludes keys on the keyboard when it appears (Fig-ure 5.1). The occlusion caused two main problems. In some cases, due to gaze estimation error,

47

48 EYESWIPE 2 5.1

the key below the action button would be selected instead of the action button. In these cases,as reported by a participant, they would be required to look at the correct key again and restartthe process. We observed that the opposite problem also happened, though not reported by anyparticipant: the action button would appear over the letter they wanted to select. It happened, forinstance, if their saccade overshot to a letter in the bottom row of the keyboard, such as Z, whenthey actually wanted to select a key on the middle row, such as S. In some of these cases, becauseof the estimation error, gaze would be estimated over the action button instead of on the key belowit and the user was not able to select the key.

Figure 5.1: The pop-up action buttons in EyeSwipe 1 may overlap some keys. The user may not be ableto select a key partially occluded by the action button, such as “A” or “S”, if that was their actual intention,depending on the gaze estimation error.

Also, the action button is a dynamic GDO. It may be distracting if the action button appearsin the wrong place or in the wrong time (e.g. in the middle of the gesture).

The use of reverse crossing with pop-up action buttons enabled explicit delimitation of the gazegesture, while also providing deterministic selection of the first and last letters of the word. However,considering the feedback from some of the participants, we decided to investigate an alternative toreverse crossing with the pop-up action buttons that does not depend on selecting the wrong wordbefore selecting the correct candidate.

5.1.2 Gaze Estimation Errors

The use of gestures helps partially handling gaze estimation errors. If the user fixates a letter,but the eye tracker estimates their gaze on a neighbor key, the desired word may still be selectedamong the top-scoring candidates. Unless it happens on either the first or the last letters of the word.In this case the desired word will not be in the candidate list, because it will have been filtered outof the possible candidates. The shape classification could be more flexible considering neighbor keysfor the first and last letters. However, the resulting interaction would possibly feel inconsistent.It might give the impression that the interface is ignoring the user’s command if they explicitlyselected the first and last letters of the word and the interface displayed candidates starting orending with different characters.

Throughout the interaction, drifts in gaze estimation may occur. Some of the possible causesare relative movements between the eye tracker and the user or fatigue. For this reason, eye trackerusers are often required to recalibrate after short periods of use. Also, the initial calibration maynot be accurate, but still enough for the interaction. It may be frustrating for the user to look

5.2 INTERFACE DESCRIPTION 49

at an element of the interface and realize that the interface is not responding accordingly. Moreexperienced users may identify this as a calibration problem and then recalibrate the eye tracker.However, even some of these users may not be able to start the calibration process by themselvesif they can only use gaze as their input method. In any case, it would be better if the system couldprevent the increase in the estimation error in the first place.

While the user is interacting with the interface, the system collects hundreds of gaze samplesevery minute. Depending on the task and the context, it may be reasonable to assume that theinterface knows where the user is looking at. For example, if the user is selecting a key by dwell-time,the center of the key may be a good estimate of the gaze position. This information is available tothe interface and could be used to dynamically adjust the gaze estimation, compensating for driftsand improving the initial calibration, that considered just a few points.

Huang et al. [18] proposed using interaction cues, such as mouse clicks or key presses, to au-tomatically calibrate a webcam-based eye tracker. Khamis et al. [25] used smooth pursuit data tocorrect the estimated gaze position while the user is reading text moving on the interface. However,it is not clear how gaze gesture data can be used to dynamically adjust the calibration. Users mayadopt different strategies to perform the gestures. For example, when entering the word case, theuser may (1) look at the center of each key, or (2) at C, then between A and S, and finally at E, oruse some different strategy (Figure 5.2).

Figure 5.2: It is not clear in which cases it is safe to assume the user was looking at the center of thecharacters while performing a gesture. Two example strategies are shown: [red] the user fixated at the centerof each letter in the word “case”; [blue] the user fixated approximately at “C”, then in between “A” and “S”and finally close to “E”.

5.2 Interface Description

The interface of EyeSwipe 2 (Figure 5.3) is composed of three main regions: text, action, andgesture. The text region contains the entered text and a backspace key. The action region iscomposed of a row of action buttons. The contents of the action buttons depend on the user’sgaze behavior on the previous region. Examples of contents are candidate words to be entered,punctuation marks, or actions such as deleting or replacing a word. The gesture region is where thegaze gesture is performed and contains all the characters: letters and the punctuation key.

The letters on the keyboard are displayed according to the QWERTY layout with no visualborder between them. This design decision was made to convey the idea of freedom of movement

50 EYESWIPE 2 5.2

① ②

③ ④

⑤ ⑥

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

Gaze

Cancel

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

Gaze

Cancel

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

gaze gate grade fate gas grader

Gaze

Cancel

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

gaze gate grade fate gas grader

Gaze

Cancel

Gaze

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

grader gate grade fate gas gaze

Gaze

Cancel

A S D F G H J K L

Q W E R T Y U I O P

Z X C V B N M.,?!

Cancel

←

grader gate grade fate gas gaze

Gaze Grade

Cancel

Grade

Gaze

Gaze

Figure 5.3: The interface of EyeSwipe 2 is composed of three regions (from top to bottom): text, action,and gesture. The action buttons (in green) on the action region depend on the actions of the user in thepreviously focused region. The word can be entered by completing the following actions: (1) Look at thevicinity of the first letter of the desired word to indicate the beginning. (2) Glance through the intermediateletters. (3) When the user switches the focus from the gesture region to the action region, the word candidatesare shown. (4) When the user moves from the action region to either the text or the gesture region, the wordis entered. (5) The user can change or delete the word by fixating on the backspace key in the text region,looking at the desired word/delete word button, and moving to either the text or the gesture region. (6) Theuser can also select “Cancel” to ignore the current action buttons.

5.2 INTERFACE DESCRIPTION 51

Text Action Gesture

[←] Replace/Delete[.,!?] Punctuation Candidates[Gaze Path] Word Candidates

[Action] Perform Action[Cancel] No action

[Action] Perform Action[Cancel] No action

Figure 5.4: Commands emitted in each region transition. User’s selections on the previous region are shownin brackets followed by the emitted command. For example, if the user transitions from the gesture regionafter performing a gaze gesture, the command to display word candidates is emitted.

between letters. Showing the letters with borders around them is related to the paradigm of pressingkeys. EyeSwipe 1 still used the key press paradigm for the selection of the limits of the gaze gestures.

The user emits commands to the interface by switching the focus from one region to another. Thecommands depend on the origin and destination regions and on user’s gaze behavior on the originregion (Figure 5.4). When the user moves from the gesture region to the action region the interfaceshows candidates to be entered in the action buttons. If the user focused on the punctuation keybefore going to the action region, the action buttons will contain punctuation marks. Otherwise,candidate words based on the user’s gaze path on the gesture region are displayed on the actionbuttons. When the user moves from the text region to the action region and the user last focusedon the backspace key in the text region, the action buttons show candidates to replace the last wordor the option to delete it (always displayed in the rightmost action button). Finally, when the usermoves from the action region to either the text region or the gesture region the interface performsthe action in the last focused action button. The action region also has two cancel buttons, one ineach extremity, so if the last focused button is either of them once the user leaves the action region,no action is performed. The user’s gaze behavior in the region is ignored in all other region switchesand nothing happens.

The action region presents 6 action buttons horizontally. When the user focuses on the actionregion, the first fixation on the region is stored as a reference point. The candidate words aredisplayed in the action buttons according to their position in the candidate list. The action buttonsclosest to the reference point are assigned the word candidates with higher scores. This choice wasmade considering two possible scenarios. In the first, the user always looks to the leftmost actionbutton to start searching for the candidate word. In the second, the user makes a vertical saccadefrom the last letter to the action region. In this case depending on the last letter, the first fixation onthe action region will occur in different locations. We decided not to favor neither of the strategies.Instead, the most probable candidates are always displayed close to the user’s first fixation on theaction region. The most probable candidate is displayed on the action button closest to the user’sfirst fixation, with the second most probable to its right and third most probable to its left, and soforth. Consequently, if the user looks at the leftmost action button, the candidates are displayedfrom most probable to least probable from left to right. Once the candidates are assigned to anaction button, they remain in the same position until the user’s gaze leaves the action region.

EyeSwipe 2 provides two ways to select the beginning of the gaze gesture on the gesture region.The letters within a distance to the first fixation in the gesture region are used as first letter

52 EYESWIPE 2 5.2

candidates. In this case, all gaze data in the gesture region is used as the gaze gesture. Wheneverthe user holds a fixation for longer than a hidden dwell-time (currently set to 700 ms) the gazegesture is restarted at that fixation. New first letter candidates are selected based on the distanceto the new reference fixation point. In both cases a short “click” sound is emitted and the first lettercandidates are highlighted in a different color for a brief moment (approximately 500 ms). The firstletter candidates may include more than one letter. In these cases all letters are highlighted.

Note that, from the user’s perspective, there is no difference between the two cases. The in-struction in both cases is to start the gesture once the first letter of the word is highlighted. Inpilot experiments we explained the two cases separately and participants were often confused aboutthe difference between them. Once we started using the unified explanation, participants could useboth starting mechanisms seamlessly. We also added to the explanation that whenever they sawletters being highlighted that meant the gaze gesture had restarted. If that was not their intentionthey should look at the first letter and restart the process.

Last letter candidates are determined based on the distance to the last fixation on the gestureregion before switching to the action region. The gaze gesture is then used for gesture classificationto produce the word candidate list.

Because of the hidden dwell-time, the user cannot stop for too long while performing a gazegesture, otherwise it will be restarted. When the user is not performing a gesture, they can freelylook at any part of the gesture region. Thus, EyeSwipe 2 is not a dwell-free method, but may beused as a dwell-free method by expert users if their first fixation on the gesture region selects thedesired first character.

5.2.1 Feedback

EyeSwipe 2 gives feedback on the region currently under focus by showing a thicker blackborder around it and slightly changing its background color. Focus on the buttons is also indicatedby darkening their background color.

Inspired by AugKey [9], the action buttons always display the last entered word in their upperregion with a smaller font size and color. If the user is selecting a word or punctuation mark to beentered, they can see what the previous word was without moving their gaze to a different location.If the user is replacing or deleting a word, the second to last word is displayed in the upper region ofthe action buttons. The same word is displayed in all action buttons, so the user has quick access tothat contextual information, independently of the button they are currently looking at. The word“keep” is shown as an example in Figure 5.3 in all action buttons while the user is entering the word“this”.

The blue path used for visual feedback on EyeSwipe 1 distracted some users and caused them toslow down as they tried to make the dot reach each individual letter. In EyeSwipe 2 we decided fora more subtle feedback, not giving specific feedback on the current gaze path. The user has feedbackon the first letter candidates (highlighted) and they can know if the interface is detecting their gazeinside the gesture region by the thicker black border displayed around the currently focused region.

5.2.2 Context Switching

The interaction in EyeSwipe 2 was initially proposed as an extension to reverse crossing. Withreverse crossing the user emits a command related to their current target by switching to the action


button that represents the action they want the interface to perform, and then looking back atthe target. In EyeSwipe 2 the buttons involved in reverse crossing were replaced by whole regions,but the mechanism remains analogous: the user emits a command related to the current regionby switching to the action button in the action region that represents the action they want theinterface to perform, and then look back at another region. Regions are more robust to gaze thansingle buttons, because of their size.

This extension to reverse crossing is similar to Context Switching. With Context Switchingthe user also emits commands by switching contexts. However, the contexts present duplicatedcontent. Thus the interaction proposed for EyeSwipe 2 can also be interpreted as an extensionto Context Switching. Instead of duplicated content, EyeSwipe 2 presents unique content in eachcontext (region). Similarly to Context Switching, this extension also uses bridges between the regionsto avoid accidental selections and make the selection more robust to noise in gaze estimation. Werefer to the selection mechanism in EyeSwipe 2 as Reverse Region Crossing.

Showing the action buttons in a separate region, instead of around the current letter, solves theproblem of occlusion. Also, as the action region is always visible instead of popping up, the GDOfor displaying the actions is no longer dynamic. The contents still change based on the previousregion and user’s actions, but the location of the action region is now static.

5.3 Gesture Classification

EyeSwipe 2 classifies gaze paths into word candidates with a similar approach to EyeSwipe 1.The gaze path is compared to the ideal path of the words and the result is combined with theprobability of the word according to a word unigram model. EyeSwipe 2 uses a score based onBayes’ theorem and the Discrete Fréchet distance [14].

5.3.1 Discrete Fréchet Distance

The Fréchet distance [14] is a measure of the distance between two curves that respects thetemporal sequence of the points. Intuitively, the problem of finding the Fréchet distance is that ofa man traversing one of the curves while walking his dog traversing the other curve on a leash.Both the man and the dog can vary their speeds, but only walk forward. The Fréchet distance isthe length of the shortest leash such that both the man and the dog can traverse their separatepaths. The Fréchet distance compares two continuous curves instead of sequences of samples fromthe curves. For this reason, using the Fréchet distance to compare the gaze gesture to the ideal pathseemed promising, as the difference between the number of points in each path would not affect theresult.

To compute the Fréchet distance the curves are typically approximated to polygonal curves. Apolygonal curve is a curve P : [0, n]→ V where n is a positive integer and P (i+λ) = (1−λ)P (i)+λP (i + 1), where λ ∈ [0, 1] for all i ∈ {0, n − 1}. Alt and Godau [1] introduced an algorithm thatcomputes the Fréchet distance between two polygonal curves P : [0, p] → V and Q : [0, q] → V

in O(pq log2(pq)), where p and q are the number of segments in P and Q. Eiter and Mannila [11]proposed a discrete variation of the Fréchet distance that can be computed in O(pq). They alsoshow that the discrete Fréchet distance is an upper bound to the Fréchet distance. The algorithmproposed by Eiter and Mannila is presented in Algorithm 3.

54 EYESWIPE 2 5.3

Algorithm 3 Discrete Fréchet distance proposed by Eiter and Mannila [11]

Precondition: P = (u1, . . . , up) and Q = (v1, . . . , vq) are polygonal curves

1: function DFD(P,Q)2: ca ← Array with dimensions [1 . . . p, 1 . . . q] initialized with -13: function C(i, j)4: if ca(i, j) > −1 then return ca(i, j)5: else if i = 1 and j = 1 then6: ca(i, j) ← distance(u1, v1)7: else if i > 1 and j = 1 then8: ca(i, j) ← max(C(i− 1, 1), distance(ui, v1))9: else if i = 1 and j > 1 then

10: ca(i, j) ← max[C(1, j − 1), distance(u1, vj)]11: else if i > 1 and j > 1 then12: ca(i, j) ← max(min(C(i− 1, j), C(i− 1, j − 1), C(i, j − 1)), distance(ui, vj))13: else14: ca(i, j) ← ∞15: return ca(i, j)

16: return C(p, q)

To compare the gaze gesture and the ideal path of a word the two paths are subsampled. Forevery pair of points (a, b) in the sequence, the preprocessing step generates points between a and bwith a fixed step size. After preprocessing, the number of points in each path has the same orderof magnitude.

To compute the Fréchet distance, EyeSwipe 2 follows a similar method to Algorithm 2 on theprefix tree. A difference between the two methods is that the discrete Fréchet distance requires thepaths (both the gaze and ideal paths) to be subsampled. The prefix tree is traversed applying theupdate function shown in Algorithm 4 for each subsampled gaze position.

The complexity of the update method for the Fréchet distance (Algorithm 4) is O(SM), whereS is the number of subsamples from the gaze path andM is the number of subsampled points in theprefix tree (each node contains points subsampled between its parent’s center and its own center).

5.3.2 Word Probability

Similarly to EyeSwipe 1, EyeSwipe 2 also uses information from two sources to classify thegesture into words: the shape similarity and a language model. These two sources are, however,combined differently. EyeSwipe 1 used a score based on a weighted sum of the gesture score withthe word probability. EyeSwipe 2 calculates a probability for each word using Bayes’ theorem. LetWand G be random variables that represent, respectively, the word entered by the user and the shapeof the gaze gesture. We want to find the word wi in the lexicon L that maximizes Pr(W = wi|G = g),where g is the gesture made by the user. Applying Bayes’ theorem, we have:

Pr(W = wi|G = g) =Pr(G = g|W = wi) Pr(W = wi)

Pr(G = g)(5.1)

As Pr(G = g) is independent of wi we can search for the wi that maximizes Pr(G = g|W =

wi) · Pr(W = wi), so Pr(G = g) does not need to be calculated. The word probability Pr(W = wi)

is calculated in the same way as EyeSwipe 1:


Algorithm 4 Discrete Fréchet distance update functionPrecondition: Current is a node from a prefix tree, gaze is the new gaze samplePrecondition: Node.Samples is an array of points subsampled between Node and its parentPrecondition: Node.K is the size of Node.SamplesPrecondition: Node.DFD is an array of size Node.K initialized with ∞ in the first iteration for

all nodesPrecondition: Node.Previous-DFD is initialized with ∞ in the first iteration for all nodesPrecondition: Root.Previous-DFD is initialized with 0 in the first iteration

1: function DFD-Update(Current, gaze)2: Previousend ← Current.DFD[end] . end = last position of the array3: P0 ← Current.Parent.Previous-DFD4: P1 ← Current.Parent.DFD[end]

5: for i← 1 to Current.K do6: D ← distance(Current.Samples[i], gaze)7: Previousi ← Current.DFD[i]8: Current.DFD[i] ← max{min{P0, P1, P reviousi}, D}9: P0 ← Previousi

10: P1 ← Current.DFD[i]

11: Current.Previous-DFD ← Previousend12: return Current.DFD[end]

Pr(W = wi) =OCC(wi, C)∑

wj∈LOCC(wj , C)

(5.2)

Where OCC(v, C) is the number of occurrences of the word v in a text corpus C. The probabilityPr(G = g|W = wi) is based on the gesture score S(g, wi):

S(g, wi) =|IP (wi)|

1 +DFDk(g, IP (wi))(5.3)

Where IP (wi) is the ideal path of the word wi, DFD(a, b) is the discrete Fréchet distancebetween a and b, | · | is the length operator, and k is a positive constant. The gesture score isdirectly proportional to |IP (wi)| to favor longer words and inversely proportional to the distanceDFD(g, IP (wi)). A scaling constant k is used to adjust the distribution of the distances. Theprobability Pr(G = g|W = wi) is defined as:

Pr(G = g|W = wi) =S(g, wi)∑

wj∈LS(g, wj)

(5.4)

Computing the Constant k

We used gesture data from the experiment comparing EyeSwipe 1 to the dwell-keyboard tocompute k. The optimal value k∗ is the value ki that minimizes the following cost function:

C(ki) =∑p∈P

maxkj∈[0,10]

top(kj , p)− top(ki, p), (5.5)

56 EYESWIPE 2 5.4

where P is a set with all participants of the experiment and top(ki, p) is the percentage ofgestures made by participant p with the correct word in the first position of the candidate listusing k = ki. To compute top(ki, p), the candidate words are sorted according to Equation 5.4.The intuition behind minimizing Equation 5.5 is to search for the value ki that achieves values oftop(ki, p) that are as high as possible for all participants.

Word occurrence data1 extracted from the Corpus of Contemporary American English (COCA)was used to compute the probability Pr(W = wi) (Equation 5.2).

As the amount of data is finite, the image set of top(ki, p) is discrete. We tested a range of valuesfor ki in the interval [0, 10] with a step size of 0.1. Using the gestures from the first experiment, wecomputed the value k∗ = 6.6. Figures 5.5 and 5.6 show Equation 5.5 evaluated using data from thelast session from participants A01 and A07.

Figure 5.5: Cost function (Equation 5.5) evaluated for ki in the interval [0, 10] using data from the lastsession from Participant A01.

5.4 Dynamic Calibration

As users enter words with EyeSwipe 2 they produce several gaze gestures. The majority of thewords that have not been deleted are likely the ones the user intended to enter. Thus, after the userentered a few words, the system has a dataset of pairs of gaze gestures and words. Such pairs ofpaths may be used to dynamically adjust the calibration.

5.4.1 Related Work

In Dasher’s website [32], MacKay suggested using interaction data to dynamically calibrate theeye tracker. This feature has been implemented in Dasher version 3. Unfortunately details about theimplementation are not documented. In the website, MacKay gives an example of the presence of aslight vertical error in gaze estimation. Assuming that the user is looking at the desired characterfrom the beginning of the zooming (as the characters move to the left in Dasher’s interface) theywill still be able to select it, despite the estimation error. Once the character is entered the system

1http://www.ngrams.info/

5.5 DYNAMIC CALIBRATION 57

Figure 5.6: Cost function (Equation 5.5) evaluated for ki in the interval [0, 10] using data from the lastsession from Participant A07.

computes an updated calibration based on the initial gaze position estimated by the eye trackerand the initial location of the character.

Khamis et al. [25] proposed an automatic calibration scheme called Read2Calibrate. Text issequentially displayed following a predefined path. The user performs smooth pursuits to read thetext while following it. Once the correlation between the user’s eye movements and the text pathis above a certain threshold, Read2Calibrate uses the data to calibrate the eye tracker.

5.4.2 Dynamic Calibration with Gaze Gestures

To use the data from the gaze gestures and words produced by the user, EyeSwipe 2 searchesfor pairs of points in the gaze path and in the word’s ideal path. Given the gesture’s fixation dataand the word’s ideal path, EyeSwipe 2 uses a modified DTW algorithm to find pairs of points fromfixations to centers of letters (Algorithm 5).

The output of Algorithm 5 is a matching gaze point for each point in the ideal path of theword. These pairs of points are accumulated and then used to compute the adjustment functionfor subsequent gaze positions. Because the eye tracker used for the experiments does not provideaccess to lower level eye features, such as the location of the pupil center, or the eye orientation,we opted to apply the adjustment function on the estimated gaze position. Alternatively, one couldrecompute the mapping function from the eye features to the gaze position using the new data.

This dynamic calibration scheme assumes that the user was looking at the center of the letterswhile performing the gaze gesture. Though the hypothesis may not hold for all points we wereinterested in investigating if the differences would be averaged out when computing the adjustment.More robust adjustment methods should further investigate the nature of the gaze gestures: when isit more common to overshoot or undershoot a saccade? Under which conditions is it safe to assumethe user was looking at the center of the letter?

58 EYESWIPE 2 5.5

Algorithm 5 Modified DTW that returns matched pointsPrecondition: P = (p1, . . . , pn) and Q = (q1, . . . , qm) are sequences of points

1: function DTW-Matches(P,Q)2: d ← Array with dimensions [1 . . . n+ 1, 1 . . .m+ 1] initialized with ∞3: d[1, 1] ← 04: path ← Array with dimensions [1 . . . n+ 1, 1 . . .m+ 1] initialized with -15: for i← 2 to n+ 1 do6: for j ← 2 to m+ 1 do7: options ← {d[i− 1, j − 1], d[i, j − 1], d[i− 1, j]} . Match, deletion, insertion8: d[i, j] ← distance(pi, qj) + min(options)9: path[i, j] ← argmin(options)

10: r ← min(n,m)11: pairs ← Array with dimension [1 . . . r]12: i, j ← n+ 1,m+ 113: while i > 1 and j > 1 do14: if path[i, j] = 1 then15: pairs[r] ← (pi−1, qj−1)16: r, i, j ← r − 1, i− 1, j − 117: else if path[i, j] = 2 then18: j ← j − 119: else if path[i, j] = 3 then20: i ← i− 1

21: return reverse(path)

5.5 EyeSwipe 2 Evaluation

We conducted an experiment to evaluate EyeSwipe 2 in terms of performance and user experi-ence. We decided to compare EyeSwipe 2 to its predecessor, EyeSwipe 1, to measure the impact ofthe proposed enhancements.

As typing with the eyes is different than manual typing, even if the user is an expert at manualtyping with the QWERTY layout it is not clear how or if the experience transposes to the formertask. We were also interested to investigate how the expertise of the user in finding the letters withtheir gaze correlates to their ability to enter text with EyeSwipe and how it evolved throughout theexperiment.

With this in mind we designed a two step experiment. In the first step, the participant enteredtext using one of the interfaces. In the second step, the participant searched for letters on theQWERTY layout. The two steps were alternated during the whole experiment. The second stepwas used to collect both data about the participant’s expertise in finding letters on the QWERTYlayout and ground truth data to evaluate the dynamic calibration.

5.5.1 Participants

Eleven undergraduate students (4 females, 7 males; ages 19 to 28, average 21) without disabilitiesfrom Boston University were recruited to participate in the experiment. None of them participatedin the previous experiment. All participants had no or little experience with eye tracking systemsand were proficient in English.

About half the participants wore corrective lenses (3 with glasses, 2 with contact lenses). All

5.5 EYESWIPE 2 EVALUATION 59

participants were familiar to typing on a QWERTY keyboard.

5.5.2 Apparatus

A Tobii EyeX eye tracker was used to collect the gaze information. The keyboard was displayedon a full screen window on a 22-inch LCD monitor (1920 × 1080 pixels resolution) connectedto a laptop (2.30 GHz CPU, 4GB RAM) running Windows 7. The length of the square keys (e.g.character keys in EyeSwipe 1) was approximately 2 degrees (100 pixels) separated by approximately0.5 degrees (25 pixels). The characters in the gesture region in EyeSwipe 2 were positioned in thesame place as in EyeSwipe 1, however with no border around them.

We used the same 10,219-word lexicon from the previous experiment for both methods. The wordoccurrences2 were extracted from the Corpus of Contemporary American English (COCA) [7]. Weused a different corpus bacause COCA also provided bigram and trigram data. We plan to use thisadditional information in future versions of EyeSwipe.

5.5.3 Procedure

Each participant visited the lab on four different days (no more than 48 hours apart) andperformed four text entry sessions with each interface per day. The participants received a compen-sation of US$15 for each day they participated in the experiment and an additional US$20 uponcompletion of the experiment.

The experimenter began by introducing the eye tracker and the experiment. The experimentconsisted of two alternating steps: (1) a text entry session, always followed by (2) a QWERTYlayout expertise session.

Each text entry method was explained right before the participant performed the first sessionwith it. Before starting the formal session, the participant practiced entering two sentences withthe method. The experimenter always referred to the methods as RED (EyeSwipe 1) and GREEN(EyeSwipe 2), the predominant colors in the buttons of in each interface, to avoid biasing theparticipant’s preference towards EyeSwipe 2 because of the version number.

The participant was asked to seat comfortably at approximately 70 cm from the screen. Theywere instructed to avoid moving their heads, but no physical restriction (e.g. chin rest) was imposed.Each participant started each day by calibrating the eye tracker. The eye tracker was recalibratedas needed during the day.

The procedure for the text entry sessions was similar to the previous experiment: phrases weresampled from MacKenzie and Soukoreff’s dataset [33] and presented one at a time on the top ofthe experiment window. The participants were encouraged to memorize and enter the text as fastand accurately as possible as many phrases as they could. However, the text to be transcribed wasalways visible on the top of the window in case they forgot it. Participants were instructed to correcterrors only if detected right away. Otherwise they should ignore the error. If the session timed outin the middle of a sentence, the participant had to finish entering it to end the session.

The version of EyeSwipe 2 used in the experiment did not apply the dynamic calibration aswe intended to use the data from the experiment to test different calibration functions. So the

2http://www.ngrams.info/

60 EYESWIPE 2 5.5

experiment tested the combined effect of the enhancements proposed for EyeSwipe 2 except for thedynamic calibration.

The QWERTY layout sessions tested two alternating modes. In both modes a letter was dis-played on the top part of the window and the letter keys of the QWERTY layout were displayed inthe bottom (in the same position as in the interfaces of EyeSwipe 1 and 2), but the letters were notvisible. As the participant selected the letter on the top, the letter disappeared and the participanthad to select the same letter on the keyboard layout as soon as possible by dwell-time. A 600 msdwell-time was used for all selections. Once the participant selected the key, another letter wasdisplayed. This process repeated for 13 of the 26 letters selected randomly. The other 13 letterswere selected on the following session of the same mode.

The two modes are hidden layout and visible layout. In the hidden layout, the letters on thekeyboard were never displayed. So the participant had to remember what key represented the letterthey were searching for. Participants were instructed to guess the approximate location if they didnot remember it, instead of trying to count the keys. In the visible layout, the letters on the keyboardbecame visible as soon as the target letter disappeared. Once the next target letter appeared, thekeyboard letters became invisible again.

At the end of the experiment the participants completed a questionnaire about their subjectivefeedback on the two text entry methods and their basic demographic information.

5.5.4 Design

For the text entry sessions we used a within subjects design with dependent variables text entryrate and accuracy (measured by MSD error rate, correction rate, cancel rate, and replace rate),and independent variables session and method. Each participant performed thirty two 5-minutetext entry sessions, sixteen using EyeSwipe 1 and sixteen using EyeSwipe 2, totaling 80 minuteswith each interface, aside from a short practice section. Participants performed eight sessions perday, four with each method. Half the participants started with EyeSwipe 1 and the other half withEyeSwipe 2.

In each day the participant started with a different method than the previous day. The firstfour sessions were performed with the first method and the last four sessions with the second. Aftereach text entry session, the participant performed a QWERTY layout expertise session. Everylayout session used a different mode from the previous. Thus, each participant performed thirty twolayout sessions, sixteen with each mode. The layout sessions were also counterbalanced, so half theparticipants who started with each text entry method started with the visible layout and the otherhalf started with the hidden layout.

Metrics for the QWERTY Layout Expertise Sessions

For the QWERTY layout expertise sessions we measured the following dependent variables:selection count, key distance to target, completion time, and reaction time.

The selection count is the number of keys highlighted, i.e. the user focused it for longer than100 ms, before the final selection by dwell-time (600 ms).

The key distance to target is the shortest path between the selected key and the target key ina graph connecting neighbor keys. For example, if the target key was the letter W and the user


Q

A

W

S

E

D

R

F

T

Z X C V

. . .

. . .

. . .

Figure 5.7: The key distance to target is the length of the shortest path between two keys in a graphconnecting neighbor keys. The key distance between W and V, for example, is 4. Two of the shortest pathsbetween W and V are shown in red and blue.

selected the letter V, the key distance is 4 (Figure 5.7). Some of the shortest paths in this exampleare V-F-R-E-W and V-C-D-S-W.

The completion time is the time measured from the offset of the target letter and the selectionof a key by dwell-time.

Finally, the reaction time is the time between the offset of the target letter and the occurrenceof the first gaze sample outside a region with radius of 1 degree around the target letter.

5.6 Experimental Results

Participant B09 spent most of the first session trying to enter the word “presidential”. Thecorrect word was never listed among the candidate words. When we analyzed the data after theexperiment we observed that Participant B09 was not looking at all the letters. In fact, in most ofthe cases he would not look at almost half of the letters and would finish the word before lookingat the last letter, e.g. “presnt” or “prsntia”. The experimenter asked Participant B09 to just adda period and proceed to the next sentence. As the sentence was long and Participant B09 spentmost of the session time trying to enter it, the resulting error rate was high (33.759). We did notconsidered Participant B09’s first session, because it did not produce enough data for the analysis.His first session was replaced by the average of the data produced by the other participants in thefirst session.


Participants were able to enter text, on average, faster using EyeSwipe 2 compared to EyeSwipe1 (Figure 5.8). In particular, participants were faster with EyeSwipe 2 in almost all sessions whencomparing EyeSwipe 1 to EyeSwipe 2 in sessions with the same ordinal number (first session withEyeSwipe 1 to first session with EyeSwipe 2, etc.). The highest mean entry rate achieved by theparticipants was 14.78 wpm on session 15 using EyeSwipe 2 (14.36 wpm in the last session). Thehighest mean entry rate with EyeSwipe 1 was 12.29 wpm in session 12 (11.93 wpm in the lastsession).

We conducted a within subjects repeated measures ANOVA on the text entry rate with theindependent variables method (EyeSwipe 1 and EyeSwipe 2) and session (1–16). There were signif-icant main effects of method (F1,10 = 161.476, p < 0.01) and session (F15,150 = 12.072, p < 0.01).The interaction between method and session was not significant (F15,150 = 0.807, p > 0.05).

62 EYESWIPE 2 5.6

Figure 5.8: Boxplots for the mean text entry rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

The maximum entry rate for each session and participant are shown in Figure 5.9. There weresignificant main effects for method (F1,10 = 265.949, p < 0.01) and session (F15,150 = 9.154,p < 0.01). The interaction between the factors was not significant (F15,150 = 1.328, p > 0.05).Participant B02 entered the sentence “construction makes traveling difficult” at 32.958 wpm insession 12 (last session of the third day) using EyeSwipe 2. In the same session, Participant B02entered the sentences “presidents drive expensive cars” and “the chamber makes important decisions”at 24.479 wpm and 25.398 wpm, respectively. The MSD error rate for the three sentences was zero.The average word length in these specific sentences is 7.45 characters, which is almost 50% longerthan the average English word length of about 5 characters.

The maximum entry rate with EyeSwipe 1 was achieved by Participant B04 also in session 12.He achieved 21.272 wpm in the sentence “construction makes traveling difficult”. In the last session(16) he also achieved 21.057 wpm in the sentence “vanilla flavored ice cream”. As a comparison, hishighest entry rate was 28.151 wpm on the sentence “round robin scheduling” in the 15th session usingEyeSwipe 2. All participants reached more than 20 wpm in at least one sentence using EyeSwipe 2.The maximum entry rate per participant with EyeSwipe 1 ranged between 15.822 wpm and 21.272wpm.

Clean Text Entry Rate

Both versions of EyeSwipe allows entering long words with very short gestures. For this reason,similarly to the previous experiment, we computed the clean text entry rate to observe the impactof errors in the text entry rates. The mean and maximum clean text entry rates are shown in


Figure 5.9: Boxplots for the maximum text entry rates for each session and interface. Overall means areshown with a star symbol, outliers are represented by a black diamond.

Figures 5.10 and 5.11.The mean clean entry rates are not significantly different than the entry rates considering all

sentences (EyeSwipe 1: F1,10 = 3.002, p > 0.05, EyeSwipe 2: F1,10 = 3.262, p > 0.05). Themaximum clean entry rates for EyeSwipe 1 are also not significantly different than the entry ratesconsidering all sentences (F1,10 = 2.318, p > 0.05). The maximum clean entry rates for EyeSwipe2 were significantly lower than the maximum entry rates considering all sentences (F1,10 = 7.652,p < 0.05). However, the values for the highest maximum entry rates reported previously remain thesame, as all of them had no uncorrected errors.

Gesture Entry Rate

The gesture entry rate in characters per minute (cpm) by word length is shown in Figure 5.12.The dashed line indicates the average entered word length (4.466 characters).

The gesture entry rate was higher with EyeSwipe 2. A within subjects repeated measuresANOVA revealed significant main effects of both method (F1,10 = 151.474, p < 0.01) and wordlength (F1,10 = 892.366, p < 0.01) and interaction between the two factors (F1,10 = 23.528,p < 0.01). Overall, the longer the word is, the higher the average gesture entry rate will be. Thisresult is consistent with the maximum entry rates obtained for both methods. The average wordlength for the fastest sentences is higher than the overall average word length. In other words, bothmethods are faster at entering sentences with longer words.

64 EYESWIPE 2 5.6

Figure 5.10: Boxplots for the mean clean text entry rates for each session and interface. The clean textentry rate includes only transcribed sentences with no uncorrected errors. Overall means are shown with astar symbol, outliers are represented by a black diamond.

Reverse Crossing vs. Hidden Dwell-Time

In EyeSwipe 2 we removed the pop up action buttons and the reverse crossing to mark thebeginning of the gesture. The grand mean for the duration of the reverse crossing gesture to start agesture was 1,020.64 ms. The average duration of the starting reverse crossing decreased significantly(F1,10 = 32.789, p < 0.001) over the sessions, starting from 1,176.71 ms in the first session and endingat 958.980 ms in the last session. However, even in the last session, the average reverse crossingtook longer than the hidden dwell-time required to restart a gesture.

Besides the shorter time to start a gesture by hidden dwell-time, gestures in EyeSwipe 2 can bestarted immediately from the first fixation on the gesture region. In such cases, the time requiredto start a gesture is zero. The restart rate is the rate of gestures that used the hidden dwell-time torestart the gesture over all gestures. The grand mean for the restart rate was 41.537%. There wasa significant decrease (F1,10 = 9.993, p < 0.05) in the restart rate over the sessions, starting from51.329% in the first session and ending at 34.017% in the last session.

Therefore, we identified two complementary explanations to the higher gesture entry rate ob-served for EyeSwipe 2. The majority of the gestures in EyeSwipe 2 were performed starting fromthe first fixation in the gesture region, so no dwell-time was needed. When the hidden dwell-timewas used to restart the gesture, it was still faster than the average reverse crossing.


Figure 5.11: Boxplots for the maximum clean text entry rates for each session and interface. The cleantext entry rate includes only transcribed sentences with no uncorrected errors. Overall means are shown witha star symbol, outliers are represented by a black diamond.

Candidate Selection Strategy

Participants did not receive any specific instructions on where to look at on the action regionafter finishing a gesture on the gesture region in EyeSwipe 2. When explaining how to use EyeSwipe2, the experimenter told the participant that after looking at the last letter of the word they shouldlook at the action region to select a word candidate.

Though we did not say that the most probable candidate would be displayed at the first positionthey looked at, some participants noticed this feature and commented in between sessions. Weintentionally left this explanation vague, so we could investigate how the participants would usethis feature.

Upon analysis of the gaze data we observed two strategies to select the word candidate. Thefirst strategy (LEFT) is to always look at the leftmost action button. In this case, if the targetword is not the first candidate, the user scans the other candidates from left to right. The secondstrategy (UP) was to look at the action button right above the last letter of the word. For example,if the last letter was P, the participant would look at the rightmost action button, if it was T, theparticipant would look at the action button in the middle of the action region. The search strategiesvaried if the first candidate was not the target word.

To determine the strategy chosen by each participant we selected the data from the words endingin the 7 letters in the middle of the keyboard (T, Y, F, G, H, V, and B). We then computed theangle between the line segment from the last fixation on the gesture region to the first fixation onthe action region and a vertical line passing through the last fixation on the gesture region. The

66 EYESWIPE 2 5.6

Figure 5.12: Gesture entry rate in characters per minute. The dashed line indicates the average enteredword length (4.466 characters).

angles were classified between looking left (below -30◦), up (between -30◦and 30◦), and right (above30◦). If more than 80% of the gestures were finished by looking up, the participant was assigned theUP strategy. Otherwise, if more than 50% of the gestures were finished by looking left, the LEFTstrategy was assigned. In all other cases a N/A strategy was assigned (no participant was assignedthis strategy). The strategies used by each participant are shown in Table 5.1.

Table 5.1: Strategy adopted by each participant. LEFT: the participant looked for the leftmost action button,independently of the location of the last letter of the word. UP: the participant looked at the action buttonright above the last letter of the word.

Participant B01 B02 B03 B04 B05 B06 B07 B08 B09 B10 B11Strategy LEFT UP UP LEFT UP LEFT UP LEFT LEFT LEFT LEFT

We compared the text entry rate in the last session between the participants grouped accordingto the strategies used. With a Welch’s t test, no significant difference was found between thestrategies (t(7.753) = −0.363, p > 0.05).

5.6.2 Accuracy

A total of 7,792 gestures were performed using EyeSwipe 1 and 11,945 using EyeSwipe 2. Theabsolute number of completed (that remained in the text), deleted, and canceled gestures was higherfor EyeSwipe 2: 515 gestures were deleted, 170 canceled, and 7107 completed with EyeSwipe 1, and537 deleted, 1,671 canceled, and 9,737 completed with EyeSwipe 2.

Error Rate

A mean MSD error rate below 2.0 was observed for most of the sessions for both methods(Figure 5.13). The exception was the first session with EyeSwipe 2, with mean error rate of 2.337.The error rate in the first session with EyeSwipe 2 was mostly influenced by a high error rates byParticipant B01, with 13.074, who faced difficulties with the interface at first.

Participant B01 entered wrong words before starting some sentences because she repeatedlyswitched the focus between the gesture region and the action region, accidentally entering some


Figure 5.13: Boxplots for the mean error rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

words. We told her not to worry about the initial wrong words and continue from that point.Later in the first session she understood how to delete words and improved the error rate. Anotherindication that she understood how to enter and delete words with EyeSwipe 2 is that her errorrate in the second session dropped to 1.339.

The low mean error rates, below 1.5 for all sessions after session 2 of each method, indicate thatparticipants were careful entering the sentences. Seven participants finished at least 10 sessions withno uncorrected errors (error rate equal to zero). In particular, Participant B03 and Participant B10finished all 16 sessions with EyeSwipe 2 and EyeSwipe 1, respectively, with no uncorrected errors.

The effect of the interface was not significant on the error rate (F1,10 = 2.849, p > 0.05). Theeffect of session was significant (F15,150 = 3.184, p < 0.001), indicating that participants producedless uncorrected errors with practice.

Fréchet Score vs. Dynamic Time Warping Score

Based on pilot experiments we decided to use Fréchet distance instead of Dynamic TimeWarpingas the gesture to ideal path similarity measure. The same gesture classification method was usedfor both EyeSwipe 1 and EyeSwipe 2 in this experiment. So the difference between them was onlythe interfaces.

We will refer to the candidate probability function used in this experiment (Equation 5.1) asFréchet score. We collected the gesture data from all participants and computed offline both theFréchet and DTW scores (Equation 4.2). The top-n accuracy for both scores and interfaces is shownin Figure 5.14.

Note that this analysis is different than the one presented for the first experiment. In the firstexperiment only the completed gestures with correct first and last letter candidates were analyzed.

68 EYESWIPE 2 5.6

Figure 5.14: Top-n scores for all gestures using the Fréchet and DTW scores.

For this experiment we wanted to compare both the two score functions (DTW and Fréchet) andthe two versions of EyeSwipe. As EyeSwipe 2 selects multiple first and last letter candidates, wedecided to report the top-n accuracy considering all gestures, independently of the first and lastletter candidates being correct.

As expected, the top-n accuracy is better for EyeSwipe 1 compared to EyeSwipe 2. As EyeSwipe2 uses multiple first and last letter candidates, the search space increases, making the gestureclassification problem harder.

For each gesture we calculated the index of the target word (determined by the sentence to betranscribed) in the candidate list using both scores. The index was equal using the two scores for70.309% of the gestures. The index was lower (lower is better) using the DTW score for 13.766%of the gestures, and lower using the Fréchet score for the remaining 15.925%.

These results indicate that the Fréchet score was better overall for classifying the gestures intowords. The Fréchet score may be more sensitive to points outside of the ideal path. For example,if the user needs to search for a letter, points outside of the ideal path will be added. Recallingthe informal definition of the Fréchet distance, it is equivalent to the shortest leash required forboth the man and the dog to traverse the two separate paths going strictly forward. If there is asingle gaze point that highly deviates from the ideal path, this point will define the whole Fréchetdistance, ignoring the other points. Dynamic Time Warping, on the other hand, accumulates thedistances between the points, so this large distance added by the point outside the ideal path willimpact the DTW score, but the other points in the path will still maintain their contribution tothe score.

Considering the behavior of the two score functions, the better top-n scores achieved by theFréchet score indicate that in general participants did not deviate much from the ideal path. Inother words, they did not perform much searching during the gaze gestures.

The behavior of the Fréchet distance is desirable when the user does not deviate much from theideal path. In this case, when comparing the gaze path to the ideal path of the target word andanother word with similar path, the difference between the scores is expected to be higher usingthe Fréchet score. Because the Fréchet distance is more sensitive to points, even if just one, outsideof the ideal path, the distance to the target word will be smaller.


The lower top-n scores obtained with EyeSwipe 2 with both score functions is partially explainedby the implicit selection of the last letter candidates. The correct first letter was selected for 92.178%of the gestures with EyeSwipe 1 and the last letter, 99.960% of the gestures. With EyeSwipe 2 theaccuracy of selecting the first and last letters were 99.925% and 88.641%. Thus, in approximately10% of the gestures, the correct last letter was not among the last letter candidates, so the wordwas not present in the candidates search space.

Correction Rate

The correction rate (rate of deleted words over all entered words) is shown in Figure 5.15. Theaverage correction rate was 6.978% with EyeSwipe 1 and 4.845% with EyeSwipe 2. Significant maineffects were observed for method (F1,10 = 11.321, p < 0.01) and session (F15,150 = 2.346, p < 0.005)on the mean correction rate. The interaction between the factors (F15,150 = 0.538, p > 0.05) wasnot significant. The mean correction rate improved from 12.715% in the first session to 5.254% inthe last session for EyeSwipe 1 and from 7.131% to 3.175% for EyeSwipe 2, indicating that usersproduce less wrong words with practice.

Figure 5.15: Boxplots for the mean correction rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

Though the correction rate was lower with EyeSwipe 2 in general, it does not consider thegestures that have been canceled. EyeSwipe 1 requires the first candidate to be entered so that theuser can search for other candidates. In this case, the correction rate will be higher if the targetword is also not among the candidates. EyeSwipe 2 shows all candidates at once, so the user cancancel the gesture before entering any word. In this case, the correction rate will be lower, but thecancel rate will be higher.

70 EYESWIPE 2 5.6

Cancel Rate

The cancel rate (rate of canceled gestures over all started gestures) is shown in Figure 5.16.The average cancel rate was 2.551% with EyeSwipe 1 and 14.148% with EyeSwipe 2. There weresignificant main effects for both method (F1,10 = 146.453, p < 0.001) and session (F15,150 = 5.268,p < 0.001). The interaction between the factors was not significant (F15,150 = 1.282, p > 0.05).The mean cancel rate improved from 10.941% in the first session to 0.725% in the last session forEyeSwipe 1 and from 19.433% to 12.454% for EyeSwipe 2.

Figure 5.16: Boxplots for the mean cancel rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

The cancel rate decreased until session 7 (10.506%) for EyeSwipe 2. After session 7 the startedincreasing again, reaching 15.707% in session 12. A possible explanation is that the reasons to cancela gesture may differ between EyeSwipe 1 and EyeSwipe 2.

Gestures may have been canceled for several reasons. The correct first letter may not have beenselected, the user may have been searching for some letters during the gesture and decided to restartit, or the user might have forgotten what word they were entering.

EyeSwipe 2 has additional reasons to cancel a gesture. Users may accidentally perform a saccadefrom the gesture region to the action region. In this case, their fixations on the gesture region will beused to compute word candidates. However, as this was activated by accident, the user will cancelthe current gesture.

Another possible reason is that EyeSwipe 2 presents all word candidates at once on the actionregion. The user can choose to enter a word or cancel the gesture. If the target word is not amongthe candidates it is more efficient to cancel the gesture instead of selecting some word and thendeleting it. We noticed the participants did the latter, less efficient, way more often in the initialsessions.


This may be the main reason for the difference between the cancel rates obtained with EyeSwipe1 and EyeSwipe 2. The top-6 accuracy using the Fréchet score was approximately 80% for Eye-Swipe 2. Most of the misclassified gestures have been canceled by the participants. The remainingmisclassified gestures were either deleted or left as uncorrected errors.

It was not possible to decide canceling a gesture because the word was not among the candidateswith EyeSwipe 1. The first candidate needs to be entered for the user to be able to search forthe other candidates. Only then, if the word is not among the candidates, they delete the word.Consequently, the correction rate increases as the cancel rate decreases.

Replace Rate

The efficiency of EyeSwipe 1 depends on its ability to select the correct first candidate. EyeSwipe2 is also affected by the first candidate selection, as it is displayed at the action button the user firstlooks in the action region. However, the other candidates are just one saccade away, so it is simplerfor the user to select a different word. Even if EyeSwipe 1 misses the first candidate, the user hasto select it with a reverse crossing and then replace it with another reverse crossing (or delete it ifthe word is not among the candidates).

The replace rate (rate of replaced words over all entered words) is shown in Figure 5.17. Thefirst selection of word candidates in EyeSwipe 2 is not considered a replacement, because no wordwas previously entered for that given gesture. A word is considered replaced independently of thenumber of times it has been replaced, so it makes no difference if a word has been replaced onlyonce or multiple times. The mean replace rate remained relatively still for both methods: 15.727%for EyeSwipe 1 and 1.721% for EyeSwipe 2. For most of the sessions, the mean replace rate wasbelow 2% with EyeSwipe 2, indicating that the participants usually did not replace the word onceit was selected for the first time.

The higher replace rate with EyeSwipe 1 compared to EyeSwipe 2 may be interpreted as aconsequence of the reverse crossing to end the gesture always entering the first candidate. Wheneverthe first candidate is not the target word, the user needs to replace it.

There was a significant main effect of method (F1,10 = 168.372, p < 0.001) on the mean replacerate. The main effect of session (F15,150 = 1.094, p > 0.05) and the interaction between the factors(F15,150 = 1.059, p > 0.05) were not significant.

5.6.3 QWERTY Layout Expertise Results

As expected, the two modes (visible and hidden) produced significantly different results in thelayout sessions. The visible condition, when the letters in the keys were visible to the participant,produced significantly lower mean values for all four metrics: selection count (F1,10 = 5.191, p <0.05), key distance to target (F1,10 = 19.869, p < 0.001), completion time (F1,10 = 28.104, p <0.001), and reaction time (F1,10 = 14.670, p < 0.005).

Session had a significant main effect in key distance to target (F15,150 = 4.892, p < 0.001),completion time (F15,150 = 1.973, p < 0.05), and reaction time (F15,150 = 1.821, p < 0.05). Theeffect of session was not significant for the selection count (F15,150 = 1.203, p > 0.05).

We computed the correlation between the results for each metrics and the mean text entry ratein the last day for each participant with each interface. The correlations between the entry rate andthe selection count (EyeSwipe 1: 0.291, EyeSwipe 2: -0.0845) and key distance (EyeSwipe 1: -0.169,

72 EYESWIPE 2 5.6

Figure 5.17: Boxplots for the mean replace rates for each session and interface. Overall means are shownwith a star symbol, outliers are represented by a black diamond.

EyeSwipe 2: -0.160) were low. There was a moderate positive correlation between the entry rateand the completion times (EyeSwipe 1: 0.380, EyeSwipe 2: 0.331), and a high positive correlationbetween text entry rate and reaction time (EyeSwipe 1: 0.775, EyeSwipe 2: 0.822).

Considering the correlations between the metrics and the text entry rate, we calculated simplelinear regressions to predict the mean text entry rate based on each metric independently. Nosignificant regression equations were found for selection count (EyeSwipe 1: F1,9 = 0.831, p > 0.05,EyeSwipe 2: F1,9 = 0.065, p > 0.05), key distance (EyeSwipe 1: F1,9 = 0.261, p > 0.05, EyeSwipe2: F1,9 = 0.237, p > 0.05), and completion time (EyeSwipe 1: F1,9 = 1.519, p > 0.05, EyeSwipe 2:F1,9 = 1.104, p > 0.05).

A significant linear regression equation to predict text entry rate based on reaction time wasfound (EyeSwipe 1: F1,9 = 13.490, p < 0.01, EyeSwipe 2: F1,9 = 18.77, p < 0.005), with R2 of 0.600and 0.676 (Figure 5.18). For EyeSwipe 1, the text entry rate is equal to 2.256 + 0.024 (reactiontime) wpm, when reaction time is measured in milliseconds. For EyeSwipe 2, the text entryrate is equal to 2.096 + 0.032 (reaction time). These results indicate that participants withhigher mean reaction times were able to enter text at higher rates with both interfaces.

5.6.4 Subjective Feedback

At the end of the experiment participants answered a questionnaire about their subjectivefeedback. Participants indicated their perception of performance, learnability, and user experiencein a 7-point Likert scale.

The answers to each question are shown in Figure 5.19. The answers, on the rightmost column,are grouped by points in the Likert scale (middle column) for each question (leftmost column).

The mean responses to each question are summarized in Figure 5.20. Each circle in the radar


Figure 5.18: A significant linear regression was found between the mean reaction time (layout experiment)in milliseconds and the mean text entry rate (text entry sessions) in wpm.

represents a score in the 7-point Likert scale.We conducted Wilcoxon Signed-Ranks tests for each question independently. EyeSwipe 2 was

considered more comfortable (Z = 3.5, p < 0.01), faster (Z = 3.0, p < 0.01), with better overallperformance (Z = 8.0, p < 0.05), and was more liked (Z = 1.0, p < 0.01) by the participants. Thedifferences for learnability (Z = 12.0, p > 0.05), eye effort (Z = 11.5, p > 0.05), cognitive load(Z = 4.5, p > 0.05), and accuracy (Z = 14.0, p > 0.05) were not statistically significant.

Both methods were considered easy to learn by most of the participants. However, the cognitiveload and eye effort were both considered high or very high for both methods. A possible explanationfor this result is that it is unnatural to use the eyes for controlling anything. Using eye movementsto control the interface takes time to get used to, so during the process it is likely to be cognitivelyand physically demanding.

All participants, except Participant B01, preferred EyeSwipe 2 over EyeSwipe 1. ParticipantB01 pointed out that “[EyeSwipe 2] is faster but [EyeSwipe 1] is more accurate.”

The participants were also asked to comment on each interface individually and then comparethe two methods.


The frequent reverse crossing gesture was explicitly disliked by 6 out of the 11 participants, whoconsidered it (Participant B02) “difficult”, (Participant B01) “[taking] a lot of effort”, (ParticipantB03) “frustrating”, (Participant B03) “hard-on-the-eyes”, and (Participant B04) “very repetitive”.Participant B05 also said that it was “too slow with picking up my eyes for the first letter of theword”, indicating that it was faster to select the first letter of the word with EyeSwipe 2. ParticipantB04 pointed out a difficulty we observed for several participants when learning the reverse crossingin both experiments: “I often found myself wanting to start the word after looking up at the startbutton, not looking back down again.”

The interface of EyeSwipe 1 gives feedback on the latest gaze samples with a blue line andindicates the current gaze point as a small blue circle. The opinion of the participants about thegaze path feedback was divided. Participant B06 explicitly said that “I did not like the blue path

74 EYESWIPE 2 5.6

Figure 5.19: Participants’ perception of performance, learnability, and user experience in a 7-point Likertscale.

following your eyes as it lagged and got distracting”. The eye tracker used for this experimentestimates the gaze location at approximately 60 Hz. Furthermore, once a new gaze position isreceived it is filtered by a mean filter. This delay is, as pointed by Participant B06, perceivableand may become distracting. Participant B09 also considered that “the blue dot was distracting.”However, Participant B08 said that “I felt that the blue dot sometimes helped me”. Participants B10and B11 had mixed opinions about the gaze feedback: “I liked that I could see how it was trackingmy eyes as I went between letters, but that also slowed me down”; “the fact that there is a littleindicator for your eyes is nice, but I feel as though I kept trying to correctly match the icon overto the right letters.” These problems may be worsen if the gaze estimation is not accurate.

The accuracy problems of the eye tracker were mentioned by Participant B07: “if it was notperfectly calibrated, it was difficult to exactly get the right letter.” She also pointed out that itwas “annoying that you had to select both the first and last letter to get the word right, [it] tooklonger.” Both these issues have been attended by EyeSwipe 2. By using multiple first and last lettercandidates the word classification may still work even if the calibration is not good. Also the usercan still start a gesture with the correct first letter, even if the gaze estimation is not accurateenough to select a key. Regarding the second problem indicated by Participant B07, EyeSwipe 2has improved over EyeSwipe 1 by not requiring the user to explicitly select the last letter, which isinferred based on the last fixations on the gesture region.


Figure 5.20: Average responses for all participants to the 7-point Likert scale questions.


Participant B06 said she liked that she “only had to trace the word then select it and look ateach button once.” It may be related to the need for frequent reverse crossing gestures on EyeSwipe1. In practice, EyeSwipe 2 needs a reverse-crossing-like movement to select the word to be entered,the reverse region crossing: the user looks at the last letter of the word, searches for the candidateword, and looks back at the gesture region. This back and forth eye movement may have not beena problem to Participant B06 because the movement occurs between regions instead of buttons.

Interestingly, Participant B02 liked the flow of the interaction: “it had a nice flow: spelling onthe bottom, choosing the word in the middle, then review at the top of the screen.” His choice oflooking at the text region to review the text required at least one extra saccade to enter the nextword. However, it seems not to have been a problem for this participant.

Regarding the usability and learnability, Participant B06 said: “it was pretty easy to use andvery easy to learn.” Participants B09 and B08 noted that they needed some time to learn and getused to entering words with EyeSwipe 2, but enjoyed the experience after that: “[it was] tough toget a handle on it at first, but after the first couple sessions, it became much easier. It was veryquick and accurate.” Participant B08 commented that he “felt pretty good using [EyeSwipe 2] when[he] got used to it.” Participant B04 added that EyeSwipe 2 “got faster with experience.”

Others also commented that EyeSwipe 2 was “good at predicting the words” (Participant B05),though the same word classification method was used for both methods and in fact, EyeSwipe 1was more accurate. Participant B09 said that it “did not put too much strain on [his] eyes.”

Participants were told to wait for the first letter of the word to be highlighted before they startedthe gesture. They were not told that the letters would be highlighted in two different conditions:the first fixation on the gesture region and upon completion of the hidden dwell-time. ParticipantB08 noticed this difference and added: “I had issues selecting the correct start keys right off the bateach time so it usually made me have to wait a second before every word.” Indicating that he hadto use the hidden dwell-time in most of the cases.

76 EYESWIPE 2 5.6

EyeSwipe 2 does not consider the duration of the gesture. For this reason, long words such as“yesterday” may be entered when there are few words that meet the first and last letter criteria. Inthe example of the word “yesterday” if the user keeps alternating between the letter “Y” and theaction buttons in the action region, the word will be entered multiple times, possibly by accident.

The effects of this issue were perceived by some participants, who said that (Participant B07)“sometimes, it would randomly get many words in a row that I did not select and I had to deletethem.” To avoid this problem, the duration, or length of the fixation sequence, of the gesture couldalso be used as a criteria to prune unlikely word candidates.

Participant B11 found the cancel function to be frustrating sometimes. He said that he acciden-tally looked at the word candidates before leaving the action region, thus entering words instead ofcanceling. During the experiment we noticed that this issue, combined with the previous one, wasresponsible for part of the extra words that have been entered.

Some participants had significant calibration problems near the corners of the monitor. Forthis reason it was often hard to select the backspace key. Participants B07 and B10 commentedthat it was particularly difficult to delete words because of this problem. Changing the locationof the backspace button or increasing its size are possible solutions. Also, assuming the dynamiccalibration is able to improve the gaze estimation over time, this problem may also be alleviated.

Only Participant B01 commented about the minimal feedback on the gesture region: “it mightbe more accurate if every letter of a word lits up.” She did not suggest using the same feedback ofEyeSwipe 1, but actually extending the feedback of the first letter to the middle letters. This maybe an interesting feedback method to explore in future iterations.

EyeSwipe 1 vs. EyeSwipe 2

We asked the participants to compare the two interfaces. Regarding his preference, ParticipantB03 said: “I like [EyeSwipe 2] much more than [EyeSwipe 1]. It is more intuitive and requires lesswork for typing the words.”

Participant B01, who showed a slight preference for EyeSwipe 1 over EyeSwipe 2, said that“[EyeSwipe 2] is faster but [EyeSwipe 1] is more accurate.” Both observations agree with the dataproduced by her: she was faster (average of 34.750 words per session with EyeSwipe 1 and 54.625with EyeSwipe 2) and less accurate (average of 0.813 canceled words and 2.625 deleted words withEyeSwipe 1 and 7.938 and 3.188 with EyeSwipe 2) using EyeSwipe 2. Her feedback indicates thatfailing to present the target word among the candidates may lead to frustration, even if the overallentry rate is higher.

Participant B04 considered the interaction with EyeSwipe 2 “much more fluid and faster”. Par-ticipant B10 added: “[EyeSwipe 2] was better because I was able to think about the word overallmore easily than I was with [EyeSwipe 1], where I felt like I had to focus more on going letter byletter.” Participant B09 noted that EyeSwipe 2 was faster because it did not require the reversecrossing gesture, and added: “[EyeSwipe 2] was also more accurate to me, and I was able to typea lot faster.” Participant B03 also said that the requirement for “less precise and less frequent eyemovements” made EyeSwipe 2 superior.

Regarding gesture classification accuracy, Participant B08 said that he felt EyeSwipe 1 was moreaccurate because it forced him to slow down a little. Participant B10 said: “[EyeSwipe 2] was betterat suggesting the word I wanted when it did not get it right the first time.”


When commenting about comfort and the required eye movements, some participants mentionedthat EyeSwipe 2 was more comfortable and easier because it did not require the eyes to move much,(Participant B02) “just glance over the letters.” Participant B07 also considered EyeSwipe 2 “mostsimilar to other swipe keyboards, while [EyeSwipe 1] seemed a bit trickier.” Participant B11 added:“[EyeSwipe 2] has a simpler maneuver than [EyeSwipe 1]. I felt as though [EyeSwipe 1] requiredtoo many little movements to be comfortable.”

The opinions were divided about the visual feedback. Participant B08 said that the gaze pathand current gaze position shown in EyeSwipe 1 “can be helpful for longer words.” Others liked thatEyeSwipe 2 did not show this information because it could (Participant B05) “be distracting” and(Participant B04) “helped [her] worry less about making sure [she] saw a letter fully.” ParticipantB09 was annoyed to have the “blue dot tracking every little movement of my eyes.”

An additional feedback present on EyeSwipe 2 was the previously entered word shown above thecandidate word in each action button. Most participants did not notice it, even when asked afterthe end of the experiment. Participant B10, however, said: “I liked being able to see the previousword when selecting the one I had just typed, and I think it was easier to check where I was in thesentence without messing up than it was on [EyeSwipe 1].”

5.6.5 Dynamic Calibration

We conducted a preliminary study on dynamically adjusting the estimated gaze position usingdata from the experiment. The completed gestures data (gestures that have not been deleted orcanceled) from each text entry session were used to compute a correction function. The correctionfunction was tested on the estimated gaze data from the subsequent QWERTY layout expertisesession. Test data was collected from the fixations on the keys in the layout sessions. For eachselected key we find the closest fixation to its center. The center of the key is the ground truth andthe correction function is applied on the fixation estimated by the eye tracker. We assume that theparticipants were looking at the center of the keys during its selection. To improve the chances ofthe assumption to be correct, the experimenter specifically instructed the participants to fixate thecenter of the key when selecting it during the layout sessions.

To test the dynamic calibration, the same calibration must have been used for both the textentry session and the layout session. Participants B01 and B08 had to recalibrate the eye trackerduring multiple layout sessions, so their data could not be used for this analysis.

Algorithm 5 was used to find pairs of points in the gesture data with their respective charac-ter center. As discussed in Section 5.4, it is not clear which pairs of points are reliable. For thispreliminary tests we assumed the estimation error to be small, so fixations estimated closer to thecenter of the keys are more reliable. We applied a simple filter that removed the pairs of pointswith distances greater than the average distance plus half a standard deviation. The remainingpairs of points were used to compute the correction function. We tested a constant displacementand two polynomial models for the correction function. The constant displacement is calculated asthe mean displacement between the character centers and their respective closest gesture points.The polynomial models are a linear combination (Equation 5.6) and a second degree polynomial(Equation 5.7):

a1x+ a2y + a3 (5.6)

78 EYESWIPE 2 5.7

b1x2 + b2y

2 + b3xy + b4x+ b5y + b6 (5.7)

The boxplots for the errors using each correction function are shown in Figure 5.21. The baselineis the estimation error from the layout sessions. We also computed the difference in the averageerror for each participant using each correction function. The error differences are shown on theimage on the right-hand side of Figure 5.21. Negative values mean that the correction functionreduced the mean error, while positive values mean that it increased the mean error.

Figure 5.21: Left: Boxplots for the errors in degrees of visual angle using each correction function. Right:Boxplots for the improvement over the baseline using each correction function (in degrees). Negative valuesmean that the correction function reduced the mean error.

The constant displacement function improved the results on average. The estimation erroris significantly lower using the constant displacement compared to the baseline (t(7) = −6.521,p < 0.001). The two polynomial functions produced worse results than the baseline in general. Apossible explanation is that unreliable points from the gestures were used for the calibration andthe parameters of the polynomial function were affected by such points. The constant displacementis likely more robust because it uses the average from all points and is not affected by the locationof the estimated point. However, it is also less likely to correct local errors. To use more flexiblecorrection functions, such as the polynomial functions, or homographies, it is necessary to furtherstudy which fixations during the gesture are more reliable.

5.7 Discussion

Experimental results suggest that users can enter words approximately 20% faster with Eye-Swipe 2 compared to EyeSwipe 1 on average. Regarding peak velocity, a participant was able toenter a sentence with long words at almost 33 wpm with EyeSwipe 2, compared to 21 wpm with Eye-Swipe 1. In the subjective questionnaire participants also indicated that they perceived EyeSwipe2 as a faster text entry method.

5.7 DISCUSSION 79

The differences in text entry rate can be partially explained by the removal of the reverse crossingon the action buttons. EyeSwipe 2 allows the user to start a gesture from the first fixation on thegesture region or restart the gesture from any location by a hidden dwell-time (set to 700 ms in theexperiment). In both cases it was faster than performing a reverse crossing gesture (approximately1 s on average). Throughout the experiment approximately half of the gestures with EyeSwipe 2were started from the first fixation on the gesture region, in which case there is no wait time tostart the gestures.

The participants maintained a low MSD error rate (below 1.5) during most of the experiment,indicating that they were careful to correct errors when they occurred. The exception was thefirst session, during which some participants were still learning how to use the interfaces. Thecorrection rate (rate of deleted gestures) was similar for both methods. However, the cancel ratefor EyeSwipe 2 was significantly higher (12.454% compared to 0.725%). Even with the considerablyhigher cancel rate, participants achieved higher text entry rates with EyeSwipe 2. The higher cancelrate was indicated as a problem by some participants, who considered EyeSwipe 2 less accurate inclassifying the gestures.

The Fréchet score used in this experiment on both EyeSwipe 1 and 2, performed better, onaverage, than the DTW score from the previous experiment in an offline analysis. This resultindicates that participants did not perform much searching during the gaze gestures, so they didnot deviate much from the ideal path. Even with the Fréchet score, however, the top-n resultsconsidering all gestures (completed, deleted, and canceled) are still low (below 80%) for EyeSwipe2. The low top-n accuracy partially explains the high cancel rates observed in EyeSwipe 2. Overall,improving the accuracy of the gesture classification is a promising approach to improving userexperience. With better classification the user will be required to cancel fewer gestures. To achievebetter gesture classification, more accurate last letter candidate selection is required. Alternatively,using only a first letter criteria (considering all possible last letters) could also be tested.

EyeSwipe 2 shows all available word candidates at once in the action buttons inside the actionregion. As the user concludes the gesture they can select the candidate or cancel immediately if thetarget word is not among them. EyeSwipe 1 always shows only the first candidate in the actionbutton to conclude the gesture. To select a different candidate, the first candidate must always beentered first. This is reflected in the higher replace rate with EyeSwipe 1 (15.713%) compared toEyeSwipe 2 (1.728%).

A significant linear regression equation was found between the reaction time on the QWERTYlayout experiment and the text entry rate. The mean text entry rate increases with the reactiontime in the layout experiment.

In the subjective questionnaire, most of the participants chose EyeSwipe 2 as the preferredmethod. EyeSwipe 2 was considered more comfortable, faster, and with a better overall performancethan EyeSwipe 1 according to the results from a Likert scale.

The main complaint about EyeSwipe 1 were the frequent short eye movements required byreverse crossing, which caused it to be slower and less comfortable. Some participants consideredEyeSwipe 2 to be a “fluid” text entry method. However, other participants indicated that EyeSwipe2 takes practice to get used to and the process may be frustrating in the beginning.

Considering the similarities between EyeSwipe 2 and Context Switching, a possible extensionwould be to place the text region in between the gesture and action regions to act as the bridge in

80 EYESWIPE 2 5.7

Context Switching. Context Switching is activated by saccades: characters are only entered whenthe user performs a saccade from one context to another. If they stop at the bridge, where the textis displayed, the previous action is canceled. A similar approach could be tested for EyeSwipe 2:the candidates for a gesture would only be displayed if a direct saccade from the gesture regionto the action region was performed. Similarly, a word candidate would only be entered if the userperformed a saccade directly to the gesture region. Thus a word would only be entered if the reverseregion crossing was performed with saccades between the gesture and action regions.

Opinions about the minimal feedback used in EyeSwipe 2 (the current gaze position is notshown) compared to EyeSwipe 1 (a blue path showing the previous gaze positions and a blue doton the current position) were conflicting. One participant (B01) preferred the explicit feedback ofEyeSwipe 1, others liked that it was not present in EyeSwipe 2. A possible investigation, suggestedby Participant B01, is to use a similar feedback scheme to the first letter candidates in EyeSwipe2. The letters within a distance to the current gaze position can be highlighted (in a different colorthan the first letter candidates) to show the approximate gaze location estimated by the eye tracker,while not committing to a specific position. This visual feedback has been explored by Pedrosa etal. in Filteryedping [43]. The radius around the gaze position can also be adjusted depending onsome confidence based on the data acquired for the dynamic calibration.

We conducted preliminary tests on the dynamic calibration using gaze gesture data from thisexperiment. The preliminary results indicate that even using all closest points to the ideal pathindiscriminately can still improve the gaze estimation if a very simple correction function is used.In our case, the constant displacement. Slightly more complex functions, such as the polynomials,may not work without further processing. The gaze estimation error is often higher as the gazedeviates from the center of the screen, for example. More complex correction functions with localcorrection are desirable. To this end, further studies are required to identify which gaze points inthe gesture are reliable for the dynamic calibration.

Chapter 6

Conclusions

Text entry by gaze is a vital communication tool for people with severe motion disabilities. Ifthe user is able to control their eye movements they are able to express their needs and thoughtsthrough words using a gaze-based interface.

With the increasing interest in virtual reality and augmented reality, users without disabilitiesmay also benefit from text entry by gaze. In virtual reality, users are often unable to see theenvironment around them. This makes it more difficult for them to use physical devices present inthe environment, such as a physical keyboard. Virtual keyboards may be a solution for text entryin virtual reality. However, it is not clear how the user can interact with the virtual keyboard. Oneof the possible solutions may be the use of gaze.

Methods for text entry by gaze, however, typically deliver poor user experience. People whohave gaze-based interfaces as the only option for communication may continue to use it, even withthe problems with user experience. People who have other options might choose them over gaze. Ifthe user experience is improved to a level in which people without disabilities would consider usingtext entry by gaze, users with severe motor disabilities would also benefit greatly. In this thesis wegive one more step towards this goal.

Dwell-time is one of the most common approaches to selection by gaze. Virtual keyboards usingdwell-time are easy to learn and, with some additional features [9, 34, 38], may be able to achievestate-of-the-art entry rates. However, the method forces the user to wait for the keys to be selected.This forced wait not only limits speed [26], but may also cause the perception that the interfaceis slow: the user is inactive, only waiting for the key to be selected. Also, the dwell-time periodis usually longer than a typical fixation, thus causing discomfort by forcing the user to hold theirfixation for a longer period of time.

Gesture-based interfaces such as EyeWrite [64] and Quikwriting [4] require the user to performa sequence of saccades to enter a single letter. These dwell-free approaches to text entry keep theuser active throughout the whole process of entering a letter and can be performed as fast as theuser is able to perform the gesture.

Virtual keyboards based on keys must consider the magnitude of the gaze estimation error todetermine the size of the keys. Typically this implies that both the size of the keys and the spacebetween them must be of about 1 to 2 degrees of visual angle. Zooming interfaces with dynamicgraphic display objects, such as Dasher [61] and StarGazer [15] are more robust to gaze estimationerrors. The user can adjust the selected character as the interface zooms in it. The character is onlyselected after it occupies a large portion of the keyboard window, so such interfaces tolerate relatively

81

82 CONCLUSIONS 6.2

large errors in gaze estimation. The dynamism of the interface, however, can be distracting to theuser. Using gaze for control is already an unnatural task. Ignoring moving visual stimuli competingfor the user’s attention while trying to focus on the control task with the gaze may be even harderand more stressful.

In this thesis we presented EyeSwipe, a method for text entry by gaze. EyeSwipe overcomessome of the limitations presented above. Users can enter a word by looking at the sequence ofcharacters that form it without stopping in each letter. The method uses the gesture formed bythe sequence of fixations to assign probabilities to candidate words. The two iterations of EyeSwipepresented in this thesis differ in the methods to indicate the limits of the gestures and to select thecandidate word.

The gaze gesture is composed of a sequence of fixations on the letters. Expert users may haveone or two fixations per letter. Thus instead of using a gesture to enter a single letter, the user entersa whole word with a single gesture. Also, the overall gaze path is used for the gesture classification,so the fixations do not need to be accurately estimated on the keys.

Overall the interface of EyeSwipe is static. Elements in the interface may react to the user’sgaze by changing colors to give visual feedback, but the elements themselves do not move or changetheir size. The exception is the action button used by EyeSwipe 1, which pops up above the keyselected by the user.

For the implementations tested in the experiments we used the QWERTY keyboard layout.Despite not being the most efficient layout for gesture typing [52] it is the most familiar to mostdesktop or mobile computer users. It was then simple to explain to the participants of the experi-ments how to perform the gaze gesture: they simply had to quickly look at the letters that form theword. The methods to indicate the first and last letters of the word and to select word candidatesor punctuation marks required further explaining as they were more unusual to the participants,who had no prior experience with gaze-based interfaces. The familiarity of the participants withQWERTY keyboards may explain why both methods were considered easy to learn, but not as easyas dwell-time.

6.1 Limitations

In both experiments presented in this thesis, none of the participants had motion disabilities.Also, most participants were experienced QWERTY keyboard users. Thus, though the results showthat text entry rates of 20 or 30 words per minute are achievable with EyeSwipe, it is still unclearhow they would differ for people with severe motion disabilities.

In the experiment comparing EyeSwipe 1 to the dwell-keyboard, a fixed 600 ms dwell-timewas used. The use of an adjustable dwell-time [34] would likely produce higher entry rates for thedwell-keyboard and result in a better perception of speed and comfort on the subjective feedback.EyeSwipe uses a lexicon to compute word candidates. The dwell-keyboard would also benefit fromsuch ability to correct and predict the most likely words. If the lexicon was applied to the dwell-keyboard, the error rate, entry rate and user experience would also likely be different.

Finally, even at the end of the experiment comparing the two versions of EyeSwipe, participantswere still improving. An extended study is needed to measure expert performance with EyeSwipe.

6.2 EYESWIPE 1 83

6.2 EyeSwipe 1

EyeSwipe 1 uses reverse crossing to mark the limits of a gesture. When selecting the last letter,the word to be entered is displayed on the action button. The top of the interface displays otherword candidates, so the user can replace the entered word by reverse crossing. In the experimentcomparing EyeSwipe 1 to a fixed dwell-time-based method we found EyeSwipe 1 to be faster bothas measured by the words per minute (wpm) rate and as perceived by the participants. Participantsachieved an average text entry rate of 11.48 wpm with EyeSwipe 1 and 9.47 wpm with the dwell-keyboard in the last session, after 30 minutes using each method.

EyeSwipe 1 allows the user to enter text at their own pace, provided that they do not divergemuch from the ideal path of the word. The interface does not limit how fast the user selects theindividual letters. If the gesture classification includes the desired word among the most probableword candidates, the entry rate depends mainly on the speed of the user’s gesture. For this reasonwe expected a larger variability on the entry rates within and between participants. Consideringthis hypothesis we also reported the maximum entry rates achieved by participants when enteringa whole sentence.

As predicted by Kristensson and Vertanen’s model, using Equation 3.1 with an overhead of200 ms, the maximum entry rates achieved with dwell-time plateaued at almost 15 wpm. Themaximum entry rate achieved by a participant with EyeSwipe 1 in a whole sentence was 21.88 wpm.The average maximum entry rate per session was still increasing in the last session for EyeSwipe 1,indicating that it could increase with more practice.

In the subjective feedback questionnaire, participants considered EyeSwipe 1 to be more comfort-able and faster than the dwell-based keyboard. When commenting about EyeSwipe 1, a participantsaid: “I was typing quickly and even if I missed a few of the letters the suggestions would come upfor me. At first this method was difficult to get used to, but once I did, I was pretty accurate andfast at typing.” Another participant added that EyeSwipe 1 “feels faster and more natural” thanthe dwell-based keyboard.

The blue path used to give visual feedback on the current gaze path was considered distractingby some participants. When gaze estimation was not accurate, participants often tried to correctits position. For this reason, we decided to test a more subtle visual feedback on EyeSwipe 2.

The entry rate in characters per minute increased with the length of the word. This observationis directly related to the requirement to select both the first and last letters with reverse crossing.The average duration of a reverse crossing was about 1,000 ms. The time spent on each characterof the word during a gaze gesture was considerably shorter, so the longer the word, the less thereverse crossing time affected the overall result. The impact of the two reverse crossings was feltespecially when entering short 1 or 2-letter words. In which cases it was faster to use dwell-time, onaverage. This limitation was also reported by some participants, who suggested having a differentmechanism for short words.

During the experiment the dwell-time was fixed at 600 ms, but with practice it would likelybe set to a lower value had it been adjustable, as suggested by Majaranta et al. [34]. The 600 msdwell-time was shorter than the average reverse crossing duration. Had a shorter dwell-time beenused, the impact of the duration of the reverse crossing on the entry rate would be even morenoticeable. The duration of the reverse crossing was not only longer than the dwell-time, but alsoparticipants commented that it was uncomfortable to perform frequent reverse crossings.

84 CONCLUSIONS 6.3

Another issue with using reverse crossing is the pop-up action button. When an action buttonpopped up it often occluded some keys, making it difficult to select them. Besides, the action buttonis a dynamic graphic display object that, as discussed above, may distract and overwhelm the user’svisual channel.

The eye tracker produced less accurate gaze estimations for some participants. In some cases itwas still enough for them to be able to interact with the interfaces, but eventually caused problemsselecting specific keys. Though the gaze gestures themselves still worked even with the gaze estima-tion errors, the reverse crossing gesture requires accurate selection of a key. EyeSwipe 1 assumedthe accuracy of the gaze estimation was sufficient to select a key to perform reverse crossing, whichshowed not be true in all cases.

6.3 EyeSwipe 2

EyeSwipe 2 is an enhancement over EyeSwipe 1 that addresses some of its limitations. Thereverse crossing was modified to be performed by alternating between whole keyboard regionsinstead of buttons. By using regions instead of buttons it becomes more tolerant to gaze estimationerrors. Gestures in EyeSwipe 2 are initiated and terminated by switching the gaze between regions,with Reverse Region Crossing, in a similar manner to Context Switching methods. The regionswitching event causes a response from the interface based on the last actions of the user on theprevious region.

First letter candidates are selected either according to the first fixation on the gesture region,or by a hidden dwell-time. This choice was made to remove the pop-up action buttons, whilemaintaining a method to explicitly determine a start position. Whenever the user fixates a characterfor longer than the dwell-time the whole gesture is restarted. In Section 6.6 we discuss how thislimitation can be overcome in future iterations, so EyeSwipe can become a dwell-free method.

All letters within a distance to the user’s gaze point at the end of the hidden dwell-time areconsidered as first letter candidates. The last letter is implicitly indicated when the user switchesthe gaze to the action region that will show the word candidates. The letters within a distance tothe last fixations in the gesture region are considered as last letter candidates. By using more thana single letter candidate we loose the requirement on accuracy for the gaze estimation.

In an experiment comparing EyeSwipe 2 to EyeSwipe 1 participants achieved an average textentry rate of 14.59 wpm with EyeSwipe 2 and 12.58 wpm with EyeSwipe 1 after 75 minutes enteringwords with each method. The maximum entry rate achieved by a participant was 21.27 wpm withEyeSwipe 1 and 32.96 wpm with EyeSwipe 2. The average maximum entry rate per session was stillincreasing in the last session for both methods, indicating that it could increase with more practice.

Though the delete rate was relatively low using EyeSwipe 2 (approximately 3% in the lastsession), the cancel rate was considerably high (approximately 12% in the last session). The highcancel rate is partially explained by the implicit last letter selection. EyeSwipe 2 failed to selectthe correct last letter for approximately 30% of the gestures. Future iterations should look foralternatives to improve last letter selection, or consider all possible last letters so the target wordis included in the gesture classification search space.

In the subjective feedback questionnaire, participants considered EyeSwipe 2 to be faster andmore comfortable, but less accurate than its predecessor. The lower accuracy at classifying gestures

6.5 DESIGN IMPLICATIONS 85

was expected, as instead of single start and end characters, EyeSwipe 2 uses multiple candidatesfor start and end letters. The frequent eye movements to perform reverse crossings were reported asuncomfortable and difficult by some participants, who praised EyeSwipe 2 for requiring less frequentmovements. A participant added: “[EyeSwipe 2] seems much more fluid and faster.”

6.4 Design Implications

Throughout the interaction there are different factors that affect the perception of continuity ofthe text entry method. The time to filter and process gaze samples and fixations may be perceivedby the user, if too long. In this case, it will likely impact the user experience negatively. Predictingfixations before the end of saccades would reduce the impact of such processing times and maketime for the interface to be ready for the next fixation. Also, there are method-specific factors thatforce the user to wait for the system, interrupting the continuity of the interaction.

Dwell-time is a waiting time that is intentionally used to avoid the Midas Touch Problem.However, the other text entry methods by gaze also require the user to wait or impacts their entryrate in different ways. Interface animations, such as those used by Dasher [61] and StarGazer [15],also require the user to wait. Word candidate selection in EyeSwipe may increase the cognitiveload, thus affecting the perception of continuity. Distracting feedback, such as the blue dot usedin EyeSwipe 1, also breaks continuity. Because of the distraction, the user shifts the attention to atask other than entering text, for example, correcting the blue dot position.

In EyeSwipe, another factor that might affect continuity is planning the gaze gesture. Usersmay, for example, want to plan the whole gaze path before starting it, increasing the time neededto enter a word and breaking the “flow”. Also, unintended actions cause interruptions and may be asource of frustration. In EyeSwipe 2, we observed a high cancel rate. Possible causes are problems ingesture classification or unintended activations of the action region during a gaze gesture. Anotherwaiting time in EyeSwipe 2 is the hidden dwell-time, needed to restart a gesture.

When designing a text entry interface, the different factors that affect continuity must be con-sidered. When pursuing the so-called dwell-free text entry, other implicit waiting times may beunintentionally created. In some cases, the alternatives can take longer than the dwell-time itself.The reverse crossing is an example of dwell-free alternative that takes longer than the average dwell-time. Speed, however, is not the only dimension to be optimized. Even if the dwell-free alternativetakes longer than the dwell-time it may have interesting properties that improve the overall userexperience.

The factors that affect the continuity of the method must be identified during interaction design.The designer then should decide whether they are worth the discontinuity for the specific applica-tions. For example, producing short pieces of text, such as passwords or tweets, may require moreaccurate input, and slower text entry methods may be acceptable. When composing longer e-mails,or writing a report, if the delay is too long the overall flow of the interaction may be afffected.

6.5 Dynamic Calibration

Preliminary results showed evidence that indicate that word gesture data can be used to dy-namically adjust the estimated gaze point. A constant linear displacement is computed using the

86 CONCLUSIONS

pairs of points in the gesture and the ideal path. By applying this displacement on the estimatedgaze position the mean estimation error was reduced by approximately 0.1◦, on average.

Polynomial functions were not adequate to model the displacement, as they resulted in deteri-orated gaze position estimates. Though the use of more complex functions may reduce errors, thegesture data needs to be better understood. In which cases is it safe to assume the user was lookingat the characters? In which cases is the user expected to undershoot, or overshoot a saccade? Howcan other pieces information be used, such as the fixation durations? Are longer fixations morereliable?

This dynamic calibration scheme can be extended to other methods that are not based ongestures. As the user types a word with a dwell-based keyboard, for example, the fixations on thekeys can be used to update the model used for gaze estimation.

6.6 Future Research

A direct consequence of the hidden dwell time to select the first letter on EyeSwipe 2 is thatusers can perform the gaze gesture as fast as they want, but not as slow as they want. The fixationswithin the gesture must all be shorter than the hidden dwell time. Otherwise the gesture is reset.

The need for the hidden dwell time also hampers the flow of the interaction. The user muststop and wait for the first letter to be selected. A possible extension is to remove this requirementby allowing the user to freely look at the gesture region before starting the gesture. The beginningof the gesture may be inferred based on features such as fixation duration or pupil dilation. If weare able to infer the starting position of the gesture the user will be able to enter text at their ownpace, either slowly looking at each letter, rapidly swiping through the letters, or a combination ofboth.

A similar inference method may be used for the last letter selection. Improving selection ofthe first and last letters, or other approaches that improve the gesture classification, may have apositive impact on user experience. Though gesture classification accuracy was considerably worsefor EyeSwipe 2 compared to EyeSwipe 1, participants still liked it more. Better gesture classificationcan not only improve text entry rate, by reducing instances of deletion and cancelling, but alsoprovide a more positive experience to the user.

By loosening the requirement for accurate gaze estimation it may be possible to use tighterkeyboard layouts. Another future research direction is to use physical keyboards as the gestureregion. This would reduce the required screen real estate to the action and text regions.

Finally, text entry rates are important, however there may be other factors that affect the user’sperception of speed. We want to investigate how the waiting required by dwell-time-typing can alterthe user’s perception of speed, even if the final text entry rate is the same.

Appendix A

Publications

In this appendix we present a list of selected publications obtained during the development ofthis thesis.

We presented preliminary results on how the gap and overlap effects can be applied to improvegaze-based interaction in [I]. The gap and overlap effects are results from psychophysics that indicatethat the latency to initiate a saccade is affected by the brightness of the targets. The results fromthis study influenced the choice for visual feedback on the buttons in EyeSwipe. Instead of increasingthe brightness, we chose to dim the background of the buttons to indicate selections.

In [II] we discussed how gaze information can be applied to facilitate accessibility and digitalinclusion. Head Movement and Gaze Input Cascaded pointing (HMAGIC), presented in [III], is onesuch example. The gaze position estimated by an eye tracker is used as a rough estimation for themouse pointer position. After the pointer is moved to the gaze location, the user adjusts its positionwith head movements. A possible extension of EyeSwipe is to incorporate ideas from HMAGIC andcombine gaze and head movements in the interaction. One example is to use head movements toindicate the limits of the gaze-gesture.

We proposed Heatmap Explorer, an interactive gaze data visualization tool, in [IV]. HeatmapExplorer can be used for visualization of gaze data in both static images and videos of interfaces.The video from the interface is extracted from the video recorded by the front facing (scene) cameraof a head-mounted eye tracker.

The results from the first experiment presented in this thesis, comparing EyeSwipe 1 to the fixeddwell-time keyboard, were presented in [V]. The final results from this thesis will be summarized ina journal paper to be submitted to ACM Transactions on Computer-Human Interaction (TOCHI).

List of Publications

I. Antonio Diaz Tula, Andrew T. N. Kurauchi, and Carlos H. Morimoto. 2013. “Facilitatinggaze interaction using the gap and overlap effects.” In CHI ’13 Extended Abstracts on HumanFactors in Computing Systems (CHI EA ’13). ACM, New York, NY, USA, 91-96;

II. Andrew T. N. Kurauchi, Antonio Diaz Tula, and Carlos H. Morimoto. 2014. “Facilitatingaccessibility and digital inclusion using gaze-aware wearable computing.” In Proceedings of the13th Brazilian Symposium on Human Factors in Computing Systems (IHC ’14). SociedadeBrasileira de Computação, Porto Alegre, Brazil, Brazil, 417-420;

87

88 APPENDIX A

III. Andrew Kurauchi, Wenxin Feng, Carlos Morimoto, and Margrit Betke. 2015. “HMAGIC: HeadMovement and Gaze Input Cascaded Pointing”. In Proceedings of the 8th ACM InternationalConference on PErvasive Technologies Related to Assistive Environments (PETRA ’15). ACM,New York, NY, USA, Article 47, 4 pages;

IV. Antonio Diaz Tula, Andrew Kurauchi, Flávio Coutinho, and Carlos Morimoto. 2016. “HeatmapExplorer: an interactive gaze data visualization tool for the evaluation of computer interfaces.”In Proceedings of the 15th Brazilian Symposium on Human Factors in Computer Systems (IHC’16). ACM, New York, NY, USA, Article 24, 9 pages;

V. Andrew Kurauchi, Wenxin Feng, Ajjen Joshi, Carlos Morimoto, and Margrit Betke. 2016.“EyeSwipe: Dwell-free Text Entry Using Gaze Paths.” In Proceedings of the 2016 CHI Con-ference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA,1952-1956.

Bibliography

[1] Helmut Alt and Michael Godau. Measuring the resemblance of polygonal curves. In Proceedingsof the Eighth Annual Symposium on Computational Geometry, SCG ’92, pages 102–109, NewYork, NY, USA, 1992. ACM. 53

[2] Robert W Baloh, Andrew W Sills, Warren E Kumley, and Vicente Honrubia. Quantitativemeasurement of saccade amplitude, duration, and velocity. Neurology, 25(11):1065–1065, 1975.1, 8

[3] Michael Barz, Florian Daiber, and Andreas Bulling. Prediction of gaze estimation error forerror-aware gaze-based interfaces. In Proceedings of the Ninth Biennial ACM Symposium onEye Tracking Research & Applications, ETRA ’16, pages 275–278, New York, NY, USA, 2016.ACM. 9

[4] Nikolaus Bee and Elisabeth André. Writing with your eye: A dwell time free writing systemadapted to the nature of human eye gaze. In Proceedings of the 4th IEEE tutorial and researchworkshop on Perception and Interactive Technologies for Speech-Based Systems: Perceptionin Multimodal Dialogue Systems, PIT ’08, pages 111–122, Berlin, Heidelberg, 2008. Springer-Verlag. xiii, 2, 4, 22, 26, 81

[5] Margrit Betke, James Gips, and Peter Fleming. The camera mouse: visual tracking of bodyfeatures to provide computer access for people with severe disabilities. IEEE Trans Neural SystRehabil Eng, 10(1):1–10, Mar 2002. 28

[6] Bruce Bridgeman, Derek Hendry, and Lawrence Stark. Failure to detect displacement of thevisual world during saccadic eye movements. Vision Research, 15(6):719 – 722, 1975. 8

[7] Mark Davies. The Corpus of Contemporary American English (COCA): 520 million words,1990-present. https://corpus.byu.edu/coca/, 2008. Accessed: 2017-11-22. 59

[8] Antonio Diaz-Tula, Filipe M. S. de Campos, and Carlos H. Morimoto. Dynamic contextswitching for gaze based interaction. In Proceedings of the Symposium on Eye Tracking Researchand Applications, ETRA ’12, pages 353–356, New York, NY, USA, 2012. ACM. 2, 18

[9] Antonio Diaz-Tula and Carlos H. Morimoto. Augkey: Increasing foveal throughput in eyetyping with augmented keys. In Proceedings of the 2016 CHI Conference on Human Factorsin Computing Systems, CHI ’16, pages 3533–3544, New York, NY, USA, 2016. ACM. xiii, 2,3, 15, 26, 52, 81

[10] Andrew T. Duchowski. Gaze-Contingent Visual Communication. PhD thesis, Texas A&MUniversity, 08 1997. 7

[11] Thomas Eiter and Heikki Mannila. Computing Discrete Fréchet Distance. Technical report,05 1994. 53, 54

[12] Anna Maria Feit, Shane Williams, Arturo Toledo, Ann Paradiso, Harish Kulkarni, Shaun Kane,and Meredith Ringel Morris. Toward everyday gaze input: Accuracy and precision of eyetracking and implications for design. In Proceedings of the 2017 CHI Conference on Human

89

https://corpus.byu.edu/coca/

90 BIBLIOGRAPHY

Factors in Computing Systems, CHI ’17, pages 1118–1130, New York, NY, USA, 2017. ACM.45

[13] Wenxin Feng, Ming Chen, and Margrit Betke. Target reverse crossing: A selection method forcamera-based mouse-replacement systems. In Proceedings of the 7th International Conferenceon PErvasive Technologies Related to Assistive Environments, PETRA ’14, pages 39:1–39:4,New York, NY, USA, 2014. ACM. 28

[14] Maurice Fréchet. Sur quelques points de calcul fonctionnel. PhD thesis, Rendiconti del CircoloMatematico di Palermo, 1906. 22:1–74. 53

[15] Dan Witzner Hansen, Henrik H. T. Skovsgaard, John Paulin Hansen, and Emilie Møllenbach.Noise tolerant selection by gaze-controlled pan and zoom in 3d. In Proceedings of the 2008Symposium on Eye Tracking Research & Applications, ETRA ’08, pages 205–212, New York,NY, USA, 2008. ACM. xiii, 2, 23, 24, 26, 81, 85

[16] John Paulin Hansen, Anders Sewerin Johansen, Dan Witzner Hansen, Kenji Itoh, and SatoruMashino. Command without a click: Dwell time typing by mouse and gaze selections. InM. Rauterberg et al., editor, Human-Computer Interaction, INTERACT ’03, pages 121–128.IOS Press, 2003. 34, 35

[17] Sabrina Hoppe, Florian Daiber, and Markus Löchtefeld. Eype - using eye-traces for eye-typing.In Workshop on Grand Challenges in Text Entry. ACM International Conference on HumanFactors in Computing Systems (CHI 13). ACM, 2013. 20

[18] Michael Xuelin Huang, Tiffany C.K. Kwok, Grace Ngai, Stephen C.F. Chan, and Hong VaLeong. Building a personalized, auto-calibrating eye tracker from user interactions. In Pro-ceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, pages5169–5179, New York, NY, USA, 2016. ACM. 49

[19] Anke Huckauf and Mario Urbina. Gazing with peye: New concepts in eye typing. In Proceedingsof the 4th Symposium on Applied Perception in Graphics and Visualization, APGV ’07, pages141–141, New York, NY, USA, 2007. ACM. 2, 21

[20] Anke Huckauf and Mario H. Urbina. Object selection in gaze controlled systems: What youdon’t look at is what you get. ACM Trans. Appl. Percept., 8(2):13:1–13:14, feb 2011. 18

[21] Poika Isokoski. Text input methods for eye trackers using off-screen targets. In Proceedings ofthe 2000 Symposium on Eye Tracking Research & Applications, ETRA ’00, pages 15–21, NewYork, NY, USA, 2000. ACM. 19

[22] Robert J. K. Jacob. What you look at is what you get: Eye movement-based interaction tech-niques. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,CHI ’90, pages 11–18, New York, NY, USA, 1990. ACM. 2, 9

[23] Shahram Jalaliniya and Diako Mardanbegi. Eyegrip: Detecting targets in a series of uni-directional moving objects using optokinetic nystagmus eye movements. In Proceedings of the2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, pages 5801–5811,New York, NY, USA, 2016. ACM. 13

[24] Josh Kaufman. Google 10000 english. https://github.com/first20hours/google-10000-english,2015. Accessed: 2015-09-22. 35

[25] Mohamed Khamis, Ozan Saltuk, Alina Hang, Katharina Stolz, Andreas Bulling, and FlorianAlt. Textpursuits: Using text for pursuits-based interaction and calibration on public displays.In Proceedings of the 2016 ACM International Joint Conference on Pervasive and UbiquitousComputing, UbiComp ’16, pages 274–285, New York, NY, USA, 2016. ACM. 22, 24, 49, 57

https://github.com/first20hours/google-10000-english

BIBLIOGRAPHY 91

[26] Per Ola Kristensson and Keith Vertanen. The potential of dwell-free eye-typing for fast as-sistive gaze communication. In Proceedings of the Symposium on Eye Tracking Research andApplications, ETRA ’12, pages 241–244, New York, NY, USA, 2012. ACM. xvii, 3, 4, 15, 16,26, 27, 81

[27] Per-Ola Kristensson and Shumin Zhai. Shark2: A large vocabulary shorthand writing systemfor pen-based computers. In Proceedings of the 17th Annual ACM Symposium on User InterfaceSoftware and Technology, UIST ’04, pages 43–52, New York, NY, USA, 2004. ACM. 17

[28] Yi Liu, Chi Zhang, Chonho Lee, Bu-Sung Lee, and Alex Qiang Chen. Gazetry: Swipe texttyping using gaze. In Proceedings of the Annual Meeting of the Australian Special InterestGroup for Computer Human Interaction, OzCHI ’15, pages 192–196, New York, NY, USA,2015. ACM. 17

[29] Winwaed Software Technology LLC. Calculating word and n-gram statis-tics from a wikipedia corpora. http://www.monlp.com/2012/04/16/calculating-word-and-n-gram-statistics-from-a-wikipedia-corpora/, 2012. Accessed: 2015-09-22. 33, 35

[30] Jean Lorenceau. Cursive writing with smooth pursuit eye movements. Current Biology,22(16):1506 – 1509, 2012. 8, 22, 26

[31] Otto Hans-Martin Lutz, Antje Christine Venjakob, and Stefan Ruff. Smoovs: Towardscalibration-free text entry by gaze using smooth pursuit movements. Journal of Eye MovementResearch, 8(1), 2015. xiii, 2, 25, 26

[32] David MacKay. Automatic pointer calibration with dasher. http://www.inference.org.uk/dasher/development/Calibration.html, 2003. Accessed: 2017-11-22. 56

[33] I. Scott MacKenzie and R. William Soukoreff. Phrase sets for evaluating text entry techniques.In CHI ’03 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’03, pages754–755, New York, NY, USA, 2003. ACM. 35, 59

[34] Päivi Majaranta, Ulla-Kaija Ahola, and Oleg Špakov. Fast gaze typing with an adjustable dwelltime. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’09, pages 357–360, New York, NY, USA, 2009. ACM. 2, 3, 13, 14, 16, 17, 26, 36, 45, 81, 82,83

[35] Päivi Majaranta, Scott MacKenzie, Anne Aula, and Kari-Jouko Räihä. Effects of feedback anddwell time on eye typing speed and accuracy. Univers. Access Inf. Soc., 5(2):199–208, July2006. 13, 14

[36] Päivi Majaranta and Kari-Jouko Räihä. Twenty years of eye typing: Systems and design issues.In Proceedings of the 2002 Symposium on Eye Tracking Research & Applications, ETRA ’02,pages 15–22, New York, NY, USA, 2002. ACM. 3, 13

[37] Carlos H. Morimoto and Arnon Amir. Context switching for fast key selection in text entryapplications. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications,ETRA ’10, pages 271–274, New York, NY, USA, 2010. ACM. xiii, 2, 18, 26

[38] Martez E. Mott, Shane Williams, Jacob O. Wobbrock, and Meredith Ringel Morris. Improvingdwell-based gaze typing with dynamic, cascading dwell times. In Proceedings of the 2017 CHIConference on Human Factors in Computing Systems, CHI ’17, pages 2558–2570, New York,NY, USA, 2017. ACM. 2, 3, 14, 16, 26, 81

[39] Atsuo Murata. Eye-gaze input versus mouse: Cursor control as a function of age. InternationalJournal of Human-Computer Interaction, 21(1):1–14, 2006. 9

http://www.monlp.com/2012/04/16/calculating-word-and-n-gram-statistics-from-a-wikipedia-corpora/

http://www.monlp.com/2012/04/16/calculating-word-and-n-gram-statistics-from-a-wikipedia-corpora/

http://www.inference.org.uk/dasher/development/Calibration.html

http://www.inference.org.uk/dasher/development/Calibration.html

92 BIBLIOGRAPHY

[40] Emilie Møllenbach, John Paulin Hansen, and Martin Lillholm. Eye movements in gaze inter-action. Journal of Eye Movement Research, 6(2), 2013. 7, 8, 13

[41] Marcus Nyström and Kenneth Holmqvist. An adaptive algorithm for fixation, saccade, andglissade detection in eyetracking data. Behavior Research Methods, 42(1):188–204, Feb 2010.8

[42] Yun Suen Pai, Benjamin Outram, Noriyasu Vontin, and Kai Kunze. Transparent reality: Usingeye gaze focus depth as interaction modality. In Proceedings of the 29th Annual Symposiumon User Interface Software and Technology, UIST ’16 Adjunct, pages 171–172, New York, NY,USA, 2016. ACM. 13

[43] Diogo Pedrosa, Maria Da Graça Pimentel, Amy Wright, and Khai N. Truong. Filteryedping:Design challenges and user performance of dwell-free eye typing. ACM Trans. Access. Comput.,6(1):3:1–3:37, March 2015. 16, 20, 26, 27, 42, 80

[44] Ken Perlin. Quikwriting: Continuous stylus-based text entry. In Proceedings of the 11th AnnualACM Symposium on User Interface Software and Technology, UIST ’98, pages 215–216, NewYork, NY, USA, 1998. ACM. 22

[45] Marco Porta and Matteo Turina. Eye-s: A full-screen input modality for pure eye-based com-munication. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications,ETRA ’08, pages 27–34, New York, NY, USA, 2008. ACM. 20, 26

[46] Kari-Jouko Räihä and Saila Ovaska. An exploratory study of eye typing fundamentals: Dwelltime, text entry rate, errors, and workload. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, CHI ’12, pages 3001–3010, New York, NY, USA, 2012.ACM. 14, 26

[47] Chotirat Ratanamahatana and Eamonn J. Keogh. Everything you know about dynamic timewarping is wrong. In Third Workshop on Mining Temporal and Sequential Data, 2004. 32

[48] David A. Robinson. The oculomotor control system: A review. Proceedings of the IEEE,56(6):1032–1049, June 1968. 7

[49] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken wordrecognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, Feb1978. 32

[50] Dario D. Salvucci. Inferring intent in eye-based interfaces: Tracing eye movements with processmodels. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,CHI ’99, pages 254–261, New York, NY, USA, 1999. ACM. 16

[51] Sayan Sarcar, Prateek Panwar, and Tuhin Chakraborty. Eyek: An efficient dwell-free eye gaze-based text entry system. In Proceedings of the 11th Asia Pacific Conference on ComputerHuman Interaction, APCHI ’13, pages 215–220, New York, NY, USA, 2013. ACM. 19, 26, 29

[52] Brian A. Smith, Xiaojun Bi, and Shumin Zhai. Optimizing touchscreen keyboards for gesturetyping. In Proceedings of the 33rd Annual ACM Conference on Human Factors in ComputingSystems, CHI ’15, pages 3365–3374, New York, NY, USA, 2015. ACM. 82

[53] R. William Soukoreff and I. Scott MacKenzie. Theoretical upper and lower bounds on typingspeed using a stylus and soft keyboard. Behaviour & Information Technology, 14(6):370–379,1995. 1

[54] R. William Soukoreff and I. Scott MacKenzie. Metrics for text entry research: An evaluationof msd and kspc, and a new unified error metric. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, CHI ’03, pages 113–120, New York, NY, USA, 2003.ACM. 10

BIBLIOGRAPHY 93

[55] M Tall. Neovisus: Gaze driven interface components. In Proceedings of the 4rd Conference onCommunication by Gaze Interaction (COGAIN 2008), pages 47–51, 2008. xiii, 29

[56] Outi Tuisku, Päivi Majaranta, Poika Isokoski, and Kari-Jouko Räihä. Now Dasher! Dash Away!Longitudinal Study of Fast Text Entry by Eye Gaze. In Proceedings of the 2008 Symposium onEye Tracking Research & Applications, ETRA ’08, pages 19–26, New York, NY, USA, 2008.ACM. 24, 26

[57] Mario H. Urbina and Anke Huckauf. Alternatives to single character entry and dwell timeselection on eye typing. In Proceedings of the 2010 Symposium on Eye-Tracking Research &Applications, ETRA ’10, pages 315–322, New York, NY, USA, 2010. ACM. xiii, 21, 26, 29

[58] Mélodie Vidal, Andreas Bulling, and Hans Gellersen. Pursuits: Spontaneous interaction withdisplays based on smooth pursuit eye movement and moving targets. In Proceedings of the2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp’13, pages 439–448, New York, NY, USA, 2013. ACM. 8, 9, 22, 24

[59] Oleg Špakov and Darius Miniotas. On-line adjustment of dwell time for target selection by gaze.In Proceedings of the Third Nordic Conference on Human-computer Interaction, NordiCHI ’04,pages 203–206, New York, NY, USA, 2004. ACM. 2, 3, 14, 26

[60] David J. Ward, Alan F. Blackwell, and David J. C. MacKay. Dasher - a data entry interfaceusing continuous gestures and language models. In Proceedings of the 13th Annual ACMSymposium on User Interface Software and Technology, UIST ’00, pages 129–137, New York,NY, USA, 2000. ACM. 3, 23

[61] David J. Ward and David J. C. MacKay. Fast hands-free writing by gaze direction. Nature,418(6900):838, 2002. xiii, 2, 3, 22, 23, 26, 81, 85

[62] Jacob O. Wobbrock. Measures of Text Entry Performance, pages 47–74. Elsevier Inc., 2007.10

[63] Jacob O. Wobbrock, Brad A. Myers, and John A. Kembel. Edgewrite: A stylus-based textentry method designed for high accuracy and stability of motion. In Proceedings of the 16thAnnual ACM Symposium on User Interface Software and Technology, UIST ’03, pages 61–70,New York, NY, USA, 2003. ACM. 3, 19

[64] Jacob O. Wobbrock, James Rubinstein, Michael W. Sawyer, and Andrew T. Duchowski. Lon-gitudinal evaluation of discrete consecutive gaze gestures for text entry. In Proceedings of the2008 Symposium on Eye Tracking Research & Applications, ETRA ’08, pages 11–18, New York,NY, USA, 2008. ACM. 2, 3, 4, 9, 19, 26, 81

[65] Shumin Zhai and Per Ola Kristensson. The word-gesture keyboard: Reimagining keyboardinteraction. Commun. ACM, 55(9):91–101, September 2012. 32

Documents

Program: Graduate Program in Computer Science Co-advisors ......Acknowledgments Countless people contributed directly or indirectly to this thesis. First and foremost I thank God,creatorandsustaineroflife.Tohimbetheglory