Recognizing input for swipe based keyboards · 4.6 Future work 21 5 next word prediction22 5.1 N-grams 22 5.1.1 Theory 22 5.1.2 Application 23 ... well as more accurate swipe inputs

Recognizing input for swipebased keyboards

Rémi de Zoeten6308694

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

SupervisorJelle Zuidema

Institute for Logic, Language and ComputationFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

June 28th, 2013

i

A B S T R A C T

In this document we describe methods that can be used to imple-ment the Swype keyboard. We developed various techniques for swipetrajectory simulation and techniques for discriminative recognition ofan empirical swipe trajectory. We used a language model to improverecognition by observing the context of the swipe.

ii

C O N T E N T S

1 introduction 1

1.1 Qualities of a swipe based text input program 2

1.1.1 Enhancing user experience 2

1.1.2 Challenges 2

1.2 Research question 3

1.3 Data 3

1.4 Thesis outline 4

2 useful literature for swipe recognition 5

3 path simulation 7

3.1 Naive path simulation (method 1) 8

3.2 Filled path simulation (method 2) 8

3.3 Center re-estimation simulation (method 3) 10

3.4 Smoothed path simulation (method 4) 11

4 comparing simulated and empirical paths 13

4.1 Dynamic time warping 13

4.2 Subsequence grouping 14

4.3 Greedy asymmetric dynamic time warping 16

4.4 Efficiency 17

4.4.1 Reducing input size 18

4.4.2 Indexing 18

4.5 Results 19

4.6 Future work 21

5 next word prediction 22

5.1 N-grams 22

5.1.1 Theory 22

5.1.2 Application 23

5.2 Results 24

5.3 Future work 25

6 conclusions 26

6.1 Future work 27

a example recognitions 28

a.1 Example one 28

a.2 Example two 31

bibliography 34

iii

1I N T R O D U C T I O N

There are several text input methods for touch screen devices. Ofcourse there is the ‘classic’ qwerty style keyboard where a user taps aseparate key for each character he or she wishes to type, but there arealso newer alternatives. Some of these alternative keyboards have adifferent key alignment [6] or method of gesture recognition [7]. Theprevalent alternative at this time is the ‘Swype’ keyboard by SwypeInc [1]. This input method features a qwerty keyboard but instead ofhaving to tap every character to enter a word the user ‘swipes’ a fingeror stylus across the characters of the word. The method also permitsthe user to be careless in some cases and not correctly swipe acrossevery intended letter since a specific letter could be derived from theentire swipe instead of a single tap.

The following images are example swipes of the words serious andstupid.

Figure 1: Example swipe of the word serious

Figure 2: Example swipe of the word stupid

When imagining what the swipe paths of the words serious and stupidcould be, the paths are probably very similar. When swiping a word,the finger crosses many letters that were not intended, and also crossesintended letters, sometimes in a straight line. In the word ‘stupid’ theletter u (which is intended) is crossed in a straight line, but also the

1

2 introduction

letter y, which is not an intended letter. The main differences betweenthe two paths are around the letters p and d. Those parts of the pathsthat might be used to discriminate between the two words.

The current implementation of Swype is not open source and up todate there seems to be no literature describing how such a keyboardcould be implemented. Furthermore, the current implementation ofSwype performs badly at recognizing double letters in words such as‘letters’ and is (almost) not context sensitive, resulting in constructionssuch as ‘an dog’.

This document presents our research on how the Swype inputmethod could be implemented.

1.1 qualities of a swipe based text input program

1.1.1 Enhancing user experience

Swype has been successfully used to set the world record for textingspeed [2]. This means that swiping instead of typing can be a veryfast input method. When swiping it is not always necessary to swipeacross all the letters of the word for successful word recognition. If theimprecise swipe behavior of users fall into certain patterns, it couldbe used to enhance both the accuracy of the program and the easewith which the user swipes. In our research we evaluate some of suchtechniques.

Some keyboard solutions [7] learn from previously typed messagessuch as emails that the user has produced in the past. The programanalyzes the linguistic behavior of the specific user and assumessimilar behavior in the future.

1.1.2 Challenges

There are several challenges that need to be considered when imple-menting a swipe based input program.

1. First of all, the program has to perform at a certain speed inorder to be useful. This can be challenging when targetingmobile devices because computational resources are limited inthis setting.

2. Another challenge is correct recognition of the user input. In theofficial Swype implementation the following three recognitionsare possible: the best option (case 1) is that the program recog-nizes the correct word. In this case, the user continues to swipethe next word. It could also be that the program is uncertainabout the word that was swiped in which case it presents fouroptions. If the intended word is among these options (case 2) the

1.2 research question 3

user selects the word. In which case, although the program didnot function perfectly, the mistake was nevertheless corrected. Ifthe intended word is not among the four options (case 3), thenthe program has failed and the user has to try again. This meansthat the program not only recognizes correct or incorrectly butthat there is also a middle ground of sorts.

1.2 research question

Our main research question is:

How to implement a swipe based text input method?

This question will be answered using the following two questions:

1. How to select a set of candidate words provided an empirical path?

2. How to select the correct word from this set provided the previouslyentered words?

1.3 data

A discriminative approach is used to select a set of candidate words.This means that an empirical path is compared to a set of referencepaths. If two paths are ‘similar’ by some measure then the wordrepresented by the reference path is considered a candidate word.

In order to generate empirical paths that function as test data, aSwype interface replica is used. Sander Latour has built an opensource keyboard [8] that registers a swipe over the keyboard andoutputs a sequence of coordinates that the user has swiped accross.A series of small texts is swiped on the keyboard and the sequenceof coordinates is saved along with an image of the swipe pattern.The test data covers untidy swipes (bended lines, missing letters) aswell as more accurate swipe inputs. In the process of developing themethods described in this document, more ‘fresh’ data was generatedto perform new tests.

The following sequences are coordinates from empirical paths de-picted in figures 1 and 2.

4 introduction

#serious339,201341,201346,200352,198362,194...396,231394,231393,231393,232391,232

#stupid307,215311,209315,204321,199326,193...335,243333,243331,244330,244329,244

1.4 thesis outline

In chapter two, ‘useful literature for swipe recognition’some literature that could be relevant for the task is reviewed. Inchapter three, ‘path simulation’ we evaluate different strategiesto generate a simulated path for a word. Chapter four, ‘comparing

simulated and empirical paths’ describes different methodsto compare the simulated paths with an empirical path that the userhas produced. This is then used to find a set of words that are likelyto have been intended by the user. In the fifth chapter, ‘next word

prediction’ we will integrate a language model to disambiguateamong likely candidates. Chapter six, ‘conclusions’ sums up ourfindings and describes future work.

2U S E F U L L I T E R AT U R E F O R S W I P E R E C O G N I T I O N

Text input on touch screens is a recent phenomena and there is stillvery little literature about this specific subject. Pirkl has documented amethod for touch screen input [12] which is different from the Swypeinput method. Pirkl proposes a different keyboard layout which hecalls a vector keyboard. The technical difference between a vector key-board and a qwerty keyboard is that with the vector keyboard fewerunintended letters are passed over when swiping a word, yielding lessnoise in the input. The keyboard is implemented by representing theswipe gesture as a small set of vectors. These vectors are identifiedby length and angle and then represented by a point, which is thenused to recognize which letter the user intended because a vector‘points’ somewhere. Pirkl’s method consists of identifying a selectnumber of words that the user is most likely intending to employ, andto use a bigram language model to disambiguate. He also comparesperformance of the vector keyboard to other keyboards.

A swipe over the keyboard could be considered an extreme exampleof a misspelled word. Peter Norvig has a page dedicated to spellcorrectors [4]. He classifies various misspellings in 5 categories andshows how each of these misspellings can be corrected by a program.Spell correction and Swype both have to perform in real time. Norvigshows how pre-calculations for every category of misspelling can bemade on a reference set of words.

Comparing a user produced, empirical path with a simulated pathis an example of comparing two sequences. A classic, generic algo-rithm used for this is known as dynamic time warping, which is usedwhen comparing DNA sequences, speech recognition and other fieldsof science as explained by Müller [3]. Dynamic time warping has acomplexity of O(n× m) for input lengths n and m, which makes ita computationally intensive algorithm. Müller starts by providing aformal definition of the algorithm and follows with an explanationof the mathematical representations. He also addresses performanceoptimizations by including ‘slope constrains’ which are constraintson the number of delete and insert operations. Another method thatis explained is the pre-computing and pre-ordering of reference se-quences. This can be used to iterate over a subset of these sequenceswhen performing a comparison as opposed to the complete set whichsaves computations at runtime.

5

6 useful literature for swipe recognition

Stolcke [5] uses Probabilistic Context Free Grammars (PCFGs) fornext word predictions, which the Swype application also has to do.He describes how PCFGs can be used to calculate prefix probabilities.On most PCFGs this is not a trivial problem because the grammars canproduce sentences of infinite length, therefore prefix sentences couldpotentially be the prefix of an infinite number of sentences. Nonethe-less, Stolcke shows how it is possible to calculate prefix probabilities.

3PAT H S I M U L AT I O N

The first step to interpret an empirical path is to try to identifya small set of words, likely to be the intended word given a usergenerated swipe. The size of this set will be a minimum of oneword and a parameterized maximum, to be determined empiricallydepending on an error rate. A language model will then use thecontext of the empirical path to calculate a likelihood for every wordin the set. The details of this model are explained in chapter four.

To select a set of candidate words, the user generated swipe pathwill be compared to expected swipe paths of a set of reference words.We have implemented several algorithms that generate a predictedpath for a given word and used these predicted paths to compare withempirical paths using different comparison algorithms.In predicting a swipe path we have treated double letters of a word

as a position the user would only swipe over once. This is differentfrom the official implementation of Swype which prefers the user topass the ‘l’ in ‘hello’ more than once. Omitting double letters couldlead to faster user input and a better user experience because the userneeds to swipe less letters. The result is that the expected paths of thewords ‘tool’ and ‘toll’ are the same and would rank the same in anyof the used comparison algorithms. This means that there might bemore noise in the set of candidate words, but this is not certain. If theuser was expected to circle around a letter to indicate that the letter isdouble (as in the official Swype implementation), then the path wouldbe longer and possibly more prone to other errors.

Our reference set contains 16486 words of which 18.77 percent havedouble letters. Let removeDoubles be a function that maps a string toanother string with all double letters removed. The word ‘hello’ wouldbe mapped to ‘helo’. The following table shows the probabilities oftwo words being mapped to the same string by the removeDoublesfunction. n stands for the number of words getting mapped to thesame string, p the probability that n words map to the same stringand pnormalized the probability p, normalized for word frequency.

n p pnormalized

1 0.98483 0.95875

2 0.01480 0.04087

3 0.00036 0.00036

4+ 0.00000 0.00000

According to the calculated probabilities about one in every twentypredicted swipe paths are ambiguous when the removeDoubles function

7

8 path simulation

is used. Furthermore, if a predicted path is ambiguous then in 99

percent of the cases there are only two words that are represented bythe predicted path and never more than three. When a predicted pathis ambiguous a language model should be used to disambiguate.

3.1 naive path simulation (method 1)

There could be many different methods to simulate a path forcomparison with an empirical path. Some paths might be morediscreet in their representation of the path. Other path simulationsmight anticipate some behavior that users exhibit.The naive, and arguably the simplest, path simulation is defined asthe sequence of coordinates of the centers of the letters of a word. Thefollowing image is an abstract representation of a naive path. Thepath starts at the green dot and ends at the blue dot.

Figure 3: Example naive path

3.2 filled path simulation (method 2)

In the ‘Filled path simulation’ it is assumed that the points betweenthe centers of the expected letters hold relevant information. Theyshould not be too far of course from a straight line between two letters.

In generating a ‘filled path simulation’ we start with a naively simu-lated path. When the distance between two consecutive coordinatesof the path is greater than a certain threshold, the average of bothcoordinates is calculated and placed between these two coordinates.This process is repeated until no distance between two consecutivecoordinates is greater than the threshold.

This is a parameterized path simulation. The appropriate value ofthe threshold may depend on the comparison algorithm and affects theperformance of the program. The smaller this threshold is, the more‘fine grained’ the simulation, but this also requires more computationsto do a comparison.

3.2 filled path simulation (method 2) 9

If a naive path is represented by the following sequence of coordi-nates:

{(1, 2), (9, 8)}

and the threshold value is set to 4, then the ‘filled path’ would looklike this:

{(1, 2), (5, 5), (9, 8)}

A threshold value of 3 would yield the following ‘filled path’:

{(1, 2), (3, 3), (5, 5), (7, 6), (9, 8)}

The following images show filled paths for the naive path depicted inimage 3.

Figure 4: Example filled path

Figure 5: More precise filled path

10 path simulation

3.3 center re-estimation simulation (method 3)

This simulation is intended to be a variation of the previous sim-ulation. The ‘center re-estimation’ simulation assumes that when auser swipes he will tend to cut corners when possible, as opposedto ‘correctly’ swiping over the center. This reduces the length of theswipe. An example of corner cutting can be observed in the imagebelow.

Figure 6: Swipe path of the word ‘hello’

In this swipe path the centers of the letter ‘e’ and the letter ‘l’ are notreached as the users’ finger cuts the corner, and heads for the nextletter before reaching the center.

The ‘center re-estimation’ simulation leverages this ‘sloppy’ swipebehavior that users exhibit. It does this by not connecting the correctcenters of the letters but by re-estimating the centers that the userwill try to swipe towards given the previous and the next letter. Thisis done as follows: the program first assumes a perfect, straightswipe path from one center to the other. Then the program calculatesat what coordinates the swipe enters and leaves the intended lettersurface. Now the program has a triplet of coordinates; (enter, center,leave). Instead of the original center a weighted average of thesethree coordinates is taken. A sequence of these averages form a pathfrom letter to letter. Then, just as in the ‘filled path’ method, thegaps between coordinates is filled until a certain threshold is met.The average of the three coordinates can be calculated with differentweights for each coordinate. With empirical data it was discoveredthat the weight distribution of (enter, center, leave) should be about(1, 10, 3). This suggests that users do indeed cut corners, but that onaverage the effect is not very large.

The following image depicts the surface of a letter and the triplet(enter, center, leave) as (green, black, blue). The red dot representsthe re-estimated center of the letter for the particular context that theletter is in.

3.4 smoothed path simulation (method 4) 11

Figure 7: Key surface with re-estimated center

Notice that this model does not change the center when there is astraight line through a letter, for example the ‘u’ in ‘your’ becauseenter and leave are opposite and that the effect of the averaging issomewhat amplified in the ‘o’ in ‘your’ because enter and leave are thesame1. The results of the corner re-estimation is also not limited to thearea near the letters, but can have effect on the entire path, even whenletters are far apart because the path between two centers is differentwhen the centers are re-estimated.

3.4 smoothed path simulation (method 4)

In line with the ‘center re-estimation’ simulation this simulationis intended to leverage the tendency of users to cut corners on theintended letters. This simulation uses a ‘filled path’ simulation andsmooths this simulation, resulting in a path similar to the ‘centerre-estimation’ simulation.

The smoothing of the ‘filled path’ is done by averaging every pointon the filled path with its neighbor points. The smoothed value of apoint at index i is the average of points i− 1 and i + 1. Because not alli exist, the first and last element of the original path are copied.

The most basic implementation would be the following procedure:

filledPath = [ (2, 3), (6, 7), (6, 15), (12, 18) ]smoothedPath = [ filledPath[0] ]for i in range(1, lenght(filledPath)-2):

averagePoint = average(filledPath[i-1], filledPath[i+1])smoothedPath.append(averagePoint)

smoothedPath.append(filledPath[lenght(filledPath)-1])

The contents of the ‘smoothedPath’ variable are now:

[ (2, 3), (4, 9), (9, 12.5), (12, 18) ]

In this basic example only two neighbors are used. It is also possibleto weigh more neighbors and to weight each neighbor coordinate

1 This example assumes a qwerty keyboard.

12 path simulation

differently depending on their distance. This could potentially becomea complex schema.

The schema that gave the best results was the following: take nneighbors to the left and m neighbors to the right. The neighbors havea distance d to the entry that is being smoothed so that the closestneighbors have d = 1 and the furthest d = n and d = m respectively.Their weights are n− d+ 1 and m− d+ 1 respectively. This means thatthe closer a neighbor is, the more it weighs in. It is possible to squareor square root the weights of the neighbors, but this did not improveresults in our experiments. We found that a weight distribution of(n, m) = (2, 6) performs well on our data. This means that coordinatesfurther in the sequence need to weigh more than coordinates earlierin the sequence, which is in line with our findings from the ‘centerre-estimation’ simulation.

The following image is an abstract representation of a filled path inblack and its corresponding smoothed path in red.

Figure 8: Filled and smoothed paths

4C O M PA R I N G S I M U L AT E D A N D E M P I R I C A L PAT H S

A comparison method calculates an error or distance between asimulated path and a empirical (user generated) path. With thisdistance measure it is possible to estimate a likelihood that a givenword is the intended word.A set of reference words W is used to build a list of tuples (word,predicted path). Let D(Wi, E) be a function that calculates the distancebetween word number i and an empirical path E. The chance that aword w is represented by an empirical path E is calculated as follows:

P(w|E) = D(w, E)∑wi∈W D(wi, E)

To determine the intended word provided an empirical path allreference words w are ordered by their chance P(w|E). The top nwords are then considered likely candidates and are evaluated by alanguage model. The value of n is to be determined empirically.

4.1 dynamic time warping

Dynamic time warping is a generic method that can be used tocompare two signals and measure their difference. This method isused in speech recognition[ref], to compare dna sequences[ref], andother signals [ref].

Dynamic time warping can be used to compare two sequences nand m. Entries from n and m can be deleted or additional entriescan be inserted. Entries from n and m can also be compared with acost function that needs to be provided to the dynamic time warpingalgorithm. There is a ‘cost’ associated with delete and insert opera-tions which are parameters to the dynamic time warping algorithmand the result of the cost function is also considered a cost. The algo-rithm finds the path to combine n and m using the insert, delete andcompare operations that minimizes the sum cost of the operations. Inour case, the compare operation is implemented by a cost functionthat calculates the distance between two entries (coordinates) fromsequences n and m. Distance measures of interest for our researchinclude manhattan distance, manhattan squared distance, euclideandistance and euclidean squared distance between two coordinates.Empirical tests show that manhattan distance performs best on ourtests data.

Dynamic time warping was tested in combination with all pathsimulation methods. The naive path simulation performed inaccurate

13

14 comparing simulated and empirical paths

and the other path simulations performed more accurate. Testing waslimited because DTW is a computationally intensive algorithm. Ithas a time and a space complexity of O(n× m) for input lengths nand m. The complexity of this algorithm is especially a problem inresource constrained settings, such as mobile devices. To reduce thenumber of computations the number of coordinates in the simulatedpath is reduced, which is a threshold in all path simulations, exceptthe naive path simulation. This often reduces the accuracy of thepath simulation which means that a less accurate comparison canbe made. Another method to reduce computations is by reducingthe number of coordinates from the empirical path (user input path).More on this in the section ‘efficiency’. Although limiting the size ofthe inputs to the DTW algorithm reduces the strain on computationalresources the accuracy of the comparison suffers from this speed up.If the input sizes are reduced too much then the algorithm becomesalmost entirely inaccurate. DTW could be a good method, but it iscomputationally too intensive.

4.2 subsequence grouping

For this method the naive path simulation is used.The points from the empirical path are grouped as to belong to acertain point in the simulated path. This is done by a single, simul-taneous pass over the empirical and the simulated path. Outliers areremoved from the groups until the groups are of a certain size andthe average of the remaining group is compared with the coordinatefrom the simulated path that the group belongs to. The sum of thedistances between the coordinates from the simulation and the groupthat was assigned is the error distance between the empirical and thesimulated paths.The following image shows a naive path represented by black dotsand an empirical path that has been grouped. The red dots do not cor-respond to an entry in the simulated path and the green dots belongto the nearest black dot.

Figure 9: Example of a clustered empirical path

4.2 subsequence grouping 15

Removing outliers from a group is done by finding the center ofgravity of the group and then remove the element of the group withthe furthest distance. Example pseudocode:

function removeOutliers(group, thresholdSize):while size(group) > thresholdSize:

center = averageCoordinate(group)maxIndex = -1maxValue = -∞for i in range(length(group)):

d = distance(center, group[i])if d > maxValue:

maxIndex = imaxValue = d

group.pop(maxIndex)return group

As a rule, if any of the groups is empty, the error is set to ∞. Thisprevents the swipe pattern of ‘charm’ to be recognized as ‘charmed’because the last two groups (representing ‘e’ and ‘d’ ) would be empty.

The complexity of the subsequence grouping is better than that ofdynamic time warping. For an empirical path of length n and athreshold t a naive implementation of the algorithm has a worst caseruntime of O(n× (n− t)). Because the presence of empty groups canbe recognized quickly and result in an error of ∞ the program canoften stop the analysis before it finishes. This algorithm runs muchfaster than dynamic time warping.

Using the subsequence grouping method to compare two signalsmeans throws away most of the user input, keeping only the mostimportant coordinates in the respective groups to measure distance.Although this is unfortunate the algorithm performs better than dy-namic time warping because using dynamic time warping requiresthe input signals to be shortened. The subsequence grouping methodfocusses on the letters of the words and if the user input reachesthose letters, but focuses less on the path as a whole. As a result thetrajectory of the word ‘spread’ could be interpreted as ‘sped’ becausethe trajectory of ‘spread’ covers all letters of the word ‘sped’ (but notvice-versa). This effect is reduced by letting the size of the coordinategroups be larger. This would mean that when comparing user inputof ‘spread’ with the expected path of ‘sped’ more coordinates aroundthe letter ‘a’ will be weighed in with the groups of ‘e’ and ‘d’, pullingthe center of those groups away from the center of those letters. Theadverse effect of increasing the cluster sizes is that more noise is addedbecause coordinates further away from the reference coordinate areweighed in. It might be useful to change the size of the clusters dy-


namically depending on the context but absence of a sensible methodto achieve this, a fixed cluster size is determined empirically.

Figure 10: Example swipe of ‘spred’

Figure 11: Example swipe of ‘spread’

4.3 greedy asymmetric dynamic time warping

The subsequence grouping algorithm had the advantage of usingless runtime and space compared to dynamic time warping. Unfortu-nately the subsequence grouping algorithm discards some informationin the subsequence grouping process. When using dynamic time warp-ing also some information is not used due to performance issues withthis algorithm.

The greedy asymmetric dynamic time warping (from now on ‘gad’as in Greedy Asymmetric DTW) algorithm can work with all pathsimulations, except the naive path simulation. The idea is to use afine grained path simulation, one with more points than the userinput, and to map every user input coordinate to one coordinate inthe simulated path. The error is the sum of the distances betweenthe pairs of coordinates that are mapped to each other. This makesthe algorithm somewhat similar to dynamic time warping, but thereare some significant differences. ‘Delete’ and ‘insert’ operations arelimited and their cost can not be set in the same way as in manydynamic time warping implementations. User input sequence N oflength n can only be compared with a simulated path M of lengthm if n < m. There will be m − n deletions from M and there willbe no other delete or insert operations performed on N or M. Theseconstraints are a significant limitation compared to the dynamic time

4.4 efficiency 17

warping algorithm. In return, the algorithm has a runtime of O(N)

and uses O(1) space.The algorithm can be constructed using two functions, nextBestMatch

and gadDistance. nextBestMatch takes three arguments, a coordinatefrom the user input, a simulated path, and an index that points to anentry in the simulated path. The function uses hill climbing to findand return the index of the next coordinate in the simulated path withthe smallest distance to the coordinate from the user input that it istrying to find a match for. See the following pseudo code:

function nextBestMatch(coordinate, simulatedPath, index):bestDistance = distance(simulatedPath[index], coordinate)nextDistance = bestDistancewhile nextDistance =< bestDistance and not length(simulatedPath) == index:

index += 1bestDistance = nextDistancenextDistance = distance(simulatedPath[index], coordinate)

return bestDistance, index

The gadDistance will use the nextBestMatch function to sum the bestmatch distances.

function gadDistance(empiricalPath, simulatedPath):sumDistance = distance(empiricalPath.firstItem, simulatedPath.firstItem)j = 0for i in range(1, length(empiricalPath)):

distance, j = nextBestMatch(empiricalPath[i], simulatedPath, j)sumDistance += distance

for i in range(j, length(simulatedPath)):sumDistance += distance(empiricalPath.lastItem, simulatedPath[i])

return sumDistance

Notice the second for loop in the gadDistance function. This extraloop prevents a swipe input for the word ‘less’ to be matched withthe simulated path of the word ‘lessons’ because the distances of ‘ons’to ‘s’ will be added to the sum distance. Similarly the sumDistance isinitiated as the distance between the two first items of the sequences.

4.4 efficiency

The program has to be efficient enough to perform real time on amobile device. An easy speed up is to simulate the paths of the refer-ence words once and keep the simulated paths in memory. Efficiencycan also be gained by reducing the length of the input sequencesand by pre-ordering the reference words so that only a portion of thereference words are eligible for comparison with the empirical path.


4.4.1 Reducing input size

Reducing the length of the simulated paths can be done by settinga parameter that specifies the precision of the simulation. A greaterprecision yields a more accurate comparison, but also increases thenumber of computations. The size of an empirical path can be reducedby iterating over the empirical path coordinates and determining thedistance between two consecutive coordinates. If the distance is lowerthan a certain threshold, the next coordinate is left out. The followingfunction returns the reduced empirical path.

function reduceEmpiricalPath(empiricalPath, distanceThreshold):last = empiricalPath[0] # get first entry from userInputreducedPath = [ last ] # list of reduced user input, copies first elementfor coordinate in empiricalPath:

if distance(last, coordinate) < distanceThreshold:reducedPath.append(coordinate)last = coordinate

return reducedPath

The idea behind this is that if two consecutive points are very close toeach other then one will hold almost as much information as both. Thereduction in computations that is gained by reducing the input lengthsdepends on the comparison algorithm that is used. Dynamic timewarping has a complexity of O(n×m) which means that it requiresa polinomial amount of computations. As an example, if the inputlengths (n, m) are (9, 9) and can be reduced to (3, 3) then the differentalgorithms have different speed ups. The dynamic time warpingalgorithm has a 9 times speed up, and the subsequence groupingand gad algorithms have a 3 times speed up. This also works theother way around; if the precision of the simulated and empiricalpaths are increased by 3 times, then dynamic time warping requires 9times as much computations and the subsequence grouping and gadalgorithms only 3 times.

4.4.2 Indexing

In a naive implementation an empirical path is compared with allthe simulated paths. But, with a well chosen data structure this canbe avoided. We have developed a data structure that distributes thereference words in various ordered sets depending in what letter thewords begin and end with. If the user starts an empirical path near theletter n and finishes near the letter e then the path will not match theword ‘electron’. The start and end of an empirical path are assumedto be in the proximity of the start and end of the intended word. Thereference words can first be divided into sets depending on their firstletter. These sets contain other sets depending on the last letter of

4.5 results 19

the word. In this setting the reference words are divided over 26 sets,each containing another 26 sets. Let a set of words that start with theletter A and end with B be called ‘set(A)(B)’ There exist 26× 26 = 676such sets. If the start of an empirical path is near the letters e, rand the end near the letters b, n then the empirical path need onlybe compared with reference words in the following sets: set(e)(b),set(e)(n), set(r)(b), set(r)(n), which are only 4 out of the 676 possiblesets. The assessment that the start point of an empirical path is neara letter depends on a threshold that can be set. In our research wechose that a start point is near a letter if it is no more distant than1.1× the width of a letter. End letters can be distanced 1.2× the widthof a letter. The values 1.1 and 1.2 were the smallest values for our testset that did not affect the accuracy of the program for the various pathsimulations an methods of comparison. The smaller the values are,the faster the program runs because less sets and thus less words areconsidered. However, it should be noted that these sets can very insize and that, for example, set(x)(x) is an empty set.

Another property that can be useful is the distance the path covers.Let a path walking distance be the sum distance between two consecutivecoordinates within a path. For example the word electron could havea walking distance of 30 and the word earn a walking distance of9. Notice that both electron and earn are in set(e)(n). If an empiricalpath has a walking distance of 28 it is not necessary to compare itto the word earn because their walking distances are very distinct,but the word electron with a distance of 30 is still a candidate. Tomake use of this, all walking distances are pre-calculated and storedwith the simulated paths. A simulated path walking distance hasto be not smaller than x times the walking distance of the empiricalpath and not y times larger. For our test set the values of (x, y) wereset to (0.7, 2.0) because these values posed the greatest restriction onwalking distances without affecting the accuracy of the program forthe various path simulation and path comparison algorithms. Forefficient implementation entries in sets like set(A)(B) are ordered bywalking distance so that a nearest match can be found in a time ofO(log(n)) and not the entire set has to be reviewed.

4.5 results

The path simulation methods described in chapter 3 were combinedwith the comparison methods described in this chapter. Every empir-ical path in our test set is compared with each word w in our set ofreference words R that fall in certain sets as described in the section‘Efficiency’ 4.4. When comparing an empirical path e with a word wiin R using distance measure D, the probability that word wi is theword that is intended by the user can be expressed as follows:


P(wi|e) =D(e, wi)

∑w∈R D(e, w)

Using this formula a candidate set Cd of size n is constructed, con-taining the n most likely candidates, with probabilities normalizedso that the sum probability is 1. This set can be evaluated or canbe reviewed again by a language model. When evaluating the setCd there are three categories that are considered. The category Firstindicates that the candidate with the highest estimated probabilityis the correct word. First four indicates that the first answer was notthe correct word, but that the correct word is among the four wordswith the highest probability. This means that the program did notfunction correctly, but that the mistake could potentially be correctedby user interference. The Wrong category indicates that the programdid not correctly recognize the empirical path and that the mistakecould not be corrected. The following tables show how the variouspath simulation and path comparison methods score. The test setcontains 263 words.

Path type First % First four % Wrong % Time (seconds)

filled 15.9 18.6 65.3 150

re-estimated 19.0 17.1 63.8 120

smoothed 7.2 14.4 78.3 180

Table 1: DTW


naive 18.2 19.3 62.3 3.0

filled 18.2 19.3 62.3 3.0

re-estimated 17.4 20.9 61.5 3.0

smoothed 6.0 14.8 79.0 3.0

Table 2: subsequence grouping


filled 70.7 23.1 6.0 2.4

re-estimated 68.4 24.7 6.8 2.6

smoothed 71.1 21.2 7.6 2.4

Table 3: gad

4.6 future work 21

4.6 future work

Our research has focussed on discriminative approaches, meaningthat the empirical path is used to discriminate among many wordsin a reference set. After sufficient discrimination only a few wordsremain. It might also be possible to use a generative approach. Suchan approach analyzes the empirical path an tries to generate possiblewords from it. If a generated word is in a set of reference words, thenthe generated word is valid. Pirkl [12] uses a generative approach forhis vector keyboard. Although his approach leverages an alternativekeyboard layout it might also be applicable for more mainstreamkeyboards.

When using the official Swype software, the program appears tobe sensitive for the directionality and corners of the swipe pattern.Directionality could be used in the theory about sets as describedin the section ‘Efficiency’ 4.4. When a corner is observed in a swipepattern it is certain that there is an intended letter very close to thecorner, otherwise the pattern is not drawn properly. Our program doesnot make any use of this yet and there are potentially many differentways in which corners could be used. There is an example case in theattachment (figure nr. 25) where corners could have been used formore accurate results. Corners might be used in both generative anddiscriminative approaches.

5N E X T W O R D P R E D I C T I O N

This chapter describes how a language model can be used to re-order aset with candidate words Cd that was obtained in the previous chapter.Cd is the set of likely candidate tuples (word, probability) based ontheir distance to an empirical path. In this chapter we will constructtwo similar sets, also containing candidate tuples (word, probability).The first is set Cl that defines probabilities for each of the words inCd based on a language model. The second is set C∗ that combinesthe chances defined in the sets Cd and Cl . Notice that the set C∗

determines what words are presented to the user and that Cd andCl are only constructed in order to generate C∗. Furthermore, it isassumed that the intended word is in these sets and therefore the sumchance of each set is 1.

5.1 n-grams

In an n-gram model it is assumed that a word early in a sentenceis somehow predictive for a word later in the same sentence. Byobserving many sentences from a training set it is possible to derivestatistics about what words follow other words. These statistics could,for example, indicate that the word sequence an animal is more likelythan the sequence an dog. A training set is a large corpus of sentencesfrom the target language.

5.1.1 Theory

The n in n-gram models indicates the number of words that isobserved when applying statistical knowledge. In a 2-gram, or bi-grammodel the scope of the model covers two words. The model estimatesthe chance P(w2|w1) that word w2 follows word w1. This estimationis calculated as follows:

P(wn|wn−1) =C(wn−1, wn)

∑w C(wn−1, w)

Where C(A, B) is a Count function that counts the frequency withwhich B follows A in a training set.

When training a bi-gram language model by accumulating statisticsof many sentences, all sentences are given the prefix word <s>. Thisprefix indicates the start of a sentence and is used to calculate theprobability that word w1 is the first word of any given sentence.

22

5.1 n-grams 23

A bigram chance of a (partial) sentence w of length n is defined asfollows:

P(wn1 ) = P(w1| < s >)× P(w2|w1)× . . . P(wn|wn−1)

Which can be written as

P(wn1 ) = P(w1| < s >)×

n

∏k=1

P(wk|wk−1)

A tri-gram model can be defined similarly.

It is possible that the Count function C returns a count of zero. In theprobability function P(wn|wn−1) =

C(wn−1,wn)∑w C(wn−1,w)

this can be problematic;if zero is divided by some other number, the estimated probabilityP(wn|wn−1) becomes 0. This might not always be a sensible estimation.A more severe problem occurs when ∑w C(wn−1, w) is zero becausethis results in a division by zero. These problems occur because thetraining data typically does not cover all possible word sequences,leaving certain sequences unobserved. To prevent these problems asmoothing technique is used to redistribute some of the probability massfrom the observed sequences to the unobserved sequences. A verybasic smoothing method is plus-one smoothing by which 1 is added toevery result of the Count function. There are many, more advancedsmoothing methods, such as Good-Turing smoothing [9] and Katzsmoothing [10].

5.1.2 Application

A bi-gram language model was trained on the ‘Open American Na-tional Corpus’ [11]. The frequency counts of every word and bi-gramwas pre-computed and stored in a dictionary. The total size of thesetwo dictionaries is 22 megabyte, which is not a lot for modern mobiledevices. Because the frequency counts are pre-calculated and storedin a dictionary, the language model can have an average runtime ofO(1). The bi-gram model was used to construct Cl by calculatingbi-gram chances (plus-one smoothing) of the previously swiped wordw1 combined with all words w in the set Cd.

Let P(wn|wn−1) be a bi-gram probability and PCl (wn|wn−1) the proba-bility P(wn|wn−1) normalized over set Cl so that:

PCl (wn|wn−1) =P(wn|wn−1)

∑w∈CdP(w|wn−1)

Entries in the set Cl are again tuples (word, probability) with probabilitydefined as PCl (wn|wn−1) and a sum probability of 1.Now that Cd and Cl are constructed it is possible to combine themand define C∗. Entries in set C∗ will contain the same words as the

24 next word prediction

other two sets, but will combine the probabilities. This can be done bymultiplying the probabilities for each word w from each set as follows:

P∗(w) = Pd(w)× Pl(w)

Using this measure the probabilities in the sets Cd and Cl are re-garded as equally informative and correctly distributed within theirrespective sets. However, the probabilities of one set might be moreinformative than the other, or the probabilities might not sufficientlydiscriminate among words in a particular set. Therefore it might beuseful to re-estimate probabilities in one set to make them more or lessdiscriminative. This re-estimation can be done by raising every prob-ability in one set to a power and then normalizing the probabilitiesagain. If the power is larger than 1, the newly estimated probabilitiesare more discriminative and that set will weigh in more in the set C∗,and vice-versa. For our test data, the best results were obtained byraising the probabilities in set Cd to the power of two. This indicatesthat the probabilities derived from the distance measure are moreinformative than the probabilities derived from the language model.

5.2 results

The following tables present the results of applying a languagemodel to the results obtained in the previous chapter. The size of ourtest set is 263 words.


filled 53.6 17.1 29.2 150

re-estimated 45.2 15.5 39.1 210

smoothed 54.7 13.6 31.5 180

Table 4: DTW with language model (bi-gram)


naive 57.4 15.2 27.3 3.1

filled 57.4 15.2 27.3 3.1

re-estimated 57.7 15.5 26.6 3.1

smoothed 35.7 12.1 52.0 3.1

Table 5: Subsequence grouping with language model (bi-gram)

5.3 future work 25


filled 80.6 13.6 5.7 2.4

re-estimated 82.9 10.6 6.3 2.6

smoothed 81.3 11.7 6.8 2.4

Table 6: GAD with language model (bi-gram)

5.3 future work

We have used a bi-gram language model with plus-one smoothing.A more advanced smoothing method [9], [10] might result in betterestimations. A feature that is present in at least one keyboard imple-mentation [7] is the learning from messages that have already beenwritten by the user. This can be used to alter pre-calculated n-gramstatistics to be more discriptive to the language usage of a specificuser.Another language model that might be useful is a probabilistic contextfree grammar. This language model can also be used to calculateprefix probabilities [5]. Because this language model also incorporatesgrammar it is sensible for the structure of the entire sentence, and notjust of the n− 1 preceding words (as n-grams). The disadvantage isthat it requires more effort to calculate statistics for a PCFG, becauseunlike the data for n-grams, data for a PCFG needs to be annotated.

6C O N C L U S I O N S

In this research we have shown how a discriminative approach can beused for the Swype input method. In order to recognize what word isintended an empirical path is compared to a simulated path. We havedeveloped the following methods for path simulation:

1. Naive path simulation [3.1]

2. Filled path simulation [3.2]

3. Center re-estimation simulation [3.3]

4. Smoothed path simulation [3.4]

The official Swype implementation prefers users to swipe twice acrossdouble letters. We showed there are advantages to omitting doubleletters and that doing so introduces ambiguous swipe patterns for atmost 1 in 20 words [3].If simulated and empirical paths are more distinct then the wordassociated with the simulated path is less likely to be intended andvice-versa. We used dynamic time warping to compare the empiri-cal an simulated paths. In addition to this we developed two otheralgorithms that provided a better recognition while using less compu-tational resources [5.2]. These algorithms are:

1. Subsequence grouping [4.2]

2. GAD (Greedy Asymmetric DTW) [4.3]

Keyboards for touch screens are often used on mobile devices withfew computational resources. Also, if the keyboard program is notresponsive enough the keyboard becomes useless. This means thatthe performance of the program is very important for it to be useful.Therefore we developed several methods to enhance the efficiency ofthe program, the most important of which is a data structure [4.4.2] tostore the reference words and their simulated paths.

To enhance the accuracy of the recognition we used a bi-gram lan-guage model. This language model can use the context of the swipeand linguistic behavior of a specific user to further enhance the swiperecognition. Use of the language model requires constant time andhas a memory footprint of just 22 mega bite.

26

6.1 future work 27

On our test set of 263 words, the best recognition rate we achievedwas using the center re-estimation path simulation together with theGAD comparison algorithm. We achieved the following results:

First % First four % Wrong %

82.9 10.6 6.3

Our test data covers untidy swipes (bended lines, missing letters)as well as more accurate swipe inputs. For example-swipes andrecognition see appendix [A].

6.1 future work

Our test set contains 263 example swipes. To better analyze theaccuracy of the program it would be useful to evaluate even moreswipes. Furthermore, our data set was produced by two individuals,which might not provide sufficient swipe diversity to accurately ana-lyze the swipe recognition of an average user. For a better analysis itwould be useful to have more data from many different users.

Our research has focussed on a discriminative approach. It might alsobe possible to use a generative approach. Such an approach analyzesthe empirical path an tries to generate possible words from it. 4.6describes more future work for path recognition.

We used a bi-gram language model with plus-one smoothing. Thereare better smoothing techniques for n-grams. Alternatively a prob-abilistic context free grammar can be used to estimate prefix prob-abilities. A PCFG analyzes the entire structure of a sentence whilean n-gram language model only analyzes the n− 1 last words. 5.3describes more future work on language models.

AE X A M P L E R E C O G N I T I O N S

Optimal performance of the program is obtained by using the centerre-estimation simulation with the mapping comparison algorithm. Thefollowing images depict a swipe path and the results of our programin its best configuration.

a.1 example one

"There is no freedom or democracywithout transparent governmentand an educated public."

Correct First four Wrong

10 3 0

Figure 12: there

Ranking Word P

1 there 0.98

2 ride 0.01

3 rude 0.00

4 rider 0.00

Figure 13: is

Ranking Word P

1 is 0.99

2 us 0.00

3 iss 0.00

4 its 0.00

Figure 14: no

Ranking Word P

1 no 0.99

2 jo 0.00

3 ni 0.00

4 nu 0.00

28

A.1 example one 29

Figure 15: freedom

Ranking Word P

1 from 0.66

2 freedom 0.33

3 freon 0.12

4 devon 0.09

Figure 16: or

Ranking Word P

1 or 0.44

2 our 0.22

3 out 0.18

4 ie 0.15

Figure 17: democracy

Ranking Word P

1 democrat 0.38

2 reluctant 0.27

3 democracy 0.20

4 railway 0.15

Figure 18: without

Ranking Word P

1 without 0.30

2 editor 0.23

3 worthy 0.23

4 worry 0.23

Figure 19: transparent

Ranking Word P

1 transparent 0.29

2 transplant 0.28

3 transformer 0.25

4 tt 0.18

30 example recognitions

Figure 20: government

Ranking Word P

1 government 0.29

2 holocaust 0.25

3 garment 0.24

4 gibraltar 0.22

Figure 21: and

Ranking Word P

1 and 0.96

2 sand 0.01

3 wand 0.01

4 sands 0.01

Figure 22: an

Ranking Word P

1 an 0.95

2 san 0.04

3 ann 0.01

4 ab 0.00

Figure 23: educated

Ranking Word P

1 estimated 0.85

2 educated 0.07

3 engaged 0.05

4 estimates 0.03

Figure 24: public

Ranking Word P

1 public 0.27

2 olympic 0.27

3 politic 0.26

4 prozac 0.19

A.2 example two 31

a.2 example two

"Sorry, I am too drunk to comehome now. Take care of kids."

Correct First four Wrong

9 1 3

Figure 25: sorry

Ranking Word P

1 shot 0.53

2 scott 0.23

3 shoot 0.20

4 shout 0.04

Figure 26: i

Ranking Word P

1 i 0.91

2 u 0.07

3 o 0.02

Figure 27: am

Ranking Word P

1 am 0.99

2 aim 0.00

3 sam 0.00

4 arm 0.00

Figure 28: too

Ranking Word P

1 to 0.61

2 too 0.26

3 tho 0.07

4 till 0.06

32 example recognitions

Figure 29: drunk

Ranking Word P

1 dim 0.34

2 dunn 0.23

3 fun 0.21

4 dum 0.21

Figure 30: to

Ranking Word P

1 to 0.32

2 ti 0.24

3 tu 0.23

4 thi 0.21

Figure 31: come

Ranking Word P

1 come 0.98

2 chile 0.01

3 colder 0.00

4 filmer 0.00

Figure 32: home

Ranking Word P

1 home 0.82

2 hole 0.06

3 honour 0.06

4 hour 0.05

Figure 33: now

Ranking Word P

1 now 0.55

2 note 0.15

3 node 0.15

4 nose 0.14

A.2 example two 33

Figure 34: take

Ranking Word P

1 take 0.86

2 range 0.08

3 fame 0.03

4 frame 0.03

Figure 35: care

Ranking Word P

1 care 0.95

2 cadre 0.02

3 car 0.01

4 case 0.01

Figure 36: of

Ranking Word P

1 of 0.84

2 if 0.13

3 out 0.01

4 off 0.01

Figure 37: kids

Ranking Word P

1 its 0.60

2 us 0.21

3 is 0.18

4 iss 0.00

B I B L I O G R A P H Y

[1] Swype Inc. Swype Home Page Retrieved on 15 June 2013.

[2] TechCrunch, Swype user sets Guinness World Record for texting speedRetrieved on 15 June 2013.

[3] Meinard Müller, Information Retrieval for Music and Motion.Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2007.

[4] Peter Norvig, How to write a spell corrector. Retrieved on 15 June2013.

[5] Andreas Stolcke, An efficient probabilistic context-free parsing algo-rithm that computes prefix probabilities. Journal of ComputationalLinguistics Volume 21 Issue 2, June 1995 Pages 165-201 MIT PressCambridge, MA, USA.

[6] 8pen, 8pen Home Page. Retrieved on 15 June 2013.

[7] SwiftKey, SwiftKey Home Page. Retrieved on 15 June 2013.

[8] Sander Latour, Swype-js Github repository Retrieved on 15 June 2013,commit signature 09665715d59bb2e810998adcfd65d501344927a6.

[9] Good, I.J. The population frequencies of species and the estimation ofpopulation parameters. Biometrika 40 (3-4): 237-264 (1953).

[10] Katz, S. M. Estimation of probabilities from sparse data for the lan-guage model component of a speech recogniser. IEEE Transactions onAcoustics, Speech, and Signal Processing, 35(3), 400-401 (1987).

[11] American National Corpus ANC Home Page. Retrieved on 17 June2013.

[12] Martin Pirkl, thesis: Vector keyboard for mobile devices based onAndroid platform. (2011) Retrieved on 20 June 2013.

34

http://www.swypeinc.com

http://techcrunch.com/2010/03/22/swype-user-sets-guinness-world-record-for-texting-speed/

http://norvig.com/spell-correct.html

http://www.8pen.com

http://www.swiftkey.net

https://github.com/mslatour/swype-js

http://www.americannationalcorpus.org

https://dip.felk.cvut.cz/browse/pdfcache/pirklmar_2011bach.pdf

https://dip.felk.cvut.cz/browse/pdfcache/pirklmar_2011bach.pdf

Documents

Recognizing input for swipe based keyboards · 4.6 Future work 21 5 next word prediction22 5.1 N-grams 22 5.1.1 Theory 22 5.1.2 Application 23 ... well as more accurate swipe inputs