Human Assisted Positioning Using Textual Signs

Human Assisted Positioning Using Textual Signs

Bo Han∗ Feng Qian∗ Moo-Ryong Ra∗AT&T Labs – Research, Bedminster, New Jersey, USA

{bohan,fengqian,mra}@research.att.com

ABSTRACTLocation information is one of the key enablers to context-awaresystems and applications for mobile devices. However, most exist-ing location sensing techniques do not work or will be significantlyslowed down without infrastructure support, which limits their ap-plicability in several cases. In this paper, we propose a localiza-tion system that works for both indoor and outdoor environmentsin a completely offline manner. Our system leverages human users’perception of nearby textual signs, without using GPS, Wi-Fi, cel-lular, and Internet. It enables several important use cases, such asoffline localization on wearable devices. Based on real data col-lected from Google Street View and OpenStreetMap, we examinethe feasibility of our approach. The preliminary result was encour-aging. Our system was able to achieve higher than 90% accuracywith only 4 iterations even when the speech recognition accuracy is70%, requiring very small storage space, and consuming 44% lessinstantaneous power compared to GPS.

1. INTRODUCTIONLocation is one of the key context information and has enabled

numerous context-aware systems and applications. To determinethe location, today’s mobile devices use various sensors and re-lated techniques such as GPS, Wi-Fi fingerprinting, cellular trian-gulation etc. All these approaches let the device figure out the loca-tion automatically and each of them complements one another byhaving distinct characteristics of availability, accuracy, infrastruc-ture requirements, and resource consumption. However, most ex-isting techniques rely upon the infrastructure support and/or heavytraining for their proper use. For instance, on many smartphones,A-GPS (Assisted GPS) is typically used to get an initial locationfix quickly. In this case, nearby cell towers or equipment directlyconnected to towers originate and deliver hints to smartphones sothey can narrow down the search space. Without the help, it willtake a long time to bootstrap the localization system. Some otherlocalization techniques also require heavy training on the regionof interest. Radio signal based fingerprinting techniques, such asultrasound, Wi-Fi, and FM radio are representative examples.

In this paper, we propose an alternative positioning system byleveraging a human’s perception of her surroundings. The loca-∗All authors made equal contribution.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, February 12–13, 2015, Santa Fe, New Mexico, USA.Copyright c© 2015 ACM 978-1-4503-3391-7/15/02 ...$15.00.http://dx.doi.org/10.1145/2699343.2699347.

Device name Release Has GPS Has Cellular Has Wi-FiGoogle Glass 2013/03 No∗ No Yes

Samsung Gear 2 2014/04 No No NoApple Watch 2015 No No Partial∗∗∗ The sensor might exist but not yet accessible through the API.∗∗ Can only communicate with its paired phone.

Table 1: Localization sensors on popular wearable devices.

Figure 1: Examples of textual Signs in outdoor environment.

tion is computed based on what the user sees and tells the system.One notable difference from prior art is that the acquisition of lo-cation information can be performed based on only local informa-tion. This is particularly useful for wearable devices. As depictedin Table 1, many such devices, including Google Glass and AppleWatch, do not have GPS, thereby the only way for them to ob-tain location information is to ask their paired devices, typically asmartphone. Moreover, the capability of completely local compu-tation enables several unique use cases, which are articulated in §2.

Our localization methodology leverages textual signs we see ev-ery day. They include street signs, local business logos, publictransportation signs, house numbers, decorative logos, etc. Thesetextual signs offer several properties making them ideal as locationsignatures. First, they are prevalent as long as the area is reasonablypopulated (§5.1). Second, they are easily recognizable and readablewith little ambiguity. Third, users can easily identify those spatiallyand temporally stable signs (Figure 1) and ignore those temporarysigns that are not suitable for fingerprinting the location, such astext on a vehicle or on an advertisement banner. In contrast, au-tomated approaches such as Optical Character Recognition (OCR)may have difficulties detecting many texts due to their diverse styles(examples in Figure 1), and OCR cannot tell which signs are stable.Another benefit of using text is the resultant signature database issmall in size. As shown in Table 2, each entry corresponding to asign takes only a few bytes. Therefore, the database of a large re-gional area can be stored locally. By integrating the database withoffline maps, the entire positioning system can run locally withoutrequiring GPS or Internet connectivity. The database can be con-structed and updated by crowdsourcing without any domain knowl-edge or specialized equipment (e.g., by leveraging Google StreetView, see §4), making it much easier than collecting data for tradi-tional maps [17]. For positioning, the user can, for example, readout several (e.g., 3 to 5) signs nearby she sees. The text is recog-

ID Text Location Tag1 Greene St. 40.729256,-73.995753 Road2 West 4 St. 40.729256,-73.995753 Road3 Rocco Restaurant 40.727938,-74.000157 Restaurant4 23 St. Station 40.745528,-73.998623 Transportation5 Downtown Brooklyn (subway)6 Century 21 dept. store 40.774097,-73.982095 Shopping7 Imagine 40.775921,-73.975867 Landmark8 Regal 40.757065,-73.989077 Cinema9 405 40.774135,-73.950709 House number

Table 2: Sign database for signs in Figure 1.

nized and then matched against the database to localize the user.Out scheme works in both outdoor and indoor environments.

In this paper, we make the following contributions. First, wecollect real data from Google Street View and OpenStreetMap [5]and examine the feasibility of our idea. Second, we design ro-bust algorithms to tolerate various inaccuracies and ambiguitiesdue to human perception, the user input procedure, and the signsthemselves (§3), and then perform a simulation study to demon-strate the feasibility of the proposed algorithm (§5.2). Third, inour system’s regime, users can use different means to enter thetext of the signs. Among others, we conducted a case study us-ing speech recognition, which is convenient for wearable devicessuch as Google Glass and the result was very encouraging (§5.3).Unlike general-purpose natural language interfaces [1], our speechrecognition module uses a simple language model and usually asmall vocabulary consisting of words on nearby signs. Thus, theentire processing chain is able to run locally.

Note that, our goal is to build a human-assisted positioning sys-tem that achieves outdoor and indoor localization in a complete of-fline manner without using GPS, Wi-Fi, cellular, and Internet con-nectivity. We do not expect our approach to replace traditional lo-calization methods. Instead, it provides a very useful alternativewhen traditional approaches cannot work due to various reasons.

2. MOTIVATION: USE CASESWe motivate our approach by discussing several use cases.

“Offline” Localization. Bob’s grandmother is visiting him. Real-izing she was lost in the unfamiliar city, she called her grandson.Using our system, Bob is able to quickly figure out where she is byasking her to describe her surroundings. In some cases, the samegoal can be achieved by checking the street signs. However, streetsigns are not always nearby and not always available (e.g., in atheme park or shopping mall).Facilitating Location Search and Recollection. Alice barely re-members the place where she met her friend last week. It was neara Starbucks and a Thai restaurant. Our system provides a way toeffectively search this location based on the limited information.Offline Localization on Wearable Devices. Many wearable de-vices such as Google Glass and Apple Watch do not have GPS (Ta-ble 1). Consequently, the only way for them to obtain location in-formation is to ask their paired smartphones via Bluetooth. There-fore, if the user does not bring her smartphone, or if the phone runsout of battery, the wearable device cannot know its location. Ourproposal fills this gap by requiring neither location sensor (GPS,Wi-Fi, cellular, etc.) nor Internet connectivity. It works in bothoutdoor and indoor environments.Infrastructure-free Indoor Localization. Localization in indoorenvironments usually requires hardware beacons and/or heavy train-ing. In our approach, location can be determined by leveragingexisting textual signs such as room names and numbers. The signdatabase can be easily constructed by walking in the building and/ormarking the locations of textual signs on the floor map.

3. LOCALIZATION SYSTEM DESIGNOur proposed system consists of three components: the sign

database, the user input module, and the localization algorithm.When a user enters a nearby textual sign, it is matched against thesign database to get a set of candidate locations. These locationsthen are fed into the localization algorithm to refine the user’s loca-tion. The above steps are repeated until the user’s actual location ispinpointed. We detail the three components below.Challenges. At first glance, the proposed scheme looks straightfor-ward. One key challenge, however, is to handle the inaccuracy thatmay appear in each step. First, the user may misread some signs.Second, the user may only enter part of the sign based on com-mon conventions. For example, “King Harbor Seafood Restaurant”can be partly entered as “King Harbor” or “King Harbor Seafood”.Third, the user input module (e.g., the speech recognition engine)may make mistakes by translating the text to something different.Fourth, the same sign (e.g., “Bank of America”) may appear inmultiple places thus causing ambiguities. Last but not least, a signrecognized by the user may not be in the database. We need robustalgorithms to tolerate such inaccuracies. Another challenge stemsfrom the fact that our system runs completely locally on mobile de-vices with limited storage, computation capability, and battery life.Therefore, it needs to be as lightweight as possible to minimize theresource consumption, but without compromising the accuracy anduser experience.

3.1 The Sign DatabaseThe sign database contains precise locations of the textual signs.

It can optionally contain tags for other value-added services. Basedon the original database, our system generates an auxiliary datastructure called token table by splitting each textual sign into sepa-rate words (i.e., tokens) and constructing mappings from each tokento the set of signs in which the token appears. For example, fromTable 2, we have “rocco”→{3}, and “regal”→{8} where “rocco”and “regal” are two tokens, and the numbers are sign IDs. The rea-son for introducing tokens is to tolerate errors. As will be describedin §3.2, user input is processed at the basis of each token insteadof each sign as a whole. Therefore, if the user only enters part ofa sign, or if there are mistakes in some words of the sign, the signmay still be identified because some tokens partially match.

3.2 User Input InterfaceOur system can accommodate any input interface as long as the

method let users conveniently enter the tokens. For instance, speechrecognition can be used. Or if an on-screen keyboard is available,one can simply type. We mainly focus on speech recognition (SR)as our primary input interface because microphone is one of themost prevalent sensor on mobile devices and SR is particularly suit-able for wearable devices. When the user reads out a nearby sign,the SR engine translates her speech back into the text, which isthen processed by the localization algorithm. SR performs a seriesof computations to find a match between a raw voice signal withwords (or sentences) in the database. Typical SR processing chainrequires three models: an acoustic model, a phonetic dictionary,and a language model. The acoustic model translates continuousraw voice signal to phones. The phonetic dictionary enables tomap the phones to words 1. The language model decides legitimatesentences e.g., word A cannot appear after word B.

In our initial design, we keep our language model simple in orderto make the size of local database for SR as small as possible. As aresult, the user input is recognized at the basis of each word token.Our SR is thus unaware of the structure of the text on signs as it only1“words” and “tokens” are interchangeable in this paper.

N p1,1

p1,2

p1,3

p1,4

p1,5

p2,1

p2,2

p3,1

p3,2

p4,1S1: “Starbucks”

S2 “LakeviewStreet”

S3: “Chase Bank”

S4: “Skytower Building”

p2,3

Figure 2: An example for illustrating the localization algorithm.

translates user’s speech into individual tokens. Here is the high-level description of token recognition procedure. Let the input be asequence of tokens {ti}. Their matched signs {si} can be quicklyidentified by looking up each ti in the token table (§3.1). A tokencan match zero, one, or more signs in the token table. For eachtoken ti, we define its frequency f(ti) as the number of its uniquematches (i.e., signs containing ti) divided by the total number ofunique signs in the database. Tokens with high f(ti) are ignoredbased on a predefined threshold (e.g., 0.001 based on Figure 4)because they are unlikely to have enough discrimination power.

Also, note that the phonetic dictionary and language model canbe changed dynamically in our case. Specifically, in the incremen-tal update phase (§3.3), it only contains a small subset of all to-kens in the database depending on where the user is. These oper-ations can be done locally on most mobile devices, given the factthat nowadays it is common to observe multi-core CPUs on mo-bile platform and we have a relatively small number of tokens e.g.,several thousand per city (§5.3). All those factors discussed so fari.e., simple language model and dynamically generated model files,contribute to reduce the complexity of the SR engine, thereby mak-ing it feasible to run on today’s mobile devices with reasonableaccuracy, as to be demonstrated in §5.3.

We admit that, as a completely offline approach, our SR accuracycan be potentially inferior to those cloud-based SR systems backedby large database. Also, tokens with same or similar pronunciationsmake it inherently difficult to distinguish them. To mitigate thisissue, we also includes a feature allowing a user to correct the SRresult at a per-token basis. The system displays its best guess, aswell as k other candidate tokens and a “not listed” option so thatthe user can correct the result within a time window by using eithervoice command or physical buttons.

3.3 The Localization AlgorithmThe localization algorithm works in a continuous manner. It

takes as input a stream of signs S = {si}. The output is eitherthe estimated location or unknown indicating the location cannotbe accurately determined due to a lack of input.

For each si, we find its entries pi,j(1 ≤ j ≤ ki) in the databasewhere ki is the number of entries of si in the database. Assume wehave m signs s1,...,sm from the user. Some signs might be incor-rect. Also some may correspond to multiple locations (i.e., ki > 1).In order to automatically filter out incorrect signs and locations, weleverage a key assumption that the majority of signs should have(at least) one entry appearing near the user’s true location becausethe user can see them. That is to say, for at least m0 signs, ∃j′ suchthat pi,j′ is close to the true location, assuming we get m0 ≤ msigns correct. Consider an example in Figure 2. The user entersfour signs s1, s2, s4, and something else that is mistakenly inter-preted into s3 (e.g., due to inaccurate speech recognition). s1, s2,and s3 appear at multiple locations in the database. Despite of these

Algorithm 1 Localization Algorithm1: S = {si}, t = 2, stop = FALSE;2: while (stop <> TRUE && t <= T ) do3: found = 0;4: for (∀S′ ⊆ S, |S′| = t) do5: r = centroid of S′;6: if (∀si, sj ∈ S′, distance(si, sj ) < D ) then7: found++;8: end if9: end for

10: t++;11: if (found == 0) then12: stop = TRUE; Output: unknown location;13: else if (found == 1) then14: stop = TRUE; Output: location is r;15: end if16: end while17: if (found <> 1) then18: Output: unknown location;19: end if

noises, p1,3, p2,2, and p4,1 form the largest cluster, which with highprobability corresponds to the user’s real location.

Following the intuition from Figure 2, we design a localizationalgorithm, with the high-level idea of searching for the largest clus-ter of signs by gradually increasing the size of the set of correct signlocations. As shown in Algorithm 1, For any subset of 2 possiblelocations, we measure their distance and count the number of sub-sets with distance smaller than D, a pre-defined threshold. If wedo not find any such subset, the location is unknown. If such sub-set if unique, the estimated location is the centroid of this subset oflocations. If there are multiple such subsets, we repeat the aboveprocedure for any subset of 3 possible locations, 4 possible loca-tions,... until the subset size reaches a threshold T ≤ |S|. Weempirically found this algorithm is more robust than off-the-shelfclustering algorithms such as k-means for our problem. We evalu-ate its accuracy in §5.2.

Bootstrapping and incremental update are two scenarios inwhich the above algorithm can be applied. In the bootstrappingphase (e.g., when the device is turned on), our system has no knowl-edge of the user’s previous location. In this case, the search spacein the sign database can be large (e.g., at city level). Afterwards,in the incremental update phase, the system can use the user’s lastknown location (if it is “fresh” enough) to predict the user’s currentcoarse-grained location, which can be leveraged to significantly re-duce the textual sign search space. This leads to higher accuracy forboth speech recognition (as the vocabulary becomes smaller) andthe localization algorithm (because a sign corresponds to fewer lo-cations). One way to estimate the user’s coarse-grained location isas follows. Let (x0, y0) be the user’s previous location measured∆T minutes ago. Let V and D be her maximum walking speed andher range of visibility, respectively. Assuming the user only walks,then the location of signs she currently sees, denoted as (x, y), mustsatisfy (x − x0)2 + (y − y0)2 ≤ (V ∆T + D)2. Clearly this isthe most simple and conservative estimation, which can be signifi-cantly improved by considering walking speed [11], direction [23],and other coarse-grained localization techniques by leveraging var-ious sensors. The bootstrapping phase will be triggered manuallyor automatically, for example, after sensing the user has taken a ve-hicle or a train [12]. We demonstrate the benefits of incrementalupdate in §5.

3.4 DiscussionsIn this subsection, we discuss some limitations of our system and

potential user interface suitable to our system.

76

Sears

(a)

(b)

Lazboy

Auto center

Bestbuy

Citibank

Figure 3: (a) Traditional GPS-based map. The thick blue line is thecomputed route. (b) Map using sign-based localization. Entered signsare highlighted. Some other signs not yet entered are also displayed tohelp user identify them, and enter them by selection or speech.

Comparison to Fully-Automated Approach. An inherent limita-tion of our approach is the localization usually takes longer, and itis difficult to derive a bound for the accuracy, because we rely onhuman users instead of location sensors. Therefore, our system willnot replace automated localization methods. Instead, it provides auseful alternative when traditional approaches cannot work due tovarious reasons (e.g., in use cases described in §2). Notably, ourproposal does have one crucial advantage: by asking users to ac-tively participate in the localization process, it can improve users’space awareness and their recollection of the environment [10].Map User Interface. In traditional GPS-based mobile maps, theapp directly shows user’s location, as shown in Figure 3(a). How-ever, our approach uses neither the distance nor the direction in-formation of the signs so it is difficult to leverage triangulation topinpoint the user’s exact location. Instead, we propose to directlyhighlight the textual signs entered by the user so that she can (prob-ably more easily) figure out her location by using the signs as refer-ences. The display is updated when the user enters the next token.Between consecutive updates, the map can move slowly based onuser’s walking speed and direction, in order to make the locationupdate smoother. Signs not yet entered can also be displayed (butnot highlighted) to help user identify them, and enter them by sim-ply selecting (or reading) them. This makes sign-based navigationvery convenient, as shown in Figure 3(b).Comparison to Offline Maps. Several vendors (e.g., Garmin andBaidu) provide offline maps allowing users to access them withoutInternet connectivity. However, users still need to use GPS or Wi-Fifingerprinting to locate themselves on the map. Our work attemptsto bridge this gap by making localization completely offline. Also,Baidu Map [2] provides a “search in view” feature allowing users tosearch POI (Point of Interest) names within the currently displayedarea. But it only searches for one POI (e.g., KFC) at a time andshows all its instances, making it not suitable for localization.Using General Landmarks. Potentially, the landmarks can be ex-tended from textual signs to more general objects (e.g., a particularstatue on a street). However, compared to textual signs, these gen-eral landmarks provide much more vague description of the loca-tion, and using them as location signatures may cause more ambi-guities. We will consider this direction in future work.

4. DATA COLLECTIONTo demonstrate the feasibility of our proposal, we collected data

from two sources: Google street view and OpenStreetMap [5].Google street view is a feature in Google Maps that provides panoramicviews from positions along many streets in the world. We randomlypicked 19 outdoor locations in the U.S., Canada, U.K., and two in-door locations. For the AT&T Building, we conducted a real walk

100 101 102 103 104 105

10−5

10−4

10−3

10−2

10−1

Tokens ranked by frequency

Fre

quen

cy

CA (POI)London (POI)CA (Roads)London (Roads)

100 200 400 800 1.6k 3.2k 6.4k

101

102

103

104

Radius (m)

Num

ber

of s

igns

Figure 4: Ranked token fre-quency (both axes in log scale).

Figure 5: Number of signswithin a confined area (London).

and collected the signs. For all other 20 locations, we randomly se-lect spots with street view, “walk” in the street for a short distance(∼ 0.1 mile), and manually record observed textual signs. The re-sults are shown in Table 3. The street view approach allows us toconstruct high-quality signature database without physically goingto the site, and is ideal for crowdsourcing. For example, an inter-face can be easily added for users to tag the signs in street viewimages. Some sign images can also be used as CAPTCHAs so thatall Internet users can contribute. We do notice a limitation of us-ing street view images from one single repository: some signs areblocked by obstacles so Table 3 only reports a subset of signs.OpenStreetMap (OSM). We also crawled OSM to obtain namesof POI (point of interest) in much larger areas. OSM [5] is acrowdsourcing-based world mapping project involving 200,000 con-tributors in 2011 [17]. The OSM raw data [6] employs three maindata structures to represent map elements: node, way, and area. Webuilt a custom tool that analyzes these data structures and extractstheir associated names (if they exist). We assume each POI has asign corresponding to its name on the map. For a node, the sign’slocation is the same as the node’s. For an area, we pick a randomnode on its boundary as the sign’s location. For ways, we extractedtheir names but the road signs’ locations are difficult to estimate.One limitation of this synthetic approach is that it only produces asubset of signs whose contents and locations may also deviate fromthose of the real ones. However, we believe the dataset is sufficientfor our feasibility study in §5, in which we study three areas: Lon-don metropolitan area (55K signs), New York City (∼21K signs),and the state of California (139K signs).

5. PRELIMINARY RESULTSUsing the two data sources described in §4, we show preliminary

results to demonstrate the feasibility of our proposal.

5.1 Characterizing Textual SignsSign Popularity. Table 3 summarizes the Google street view re-sults. For the vast majority of the 21 locations covering diverseenvironments, we observed a non-trivial amount of textual signs.Generally, the signs are denser in downtown than in suburb andresidential areas. However, we did not find a strong correlationbetween sign density and city size (except for very large cities).In indoor environments, signs mostly consist of room names andnumbers. We believe our system is applicable in most places. Notethat obviously temporary signs (e.g., ad banners) and signs that aretoo general (e.g., traffic signs) are not included in Table 3.Token Distribution. Next, we characterize tokens in two locations:the state of California and the London metropolitan area. For eachlocation, we examine tokens in POI signs and tokens in road signs,from OSM. This results in four token sets. Figure 4 shows log-logplot of the ranked frequencies of the four sets. The distributionsare heavy-tailed. Although a tiny fraction of tokens, which willbe ignored by the localization algorithm, are very general, the vast

Table 3: Signs collected from Google street view.Location Type City Name City Approximate Location Walked Num of

Population Dist. SignsBig city, downtown New York City, NY 8.4 M∗ 9th Ave. / 54th St. 0.1 mile 24Big city, residential Queens, NY 2.2 M 37th Dr. / 108th St. 0.1 mile 12Big city, downtown Chicago, IL 2.7 M State St. / Madison St. 0.1 mile 15Big city, suburb, residential Oak Park, IL∗∗ 51 K S East Ave. / Randolph St. 0.2 miles 8Big city, suburb, residential Portland, OR 584 K SE Mill St. / SE 11th Ave. 0.1 mile 11Big city, suburb St. Louis, MO 318 K S Broadway / Stansbury St. 0.2 miles 11Mid-sized city, downtown Chattanooga, TN 173 K E 8th St. / Market St. 0.1 mile 11Mid-sized city, downtown Salinas, CA 164 K E Gabilan St. / Monterey St. 0.1 mile 11Mid-sized city, suburb Loveland, CO 70 K N Lincoln Ave. / E 29th St. 0.1 mile 9Mid-sized city, suburb Aberdeen, SD 26 K S 5th St. / SW 12th Ave. 0.2 miles 9Mid-sized city, suburb Fairbanks, AK 32 K N Cushman St. / 1st Ave. 0.1 mile 9Mid-sized city, suburb, residential Prince Albert, SK, Canada 35 K Bennett Dr. / 4th Ave. West 0.2 miles 12Small city, tourism Key West, FL 25 K Southard St. /Duval St. 0.1 mile 22Small city Riverton, WY 11 K E Park Ave. / N Broadway Ave. 0.2 miles 10Small town, residential Fairfield, IA 9 K E Adams Ave. / S Main St. 0.2 miles 8Small town Rushville, IL 3 K S Congress St. / W Lafayette St. 0.1 mile 13Small town Windermere, Cumbria, UK 8 K Near town centre 0.1 mile 9University campus Ann Arbor, MI U of Michigan north campus 0.1 mile 6Theme park San Diego, CA The Seaworld theme park 0.1 mile 6Indoor, museum Chicago, IL Art Institute of Chicago 0.1 mile 10+Indoor, office building Bedminster, NJ AT&T Building ∗∗∗ 0.1 mile 10+Notes: ∗ Includes all five boroughs of New York City. ∗∗ Near Chicago. ∗∗∗ Our office building. Conducted real walk.

70

80

90

100

70 80 90

Loca

lizat

ion

accu

racy

Speech recognition accuracy

n=3n=4

n=5n=6

n=7

(a) NYC

70

80

90

100

70 80 90

Loca

lizat

ion

accu

racy

Speech recognition accuracy

n=3n=4

n=5n=6

n=7

(b) London

Figure 6: The accuracy of localization algorithm.majority of tokens belong to very few signs (most only one). Theyare therefore ideal for quickly pinpointing the user’s location.Signs in a small area. We show that if the search is confined withina small area, the number of signs decreases significantly. We ran-domly pick 50,000 locations in London. For each location (xi, yi),we count the number of signs within (x− xi)

2 + (y − yi)2 ≤ R2

where the radius R ranges from 100m to 6.4km. Figure 5 showsthat the number of signs decreases accordingly as the search areabecomes smaller. Note that in the incremental update phase, whichis the common usage scenario compared to the bootstrapping phase(§3.3), usually the search area is small.

5.2 Localization AlgorithmWe evaluate the performance of the proposed localization algo-

rithm for different levels of speech recognition accuracy (70%, 80%and 90%) and with difference numbers of input signs (from 3 to 5).We show the results for the New York City and London OSM datasets in Figure 6. We randomly select 5,000 user locations in Lon-don and 500 user locations in New York City (the London OSMdata set is denser than NYC). Based on Table 3, we only consideruser locations with more than 10 signs within 100 meters and thuswe set D in Algorithm 1 to be slightly larger than 200 meters. Theestimated location is considered accurate if the distance betweenthe user location and the centroid of the selected signs is less thanD/2. Note this does not imply the localization accuracy is D/2.Instead, it intuitively means the selected signs are indeed what theuser sees. We run the simulation 5 times and report the averageand standard deviation for T = 4. There are two major observa-tions from Figure 6. First, better speech recognition can improve

2 4 6 8 100

0.2

0.4

0.6

0.8

1

# of attempts

Rec

ogni

tion

Rat

e

100m200m400m800m1600m3200m

(a) Accuracy with # of Attempts.

2 4 6 8 100

0.2

0.4

0.6

0.8

1

Rank

Rec

ogni

tion

Rat

e

100m200m400m800m1600m3200m

(b) Rank Distribution.

Figure 7: SR performance: The results are potentially theworst case since the participant was a non-native speaker andwe use the Sphinx software out-of-box, i.e., without optimizingacoustic model, dictionary, and language model files.

the performance of localization. Second, more input signs can leadto better localization accuracy. Even when the speech recognitionaccuracy is 70%, the proposed algorithm works for more than 90%locations with 4 signs and more than 99% locations with 7 signs.

5.3 Speech Recognition (SR)Methodology. We adapt one popular implementation of a SR chaincalled Sphinx [3]. We use its Android version (PocketSphinx) andmake use of the acoustic model as it is. In the experiments, we firstrandomly pick a set of locations in London and NYC. Then, foreach location, we collect token sets from the OSM data with dif-ferent radii (100m∼6400m). Each token set has 67∼4186 tokensfor London and 12∼1038 tokens for NYC. We take each token setas input and generate a dictionary and language model, which arethen used by one experimental trial. In each experiment, the par-ticipant randomly takes a word from the phones-words dictionaryand pronounces it one at a time. The software presents 10 potentialcandidates. If there was a correct answer, we treat it as a success-ful recognition. Otherwise, the participant repeats the word again.When he gets a success, we record a rank of the correct candidatesuggested by the recognition system.Results. We first take a look at the storage overhead imposed byhaving a phonetic dictionary and a language model and verify thatit is very small. Specifically, for the NYC dataset, the dictionary

has sizes 203 bytes∼8 kbytes (100m, 6400m respectively) and thesize of the language model is 703 bytes∼15 kbytes. In case of theLondon dataset, the dictionary size is 701 bytes ∼ 31 kbytes andthe language model size lies in 1.6 kbytes ∼ 63 kbytes.

We next examine the recognition performance of SR shown inFigure 7. It is worth mentioning that the results can be the worstcase performance since the participant for the experiment was anon-native English speaker and we use the recognition softwareout-of-box i.e., we did not tune the default acoustic model or thedictionary/language model generated by PocketSphinx. Overall wethink the results are encouraging. Up to 400m radius, repeatingonce more i.e., 2 trials in total, gives 95% of recognition rate. Evenfor all other cases, 4 trials will cover 90% of accuracy. We notethat this is very preliminary result and plan to do a more compre-hensive evaluation with a diverse set of participants and more opti-mized software settings later e.g., using better trained acoustic anddictionary/language models, as well as improved feature acousticextraction techniques [18].

We also measured the energy consumption of the SR componentand compare it with that of GPS sensor. We use Monsoon PowerMonitor [4] and Samsung Galaxy S4 device for the measurement.The SR component (including both microphone and SR process-ing) consumes∼44% less instantaneous power compared to that ofGPS (369.22 mW vs. 208.48 mW), which fortifies our use of SR.

6. RELATED WORKGPS has been used as a primary source of location information

and provide accurate data in an outdoor environment. But not allmobile devices are equipped with GPS (Table 1) due to concernssuch as energy consumption. Further, GPS sensing does not al-ways provide a precise location in urban environment. It is highlysusceptible to obstacles in the sky and can have errors as high asmore than 100 meters [16, 21]. Recently a technique for acquiringGPS signal in indoor environment has proposed [19]. However itis not yet applicable to the off-the-shelf devices due to the use ofspecialized radio components and complexity of the system.

In the last decade, numerous research efforts propose localiza-tion techniques based on radio signal based fingerprinting. Theseinclude ones based on GSM signal [20], Wi-Fi signal [7], FM sig-nal [8] etc. Unfortunately, some of them are based on the infor-mation that is in general not available in off-the-shelf mobile de-vices [20] or requires heavy training [7]. In contrast, our systemuses microphone (or on-screen keyboard) as input, which is mostlyavailable in today’s mobile devices including wearables and oursign database is relatively easy to construct compared to above-mentioned techniques. A number of works aim to strategicallycombine multiple approaches [21, 15, 25, 14, 22]. They trade-off location accuracy with other system resources, such as energy,thereby different from our approach where the focus is to providepurely local localization systems based on human perception.

When it comes to dealing with textual signs, one intuitive ap-proach is to use image-based positioning [24] i.e., let the devicesee what a human sees. Although the approach is complemen-tary to our system, it requires multi-stage CV algorithms and largedatabases (e.g., 300K images for 50K locations at NYC [24]), mak-ing cloud the best place for performing such tasks. It thus requiresInternet connectivity and brings additional concerns of delay [13],energy consumption, and cellular data usage. Light and viewingangle can significantly reduce the image matching accuracy.

Although explored in slightly different context, the idea of lever-aging human assistance to determine a location had been exploredand inspired our work. In the GUIDE tourist guide system [9], avisitor is shown several thumbnail images of local attractions and

asked if any of them matches the visitor’s surroundings. This ap-proach is not directly applicable to a general-purpose localizationsystem like ours due to scalability issues.

7. CONCLUSION AND FUTURE PLANAs discussed in §6, there have been a large number of indoor and

outdoor localization proposals using various sensing techniques.Instead, we provide a solution for offline localization without usingGPS, Wi-Fi, cellular, or Internet connectivity, by exploiting humanusers’ perception of nearby textual signs. It enables several impor-tant use cases, such as offline localization on wearable devices, inboth outdoor and indoor environments. Currently we are workingon (i) prototyping on Google Glass using speech recognition as theuser interface, (ii) building a city-level sign database by taggingGoogle street view images using crowdsourcing, and (iii) improv-ing speech recognition accuracy, before conducting field trials.

8. REFERENCES[1] Apple Siri on iOS 7. https://www.apple.com/ios/siri/.[2] Baidu Map. http://map.baidu.com/.[3] Cmu sphinx: Open source toolkit for speech recognition,

http://cmusphinx.sourceforge.net/.[4] Monsoon power monitor,

http://www.msoon.com/LabEquipment/PowerMonitor/.[5] OpenStreetMap. http://www.openstreetmap.org.[6] OpenStreetMap Data.

http://wiki.openstreetmap.org/wiki/Planet.osm.[7] P. Bahl and V. N. Padmanabhan. Radar: An in-building rf-based user location

and tracking system. In IEEE INFOCOM’00.[8] Y. Chen, D. Lymberopoulos, J. Liu, and B. Priyantha. Fm-based indoor

localization. In ACM MobiSys’12, pages 169–182, 2012.[9] K. Cheverst, N. Davies, K. Mitchell, and A. Friday. Experiences of developing

and deploying a context-aware tourist guide: the GUIDE project. In Mobicom,2000.

[10] J. Chung. Mindful Navigation with Guiding Light: Design Consideration ofProjector based Indoor Navigation Assistance System. PhD thesis, MIT, 2012.

[11] J. geun Park, A. Patel, D. Curtis, S. Teller, and J. Ledlie. Online PoseClassification and Walking Speed Estimation using Handheld Devices. In ACMUbicomp, 2012.

[12] S. Hemminki, P. Nurmi, and S. Tarkoma. Accelerometer-Based TransportationMode Detection on Smartphones. In SenSys, 2013.

[13] J. J. Hull, X. Liu, B. Erol, J. Graham, and J. Moraleda. Mobile ImageRecognition: Architectures and Tradeoffs. In HotMobile, 2010.

[14] M. B. Kjærgaard, J. Langdal, T. Godsk, and T. Toftkjær. EnTracked:Energy-Efficient Robust Position Tracking for Mobile Devices. In ACMMobiSys’09, 2009.

[15] K. Lin, A. Kansal, D. Lymberopoulos, and F. Zhao. Energy-Accuracy Trade-offfor Continuous Mobile Device Location. In ACM MobiSys’10, 2010.

[16] M. Modsching, R. Kramer, and K. ten Hagen. Field trial on GPS Accuracy in amedium size city: The influence of builtup. In 3rd Workshop on Positioning,Navigation and Communication (WPNC), 2006.

[17] P. Neis and A. Zipf. Analyzing the contributor activity of a volunteeredgeographic information project – the case of openstreetmap. ISPRSInternational Journal of Geo-Information, 1(2):146–165, 2012.

[18] S. Nirjon, R. Dickerson, J. Stankovic, G. Shen, and X. Jiang. sMFCC:exploiting sparseness in speech for fast acoustic feature extraction on mobiledevices – a feasibility study. In HotMobile, 2013.

[19] S. Nirjon, J. Liu, G. DeJean, B. Priyantha, Y. Jin, and T. Hart. Coin-gps: indoorlocalization from direct gps receiving. In ACM MobiSys’14.

[20] V. Otsason, A. Varshavsky, A. LaMarca, and E. De Lara. Accurate gsm indoorlocalization. In UbiComp 2005: Ubiquitous Computing, pages 141–158.Springer, 2005.

[21] J. Paek, J. Kim, and R. Govindan. Energy-Efficient Rate-Adaptive GPS-basedPositioning for Smartphones. In Mobisys, 2010.

[22] J. Paek, K.-H. Kim, J. P. Singh, and R. Govindan. Energy-Efficient Positioningfor Smartphones using Cell-ID Sequence Matching. In Mobisys, 2011.

[23] N. Roy, H. Wang, and R. R. Choudhury. I am a Smartphone and I can Tell myUser’s Walking Direction. In ACM MobiSys, 2014.

[24] F. X. Yu, R. Ji, and S.-F. Chang. Active Query Sensing for Mobile LocationSearch. In ACM MM, 2011.

[25] Z. Zhuang, K.-H. Kim, and J. P. Singh. Improving Energy Efficiency ofLocation Sensing on Smartphones. In Mobisys, 2010.

https://www.apple.com/ios/siri/

http://map.baidu.com/

http://cmusphinx.sourceforge.net/

http://www.msoon.com/LabEquipment/PowerMonitor/

http://www.openstreetmap.org

http://wiki.openstreetmap.org/wiki/Planet.osm

Documents

Human Assisted Positioning Using Textual Signs