Upload
destiny-butler
View
15
Download
2
Tags:
Embed Size (px)
DESCRIPTION
http://cs.joensuu.fi/mopsi/. Ad-hoc Georeferencing of Web-pages Using Street-name Prefix Trees. Andrei Tabarcea , Ville Hautamäki , Pasi Fränti University of Eastern Finland. Introduction. Our goal is to find services and points of interest close to the user’s location - PowerPoint PPT Presentation
Citation preview
AD-HOC GEOREFERENCING OF WEB-PAGES AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREESUSING STREET-NAME PREFIX TREES
Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi FräntiUniversity of Eastern FinlandUniversity of Eastern Finland
http://cs.joensuu.fi/mopsi/http://cs.joensuu.fi/mopsi/
INTRODUCTIONINTRODUCTION
• Our goal is to find services and points of interest close to the user’s location
• We call this “location-based search”• We try to find location information in web-pages
AD-HOC GEOREFERENCINGAD-HOC GEOREFERENCING
• The problem is how to extract and validate location data from free-form text
• Most web pages don’t contain explicit georeferencing (eg. geo-tags)• Postal address is the most common location data found• Our goal is to give geographical coordinates to services mentioned in
web-pages• We call this method ad-hoc georeferencing
<HTML><HEAD profile"="http://geotags.com/geo> <META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>
MOPSI LOCATION-BASED SEARCHMOPSI LOCATION-BASED SEARCHMOPSI = Mobiilit paikkatieto-
sovellukset ja Internet (Mobile location based applications and Internet)
Available on http://cs.joensuu.fi/mopsi/http://cs.joensuu.fi/mopsi/
Main focus areas: • Mobile search engine• How to collect & present
location-based data• Other location-related topics
MOBILE SEARCH ENGINEMOBILE SEARCH ENGINE– How can you find services:
– Asking directions– Advertisements– Wandering around– Yellow pages– Internet
– Query consists of:– Keyword– Location
MOBILE SEARCH ENGINE STRUCTUREMOBILE SEARCH ENGINE STRUCTURE
Geocoded street-name
database
Core server software
Mobileapplication
Web userinterface
Coordinates
AddressKeywordCoordinates
Searchresults
KeywordCoordinates
Searchresults
Search Engine consists of:•User interface•Core server software•Geocoded street-name database
CORE SERVER SOFTWARECORE SERVER SOFTWARE
Georeferencing module
Geocodeddatabase
Address and
description detector
Address validator
Word list
Results list
Sorted results list
KeywordMunicipalities
<keyword, municipality>
query
Result links
Coordinates
Municipalities list
Addresses
Coordinates
Relevant municipalities
detector
Keyword, Address,Coordinates
Page parser
CORE SERVER SOFTWARECORE SERVER SOFTWARE
Georeferencing module
Geocodeddatabase
Address and
description detector
Address validator
Word list
Results list
Sorted results list
KeywordMunicipalities
<keyword, municipality>
query
Result links
Coordinates
Municipalities list
Addresses
Coordinates
Relevant municipalities
detector
Keyword, Address,Coordinates
Page parser
OUR SOLUTIONOUR SOLUTION• A rule-based solution that detects
address-based locations using a gazetteer and street-name prefix trees created from the gazetteer
• We compare this approach against:– a method that doesn’t require a
gazetteer (a heuristic method that assumes that the street-name has a certain structure)
– a method that also uses data structures created from the gazetteer in the form of street-name arrays
StreetNameDetection(words){
WHILE i < count(words) DO{
IF words[i] = street name THEN {
Search for street number, postal code and other address elements near words[i].
IF address elements found THEN{
Create address blockGet coordinates using Geocoded
DatabaseIF coordinates found THENAdd address block to address
list}
} i = i + 1; }}
STREET-ADDRESS DETECTIONSTREET-ADDRESS DETECTION
• We use a rule-based pattern matching algorithm• The detection of street-names is the starting point of the algorithm• An address-block candidate is constructed by detecting typical address
elements (street names, numbers, postal codes, telephone numbers and municipal names)
• Address block candidates are validated using the gazetteer
STREET-NAME DETECTIONSTREET-NAME DETECTION
• Street-name detection is the starting point of the address detection• Heuristic and brute-force method are compared against our Prefix
Tree solution• Our application uses a commercial gazetteer for Finland and, for
Singapore, street data from the free map project OpenStreetMap
Gazetteer Statistics Finland Singapore
Number of municipalities 410 1
Total number of street names 92 572 573
Number of streets per municipality 474 573
Average street name length 11.6 6.1
Total size (MB) 2 982 0.18
PREFIX TREESPREFIX TREES• Invented by Friedkin (1960)• The prefix tree (or trie) is a
fast ordered tree data structure used for retrieval
• Root is associated with an empty string
• All the descendants of a node have a common prefix of the string associated with that node
• Some nodes can have associated values (usually they mark the end of a word)
STREET-NAME PREFIX TREESSTREET-NAME PREFIX TREES
• Our solution is to detect street-names using prefix trees constructed from the gazetteer
• A street-name prefix tree is build for each municipality used in the search
• The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities
Prefix Tree Statistics Finland Singapore
Maximum tree depth 34 14
Average tree depth 12.7 7.4
Average tree width 105 167
Average number of nodes per tree 2338 2335
Total size (MB) 74.4 0.18
OTHER SOLUTIONSOTHER SOLUTIONS• Heuristic solution
– Relies on regular expression matching– Street names usually have similar endings or similar
prefixes– A gazetteer is not needed (except for validation)– Can be fast but not precise
• Brute-force solution– Every word should be checked if it exists in the
gazetteer– An optimized solution is used (gazetteer is locally
limited and preloaded into arrays)
EXPERIMENTSEXPERIMENTS• 10 urban locations (blue) and 10
rural location (orange) were used for testing
• Testing was done using the MOPSI prototype for Finland and Singapore
• Both commercial and non-commercial keywords were used:
Commercial hotel, restaurant, pizzeria, cinema, car repair
Non-commercial hospital, museum, police station, swimming hall, church
RESULTSRESULTS
• Average processing times for every solution were calculated
• The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution
• The resulting solution improves the speed and quality of web-page georeferencing
Method Time (s) Standard deviation
Validated addresses
Rural municipalities
Brute-Force 3,01 2,43 3,7
Heuristic 1,54 1,15 2,5
Prefix Tree 0,51 0,35 3,7
Urban Municipalities
Brute-Force 10,18 7,11 19,8
Heuristic 1,70 1,24 18,6
Prefix Tree 0,87 0,85 19,8
Total
Brute-Force 6,59 6,40 11,8
Heuristic 1,62 1,20 10,5
Prefix Tree 0,69 0,68 11,8
OPEN PROBLEMSOPEN PROBLEMS
• Support approximate matching to avoid problems in misspellings
• Improve flexibility of the address detection algorithm
• Implement a way to learn rules automatically using hand tagged example corpus.