View
218
Download
0
Category
Tags:
Preview:
Citation preview
A confidence-based framework for
disambiguating geographic terms
Erik Rauch, Michael Bukatin, and Kenneth Baker
MetaCarta, Inc.
‘wine’ in Europe
Al Hamra
(= ‘red’ in Arabic)
Local and non-local information
Madison
Wisconsin
Milwaukee
‘s downtown
More non-local information -> too many states to get probabilities
Candidate places
• 38 01'10.5"N 121 44'48.8"W
• four miles south of Lusaka–(22.10 S 15.51 E)
• Deir az Zor – (32.10 N 41.11 E), 0.325
– (25.03 N 31.44 E), 0.151
– (….)
confidence
Local context
resident of Madison
Minister Ishihara
Ishihara, Japan (32.36 N 147.21 E)Madison, WI; Madison, ID; Madison, CT; Madison, KY…
Context affects confidence
• Increase or decrease c(p,n) based on strength of context words– “by Madison” vs. “President Madison”– can be added manually or automatically
• and/or use HMM
Local context problems
Madison family attractions
Madison, WI; Madison, ID; Madison, CT; Madison, KY…
Milwaukee
Using spatial patterns of geographic references
Madison
MilwaukeeWisconsin
Increase c(p,n) based on number of other references:
Enclosing regions or nearby points
Pitfalls
Ishihara, Japan (32.36 N 147.21 E)
Ishihara, Japan’s leading epidemiologist,
Training
• “Philadelphia” is usually geographic; “Bend” usually isn’t
• If name n often refers to point p in documents, give (n,p) high confidence to start with
• Use average confidence in a large corpus
Training cont’d
• Extract local linguistic contexts that often occur with geographic names in tagged corpora
• Or train HMM
Relevance
• Several dimensions to relevance: – Traditional textual relevance of query terms– Georelevance
Query: “cheese” in France
Georelevance
• Aim: combination reflects user’s preferred balance between recall and correctness of the geographic reference
• e.g. Georelevance = query term relevance * geoconfidence
• Depends on:– Attributes of the geotext, e.g. document frequency, font
size, position– Geoconfidence
Conclusion
• Ambiguity problem much worse with large gazetteers
• Can use probabilistic methods where feasible (local information), combine with confidence-based heuristics
Recommended