A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and...

Preview:

Citation preview

A confidence-based framework for

disambiguating geographic terms

Erik Rauch, Michael Bukatin, and Kenneth Baker

MetaCarta, Inc.

‘wine’ in Europe

Al Hamra

(= ‘red’ in Arabic)

Local and non-local information

Madison

Wisconsin

Milwaukee

‘s downtown

More non-local information -> too many states to get probabilities

Candidate places

• 38 01'10.5"N 121 44'48.8"W

• four miles south of Lusaka–(22.10 S 15.51 E)

• Deir az Zor – (32.10 N 41.11 E), 0.325

– (25.03 N 31.44 E), 0.151

– (….)

confidence

Local context

resident of Madison

Minister Ishihara

Ishihara, Japan (32.36 N 147.21 E)Madison, WI; Madison, ID; Madison, CT; Madison, KY…

Context affects confidence

• Increase or decrease c(p,n) based on strength of context words– “by Madison” vs. “President Madison”– can be added manually or automatically

• and/or use HMM

Local context problems

Madison family attractions

Madison, WI; Madison, ID; Madison, CT; Madison, KY…

Milwaukee

Using spatial patterns of geographic references

Madison

MilwaukeeWisconsin

Increase c(p,n) based on number of other references:

Enclosing regions or nearby points

Pitfalls

Ishihara, Japan (32.36 N 147.21 E)

Ishihara, Japan’s leading epidemiologist,

Training

• “Philadelphia” is usually geographic; “Bend” usually isn’t

• If name n often refers to point p in documents, give (n,p) high confidence to start with

• Use average confidence in a large corpus

Training cont’d

• Extract local linguistic contexts that often occur with geographic names in tagged corpora

• Or train HMM

Relevance

• Several dimensions to relevance: – Traditional textual relevance of query terms– Georelevance

Query: “cheese” in France

Georelevance

• Aim: combination reflects user’s preferred balance between recall and correctness of the geographic reference

• e.g. Georelevance = query term relevance * geoconfidence

• Depends on:– Attributes of the geotext, e.g. document frequency, font

size, position– Geoconfidence

Conclusion

• Ambiguity problem much worse with large gazetteers

• Can use probabilistic methods where feasible (local information), combine with confidence-based heuristics

Recommended