17
Gathering Alternative Surface Forms for DBpedia Entities Volha Bryl University of Mannheim, Germany Springer Nature Christian Bizer, Heiko Paulheim University of Mannheim, Germany NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015

Gathering Alternative Surface Forms for DBpedia Entities

Embed Size (px)

Citation preview

Page 1: Gathering Alternative Surface Forms for DBpedia Entities

Gathering Alternative Surface Formsfor DBpedia Entities

Volha BrylUniversity of Mannheim, Germany Springer Nature

Christian Bizer, Heiko PaulheimUniversity of Mannheim, Germany

NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015

Page 2: Gathering Alternative Surface Forms for DBpedia Entities

Why you need Surface Forms

• Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc.

• Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation

2 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 3: Gathering Alternative Surface Forms for DBpedia Entities

Why you need Surface Forms

• Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc.

• Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation

“Billionaire Elon Musk has spelled out how he plans to create temporary suns over Mars in order to heat the Red Planet. Dismissing earlier comments that he intended to nuke the planet’s surface, he says he wants to create aerial explosions to heat it up. ”

--- to link the three entities, your machine should know that red planet is an alternative name for Mars, and that Mars can be referred to just by its “type” – planet

3 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 4: Gathering Alternative Surface Forms for DBpedia Entities

Surface Forms from Wiki(DB)pedia

• Some of Wikipedia’s (hence, DBpedia’s) crowd-sourced content look quite like surface forms

• Page titles

• Redirects• Account for alternative names, word forms (e.g. plurals), closely related words,

abbreviations, alternative spellings, likely misspellings, subtopics

• Disambiguation pages• There are 10+ Bethlehem’s in US, according to

https://en.wikipedia.org/wiki/Bethlehem_(disambiguation)

• Anchor texts of links between wiki pagesNamed after the Roman god of war, it is often referred to as the “Red

Planet”...

Source: Named after the [[Mars (mythology)|Roman god of war]], it is often referred to as the "Red Planet“

• …additionally, we use anchor texts of links from external pages to Wikipedia

4 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 5: Gathering Alternative Surface Forms for DBpedia Entities

Surface Forms from Wiki(DB)pedia

• Not a new idea

• BabelNet, DBpedia Spotlight, … [see our paper for more links]

5 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Mars in BabelNet:

Page 6: Gathering Alternative Surface Forms for DBpedia Entities

Surface Forms from Wiki(DB)pedia

• Not a new idea

• BabelNet, DBpedia Spotlight, … [see our paper for more links]

• Problem: Quality

• …it is not only that quality is a problem, it is also that it have never been assessed or addressed• Reason 1: good quality of Wikipedia content is taken for granted

• Reason 2: hopes are that NLP algorithms won’t be influenced by noise

6 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Mars in BabelNet:

Page 7: Gathering Alternative Surface Forms for DBpedia Entities

Surface Forms from Wiki(DB)pedia

• Not a new idea

• BabelNet, DBpedia Spotlight, … [see our paper for more links]

• Problem: Quality – Why?

• By adding a redirect or an anchor text of internal Wikipedia link, a Wikipedia editor might mean not only same as or also known as, but also related to, contains, etc.

• Both variants serve the purpose of pointing to the correct wiki page

7 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Mars in BabelNet:

Page 8: Gathering Alternative Surface Forms for DBpedia Entities

Solution: Focus on Quality

• Step 1: Extract

• We extract SFs from Wikipedia labels, redirects, disambiguations, and anchor texts of internal wiki-links

• Step 2: Evaluate

• We create a gold standard to evaluate the SFs quality

• Step 3: Filter

• We implement three filters to improve SFs quality

• Bonus: More SFs

• We extract SFs from anchor texts of Wiki links found in the Common Crawl 2014 corpus

• All datasets are available at

http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/

8 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 9: Gathering Alternative Surface Forms for DBpedia Entities

SFs Dataset Statistics

• LRD = Labels, Redirects, Disambiguations

• Extracted from DBpedia dumps

• WAT = Wikipedia Anchor Texts

• Extracted by a new DBpedia extractor (based on PageLinksExtractor)

9 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 10: Gathering Alternative Surface Forms for DBpedia Entities

Gold Standard

• Manual annotation, 1 annotator, 2 subsets

• Popular subset: manually selected 34 popular entities of different types• Denmark, Berlin, Apple Inc., Animal Farm, Michael Jackson, Star Wars, Diego

Maradona, Mars, etc.

• ~82 SFs per entity, linked from other Wiki pages 813,736 times

• Random subset: randomly selected 81 entities each having at least 5 SFs• Andy_Zaltzman, Bell AH-1 SuperCobra, Biarritz, Castellum, Firefox (film), Kipchak

languages, ParisTech, Psychokinesis, etc.

• ~13 SFs per entity , linked from other Wiki pages 14,760 times

Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/gold/

10 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 11: Gathering Alternative Surface Forms for DBpedia Entities

Gold Standard

• Type of annotations

• correct (“the eternal city” for Rome),

• contained (“Google Japan” for Google), contains (“Turkey” for Istanbul),

• type of (“the city” for Rome)

• partial (“Diego” for Diego Maradona)

• related (“Google Blog” for Google)

• wrong (“during World War I” for United States)

11 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 12: Gathering Alternative Surface Forms for DBpedia Entities

Evaluation: How many correct SFs?

• SFs extracted from labels, redirects, disambiguations

• correct, popular subset: 66.8%

• correct, random subset: 86.6%

• SFs extracted from Wikipedia anchor texts

• correct, popular subset: 38.5%

• correct, random subset: 70.7%

• Combined dataset

• correct, popular subset: 45.7%

• correct, random subset: 75%

12 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 13: Gathering Alternative Surface Forms for DBpedia Entities

(1) Filtering: String Patterns

• Data analysis there are patterns wrong SFs follow

• URLs: contain .com or .net (“Berlin-china.net” for Berlin)

• of-phrases, with the exceptions for city of, state of, and the like (“Issues of Toronto” for Toronto)

• in-phrases (“Historical sites in Berlin” for Berlin)

• and-phrases (“Tom Cruise and Katie Holmes” for Tom Cruise)

• list-of (“List of Toronto MPs and MPPs” for Toronto)

• Increase in precision

• popular subset: 1.33%

• popular subset, LRD only: 3.75%

• random subset: less than 1%

13 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 14: Gathering Alternative Surface Forms for DBpedia Entities

(2) Filtering: Wikidata

• Observation: some SFs are entities on their own in other languages

• E.g. “Neckarau” city area of Mannheim redirects to Mannheim in English Wikipedia, but has its own page in German Wikipedia

• Implementation: use DBpedia- Wikidata dumps, released in May 2015

• Check whether a SF exactly matches or is close (Levenshtein distance) to any of the labels of Wikidata entities that do not have English but have other Wikipedia pages

• Increase in precision

• 0.5% compared to pattern-based filtering

• 1.5% for SF extracted only from LRD

14 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 15: Gathering Alternative Surface Forms for DBpedia Entities

(3) Filtering: Frequency Scores

• For SFs extracted from anchor texts, frequencies are available

TF-IDF scores

• Determining the threshold: 1.0 .. 8.0 values with a step of 0.2 evaluated

•Two thresholds selected, highest values of F1: 1.8 and 2.6

•Threshold 0 (no filtering) used as baseline

• Increase in precision

•20% for popular subset, 10% for random subset

* Filtering done on the dataset to which pattern- and Wikidata-based filters are already applied

15 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 16: Gathering Alternative Surface Forms for DBpedia Entities

SFs from Common Crawl

• Common Crawl (CC) is the largest publicly available web corpus

• Extraction done on Winter 2014 CC Corpus, in the context of the Web Data Commons project

• http://webdatacommons.org/ -- extracting and providing for public download various types of structured data from CC

• Data required a lot of cleaning

• 3M SFs added to our LRD&WAT corpus

• No annotated gold standard: left for future work

• Available at

http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/lrd-cc/

16 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

Page 17: Gathering Alternative Surface Forms for DBpedia Entities

Conclusion and Future Work

• Main message

• quality of Wikipedia-base surface forms is often overlooked!

• Contributions

• Gold standard SFs, made available

• 3 filtering strategies: precision improved by > 20% for popular Wikipedia entities, for > 10% for random entities

• Extracted SFs from Common Crawl corpus

• All data publicly available

• Future work directions

• Task-based evaluation of the resource, further work on the gold standard

17 Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim