Upload
christianhbecker
View
5.366
Download
4
Embed Size (px)
DESCRIPTION
Interlinking BBC CIS concepts with DBpedia Learn more about our work on http://mes-semantics.com
Citation preview
Christian Becker: DBpedia LinkerLondon. September 4, 2008
Christian BeckerChris Bizer
Georgi Kobilarov
Freie Universität Berlin
DBpedia Linker
Interlinking CIS concepts with DBpedia
Christian Becker: DBpedia Linker
Hello
Name Christian Becker
Job Partner, MES (Consulting on media-centric solutions)
PhD Student at Freie Universität Berlin
Semantic Web Projects DBpedia’s Geo and Homepage Extractors
DBpedia Mobile and Marbles Browser
flickr™ wrappr
Christian Becker: DBpedia Linker
DBpedia/Wikipedia as a Common Vocabulary
Better link between BBC properties
Better link externally
Better find and integrate BBC content elsewhere; leverage BBC metadata
Christian Becker: DBpedia Linker
Better link between BBC properties
Christian Becker: DBpedia Linker
Better link externally
BBC properties can be enriched with information from Wikipedia articles as well as content connected to them
Christian Becker: DBpedia Linker
Better find and integrate BBC content elsewhere
Christian Becker: DBpedia Linker
DBpedia
Programmes Music
Topics
Users
Events
News Food
Gardening
Christian Becker: DBpedia Linker
Christian Becker: DBpedia Linker
BBCProgrammes
BBCTopics
BBCMusic ✔
DBpedia
TODAY!
Music-brainz
BBCNews etc.
FUTURE
FUTURE
Christian Becker: DBpedia Linker
BBC Topics: CIS Taxonomy
Core datasets 6,630 brands
55,943 locations
55,943 names
11,231 subjects
Preferred and alternative labels
Tree hierarchy expressed in SKOS
Implicit hierarchy in parentheses texts, e.g. Jane Seymour (actor)
Christian Becker: DBpedia Linker
Results
Total Linked Precision* Recall*
Brand
Location
Name
Subject
6,630 1,267 (19%) 86% 41%
55,943 11,316 (20%) 99% 77%
73,442 22,341 (30%) 92% 67%
11,231 6,822 (61%) 92% 75%
* Against test set of 600 resources. Updated to reflect only cases where links are possible.
Christian Becker: DBpedia Linker
Why so few links...?
Many concepts simply don’t have their own Wikipedia articles Brands
- “Mind the baby, Mr Bean” is in Wikipedia’s “List of Mr. Bean episodes”- “Face to face (BBC Radio Gloucestershire community
programme)” (not the BBC TV Series!)
Locations- “West Woods (Wiltshire)”- “Hobhole Drain” (notable mention in “List of rivers of England”)- “Hinchingbrooke Country Park”
Names- “The Jolly Anker (pub, Northampton)”- “Moulton Players (drama group)”- “Halliwell, Jo (BBC Leeds volunteer for Fat Nation)”
Subjects- “Agricultural Statistics”
We think that important concepts are largely linked!
Christian Becker: DBpedia Linker
Linking Approach
Automated linking: Tradeoff between quality and quantity
We wanted highly qualitative links
Limited input - only labels and hierarchy
Problems No correspondences
Differing labels- Word stemming- Determining term nearness using Lucene’s scorer- Integrating Wikipedia redirects to find alias labels
Ambiguities- Sorting by number of inter-wiki references- DBpedia class restrictions - Class Equivalence- Require exact matches
Christian Becker: DBpedia Linker
Poor man’s PageRank
Bill Clinton
30000
...Democratic
Party
Hillary Clinton
United States
List of United States
Presidents
Lucene boost factor = Number of article from which an article is referenced
Christian Becker: DBpedia Linker
Integrating Redirects
Bill Clinton
30000
William Blythe III
200
Buddy (Clinton's
dog)
5
Putting People First
100
Redirects serve as alias labels. Their references count towards the redirection target.
“Brand” category set
Christian Becker: DBpedia Linker
Class Restrictions
imdb_title
“Mary (1985 sitcom)” = ?
Mary (Holy Mother)
50000
Something about Mary
5000
The Mary Tyler Moore
Show
1000
Mary (1985 series)
500
Infobox album
Infobox television
Black and white films
...
✘
Christian Becker: DBpedia Linker
Class Equivalence
Mary (1985 sitcom) 1985
tv brand Infobox television
1980s American television
series
(15 more)
BBC CIS DBpedia
sitcom
Something about Mary
The Mary Tyler Moore
Show
Mary (1985 series)
Lucene query:((+mary 1985 sitcom )) AND ((categories:Category\:1985_television_series_debuts))
Christian Becker: DBpedia Linker
Class Equivalence
About 5% boost in precision and recall (after class restrictions and exact matching)
Algorithm Enrich class hierarchy using parentheses texts
Perform label-based lookup on all items in the dataset and memorize result candidates
Rank CIS classes against DBpedia classes
Perform label lookup restricting results to top 5,10,15% class equivalences; excluding the overall top 20% classes
Christian Becker: DBpedia Linker
Class Equivalence
Mary (1985 sitcom) 1985
tv brand Infobox television
1980s American television
series
(15 more)
BBC CIS DBpedia
sitcom
Something about Mary
The Mary Tyler Moore
Show
Mary (1985 series)
✘
Christian Becker: DBpedia Linker
The Linkage tool
Written in Java, uses Lucene indexes prepared in C#
Command line interface with link and benchmark modes
Components Apache Lucene search
OpenRDF Sesame (native storage)
Dataset-specific algorithm choice and parameters
Next step: General Linking Interface
Christian Becker: DBpedia Linker
Future Directions
Improve quality / quantity Text-level comparison of content relating to the CIS concepts with
Wikipedia articles
Manual review based on confidence score
General Interlinking Framework Describe input data
Select algorithms
Link!
Add non-existant resources to DBpedia Wikipedia requires qualitative content according to Wikipedia
Guidelines
Idea: A “Minipedia” that serves as an additional source to DBpedia
Christian Becker: DBpedia Linker
Thanks!
Questions?