View
3.661
Download
0
Category
Tags:
Preview:
Citation preview
2
Why Mine Wikipedia?
• How can we automatically extract theunstructured content from Wikipedia …
• … to create a structured database ofinformation …
• … that can be leveraged by users inapplications and data loads
4
Problem is …
• Wikipedia is written by humans, for humans.
- Great if you need to look up a fact, or learn about something
• But you can’t …
- Ask questions:“What movies by George Lucas has Harrison Ford starred in?”
- Search effectively:“Find me all companies that build personal computers.”
- Build applications:“Let’s make a social app that ranks consumer goods listed inwikipedia.”
11
Searching for Structure: Properties
What are the highest buildings in the world?
{ "query" : [ { "type" : "/architecture/structure" "name" : null, "height_meters" : null, "sort" : "-height_meters", "limit" : 10, } ]}
12
Searching for Structure: Properties
What are all the countries that speak English?
{ "query" : [ { "type" : "/location/country" "name" : null, ”official_language" : “English”, "limit" : 100 } ]}
13
A Treasure Trove Waiting To Be Opened
• 2,150,000 articles (ie, topics)
• 7,100,000 category refs (ie, typings)- Found within 280,000 categories
• 42,000,000 template values (ie, properties)- Found within 10,000 templates and 56,000 template keys
• All growing at ~2% every two weeks
• Available information doubles every year!
16
Similar, but different …
• Many pages in wikipedia are not topics- Disambiguation pages, lists, categories, images, docs, talk …
• Only store a 1200-character blurb- We’re not wikipedia, after all
• Don’t need to add “(suffix)” to names- “Python (genus)” vs “Python (programming language)”- Freebase types disambiguate without names
• Cities should be specified without state suffix- “San Francisco” vs “San Francisco, California”- Cleanup in progress, some exceptions remain
• “Exclusionist” vs “Inclusionist”- Exclusionists appear to be winning in Wikilandia- Freebase is inherently more inclusionist
17
You Can’t Read The Same Wikipedia Twice
Every 2 weeks …
- 65,000 new pages- 30,000 new topics- 80,000 new aliases- 10,000 merges
- 8,000 deletes- 5,000 name changes- 1,000 page ID changes- 1,000 splits
… change in Wikipedia
18
Keeping track of changes …
• Store reference information within freebase- Page_ids, article titles and redirects
- Page_id (WPID) is stored in /wikipedia/en_id- Article titles and redirects are stored in /wikipedia/en- “mwcl_wikipedia_en”, “mw_infobot” user
• None of these IDs are stable in wiki-land …
19
Determining actions by comparing keys
• Because we are more inclusionist than wikipedia,we usually do not delete topics.
• Topic renames only occur on “untouched” topics.
• Merges occur automatically on “untouched” topics- Otherwise, flagged for review in “pipeline”
case action
new topic create a new topic
name change add new name as en key; if "untouched", rename the topic
id change change the en_id to the new value
merge move the en key to the new topic; if "untouched", merge the topics
split create new topic, move en key from old topic to new topic
delete keep topic, but delete en_id and en keys from topic
21
Map Template Fields To Properties
{{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] |more users = [[Air France-KLM]] |produced = 1993 - Present |number built = 723 as of March 2008 |unit cost = US$187.5-253 million}}
MediaWiki
Template
Rendering
22
Map Template Fields To Properties
{{infobox Aircraft |subtemplate={{Infobox Boeing Aircraft}} |name =Boeing 777 |manufacturer =[[Boeing Commercial Airplanes]] |first flight =[[June 12]] [[1994]] |introduction =[[June 7]] [[1995]] with [[United]] |primary user = [[Singapore Airlines]] |more users = [[Air France-KLM]] |produced = 1993 - Present |number built = 723 as of March 2008 |unit cost = US$187.5-253 million}}
MediaWiki
Template
Rendering
“manufacturer” -->/aviation/aircraft_model/manufacturer
23
Just the Starting Point …
• Extracted to date from Wikipedia:
- 2,365,000 topics- 2,895,000 typings- 5,638,000 properties
• A complement to user-entered data- User data always takes precedence, won’t be overwritten
• Processes are being automated to keep in sync
Recommended