Upload
curtis-perkins
View
234
Download
0
Tags:
Embed Size (px)
Citation preview
Wikipedia Knowledge Extraction
Pronoun Resolution module Infobox extraction SRL parsing Improved refinement Clustering Hadoop compatibility
“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)
“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)
Current solution: replace pronouns with article title (very primitive)
Target solution: ◦ Nobody in the world has solved this yet◦ Use an existing system that is usually correct?◦ Simple rules for common patterns?
Convert information into simple sentences:◦ Joe Biden is Barack Obama’s Vice
President ◦ Barack Obama is preceded by
George W. Bush Use type of phrase (Noun
Phrase, Verb Phrase) to determine sentence to form.
Read papers from Turing Center (University of Washington)
Performs a deep analysis on each sentence. E.g. “Yoshi has a long tongue which he uses
to grab enemies and eat them.”◦ has (A0: Yoshi, A1: long tongue)◦ use (A0: Yoshi, A1: long tongue, A2: grab enemies
and eat them) Use SRL parsing to improve quality and
representation of knowledge. Problem: speed and complexity
Current system has Subject, Object, Verb tuples
Problem: hard to define what words to incorporate in each phrase
E.g. “'The dog ( Canis lupus familiaris )' 'is' 'a mammal from the family Canidae‘”◦ The dog? dog? The dog ( Canis lupus familiaris )?◦ a mammal? a mammal from the family Canidae?
Possible solutions: ◦ Different levels of information?◦ Simple rules based on part of speech tags?
Idea: Determine whether two separate mentions point to the same concept◦ ‘The dog’, ‘a dog’, ‘dogs’◦ ‘Cats’, ‘C.A.T.S’, ‘CAT Scan’◦ ‘President Obama’, ‘President Barack Obama’
Possible solutions:◦ Feature-based classification◦ Self organizing map◦ Terms associated
Need to ensure scaling is possible for move to regular Wikipedia
Hadoop is an open source implementation of the Map-Reduce algorithm
Map-Reduce is an algorithm that parallelizes a process by splitting its iterations over several machines