10
Wikipedia Knowledge Extraction

Wikipedia Knowledge Extraction. Pronoun Resolution module Infobox extraction SRL parsing Improved refinement Clustering Hadoop compatibility

Embed Size (px)

Citation preview

Page 1: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Wikipedia Knowledge Extraction

Page 2: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Pronoun Resolution module Infobox extraction SRL parsing Improved refinement Clustering Hadoop compatibility

Page 3: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)

Page 4: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

“His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)

Current solution: replace pronouns with article title (very primitive)

Target solution: ◦ Nobody in the world has solved this yet◦ Use an existing system that is usually correct?◦ Simple rules for common patterns?

Page 5: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Convert information into simple sentences:◦ Joe Biden is Barack Obama’s Vice

President ◦ Barack Obama is preceded by

George W. Bush Use type of phrase (Noun

Phrase, Verb Phrase) to determine sentence to form.

Read papers from Turing Center (University of Washington)

Page 6: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Performs a deep analysis on each sentence. E.g. “Yoshi has a long tongue which he uses

to grab enemies and eat them.”◦ has (A0: Yoshi, A1: long tongue)◦ use (A0: Yoshi, A1: long tongue, A2: grab enemies

and eat them) Use SRL parsing to improve quality and

representation of knowledge. Problem: speed and complexity

Page 7: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Current system has Subject, Object, Verb tuples

Problem: hard to define what words to incorporate in each phrase

E.g. “'The dog ( Canis lupus familiaris )' 'is' 'a mammal from the family Canidae‘”◦ The dog? dog? The dog ( Canis lupus familiaris )?◦ a mammal? a mammal from the family Canidae?

Possible solutions: ◦ Different levels of information?◦ Simple rules based on part of speech tags?

Page 8: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Idea: Determine whether two separate mentions point to the same concept◦ ‘The dog’, ‘a dog’, ‘dogs’◦ ‘Cats’, ‘C.A.T.S’, ‘CAT Scan’◦ ‘President Obama’, ‘President Barack Obama’

Possible solutions:◦ Feature-based classification◦ Self organizing map◦ Terms associated

Page 9: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

Need to ensure scaling is possible for move to regular Wikipedia

Hadoop is an open source implementation of the Map-Reduce algorithm

Map-Reduce is an algorithm that parallelizes a process by splitting its iterations over several machines

Page 10: Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility