Relational Inference for Wikification Xiao Cheng and Dan Roth University of Illinois at...

Preview:

Citation preview

1

Relational Inference for Wikification

Xiao Cheng and Dan RothUniversity of Illinois at Urbana-Champaign

2

Wikification

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State.

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State.

3

Applications

Knowledge Acquisition via Grounding Coreference Resolution

Learning-based multi-sieve co-reference resolution with knowledge (Ratinov et al. 2012)

Information Extraction Unsupervised relation discovery with sense disambiguation (Yao et al.

2012) Automatic Event Extraction with Structured Preference Modeling (Lu

and Roth, 2012 ) Text Classification

Gabrilovich and Markovitch, 2007; Chang et al., 2008 Entity Linking

4

Ambiguity

Concepts outside of Wikipedia (NIL) Blumenthal ?

Variability

Scale Millions of labels

Challenges

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State.

ConnecticutCT

The Nutmeg StateTimesThe New York TimesThe Times

5

State-of-the-art systems (Ratinov et al. 2011) can achieve the above with local and global statistical features Reaches bottleneck around 70%~ 85% F1 on non-wiki datasets What is missing?

Challenges

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State.

6

, the of deposed , …

Motivating Example

Mubarak wife Egyptian President Hosni Mubarak

What are we missing with Bag of Words (BOW) models? Who is Mubarak?

Constraining interaction between concepts (Mubarak, wife, Hosni Mubarak)

Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

7

Relational Inference for Wikification

Our contribution Identify key textual relations for Wikification A global inference framework to incorporate relational knowledge

Significant improvement over state-of-the-art systems

Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

(Mubarak, wife, Hosni Mubarak)

8

Talk Outline

Why Wikification? Introduction

Motivation Approach

Wikification Pipeline Formulation Relational Analysis

Evaluation Result Entity Linking

9

Mention Segmentation

Candidate Generation

Candidate Ranking

NIL Linking

Wikification

Mention Segmentation

Candidate Generation

Candidate Ranking

Determine NILs

10

Wikification Pipeline 1 - Mention Segmentation

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

sub-NP (Noun Phrase) chunks NER Regular expressions

11

Wikification Pipeline 1 - Mention Segmentation

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

12

Wikification Pipeline 2 - Candidate Generation

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4

1 Socialist_Party_(France)

2 Socialist_Party_(Portugal)

3 Socialist_Party_of_America

4 Socialist_Party_(Argentina)

k ek3

1 Slobodan_Milošević

2 Milošević_(surname)

3 Boki_Milošević

4 Alexander_Milošević

13

Wikification Pipeline 3 - Candidate Ranking

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

2 Socialist_Party_(Portugal) 0.16

3 Socialist_Party_of_America 0.07

4 Socialist_Party_(Argentina) 0.06

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

Local and global statistical features

14

Wikification Pipeline 4 – Determine NILs

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

k ek3 sk

3

1 Slobodan_Milošević 0.7

Is the top candidate really what the text referred to?

15

Talk Outline

Why Wikification? Introduction

Motivation Approach

Wikification Pipeline Formulation Relational Analysis

Evaluation Result Entity Linking

16

Formulation (0)

Intuition Promote pairs of concepts coherent with textual relations

Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

(Mubarak, wife, Hosni Mubarak)

17

Formulate as an Integer Linear Program (ILP):

If no relation exists, collapse to the non-structured decision

Formulation (1)

Whether a relation exists between and

weight of a relation

Whether to output th candidate of the th mentionweight to output

18

Formulation (2)

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

2 Socialist_Party_(Portugal) 0.16

3 Socialist_Party_of_America 0.07

4 Socialist_Party_(Argentina) 0.06

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

r(1,2)34

eki: whether a concept is chosen

ski : score of a concept

r(k,l)ij: whether a relation is present

w(k,l)ij : score of a relation

r(4,3)34

19

Overall Approach

Wikification

Candidate

Generation

Relation Analysis

Relation Identification

20

Relation Identification

ACE style in-document coreference Extract named entity-only coreference relations with high precision

Syntactico-Semantic relations (Chan & Roth ‘10)

Easy to extract with high precision Aim for high recall, as false-positives will be verified and discarded Sparse, but covers ~80% relation instances in ACE2004

Type Example

Premodifier Iranian Ministry of Defense

Possessive NYC’s stock exchange

Formulaic Chicago, Illinois

Preposition President of the US

21

Relation Identification

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

Argument 1 Relation Type Argument 2

Yugoslav President apposition Slobodan Milošević

Slobodan Milošević coreference Milošević

Milošević possessive Socialist Party

22

Overall Approach

Wikification

Candidate

Generation

Relation Analysis

Relation Identification

23

Relation Retrieval

Current approach Collect known mappings from Wikipedia page titles, hyperlinks… Limit to top-K candidates based on frequency of links (Ratinov et al.

2011) What concepts can “Socialist Party” refer to?

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

24

Uninformative Mentions

25

Relation Retrieval

What concepts can “Socialist Party” refer to? More robust candidate generation

Identified relations are verified against a knowledge base (DBPedia) Retrieve relation arguments matching “(Milošević ,?,Socialist Party)”

as our new candidates

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

26

Query Pruning Only 2 queries per pair necessary due to strong baseline.

Relation Retrieval

q1=(Socialist Party of France,?, *Milošević*)q2=(Slobodan Milošević,?,*Socialist Party*)

k ek4 sk

4

1 Socialist_Party_(France) 0.23

k ek3 sk

3

1 Slobodan_Milošević 0.7

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

27

Relation Retrieval

Argument 1 Relation Type Argument 2

Milošević possessive Socialist Party

28

Relation Retrieval

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

2 Socialist_Party_(Portugal) 0.16

3 Socialist_Party_of_America 0.07

4 Socialist_Party_(Argentina) 0.06

21 Socialist_Party_of_Serbia 0.0

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

𝑟34(1,21)=1

29

Overall Approach

Wikification

Candidate

Generation

Relation Analysis

Relation Identification

30

Relational Inference

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

2 Socialist_Party_(Portugal) 0.16

3 Socialist_Party_of_America 0.07

4 Socialist_Party_(Argentina) 0.06

21 Socialist_Party_of_Serbia 0.0

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

𝑟34(1,21)=1

𝑤34(1,21)=?

31

Relation scoring

Query scoring as a tie-breaker between multiple relations Explicit relations are stronger than a hyperlink relations Normalize score for each pair of mention to

𝑤=1𝑍𝑓 (𝑞 ,𝜎 )

Retrieved relation tupleRelation query

32

Relational Inference

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek4 sk

4

1 Socialist_Party_(France) 0.23

2 Socialist_Party_(Portugal) 0.16

3 Socialist_Party_of_America 0.07

4 Socialist_Party_(Argentina) 0.06

21 Socialist_Party_of_Serbia 0.0

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

𝑟34(1,21)=1

1

33

Relational Inference - coreference

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

k ek2 sk

2

1 Slobodan_Milošević 1.0

k ek3 sk

3

1 Slobodan_Milošević 0.7

2 Milošević_(surname) 0.1

3 Boki_Milošević 0.1

4 Alexander_Milošević 0.05

𝑟23(1,1)=1

34

Overall Approach

Wikification

Candidate

Generation

Relation Analysis

Relation Identification

35

Determine unknown concepts (NILs)

How to capture the fact: “Dorothy Byrne” does not refer to any concept in Wikipedia

Identify coreferent nominal mention relations Generate better features for NIL classifier

Dorothy Byrne, a state coordinator for the Florida Green Party,…

k ek2 sk

2

1 Green_Party_of_Florida 1.0

k ek1 sk

1

1 Dorothy_Byrne_(British_Journalist)

0.6

2 Dorothy_Byrne_(mezzo-soprano)

0.4

nominal mention

36

Determine unknown concepts (NILs)

Create NIL candidate for propagation

Dorothy Byrne, a state coordinator for the Florida Green Party,…

k ek2 sk

2

1 Green_Party_of_Florida 1.0

k ek1 sk

1

0 NIL 1.0

1 Dorothy_Byrne_(British_Journalist)

0.6

2 Dorothy_Byrne_(mezzo-soprano)

0.4

nominal mention

37

Talk Outline

Why Wikification? Introduction

Motivation Approach

Wikification Pipeline Formulation Relational Analysis

Evaluation Result Entity Linking

38

Wikification Performance Result

ACE MSNBC AQUAINT Wikipedia60

65

70

75

80

85

90

95

F1 Performance on Wikification datasets

Milne&WittenRatinov&RothRelational Inference

39

Evaluation – TAC KBP Entity Linking

Wikipedia Articles

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

TAC KBNon-TAC KB

Task Definition Links a named entity in a document to

either a TAC Knowledge Base (TAC KB) node or NIL

Cluster NIL entities Relevant Tasks

Wikification Cross-document coreference

40

Evaluation – TAC KBP Entity Linking

Run Relational Inference (RI) Wikifier “as-is”: No retraining using TAC data

68

72

76

80

84

88

RI

CogCompLC

C

MS_MLI

NUShime

*Median

TAC KBP 2011 Entity Linking Performance

Micro AverageB³F1

System Names*Median of top 14 systems

41

Conclusion

Importance of linguistic and world knowledge Identification of relational information benefits Wikification Introduced an inference framework to leverage better

language understandings Future work

Accumulate what we know about NIL concepts Joint entity typing, coreference and disambiguation Incorporate more relations

*Demo will be updated in a week at: http://cogcomp.cs.illinois.edu/demo/wikify Download at: http://cogcomp.cs.illinois.edu/page/download_view/Wikifier

Thank you!

42

BACK UP SLIDESBack up slides

43

Massive Textual Information

How can we know more from large volumes of possibly unfamiliar raw texts?

44

Massive Textual Information

We naturally “look up” concepts and accumulate knowledge.

45

Challenges

It’s a version of Chicago – the standard classic Macintosh menu font…

Tuesday marks the 142nd anniversary of an event that forever altered the course of Chicago's development as a then-young American city

By the time the New Orleans Saints kicked off at Soldier Field on Sunday afternoon, their woeful history in the Windy City was fully understood.

?

Ambiguity

Variety

Concepts outside of Wikipedia (NIL)

Challenges

46

It’s a version of Chicago – the standard classic Macintosh menu font…

Tuesday marks the 142nd anniversary of an event that forever altered the course of Chicago's development as a then-young American city

By the time the New Orleans Saints kicked off at Soldier Field on Sunday afternoon, their woeful history in the Windy City was fully understood.

?Chicago Chicago

Chicago font

ChicagoChicago

The Windy City

47

Organizing knowledge

Wikipedia as a source of “common sense” knowledge Naturally bridges rich structured knowledge and text data Comprehensive for most purposes

~4.3 million English articles as of today Cross-lingual

Estimated size of a printed version of Wikipedia as of August 2010. *Picture courtesy of Wikipedia

48

Motivating Example

Relatively sparse, but act as hard constraints

Intuitively, we need to “de-coreference” this pair of mention Opens a new dimension in text understanding

helps all stages of Wikification

Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

Mubarak

Hosni Mubarak

Egyptian President

wife

49

Wikification Approach

Mention Segmentation Use Shallow Parsing, NER and regular expression to generate likely

mentions of concepts. Match nested mentions using dictionary. Discard unknown mentions.

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

50

Ranking

Features Local features from Bag of Words (BOW) representation, such as

various TFIDF windows. Global features from Bag of Concepts (BOC) representation.

What are we losing?

Mubarak

Hosni Mubarak

Egyptian President

wife

51

Relation Extraction

In document coreference Extract entity-only coreference relations with high precision

Syntactico-Semantic relations (Chan & Roth ‘10)

Easy to extract with high precision Deep parsing too slow

Premodifier Iranian Ministry of Defense

Possessive NYC’s stock exchange

Formulaic Chicago, Illinois

Preposition President of the US

52

Relational Inference

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party

Coreference

Possessive

53

Relational Inference

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party

How to look up knowledge and use these relations?

54

Wikification Approach

Recall concepts Rank concepts Determine unknown concepts (NILs)

55

Determine NILs

Dorothy Bryne, a state coordinator for the Florida Green Party

Leverage nominal mention relations to construct limited context

56

Relational Inference Example

...like Chicago O'Hare and [La Guardia] in [New York],...

Wikipedia Page Ski

LaGuardia_Airport 0.36La_Guardia 0.22Fiorello_La_Guardia 0.18La_Guardia,_Spain 0.11

Wikipedia Page Ski

New_York 0.53New_York_City 0.44New_York_(magazine) 0.01New_York_Knicks 0.01

57

Relational Inference Scenario 1

...like Chicago O'Hare and [La Guardia] in [New York],...

Wikipedia Page ScoreLaGuardia_Airport 0.36La_Guardia 0.22Fiorello_La_Guardia 0.18La_Guardia,_Spain 0.11

Wikipedia Page ScoreNew_York 0.53New_York_City 0.44New_York_(magazine) 0.01New_York_Knicks 0.01

Γ = 0.18+0.53 = 0.71

58

Relational Inference Scenario 2

...like Chicago O'Hare and [La Guardia] in [New York],...

Wikipedia Page ScoreLaGuardia_Airport 0.36La_Guardia 0.22Fiorello_La_Guardia 0.18La_Guardia,_Spain 0.11

Wikipedia Page ScoreNew_York 0.53New_York_City 0.44New_York_(magazine) 0.01New_York_Knicks 0.01

Γ = 0.36+0.44 = 0.8

59

Relational Inference Conclusion

...like Chicago O'Hare and [La Guardia] in [New York],...

Pr(Γ= ) > Pr(Γ= )

60

Constraint scaling

Lexical similarity as a tie-breaker between multiple relations Explicit relations are stronger than a hyperlink relations Normalize score for each pair of mention to [0,1]

Relation scaling factor

Query lexical similarityNormalization factor

: query: a retrieved relation triple

61

Wikification Pipeline 1 - Mention Segmentation

...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

Sub noun phrase chunks and NER segments (Ratinov et al. 2011)

Regular expression capturing longer composite mentions President of the United States of America