44
Collective Annotation of Wikipedia Entities in Web Text Sayali Kulkarni Amit Singh Ganesh Ramakrishnan Soumen Chakrabarti IIT Bombay

Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Collective Annotation of Wikipedia

Entities in Web Text

Sayali Kulkarni Amit Singh Ganesh RamakrishnanSoumen Chakrabarti

IIT Bombay

Page 2: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Introduction

Our aimAggressive open domain annotation of unstructured Web textwith uniquely identified entities in a social media (Wikipedia)

The incentiveUse the annotations for search and mining tasks

Page 3: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Outline for today

I Terminologies

I About entity disambiguation

I Our contributions

I Evaluation and results

I Conclusion

Page 4: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Terminologies

Figure: A plain page from unstructured data source

Page 5: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Terminologies (2)

Spots

Figure: A spot on a page

Spot is an occurrence of text on a page that can be possiblylinked to a Wikipedia articleRelated notations:S0 All candidate spots in a Web pages ∈ S0 One spot, including surrounding context

Page 6: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Terminologies (3)

Figure: Possible attachments for a spot

Attachments are Wikipedia entities that can be possibly linkedto a spotRelated notations:Γs Set of candidate entity labels for spot s on a pageΓ0

⋃s∈S0

Γs , set of all candidate labels for the page

Page 7: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Entity disambiguation

On first getting into the 2009 JaguarJaguarJaguarJaguar XF, it seems like the ultimate in automotive techautomotive techautomotive techautomotive tech. A red backlightbacklightbacklightbacklight on the engineengineengineengine start button pulses with a heartbeat cadence.

Jaguar, the carJaguar, the carJaguar, the carJaguar, the carOr

Jaguar, the animal?

Figure: Clues from local contexthelp in disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

Figure: Disambiguation based oncompatibility between spot andlabel

Related work: SemTag and Seeker[2]

Page 8: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Collective entity disambiguation

... Michael JordanMichael JordanMichael JordanMichael Jordan is also noted for his product endorsements. He fueled the success of Nike's Air JordanAir JordanAir JordanAir Jordan sneakers. ...The Chicago Bulls Chicago Bulls Chicago Bulls Chicago Bulls selected Jordan with the third overall pick, ...

Basketball player

American actor

American researcher

Jordan Airlines

Nike shoesAmerican basketball team

Figure: Other spots on page helpin disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Inter-label topical coherence

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

g(γ(γ(γ(γ))))

g(γ(γ(γ(γ’ ))))

Figure: Disambiguation based onlocal compatibility and coherencebetween labels

Related work: Cucerzan[1] and Milne et al.[3]

Page 9: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: relatedness between entity based oncosine measure

I Milne et al. proposal: c = number of Wikipedia pages;g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 10: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: relatedness between entity based oncosine measure

I Milne et al. proposal: c = number of Wikipedia pages;g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 11: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: relatedness between entity based oncosine measure

I Milne et al. proposal: c = number of Wikipedia pages;g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 12: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Our contributions

I Posing entity disambiguation as an optimization problem

I Single optimization objectiveI Using integer linear programs (NP Hard)I Heuristics for approximate solutions

I Rich node features with systematic learning

I Back off strategy for controlled annotations

Page 13: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Modeling local compatibility

I Feature vector fs(γ) ∈ Rd expresses local textualcompatibility between (context of) spot s and candidatelabel γ

I Components of fs(γ)

Spot featureSpot featureSpot featureSpot featureContext of spot

WikipediaWikipediaWikipediaWikipedia featuresfeaturesfeaturesfeaturesSnippetFull textAnchor textAnchor text with context

Similarity measuresSimilarity measuresSimilarity measuresSimilarity measuresDot productCosine similarityJaccard similarity

1 X 4 X 3 = 12 features1 X 4 X 3 = 12 features1 X 4 X 3 = 12 features1 X 4 X 3 = 12 features

1 4

3

I Sense probability prior: probability that a Wikipedia entitycan be associated with a spot (Pr(γ|s))

Page 14: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Components of our objective

Node score

I Node scoring model w ∈ Rd

I Node score defined as w>fs(γ)

I w is learned using a linear adaptation of rankSVM

Clique Score

I Use relatedness measure (r) as described by Milne et. al.

Total objective

1

|S0|∑

s

w>fs(ys)︸ ︷︷ ︸Node score

+1(|S0|2

) ∑s 6=s′

r(ys , ys′)︸ ︷︷ ︸Clique score

y is the final set of assignments on a page

Page 15: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Components of our objective

Node score

I Node scoring model w ∈ Rd

I Node score defined as w>fs(γ)

I w is learned using a linear adaptation of rankSVM

Clique Score

I Use relatedness measure (r) as described by Milne et. al.

Total objective

1

|S0|∑

s

w>fs(ys)︸ ︷︷ ︸Node score

+1(|S0|2

) ∑s 6=s′

r(ys , ys′)︸ ︷︷ ︸Clique score

y is the final set of assignments on a page

Page 16: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Backoff strategy

I Not all spots may be tagged. Allow backoff from taggingI Assign a special label “na” to mark a “no attachment”I Reward a spot for attaching to na – RNAI Spots marked na do not contribute to clique potentialI Smaller the value of RNA, more aggressive is the tagging

Modified ObjectiveN0 ⊆ S0 : spots assigned naA0 = S0 \ N0 : remaining spots

maxy

1

|S0|

(∑s∈N0

ρna +∑s∈A0

w>fs(ys)

)(Node Score)

+1(|S0|2

) ∑s 6=s′∈A0

r(ys , ys′) (Clique Score)

Page 17: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Backoff strategy

I Not all spots may be tagged. Allow backoff from taggingI Assign a special label “na” to mark a “no attachment”I Reward a spot for attaching to na – RNAI Spots marked na do not contribute to clique potentialI Smaller the value of RNA, more aggressive is the tagging

Modified ObjectiveN0 ⊆ S0 : spots assigned naA0 = S0 \ N0 : remaining spots

maxy

1

|S0|

(∑s∈N0

ρna +∑s∈A0

w>fs(ys)

)(Node Score)

+1(|S0|2

) ∑s 6=s′∈A0

r(ys , ys′) (Clique Score)

Page 18: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Methodologies for solving the objective

Integer linear program (ILP) based formulation

I Casting as 0/1 integer linear program

I Using up to |Γ0|+ |Γ0|2 variables

I Relaxing it to an LP

Simpler heuristics

I Hill climbing for optimization

Page 19: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Evaluation of the annotation system

Evaluation measures:

PrecisionNumber of spots tagged correctly out of total number of spotstagged

RecallNumber of spots tagged correctly out of total number of spotsin ground truth

F12×Recall×Precision(Recall+Precision)

Page 20: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Datasets for evaluation

I Documents(IITB) crawled from popular sites

I Publicly available data from Cucerzan’s experiments (CZ)

IITB CZ

Number of documents 107 19Total number of spots 17,200 288Spot per 100 tokens 30 4.48Average ambiguity per spot 5.3 18

Figure: Corpus statistics.

Page 21: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Effect of learning in node score calculation

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Using w isbetter thanusing individualnode features inisolation

I Enough tooutperformother baselinesystems

Page 22: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Effect of learning in node score calculation

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Using w isbetter thanusing individualnode features inisolation

I Enough tooutperformother baselinesystems

Page 23: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Benefits of collective annotation

40

50

60

70

80

40 50 60 70 80Recall, %

Pre

cisi

on, %

LocalHill1LP1

Figure: Recall/precision on IITB data

0

20

40

60

80

100

20 30 40 50 60 70Recall, %

Pre

cisi

on, %

Hill1

LP1

Milne

Cucerzan

F1=63%

F1=69%

Figure: Recall/precision on CZ data

I Adding collective inferenceadds to the accuracy of theannotations

Page 24: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Results summary

I Selection of features for defining the node score is important

I Collective inference improves accuracy further

I Able to gain high recall without sacrificing much on precision

Evaluation:

Our system Cucerzan Milne et al.

Recall 70.7% 31.43% 66.1%Precision 68.7% 53.41% 19.35%F1 69.69% 39.57% 29.94%

Page 25: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Future work

I Extending collective inference beyond page-level boundaries

I Associating confidence with annotations

I Reducing cognitive load during the process manualannotations

I Building an entity search system over annotations

KDD Demo: 30 June ’09, 17:30 onwards

Page 26: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Future work

I Extending collective inference beyond page-level boundaries

I Associating confidence with annotations

I Reducing cognitive load during the process manualannotations

I Building an entity search system over annotations

KDD Demo: 30 June ’09, 17:30 onwards

Page 27: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Questions?

Page 28: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Additional slides

I Multitopic models

I Belief about the objective

I Tuning RNA

I More about data sets

I Human Supervision

I ILP in detail

I Hill climbing algorithm

I Timing graphs

I References

Page 29: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Dendrogram with multitopic model

Page 30: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Multi-topic model

I Current clique potentials encourages a single cluster model

I The single cluster hypothesis is not always true

I Partition the set of possible attachments as C = Γ1, . . . , ΓK

I Refined clique potential for supporting multitopic model

1

|C |∑Γk∈C

1(Γk

2

) ∑s,s′: ys ,ys′∈Γk

r(ys , ys′). (CPK)

I Using(Γk

2

)instead of

(S0

2

)to reward smaller coherent

clusters

I Node score is not disturbed

Page 31: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Is our belief about the objective correct?

20

30

40

50

0.75 0.8 0.85 0.9 0.95 1Objective (normalized)

F1,

%

doc1 doc2doc3 doc4doc5 doc6

Figure: F1 versus Objective

I As the objectivevalue increases,the F1 increases

I Validates ourbelief about theobjective

Page 32: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Effect of tuning RNA

15

25

35

45

55

65

1 2 3 4 5 6 7 8rhoNA

F1,

%

Local Hill1LP1

Figure: F1 for Local, Hill and LP for differentRNA values

I Best RNA forLocal is lesserthan the bestRNA for Hill1and LP1

Page 33: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Effect of tuning RNA (2)

40

50

60

70

80

90

1 2 3 4 5 6 7 8RNA

Pre

cisi

on, %

Local Hill1LP1

Figure: Precision for different RNA values

102030405060708090

1 2 3 4 5 6 7 8RNA

Rec

all,

%

Local Hill1LP1

Figure: Recall for different RNA values

I Smaller the value ofRNA, more aggressiveis the tagging

I Precision increaseswith increase in RNAvalue

I Recall decreases withincrease in RNA value

Page 34: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

More about data sets

More on IITB dataset

I Collected a total of about 19,000 annotations

I Done by by 6 volunteers

I About 50 man-hours spent in collecting the annotations

I Exhaustive tagging by volunteers

I Spots labeled as na was about 40%

#Spots tagged by more than one person 1390#na among these spots 524#Spots with disagreement 278#Spots with disagreement involving na 218

Figure: Inter-annotator agreement.

Page 35: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Human Supervision

I System identifies spots and mentions

I Shows pull-down list of (subset of) Γs for each s

I User selects γ∗ ∈ Γs ∪ na

Page 36: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Integer linear program (ILP) based formulation

Variables:

zsγ = [[spot s is assigned label γ ∈ Γs ]]

uγγ′ = [[both γ, γ′ assigned to spots]]

Document

Spots

s

s’S

pot-

to-la

bel

com

patib

ility

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

zzzzssssγγγγuuuuγγγγ γ γ γ γ ’’’’

Figure: Defining the variables for ILP

Page 37: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Integer linear program (ILP) based formulation

Objective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑s∈S0

∑γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

) ∑s 6=s′∈S0

∑γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 38: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Integer linear program (ILP) based formulation

Objective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑s∈S0

∑γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

) ∑s 6=s′∈S0

∑γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 39: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Integer linear program (ILP) based formulation

Objective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑s∈S0

∑γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

) ∑s 6=s′∈S0

∑γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 40: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Integer linear program (ILP) based formulation

Objective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑s∈S0

∑γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

) ∑s 6=s′∈S0

∑γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 41: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

LP relaxation for the ILP formulation

I Relax the constraints in the formulation as :

∀s, γ : 0 ≤ zsγ ≤ 1, ∀γ, γ′ : 0 ≤ uγγ′ ≤ 1

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′

∀s :∑

γ zsγ = 1.

I Margin between objective of relaxed LP and the rounded LPis quite thin

700

800

900

1000

1 2 3 4 5 6 7 8Tuning parameter

Tot

al O

bjec

tive

LP1-rounded

LP1-relaxed

Page 42: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Hill climbing algorithm

1: initialize some assignment y (0)

2: for k = 1, 2, . . . do3: select a small spot set S∆

4: for each s ∈ S∆ do5: find new γ that improves objective6: change y

(k−1)s to y

(k)s = γ greedily

7: if objective could not be improved then8: return latest solution y (k)

Figure: Outline for hill-climbing algorithm

Page 43: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

Scaling and performance measurement

0

2

4

6

8

10

0 50 100 150 200 250# of spots

Tim

e, s

Hill1

LP1

Figure: Scaling the annotation process withnumber of spots being annotated

I Scaling is mildlyquadraticallywrt |S0|

I Hill1 takesabout 2–3seconds

I LP1 takesaround 4–6seconds

Page 44: Collective Annotation of Wikipedia Entities in Web Textsoumen/doc/sigkdd2009c/SIGKDDTalk.pdf · More about data sets More on IITB dataset I Collected a total of about 19,000 annotations

References[1] S. Cucerzan.

Large-scale named entity disambiguation based on Wikipedia data.In EMNLP Conference, pages 708–716, 2007.

[2] S. Dill et al.SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation.In WWW Conference, 2003.

[3] D. Milne and I. H. Witten.Learning to link with Wikipedia.In CIKM, 2008.