Research: Entity identification on microblogs

Entity Identification on Microblogsby CRF Model with Adaptive Dependency

Dept. of Social Informatics,

Kyoto University, Japan

Jun-Li Lu Makoto P. Kato Takehiro Yamamoto Katsumi Tanaka

@2015 IEEE/WIC/ACM International Conference on Web Intelligence (WI2015)

2

Outline

• Entity identification

• How an entity is mentioned

• Method• Feature

• Conditional Random Field (CRF) model

• Adaptive dependency

• Experiment results & conclusion

3

Problem definition:Entity identification on microblogs

Jacoby is leaving for the

rival and betrays Red Sox;

Yankees seems aiming

for championship.

microblog

…

…

…

Given mention, to find mapped entity?

4

How an entity is mentioned?

… is a

professional

baseball teamOur baseball team

is the rival to

Yankees

mention

attribute

Boston Red Sox

rival

… is a

professional…

New York Yankeesname

relationship

Direct-reference• Name: mention is partial or full name of an entity

Indirect-reference• Attribute: mention is to describe an entity

• Relationship: mention is the relationship between two entities

• Metaphor: mention contains another entity’s name but is to map an entity

entity’s article

5

Related work

• Two sub-tasks: NER (Named Entity Recognition), NED (Named Entity Disambiguation)

• NER and NED jointly considered [TKDE2015, WWW2014]

• Mining additional context for NED, in addition to KB [KDD2013]

• On well-written doc. v.s. on short-and-noisy microblog [WWW2014]

• Efficient prediction algorithm [WSDM2015]

=> Past works focused on direct-reference

6

Our contribution

• Survey for indirect-reference• indirect-reference was not infrequent in microblogs

• Novel feature for indirect-reference• topic-specific translation, “entity-known-as” pattern, …

• A efficient model that considers dependency between entities• predicting entities together by CRF model

• getting proper dependency among entities

Presenting flow

Introduction toEntity Identification

Feature CRF Model with Adaptive Dependency

Experiment results

How to measure entities?

How to predict entities?

Previous-work features

…are an

baseball

team…

New York Yankees

microblog

the Yankees is the

rival to ……In 2015, [New

York Yankees] won

championship

…[New York

Yankees

|yankees]… yankees…

# of found documents

Boston Red Sox

[Boston

Red Sox]

…New York

Yankee…

writer’s recent microblogs

…

yankees

…

1. Keyword

2. Context similarity

3. Entities’ correlation

4. Mention entity’s name

5. Occurrence frequency

6. User interest

Jaccard-index

bag-of-wordssimilarity

prob.(yankees)

1.

2.

3.

4.5.

6.

candidate-entity candidate-entity

# of found cases


match

9

For indirect-reference: topic-specific translation

• To get microblog’s meaning based on topic knowledge: Effective when microblog is abstract

• How we did

…the player is

leaving for…

microblog topic translation

player =“outfielder”

“goalkeeper”

…the playeris leaving for…

news

“New York Yankees”

“Jacoby Ellsbury”

…

“pitcher”

“outfielder”“shortstop”…

“player”=>“outfielder”

responded or writer’s past microblogs

microblog-related data top proper-noun translation by semantic-similarity

“baseball”

“soccer”

top terms in topic(related Wikipedia documents)

10

For indirect-reference:pattern

• Effective when mention is normal-noun: e.g., no hint for entity’s name

• Pattern 1: entity-known-as

• Pattern 2: entity-performing-action

mention+ known-as-phrase

“pinstripes” + “known as”

action

“hit”

“New York Yankees…known as

…pinstripes”

“Jacoby Ellsbury

…hit”

Presenting flow



Experiment results



12

Conditional Random Field (CRF) model

• To predict multiple entities together by proper dependency

• Linear + Non-sequential CRF:• to make prediction tractable, linear time 𝑂(𝑛𝑐2)

• If cycle-CRF, time is exponential, 𝑂(𝑐𝑛)

• to allow proper dependency among entities

n: # of mentions/a microblog; c: # of candidates/a mention

with dependency

pro

bab

ility

𝑌2=

without

𝑌2𝑌1𝑌2𝑌1

13

Adaptive dependency

• To make proper dependency among entities

• By entities’ correlation

CRF model of adaptive dependency

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5

𝑌5𝑌4𝑌2 𝑌3𝑌1

Pick 𝑖, 𝑗 with max adaptive dependencyand not making cycle

………

“baseball”

to make high to make low

“singing”

c c

14

Prediction probability

• CRF model

𝑝 𝒚 𝒙 =1

𝑧 𝒙𝑒𝑥𝑝

𝑖

[

𝑓∈𝐹𝛼

𝑤𝑓 𝑓 𝑦𝑖 +

𝑓∈𝐹𝛽

𝑤𝑓 𝑓 𝑥𝑖 , 𝑦𝑖 ] +

𝑖,𝑗 ∈L

𝑓∈𝐹𝛾

𝑤𝑓𝑓 𝑦𝑖 , 𝑦𝑗

• CRF model with adaptive dependency

𝑝 𝒚 𝒙 =1

𝑧 𝒙𝑒𝑥𝑝

𝑖

[

𝑓∈𝐹𝛼

𝑤𝑓 𝑓 𝑦𝑖 +

𝑓∈𝐹𝛽

𝑤𝑓 𝑓 𝑥𝑖 , 𝑦𝑖 ] +

𝑙= 𝑖,𝑗 ∈L

𝑓∈𝐹𝛾

𝛿 𝑙 𝑤𝑓𝑓 𝑦𝑖 , 𝑦𝑗

𝒚=(𝑦1,…, 𝑦𝑛): a set of entities; 𝒙=(𝑥1,…, 𝑥𝑛): a set of mentions; 𝑧(𝒙): normalization; 𝑤𝑓: weight of feature f

𝐹𝛼: a set of features of an entity𝐹𝛽: a set of features of an entity and mention

𝐹𝛾: a set of features of two entities

L: a set of connections between 𝑌𝑖 , 1 ≤ 𝑖 ≤ 𝑛

𝛿 𝑙 , adaptive dependency:top-k value of 𝑓∈𝐹𝛽𝑤𝑓 𝑓 𝑥𝑖 , 𝑦𝑖 + 𝑓∈𝐹𝛽𝑤𝑓 𝑓 𝑥𝑗 , 𝑦𝑗 + 𝑓∈𝐹𝛾𝑤𝑓 𝑓 𝑦𝑖 , 𝑦𝑗

Presenting flow



Experiment results



16

Experiment outline

• Microblog annotation

• Candidate entity generation

• Performance• Overall: features + model

• Feature comparison

• CRF model with adaptive dependency

17

Microblog annotation

• Credible ground-truth: 3 annotators on 500 random tweets from Twitter (2014/10)

• Annotation result:

=> Multiple mentions in a microblog (2.61 per tweet)

=> Indirect-reference was not infrequent (indirect:direct≈2:3)

Twitter-tag Tweet # Mention # direct-ref. # indirect-ref #

#Yankees 86 228 153 108

#Obama 92 227 167 87

#Ebola 97 241 151 156

#Nobel 94 287 228 124

#Islam 92 219 151 95

Mean per tweet 2.61 1.84 1.24

18

Candidate entity generation

• Direct reference: mention is partial or full name of entity

• Indirect reference: mention is included in entity’s main page in Wikipedia

30

50

70

90

10

60

11

0

16

0

21

0

26

0

31

0

36

0

41

0

46

0

51

0

56

0

61

0

66

0

71

0

76

0

81

0

86

0

91

0

96

0

20

00g

t-en

titi

es i

n

candid

ates

(%

)

size of top candidate entities

for direct reference

for indirect reference

=> Weak for indirect-reference

19

Baseline method

• Baseline-model: sequence-rank one-by-one• 𝑎𝑟𝑔𝑚𝑎𝑥𝑒∈𝐶p(yi = e|y1, … , y𝑖−1, 𝑦𝑖+1,…, yn)

• 𝐶: candidates for yi

𝑌5𝑌4𝑌2 𝑌3𝑌1

1𝑜 2𝑜 3𝑜 4𝑜 5𝑜

Context similarity

Entities’ correlation

Mention entity’s name

User interest

Occurrence frequency

Keyword

Topic-specific translation

Pattern

Writing behavior

Ourfeatures

Baseline-features

Ranking order:

20

Overall performance

• Our CRF model (or all features) was always better

=> CRF model works regardless of features

=> Multiple features are required

MRR=1/𝑞 𝑖 1/ranki, where q: # of test, ranki: rank position of ground-truth entities at test i

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MR

R

**

**

(+SEM)

Our CRF

All-feature (including ours)Baseline-feature

Baseline-model

21

Feature comparison

• Our feature was effective for indirect-reference

00.10.20.30.40.5

Topic-

specific

translation,

Eq. 1a-b

Occurrence

frequency

Entities'

correlation

Topic-

specific

translation,

Eq. 1c-f

Keyword Context

similarity

Pattern Writing

behavior

Mention

entity's

name

User

interest

MR

R for indirect-reference

0

0.2

0.4

0.6

0.8

Occurrence

frequency

Mention

entity's

name

Topic-

specific

translation,

Eq. 1a-b

Entities'

correlation

Topic-

specific

translation,

Eq. 1c-f

Pattern Context

similarity

Writing

behavior

Keyword User

interest

MR

R for direct-reference

(+SEM)

(+SEM)

Our feature

Baseline-feature

22

Effect of CRF model with adaptive dependency

• Our adaptive dependency was a little worse than best• but note that our complexity is in linear

appearing order

𝑂(𝑐𝑛) 𝑂(𝑛𝑐2) 𝑂(𝑛𝑐)complexity 𝑂(𝑛𝑐2) 𝑂(𝑛𝑐2)

00.10.20.30.40.50.60.7

Fully connected Adaptive Occurrence order Random No dependency

MR

R

(+SEM)

23

Conclusion

• Contribution:• Surveyed on microblogs for indirect-reference• Effective feature for indirect-reference• Accurate and efficient: CRF model with adaptive dependency

• Finding:• Not good for getting candidates for indirect-reference

• Limited performance on some novel feature

• Multiple features were required when direct/indirect references are mixed

• Thank you for listening

Data & Analytics

Research: Entity identification on microblogs