KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation

PIKAKSHI MANCHANDA

DISCo, University of Milano-Bicocca, Milan, Italy

@pikakshi787

Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts

KDIR 2015

KDIR 2015, Lisbon,12th November, 2015

People communicate and share information increasingly through social media platforms

Fresh information emerging in real-time on social media platforms primarily

New entities (newly emerging, newly relevant/popular)

New relationships

Factual information

Events

2

SOCIAL MEDIA: ENTITIES-EMOJIS-EVENTS


KDIR 2015

WHY INFORMATION EXTRACTION??

3

Existing

entities

New entity (Product Launch)

Apple Watch

Product

IBM OS2

Product

Apple

Company

New

Relations

WHY SOCIAL MEDIA PLATFORMS??

Fresh

Real-time info

Incomplete KBs

Unstructured

Web


KDIR 2015

MOTIVATION

4 Manchanda et al., Leveraging Entity Linking for Entity Recognition in Microposts

KDIR 2015

Bridging the gap between Unstructured Web and Web of Data

• Intrinsic incompleteness in KBs

Information Extraction from social media streams (microposts,..)

• Named Entity Recognition (NER)

• Named Entity Classification

• Named Entity Linking (NEL)

Knowledge Base (KB) enrichment

• Identify new knowledge

• Improve NER

• Lexically enriching knowledge bases for existing & new entities

INFORMATION EXTRACTION

Named Entity Recognition: Task of identifying named entities in a piece of text

Named Entities: text fragments that refer to entities in the real world (proper nouns..)

Named Entity Classification: Classifying recognized named entities into entity types such as PERSON, LOCATION, ORGANIZATION…

Named Entity Linking: Linking the identified named entities to resources in a knowledge base (such as Wikipedia, DBpedia)


KDIR 2015

6

The Town might be one of the best movies I have seen all year. So,

so good. And don't worry Ben, we already forgave you for Gigli.

Really.

http://dbpedia.org/page/Ben_Affleck

foaf:Person

yago:AmericanFilmActors

http://dbpedia.org/page/Gigli

dbo:Film

yago:AmericanFilms http://es.dbpedia.org/page/The_Town

dbpedia-owl:Film

schema.org/Movie


KDIR 2015

Nam

ed E

ntity

Lin

kin

g


7

The Town might be one of the best movies I have seen all year. So,

so good. And don't worry Ben, we already forgave you for Gigli.

Really.

http://dbpedia.org/page/Ben_Affleck

foaf:Person

yago:AmericanFilmActors

http://dbpedia.org/page/Gigli

dbo:Film

yago:AmericanFilms

http://live.dbpedia.org/page/The_Town_(2012_TV_series)

dbo:TelevisionShow

http://schema.org/CreativeWork


KDIR 2015


Nam

ed E

ntity

Lin

kin

g

Entity Recognition and Linking in microposts has been reported to be quite challenging:

1. Short and noisy nature, typographic errors, shortening of words, ambiguity, polysemy (Liu et al. 2013, Ritter et al. 2011, Meij et al. 2012)

2. Out Of Vocabulary (OOV) entity mention identification problem

The Big Bang Theory being referred as TBBT

3. Out of Knowledge base (OOKB) entity problem

A new upcoming company Widro

8

CHALLENGES


KDIR 2015

9

Systems/Tools Approach Domain Entity Types/Classes Taxonomy

ANNIE Gazetteers & FSM Newswire 7 (adapted) MUC

Stanford NER CRF Newswire 4, 3 or 7 CoNLL, ACE

Alchemy API Machine Learning Unspecified 324 Alchemy

NERD-ML KNN & Naïve

Bayes

Twitter 4 NERD

TextRazor Machine Learning Unspecified 1779 DBpedia, Freebase

Ritter et al., 2011 CRF Twitter 3 or 10 CoNLL, ACE

Liu et al. 2011 KNN & CRF Twitter 4 CoNLL, ACE

Kalina et al, 2013 Gazetteers & FSM Twitter 3 or 10 CoNLL

Derczynski et al, 2015 Structured

Learning (CRF)

Twitter 10 Freebase

ENTITY RECOGNITION


KDIR 2015

10

Tools Taxonomy Approach/ Features used Domain

DBpedia Spotlight

(Mendes et al., 2011)

DBpedia, Freebase,

Schema.org

Gazetteers and Similarity Metrics Unspecified

TAGME (Ferragina and

Scaiella, 2010)

Wikipedia Wikipedia anchor texts and the

pages linked to those anchor texts

Short texts

YODIE (Damljanovic and

Bontcheva, 2012)

DBpedia Similarity metrics and URI frequency Twitter

Babelfy (Moro et al., 2014) BabelNet semantic

network

Graph-based approach, semantic

signatures

Short text

Meij et al., 2012 Wikipedia n-gram features, concept features,

and tweet features

Twitter

S-MART, Yang et al, 2015 Wikipedia

Structural Learning (Tree-based) Twitter

Weasel (Tristram et al,

2015)

DBpedia

Machine Learning (using SVM) Newspaper

Articles

Guo et al., 2013 Wikipedia Structural SVM Twitter

Yamada et al., 2015 Wikipedia Supervised

(String matching, n-grams)

Twitter

Mention detection

& disambiguation

system: Pipeline

Use NEL to learn

how to perform

NER: pipeline


KDIR 2015

ENTITY LINKING

THE PROPOSED SYSTEM An end-to-end IE framework for microblogs to orchestrate NER and NEL

• Entity Recognition and Classification

• Candidate match retrieval for identified entities

• Entity linking

• Leverage entity linking to improve named entity classification

Gold-standard corpus of ~2400 tweets (Ritter et al., 2011)

Ground Truth: Manually curated set of 1616 named entities identified with entity types

Use of DBpedia as an external KB


KDIR 2015

12

FRAMEWORK


KDIR 2015

Named Entity

Recognition Tweet Surface forms of

named entities

Index

(rdfs:label)

Entity Search

Top-k labels for each

surface form

Resource

description f

Surface form,

entity type &

context

KB

Entity Disambiguation

Resource

for

each

label

Entity Linking

Improvement of NER

Resource

for surface

form

13

ENTITY RECOGNITION


KDIR 2015

Named Entity

Recognition Tweet Surface forms of

named entities

Index

(rdfs:label)

Entity Search


surface form

Resource

description f

Surface form,

entity type &

context

KB


Resource

for

each

label

Entity Linking

Improvement of NER

Resource

for surface

form

T-NER grounded on Conditional Random Fields (Sutton and McCallum, 2006)

Classifying each entity e into one or more entity type/class c with a probability score PCRF(e,c)

Experimental Analysis: Entity Recognition

NER Systems: T-NER (Ritter et al. 2011)

14

Entity type: O Entity type:

Geo-Loc

Entity type:

Band Entity type:

Sportsteam

Identification Errors

“@vogueglamGIRL Ah I know! She is simply the best in The Sept Issue. My boyfriend’s aunt worked for Anna

Wintor in NY”

Classification Errors

“Cant wait for the ravens game tomorrow....go ray rice!!!!!!!”


KDIR 2015

PCRF (e, c) = exp (Σ wkfk (e, c)) k=1

K

Text

Phrase

Classification

Level

Example Classification

(%)

Entity Entity Type

Entities

(1496)

Correctly

Classified

Justin Bieber Person 61.57

Incorrectly

Classified

Chicago Person 37.96

Segmentation

Error

Alpha, Omega

(Alpha-Omega)

Geo-Location,

Band

0.47

Non-

Entities

(44k)

Correctly

Classified

It Outside (O) 99.8

Incorrectly

Classified

justthen Person 0.2

T-NER Classification Performance

15

Identifies 1496 named entities from the GS, in contrast to 1616 entities in ground

truth.

8% of entities are not even recognized and thus classified as non-entities (amongst other

44k tokens)

Entity Type Error (%)

Band 73.83

Company 21.9

Facility 54.79

Geo-Location 19.75

Movie 75.83

Other 46.29

Person 28.18

Product 39.70

Sportsteam 48.27

TVshow 48.71

Classification Error Rate: T-NER


KDIR 2015

16

ENTITY LINKING


KDIR 2015

Surface forms of

named entities

Index

(rdfs:label)

Entity Search


surface form

Resource

description f

Surface form,

entity type &

context

KB


Resource

for

each

label

DBpedia Titles files and NLP resources available at: http://wiki.dbpedia.org/Downloads2015-04

Entity Linking

Named Entity

Recognition Tweet

Improvement of NER

Resource

for surface

form

http://wiki.dbpedia.org/Downloads2015-04





17

Classifiable

Named

Entity

Linking Level Example Linking

(%)

Entity DBpedia

Type

Linkable Correctly

Linked

Wisconsin Geo-

Location

63.11

Incorrectly

Linked

America Movie 3.05

Uninformative N.J. Thing 16.15

Non-

Linkable

Uninformative Secrets Thing 11.85

Generic Whitney Other 5.83

A total of 1442 entities out of 1496 entities are

disambiguated with ~4k candidate KB resources

Entity Linking-Performance Analysis

Matching function, PKB (e, rc), to detect the resource for a

surface form of named entity in KB, if it exists:

1. Lexical Similarity, lex(e, lrc)

2. Coherence, coh(e+, drc)


KDIR 2015

Experimental Analysis: Entity Linking

⇒ PKB (e, rc) = *(lex(e, lrc)) + (1- )*(coh(e+, drc))

( currently set to 0.5)

18

ENTITY RECOGNITION ENHANCEMENT


KDIR 2015

Surface forms of

named entities

Index

(rdfs:label)

Entity Search


surface form

Resource

description f

Surface form,

entity type &

context

KB


Resource

for

each

label

Entity Linking

Named Entity

Recognition Tweet

T-NER+

Resource

for surface

form

c*e = argmax {PCRF (e, c)*PKB (e, rc)}

c

T-NER Performance

Analysis T-NER+ Performance

Analysis

Entity Type Precision Recall F1 Precision Recall F1

Band 0.26 0.88 0.40 0.39 0.90 0.54

Company 0.78 0.90 0.84 0.81 0.90 0.85

Facility 0.45 0.72 0.55 0.50 0.72 0.59

Geo-Location 0.80 0.95 0.87 0.80 0.95 0.87

Movie 0.24 0.88 0.38 0.34 0.88 0.49

Other 0.57 0.70 0.63 0.56 0.76 0.64

Person 0.72 0.92 0.81 0.77 0.92 0.84

Product 0.60 0.69 0.65 0.63 0.71 0.67

Sportsteam 0.52 0.83 0.64 0.63 0.85 0.72

TVshow 0.51 0.91 0.66 0.45 0.89 0.59

Overall 0.62 0.87 0.73 0.66 0.88 0.76

Comparative Analysis: T-NER and T-NER+


KDIR 2015

Experimental Analysis: Entity Recognition Enhancement Entity Ground-Truth T-NER T-NER+

30stm Band Product Band

Yahoo Company Band Company

Southgate

House

Facility Band Facility

Canada Geo-Location Person Geo-Location

Camp rock 2 Movie Person Movie

Thanksgiving Other Person Other

John Acuff Person Facility Person

iphone Product Company Product

Lions Sportsteam Person Sportsteam

TMZ TVshow Band TVshow

Example: Re-classification of entities

Precision (P) = |{cor.cl} ∩ {cl}|

|{cl}|

Recall (R) = |{cor.cl} ∩ {cl}|

|{cor. cl}|

F1 Measure = 2 x P x R

P+R

cor.cl denotes correctly classified entities,

while cl denotes classified entities.

20

New knowledge emerges constantly on social media streams

Its important to identify new knowledge in order to bridge the gap

between Unstructured Web and Web of Data

An end-to-end entity linking pipeline might be helpful for

detecting new knowledge

Entity linking can be used to improve classification performance

of an entity recognition system

Improving entity recognition is crucial for identifying new entities


KDIR 2015

21

Presented an end-to-end entity linking pipeline for short textual formats (microposts)

Presented an approach for improving entity recognition through re-classification

Marginal improvements observed in re-classification using linked entities

A definite scope for improving the current system

New knowledge has been identified, though not dealt with currently

Quality assessment, trustworthy factors…

Relation extraction from microposts to improve identification of new knowledge

Experimenting with more recent datasets


KDIR 2015

Documents

KDIR2015-Entity Linking and Knowledge Discovery in Microblogs-Presentation