47
A. Elizabeth Cano, Andrea Varga, Matthew Rowe, Fabio Ciravegna, and Yulan He° Knowledge Media Institute, The Open University, Milton Keynes University of Sheffield, Sheffield Lancaster University, Lancaster ° Aston University, Birmingham UK. 2013 Harnessing Linked Knowledge Sources for Topic Classification in Social Media

Harnessing Linked Knowledge Sources for Topic Classification in Social Media

Embed Size (px)

DESCRIPTION

Presented at Hypertext'13. Topic classification (TC) of short text messages o↵ers an ef- fective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolu- tion). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the top- ics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detec- tion (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.

Citation preview

Page 1: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

A. Elizabeth Cano, Andrea Varga�, Matthew Rowew, Fabio Ciravegna�, and Yulan He°

Knowledge Media Institute, The Open University, Milton Keynes � University of Sheffield, Sheffield w Lancaster University, Lancaster ° Aston University, Birmingham

UK. 2013

Harnessing Linked Knowledge Sources for Topic Classification in Social Media

Page 2: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Social Media Streams - Risk in violent and criminal activities

Page 3: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Research Questions:

o  Can semantic features help in topic classification (TC)?

o  Which knowledge source (KS) data and KS taxonomies provide useful information for improving the TC of tweets?

Page 4: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

OUTLINE

•  Introduction - Topic Classification (TC) of Microposts - Related Work - State of the art limitations

•  Proposed Approach •  Experiments •  Findings •  Conclusions

Page 5: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION

u Difficulties of Topic Classification of microposts o  Restricted number of characters o  Irregular and ill-formed words •  Mixing upper and lowercase letter

§  Makes it difficult to detect proper nouns, and other part of speech tags.

•  Wide variety of language §  E.g., “see u soon”

o  Event-dependent emerging jargon •  Volatile jargon relevant to particular events

§  E.g., “Jan.25” (used during the Egyptian revolution o  High Topical Diversity o  Sparse data

Page 6: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Social Knowledge Sources (KS)

DBpedia* Yago2 Freebase

Resources 2.35 million 447million 3.6 million

Classes 359 562,312 1,450

Properties 1,820 253,213,842 7,000

*Using dbpedia ontology

o  Structured Semantic Web Representation of data •  Maintained by thousand of editors

§  E.g DBpedia, derived from Wikipedia §  Freebase

•  Evolves and adapts as knowledge changes [Syed et al, 2008]

o  Cover a broad range of topics o  Characterise topics with a large number of resources

Page 7: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Local and External Metadata of a Tweet

Page 8: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Local and External Metadata of a Tweet

NER:Country NER:Person

NER:Person

Page 9: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

INTRODUCTION Local and External Metadata of a Tweet

NER:Country NER:Person

NER:Person

<http://dbpedia.org/resource/Barack_Obama

<http://dbpedia.org/resource/Egypt

<http://dbpedia.org/resource/Hosni_Mubarak

Page 10: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH

o  State of the art limitations §  Use of single knowledge sources §  Entities’ metadata is constrained by the used NER service

(e.g OpenCalais, Alchemy). o  Our approach

§  Exploits multiple knowledge sources. §  Enhances the entity metadata by deriving semantic graphs. §  Leverages the graph structures surrounding entities present

in a KS for the TC task.

Exploiting Knowledge Sources for the Topic Classification of Microposts

Page 11: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

OUTLINE

•  Introduction •  Proposed Approach •  Semantic Meta-graphs •  Weighting Schemas •  Enhancing TC with Semantic Features

•  Experiments •  Findings •  Conclusions

Page 12: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Rationale…

1

2

Page 13: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Rationale…

1

2

Could be more indicative of War and Conflict

Page 14: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Rationale…

2

Not necessarily a good indicator of War and Conflict

Page 15: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Rationale…

1

2

Can the graph structure of existing Knowledge sources provide an abstraction of the use of these entity types for representing a topic ?

Page 16: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

1 Datasets Collection

SPARQL query for all resources from a given Topic (e.g. War )

Page 17: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

2 Datasets Enrichment

From tweets and articles’ abstracts, extract entities and link them to resources in

DBpedia and Freebase.

Page 18: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

2 Datasets Enrichment

From tweets and articles’ abstracts, extract entities and link them to resources in

DBpedia and Freebase.

Page 19: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

2 Datasets Enrichment

From tweets and articles’ abstracts, extract entities and link them to resources in

DBpedia and Freebase.

Page 20: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

3 Semantic Features Derivation

Page 21: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

4Build a Topic Classifier based on Features Derived from Crossed-Sources

Page 22: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Framework for Topic Classification of Tweets

Concept EnrichmentDBFB

DB-FBRetr

ieve

Art

icle

s

TW

Retrieve TweetsDerive Semantic Features

Build Cross-Source Topic Classifier

AnnotateTweets

4Build a Topic Classifier based on Features Derived from Crossed-Sources

Page 23: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Deriving Semantic Meta-Graphs

<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates> <dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>

Page 24: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Deriving Semantic Meta-Graphs

<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates> <dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>

Page 25: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Definition 1- Resource Meta-graph Is a sequence of tuples G:=(R,P,C,Y) where •  R, P, C are finite sets whose elements are resources,

properties and classes; •  Y is a ternary relation representing a

hypergraph with ternary edges. •  Y is a tripartite graph where the vertices

are

Y ! R"P"C

H Y( ) = V,DD = r, p,c{ } r, p,c( )! Y{ }

Page 26: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Resource Meta-graph The meta-graph of entity e is the aggregation of all resources, properties and classes related to this entity.

Obama

birthPlace

author

spouse

Projecting on Properties Projecting on Classes

LivingPeople

PresidentOfTheUnitedStates

Obama

Person

Author

Page 27: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Resource Meta-graph The meta-graph of entity e is the aggregation of all resources, properties and classes related to this entity.

Obama

birthPlace

author

spouse

Projecting on Properties Projecting on Classes

LivingPeople

PresidentOfTheUnitedStates

Obama

Person

Author

How can we weight these graphs to reveal semantic features characterise Obama in the context of Violence?

?

? ?

? ? ? ?

Page 28: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Weighting Semantic Features Specificity Measures the relative importance of a property to a given class in a KS graph GKS:

p!G e( )c !G e( )

specificityKS p,c( ) = pN R(c)( )N(R(c))

Page 29: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Weighting Semantic Features Generality Captures the specialisation of a property p to a given class c, by computing the property’s frequency among other semantically related classes R’(c). Where N(R’(c)) is the number of resources whose type is either c or a specialisation of c’s parent classes.

generalityKS p,c( ) =N R '(c)( )

pN (R '(c))

Page 30: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Weighting Semantic Features

SG p,c( ) = specificityKS p,c( )! generalityKS p,c( )

Page 31: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation (A1) Class Features Property Features Class+ Property Features

A1!CF ' = F + CF

A1!PF ' = F + pFA1!C+PF ' = F + cF + pF

Page 32: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation (A1) Class Features Property Features Class+ Property Features

A1!CF ' = F + CF

A1!PF ' = F + pFA1!C+PF ' = F + cF + pF

F president, obama, televised, statement, hosni, mubarak, resignation, cnn, says, egypt

FA1+ P dbpedia:birth, dbpedia:state, …., dbpedia-owl:PopulatedPlace/populationDensity….

FA1+ C PopulatedPlace, Office_holder, PresidentOfTheUnitedStates, Politician…

Page 33: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation with Generalisation (A2) This augmentation exploits the subsumption relation among classes within the DBpedia or Freebase ontologies. In this cases we consider the set of parent classes of c. Parent(c) Features Parent(c) + Property Features

A2!CF ' = F + parent (c)F

A2!C+PF ' = F + pF + parent (c)F

Page 34: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation with Generalisation (A2) This augmentation exploits the subsumption relation among classes within the DBpedia or Freebase ontologies. In this cases we consider the set of parent classes of c. Parent(c) Features Parent(c)+Property Features

A2!CF ' = F + parent (c)F

A2!C+PF ' = F + pF + parent (c)F

F president, obama, televised, statement, hosni, mubarak, resignation, cnn, says, egypt

FA2+ parent(c) Place, Office_holder, President, Politician…

Page 35: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

OUTLINE

•  Introduction •  Proposed Approach •  Experiments •  Dataset •  Baseline Features •  Results

•  Findings •  Conclusions

Page 36: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Datasets o  Twitter Dataset [Abel et al., 2011] (TW)

§  Collected during two months starting on Nov 2010. §  Topically annotated §  Using tweets labelled as “War & Conflict” (War),

“Law & Crime” (Cri), “Disaster & Accident” (DisAcc).

§  Multilabelled dataset comprising 10,189 Tweets. o  DBpedia (DB) and Freebase (FB) Dataset

§  SPARQL queried endpoints for all resources from categories and subcategories of skos:concept of War, Cri, DisAcc. •  DBpedia – 9,465 articles •  Freebase – 16,915 articles

Page 37: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Datasets

Page 38: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Experimental Setup A 1.  Use annotated Tweets for training (TW)

-  Baseline: Bag of Words (BoW), Bag of Entities (BoE), and Part of Speech tags (PoS).

-  Enhance Features using the DBpedia and Freebase graphs.

2.  Train a SVM classifier based on the TW corpus. Trained/Tested on 80%-20% over five independent runs.

3.  Compute Precision, Recall, and F-measure.

Page 39: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Results for TW dataset

Page 40: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Experimental Setup B 1.  Use labelled articles from DBpedia (DB) and Freebase

(FB) for training -  Baseline: Bag of Words (BoW), Bag of Entities (BoE),

and Part of Speech tags (PoS). -  Enhance Features using the DBpedia and Freebase

graphs. 2.  Train a SVM classifier based on the DB, FB, DB+FB, DB

+FB+TW training corpus and test on TW. Trained/Tested on 80%-20% over five independent runs.

3.  Compute Precision, Recall, and F-measure.

Page 41: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Results for Training on KS articles, and Testing on TW

Page 42: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Factors contributing to the performance of a KS graph for TC

1.  Topic-Class Entropy 2.  Entity-Class Entropy 3.  Topic-Class-Property Entropy

Page 43: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Correlating Entropy metrics with the performance of the cross-source TC classifiers.

Page 44: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

PROPOSED APPROACH Correlating Entropy metrics with the performance of the cross-source TC classifiers.

Indicates that the higher the number of ambiguous entities in a topic within a KS graph, the lower the performance of the TC.

Page 45: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

FINDINGS

1.  KSs combined with Twitter data provide complementary information for TC of Tweets, outperforming the KS approaches and the approach using Tweets only.

2.  A KS performance on TC depends on the coverage of the entities within that KS.

3.  When entities have low coverage in a KS, exploiting the mapping between corresponding KSs’ ontologies is beneficial.

Page 46: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

CONCLUSIONS

•  Explored the task of topic classification of tweets •  Exploited information in KSs (e.g. DBpedia, Freebase)

using semantic graphs for concepts and properties surrounding an entity.

•  Presented the importance of considering graph structures in KSs for the supervised classification of tweets, by achieving significant improvement over various state-of-the-art approaches using both single KSs and Tweets only.

Page 47: Harnessing Linked Knowledge Sources for Topic Classification in Social Media

CONTACT US

A.  Elizabeth Cano •  http://people.kmi.open.ac.uk/cano/

B.  Andrea Varga •  http://sites.google.com/site/missandreavarga/

C.  Matthew Rowe •  http://lancs.ac.uk/staff/rowem/

D.  Fabio Ciravegna •  http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna

E.  Yulan He •  http://www1.aston.ac.uk/eas/staff/dr-yulan-he