Upload
amparo-elizabeth-cano
View
1.009
Download
2
Embed Size (px)
DESCRIPTION
Presented at Hypertext'13. Topic classification (TC) of short text messages o↵ers an ef- fective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolu- tion). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the top- ics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detec- tion (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.
Citation preview
A. Elizabeth Cano, Andrea Varga�, Matthew Rowew, Fabio Ciravegna�, and Yulan He°
Knowledge Media Institute, The Open University, Milton Keynes � University of Sheffield, Sheffield w Lancaster University, Lancaster ° Aston University, Birmingham
UK. 2013
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
INTRODUCTION Social Media Streams - Risk in violent and criminal activities
INTRODUCTION Research Questions:
o Can semantic features help in topic classification (TC)?
o Which knowledge source (KS) data and KS taxonomies provide useful information for improving the TC of tweets?
OUTLINE
• Introduction - Topic Classification (TC) of Microposts - Related Work - State of the art limitations
• Proposed Approach • Experiments • Findings • Conclusions
INTRODUCTION
u Difficulties of Topic Classification of microposts o Restricted number of characters o Irregular and ill-formed words • Mixing upper and lowercase letter
§ Makes it difficult to detect proper nouns, and other part of speech tags.
• Wide variety of language § E.g., “see u soon”
o Event-dependent emerging jargon • Volatile jargon relevant to particular events
§ E.g., “Jan.25” (used during the Egyptian revolution o High Topical Diversity o Sparse data
INTRODUCTION Social Knowledge Sources (KS)
DBpedia* Yago2 Freebase
Resources 2.35 million 447million 3.6 million
Classes 359 562,312 1,450
Properties 1,820 253,213,842 7,000
*Using dbpedia ontology
o Structured Semantic Web Representation of data • Maintained by thousand of editors
§ E.g DBpedia, derived from Wikipedia § Freebase
• Evolves and adapts as knowledge changes [Syed et al, 2008]
o Cover a broad range of topics o Characterise topics with a large number of resources
INTRODUCTION Local and External Metadata of a Tweet
INTRODUCTION Local and External Metadata of a Tweet
NER:Country NER:Person
NER:Person
INTRODUCTION Local and External Metadata of a Tweet
NER:Country NER:Person
NER:Person
<http://dbpedia.org/resource/Barack_Obama
<http://dbpedia.org/resource/Egypt
<http://dbpedia.org/resource/Hosni_Mubarak
PROPOSED APPROACH
o State of the art limitations § Use of single knowledge sources § Entities’ metadata is constrained by the used NER service
(e.g OpenCalais, Alchemy). o Our approach
§ Exploits multiple knowledge sources. § Enhances the entity metadata by deriving semantic graphs. § Leverages the graph structures surrounding entities present
in a KS for the TC task.
Exploiting Knowledge Sources for the Topic Classification of Microposts
OUTLINE
• Introduction • Proposed Approach • Semantic Meta-graphs • Weighting Schemas • Enhancing TC with Semantic Features
• Experiments • Findings • Conclusions
PROPOSED APPROACH Rationale…
1
2
PROPOSED APPROACH Rationale…
1
2
Could be more indicative of War and Conflict
PROPOSED APPROACH Rationale…
2
Not necessarily a good indicator of War and Conflict
PROPOSED APPROACH Rationale…
1
2
Can the graph structure of existing Knowledge sources provide an abstraction of the use of these entity types for representing a topic ?
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
1 Datasets Collection
SPARQL query for all resources from a given Topic (e.g. War )
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract entities and link them to resources in
DBpedia and Freebase.
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract entities and link them to resources in
DBpedia and Freebase.
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract entities and link them to resources in
DBpedia and Freebase.
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
3 Semantic Features Derivation
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
4Build a Topic Classifier based on Features Derived from Crossed-Sources
PROPOSED APPROACH Framework for Topic Classification of Tweets
Concept EnrichmentDBFB
DB-FBRetr
ieve
Art
icle
s
TW
Retrieve TweetsDerive Semantic Features
Build Cross-Source Topic Classifier
AnnotateTweets
4Build a Topic Classifier based on Features Derived from Crossed-Sources
PROPOSED APPROACH Deriving Semantic Meta-Graphs
<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates> <dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>
PROPOSED APPROACH Deriving Semantic Meta-Graphs
<dbpedia:Barack_Obama, rdf:type, yago:PresidentOfTheUnitedStates> <dbpedia:Barack_Obama, dbo:birthPlace, dbpedia:Hawaii>
PROPOSED APPROACH Definition 1- Resource Meta-graph Is a sequence of tuples G:=(R,P,C,Y) where • R, P, C are finite sets whose elements are resources,
properties and classes; • Y is a ternary relation representing a
hypergraph with ternary edges. • Y is a tripartite graph where the vertices
are
Y ! R"P"C
H Y( ) = V,DD = r, p,c{ } r, p,c( )! Y{ }
PROPOSED APPROACH Resource Meta-graph The meta-graph of entity e is the aggregation of all resources, properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author
PROPOSED APPROACH Resource Meta-graph The meta-graph of entity e is the aggregation of all resources, properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author
How can we weight these graphs to reveal semantic features characterise Obama in the context of Violence?
?
? ?
? ? ? ?
PROPOSED APPROACH Weighting Semantic Features Specificity Measures the relative importance of a property to a given class in a KS graph GKS:
p!G e( )c !G e( )
specificityKS p,c( ) = pN R(c)( )N(R(c))
PROPOSED APPROACH Weighting Semantic Features Generality Captures the specialisation of a property p to a given class c, by computing the property’s frequency among other semantically related classes R’(c). Where N(R’(c)) is the number of resources whose type is either c or a specialisation of c’s parent classes.
generalityKS p,c( ) =N R '(c)( )
pN (R '(c))
PROPOSED APPROACH Weighting Semantic Features
SG p,c( ) = specificityKS p,c( )! generalityKS p,c( )
PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation (A1) Class Features Property Features Class+ Property Features
A1!CF ' = F + CF
A1!PF ' = F + pFA1!C+PF ' = F + cF + pF
PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation (A1) Class Features Property Features Class+ Property Features
A1!CF ' = F + CF
A1!PF ' = F + pFA1!C+PF ' = F + cF + pF
F president, obama, televised, statement, hosni, mubarak, resignation, cnn, says, egypt
FA1+ P dbpedia:birth, dbpedia:state, …., dbpedia-owl:PopulatedPlace/populationDensity….
FA1+ C PopulatedPlace, Office_holder, PresidentOfTheUnitedStates, Politician…
PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation with Generalisation (A2) This augmentation exploits the subsumption relation among classes within the DBpedia or Freebase ontologies. In this cases we consider the set of parent classes of c. Parent(c) Features Parent(c) + Property Features
A2!CF ' = F + parent (c)F
A2!C+PF ' = F + pF + parent (c)F
PROPOSED APPROACH Enhancing Feature Space with Semantic Features Semantic Augmentation with Generalisation (A2) This augmentation exploits the subsumption relation among classes within the DBpedia or Freebase ontologies. In this cases we consider the set of parent classes of c. Parent(c) Features Parent(c)+Property Features
A2!CF ' = F + parent (c)F
A2!C+PF ' = F + pF + parent (c)F
F president, obama, televised, statement, hosni, mubarak, resignation, cnn, says, egypt
FA2+ parent(c) Place, Office_holder, President, Politician…
OUTLINE
• Introduction • Proposed Approach • Experiments • Dataset • Baseline Features • Results
• Findings • Conclusions
PROPOSED APPROACH Datasets o Twitter Dataset [Abel et al., 2011] (TW)
§ Collected during two months starting on Nov 2010. § Topically annotated § Using tweets labelled as “War & Conflict” (War),
“Law & Crime” (Cri), “Disaster & Accident” (DisAcc).
§ Multilabelled dataset comprising 10,189 Tweets. o DBpedia (DB) and Freebase (FB) Dataset
§ SPARQL queried endpoints for all resources from categories and subcategories of skos:concept of War, Cri, DisAcc. • DBpedia – 9,465 articles • Freebase – 16,915 articles
PROPOSED APPROACH Datasets
PROPOSED APPROACH Experimental Setup A 1. Use annotated Tweets for training (TW)
- Baseline: Bag of Words (BoW), Bag of Entities (BoE), and Part of Speech tags (PoS).
- Enhance Features using the DBpedia and Freebase graphs.
2. Train a SVM classifier based on the TW corpus. Trained/Tested on 80%-20% over five independent runs.
3. Compute Precision, Recall, and F-measure.
PROPOSED APPROACH Results for TW dataset
PROPOSED APPROACH Experimental Setup B 1. Use labelled articles from DBpedia (DB) and Freebase
(FB) for training - Baseline: Bag of Words (BoW), Bag of Entities (BoE),
and Part of Speech tags (PoS). - Enhance Features using the DBpedia and Freebase
graphs. 2. Train a SVM classifier based on the DB, FB, DB+FB, DB
+FB+TW training corpus and test on TW. Trained/Tested on 80%-20% over five independent runs.
3. Compute Precision, Recall, and F-measure.
PROPOSED APPROACH Results for Training on KS articles, and Testing on TW
PROPOSED APPROACH Factors contributing to the performance of a KS graph for TC
1. Topic-Class Entropy 2. Entity-Class Entropy 3. Topic-Class-Property Entropy
PROPOSED APPROACH Correlating Entropy metrics with the performance of the cross-source TC classifiers.
PROPOSED APPROACH Correlating Entropy metrics with the performance of the cross-source TC classifiers.
Indicates that the higher the number of ambiguous entities in a topic within a KS graph, the lower the performance of the TC.
FINDINGS
1. KSs combined with Twitter data provide complementary information for TC of Tweets, outperforming the KS approaches and the approach using Tweets only.
2. A KS performance on TC depends on the coverage of the entities within that KS.
3. When entities have low coverage in a KS, exploiting the mapping between corresponding KSs’ ontologies is beneficial.
CONCLUSIONS
• Explored the task of topic classification of tweets • Exploited information in KSs (e.g. DBpedia, Freebase)
using semantic graphs for concepts and properties surrounding an entity.
• Presented the importance of considering graph structures in KSs for the supervised classification of tweets, by achieving significant improvement over various state-of-the-art approaches using both single KSs and Tweets only.
CONTACT US
A. Elizabeth Cano • http://people.kmi.open.ac.uk/cano/
B. Andrea Varga • http://sites.google.com/site/missandreavarga/
C. Matthew Rowe • http://lancs.ac.uk/staff/rowem/
D. Fabio Ciravegna • http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna
E. Yulan He • http://www1.aston.ac.uk/eas/staff/dr-yulan-he