64
Personalized and Adaptive Semantic Information Filtering for Social Media Pavan Kapanipathi, PhD Candidate Kno.e.sis Center, Wright State University Committee: Drs. Amit Sheth (Advisor), Krishnaprasad Thirunarayan, Derek Doran, and Prateek Jain Ohio Center of Excellence in Knowledge-Enabled Computing

Personalized and Adaptive Semantic Information Filtering for Social Media

Embed Size (px)

Citation preview

Page 1: Personalized and Adaptive Semantic Information Filtering for Social Media

Personalized and Adaptive Semantic Information Filtering for Social Media

Pavan Kapanipathi, PhD Candidate

Kno.e.sis Center, Wright State University

Committee: Drs. Amit Sheth (Advisor), Krishnaprasad Thirunarayan, Derek Doran, and Prateek Jain

Ohio Center of Excellence in Knowledge-Enabled Computing

Page 2: Personalized and Adaptive Semantic Information Filtering for Social Media

Social Media

2

Introduction

Page 3: Personalized and Adaptive Semantic Information Filtering for Social Media

Information Consumption on Social Media

• Updates of Friends and Acquaintances

3 Introduction

Page 4: Personalized and Adaptive Semantic Information Filtering for Social Media

Information Consumption on Social Media

• Updates of Friends and Acquaintances

• News [1]

– 86% of Twitter users surveyed

4

Introduction

Page 5: Personalized and Adaptive Semantic Information Filtering for Social Media

Information Consumption on Social Media

• Updates of Friends and Acquaintances

• News [1]

– 86% of Twitter users surveyed

• Medical Information [2]

– 1 in 3 use social media

5 Introduction

Page 6: Personalized and Adaptive Semantic Information Filtering for Social Media

Information Consumption on Social Media

• Updates of Friends and Acquaintances

• News [1] – 86% of Twitter

users surveyed

• Medical Information [2] – 1 in 3 use social media

• Disaster Management [3] – 20 million tweets on Hurricane Sandy

– Most crisis management agencies monitor social media

6 Introduction

Page 7: Personalized and Adaptive Semantic Information Filtering for Social Media

Information Overload on Social Media

• Users often complain of getting overwhelmed with the information on social media

• 5 billion posts per day

– Real-time information

• 1000+ in my social network

7

“...a wealth of information creates a poverty of attention...” Herbert A. Simon

Introduction

Page 8: Personalized and Adaptive Semantic Information Filtering for Social Media

Need for Information Filtering

• Scenario

– Address information overload

– Enormous data stream has to be filtered

• Information Filtering Systems

– Emails, News, and Blogs

– Functionality

• Understand user interests

• Deliver relevant information

8

Introduction

Page 9: Personalized and Adaptive Semantic Information Filtering for Social Media

Traditional Information Filtering

9

User Interest Identification/User

Modeling

Filtering Module

Streaming Data

User Generated

Content

Filtered Data

Hanani, Uri, Bracha Shapira, and Peretz Shoval. "Information filtering: Overview of issues, research and systems." User Modeling and User-Adapted Interaction 11.3 (2001): 203-259. Introduction

Page 10: Personalized and Adaptive Semantic Information Filtering for Social Media

Traditional Information Filtering

10

User Interest Identification/User

Modeling

Filtering Module

Streaming Data

User Generated

Content

Filtered Data

Hanani, Uri, Bracha Shapira, and Peretz Shoval. "Information filtering: Overview of issues, research and systems." User Modeling and User-Adapted Interaction 11.3 (2001): 203-259.

NBA

Basketball

Sports

Relevance: 0.9

Introduction

Page 11: Personalized and Adaptive Semantic Information Filtering for Social Media

Challenges 1. Lack of Context

• Lack of context for processing short-text – Short-Text

• Average length of social media posts (Facebook, Twitter, Google+, etc.) are 100-160 characters

• Identifying topics from short-text is important – We can infer the author’s interest and deliver the tweet to interested

users in the topic

– Traditional techniques are shown to have not perform well on social media [Sriram 2010, Derczynski 2013]

11

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

Page 12: Personalized and Adaptive Semantic Information Filtering for Social Media

Introduction

Challenges 2. Continuously Changing Vocabulary

• Social media is a real-time platform with information about latest activities in the real-world

• Hurricane Sandy

– Mitigation, preparedness, recovery, and response phases – #Frankenstorm and #Sandy, at the start, to #StaySafe and #RedCross during the

disaster and #ThanksSandy and #RestoreTheShore after the hurricane

• Indian Elections – the announcement of prime ministerial candidates, issues

regarding corruptions, and polls in different states – #modikisarkar, #NaMo, #VoteForRG, and #CongBJPQuitIndia

12

Civil Unrest Election Natural Disaster

Page 13: Personalized and Adaptive Semantic Information Filtering for Social Media

Challenges 3. Scalability

• Practical aspects of the filtering system

• Popularity of social media is increasing

– Facebook has more than 1 billion users

– Twitter has more than 500 million users

• Disseminate information to a huge set of users

– Centralized disseminating systems either overload the client of the server. (Push or Pull model)

13 Introduction

Page 14: Personalized and Adaptive Semantic Information Filtering for Social Media

Introduction

Knowledge Bases

• A common theme across the methodologies developed is the use of background knowledge and Semantic Web technologies.

• Background knowledge to process short-text leverage knowledge bases

14

“If a program is to perform a complex task well, it must know a great deal about the world in which it operates.”

Lenat & Feigenbaum

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect

game.

Baseball Jason Herward

Kris Bryant

Chicago Cubs

Sports

Page 15: Personalized and Adaptive Semantic Information Filtering for Social Media

Wikipedia as a Knowledge Base

• Requirements for a Knowledge base to be used for filtering social data

– Diversity and Comprehensiveness: Large set of diverse users on social media such as Twitter and Facebook

– Real-time updates: Social media is a real-time platform the discusses dynamic topics

• Wikipedia as the Knowledge base

– Semi structured – Extract the structure

– Diverse: Collaborative effort of 80,000 users with 5 million articles

– Near real-time updates with unbiased views on topics [Ferron 2011]

15 Introduction

Page 16: Personalized and Adaptive Semantic Information Filtering for Social Media

Thesis Statement

16

To build an effective information filtering system, background knowledge and Semantic Web technologies can be used to address lack of context, dynamic changing vocabulary and scalability challenges introduced by social media’s short-text

and real-time nature.

Introduction

Page 17: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Hierarchical Interest Graphs – Built a hierarchical context for tweets leveraging Wikipedia category

structure. This hierarchical context is utilized for user modeling and recommendations.

– Publications [ESWC 2014, WWWCOMP 2014, TR-JRNL 2016]

• Real-time and dynamic nature: Continuously changing vocabulary – A novel methodology that utilizes the evolving Wikipedia hyperlink

structure to detect topic-relevant hashtags for continuous filtering – Publications [TR-CNF 2016, ESWC 2015]

• Popularity: Scalability – Scalable distributed dissemination system that utilizes Sematic Web

technologies. – Publications [ISWC 2011, SPIM 2011, ISWCDEM 2011]

17 Introduction

Page 18: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Hierarchical Interest Graphs – Built a hierarchical context for tweets leveraging Wikipedia category

structure. This hierarchical context is utilized for user modeling and recommendations.

• Real-time and Dynamic Nature: Continuously Changing

Vocabulary – A novel methodology that utilizes the evolving Wikipedia hyperlink

structure

• Popularity: Scalability

– Scalable distributed dissemination system that utilizes Sematic Web technologies.

18 Lack of context

Page 19: Personalized and Adaptive Semantic Information Filtering for Social Media

Baseball

• User generated content is processed to understand user interests and filtering – Tweets are used for these experiments

• Wikipedia category structure comprises taxonomical information that can be leveraged – Build context for short text for user interest identification

Processing Short-text for User Interest Identification

19

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

“You are what you share” Charles W. Leadbeater

Lack of context

ESWC 2014

Page 20: Personalized and Adaptive Semantic Information Filtering for Social Media

Content Based User Interests Identification from Social Data

20 Semantics

Term Frequency Based

Techniques

Lower Dim Space as latent

semantics

Entity Based Techniques

[Tao 2012] [Ramage 2010]

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

Not sure who the Reds will look too

replace Dusty.some very interesting

jobs open (Cubs, Mariners, Reds, poss

Yanks) Girardi the domino sports

[Yan 2012]

Term Freq great 1 day 1 sports 2 cubs 2

Dim Dist 1dim 0.3 2dim 0.2 3dim 0.2 4dim 0.1 5dim 0.4

Wiki-Entities Freq Chicago Cubs 2 Cinci Reds 2 White Sox 1 NY Yankees 1

Knowledge Enabled

Approaches

Lack of context

ESWC 2014

Page 21: Personalized and Adaptive Semantic Information Filtering for Social Media

Implicit Information from Social Data

21

Bro

ader R

elated

Interests Major League Baseball

Major League Baseball Teams

Baseball

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect

game.

Not sure who the Reds will look too

replace Dusty.some very interesting

jobs open (Cubs, Mariners, Reds,

poss Yanks) Girardi the domino

San Francisco Giants

Oakland Athletics

Baseball Organizations

Lack of context

ESWC 2014

Page 22: Personalized and Adaptive Semantic Information Filtering for Social Media

22

Bro

ader

Rel

ated

In

tere

sts

fro

m

Wik

iped

ia C

ateg

ory

St

ruct

ure

Major League Baseball

Major League Baseball Teams

Baseball

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

Not sure who the Reds will look too

replace Dusty.some very interesting

jobs open (Cubs, Mariners, Reds,

poss Yanks) Girardi the domino

Methodology: Structured Hierarchical Knowledge

0.6 1.0 0.3 0.3

Seattle Mariners

White Sox Cincinnati

Reds Chicago Cubs

Transformed Wikipedia Category

Structure to a Wikipedia Hierarchy

Lack of context

ESWC 2014

Page 23: Personalized and Adaptive Semantic Information Filtering for Social Media

23

Spre

adin

g A

ctiv

atio

n

Major League Baseball

Major League Baseball Teams

Baseball

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

Not sure who the Reds will look too

replace Dusty.some very interesting

jobs open (Cubs, Mariners, Reds,

poss Yanks) Girardi the domino

Methodology: Scoring the Inferred Hierarchical Knowledge

0.6 1.0 0.3 0.3

Seattle Mariners

White Sox Cincinnati

Reds Chicago Cubs

0.5

0.4

0.1

Lack of context

ESWC 2014

Page 24: Personalized and Adaptive Semantic Information Filtering for Social Media

Designing an Activation Function

• Design parameters to adapt to the structure of Wikipedia Hierarchy – Uneven distribution of nodes in the hierarchy

• 16 hierarchical levels – most categories between 5-9 hierarchical level – Raw Normalization 𝐹𝑛𝑖 = 1 𝑛𝑜𝑑𝑒𝑠(𝑖+1)

– Log Normalization 𝐹𝐿𝑛𝑖 = 1 𝑙𝑜𝑔10𝑛𝑜𝑑𝑒𝑠(𝑖+1)

– Many-many for category-subcategory relationships

• Boston Red Sox – Major League Baseball Teams , 1901 Establishments in Massachusetts – Preferential Path Constraint 𝑃𝑖𝑗= 1 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦𝑗𝑖

– Boosting common ancestors

• More entities activating the concept, better is its importance – Intersect Booster 𝐵𝑖 = 𝑁𝑒𝑖 𝑁𝑒𝑖𝑐𝑚𝑎𝑥

24 Lack of context

ESWC 2014

Page 25: Personalized and Adaptive Semantic Information Filtering for Social Media

Activation Functions

• Bell (Raw Normalization)

𝐴𝑗 = 𝐴𝑖 × 𝐹𝑗

𝑛

𝑖=0

• Bell Log (Log Normalization)

𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗

𝑛

𝑖=0

• Priority Intersect (Log Normalization , Preferential Path, Intersect Booster)

𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗 × 𝑃𝑗𝑖 × 𝐵𝑗

𝑛

𝑖=0

25

i is the child node j is the category Ai is the activated value of i

Lack of context

ESWC 2014

Page 26: Personalized and Adaptive Semantic Information Filtering for Social Media

26

Act

ivat

ion

Fu

nct

ion

s Major League

Baseball

Major League Baseball Teams

Baseball

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect game.

Not sure who the Reds will look too

replace Dusty.some very interesting

jobs open (Cubs, Mariners, Reds,

poss Yanks) Girardi the domino

Hierarchical Interest Graph

0.6 1.0 0.3 0.3

Seattle Mariners

White Sox Cincinnati

Reds Chicago Cubs

0.5

0.4

0.1 BELL

BELL LOG

PRIORITY INTERSECT

Lack of context

ESWC 2014

Page 27: Personalized and Adaptive Semantic Information Filtering for Social Media

Hierarchical Interest Graph Evaluation – User Study

Tweets Entities Distinct Entities

Categories in HIG

37 31,927 29,146 13,150 111,535

27

Users Tweets Distribution

Lack of context

ESWC 2014

Page 28: Personalized and Adaptive Semantic Information Filtering for Social Media

Evaluation Results of Hierarchical Interests

28

Graded Precision

Mean Average Precision

Relevant Irrelevant Maybe

k Bell Bell Log Priority Intersect

Bell Bell Log Priority Intersect

Bell Bell Log Priority Intersect

10 0.53 0.67 0.76 0.34 0.23 0.16 0.13 0.10 0.08

20 0.54 0.66 0.72 0.34 0.22 0.19 0.12 0.12 0.09

30 0.53 0.64 0.69 0.34 0.24 0.21 0.13 0.12 0.10

40 0.52 0.61 0.68 0.35 0.26 0.22 0.13 0.13 0.10

50 0.52 0.61 0.67 0.36 0.28 0.24 0.12 0.11 0.09

k Bell Bell Log Priority Intersect

10 0.64 0.72 0.88

20 0.61 0.7 0.82

30 0.59 0.69 0.79

40 0.58 0.68 0.77

50 0.57 0.67 0.75

Numbers in Bold portray better performance

Lack of context

ESWC 2014

Page 29: Personalized and Adaptive Semantic Information Filtering for Social Media

On this day in 1934, Major League Baseball

announced it would host its first night games

Great day for Chicago sports as well

as Cubs beat the Reds, Sox beat the

Mariners with Humber’s perfect

game, Bulls win and Hawks stay alive

Implicit Interests Evaluation

• Implicit interests are categories of interest that were not explicitly mentioned in tweets but inferred from the knowledge-base

29

Category: Major League Baseball

Explicit

Implicit

Lack of context

ESWC 2014

Page 30: Personalized and Adaptive Semantic Information Filtering for Social Media

Summary Hierarchical Interest Graphs

• Addressed the “Lack of Context” challenge in tweets using Hierarchical Knowledge base.

– More than 70% of hierarchical interests are implicit.

• A new way to represent Twitter user interests

– Hierarchical Interest Graph with interest scores at each nodes

– Activation Function (models) to determine interest scores

What’s the use?

30 Lack of context

ESWC 2014

Page 31: Personalized and Adaptive Semantic Information Filtering for Social Media

HIG-based Tweet Recommendation Approach

31

Incoming Tweet

Semantic Web: 0.2 World Wide Web: 0.09 Ontology: 0.7 Technology: 0.01 Semantic Search: 0.3

World Wide Web: 0.9 Technology: 0.7 Sports: 0.6 Baseball: 0.4 India: 0.2 United States: 0.2 Semantic Web: 0.2

Pearson Correlation

Recommend Y/N?

Lack of context

TR-JRNL 2016

Page 32: Personalized and Adaptive Semantic Information Filtering for Social Media

Content-based Tweet Recommendation Approaches

• Term Frequency based approaches – User profiles: Built on scoring important terms

• TF, TF-IDF

• Entity Frequency [Tao 2012] – User profiles: Built on scoring important entities

• Wikipedia Entities • Extracted using Zemanta

• Support Vector Machines (SVMrank) [Duan 2010]

– User Models built using content and tweet based features – Tweet content features: Similarity to users tweets, similarity of hashtags,

tweet length, mention of URLs, mention of hashtags.

• Latent Dirichlet Allocation [Ramage 2010]

– User profiles: Distribution of 5 latent topics.

32 Lack of context

TR-JRNL 2016

Page 33: Personalized and Adaptive Semantic Information Filtering for Social Media

Experimental Setup

• Utilized the same dataset from the user study

• Training and testing datasets using two assumptions – Tweets what users share are interesting to them and can be

recommended (UGC Assumption)

• 80% to create user profiles

• 20% (~6,000) to test recommendation

– Retweets of users are interesting to them and can be recommended (Retweet Assumption and is more popular in literature)

• 30% (~9,000) were retweets, hence used to test recommendation

• 70% to create user profiles

33

Users Tweets Entities

37 31,927 29,146

Lack of context

TR-JRNL 2016

Page 34: Personalized and Adaptive Semantic Information Filtering for Social Media

Evaluation Methodology

• Transformed to a top-N recommendation evaluation – Popular top-N evaluation methodology by Cremonesi et al. [Cremonesi

2010] for Precision/Recall

• Methodology

– For every test tweet – pick random 1000 tweets not tweeted/retweeted by the author of the test tweet • Random tweets are considered to be irrelevant to the user

– Score and rank the test tweet with the 1000 random tweets using the recommendation algorithm • TF, TFIDF, Entity-based, SVMrank, LDA, and HIG

– If the test tweet is within the top-N, its considered to be a hit otherwise not ( T is the total number of test tweets)

𝑟𝑒𝑐𝑎𝑙𝑙 = ℎ𝑖𝑡𝑠 𝑇

34

Lack of context

TR-JRNL 2016

Page 35: Personalized and Adaptive Semantic Information Filtering for Social Media

Retweet Assumption Evaluation Results

• Term frequency performs the best for recommending retweets tweets [Ramage et al 2010]

35 Lack of context

TR-JRNL 2016

Page 36: Personalized and Adaptive Semantic Information Filtering for Social Media

UGC Assumption Evaluation Results

• HIG performed better for most top-N but at Top-20 TF-based approaches performed better.

36 Lack of context

TR-JRNL 2016

Page 37: Personalized and Adaptive Semantic Information Filtering for Social Media

Lack of context

Content + Knowledge based Approach

• TF performed the best in content based approaches

• Merged TF and HIG which augments content with knowledge bases and recommend using Pearson Correlation

37

World Wide Web: 0.4 Technology: 0.007 Sports: 0.06 Baseball: 0.34 India: 0.102 United States: 0.2 Semantic Web: 0.2

world: 3 great: 10 cricket: 24 slim: 13 good: 40 united: 34 states: 30

TF

HIG

NORMALIZED

world: 0.075 great: 0.25 cricket: 0.6 slim: 0.325 good: 1 united: 0.85 states: 0.75

World Wide Web: 1 Technology: 0.017 Sports: 0.15 Baseball: 0.85 India: 0.25 United States: 0.5 Semantic Web: 0.5

MERGED

world: 0.075 great: 0.25 cricket: 0.6 slim: 0.325 good: 1 united: 0.85 states: 0.75 World Wide Web: 1 Technology: 0.017 Sports: 0.15 Baseball: 0.85 India: 0.25 United States: 0.5 Semantic Web: 0.5

TR-JRNL 2016

Page 38: Personalized and Adaptive Semantic Information Filtering for Social Media

Retweet Assumption Evaluation Results

• TF + HIG performs the best and provides an improvement of more than 40% at top-20

38 Lack of context

TR-JRNL 2016

Page 39: Personalized and Adaptive Semantic Information Filtering for Social Media

UGC Assumption Evaluation Results

• TF + HIG performs the best and provides an improvement of more than 20% at top-20

39 Lack of context

TR-JRNL 2016

Page 40: Personalized and Adaptive Semantic Information Filtering for Social Media

Summary Hierarchical Interest Graphs

• A new way to represent Twitter user Interests

– Hierarchy Interest Graphs

• Addressed the “Lack of Context” challenge in tweets using hierarchical knowledge base.

• HIG (knowledge base) augments content to provide superior performance for tweet recommendation.

40 Lack of context

TR-JRNL 2016

Page 41: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Augmented content with hierarchical knowledge from Wikipedia

• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)

• Improved content based tweet recommendation by more than 40%.

• Real-time and Dynamic Nature: Continuously Changing

Vocabulary – A novel methodology that utilizes the evolving Wikipedia hyperlink

structure to update filters for streaming topic-relevant information

• Popularity: Scalability

– Scalable distributed dissemination system that utilizes Sematic Web technologies.

41 Lack of context

Page 42: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Augmented content with hierarchical knowledge from Wikipedia

• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)

• Improved tweet recommendation by more than 40%.

• Real-time and Dynamic Nature: Continuously Changing

Vocabulary – A novel methodology that utilizes the evolving Wikipedia hyperlink

structure to update filters for streaming topic-relevant information

• Popularity: Scalability

– Scalable distributed dissemination system that utilizes Sematic Web technologies.

42 Dynamic vocabulary

Page 43: Personalized and Adaptive Semantic Information Filtering for Social Media

• Dynamic topics of interest that continuously evolve over time

– Indian Elections

• the announcement of prime ministerial candidates, issues regarding corruptions, and polls in different states

– Hurricane Sandy

• Mitigation, preparedness, recovery, and response phases

Social media: Real-time and Dynamic Platform

43

Indian Election Hurricane Sandy

Dynamic vocabulary

TR-CNF 2016

Page 44: Personalized and Adaptive Semantic Information Filtering for Social Media

• Keyword-based filtering

– Twitter streaming API

• Keywords are dynamically changing based on the happenings in the real-world

– Necessary to track these keywords to be up-to-date regarding the topic of interest

Filtering Dynamic Topics on Social Media

44

#indianelection #sandy #modikisarkar, #NaMo,

#VoteForRG, and #CongBJPQuitIndia

#Frankenstorm ,#Sandy, #RedCross,

#RestoreTheShore Dynamic vocabulary

TR-CNF 2016

Page 45: Personalized and Adaptive Semantic Information Filtering for Social Media

Topic-relevant hashtags that can be used to crawl all the tweets co-occur with

each other

(1) Colorado Shooting (2) Occupy Wall Street

Analysis with over 6 million tweets

Hindsight Analysis of Topic-relevant Hashtags

45

<1% of the topic-relevant hashtags can crawl up to 85% of the tweets

Dynamic vocabulary

TR-CNF 2016

Page 46: Personalized and Adaptive Semantic Information Filtering for Social Media

Approach for Detecting Topic-Relevant Hashtags

46

Co-occurring: Threshold δ

#indianelection2014

#modikisarkar

Manually started filter

Indian General Election,_2014

Dynamically Updated Background Knowledge

One hop from Topic Page

Entity scoring based on relevance to the Event

Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3

Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2

Entity Extraction and Scoring

Normalized Frequency Scoring

Latest K (200,500)

Similarity Check

Extract, Periodically Update Hyperlink structure

Dynamic vocabulary

TR-CNF 2016

Page 47: Personalized and Adaptive Semantic Information Filtering for Social Media

• Dataset – 2 Dynamic topics

– 2012 U S Presidential Elections

– Hurricane Sandy

• δ – Top 25 co-occurring hashtags

– Manual annotation for relevance

Evaluation

47

Event Tags Tweets Co-occ Tags (Distinct) Wiki Entities

US Elections 2012 #election2012 4,855 12,361 (1,460) 614

Hurricane Sandy #sandy 4,818 6,592 (837) 419

Event Tags Tweets (Distinct) Relevant Irrelevant Tweets Entities

US Elections 2012

25 11,504 (10,084) 7,086 2,998 27,558 (4255)

Hurricane Sandy 25 4,905 (4,850) 2,691 2,159 10,719 (2359)

Total 50 15,409 1,4934 9,777 38,219

Dynamic vocabulary

TR-CNF 2016

Page 48: Personalized and Adaptive Semantic Information Filtering for Social Media

Evaluation Results

48

Hurricane Sandy 2012 U S Presidential Elections

Subsumption Cosine Jaccard Cooccurance Subsumption Cosine Jaccard Cooccurance

𝑁𝐷𝐶𝐺10 0.93 0.86 0.85 0.65 0.91 0.85 0.89 0.83

𝑁𝐷𝐶𝐺20 0.97 0.93 0.92 0.89 0.98 0.95 0.97 0.94

NDCG

MAP

Dynamic vocabulary

TR-CNF 2016

Page 49: Personalized and Adaptive Semantic Information Filtering for Social Media

• Hashtag analysis – Co-occurrence technique can be used to detect event relevant hashtags

– More popular hashtags are easier to be detected via co-occurrence

• Continuously changing vocabulary for dynamic topics and coverage – Wikipedia as a dynamic knowledge-base for events

– Determining relevant hashtags using asymmetric similarity measure

– More hashtags in turn increase the coverage of tweets for events

• Content-based location prediction of Twitter users (ESWC 2015) – Similar framework of relevancy detection was used for location prediction

Dynamic Hashtag Filter

49 Dynamic vocabulary

TR-CNF 2016

Page 50: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Augmented content with hierarchical knowledge from Wikipedia

• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)

• Improved content based tweet recommendation by more than 40%.

• Real-time and Dynamic Nature: Continuously Changing Vocabulary – Hindsight analysis insight: co-occurrence can be used as a starting point

– Utilized Wikipedia as an evolving knowledge base for dynamic topics • top-5 detected, increased the coverage by more than 3,500 tweets instantly

with a mean average precision of 0.92

• Popularity: Scalability – Scalable distributed dissemination system that utilizes Sematic Web

technologies.

50 Dynamic vocabulary

Page 51: Personalized and Adaptive Semantic Information Filtering for Social Media

Outline

• Short-Text: Lack of context for processing – Augmented content with hierarchical knowledge from Wikipedia

• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)

• Improved content based tweet recommendation by more than 40%.

• Real-time and Dynamic Nature: Continuously Changing Vocabulary – Hindsight analysis insight: co-occurrence can be used as a starting point

– Utilized Wikipedia as an evolving knowledge base for dynamic topics • top-5 detected, increased the coverage by more than 3,500 tweets instantly

with a mean average precision of 0.92

• Popularity: Scalability – Scalable distributed dissemination system that utilizes Sematic Web

technologies.

51 Scalability

Page 52: Personalized and Adaptive Semantic Information Filtering for Social Media

Content Dissemination

• Centralize content dissemination suffers from scalability issues

– Server (publisher) or the Client (subscriber) are overwhelmed

– Server for Push and Client for Pull

• Distributed dissemination protocol

– Pubsubhubbub

• Introduced by Google in 2009

• 117 million users and 5.5 billion posts broadcasted by 2011

52 Scalability

ISWC 2011

Page 53: Personalized and Adaptive Semantic Information Filtering for Social Media

• PubSubHubbub

– Simple, Open, web-hook based pubsub protocol

– Extension to RSS, Atom.

53 53 53

Publisher Subscriber Hub

I have new content for

feed X

Give me the latest content for

feed X

Here it is

Subscriber Subscriber

Subscriber Subscriber

Here is the latest content

for feed X

Scalability

ISWC 2011

Page 54: Personalized and Adaptive Semantic Information Filtering for Social Media

54

PubSubHubbub Protocol Extension

Pub

Sub - A

Sub - B

Sub - C

Sub - D

Hey I have new content for feed

topics/preference

Social Graph and User Profiles

Get the subscribers of Pub whose profile

matches topic/preference

Here is the new content

of feed X

Give me the new content

Here it is

Semantic Hub

Scalability

ISWC 2011

Page 55: Personalized and Adaptive Semantic Information Filtering for Social Media

Publisher – Social Data Annotation

• Preliminary processing of text for filtering – Information extraction (entities, hashtags, urls, etc.)

• Representing as RDF using vocabulary used by SMOB – Comprises

• SPARQL Queries representing the subset of subscribers from the Social Graph in the hub

55 Scalability

<http://twitter.com/rob/statuses/123456789>

rdf:type sioct:MicroblogPost ;

sioc:content "Great day for Chicago sports as

well as Cubs beat the Reds, Sox beat the Mariners with

Humber’s perfect game #chicago“ ;•

sioc:has_creator <http://example.com/rob> ;

moat:taggedWith dbpedia:Chicago ;

moat:taggedWith dbpedia:Chicago_Cubs ;

moat:taggedWith dbpedia:Cincinnati_Reds ;

sioc:topic <http://example.com/tags/chicago> .

ISWC 2011

Page 56: Personalized and Adaptive Semantic Information Filtering for Social Media

Semantic Hub

• Performs the matching of processed post to user profiles

– Flexible to different matching techniques

• Pearson correlation or other similarity measures

• Delivers information to relevant subscribers.

56 Scalability

SELECT ?user WHERE {

{ ?user foaf:interest dbpedia:Chicago } UNION

{ ?user foaf:interest dbpedia:Chicago_Cubs } UNION

{ ?user foaf:interest dbpedia:Cincinnati_Reds }

}

ISWC 2011

Page 57: Personalized and Adaptive Semantic Information Filtering for Social Media

Semantic Hub: Conclusion

• Framework for distributed dissemination of content using PubSubHubbub – Hub takes the load of the filtering module and dissemination of

content

• PubSubHubbub – 117 million subscriptions by 2011

– 5.5 billion unique feeds by 2011

• Semantic Hub – Privacy-aware dissemination for distributed social networks

– Real-time filtering

57 Scalability

ISWC 2011

Page 58: Personalized and Adaptive Semantic Information Filtering for Social Media

• To build an effective information filtering system, background knowledge and Semantic Web technologies can be used to address lack of context, dynamic changing vocabulary and

scalability challenges introduced by social media’s short-text and real-time nature.

– Augmented content with hierarchical knowledge from Wikipedia to

improve context of short-text • 70% of the top-50 interests were implicit (not mentioned in users’ tweets) • Improved content based tweet recommendation by more than 40%.

– Utilized Wikipedia as an evolving knowledge base for dynamic topics to detect topic-descriptors for filtering • Hindsight analysis insight: co-occurrence can be used as a starting point • top-5 detected, increased the coverage by more than 3,500 tweets instantly

with a mean average precision of 0.92

– Extended PubSubHubbub, a distributed content dissemination protocol with Semantic Web technologies for filtering and dissemination

58 Conclusion

Thesis Conclusion

Page 59: Personalized and Adaptive Semantic Information Filtering for Social Media

Graduate Journey

• Hierarchical Interest Graphs

– Internship work – IBM TJ Watson Research Center 2013

• Location Prediction of Twitter users

– Alleviates the dependence on training data

• Determining Twitter User Hobbies

– Internship work – Samsung Research America 2014 (Patent Pending)

• Tweet Filtering and Recommendation

– Addressing the problem of dynamic topic drift.

59

Conclusion

Page 60: Personalized and Adaptive Semantic Information Filtering for Social Media

Conclusion

Graduate Journey • Research Internships

– 2011 DERI, Ireland (ISWC 2011, SPIM 2011, WebSci 2011) – 2013 IBM TJ Watson Research Center (WWWCOMP 2014,

ESWC2014) – 2014 Samsung Research America (Patent Pending)

• Invited talks – IBM TJ Watson Research Center, Frontiers of Cloud

Computing and Big Data Workshop – EMC CTO Office, Bangalore, Invited Speaker Series – WSU Advisory Board

• Proposals and Projects – Twitris – NSF Commercialization – Ohio State University – NSF Hazards SEES ($2M) – CITAR (Epidemiology) – NIH EdrugTrends ($1.6M)

• Development of Research Systems – Twarql – A semantic tweet filtering system.

• Winner of Triplification Challenge (ISem2010)

– Scalable content dissemination on distributed social networks. (ISWC2011)

– Twitris – A social semantic web for analyzing events.

60

COLLABORATIONS

CITAR

Page 61: Personalized and Adaptive Semantic Information Filtering for Social Media

Publications • [NOISE 2015] Raghava Mutharaju, and Pavan Kapanipathi. Are We Really Standing on the

Shoulders of Giants? 1st Workshop on Negative or Inconclusive Results in Semantic Web 2015, ESWC, 2015.

• [KNOW 2015] Siva Kumar Chekula, Pavan Kapanipathi, Derek Doran, Amit Sheth. Entity Recommendations Using Hierarchical Knowledge Bases. 4th International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, 2015.

• [ESWC 2015] Pavan Kapanipathi, Revathy Krishnamurthy (Joint first author), Amit Sheth, Krishnaprasad Thirunarayan. Knowledge Enabled Approach to Predict the Location of Twitter Users. In Extended Semantic Web Conference, 2015. (acceptance rate 23%).

• [ESWC 2014] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, Amit Sheth. User Interests Identification on Twitter Using a Hierarchical Knowledge Base. In Extended Semantic Web Conference 2014, Crete Greece. (acceptance rate 23%)

• [WWWComp 2014] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, Amit Sheth. Hierarchical Interest Graph from Twitter. 23rd International conference on World Wide Web companion 2014 (WWW companion 2014), Seoul, South Korea.

• [WI 2013] Fabrizio Orlandi, Pavan Kapanipathi, Alexandre Passant, Amit Sheth. Characterising concepts of interest leveraging Linked Data and the Social Web. The 2013 IEEE/WIC/ACM International Conference on Web Intelligence, Atlanta, USA, United States, 2013.

• [SPIM 2011] Pavan Kapanipathi, Fabrizio Orlandi, Amit Sheth, Alexandre Passant. Personalized Filtering of the Twitter Stream. 2nd workshop on Semantic Personalized Information Management at ISWC 2011, September 2011.

• [ISWC 2011] Pavan Kapanipathi, Julia Anaya, Amit Sheth, Brett Slatkin, Alexandre Passant. Privacy-Aware and Scalable Content Dissemination in Distributed Social Network. 10th International Semantic Web Conference 2011, Bonn, Germany, September 2011. (acceptance rate 22%)

61 Conclusion

Page 62: Personalized and Adaptive Semantic Information Filtering for Social Media

Conclusion

Publications • [ISWCDEM 2011] Pavan Kapanipathi, Julia Anaya, Alexandre Passant . SemPuSH: Privacy-

Aware and Scalable Broadcasting for Semantic Microblogging. 10th International Semantic Web Conference 2011,

• [FSWE 2011] Pavan Kapanipathi. SMOB: The Best of Both Worlds. Federated Social Web Europe Conference, Berlin, June 3rd -5th 2011.

• [WEBSCI 2011] Alexandre Passant, Owen Sacco, Julia Anaya, Pavan Kapanipathi. Privacy-By-Design in Federated Social Web Applications, Websci 2011, Koblenz, Germany June 14-17, 2011.

• [ISEM 2010] Pablo Mendes, Pavan Kapanipathi, Alexandre Passant. Twarql: Tapping into the Wisdom of the Crowd. Triplification Challenge 2010 at 6th International Conference on Semantic Systems (I-SEMANTICS), [WI 2010]

• [WI 2010] Pablo Mendes, Alexandre Passant, Pavan Kapanipathi, Amit Sheth. Linked Open Social Signals.WI2010 IEEE/WIC/ACM International Conference on Web Intelligence (WI-10),

• [WEBSCI 2010] Pablo Mendes, Pavan Kapanipathi, Delroy Cameron, Amit Sheth. Dynamic Associative Relationships on the Linked Open Data Web. In Proceedings of the WebSci10: Extending the Frontiers of Society On-Line

• [TR-CNF 2016] Pavan Kapanipathi, Krishnaprasad Thirunarayan, Fabrizio Orlandi, Amit Sheth, Pascal Hitzler. A Real-Time #approach for Continuous Crawling of Events on Twitter by Leveraging Wikipedia. Technical Report.

• [TR-JRNL 2016] Pavan Kapanipathi, Siva Kumar, Derek Doran, Prateek Jain, Chitra Venkataramani, Amit Sheth. Hierarchical Knowledge Base enabled Twitter User Modeling and Recommendation. (Journal).

• [TR-CNFC 2016] Siva Kumar, Pavan Kapanipathi, Derek Doran, Prateek Jain, Amit Sheth. Exploring Taxonomical Interests for Entity Recommendations. Technical report, 2015.

• [TR-CNFC 2016] Sarasi Sarangi, Pavan Kapanipathi, Amit Sheth. Domain-specific Sub graph Generation. Technical report, 2015. 62

Page 63: Personalized and Adaptive Semantic Information Filtering for Social Media

Conclusion

References • [1] How Do People Use Social Media for Business/Finance News?

http://blog.marketwired.com/2013/11/12/how-do-people-use-social-media-for-businessfinance-news/

• [2] What is the role of social media in healthcare? http://worldofdtcmarketing.com/role-social-media-

healthcare/social-media-and-healthcare/

• [3] Social media use during disaster management http://www.emergency-management-degree.org/crisis/

• [Tao 2012] Tao, K., Abel, F., Gao, Q., and Houben, G.-J. (2012a). Tums: Twitter-based user modeling service.

• [Ramage 2010] Ramage, D., Dumais, S., and Liebling, D. (2010). Characterizing microblogs with topic models. AAAI’ 10.

• [Yan 2012] Yan, R., Lapata, M., and Li, X. (2012). Tweet recommendation with graph co-ranking. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics.

• [Duan 2010] Duan, Y., Jiang, L., Qin, T., Zhou, M., and Shum, H.-Y. (2010). An empirical study on learning to rank of tweets. COLING ’10

• [Cremonesi 2010]Cremonesi, P., Koren, Y., and Turrin, R. (2010). Performance of recommender algorithms on top-n recommendation tasks. RecSys2010

• [Sriram 2010] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010). Short text classification in twitter to improve information filtering. SIGIR ’10

• [Derczynsk 2013] Derczynski, L., Maynard, D., Aswani, N., and Bontcheva, K. (2013). Microblog-genre noise and impact on semantic annotation accuracy. HT ’13,

• [Ferron 2011] Ferron, M. and Massa, P. (2011). Collective memory building in wikipedia: the case of north african uprisings. WikiSys2011

63

Page 64: Personalized and Adaptive Semantic Information Filtering for Social Media

Acknowledgements

64 Conclusion