49
User Interests Identification From Twitter using Hierarchical Knowledge Base Pavan Kapanipathi*, Prateek Jain^, Chitra Venkataramani^, Amit Sheth* *Kno.e.sis Center, Wright State University ^IBM TJ Watson Research Center 1 #eswc2014Kapanipathi

User Interests Identification From Twitter using Hierarchical Knowledge Base

Embed Size (px)

DESCRIPTION

Twitter, due to its massive growth as a social networking platform, has been in focus for the analysis of its user generated content for personalization and recommendation tasks. A common challenge across these tasks is identifying user interests from tweets. Semantic enrichment of Twitter posts, to determine user interests, has been an active area of research in the recent past. These approaches typically use available public knowledge-bases (such as Wikipedia) to spot entities and create entity-based user profi les. However, exploitation of such knowledgebases to create richer user profi les is yet to be explored. In this work, we leverage hierarchical relationships present in knowledge-bases to infer user interests expressed as a Hierarchical Interest Graph. We argue that the hierarchical semantics of concepts can enhance existing systems to personalize or recommend items based on a varied level of conceptual abstractness. We demonstrate the e ffectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted hierarchical interests in the graph being relevant to a user's interests.

Citation preview

Page 1: User Interests Identification From Twitter using Hierarchical Knowledge Base

1

User Interests Identification From Twitter using Hierarchical Knowledge Base

Pavan Kapanipathi*, Prateek Jain^, Chitra Venkataramani^, Amit Sheth*

*Kno.e.sis Center, Wright State University^IBM TJ Watson Research Center

#eswc2014Kapanipathi

Page 2: User Interests Identification From Twitter using Hierarchical Knowledge Base

2

Motivation Background Approach Evaluation Conclusion & Future Work

Agenda

Page 3: User Interests Identification From Twitter using Hierarchical Knowledge Base

3

Motivation Approach Evaluation Conclusion & Future Work

Agenda

Page 4: User Interests Identification From Twitter using Hierarchical Knowledge Base

4

Tapping into Social Networks to identify interests is not new (2006+). It works!!◦ Google, Bing, Samsung TV etc.

Twitter Content ◦ 500M+ Users generating 500M+ tweets per day. ◦ Public and useful for research

Twitter

Page 5: User Interests Identification From Twitter using Hierarchical Knowledge Base

5

Interests with lesser or no semantics ◦ Bag of Words [1]◦ Bag of Concepts

Some Semantics ◦ Bag of Linked Entities with intentions of using

Knowledge Bases. [2, 3]

What’s there?

1. Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. You Are Who You Know: Inferring User Profiles in Online Social Networks. WSDM ’10.

2. Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. UMAP ’11

3. Fabrizio Orlandi, John Breslin, and Alexandre Passant. Aggregated, Interoperable and Multi-domain User Profiles for the Social Web. I-SEMANTICS ’12.

Page 6: User Interests Identification From Twitter using Hierarchical Knowledge Base

6

Page 7: User Interests Identification From Twitter using Hierarchical Knowledge Base

7

How can Semantics/Knowledge Bases be utilized to infer interests?◦ Extensive use of Knowledge Bases to infer user

interests from Tweets is yet to be explored.

First we started with utilizing Hierarchical Relationships

What’s new?

Page 8: User Interests Identification From Twitter using Hierarchical Knowledge Base

8

Internet

Semantic Search

Linked Data Metadata

Technology

World Wide Web

Semantic Web

Entities

Structured Information

Page 9: User Interests Identification From Twitter using Hierarchical Knowledge Base

9

Addressing Data Sparcity Problem◦ Infer more interests of the users with lesser data.

Flexibility for Recommendations ◦ Recommend about Sports or Football

KB knows that Football is a sub-category of Sports◦ Resource Description Framework and Semantic

Web RDF has lesser data online to recommend.

Advantages of Hierarchical Interests

Page 10: User Interests Identification From Twitter using Hierarchical Knowledge Base

10

Motivation

Approach Evaluation Conclusion & Future Work

Agenda

Page 11: User Interests Identification From Twitter using Hierarchical Knowledge Base

11

Approach

Tweets

Interest Hierarchy

Page 12: User Interests Identification From Twitter using Hierarchical Knowledge Base

12

Approach

Tweets

Interest Hierarchy

Page 13: User Interests Identification From Twitter using Hierarchical Knowledge Base

Selecting an Ontology◦Available: Wikipedia, Dmoz, OpenCyc, Freebase ◦Our framework can adapt to any ontology

Wikipedia◦Diverse Domains & Coverage◦Resemblance to a Taxonomy◦Extracted Structured Wikipedia – Dbpedia◦Existing entity recognition techniques (Explained further)

13

Hierarchy Preprocessing

Page 14: User Interests Identification From Twitter using Hierarchical Knowledge Base

14

4.2 Million Articles 0.8 Million Wikipedia Categories 2.0 Million Category-Subcategory

relationships

Challenges ◦ Since crowd-sourced – Noisy ◦ Not a hierarchy/taxonomy

It is a graph It has cycles

Wikipedia Category Graph

Page 15: User Interests Identification From Twitter using Hierarchical Knowledge Base

Clean up -- Removed Wiki Admin Categories

Hierarchical Interest Graph needs a Base Hierarchy ◦Shortest Path from the root node

Root Node: Category:Main Topic Classifications Assumption – Hops to the root node determines the

level of abstraction of the category.

15

Wikipedia Hierarchy

Page 16: User Interests Identification From Twitter using Hierarchical Knowledge Base

16

Agriculture Science

Science Educatio

n

Scientists

Main topic classifications

Sports Health

Health Care

Health Economics

Level: 1

Level: 2

Level: 3

Determining the Hierarchical Level

Page 17: User Interests Identification From Twitter using Hierarchical Knowledge Base

Removing Links that does not concur to a hierarchy

17

Links concurring to a Hierarchy

Page 18: User Interests Identification From Twitter using Hierarchical Knowledge Base

18

Approach

Tweets

Interest Hierarchy

Page 19: User Interests Identification From Twitter using Hierarchical Knowledge Base

Extracting Wikipedia concepts from Tweets

Interests Scoring

19

User Interests Generator

http://en.wikipedia.org/wiki/Semantic_search

http://en.wikipedia.org/wiki/Ontology

Page 20: User Interests Identification From Twitter using Hierarchical Knowledge Base

◦ Issues relevant to entity extraction are handled by the web services

Stop words removal, URLs, Disambiguation etc.

20

Entity Extraction on Tweets*

Precision Recall F-measure Usability Rate LimitLicense

Text Razor 64.6 26.9 38.0 Web Service 500/day

Zemanta 57.7 31.8 41.0 Web Service 10000/day

*L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, HT ’13.

Page 21: User Interests Identification From Twitter using Hierarchical Knowledge Base

Scoring Wikipedia concepts

21

Scoring Interests

 

 

Page 22: User Interests Identification From Twitter using Hierarchical Knowledge Base

22

Internet

Semantic Search

Linked Data Metadata

Technology

World Wide Web

Semantic Web

User Interests

Structured Information

0.8 0.2 0.6Scores for Interests

Page 23: User Interests Identification From Twitter using Hierarchical Knowledge Base

23

Approach

Tweets

Interest Hierarchy

Page 24: User Interests Identification From Twitter using Hierarchical Knowledge Base

Result (Challenges)◦ Infer more categories

without context

◦Equal weights regardless Interest Score

◦Cannot rank categories of Interest for a user

◦We use Spreading Activation

24

Cricket

Naïve Strategy – Inferring every Hierarchical Interest

M S Dhoni

Virat Kohli

Sachin Tendulkar

Sports

Indian Cricket

Indian Cricketers

Honorary Members of the Order of

Australia

Order of Australia

Awards

Culture

Page 25: User Interests Identification From Twitter using Hierarchical Knowledge Base

Graph Algorithm to find contextual nodes◦ Cognitive Sciences◦ Neural Networks ◦ Information Retrieval

Associative, Semantic Networks ◦ Semantic Web

Context Generation

25

Spreading Activation

Page 26: User Interests Identification From Twitter using Hierarchical Knowledge Base

26

Spreading Activation Example

Cricket

M S Dhoni Virat Kohli Sachin Tendulkar

Sports

Indian Cricket

Indian Cricketers

0.8 0.20.6

0.5

0.4

0.25

0.1

Activation FunctionDetermines the extent of spreading

Page 27: User Interests Identification From Twitter using Hierarchical Knowledge Base

Activation Function

27

Page 28: User Interests Identification From Twitter using Hierarchical Knowledge Base

28

No Decay – No Weighted Edge • Result: Most generic categories ranked higher

Decays over the hops of the activation • 0.4, 0.6, 0.8• Result: Same as above

Initial Experiments with Decay & Weighted Edge

Page 29: User Interests Identification From Twitter using Hierarchical Knowledge Base

29

Results: Constant Decay

Agriculture Science

Science Educatio

n

Scientists

Main topic classifications

Sports Health

Health Care

Health Economics

Level: 1

Main Topic Classification – 1Technology – 2

Science – 2Sports– 2

Business – 2……

Technology Companies – 3Scientists– 3

29

Page 30: User Interests Identification From Twitter using Hierarchical Knowledge Base

Uneven distribution of nodes in the hierarchy

Many-many for category-subcategory relationships

30

Wikipedia Challenges to Find Relevant Nodes in the Hierarchy

30

Page 31: User Interests Identification From Twitter using Hierarchical Knowledge Base

Uneven distribution of nodes in the hierarchy

Many-many for category-subcategory relationships

31

Wikipedia Challenges to Find Relevant Nodes in the Hierarchy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

50000

100000

150000

200000

250000

300000

Hierarchical Level

Num

ber

of N

odes

31

Page 32: User Interests Identification From Twitter using Hierarchical Knowledge Base

Uneven distribution of nodes in the hierarchy

Many-many for category-subcategory relationships

32

Wikipedia Challenges to Find Relevant Nodes in the Hierarchy

32

Page 33: User Interests Identification From Twitter using Hierarchical Knowledge Base

Uneven distribution of nodes in the hierarchy

Many-many for category-subcategory relationships

33

Wikipedia Challenges to Find Relevant Nodes in the Hierarchy

33

Page 34: User Interests Identification From Twitter using Hierarchical Knowledge Base

 

34

Addressing Uneven Nodes Distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

50000

100000

150000

200000

250000

300000

Num

ber

of

Nod

es

Hierarchical Level

34

Page 35: User Interests Identification From Twitter using Hierarchical Knowledge Base

35

Many to Many – Preferential Path

1 2 3 4

35

Page 36: User Interests Identification From Twitter using Hierarchical Knowledge Base

Nodes that intersect domains/subcategories activated by diverse entities

36

Boost Intersecting Nodes

36

Page 37: User Interests Identification From Twitter using Hierarchical Knowledge Base

37

Boost Intersecting Nodes

Cricket

M S Dhoni Virat Kohli Sachin Tendulkar

Sports

Indian Cricket

Indian Cricketers3

3

5

5

Michael Clarke

Shane Watson

Australian Cricket

Australian Cricketers

2

2

37

Page 38: User Interests Identification From Twitter using Hierarchical Knowledge Base

 

38

Boost Intersecting Nodes

38

Page 39: User Interests Identification From Twitter using Hierarchical Knowledge Base

39

Activation Functions

Page 40: User Interests Identification From Twitter using Hierarchical Knowledge Base

40

Motivation Approach

Evaluation Conclusion & Future Work

Agenda

Page 41: User Interests Identification From Twitter using Hierarchical Knowledge Base

41

User Study Data◦ 37 Users◦ 31927 Tweets

User Study

• Hierarchical Interest Graph– 111,535 Category

Interests.– 3000 Categories/user– Ranking Evaluation -- Top-

50 Categories.

Page 42: User Interests Identification From Twitter using Hierarchical Knowledge Base

42

How many relevant/irrelevant Hierarchical Interests are retrieved at top-k ranks?◦ Graded Precision

How well are the retrieved relevant Hierarchical Interests ranked at top-k?◦ Mean Average Precision

How early in the ranked Hierarchical Interests can we find a relevant result? ◦ Mean Reciprocal Recall

Evaluation Metrics

Page 43: User Interests Identification From Twitter using Hierarchical Knowledge Base

43

Evaluation Results

Priority Intersect works the best with

• 76% Mean Average Precision

• 98% Mean Reciprocal Recall

Page 44: User Interests Identification From Twitter using Hierarchical Knowledge Base

44

How many of the categories inferred by the system were not explicitly mentioned by the user in tweets? (Semantic Web and Category:Semantic Web)

Implicit Interests – Syntactic

Priority Intersect at Top-10• 52% of Categories were not mentioned in tweets by user

• 65% of which were marked relevant • 10% were marked May-be

Page 45: User Interests Identification From Twitter using Hierarchical Knowledge Base

45

Mapped (String match) categories of Wikipedia to Dmoz. ◦ ~141K categories mapped

Compared all the category and sub-category relationships of the mapped categories in the hierarchy to manually created Dmoz. ◦ 87% precise (in hierarchy were also found in

Dmoz)

Hierarchy Evaluation

Page 46: User Interests Identification From Twitter using Hierarchical Knowledge Base

46

Motivation Approach Evaluation

Conclusion & Future Work

Agenda

Page 47: User Interests Identification From Twitter using Hierarchical Knowledge Base

47

Hierarchical Interest Graph (Hierarchy representation of user interests)◦ With hierarchical levels of each interest to have flexibility for

personalizing and recommending based on its abstractness.

We semantically enhanced user profiles of interests from Twitter using Knowledge bases.◦ Inferred abstract/hierarchical interests of Twitter users using

Wikipedia◦ This can help reducing the data sparcity problem by inferring

relevant interests.

The top-1 hierarchical-interest generated by the system was correct for 36 out of 37 user-study participants.◦ Mean Average Precision at Top-10 is 0.76

Conclusion

Page 48: User Interests Identification From Twitter using Hierarchical Knowledge Base

48

Measuring impact of Hierarchical Interest Graphs for recommendation of Movies/Music◦ Datasets

Movielens Lastfm

Tuning the system to utilize the hierarchical levels of interests for personalization and recommendation◦ Sports (most abstract interest)◦ Baseball (specific interest)

Future Work

Page 49: User Interests Identification From Twitter using Hierarchical Knowledge Base

49

Thanks

Contact: Pavan Kapanipathi Twitter:@pavankapsEmail: [email protected] info: Knoesis Wiki – Hierarchical Interest Graph