Temporal and semantic analysis of richly typed social networks from user-generated content sites on...

Preview:

Citation preview

1

Temporal and semantic analysis of richly typed social networks from

user-generated content sites on the web

Zide Meng, Supervisor: Fabien Gandon, Catherine Faron Zucker

2

Question Title

Question Content

Question User

Question Comments

Answer Content

Answer User

Question Tags

Answer Votes

Question Votes

3

4

Some facts about Q&A sites

• Traffic statistics in Oct. 2016 from Quantcast.com– 49.9M unique devices visit Stackoverflow– 52.9M unique devices visit Answer.com– 3.9M unique devices visit YahooAnswer

• to compare:– 211.8M unique devices visit Youtube.com– 147.6M unique devices visit Facebook.com

1/41/3

5

6

Site info of StackOverflow.com

• total question: 12.7M• unanswered: 3.5M• total answers: 20.1M• total user: 6.2M• question/min: 2.93• answer/min: 4.66

https://api.stackexchange.com/docs/info#filter=default&site=stackoverflow&run=true(accessed 2 Nov 2016)

7

what is Q&A site?

8

Detect topics and activities of users

how to export jar?

Topic detection

Temporal analysis

9

Detect topics and activities of groups

temporal dynamics

community detection

10

Toward the functionalities

• Expert identification• Question routing• Community evolution• Burst Topic detection• Event detection• etc.

11

Research Question(RQ1)• How can we formalize user-generated

content?

12

Research Question(RQ2)• How can we identify the common topics

binding users together?

13

Research Question(RQ3)• How can we detect topics-based overlapping

communities?

14

Research Question(RQ4)• How can we generate a semantic label for

topics?

Java Development

Database

15

Research Question(RQ5)• How can we extract topics-based expertise

and temporal dynamics?

16

Agenda• Backgrounds & Motivation• RQ1: Formalize user-generated content• RQ2: An efficient topic modeling method• RQ3: Overlapping community detection• RQ4: From a BOW to semantic labels• RQ5: Temporal Topic Expertise Activity• Conclusions and Perspectives

17

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

Overview

18

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

RQ1: how to formalize UGC?

RQ1 RQ4

RQ3 RQ2 RQ5

19

massive unstructured Q&A content

20

Formalize UGC with semantic schema

• From unstructured to structured• Explicit information (questions, answers…) • Implicit Information (interest, expertise…)

OriginalQ/Adata

Q/ATriples

SIOC & FOAF

User Expertise

User InterestQASM

Information Extraction

DataMining

existing work

21

QASM vocabulary

22

Formalize distribution

RQ1 Discussion

we propose the QASM vocabulary to formalize both explicit information and implicit information for user-generated content (Q&A sites)

How to extract implicit information from the original explicit information?

24

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

RQ2: How to find topics?

RQ1 RQ4

RQ3 RQ2 RQ5

25

what kind of topics are they talking about?

26

what is a Topic?• e.g. Given two documents on topic “music” and

“cooking”• “guitar” and “singer” are more likely to appear in

document about “music”• “receipt” and “pizza” are more likely to appear in

document about “cooking”• “the” and “a” will appear equally in both• score (0 ~ 1) -> relevance (weak ~ strong)• Topic “music” :

“guitar” ”singer” ”receipt” ”pizza” ”the” ”a”0.4 0.3 0.1 0.1 0.05 0.05

27

Latent Dirichlet Allocation (LDA)

28

Replace Document with User

• Original LDA: Document-Topic-Word• In our problem: User-Topic-Tag

29

How to get topic assignment?• User, Tag are observed information• Topic is hidden information

P(topic|user)

P(tag|topic)

30

how to get the distributions?

• Gibbs sampling

Sample a new Zi

Update distributions

User-Topic distribution

Topic-Tag distribution

θuk

θkw

31

Intuition behind LDA

• How to create a user tag list

32

Topic-Tag distribution

User-Topic distribution

Loop: choose a topic, choose a tag

csshtml eclipse, , , ……mysql , layout

……

……

33

Output of LDA (User-Topic-Tag)

Web Development

Java Development

Database

0.3 0.6 0.1

Java mysql tomcat html

0.1 0.1 0.2 0.6

Java mysql tomcat html

0.6 0.1 0.2 0.1

User-Topic Distribution

Topic-Tag Distribution

java mysql tomcat html

0.1 0.6 0.2 0.1

Short summary

• Goal: We want to find topics, overlapping communities, user expertise, user activities…..

• Method: LDA may solve the problems• But LDA has problems, e.g. Slow & Complex

• Find an Efficient & Simple topic modeling method

34

35

Some empirical findings

-> Find topics based on tags

High frequent tags are more general

The first tag normally indicate the domain

Each question has 1~5 tags indicating the key points

36

Solution: Prefix Tree structure

layout

html

css

Q1: html css element

Q2: html layout float

Q3: html layout css-layout

Q4: html forms select input

Q5: html forms autocomplete

forms

element float css-layout select auto complete

input

1, Root tag can be used to represent the children tags2, Tags in a tree belong to the same topic3, The order is maintained in the tree structure

37

html

HTML prefix tree for StackOverflow

38

Combine Prefix Trees • why: some trees should be in the same topic• how: compute root tag similarity matrix• output: combine trees to get topics

layout

html

mysql forms tomcat

java

mysql jvm

cssjavascript

jquery

json

0.89

0.35

0.29

similarity html javascript java

html 1 0.89 0.35

javascript 0.89 1 0.29

java 0.35 0.29 1

39

How to get the topic-tag distribution?

layout:10

html:50

mysql:20 forms:20

Topic1: Web-dev :100

javascript:50

layout:0.1

html: 0.5

mysql: 0.2 forms:0.2

Topic1: Web-dev

javascript:0.5

• MLE (Maximum Likelihood Estimation)

40

Topic-tag distribution

probability

tags

topics

sql database

highly related

eclipse

not related

41

Topic Extraction experiment setup

• Dataset– Stackoverflow (2008/08 to 2009/09)– 103K users– 242K questions and 870K answers

• Baseline Algorithms– LDA (latent Dirichlet Algorithm)

42

Topic extraction evaluation metric

• Metric: Perplexity– how likely a model would generate the test dataset

• Example:

Training Set

html width css

0.9 0.05 0.05

html width css0.4 0.4 0.2

test case1: html width csstest case2: html widthtest case3: html widthtest case4: html width css

less likely

more likely

higher Probabilities to generate test dataset Lower Perplexity Score Better Performance

model 1

model 2

???

43

Perplexity Score compared with LDALower is better

LDA WE

44

Scalability compared with LDA

LDA

RQ2 Discussion• If two tags co-occur many times, they should be in the

same topic (In the same topic tree in our method)• The probability of a tag to a topic is approximated to its

frequency in that topic if the observed data is large enough!

• Question tag list is short (3~5 tags), which is not suitable for LDA to get very good results

• Test on Flickr dataset, it can also generate meaningful topics

• TTD: a simple and efficient topic modeling method while preserving topic quality

46

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

RQ3: How to detect communities?

RQ1 RQ4

RQ3 RQ2 RQ5

47which users are interested in the same topic?

48

Existing community detection approaches

we focus on simple, efficient, topic-based, overlapping community detection method

49

how to get user-topic distribution?

jspJava tomcat*22, *15, *11, ……html *9,

0.10Web-Dev

Java-Dev

C#-Dev

0.50

0.05

0.20

0.20

0.05

0.30

0.25

0.15

0.30

0.05

0.10

*22+

*22+

*22+

*15+

*15+

*15+

*11+

*11+

*11+

*9 =

*9 =

*9 =

11.2

17.2

4.4

• topic-tag distribution + user tag list

Topic-tag distribution from the last step

50

User-topic distributionhigh interest

users

topicslow interest

user12960

web-dev

java-dev

51

How to find overlapping communities• use user-topic distribution• each topic represent each community

0.75

0.32

0.42

0.15

0.78

0.23

Web-Dev

Java-Dev

C#-Dev

0.15

0.78

0.82

For example, threshold : 0.3

Web-Dev C#-DevJava-Dev

52

Overlapping Community Detection experiment setup

• Dataset– Stackoverflow (2008/08 to 2009/09)– 103K users– 242K questions and 870K answers

• Compared Algorithms– SLPA (Label Propagation Algorithm)– LDA (latent Dirichlet Algorithm)– Ward (Hierarchical clustering algorithm)

53

Evaluation metrics

• metric: Jaccard Similarity & Cosine Similarity– avg_inner– avg_rand– avg_center– nmi=avg_inner/avg_rand

avg_inner

avg_rand

avg_center

Users in a community are more close

Users in a community are far away from outside

Users in a community are closer to center

the larger the better

54

Jaccard Similarity

avg_inner avg_rand

avg_center nmi

WE

WE

WE

WE

55

Cosine Similarity

avg_inner avg_rand

avg_center nmi

WE

WE

LDA win, but LDA has “sum to 1” restriction

LDA

LDA

56

RQ3 Discussion

• Both Our method and LDA can detect topic based communities, graph based method and clustering method can not.

• Compared with LDA, our method does not have a “sum-to-1” restrictions (a high interest in a community does not necessarily lower the interest in another community)

• Our method is simple and efficient compared with LDA

57

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

RQ4: How to label topics?

RQ1 RQ4

RQ3 RQ2 RQ5

58

Bag of words is hard to view and manage

Topic 1

Topic 2

59

Existing topic labeling approaches

60

Using External Knowledge: Link to DBpedia

Java

http://dbpedia.org/resource/Java_(programming_language)

For example

http://dbpedia.org/resource/Java

Disambiguation

61

Disambiguation

Java

Tag Description in StackOverflow

Java

Java (Place) Description

Java (P.L.) Description

0.31

0.58

Content Cosine Similarity

• Method1: Babelfy (Moro et al. 2014)• Method2: DBpedia Lookup service

62

S

Find missing links and resourceshttp://dbpedia.org/resource/Apache_Tomcat

http://dbpedia.org/resource/Java_p.l.

http://dbpedia.org/resource/eclipse

http://dbpedia.org/resource/j2ee

Sparql queries to DBpedia

63

Find the center

Java Development

Algorithms to find central node

the central node

64

Algorithms to find the center

• InDegree (ID)• Betweenness Centrality (BC)• Degree Centrality (DC)• Page Rank (Page 1999) (PR)• Random• Top (sorted topic-tag distribution)• Most (most selected by above algorithms)

65

Topic Labeling experiment: Survey

66

Evaluation: NDCG • Normalized Discount Cumulative Gain– used to evaluate two ranked list– Perfect match: NDCG=1.0– Completely wrong: NDCG=0.0

HTML 10Firefox 8Web-Development 7CSS 1Brower 0

Ground Truth: Survey Result

HTML 0.81Web-Development 0.52Firefox 0.31CSS 0.02Brower 0.01

Algorithm output

67

Experiment:NDCG

NDCG=1 : perfect match

68

RQ4 Discussion

• many words can not link to DBpedia• 1 label is not enough, 2 or 3 labels is much better• By using external knowledge base DBpedia, we

propose a method to automatically generate semantic label for bag of words.

69

Q&A Data Open Data Other Data

Schema Mapping

DataEnrichment

DataInter-Linking

Integrated DataSet

Applications

Data Preparation

Data Integration

Data Analysis

Communitydetection

Topic Extraction

TemporalAnalysis

Application

RQ5: How to model temporal?

RQ1 RQ4

RQ3 RQ2 RQ5

70

who is active now?on which topic is he active?in which topic does he have expertise?

71

Related Work

72

LDA -> TTEATemporal

Expertise

Activity

73

TTEA Model details

How to get topic assignment?

User-Topic distribution

Topic-Word distribution

Topic-Tag distribution

Topic-Time distribution

Topic-Expertise distribution

74

How to get the distributions?

• Gibbs sampling

Sample a new Zi

Update distributions

Intuition behind TTEA (Temporal)

Topic-Tag distribution

User-Topic distribution

csshtml eclipse, , , ……mysql , layout

……

……

75

Topic-Time distribution

June 2016

Intuition behind TTEA (Expertise)

Topic-Tag distribution

User-Topic distribution

csshtml eclipse, , , ……mysql , layout

……

……

76

Topic-Exp distribution

June 2016 52

77

Experiments and Evaluations

• StackOverflow dataset (07/2008-11/2013)

78

Topic Extraction experiment setup

• Baseline Algorithms– TEM (Yang 2013b) : topic, expertise– UQA (Guo 2008b) : topic, categories– GrosToT (Hu 2014): topic, temporal– TTEA (our): topic, expertise, activity, temporal

79

• For each question in test dataset, we recommend 5,10,20,30 users

• MSC: the number of successful prediction

Question Routing Task

TTEA-ACT TEM UQA GROSTOT RANDOM0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

msc@5msc@10msc@20msc@30

WE

80

Best answer prediction

• when we recommend 100 users (out of 6.2M users) for each testing questions, in around 44% cases we have one user not only answering the question, but also winning the highest vote.

81

Temporal illustrations

Month

Day

Hour

User in same topic behavior different

Use

r in

diffe

rent

topi

c

global level user level

82

RQ5 Discussion

• TTEA: an extended LDA model to extract expertise, activity, and temporal dynamics.

• Extracted information could benefit question routing, expert detection tasks.

83

Agenda• Backgrounds & Motivation• RQ1: Formalize user-generated content• Apply LDA on User generated content• RQ2: An efficient topic modeling method• RQ3: Overlapping community detection• RQ4: From a BOW to semantic labels• RQ5: Temporal Topic Expertise Activity• Conclusions and Perspectives

84

Overview of contributions

• temporal and semantic analysis of richly typed social networks from user-generated-content sites on the web

• key points:– temporal analysis– semantic analysis– social networks–user-generated content

Community/topic evolution

Topic Extraction

Community Detection

Question Answer site

85

Detailed answers to questions

• RQ1: QASM: formalize implicit and explicit content

• RQ2: TTD: a simple and fast topic modeling method

• RQ3: a TTD based overlapping community detection method

• RQ4: A DBpedia based topic labeling method• RQ5: TTEA: joint model Topic Temporal Expertise

and Activity

86

Limitations & Perspectives

• RQ1: How to formalize UGC?– formalize single platform v.s. cross platform

• RQ2: How to detect topics?– automatically generate tag from content

• RQ3: How to find overlapping communities– combine graph-based community detection

• RQ4: How to generate Labels for BOW?– use extra knowledge base or create links

• RQ5: How to extract temporal and expertise?– use all the extracted information to provide more function

87

Publications• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker, Ge Song. Detecting topics and

overlapping communities in question and answer sites. Journal of Social Network Analysis and Mining. 2015

• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Overlapping Community Detection and Temporal Analysis on Q&A Sites. Journal of Web Intelligence and Agent Systems 2016. (to appear)

• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Joint model of topics, expertises, activities and trends for question answering web applications. IEEE/WIC/ACM 2016 (to appear)

• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Simplified detection and labeling of overlapping communities of interest in Q&A sites. IEEE/WIC/ACM Web Intelligence 2015

• Zide Meng, Fabine L. Gandon, Catherine Faron-Zucker. QASM: a Q&A Social Media system based social semantics. ISWC 2014.

• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker, Ge Song: Empirical study on overlapping community detection in question and answer sites. ASONAM 2014: 344-348

• Jean-Michel Dalle, Catherine Faron-Zucker, Fabien L. Gandon, Mathieu Lacage, Zide Meng: Online Knowledge Triage: Searching, Detecting, Labelling and Orienting User Generated Content. WWW (Companion Volume) 2016

88

Thank you! zide.meng@inria.fr

89

• check when you use TAG and when you use WORD

• read reports of your reviewers and prepare answers : get ready

• documents of defense

90

91

92

Evaluations (1/4)Topic -> Perplexity

WEWEWEWE

number of topics

perp

lexi

ty sc

ore

93

Topic

Topic over tag distribution

0.75

0.23

0.02

Html

css

eclipse

very related

not related

webdevelopment

94

Temporal

Topic over time distribution

0.75

0.23

0.02

May

June

July

very popular

not popular

webdevelopment

95

Expertise

User’s Expertise over topic distribution

0.75

0.23

0.02

Web-Dev

Java-Dev

C#-Dev

has high expertise

has low expertise

96

Activity

Topic over user distribution

0.75

0.23

0.02

very active

not active

webdevelopment

97

Recommended