Upload
shota-saito
View
230
Download
0
Embed Size (px)
DESCRIPTION
Abstract: In social networking services (SNSs), persistent topics are extremely rare and valuable. In this paper, we propose an algorithm for the detection of persistent topics in SNSs based on Topic Graph. A topic graph is a subgraph of the ordinary social network graph that consists of the users who shared a certain topic up to some time point. Based on the assumption that the time-evolutions of the topic graphs associated with a persistent and non-persistent topics are different, we propose to detect persistent topics by performing anomaly detection on the feature values extracted from the time-evolution of the topic graph. For anomaly detection, we use principal component analysis to capture the subspace spanned by normal (non-persistent) topics. We demonstrate our technique on a real data set we gathered from Twitter and show that it performs significantly better than a base-line method based on power law curve fitting and the linear influence model. This is a slide I used when I presented the following paper. Shota Saito, Ryota Tomioka, and Kenji Yamanishi. Early Detection of Persistent Topics in Social Networks. In Proceeding of Advances in Social Networks in Analysis and Mining. pp xx-xx, 2014 The author copy of this paper is available from my website sites.google.com/site/ssaito1989
Citation preview
Early Detection of Persistent Topics in Social Networks
Shota Saito1 Ryota Tomioka2 Kenji Yamanishi1, 3
1 The University of Tokyo 2 Toyota Technological Institute at Chicago 3 JST, CREST
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
2
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
3
Motivation!Long-term persistent topics have a long-tail on the number of sharers• Our Goal: Predict a topic is persistently shared or not as
soon as possible Persistent topic: Topic shared for long-term and persistently
• What is a ‘long-term persistent topic?’ 😕 Judge from the last shared date
4
3 days 10 days present
Not appropriate to think that this topic is a persistent one
# of sharers per unit time
time
Motivation!Long-term persistent topics have a long-tail on the number of sharers• Our Goal: Predict a topic is persistently shared or not as
soon as possible Persistent topic: Topic shared for long-term and persistently
• What is a ‘long-term persistent topic?’ 😃 Judge from the long-tail of the number of sharers
!
!
!
!
!
!
5
3 days 10 days present
# of sharers per unit time
time
Long-tail
But you can know only after certain time elapsed
Would like to know only looking at the early period
Motivation!Long-term persistent topics have a long-tail on the number of sharers
• 698 topics retweeted over 500 times • Plot amplification factor ap defined as
ap = (# of RTs w/i 50 days)/(# w/i 10 days) against # of RTs
ap > 1.1 -> persistent (marked blue) ap < 1.1 -> non-persistent (red)
!
• # of RTs doesn’t matter !
• Note that we can draw this picture only after 50 days elapsed
-> would like to know ! as soon as possible 6
Non-persistent
Persistent
Motivation!Find “valuable topics” in social networks as early as possible• Social Networking Services (SNSs): Recently growing
More and more “non-valuable” topics in SNSs !
• What is a “valuable” topic in SNSs? 😕 Topics shared by many people
😕 Topics shared by influencers or authorised account
7
Motivation!!Non-valuable topic example: posted by influencer
Indeed he is an influencer but…
8
Got attention, but…
Motivation!Find “valuable topics” in social networks as early as possible• Social Networking Services (SNSs): Recently growing
More and more “non-valuable” topics in SNSs !
• What is a “valuable” topic in SNSs? 😕 Topics shared by many people
😕 Topics shared by influencers or authorised account
😃 Topics shared for a long time: survive persistently
9
!
!
!
!
!
!
!
!
!
provide insights to predict fashion or trend predict how marketing campaign goes: success or not?
Motivation!Valuable topic example: Not only in English, but also other language
Persistent topics are insightful
10
Dropbox Marketing Campaign Emerging Opinion Leader or Topic
Before I work apple, I thought innovation is “to make something new.” But it is wrong, and innovation is actually “to make a future ordinarily thing.” It takes time to understand the difference between those.
Motivation!Find “valuable topics” in social networks as early as possible• Social Networking Services (SNSs): Recently growing
More and more “non-valuable” topics in SNSs !
• What is a “valuable” topic in SNSs? 😕 Topics shared by many people
😕 Topics shared by influencers or authorised account
😃 Topics shared for a long time: survive persistently
Want to know persistent topics as early as possible!-> Able to foresee the trends
11
Related Work!None of existing work focused on predicting persistent topics• Analysis on topics getting many attentions: who contributes?
Social friendship network in SNSs • Influencer[Cha+ 10] • Weak tie[Bakshy+ 12]
!
• Problem: Mainly focusing on getting attention topic -> barely have insights on persistent ones
12
Related Work!None of existing work focused on predicting persistent topics• Topic Detection and Tracking (TDT)
Find a topic from sequential documents[Kleinberg 02] Problem: Mainly using Natural Language Techniques -> In SNS, many languages are used
!
Find a topic in Twitter from anomaly mention behaviour[Takahashi 11]
Problem: finding bursting topics, not persistent ones
13
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
14
Approach to Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic Graphs• Approach to the Proposed Method
Previous: Language Proposed: Network
• Particularly: Previous: A friendship network fixed in SNS Proposed: A graph consisted of users who share the topic
Topic Graph
15
Approach to Our Proposed Method!!Comparison btw graph on SNS and a Topic Graph
• Although existing work focuses on a graph made of users and their friendships on the whole SNS… !
• We focus on a topic graph, consisted of users who post or share the topic and their friendship !
• Note that a topic graph is a subgraph of a graph of the whole SNS16
Approach to Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic Graphs• Approach to the Proposed Method
Previous: Language Proposed: Network
• Particularly: Previous: A friendship network fixed in SNS Proposed: A graph consisted of users who share the topic
Topic Graph !
Assumption: !a persistent topic has a different time-evolution of topic
graphs than other non-persistent topics17
Approach to Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic Graphs
!
!
!
!
!
!
!
!
We focus on the time-evolution of topic graphs: Assume persistent topic’s time-evolution of topic graphs are different than others
18
Approach to Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic Graphs: The actual example
19
Persistent topic’s time-evolution of topic graphs might be different than others
Approach to Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: The method to pick up different thing than other majority: Anomaly Detection
-> Apply an anomaly detection method to time-evolution of topic graph
!
Evaluate topic graph: Various feature values of complex network -> Utilise time-sequential various feature values of topic graph
20
Overview of Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: 1. Introduce feature values of complex network to topic graph 2. Utilise various time-sequential feature values 3. Apply anomaly detection via PCA
21
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
22
Overview of Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: 1. Introduce feature values of complex network to topic graph 2. Utilise various time-sequential feature values 3. Apply anomaly detection via PCA
23
Overview of Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: 1. Introduce feature values of complex network to topic graph 2. Utilise various time-sequential feature values 3. Apply anomaly detection via PCA
24
Topic Graph!Define Topic Graph as a graph consisted by users who post and share the topic• Let G be a topic graph of a topic and at one time
!
!
!
!
!
Nodes: users who post or share Edges: their friendship
25
User who posts
Topic Graph!Feature values we use: from global feature values to local feature values
26
User who posts
User who posts
User who postsUser who posts
# of sharers # of communities
Eigenvalues of Graph Laplacian LMaximum distance from the originAdjacency matrix
Degree matrix
Overview of Our Proposed Method A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: 1. Introduce feature values of complex network to topic graph 2. Utilise various time-sequential feature values 3. Apply anomaly detection via PCA
27
Track evolution of Topic Graph !Set all the feature values as one vector• To track time-evolution of topic graphs -> Set all the feature values as one vector Data on one topic y
28
(� """")�# of shares # of communities
Maximum distance
Second largest GL’s eigenvalue
Largest GL’s eigenvalue
time(h) time(h) time(h) time(h) time(h)
>
Overview of Our Proposed Method!A persistent topic has an anomaly time-sequential of Topic GraphsAssumption: Persistent topic has a different time-evolution of topic graph than others
!
Our proposal: 1. Introduce feature values of complex network to topic graph 2. Utilise various time-sequential feature values 3. Apply anomaly detection via PCA
29
Data 1
Data 2
Image of Anomaly Detection via PCA !Use anomaly detection method via PCA proposed by Lakhina+.
Retake the base from axes to PCs[Pearson 1901] !
PC1: Normal PC2: Anomal !
Judging from norm of projection onto anomaly space[Lakhina+ 04] Data1: not anormal input Data2: anomal input
30
Principal Component Analysis(PCA)!Retake basis to “describe” data well and not to “miss” the data• Let Y be a matrix made of non-persistent topics’ y s as
!
• let v1 be a first principal component, then !
• repeat this procedure. Hence, kth principal component induced as !
• Compose normal subspace S by picking up principal components
Use cumulative contribution values • Compose anomalous subspace by not picking-up
principal components31
Y = (y1,y2, . . . )>
S
Anomaly Detection via PCA!!Judge from projection of the data onto anomaly space• Decompose input data y into
!
• To induce this, !
!
!
• Judge a topic is not anomalous if y is not enough projected onto anomaly space !
then a topic y is anomalous, i.e. persistent[Lakhina+ 04]
32
y = y + y y 2 S y 2 S
kyk = kCyk > �PCA
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
33
Experiment 1: Evaluation of proposed method!!Predict whether a topic is persistent or not• Experiment using Twitter data
Use tweets retweeted by over 500 users and passed 50days as topics 698 tweets retweeted by 1.6M users amplification factor:
(# of RTs w/i 50 days)/(# of RTs w/i10 days) > 1.1 -> persistent topics
• Goal: Predict whether a topic is persistent or not, only looking at the early period of the topic !
!
• Evaluate our method and comparison methods by AUC34
Post 50th day: able to know the answer
1st d 3rd d 5th d 10th d Want to know in the early period
Experiment 1: Evaluation of proposed method!!Evaluation criteria AUC• AUC: sort of accuracy of classifier
AUC is an area of a curve For a parameter of a classifier, plot
Vertical Axis : True Positive Horizontal Axis : False Positive and draw a curve by moving the parameter from -∞ to +∞
• Characteristics • Larger AUC, better performance • AUC is 0.5 if you classify randomly • AUC is 1.0 for the best performance of classifier
35
Experiment 1: Evaluation of proposed method!!Experiment on our method• Divided 698 topics into training data and test data
Test data: 109 persistent topics and randomly pick up 109 topics from 589 non-persistent topics Training data: remaining 480 non-persistent topics !
!
!
• Tried several sampling intervals: 1 hr, 3hrs, 6hrs, 12hrs !
• Use anomalous scores and compute AUC !
• Repeat 200 times36
Training Data Test DataNon-persistent: !
589 TopicsPersistent:!109 Topics
Experiment 1: Comparison Method 1!Experiment on two comparison methods: Power Law Curve Fitting• Comparison method 1: Power law curve fitting
Ground truth of long-tail fit power law curve to the difference sequence of # of retweets
Estimate and of !
where is # of sharer per unit time at time
• Fit to 218 test data !
• Use and compute AUC !
• Repeat 200 times37
nt
nt = �t↵
↵
↵
t
�
Experiment 1: Comparison Method1!Experiment on two comparison methods: Power Law Curve Fitting: Actual ExampleAn example for power law curve fitting to a persistent topic
38
Experiment 1: Comparison Method2!Experiment on two comparison methods: Linear Influence Model• Comparison method2: Linear Influence Model (LIM)[Yang
and Lskovec 10] Able to predict # of future retweets Predict as a superposition of users’ strength of influence learnt from the past
• Predict # of retweets within 50 days and # of retweets within 10 days !
• Use (# of retweets within 50 days)/(# of retweets within 10 days) and compute AUC
39
Experiment 1: Result!Our method outperform two comparison methods for most cases
40
Sampling interval 1 hour Best Sampling interval
Bette
r
Proposed method outperforms comparison methods if we have enough data points
Agenda!!Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
41
Experiment 2: Effect of feature combination!!Use only one feature value and compose normal subspace• Which feature value contributes to persistent topics
same procedure as experiment 1 and do proposed method but use only one of the features !
• Evaluate by AUC if AUC has high score -> that feature contributes
to accuracy of prediction if AUC has low score -> that feature doesn’t contributes
to accuracy of prediction
42
Experiment 2: Result!Our strategy might have better performance if we incorporate as many features as possible.
43
Sampling interval 1 hour
Some features, like # of communities, has a strong performance Some features, like maximum distance has a low performance Overall, all the features has a better performance
Bette
r
Agenda !Early Detection of Persistent Topics in Social Networks1. Backgrounds and Related Work 2. Proposed Method
1. Approach of the Proposed Method 2. Mathematical Modelling
3. Experimental Results on Twitter Data 1. Comparison with Existing Methods 2. Effect of Feature Combination
4. Conclusion
44
Conclusion
• Problem: How to predict whether a topic is persistent or not as soon as possible !
!
• Proposed a method based upon time-evolution of Topic Graph !
!
• Show good performance on Twitter Data
45
Future Work
• Set threshold to judge whether persistent or not !
• Feature values of Topic Graph How each feature value contributes a topic to become persistent Implication on connection between Graph and real world Feature Value selection • Eigenvalues of Graph Laplacian is not realistic feature value from the
perspective of computation cost
!
• Another method to Topic Graph Supervised based method
46