Upload
prithan
View
88
Download
0
Embed Size (px)
Citation preview
User Behavior Modeling on Financial Message
BoardsPritha D.N
Sahaj BiyaniDecember 9, 2015
Introduction
Investors Hub
Objective
• To identify the roles users assume in these message board forums.
• Validate the “90-9-1 Rule for Participation Inequality” in the message boards community.
Dataset• Free US listed stocks message boards
• Time Period: January, 2001 - June, 2015
• Total Message Boards: 6,278
• Total Users: 52,558
• Total Posts: 5,624,024
Dataset Analysis
• Percentage of initiated posts: 30%
• 19% of users did not initiate any post.
• 80% of users initiated less than 20 posts.
Dataset Analysis
• Number of boards user participated in:
• 56% of users are active only on 1 board.
• 90% of users are limited to/ active on less than 20 boards.
Dataset Analysis
• Average response time of replies a user makes:
Dataset Analysis
• Number of posts across boards:
• 80% of posts made on less than 200 boards.
• 1000 out of 6278 boards account for 90% of posts made.
Dataset Analysis
• Percentage of initiated posts: 30%• From the graph we infer,
• 19% of users did not initiate any post.
• 80% of users initiated less than 20 posts.
Features1. Number of threads a user initiated over time2. Number of replies a user made over time3. Number of users a user replies to4. Number of users who reply to a user5. Number of boards a user is active on6. Number of followers7. Replier share , AVG[proportion of replies a user gets on a board]8. Reply share, AVG[proportion of reply a user makes on a board]9. Average Response time10. Volume of content he posted11. Number of links he has posted
Content Related
User Network StructureActivity of User
Methodology
• Data Preprocessing
• Feature Selection/Extraction
• Clustering
• Role Inference
Data Preprocessing
• We use Min-Max Normalization• Normalize data between [0 – 1]
Feature Selection
• Step 1 – Feature Extraction • Do Principal Component Analysis• Do K-means on the projected data and extract feature labels
• Step 2 – Feature importance using Random Forest classifier
Principle Component % Variance Cumulative %
Variance
1 62.16 62.16
2 15.07 77.23
3 07.95 85.18
4 05.74 90.92
5 03.57 94.49
6 01.67 96.16
7 01.48 97.64
8 00.68 98.32
9 00.59 98.91
10 00.55 99.46
11 00.54 100
Feature Extraction using PCA
Scree Plot
Choosing the number of clusters
Elbow Plot
• Plot the Within Group Sum of Squares versus K, and look at the “elbow-point” in the plot.
• The first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.
• Choose the number after the last big drop.
• This "elbow" cannot always be unambiguously identified.
Silhouette Coefficient
a(i) is the average dissimilarity of with all data within the same cluster.
b(i) is the lowest average dissimilarity of to any other cluster, of which is not a member.
Feature Selection
• Train a Random Forest classifier using all the features and labels assigned by K-means.
• Feature importance is defined as the total decrease in node impurity (weighted by the probability of reaching that node ,which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.
Clustering Users• Applied K-Means clustering with K=4.• Run 10 times with different seeds.• 300 iterations in a single run.
Clusters User Count % of UsersCluster 1 47295 91.7
Cluster 2 360 0.73
Cluster 3 3322 6.44
Cluster 4 581 1.13
Cluster AnalysisInitiation of Posts by users of each cluster
Cluster 130%
Cluster 222%
Cluster 344%
Cluster 43%
Post Initiation Share
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Series10
200
400
600
800
1000
1200
Cluster 1
Cluster 2
Cluster 3
Cluster 4
10.9
1066.6
228.298.3
Initiation Per User
Cluster AnalysisReplies by users of each cluster
Cluster 122%
Cluster 228%
Cluster 347%
Cluster 44%
Reply Share
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Series10
500
1000
1500
2000
2500
3000
Cluster 1
Cluster 2
Cluster 3
Cluster 4
17.5
2946.3
534.9
255.9
Reply Per User
Cluster Analysis
Clust er1 Clust er 2 Clust er 3 Clust er 4
34
22
24
18
20
27
11
25
41
49
22
44
4
2
42
2
Inter-Cluster reply %Cluster1 Cluster4 Cluster2 Cluster3
Cluster AnalysisFeature 3: Number of users a user replies to
Cluster AnalysisFeature 4: Number of users who reply to a user
Role Inference• Cluster1: Lurkers
• The post initiated per user and reply made per user ratio are very less.
• Cluster2: Super Users• Very active. Contribute most to the boards. Engage with lot of users.
• Cluster3: Contributors• Account for 45% of total post initiations, 46% of total replies made. Have a high
response time meaning they respond very fast. Backbone of the forum.
• Cluster4: Taciturns• Limited to themselves. Initiate very less but reply often mostly to users in their own
cluster.
Participation Inequality
% of Users Content Contributed0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
91.73
24
0.73
26
6.44
46
1.134
Lurkers Super-Users Contributors Taciturns
Conclusion
• Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards.
Conclusion
• Users take up different roles on online communities and the cluster of users can be identified by their behavioral pattern.
• Participation Inequality exists on financial message boards as well.
Thank You!