Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Final Project Analyzing Reddit Data to Determine Popularity
2
Project Background: The Problem
Problem: Predict post popularity where the target/label is based on a transformed score metric
Algorithms / Models Applied: • SVC • Random Forests • Logistic Regression
3
Project Background: The DataData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit
Data Structure (22 Columns): •created_utc - Float •score - Integer •domain - Text • id - Integer •title - Text •author - Text •ups - Integer •downs - Integer •num_comments - Integer •permalink (aka the reddit link) - Text •self_text (aka body copy) - Text
• link_flair_text - Text •over_18 - Boolean •thumbnail - Text •subreddit_id - Integer •edited - Boolean • link_flair_css_class - Text •author_flair_css_class - Text • is_self - Boolean •name - Text •url - Text •distinguished - Text
4
Project Background: The Data - RemovedData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit
Data Structure: • created_utc - Float • score - Integer • domain - Text • id - Integer • title - Text • author - Text • ups - Integer • downs - Integer • num_comments - Integer • permalink (aka the reddit link) - Text • self_text (aka body copy) - Text
• link_flair_text - Text • over_18 - Boolean • thumbnail - Text • subreddit_id - Integer • edited - Boolean • link_flair_css_class - Text • author_flair_css_class - Text • is_self - Boolean • name - Text • url - Text • distinguished - Text
5
Reviewing the Data: Subreddit Topics
AnimalsWithoutNecks
BirdsBeingDicks
CemeteryPorn
CoffeeWithJesus
datasets dataisbeautiful
FortPorn
learnpython MachineLearning
misleadingthumbnails
Otters
PenmanshipPorn
PowerWashingPorn
ShowerBeerStonerPhilosophy
talesfromtechsupport
TreesSuckingAtThings
6
Reviewing the Data: Top Domains
Domain'Count'
imgur.com)
youtube.com)
reddit.com)
flickr.com)
soundcloud.com)
quickmeme.com)
i.minus.com)
twi6er.com)
amazon.com)
qkme.com)
vimeo.com)
wikipedia.org)
ny;mes.com)
guardian.co.uk)
bbc.co.uk)
Imgur: 773,969
YouTube: 188,526
Reddit: 25,445
Flickr : 17,854
Soundcloud: 10,397
7
Reviewing the Data: Most Have No Body Text
Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc.
Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value
8
Reviewing the Data: Time Based Data
0"
50000"
100000"
150000"
200000"
250000"
300000"
January"
February"
March"
April"
May"
June"
July"
August"
September"
October"
November"
December"
Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August
9
Reviewing the Data: Time Based Data
Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip
0"
50000"
100000"
150000"
200000"
250000"
300000"
350000"
400000"
Monday"
Tuesday"
Wednesday"
Thursday"
Friday"
Saturday"
Sunday"
10
Reviewing the Data: Time Based Data
Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon
0"
20000"
40000"
60000"
80000"
100000"
120000"
140000"
160000"
12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm"
0"20000"40000"60000"80000"
100000"120000"140000"160000"180000"200000"
50)99"
100)199"
200)299"
300)399"
400)499"
500)999"
1000)4999"
5000)9999"
10000+"
Score&Counts&
11
Reviewing the Data: Determining Popularity
~15% of posts
Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel
12
Analyzing the Data: Issues
Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms • Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing
around with the data extremely hard
Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn’t address in time
13
Analyzing the Data: SVC
0.88000$
0.89000$
0.90000$
0.91000$
0.92000$
0.93000$
0.94000$
Linear$ Poly$ Sigmoid$ RBF$
Accuracy'
0.92000%
0.92200%
0.92400%
0.92600%
0.92800%
0.93000%
0.93200%
0.93400%
0.93600%
0.93800%
0.001% 0.01% 0.1%
Accuracy'w/'Linear'Kernel'
Linear = .9368 C Value of .1 = .9363
from sklearn import svm
14
Analyzing the Data: Regression Trees
from sklearn import ensemble
0.885%
0.89%
0.895%
0.9%
0.905%
0.91%
0.915%
0.92%
0.925%
5% 10% 20% 50% 100% 125% 150%
N Estimators = 125 .922
Max Depth = 250 .924
0.885%
0.89%
0.895%
0.9%
0.905%
0.91%
0.915%
0.92%
0.925%
0.93%
5% 40% 100% 150% 200% 250% 300%
15
Analyzing the Data: Logistic
0.925&
0.93&
0.935&
0.94&
0.945&
0.95&
0.001& 0.01& 0.1& 1& 10& 50&
C Value of 1 = .9471 L1 = .947733 L2 = .947066
0.9469&
0.94695&
0.947&
0.94705&
0.9471&
0.94715&
0.9472&
0.94725&
0.9473&
0.94735&
0.9474&
L1& L2&
from sklearn import linear_model
16
Totally Crushing It!
17
Analyzing the Data: Classification Report
Random Forests
Logistic Regression
SVC
18
Soooo Not Crushing It
19
Feature Reduction: Accuracy
Random Forests - Reduced Features
Logistic Regression - Reduced Features
SVC - Reduced Features
Random Forests - All Features
Logistic Regression - All Features
SVC - All Features
94.71%
92.4%
93.63%
94.3%
94.5%
95.2%
20
Feature Reduction: Classification Report
Random Forests - All Features
Logistic Regression - All Features
SVC - All Features
Random Forests - Reduced Features
Logistic Regression - Reduced Features
SVC - Reduced Features
21
Next StepsDealing with the processing issues: • Learn and try our PySpark
Answer some additional questions: • Reevaluate how I handle the domains
• I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the “post” and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity
• Run the data with varying n_grams levels • I tried them but they expanded the columns to hundreds of thousands which just seemed to
freeze, so hopefully something like PySpark will help with the processing • Predict sub-reddit/category questions:
• Can I predict category of a post? • Do certain subreddits produce more overall popular content than others? Bears With
Beaks vs. ggggg (what ever the hell that is)
APPENDIX
22
0"20000"40000"60000"80000"
100000"120000"140000"160000"180000"200000"
50)99"
100)199"
200)299"
300)399"
400)499"
500)999"
1000)4999"
5000)9999"
10000+"
Score&Counts&
23
Reviewing the Data: Reevaluate Popularity
~12% of posts
Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel
~8% of posts
24
Analyzing the Data: SVC
0.688%
0.69%
0.692%
0.694%
0.696%
0.698%
0.7%
0.702%
0.704%
0.706%
0.708%
0.71%
0.001% 0.01% 0.1% 1% 10% 50%
C Value of .1 = 0.7077 Accuracy Score
Confusion Matrix
25
Analyzing the Data: Random Forest
0.77$
0.78$
0.79$
0.8$
0.81$
0.82$
0.83$
5$ 10$ 20$ 50$ 100$ 125$
N Estimators of 100 = 0.8218
Max Depth of 200 = 0.8247
0.775%
0.78%
0.785%
0.79%
0.795%
0.8%
0.805%
0.81%
0.815%
0.82%
0.825%
0.83%
40% 100% 150% 200% 250%
Confusion MatrixAccuracy Score
26
Analyzing the Data: Logistic
C =1, Penalty = L2
0.81%
0.815%
0.82%
0.825%
0.83%
0.835%
0.84%
0.845%
0.85%
0.001% 0.01% 0.1% 1% 10% 50%
C of 1 = .8453
Confusion Matrix