26
Final Project Analyzing Reddit Data to Determine Popularity

Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

Final Project Analyzing Reddit Data to Determine Popularity

Page 2: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

2

Project Background: The Problem

Problem: Predict post popularity where the target/label is based on a transformed score metric

Algorithms / Models Applied: • SVC • Random Forests • Logistic Regression

Page 3: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

3

Project Background: The DataData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure (22 Columns): •created_utc - Float •score - Integer •domain - Text • id - Integer •title - Text •author - Text •ups - Integer •downs - Integer •num_comments - Integer •permalink (aka the reddit link) - Text •self_text (aka body copy) - Text

• link_flair_text - Text •over_18 - Boolean •thumbnail - Text •subreddit_id - Integer •edited - Boolean • link_flair_css_class - Text •author_flair_css_class - Text • is_self - Boolean •name - Text •url - Text •distinguished - Text

Page 4: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

4

Project Background: The Data - RemovedData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure: • created_utc - Float • score - Integer • domain - Text • id - Integer • title - Text • author - Text • ups - Integer • downs - Integer • num_comments - Integer • permalink (aka the reddit link) - Text • self_text (aka body copy) - Text

• link_flair_text - Text • over_18 - Boolean • thumbnail - Text • subreddit_id - Integer • edited - Boolean • link_flair_css_class - Text • author_flair_css_class - Text • is_self - Boolean • name - Text • url - Text • distinguished - Text

Page 5: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

5

Reviewing the Data: Subreddit Topics

AnimalsWithoutNecks

BirdsBeingDicks

CemeteryPorn

CoffeeWithJesus

datasets dataisbeautiful

FortPorn

learnpython MachineLearning

misleadingthumbnails

Otters

PenmanshipPorn

PowerWashingPorn

ShowerBeerStonerPhilosophy

talesfromtechsupport

TreesSuckingAtThings

Page 6: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

6

Reviewing the Data: Top Domains

Domain'Count'

imgur.com)

youtube.com)

reddit.com)

flickr.com)

soundcloud.com)

quickmeme.com)

i.minus.com)

twi6er.com)

amazon.com)

qkme.com)

vimeo.com)

wikipedia.org)

ny;mes.com)

guardian.co.uk)

bbc.co.uk)

Imgur: 773,969

YouTube: 188,526

Reddit: 25,445

Flickr : 17,854

Soundcloud: 10,397

Page 7: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

7

Reviewing the Data: Most Have No Body Text

Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc.

Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value

Page 8: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

8

Reviewing the Data: Time Based Data

0"

50000"

100000"

150000"

200000"

250000"

300000"

January"

February"

March"

April"

May"

June"

July"

August"

September"

October"

November"

December"

Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August

Page 9: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

9

Reviewing the Data: Time Based Data

Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip

0"

50000"

100000"

150000"

200000"

250000"

300000"

350000"

400000"

Monday"

Tuesday"

Wednesday"

Thursday"

Friday"

Saturday"

Sunday"

Page 10: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

10

Reviewing the Data: Time Based Data

Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon

0"

20000"

40000"

60000"

80000"

100000"

120000"

140000"

160000"

12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm"

Page 11: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

11

Reviewing the Data: Determining Popularity

~15% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

Page 12: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

12

Analyzing the Data: Issues

Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms • Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing

around with the data extremely hard

Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn’t address in time

Page 13: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

13

Analyzing the Data: SVC

0.88000$

0.89000$

0.90000$

0.91000$

0.92000$

0.93000$

0.94000$

Linear$ Poly$ Sigmoid$ RBF$

Accuracy'

0.92000%

0.92200%

0.92400%

0.92600%

0.92800%

0.93000%

0.93200%

0.93400%

0.93600%

0.93800%

0.001% 0.01% 0.1%

Accuracy'w/'Linear'Kernel'

Linear = .9368 C Value of .1 = .9363

from sklearn import svm

Page 14: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

14

Analyzing the Data: Regression Trees

from sklearn import ensemble

0.885%

0.89%

0.895%

0.9%

0.905%

0.91%

0.915%

0.92%

0.925%

5% 10% 20% 50% 100% 125% 150%

N Estimators = 125 .922

Max Depth = 250 .924

0.885%

0.89%

0.895%

0.9%

0.905%

0.91%

0.915%

0.92%

0.925%

0.93%

5% 40% 100% 150% 200% 250% 300%

Page 15: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

15

Analyzing the Data: Logistic

0.925&

0.93&

0.935&

0.94&

0.945&

0.95&

0.001& 0.01& 0.1& 1& 10& 50&

C Value of 1 = .9471 L1 = .947733 L2 = .947066

0.9469&

0.94695&

0.947&

0.94705&

0.9471&

0.94715&

0.9472&

0.94725&

0.9473&

0.94735&

0.9474&

L1& L2&

from sklearn import linear_model

Page 16: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

16

Totally Crushing It!

Page 17: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

17

Analyzing the Data: Classification Report

Random Forests

Logistic Regression

SVC

Page 18: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

18

Soooo Not Crushing It

Page 19: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

19

Feature Reduction: Accuracy

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

94.71%

92.4%

93.63%

94.3%

94.5%

95.2%

Page 20: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

20

Feature Reduction: Classification Report

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

Page 21: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

21

Next StepsDealing with the processing issues: • Learn and try our PySpark

Answer some additional questions: • Reevaluate how I handle the domains

• I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the “post” and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity

• Run the data with varying n_grams levels • I tried them but they expanded the columns to hundreds of thousands which just seemed to

freeze, so hopefully something like PySpark will help with the processing • Predict sub-reddit/category questions:

• Can I predict category of a post? • Do certain subreddits produce more overall popular content than others? Bears With

Beaks vs. ggggg (what ever the hell that is)

Page 22: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

APPENDIX

22

Page 23: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

23

Reviewing the Data: Reevaluate Popularity

~12% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

~8% of posts

Page 24: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

24

Analyzing the Data: SVC

0.688%

0.69%

0.692%

0.694%

0.696%

0.698%

0.7%

0.702%

0.704%

0.706%

0.708%

0.71%

0.001% 0.01% 0.1% 1% 10% 50%

C Value of .1 = 0.7077 Accuracy Score

Confusion Matrix

Page 25: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

25

Analyzing the Data: Random Forest

0.77$

0.78$

0.79$

0.8$

0.81$

0.82$

0.83$

5$ 10$ 20$ 50$ 100$ 125$

N Estimators of 100 = 0.8218

Max Depth of 200 = 0.8247

0.775%

0.78%

0.785%

0.79%

0.795%

0.8%

0.805%

0.81%

0.815%

0.82%

0.825%

0.83%

40% 100% 150% 200% 250%

Confusion MatrixAccuracy Score

Page 26: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"

26

Analyzing the Data: Logistic

C =1, Penalty = L2

0.81%

0.815%

0.82%

0.825%

0.83%

0.835%

0.84%

0.845%

0.85%

0.001% 0.01% 0.1% 1% 10% 50%

C of 1 = .8453

Confusion Matrix