Twitter Sentiment Prediction.pptx

INST 737 – Twitter Sentiment Prediction on

#Windows10 release

Anuj Sharma, Krishnesh Pujari and Rajesh Gnanasekaran

12/03/15

Objective• Twitter in the recent time has come at par to other social

Medias such as Facebook, Google+ and Myspace in terms of creating sentiment waves on any issue around the world.

• To perform a twitter sentiment analysis and sentiment prediction on Microsoft’s Windows 10 release which took place on July 29th of this year.

• Follow semi-supervised learning technique to create target variable and use it in the classification models.

• To analyze and interpret the results and provide recommendations to Microsoft.

About the Data• Imported using NodeXL from Twitter Search Network

• Original dataset had 9000+ observations on hashtag ‘#Windows10’ for the time period between July 28th 2015 till August 05th 2015

• After cleaning (missing, duplicate, other language) ended up with 4646 observations with 28 original factors, 19 derived features

• Performed feature engineering to arrive at these additional features as we felt they might be better used to predict the target factor, i.e, “Polarity”

• Types of Variables - Categorical, Continuous

Sentiment Analysis● Tweet text cleaning - remove filler words, ignore words

which are not in english● Used a customized R code for text mining which parsed

tweets and classified the words into +ve, -ve or neutral polarities

● The code compared the words in the tweets with a dictionary and mapped the polarity with the tweet.

● Cross checked for the correct functionality of the code by creating 100 odd tweets and manually checked the polarity

Exploring the Data● Created histograms and box plots to identify any unusual

behavior between the variables. Found some interesting patterns

Continued...● Tested the variables

over Pearson’s Correlation; found significant correlation between factors like Tweets and Followed. Made sure that we did not include both these variables together in logistic regression.

● Momentum of tweets shifted from +ve-neutral to -ve at the end period of sample; almost 80% of -ve tweets on 08/05

Feature Engineering● Tweet timestamp was broken into Tweet date and Tweet

time● Current Date● Days difference = Tweet date minus upgrade date● Number of weeks since joined Twitter● Number of months since joined Twitter● Log of number of months since joined Twitter● Log of number of followers● Log of number of people followed by the user● Log of number of favorites● Log of number of tweets● Length of Tweet

Multinomial Logistic Regression and Interpretation

● Multinomial over Binomial - Target variable has more

than two values.

● To check which factors are affecting the tweet polarity in

any manner.

● Interpret using Log of odds to see the variation.

● Variables of importance: Relationship, No. of followers,

Tweet length, No. of weeks since joined twitter

Results

Decision Trees Classification and Interpretation

● Decision trees are the alternative to logistic regression● CART (Classification and Regression Trees) method is

used to recursively classify the target variable● Variables of importance: Tweet date, Days difference and

length of the tweet

Results

Random Forest Classification and Interpretation

● Random Forest is an ensemble of decision trees which will helps in better prediction of polarity

● Implemented 501 decision trees to identify important predictors of polarity

● Variables of importance: Tweet date, Days difference and length of tweet

Results

Limitations● The dataset was for a short span of time between 07/28/15 and

08/05/15, if bigger dataset sample, results may differ

● We have limited the scope of this project to tweets only in English

language.

● We were not able to take advantage of the geo-spatial coordinates

as most of the records had n/a value.

Recommendations● As the negative sentiment starts to prevail post release in

the later half of the week, Microsoft should not stop on

the positive branding even post release.

● As the tweet coming from a seasoned twitter user is

more likely to be negative, Microsoft should target those

influential accounts to spread positive word.

Thank You !

Documents

Twitter Sentiment Prediction.pptx