Predictive Analysis of the U.S. Presidential Election using Machine Learning

PREDICTIVE ANALYSIS OF UNITED STATES PRESIDENTIAL ELECTIONS USING

Machine Learning

A Project by Harindu Kodituwakku

Submitted as final year project towards completion of BEng (Honours) in Computing Science.

Problem Analysis

Project Objectives • Objective

A desktop application that analyze and visualize the Predictive results of the U.S. Election 2016 using Twitter data.

• Assumptions and Constraints Tweets related only to Democratic and Republican Presidential

candidates will be analyzed. Tweets related to English Language will only be considered. Predictive analysis will be calculated using the tweets collected

during 2016/10/11 – 2016/11/07.

• The Scope – Machine Learning, Sentiment Analysis, Big Data

Resource Analysis

SimilarApproachAnalysis

ExtractConcept

Solution

Concept

Research Approach

Sentiment140.com The Predictive Power of Social Media: On the Predictability of U.S.

Presidential Elections using Twitter – 2012 Predicting Elections with Twitter: What 140 Characters Reveal about

Political Sentiment - 2008

Literature Review• Twitter vs Facebook• Levels of Sentiment Analysis • Feature Extraction Methodologies • POS Tagging • Negation handling

• Supervised Machine Learning vs Unsupervised Machine Learning Techniques.

Solution Concept• Data collection through Twitter.• Supervised Machine Learning will be selected as the

classification technique.• Naïve Bayes Algorithm will be selected as the machine

learning algorithm. • MongoDB No-SQL Database will be used to store tweets

reguarding the candidates.

Design• 1st Iteration – Primitive and Informal design.• 2nd Iteration – Integration of No-SQL database.

Proper OOed UML Design. The behavioral and structural relationships with

the Python classes were showcased.• 3rd Iteration – Integration with Matplotlib graphs

for data representation.

ImplementationCore

Components Implementation of No-SQL Database Live Tweets Streamer Pre-processing of Tweets Classification Algorithm Implementation of Word Cloud

Naïve Bayes AlgorithmĈ = argmaxc P(c | d)

Ĉ = argmaxc P(c | d) = argmaxc P (d | c) P(c)

Word Cloud

Testing• 1st Iteration – Unit Testing

Unit Tested the cleaning and pre-processing algorithms. PyUnit was used for Unit testing.

• 2nd Iteration – Integration Testing Main components of the system were integrated during this

iteration. Many test cases were failed during this iteration. • 3rd Iteration – Accuracy Testing

This iteration was divided into 3 phases. In each iteration the classifier was trained using differnet number

of training data. Improved the accuracy of the system drastically.

Testing

TestingMost Informative Features Contains (bummed) = True 0 : 1 = 39.7 : 1.0 Contains (lonely) = True 0 : 1 = 25.9 : 1.0 Contains (followfriday) = True 1 : 0 = 23.7 : 1.0 Contains (tummy) = True 0 : 1 = 20.6 : 1.0 Contains (infection) = True 0 : 1 = 17.0 : 1.0 Contains (ankle) = True 0 : 1 = 16.3 : 1.0 contains (cancelled) = True 0 : 1 = 15.0 : 1.0 contains (heyy) = True 1 : 0 = 15.0 : 1.0 contains(boom) = True 1 : 0 = 15.0 : 1.0 contains (hurts) = True 0 : 1 = 14.9 : 1.0 contains (sad) = True 0 : 1 = 14.8 : 1.0 contains (depressed) = True 0 : 1 = 14.6 : 1.0 contains (hating) = True 0 : 1 = 14.3 : 1.0 contains (worst) = True 0 : 1 = 13.9 : 1.0

Performance Evaluation

0 10000 20000 30000 40000 50000 600000

Evaluating the Accuracy of the Classifier

Number of Tweets in the Training Dataset

At 9.00 am – 2016/11/09 At 11.00 am – 2016/11/09

Contribution• Implementation of No-SQL Database• Use of Python Object Serialization • Removal of Neutral Dataset from the Training • Implementation of the Word Cloud

Predictive Analysis

Source: https://shift.newco.co/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f#.44nju339g

Predictive Analysis

Source: https://shift.newco.co/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f#.44nju339g

System Evaluation • The application is unable to clearly detect the true sentiment of

Sarcastic tweets.• When the names of both candidates were mentioned in a single

tweet, polarity for each candidate is unable to detect.e.g.- Trump is a racist but Hillary is a humanitarian.

In Real time sentiment Analysis, only 100 tweets can be classified per time.

Further Improvements • The Accuracy can be further improved using larger training data

set.• The Classification engine can be hosted in a web service in order to

obtain real time classification without any delay in the process.• By gathering larger number of testing data (tweets), more accurate

classification can be obtained.• More sophisticated natural language processing techniques should

be implemented to detect sarcasm and slang tweets.

Predictive Analysis of the U.S. Presidential Election using Machine Learning

Data & Analytics

Presidential Election in USA

Presidential Election Maps

Moms 9.0: Presidential Election

Presidential Election of 1812

2004 Presidential Election Results

Russia Presidential Election 2004

1932 Presidential Election

SECTION The Presidential Election - Pearson Schoolassets.pearsonschool.com/asset_mgr/legacy/200938/section...332 The Presidential Election When people vote in the presidential election,

2008 Presidential Election Polls

The 2016 Presidential Election

2004 Presidential Election

The Election Process Module 6.4: Presidential Election

2016 Presidential Election Dashboard

Presidential Election 2016

Presidential Election Watch Party

Presidential election 2008

GENERAL ASSEMBLY ELECTION PRESIDENTIAL ELECTION

Presidential election 2012

About 2012 Presidential Election

BYUHSA 2013 Presidential Election