View
30
Download
0
Category
Preview:
Citation preview
PREDICTIVE ANALYSIS OF UNITED STATES PRESIDENTIAL ELECTIONS USING
Machine Learning
A Project by Harindu Kodituwakku
Submitted as final year project towards completion of BEng (Honours) in Computing Science.
Problem Analysis
Project Objectives • Objective
A desktop application that analyze and visualize the Predictive results of the U.S. Election 2016 using Twitter data.
• Assumptions and Constraints Tweets related only to Democratic and Republican Presidential
candidates will be analyzed. Tweets related to English Language will only be considered. Predictive analysis will be calculated using the tweets collected
during 2016/10/11 – 2016/11/07.
• The Scope – Machine Learning, Sentiment Analysis, Big Data
Resource Analysis
SimilarApproachAnalysis
ExtractConcept
s
Solution
Concept
Research Approach
Sentiment140.com The Predictive Power of Social Media: On the Predictability of U.S.
Presidential Elections using Twitter – 2012 Predicting Elections with Twitter: What 140 Characters Reveal about
Political Sentiment - 2008
Literature Review• Twitter vs Facebook• Levels of Sentiment Analysis • Feature Extraction Methodologies • POS Tagging • Negation handling
• Supervised Machine Learning vs Unsupervised Machine Learning Techniques.
Solution Concept• Data collection through Twitter.• Supervised Machine Learning will be selected as the
classification technique.• Naïve Bayes Algorithm will be selected as the machine
learning algorithm. • MongoDB No-SQL Database will be used to store tweets
reguarding the candidates.
Design• 1st Iteration – Primitive and Informal design.• 2nd Iteration – Integration of No-SQL database.
Proper OOed UML Design. The behavioral and structural relationships with
the Python classes were showcased.• 3rd Iteration – Integration with Matplotlib graphs
for data representation.
ImplementationCore
Components Implementation of No-SQL Database Live Tweets Streamer Pre-processing of Tweets Classification Algorithm Implementation of Word Cloud
Naïve Bayes AlgorithmĈ = argmaxc P(c | d)
Ĉ = argmaxc P(c | d) = argmaxc P (d | c) P(c)
Word Cloud
Testing• 1st Iteration – Unit Testing
Unit Tested the cleaning and pre-processing algorithms. PyUnit was used for Unit testing.
• 2nd Iteration – Integration Testing Main components of the system were integrated during this
iteration. Many test cases were failed during this iteration. • 3rd Iteration – Accuracy Testing
This iteration was divided into 3 phases. In each iteration the classifier was trained using differnet number
of training data. Improved the accuracy of the system drastically.
Testing
TestingMost Informative Features Contains (bummed) = True 0 : 1 = 39.7 : 1.0 Contains (lonely) = True 0 : 1 = 25.9 : 1.0 Contains (followfriday) = True 1 : 0 = 23.7 : 1.0 Contains (tummy) = True 0 : 1 = 20.6 : 1.0 Contains (infection) = True 0 : 1 = 17.0 : 1.0 Contains (ankle) = True 0 : 1 = 16.3 : 1.0 contains (cancelled) = True 0 : 1 = 15.0 : 1.0 contains (heyy) = True 1 : 0 = 15.0 : 1.0 contains(boom) = True 1 : 0 = 15.0 : 1.0 contains (hurts) = True 0 : 1 = 14.9 : 1.0 contains (sad) = True 0 : 1 = 14.8 : 1.0 contains (depressed) = True 0 : 1 = 14.6 : 1.0 contains (hating) = True 0 : 1 = 14.3 : 1.0 contains (worst) = True 0 : 1 = 13.9 : 1.0
Performance Evaluation
0 10000 20000 30000 40000 50000 600000
10
20
30
40
50
60
70
80
90
Evaluating the Accuracy of the Classifier
Number of Tweets in the Training Dataset
Accu
racy
(%)
Performance Evaluation
At 9.00 am – 2016/11/09 At 11.00 am – 2016/11/09
Performance Evaluation
At 9.00 am – 2016/11/09 At 11.00 am – 2016/11/09
Contribution• Implementation of No-SQL Database• Use of Python Object Serialization • Removal of Neutral Dataset from the Training • Implementation of the Word Cloud
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Predictive Analysis
Source: https://shift.newco.co/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f#.44nju339g
Predictive Analysis
Predictive Analysis
Source: https://shift.newco.co/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f#.44nju339g
System Evaluation • The application is unable to clearly detect the true sentiment of
Sarcastic tweets.• When the names of both candidates were mentioned in a single
tweet, polarity for each candidate is unable to detect.e.g.- Trump is a racist but Hillary is a humanitarian.
In Real time sentiment Analysis, only 100 tweets can be classified per time.
Further Improvements • The Accuracy can be further improved using larger training data
set.• The Classification engine can be hosted in a web service in order to
obtain real time classification without any delay in the process.• By gathering larger number of testing data (tweets), more accurate
classification can be obtained.• More sophisticated natural language processing techniques should
be implemented to detect sarcasm and slang tweets.
Q&A
Recommended