View
109
Download
0
Category
Preview:
Citation preview
2016 Democrat Primary: Prediction of results For New York Counties
Lavneet Sidhu | Nikita Bali | Sanjita Jain | Subhasree Chatterjee
OBJECTIVE
Predicting the number of counties won by the Democrats in the primary US Presidential Election for the New York state based on demographic data of all other US counties
To find out if there is any pattern of how people vote based on their demographic information
DATA
29 states
1928
counties
33 Demographic
variables
Primary Results County Facts
1542Training set
counties80 percent
386Testing set counties
20 percent
Population %, Female %, Different ethinicity %
Educational background, income, Number of votes
Explanatory data analysis
Distribution of votes between Democrats
as per demographic information
The explanatory data analysis was done using python.
Correlation Matrix:1. The highest correlation is between the
population percentage where language other than English is spoken at home and Population that is either Hispanic or Latino.
2. There is also a very high correlation in population that is not born in the US and population where language other than English is spoken at home and also with population that is either Latino or Hispanic.
Explanatory Data Analysis
% African American vs Fraction of votes
% non English speaking vs. Fraction
of votes
% of females vs.
Fraction votes
% over 65 vs. Fraction votes
Logistic Regression ModelClinton
0
Sanders
1Winner
Full Model
Step AIC
Step BIC
AIC 1086.87 1068.88
1076.96
BIC 1268.46 1181.05
1157.08
AUC (training)
0.918 0.917 0.913
AUC (testing) 0.794 0.760 0.771
Model Selecti
on
Response Variable
Variable
Selection
Age, % females, % whites, % Afro-American, % native Indians, % Hispanic Latino, % foreign born, % education, % veterans, home ownership rate, median value of house, person per household, per capita income, % of Asian owned firms etc.
ROC and Misclassification Rate
Training ROC Testing ROC
Clinton
Sanders
Clinton 920 131Sanders
124 367Misclassification Rate: 0.165
Clinton
Sanders
Clinton 226 36Sanders
35 89Misclassification Rate: 0.184
Random ForestClassificati
on Type
1000 Trees
5 Variables tried at
each split
17.3% OOB
estimate of error rate
Clinton
Sander
Class.Error
Clinton 1169 136 0.104Sanders
197 426 0.316
Confusion Matrix
Importance
Principal Component Analysis Regression
The data is standardized to perform principal component analysis on the demographic data. It gives us 33 uncorrelated components. We can consider 8 of the 33 components for further analysis as they explain 80% of the variance in the data
Clinton
Sanders
Clinton 1129 176Sanders
256 367
Importance of
componentsROC
TestingAUC = 86%
Testing
Confusion Matrix
Testing
Error = 22%
Testing
Model Validation
Washington
39 counties
Hawaii5 counties
Alaska29 counties
0 39
C S
5 34
C S
0 5
Logistic Model
2 3
2 27
Actual
PredictedRandom
Forest
4 35
C S
0 5
1 28
PCA Regression
2 37
C S
0 5
0 29
0 29
Factor Analysis The purpose of factor analysis is find out some unobserved variables which
will be lower in number and uncorrelated in comparison to the observed variable.
By using those factors we should be able to differentiate the voting pattern for democrat candidates based on demographic data of the county.
We tried the factor analysis on the following levels:1. County demographic data2. State demographic data3. Winner wise demographic data
Factor Analysis(Cont’d) We got 2 factors for State and County Demographic data
1st factor describes ethnicity information. 2nd factor is based on population and industrial exposure.
State
County
All states seem to exhibit similar behavior except
Hawaii, Alaska & District of Columbia
All counties seem to exhibit similar behavior
Factor Analysis (Cont’d) We got 3 factors for winner based demographic data. • Factor 1 concentrates on the population and the median income of that county.• Factor 2 can be interpreted as the Hispanic and non-native American population. • Factor 3 can be interpreted as economic prosperity and white/black population of the county
Clinton gets majority of the votes from the counties where median income is higher and non-native and Hispanic Americans are more.
NEW YORK RESULTS
New York62 counties 13 4
9
C S
25
37
C S
Logistic Model
Actual
PredictedRandom
Forest
27
35
C S
PCA Regression
6 56
C S
CONCLUSION
Hillary Clinton seems to be favored in counties where:• Median Income is higher• Percentage of Hispanic, African American population is higher
People who vote Sanders are majority Whites Similar results were obtained from different modeling techniques
Thank You
Recommended