Modeling Social Data, Lecture 2: Introduction to Counting

Introduction to CountingAPAM E4990

Modeling Social Data

Jake Hofman

Columbia University

January 27, 2017

Jake Hofman (Columbia University) Intro to Counting January 27, 2017 1 / 27

Why counting?

http://bit.ly/august2016poll

p( y︸︷︷︸support

| x︸︷︷︸age

)


http://bit.ly/august2016poll

Why counting?

http://bit.ly/ageracepoll2016

p( y︸︷︷︸support

| x1, x2︸︷︷︸age, race

)


http://bit.ly/ageracepoll2016

Why counting?

?p( y︸︷︷︸

support

| x1, x2, x3, . . .︸︷︷︸age, sex, race, party

)


Why counting?

Problem:

Traditionally difficult to obtain reliable estimates due to smallsample sizes or sparsity

(e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3,000 groups,but typical surveys collect ∼ 1,000s of responses)


Why counting?

Potential solution:

Sacrifice granularity for precision, by binning observations intolarger, but fewer, groups

(e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+)


Why counting?

Potential solution:

Develop more sophisticated methods that generalize well fromsmall samples

(e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .)


Why counting?

(Partial) solution:

Obtain larger samples through other means, so we can just countand divide to make estimates via relative frequencies

(e.g., with ∼ 1M responses, we have 100s per group and canestimate support within a few percentage points)


Why counting?

International Journal of Forecasting 31 (2015) 980–991

Contents lists available at ScienceDirect

International Journal of Forecasting

journal homepage: www.elsevier.com/locate/ijforecast

Forecasting elections with non-representative pollsWei Wanga,⇤, David Rothschild b, Sharad Goel b, Andrew Gelmana,c

a Department of Statistics, Columbia University, New York, NY, USAb Microsoft Research, New York, NY, USAc Department of Political Science, Columbia University, New York, NY, USA

a r t i c l e i n f o

Keywords:Non-representative pollingMultilevel regression and poststratificationElection forecasting

a b s t r a c t

Election forecasts have traditionally been based on representative polls, inwhich randomlysampled individuals are askedwho they intend to vote for.While representative polling hashistorically proven to be quite effective, it comes at considerable costs of time and money.Moreover, as response rates have declined over the past several decades, the statisticalbenefits of representative sampling have diminished. In this paper, we show that, withproper statistical adjustment, non-representative polls can be used to generate accurateelection forecasts, and that this can often be achieved faster and at a lesser expense thantraditional survey methods. We demonstrate this approach by creating forecasts from anovel and highly non-representative survey dataset: a series of daily voter intention pollsfor the 2012 presidential election conducted on the Xbox gaming platform. After adjustingthe Xbox responses via multilevel regression and poststratification, we obtain estimateswhich are in line with the forecasts from leading poll analysts, which were based onaggregating hundreds of traditional polls conducted during the election cycle.We concludeby arguing that non-representative polling shows promise not only for election forecasting,but also for measuring public opinion on a broad range of social, economic and culturalissues.© 2014 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

1. Introduction

At the heart ofmodern opinion polling is representativesampling, built around the idea that every individual in aparticular target population, such as registered or likelyUS voters, has the same probability of being sampled.From address-based, in-home interview sampling in the1930s to random digit dialing after the growth of landlinesand cellphones, leading polling organizations have putimmense efforts into obtaining representative samples.

⇤ Corresponding author.E-mail addresses: [email protected] (W. Wang),

[email protected] (D. Rothschild), [email protected](S. Goel), [email protected] (A. Gelman).

The wide-scale adoption of representative polling canbe traced largely back to a pivotal polling mishap inthe 1936 US presidential election campaign. Duringthat campaign, the popular magazine Literary Digestconducted amail-in survey that attracted over twomillionresponses, a huge sample even by modern standards.However, the magazine incorrectly predicted a landslidevictory for Republican candidate Alf Landon over theincumbent Franklin Roosevelt. In actual fact, Rooseveltwon the election decisively, carrying every state exceptfor Maine and Vermont. As pollsters and academics havepointed out since, the magazine’s pool of respondents washighly biased: it consisted mostly of auto and telephoneowners, as well as the magazine’s own subscribers, whichunderrepresented Roosevelt’s core constituencies (Squire,1988). During that same campaign, various pioneering

http://dx.doi.org/10.1016/j.ijforecast.2014.06.0010169-2070/© 2014 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

W. Wang et al. / International Journal of Forecasting 31 (2015) 980–991 981

pollsters, including George Gallup, Archibald Crossley, andElmo Roper, used considerably smaller but representativesamples, and predicted the election outcome with areasonable level of accuracy (Gosnell, 1937). Accordingly,non-representative or ‘‘convenience sampling’’ rapidly fellout of favor with polling experts.

So, why do we revisit this seemingly long-settledcase? Two recent trends spur our investigation. First, ran-dom digit dialing (RDD), the standard method in modernrepresentative polling, has suffered increasingly highnon-response rates, due both to the general public’s grow-ing reluctance to answer phone surveys, and to expand-ing technical means of screening unsolicited calls (Keeter,Kennedy, Dimock, Best, & Craighill, 2006). By one mea-sure, RDD response rates have decreased from 36% in 1997to 9% in 2012 (Kohut, Keeter, Doherty, Dimock, & Chris-tian, 2012), and other studies confirm this trend (Holbrook,Krosnick, & Pfent, 2007; Steeh, Kirgis, Cannon, & DeWitt,2001; Tourangeau & Plewes, 2013). Assuming that the ini-tial pool of targets is representative, such low responserates mean that those who ultimately answer the phoneand elect to respond might not be. Even if the selection is-sues are not yet a serious problem for accuracy, as somehave argued (Holbrook et al., 2007), the downward trendin response rates suggests an increasing need for post-sampling adjustments; indeed, the adjustment methodswe present here should work just as well for surveys ob-tained by probability sampling as for convenience samples.The second trend driving our research is the fact that, withrecent technological innovations, it is increasingly conve-nient and cost-effective to collect large numbers of highlynon-representative samples via online surveys. The datathat took the Literary Digest editors several months to col-lect in 1936 can now take only a few days, and, for somesurveys, can cost just pennies per response. However, thechallenge is to extract a meaningful signal from these un-conventional samples.

In this paper, we show that, with proper statistical ad-justments, non-representative polls are able to yield ac-curate presidential election forecasts, on par with thosebased on traditional representative polls. We proceed asfollows. Section 2 describes the election survey that weconducted on the Xbox gaming platform during the 45days leading up to the 2012 US presidential race. Our Xboxsample is highly biased in two key demographic dimen-sions, gender and age, and, accordingly, the raw responsesdisagree with the actual outcomes. The statistical tech-niques we use to adjust the raw estimates are introducedin two stages. In Section 3, we construct daily estimatesof voter intent via multilevel regression and poststratifica-tion (MRP). The central idea of MRP is to partition the datainto thousands of demographic cells, estimate voter intentat the cell level using amultilevel regressionmodel, and fi-nally aggregate the cell-level estimates in accordance withthe target population’s demographic composition. One re-cent study suggested that non-probability samples provideworse estimates than probability samples (Yeager et al.,2011), but that study used simple adjustment techniques,not MRP. Even after getting good daily estimates of voterintent, however, more needs to be done to translate theseinto election-day forecasts. Section 4 therefore describes

how to transform voter intent into projections of voteshare and electoral votes. We conclude in Section 5 bydiscussing the potential for non-representative polling inother domains.

2. Xbox data

Our analysis is based on an opt-in poll which was avail-able continuously on the Xbox gaming platform duringthe 45 days preceding the 2012 US presidential election.Each day, three to five questionswere posted, one of whichgauged voter intention via the standard query, ‘‘If the elec-tion were held today, who would you vote for?’’. Full de-tails of the questionnaire are given in the Appendix. Therespondents were allowed to answer at most once per day.The first time they participated in an Xbox poll, respon-dents were also asked to provide basic demographic in-formation about themselves, including their sex, race, age,education, state, party ID, political ideology, and who theyvoted for in the 2008 presidential election. In total, 750,148interviews were conducted, with 345,858 unique respon-dents – over 30,000 of whom completed five or more polls–making this one of the largest election panel studies ever.

Despite the large sample size, the pool of Xbox respon-dents is far from being representative of the voting pop-ulation. Fig. 1 compares the demographic composition ofthe Xbox participants to that of the general electorate, asestimated via the 2012 national exit poll.1 The most strik-ing differences are for age and sex. As one might expect,youngmen dominate the Xbox population: 18- to 29-year-olds comprise 65% of the Xbox dataset, compared to 19%in the exit poll; and men make up 93% of the Xbox sam-ple but only 47% of the electorate. Political scientists havelong observed that both age and sex are strongly correlatedwith voting preferences (Kaufmann & Petrocik, 1999), andindeed these discrepancies are apparent in the unadjustedtime series of Xbox voter intent shown in Fig. 2. In contrastto estimates based on traditional, representative polls (in-dicated by the dotted blue line in Fig. 2), the uncorrectedXbox sample suggests a landslide victory for Mitt Romney,reminiscent of the infamous Literary Digest error.

3. Estimating voter intent with multilevel regression

and poststratification

3.1. Multilevel regression and poststratification

To transform the raw Xbox data into accurate estimatesof voter intent in the general electorate, wemake use of the

1 For ease of interpretation, in Fig. 1 we group the states into fourcategories: (1) battleground states (Colorado, Florida, Iowa, New Hamp-shire, Ohio, and Virginia), the five states with the highest amounts ofTV spending plus New Hampshire, which had the highest per-capitaspending; (2) quasi-battleground states (Michigan, Minnesota, NorthCarolina, Nevada, New Mexico, Pennsylvania, and Wisconsin), whichround out the states where the campaigns and their affiliates mademajor TV buys; (3) solid Obama states (California, Connecticut, Districtof Columbia, Delaware, Hawaii, Illinois, Maine, Maryland, Massachusetts,New Jersey, New York, Oregon, Rhode Island, Vermont, andWashington);and (4) solid Romney states (Alabama, Alaska, Arizona, Arkansas, Geor-gia, Idaho, Indiana, Kansas, Kentucky, Louisiana, Mississippi, Missouri,Montana, Nebraska, North Dakota, Oklahoma, South Carolina, SouthDakota, Tennessee, Texas, Utah, West Virginia, and Wyoming).

http://bit.ly/nonreppoll


http://bit.ly/nonreppoll

Why counting?

The good:

Shift away from sophisticated statistical methods on small samplesto simpler methods on large samples


Why counting?

The bad:

Even simple methods (e.g., counting) are computationallychallenging at large scales

(1M is easy, 1B a bit less so, 1T gets interesting)


Why counting?

Claim:

Solving the counting problem at scale enables you to investigatemany interesting questions in the social sciences


Learning to count

This week:

Counting at small/medium scales on a single machine

Following weeks:

Counting at large scales in parallel


Learning to count

This week:

Counting at small/medium scales on a single machine

Following weeks:

Counting at large scales in parallel


Counting, the easy way

Split / Apply / Combine1

• Load dataset into memory

• Split: Arrange observations into groups of interest

• Apply: Compute distributions and statistics within each group

• Combine: Collect results across groups

1http://bit.ly/splitapplycombineJake Hofman (Columbia University) Intro to Counting January 27, 2017 8 / 27

http://bit.ly/splitapplycombine

The generic group-by operation

Split / Apply / Combine

for each observation as (group, value):

place value in bucket for corresponding group

for each group:

apply a function over values in bucket

output group and result

Useful for computing arbitrary within-group statistics when wehave required memory

(e.g., conditional distribution, median, etc.)


The generic group-by operation

Split / Apply / Combine


place value in bucket for corresponding group

for each group:

apply a function over values in bucket


Useful for computing arbitrary within-group statistics when wehave required memory

(e.g., conditional distribution, median, etc.)


Why counting?


Example: Anatomy of the long tail

Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M

Netflix 500K 20K 5 100M


http://www.grouplens.org/node/73






Example: Movielens

How many ratings are there at each star level?

0

1,000,000

2,000,000

3,000,000

1 2 3 4 5Rating

Num

ber

of r

atin

gs


Example: Movielens

0

1,000,000

2,000,000

3,000,000

1 2 3 4 5Rating

Num

ber

of r

atin

gs

group by rating value

for each group:

count # ratings


Example: Movielens

What is the distribution of average ratings by movie?

1 2 3 4 5Mean Rating by Movie

Den

sity


Example: Movielens

group by movie id

for each group:

compute average rating


Den

sity


Example: Movielens

What fraction of ratings are given to the most popular movies?

0%

25%

50%

75%

100%

0 3,000 6,000 9,000Movie Rank

CD

F


Example: Movielens

0%

25%

50%

75%

100%

0 3,000 6,000 9,000Movie Rank

CD

F

group by movie id

for each group:

count # ratings

sort by group size

cumulatively sum group sizes


Example: Movielens

What is the median rank of each user’s rated movies?

0

2,000

4,000

6,000

8,000

100 10,000User eccentricity

Num

ber

of u

sers


Example: Movielens

join movie ranks to ratings

group by user id

for each group:

compute median movie rank

0

2,000

4,000

6,000

8,000

100 10,000User eccentricity

Num

ber

of u

sers





What do we do when the full dataset exceeds available memory?







Sampling?Unreliable estimates for rare groups







Random access from disk?1000x more storage, but 1000x slower2

2Numbers every programmer should knowJake Hofman (Columbia University) Intro to Counting January 27, 2017 20 / 27


https://gist.github.com/jboner/2841832





StreamingRead data one observation at a time, storing only needed state



The combinable group-by operation

Streaming


if new group:

initialize result

update result for corresponding group as function of

existing result and current value

for each group:


Useful for computing a subset of within-group statistics with alimited memory footprint

(e.g., min, mean, max, variance, etc.)


The combinable group-by operation

Streaming


if new group:

initialize result

update result for corresponding group as function of

existing result and current value

for each group:


Useful for computing a subset of within-group statistics with alimited memory footprint

(e.g., min, mean, max, variance, etc.)


Example: Movielens

0

1,000,000

2,000,000

3,000,000

1 2 3 4 5Rating

Num

ber

of r

atin

gs

for each rating:

counts[movie id]++


Example: Movielens

for each rating:

totals[movie id] += rating

counts[movie id]++

for each group:

totals[movie id] /

counts[movie id]


Den

sity


Yet another group-by operation

Per-group histograms


histogram[group][value]++

for each group:

compute result as a function of histogram


We can recover arbitrary statistics if we can afford to store countsof all distinct values within in each group


Yet another group-by operation

Per-group histograms


histogram[group][value]++

for each group:

compute result as a function of histogram


We can recover arbitrary statistics if we can afford to store countsof all distinct values within in each group


The group-by operation

For arbitrary input data:

Memory Scenario Distributions StatisticsN Small dataset Yes General

V*G Small distributions Yes GeneralG Small # groups No CombinableV Small # outcomes No No1 Large # both No No

N = total number of observationsG = number of distinct groups

V = largest number of distinct values within group


Examples (w/ 8GB RAM)

Median rating by movie for Netflix

N ∼ 100M ratingsG ∼ 20K movies

V ∼ 10 half-star values

V *G ∼ 200K, store per-group histograms for arbitrary statistics

(scales to arbitrary N, if you’re patient)



Median rating by video for YouTube

N ∼ 10B ratingsG ∼ 1B videos


V *G ∼ 10B, fails because per-group histograms are too large tostore in memory

G ∼ 1B, but no (exact) calculation for streaming median



Mean rating by video for YouTube

N ∼ 10B ratingsG ∼ 1B videos


G ∼ 1B, use streaming to compute combinable statistics


The group-by operation

For pre-grouped input data:

Memory Scenario Distributions StatisticsN Small dataset Yes General

V*G Small distributions Yes GeneralG Small # groups No CombinableV Small # outcomes Yes General1 Large # both No Combinable

N = total number of observationsG = number of distinct groups

V = largest number of distinct values within group


Education

Modeling Social Data, Lecture 2: Introduction to Counting