Inference with big data: SCECR 2012 Presentation

CECR

Inference with Big Data

A Superpower

Approach

Galit Shmuéli Indian School of Business

Mohit Dayal Lalita Reddi Bhim Pochiraju

Mingfeng Lin Hank Lucas

Big data studies (in information systems) increasingly common

# IS papers with n>10,000 (2004-2010)

Large-study IS papers: How Big?

“over 10,000 publicly available feedback text comments… in eBay” The Nature and Role of Feedback Text Comments in Online Marketplaces

Pavlou & Dimoka, ISR 2006

“we use… 3.7 million records, encompassing transactions for the Federal Supply Service (FSS) of the U.S. Federal government in fiscal year 2000

Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets Ghose & Yao, ISR 2011

“51,062 rare coin auctions that took place… on eBay” The Sound of Silence in Online Feedback

Dellarocas & Wood, Management Science 2006

“We collected data on … [175,714] reviews from Amazon” Examining the Relationship Between Reviews and Sales

Forman et al., ISR 2008

108,333 used vehicles offered in the wholesale automotive market Electronic vs. Physical Market Mechanisms Overby & Jap, Management Science 2009

For our analysis, we have … 784,882 [portal visits] Household-Specific Regressions Using Clickstream Data

Goldfarb & Lu, Statistical Science 2006

Apply small sample approach to Big Data studies?

It’s all about Power

Magnify effects

Separate signal from noise

Artwork: Running the numbers by Chris Jordan (www.chrisjordan.com) 426,000 cell phones retired in the US every day

Power = Prob (detect H1 effect)

= f ( sample size, effect size, a, noise )

Small & complex effects

Stronger validity

Rare events

The Promise

Hypotheses Data Exploration Models Model Validation Inference

Statistical Technology

DATA VIZ: “BIG DATA” CHARTS

Chapter 1: With Mohit Dayal & Lalita Reedi (ISB)

Scaling Up Data Visualization

Missing values

Big Data Scatter plot

Visualization: Big Data Boxplot

BIG DATA (SUPERPOWER) APPROACH: Charts based on aggregation Interactive viz (zoom & pan, filter, etc.)

BIG DATA AND SMALL-SAMPLE INFERENCE

Simple Hypotheses H1: b1>0

𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

Few control variables

Assumptions?

Which model?

𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘

Few hypotheses

What data? Sign + Statistical significance

Model fit Robustness

What doesn’t scale up? (wrong conclusions) What are missed opportunities?

TOO BIG TO FAIL: LARGE SAMPLES AND FALSE DISCOVERIES

Chapter 2: With Hank Lucas (UMD) & Mingfeng Lin (UoA)

small p-values*** are not interesting

Large sample result: Deflated p-values

p-value ~ proximity of sample to H0

= f(effect size, sample size, noise)

H0: b=0

0 1 2 3

4 5

ln *ln( ) * *ln( )

( ) (controls)

Price minimumBid reserve sellerFeedback

Duration

b b b b

b b

H1: Higher minimum bids lead to higher final prices (b1>0) H2: Auctions with reserve price will sell for higher prices (b2>0) H3: Duration affects price (b4≠0) H4: The higher the seller feedback, the higher the price (b3>0)

auctions for digital cameras Aug ’07- Jan ‘08 [thanks to Wolfgang Jank for the data!]

n=341,136

n=341,136

“In a large sample, we can obtain very large t statistics with low p-values for our predictors, when, in fact, their effect on Y is very slight”

Applied Statistics in Business & Economics

Doane & Seward

BIG DATA (SUPERPOWER) APPROACH: Focus on size (ignore p-values) Subsamples for robustness: “results quantitatively similar”

MODEL ASSUMPTIONS, DIAGNOSTICS AND ADJUSTMENT

Chapter 3:

With big data, we’re in the realm of asymptotic behaviour

𝑛 → ∞

Assumption Violation

Coefficient bias

Standard errors

Redundant diagnostic tests

Under-specification all bias

Endogeneity* all 2SLS is worse

Instrument strength (Sargan)

E(e) =0 𝜷 𝟎 bias Lack-of-fit

Non-normality Anderson-Darling

Heteroscedasticity bias Breusch-Pagan

Over-specification

Serial dependence bias Durbin -Watson

Multicollinearity increase Significant correlations

Influential outliers Leverage (multiple testing)

Violated assumptions: less tinkering

*IV estimators only have desirable asymptotic, not finite sample, properties

BIG DATA (SUPERPOWER) APPROACH: Focus on bias-related assumptions Avoid statistical tests (p-value challenge)

COMPLEX EFFECTS & HETEROGENEITY

Chapter 4: With Bhimsankaram Pochiraju & Mohit Dayal (ISB)

With Big Data:

Detect small (but important) effects

Detect rare events (in rare minorities)

H1: b3>2

𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

control variables

Less assumptions

Fixed effects

Which model?

𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘

Complex hypotheses

What data?

magnitude Model fit Robustness Predictive

Heterogeneous Clustering/Mixtures Sub-samples Propensity Scores, 2SLS

Test complex hypotheses

Moderators Nonlinear relationships Multiple categories Control variables

Quantify subtle effects Specific measures (20 eBay categories) “Moderator variables are difficult to detect” -Aguinis, 1994

Low R2, yet non-zero coefficients

The rovers have a magnifying camera… that scientists can use to carefully look at the fine structure of a rock

Stepwise Selection

OLS with Stepwise (AIC measure) Logistic with variable selection (RELR)

All independent variables All control variables Quadratic terms of continuous variables 2-way interactions

Choose software carefully (R: “out of memory”)

Heterogeneity: CART

• Identify non-linearities and interactions

• Does not identify different models • Challenge: independent variables

vs. control variables

Clustering

1. Cluster all independent and control variables

2. Fit separate regression models to each cluster

• Popular in risk analytics • Fast, easy • Does not guarantee

distinct relationships

Finite Mixture Regression

Search for k separate regressions Convergence issues on entire dataset For 10 subsamples (n=30K) converged for seven

MODEL VALIDATION

Chapter 5:

Improve model validation, comparison, and generalization

Internal & external validity Robustness across subsamples

non-random random (overlapping/non)

Improve predictive validation

Training set

Holdout set

SMALL SAMPLE MODELING APPROACH

Clark Kent ≤ Superman