45
CECR Inference with Big Data A Superpower Approach Galit Shmuéli Indian School of Business Mohit Dayal Lalita Reddi Bhim Pochiraju Mingfeng Lin Hank Lucas

Inference with big data: SCECR 2012 Presentation

Embed Size (px)

Citation preview

Page 1: Inference with big data: SCECR 2012 Presentation

CECR

Inference with Big Data

A Superpower

Approach

Galit Shmuéli Indian School of Business

Mohit Dayal Lalita Reddi Bhim Pochiraju

Mingfeng Lin Hank Lucas

Page 2: Inference with big data: SCECR 2012 Presentation

Big data studies (in information systems) increasingly common

# IS papers with n>10,000 (2004-2010)

Page 3: Inference with big data: SCECR 2012 Presentation

Large-study IS papers: How Big?

“over 10,000 publicly available feedback text comments… in eBay” The Nature and Role of Feedback Text Comments in Online Marketplaces

Pavlou & Dimoka, ISR 2006

“we use… 3.7 million records, encompassing transactions for the Federal Supply Service (FSS) of the U.S. Federal government in fiscal year 2000

Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets Ghose & Yao, ISR 2011

“51,062 rare coin auctions that took place… on eBay” The Sound of Silence in Online Feedback

Dellarocas & Wood, Management Science 2006

“We collected data on … [175,714] reviews from Amazon” Examining the Relationship Between Reviews and Sales

Forman et al., ISR 2008

108,333 used vehicles offered in the wholesale automotive market Electronic vs. Physical Market Mechanisms Overby & Jap, Management Science 2009

For our analysis, we have … 784,882 [portal visits] Household-Specific Regressions Using Clickstream Data

Goldfarb & Lu, Statistical Science 2006

Page 4: Inference with big data: SCECR 2012 Presentation
Page 5: Inference with big data: SCECR 2012 Presentation

Apply small sample approach to Big Data studies?

Page 6: Inference with big data: SCECR 2012 Presentation

It’s all about Power

Page 7: Inference with big data: SCECR 2012 Presentation

Magnify effects

Separate signal from noise

Page 8: Inference with big data: SCECR 2012 Presentation

Artwork: Running the numbers by Chris Jordan (www.chrisjordan.com) 426,000 cell phones retired in the US every day

Page 9: Inference with big data: SCECR 2012 Presentation

Power = Prob (detect H1 effect)

= f ( sample size, effect size, a, noise )

Page 10: Inference with big data: SCECR 2012 Presentation

Small & complex effects

Stronger validity

Rare events

The Promise

Page 11: Inference with big data: SCECR 2012 Presentation

Hypotheses Data Exploration Models Model Validation Inference

Statistical Technology

Page 12: Inference with big data: SCECR 2012 Presentation

DATA VIZ: “BIG DATA” CHARTS

Chapter 1: With Mohit Dayal & Lalita Reedi (ISB)

Page 13: Inference with big data: SCECR 2012 Presentation

Scaling Up Data Visualization

Missing values

Page 14: Inference with big data: SCECR 2012 Presentation

Big Data Scatter plot

Page 15: Inference with big data: SCECR 2012 Presentation

Visualization: Big Data Boxplot

Page 16: Inference with big data: SCECR 2012 Presentation

BIG DATA (SUPERPOWER) APPROACH: Charts based on aggregation Interactive viz (zoom & pan, filter, etc.)

Page 17: Inference with big data: SCECR 2012 Presentation

BIG DATA AND SMALL-SAMPLE INFERENCE

Page 18: Inference with big data: SCECR 2012 Presentation

Simple Hypotheses H1: b1>0

𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

Few control variables

Assumptions?

Which model?

𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘

Few hypotheses

What data? Sign + Statistical significance

Model fit Robustness

Page 19: Inference with big data: SCECR 2012 Presentation

What doesn’t scale up? (wrong conclusions) What are missed opportunities?

Page 20: Inference with big data: SCECR 2012 Presentation

TOO BIG TO FAIL: LARGE SAMPLES AND FALSE DISCOVERIES

Chapter 2: With Hank Lucas (UMD) & Mingfeng Lin (UoA)

Page 21: Inference with big data: SCECR 2012 Presentation

small p-values*** are not interesting

Page 22: Inference with big data: SCECR 2012 Presentation

Large sample result: Deflated p-values

p-value ~ proximity of sample to H0

= f(effect size, sample size, noise)

H0: b=0

Page 23: Inference with big data: SCECR 2012 Presentation

0 1 2 3

4 5

ln *ln( ) * *ln( )

( ) (controls)

Price minimumBid reserve sellerFeedback

Duration

b b b b

b b

H1: Higher minimum bids lead to higher final prices (b1>0) H2: Auctions with reserve price will sell for higher prices (b2>0) H3: Duration affects price (b4≠0) H4: The higher the seller feedback, the higher the price (b3>0)

auctions for digital cameras Aug ’07- Jan ‘08 [thanks to Wolfgang Jank for the data!]

n=341,136

Page 24: Inference with big data: SCECR 2012 Presentation

n=341,136

Page 25: Inference with big data: SCECR 2012 Presentation
Page 26: Inference with big data: SCECR 2012 Presentation

“In a large sample, we can obtain very large t statistics with low p-values for our predictors, when, in fact, their effect on Y is very slight”

Applied Statistics in Business & Economics

Doane & Seward

Page 27: Inference with big data: SCECR 2012 Presentation

BIG DATA (SUPERPOWER) APPROACH: Focus on size (ignore p-values) Subsamples for robustness: “results quantitatively similar”

Page 28: Inference with big data: SCECR 2012 Presentation

MODEL ASSUMPTIONS, DIAGNOSTICS AND ADJUSTMENT

Chapter 3:

Page 29: Inference with big data: SCECR 2012 Presentation

With big data, we’re in the realm of asymptotic behaviour

𝑛 → ∞

Page 30: Inference with big data: SCECR 2012 Presentation

Assumption Violation

Coefficient bias

Standard errors

Redundant diagnostic tests

Under-specification all bias

Endogeneity* all 2SLS is worse

Instrument strength (Sargan)

E(e) =0 𝜷 𝟎 bias Lack-of-fit

Non-normality Anderson-Darling

Heteroscedasticity bias Breusch-Pagan

Over-specification

Serial dependence bias Durbin -Watson

Multicollinearity increase Significant correlations

Influential outliers Leverage (multiple testing)

Violated assumptions: less tinkering

*IV estimators only have desirable asymptotic, not finite sample, properties

Page 31: Inference with big data: SCECR 2012 Presentation

BIG DATA (SUPERPOWER) APPROACH: Focus on bias-related assumptions Avoid statistical tests (p-value challenge)

Page 32: Inference with big data: SCECR 2012 Presentation

COMPLEX EFFECTS & HETEROGENEITY

Chapter 4: With Bhimsankaram Pochiraju & Mohit Dayal (ISB)

Page 33: Inference with big data: SCECR 2012 Presentation

With Big Data:

Detect small (but important) effects

Detect rare events (in rare minorities)

Page 34: Inference with big data: SCECR 2012 Presentation

H1: b3>2

𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

control variables

Less assumptions

Fixed effects

Which model?

𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘

Complex hypotheses

What data?

magnitude Model fit Robustness Predictive

Heterogeneous Clustering/Mixtures Sub-samples Propensity Scores, 2SLS

Page 35: Inference with big data: SCECR 2012 Presentation

Test complex hypotheses

Moderators Nonlinear relationships Multiple categories Control variables

Quantify subtle effects Specific measures (20 eBay categories) “Moderator variables are difficult to detect” -Aguinis, 1994

Low R2, yet non-zero coefficients

The rovers have a magnifying camera… that scientists can use to carefully look at the fine structure of a rock

Page 36: Inference with big data: SCECR 2012 Presentation
Page 37: Inference with big data: SCECR 2012 Presentation

Stepwise Selection

OLS with Stepwise (AIC measure) Logistic with variable selection (RELR)

All independent variables All control variables Quadratic terms of continuous variables 2-way interactions

Choose software carefully (R: “out of memory”)

Page 38: Inference with big data: SCECR 2012 Presentation

Heterogeneity: CART

• Identify non-linearities and interactions

• Does not identify different models • Challenge: independent variables

vs. control variables

Page 39: Inference with big data: SCECR 2012 Presentation

Clustering

1. Cluster all independent and control variables

2. Fit separate regression models to each cluster

• Popular in risk analytics • Fast, easy • Does not guarantee

distinct relationships

Page 40: Inference with big data: SCECR 2012 Presentation

Finite Mixture Regression

Search for k separate regressions Convergence issues on entire dataset For 10 subsamples (n=30K) converged for seven

Page 41: Inference with big data: SCECR 2012 Presentation

MODEL VALIDATION

Chapter 5:

Page 42: Inference with big data: SCECR 2012 Presentation

Improve model validation, comparison, and generalization

Internal & external validity Robustness across subsamples

non-random random (overlapping/non)

Page 43: Inference with big data: SCECR 2012 Presentation

Improve predictive validation

Training set

Holdout set

Page 44: Inference with big data: SCECR 2012 Presentation

SMALL SAMPLE MODELING APPROACH

Page 45: Inference with big data: SCECR 2012 Presentation

Clark Kent ≤ Superman