Upload
galit-shmueli
View
501
Download
0
Tags:
Embed Size (px)
Citation preview
CECR
Inference with Big Data
A Superpower
Approach
Galit Shmuéli Indian School of Business
Mohit Dayal Lalita Reddi Bhim Pochiraju
Mingfeng Lin Hank Lucas
Big data studies (in information systems) increasingly common
# IS papers with n>10,000 (2004-2010)
Large-study IS papers: How Big?
“over 10,000 publicly available feedback text comments… in eBay” The Nature and Role of Feedback Text Comments in Online Marketplaces
Pavlou & Dimoka, ISR 2006
“we use… 3.7 million records, encompassing transactions for the Federal Supply Service (FSS) of the U.S. Federal government in fiscal year 2000
Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets Ghose & Yao, ISR 2011
“51,062 rare coin auctions that took place… on eBay” The Sound of Silence in Online Feedback
Dellarocas & Wood, Management Science 2006
“We collected data on … [175,714] reviews from Amazon” Examining the Relationship Between Reviews and Sales
Forman et al., ISR 2008
108,333 used vehicles offered in the wholesale automotive market Electronic vs. Physical Market Mechanisms Overby & Jap, Management Science 2009
For our analysis, we have … 784,882 [portal visits] Household-Specific Regressions Using Clickstream Data
Goldfarb & Lu, Statistical Science 2006
Apply small sample approach to Big Data studies?
It’s all about Power
Magnify effects
Separate signal from noise
Artwork: Running the numbers by Chris Jordan (www.chrisjordan.com) 426,000 cell phones retired in the US every day
Power = Prob (detect H1 effect)
= f ( sample size, effect size, a, noise )
Small & complex effects
Stronger validity
Rare events
The Promise
Hypotheses Data Exploration Models Model Validation Inference
Statistical Technology
DATA VIZ: “BIG DATA” CHARTS
Chapter 1: With Mohit Dayal & Lalita Reedi (ISB)
Scaling Up Data Visualization
Missing values
Big Data Scatter plot
Visualization: Big Data Boxplot
BIG DATA (SUPERPOWER) APPROACH: Charts based on aggregation Interactive viz (zoom & pan, filter, etc.)
BIG DATA AND SMALL-SAMPLE INFERENCE
Simple Hypotheses H1: b1>0
𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀
Few control variables
Assumptions?
Which model?
𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
Few hypotheses
What data? Sign + Statistical significance
Model fit Robustness
What doesn’t scale up? (wrong conclusions) What are missed opportunities?
TOO BIG TO FAIL: LARGE SAMPLES AND FALSE DISCOVERIES
Chapter 2: With Hank Lucas (UMD) & Mingfeng Lin (UoA)
small p-values*** are not interesting
Large sample result: Deflated p-values
p-value ~ proximity of sample to H0
= f(effect size, sample size, noise)
H0: b=0
0 1 2 3
4 5
ln *ln( ) * *ln( )
( ) (controls)
Price minimumBid reserve sellerFeedback
Duration
b b b b
b b
H1: Higher minimum bids lead to higher final prices (b1>0) H2: Auctions with reserve price will sell for higher prices (b2>0) H3: Duration affects price (b4≠0) H4: The higher the seller feedback, the higher the price (b3>0)
auctions for digital cameras Aug ’07- Jan ‘08 [thanks to Wolfgang Jank for the data!]
n=341,136
n=341,136
“In a large sample, we can obtain very large t statistics with low p-values for our predictors, when, in fact, their effect on Y is very slight”
Applied Statistics in Business & Economics
Doane & Seward
BIG DATA (SUPERPOWER) APPROACH: Focus on size (ignore p-values) Subsamples for robustness: “results quantitatively similar”
MODEL ASSUMPTIONS, DIAGNOSTICS AND ADJUSTMENT
Chapter 3:
With big data, we’re in the realm of asymptotic behaviour
𝑛 → ∞
Assumption Violation
Coefficient bias
Standard errors
Redundant diagnostic tests
Under-specification all bias
Endogeneity* all 2SLS is worse
Instrument strength (Sargan)
E(e) =0 𝜷 𝟎 bias Lack-of-fit
Non-normality Anderson-Darling
Heteroscedasticity bias Breusch-Pagan
Over-specification
Serial dependence bias Durbin -Watson
Multicollinearity increase Significant correlations
Influential outliers Leverage (multiple testing)
Violated assumptions: less tinkering
*IV estimators only have desirable asymptotic, not finite sample, properties
BIG DATA (SUPERPOWER) APPROACH: Focus on bias-related assumptions Avoid statistical tests (p-value challenge)
COMPLEX EFFECTS & HETEROGENEITY
Chapter 4: With Bhimsankaram Pochiraju & Mohit Dayal (ISB)
With Big Data:
Detect small (but important) effects
Detect rare events (in rare minorities)
H1: b3>2
𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀
control variables
Less assumptions
Fixed effects
Which model?
𝑓 𝑦 = 𝛽 0 + 𝛽 1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
Complex hypotheses
What data?
magnitude Model fit Robustness Predictive
Heterogeneous Clustering/Mixtures Sub-samples Propensity Scores, 2SLS
Test complex hypotheses
Moderators Nonlinear relationships Multiple categories Control variables
Quantify subtle effects Specific measures (20 eBay categories) “Moderator variables are difficult to detect” -Aguinis, 1994
Low R2, yet non-zero coefficients
The rovers have a magnifying camera… that scientists can use to carefully look at the fine structure of a rock
Stepwise Selection
OLS with Stepwise (AIC measure) Logistic with variable selection (RELR)
All independent variables All control variables Quadratic terms of continuous variables 2-way interactions
Choose software carefully (R: “out of memory”)
Heterogeneity: CART
• Identify non-linearities and interactions
• Does not identify different models • Challenge: independent variables
vs. control variables
Clustering
1. Cluster all independent and control variables
2. Fit separate regression models to each cluster
• Popular in risk analytics • Fast, easy • Does not guarantee
distinct relationships
Finite Mixture Regression
Search for k separate regressions Convergence issues on entire dataset For 10 subsamples (n=30K) converged for seven
MODEL VALIDATION
Chapter 5:
Improve model validation, comparison, and generalization
Internal & external validity Robustness across subsamples
non-random random (overlapping/non)
Improve predictive validation
Training set
Holdout set
SMALL SAMPLE MODELING APPROACH
Clark Kent ≤ Superman