Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Data Integration for Big Data Analysis for finitepopulation inference
Jae-kwang Kim
ISU
January 23, 2018
1 / 36
What is big data?
2 / 36
Data do not speak for themselves
Knowledge
Information
Data
Intepretation
Reproducibility
3 / 36
Population and Sample
Population
Sample Estimator
Parameter
Generalization Inference
4 / 36
Survey Sampling
Survey: MeasurementSampling: Representation
Table: Survey Methodology and Sampling Statistics
Survey Methodology Sampling StatisticsPsychology, Cognitive Science Statistics
Studies Nonsampling error Studies Sampling error
Questionnaire design Sampling design, estimation
5 / 36
Two wings of survey data
6 / 36
Big Data
Big Data era- Freeconomics
8 / 36
Big Data
Survey sample data vs Big Data
Table: Features
Survey sample data Big Data
Cost function C = C0 + C1× n C is not linear in n
Reprentativeness X
Bias Bias = 0 Bias 6= 0
Variance Variance = K/n Variance ∼= 0
9 / 36
Big Data
Selection Bias
Finite population: U = {1, · · · , N}.Parameter of interest: YN = N−1
∑Ni=1 yi
Big data sample: B ⊂ U .
δi =
{1 if i ∈ B0 otherwise.
Estimator: yB = N−1B
∑Ni=1 δiyi, where NB =
∑Ni=1 δi is the big data
sample size (NB < N ).
10 / 36
Big Data
MSE of Big Data Estimator
MSE Formula
Eδ(yB − YN )2 = Eδ(ρ2δ,Y )× σ2 × 1− fB
fB
where ρδ,Y = Corr(δ, Y ), σ2 = V ar(Y ), fB = NB/N , and Eδ(·) is theexpectation with respect to the big data sampling mechanism, generallyunknown.
If Eδ(ρδ,Y ) = 0, then Eδ(ρ2δ,Y ) = O(N−1B ) and the MSE is of order 1/NB .
If Eδ(ρδ,Y ) 6= 0, then Eδ(ρ2δ,Y ) = O(1) the MSE is of order 1/fB − 1.
11 / 36
Big Data
Effective sample size
neff =fB
1− fB× 1
Eδ(ρ2δ,Y )
.
If ρδ,Y = 0.05 and fB = 1/2, then neff = 400.For example, suppose that the population size is N = 10, 000, 000 and wehave 50% of the population collected in the big data. If ρδ,Y = 0.05 thenthe MSE of the big data sample mean is equal to that of SRS mean withsize n = 400.
12 / 36
Big Data
Paradox of Big data (Meng 2018)
Confidence interval using the big data sample (ignoring the selectionbias):
CI = (yB − 1.96√
(1− fB)S2/NB , yB + 1.96√
(1− fB)S2/NB)
As NB →∞, we havePr(YN ∈ CI)→ 0.
Paradox: If one ignores the bias and apply the standard method ofestimation, the bigger the dataset, the more misleading it is for validstatistical inference.
13 / 36
Data Integration
Salvation of Big Data
15 / 36
Data Integration
Data integration: Basic Idea
Two data set: Big data and survey dataBig data may be subject to selection bias.For simplicity, assume a binary Y variable
δ = 1 δ = 0Y = 1 NB1 NC1 N1
Y = 0 NB0 NC0 N0
NB NC N
where δi = 1 if unit i belongs to the big data sample and δi = 0 otherwise.Parameter of interest: P = P (Y = 1).
16 / 36
Data Integration
Data integration: Basic Idea (Cont’d)
In addition, we have a survey data of size n by SRS with the followingobservations in the sample level:
δ = 1 δ = 0Y = 1 nB1 nC1 n1
Y = 0 nB0 nC0 n0
n
How to combine two data sources?
17 / 36
Data Integration
Combined estimation
Note that
P (Y = 1) = P (Y = 1 | δ = 1)P (δ = 1) + P (Y = 1 | δ = 0)P (δ = 0).
Three components1 P (δ = 1): Big data proportion (known)2 P (Y = 1 | δ = 1) = NB1/NB : obtained from the big data.3 P (Y = 1 | δ = 0): estimated by nC1/(nC0 + nC1) from the survey data.
Final estimatorP = PBWB + PC(1−WB) (1)
where WB = NB/N , PB = NB1/NB , and PC = nC1/(nC0 + nC1).
18 / 36
Data Integration
Remark 1
Variance
V (P ) = (1−WB)2V (PC)
.= (1−WB)
1
nPC(1− PC).
If WB is close to one, then the above variance is very small.Instead of using PC = nC1/(nC0 + nC1), we can construct a ratioestimator of PC to improve the efficiency. That is, use
PC,r =1
1 + θC
whereθC =
NB0/NB1
nB0/nB1× (nC0/nC1).
19 / 36
Data Integration
Remark 2
The combined estimator is essentially a post-stratified estimator using δas a post-stratification variable.Post-stratification idea can be directly applicable to continuous Y variable.Practical Issues
δ can be obtained inaccurately (due to Imperfect Matching).We may have measurement errors in y in the big data.Survey sample may not observe y at all.
20 / 36
Data Integration
Two setups (A: survey sample data, B: Big data)
Parameter of interest: θ =∑i∈U yi
Table: Setup One
Data X Y Represent?A X XB X X
Probability sample does not observe the study variable
Table: Setup Two
Data X YA X XB X X
Probability sample does observe the study variable
21 / 36
Data Integration
Data Integration for Setup One
Rivers (2007) idea1 Use X to create nearest neighbor imputation for each unit i ∈ A.2 Compute
θ =∑i∈A
wiy∗i
where y∗i is the imputed value of yi in i ∈ A.
Based on MAR (missing at random) assumption
f(y | x, δ = 1) = f(y | x)
Bias may not be negligible if the dimension of x is high (due to curse ofdimensionality).Naive variance estimator works well. (Estimation error is asymptoticallynegligible.)
22 / 36
Data Integration
Data Integration for Setup One
Proposed method 11 Obtain δi from A, by matching or by asking the membership for the big data.2 Fit a model for P (δ = 1 | x) using sample A.3 Use
θ =∑i∈B
π−1i yi
where πi = P (δi = 1 | xi) and adjusted to satisfy∑i∈B π
−1i = N .
Based on MAR assumption.Requires correct specification of the model for π(x) = P (δ = 1 | x).
23 / 36
Data Integration
Data Integration for Setup One
Proposed method 2 : Doubly robust (DR) estimation1 Fit a “working” model for E(Y | x) to get yi = E(Yi | xi) for each i ∈ A andi ∈ B.
2 Fit a “working” model for P (δ = 1 | x) to get πi = P (δi = 1 | xi) for eachi ∈ B.
3 UseθDR =
∑i∈A
wiyi +∑i∈B
π−1i (yi − yi)
where πi = P (δi = 1 | xi).Based on MAR assumption.Requires one of the two models be correctly specified.
24 / 36
Data Integration
Justification for DR estimation
Let θHT =∑i∈A wiyi be the Horvitz-Thompson estimator that could be
used if yi were observed in sample A.Note that
θDR − θHT = −∑i∈A
wiei +∑i∈B
π−1i ei
where ei = yi − yi.Double Robustness
1 If the model for P (δ = 1 | x) is correctly specified, then
Eδ{θDR − θHT } ∼= −∑i∈A
wiei +∑i∈U
ei
which is design-unbiased to zero.2 If the model for E(Y | x) is correctly specified, then E(ei) ∼= 0 under MAR.
25 / 36
Data Integration
Data Integration for Setup Two
Table: Setup Two
Data X YA X XB X
We are interested in estimating θ =∑i∈U yi from the two data sources.
26 / 36
Data Integration
Data Integration for Setup Two
Note that we can compute θA =∑i∈A wiyi from sample A.
Thus, unlike setup one, the goal of data integration is to improve theefficiency (i.e. reduce the variance), not to reduce the selection bias.
How to incorporate the partial auxiliary information in data B?1 If B = U , then it is an easy problem: Calibration weighting2 For B ⊂ U , we can treat B as a sub-population and apply the same
calibration weighting for A ∩B.
27 / 36
Data Integration
Calibration weighting in survey sampling
Initial (design) weight: wi
Final weight: w∗i satisfying∑i∈A
w∗i (1, xi) =∑i∈U
(1, xi). (2)
Calibration weighting problem: Find w∗i that minimize
D(w,w∗) =∑i∈A
wi
(w∗iwi− 1
)2
subject to (2).
28 / 36
Data Integration
Calibration weighting for big data integration
Auxiliary variable xi are observed only when δi = 1.
Calibration equation is changed to∑i∈A
w∗i (1− δi, δi, δixi) =∑i∈U
(1− δi, δi, δixi). (3)
If yi = xi, it reduces to the post-stratification estimator in (1).
29 / 36
Data Integration
Simulation Study: Setup One
Goal: Wish to compare four estimators1 Naive estimator: mean of sample B2 Rivers estimator3 Proposed estimator 1 (PS estimator) using propensity score weighting.4 Proposed estimator 2 (DR estimator) using a working model for E(Y | x)
and a working model for P (δ = 1 | x).Three scenarios for the simulation study
1 Both models are correct2 Only the model E(Y | x) is correct. (i.e. The true distribution for P (δ = 1 | x)
is different from the working model. )3 Only the model P (δ = 1 | x) is correct.
30 / 36
Data Integration
Simulation study one: Setup
Outcome regression model1 Linear model. That is,
yi = 1 + x1,i + x2,i + εi
for i = 1, . . . , N , where x1,i ∼ N(1, 1), x2,i ∼ Ex(1), εi ∼ N(0, 1),N = 1, 000, 000, and (x1,i, x2,i, εi) is pair-wise independent.
2 Nonlinear model. That is,
yi = 0.5(x1,i − 1.5)2 + x2,i + εi,
where (x1,i, x2,i, εi) is the same with those in the linear model.
Big data sampling mechanism1 Linear logistic model.
δi | pi ∼ Ber(pi)
for i = 1, . . . , N , where logit(pi) = x2,i.2 Nonlinear logistic model.
δi | pi ∼ Ber(pi)
for i = 1, . . . , N , where logit(pi) = −0.5 + 0.5(x2,i − 2)2.
31 / 36
Data Integration
Smulation Result
Scenario n = 500 n = 1000Bias S.E. C.R. Bias S.E. C.R.
I
Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.077 0.950 -0.002 0.054 0.954PS -0.001 0.023 0.950 0.000 0.016 0.946DR -0.002 0.063 0.950 -0.002 0.044 0.950
II
Naive -0.097 0.001 0.000 -0.097 0.001 0.000Rivers -0.003 0.077 0.955 -0.001 0.055 0.945PS 0.110 0.183 0.986 0.084 0.085 0.996DR -0.001 0.063 0.947 0.000 0.046 0.946
III
Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.074 0.944 0.000 0.053 0.948PS -0.001 0.022 0.946 -0.001 0.016 0.947DR -0.001 0.050 0.950 0.001 0.035 0.950
32 / 36
Data Integration
Simulation Study: Setup Two
Finite population of size N = 1, 000, 000.
xi ∼ N(2, 1)
yi = 3 + 0.7 · (xi − 2) + ei
y∗i = 2 + 0.9 · (yi − 3) + ui
where ei ∼ N(0, 0.51) and ui ∼ N(0, 0.52). Note that y∗i is an inaccuratemeasurement of yi.Sampling mechanism for A: SRS of size n = 500.Big data sampling mechanism: Stratified random sampling
1 Create two strata using xi ≤ 2 and xi > 2.2 Within each stratum, we select nh elements by SRS independently, wheren1 = 300, 000 and n2 = 200, 000.
3 The stratum information is not available to data analyst.
33 / 36
Data Integration
Simulation Study: Setup Two
In sample A, we observe yi.Two scenarios for sample B.
1 Observe yi: Big data is subject to selection bias2 Observe y∗i : Big data is subject to selection bias and measurement error.
We can identify the elements in A ∩B.Three estimators for θ = E(Y )
1 Mean of sample A (Mean A)2 Mean of sample B (Mean B)3 Proposed data integration (DI) method using calibration weighting: In
scenario one, we use calibration using (1− δi, δiyi). In scenario two, we usecalibration using (1− δi, δiy∗i ).
34 / 36
Data Integration
Simulation Result
Table: Monte Carlo results of mean, variance, and the MSE of the four estimators (Truemean = 3.00156)
Scenario Method Mean Variance MSE(×104) (×104)
Mean A 3.00 18.6 191 Mean B 2.89 0.0 121
Proposed DI 3.00 8.8 9Mean A 3.00 18.6 19
2 Mean B 1.90 0.0 12,130Proposed DI 3.00 11.4 11
35 / 36
Data Integration
Discussion
Big data should not be analyzed naively. (Big data paradox!)Data integration is a useful tool for harnessing big data for finitepopulation inference.Two setups are considered.
In Setup One, both Rivers’ method and DR method are promising.In Setup Two, calibration weighting method is useful.
In Setup One, MAR assumption is used. In Setup Two, we do not needMAR assumption.Promising area of research.
36 / 36