Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for ﬁnite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration for Big Data Analysis for finitepopulation inference

Jae-kwang Kim

ISU

January 23, 2018

1 / 36

What is big data?

2 / 36

Data do not speak for themselves

Knowledge

Information

Data

Intepretation

Reproducibility

3 / 36

Population and Sample

Population

Sample Estimator

Parameter

Generalization Inference

4 / 36

Survey Sampling

Survey: MeasurementSampling: Representation

Table: Survey Methodology and Sampling Statistics

Survey Methodology Sampling StatisticsPsychology, Cognitive Science Statistics

Studies Nonsampling error Studies Sampling error

Questionnaire design Sampling design, estimation

5 / 36

Two wings of survey data

6 / 36

Big Data

Big Data era- Freeconomics

8 / 36

Big Data

Survey sample data vs Big Data

Table: Features

Survey sample data Big Data

Cost function C = C0 + C1× n C is not linear in n

Reprentativeness X

Bias Bias = 0 Bias 6= 0

Variance Variance = K/n Variance ∼= 0

9 / 36

Big Data

Selection Bias

Finite population: U = {1, · · · , N}.Parameter of interest: YN = N−1

∑Ni=1 yi

Big data sample: B ⊂ U .

δi =

{1 if i ∈ B0 otherwise.

Estimator: yB = N−1B

∑Ni=1 δiyi, where NB =

∑Ni=1 δi is the big data

sample size (NB < N ).

10 / 36

Big Data

MSE of Big Data Estimator

MSE Formula

Eδ(yB − YN )2 = Eδ(ρ2δ,Y )× σ2 × 1− fB

fB

where ρδ,Y = Corr(δ, Y ), σ2 = V ar(Y ), fB = NB/N , and Eδ(·) is theexpectation with respect to the big data sampling mechanism, generallyunknown.

If Eδ(ρδ,Y ) = 0, then Eδ(ρ2δ,Y ) = O(N−1B ) and the MSE is of order 1/NB .

If Eδ(ρδ,Y ) 6= 0, then Eδ(ρ2δ,Y ) = O(1) the MSE is of order 1/fB − 1.

11 / 36

Big Data

Effective sample size

neff =fB

1− fB× 1

Eδ(ρ2δ,Y )

.

If ρδ,Y = 0.05 and fB = 1/2, then neff = 400.For example, suppose that the population size is N = 10, 000, 000 and wehave 50% of the population collected in the big data. If ρδ,Y = 0.05 thenthe MSE of the big data sample mean is equal to that of SRS mean withsize n = 400.

12 / 36

Big Data

Paradox of Big data (Meng 2018)

Confidence interval using the big data sample (ignoring the selectionbias):

CI = (yB − 1.96√

(1− fB)S2/NB , yB + 1.96√

(1− fB)S2/NB)

As NB →∞, we havePr(YN ∈ CI)→ 0.

Paradox: If one ignores the bias and apply the standard method ofestimation, the bigger the dataset, the more misleading it is for validstatistical inference.

13 / 36

Data Integration

Salvation of Big Data

15 / 36

Data Integration

Data integration: Basic Idea

Two data set: Big data and survey dataBig data may be subject to selection bias.For simplicity, assume a binary Y variable

δ = 1 δ = 0Y = 1 NB1 NC1 N1

Y = 0 NB0 NC0 N0

NB NC N

where δi = 1 if unit i belongs to the big data sample and δi = 0 otherwise.Parameter of interest: P = P (Y = 1).

16 / 36

Data Integration

Data integration: Basic Idea (Cont’d)

In addition, we have a survey data of size n by SRS with the followingobservations in the sample level:

δ = 1 δ = 0Y = 1 nB1 nC1 n1

Y = 0 nB0 nC0 n0

n

How to combine two data sources?

17 / 36

Data Integration

Combined estimation

Note that

P (Y = 1) = P (Y = 1 | δ = 1)P (δ = 1) + P (Y = 1 | δ = 0)P (δ = 0).

Three components1 P (δ = 1): Big data proportion (known)2 P (Y = 1 | δ = 1) = NB1/NB : obtained from the big data.3 P (Y = 1 | δ = 0): estimated by nC1/(nC0 + nC1) from the survey data.

Final estimatorP = PBWB + PC(1−WB) (1)

where WB = NB/N , PB = NB1/NB , and PC = nC1/(nC0 + nC1).

18 / 36

Data Integration

Remark 1

Variance

V (P ) = (1−WB)2V (PC)

.= (1−WB)

1

nPC(1− PC).

If WB is close to one, then the above variance is very small.Instead of using PC = nC1/(nC0 + nC1), we can construct a ratioestimator of PC to improve the efficiency. That is, use

PC,r =1

1 + θC

whereθC =

NB0/NB1

nB0/nB1× (nC0/nC1).

19 / 36

Data Integration

Remark 2

The combined estimator is essentially a post-stratified estimator using δas a post-stratification variable.Post-stratification idea can be directly applicable to continuous Y variable.Practical Issues

δ can be obtained inaccurately (due to Imperfect Matching).We may have measurement errors in y in the big data.Survey sample may not observe y at all.

20 / 36

Data Integration

Two setups (A: survey sample data, B: Big data)

Parameter of interest: θ =∑i∈U yi

Table: Setup One

Data X Y Represent?A X XB X X

Probability sample does not observe the study variable

Table: Setup Two

Data X YA X XB X X

Probability sample does observe the study variable

21 / 36

Data Integration

Data Integration for Setup One

Rivers (2007) idea1 Use X to create nearest neighbor imputation for each unit i ∈ A.2 Compute

θ =∑i∈A

wiy∗i

where y∗i is the imputed value of yi in i ∈ A.

Based on MAR (missing at random) assumption

f(y | x, δ = 1) = f(y | x)

Bias may not be negligible if the dimension of x is high (due to curse ofdimensionality).Naive variance estimator works well. (Estimation error is asymptoticallynegligible.)

22 / 36

Data Integration


Proposed method 11 Obtain δi from A, by matching or by asking the membership for the big data.2 Fit a model for P (δ = 1 | x) using sample A.3 Use

θ =∑i∈B

π−1i yi

where πi = P (δi = 1 | xi) and adjusted to satisfy∑i∈B π

−1i = N .

Based on MAR assumption.Requires correct specification of the model for π(x) = P (δ = 1 | x).

23 / 36

Data Integration


Proposed method 2 : Doubly robust (DR) estimation1 Fit a “working” model for E(Y | x) to get yi = E(Yi | xi) for each i ∈ A andi ∈ B.

2 Fit a “working” model for P (δ = 1 | x) to get πi = P (δi = 1 | xi) for eachi ∈ B.

3 UseθDR =

∑i∈A

wiyi +∑i∈B

π−1i (yi − yi)

where πi = P (δi = 1 | xi).Based on MAR assumption.Requires one of the two models be correctly specified.

24 / 36

Data Integration

Justification for DR estimation

Let θHT =∑i∈A wiyi be the Horvitz-Thompson estimator that could be

used if yi were observed in sample A.Note that

θDR − θHT = −∑i∈A

wiei +∑i∈B

π−1i ei

where ei = yi − yi.Double Robustness

1 If the model for P (δ = 1 | x) is correctly specified, then

Eδ{θDR − θHT } ∼= −∑i∈A

wiei +∑i∈U

ei

which is design-unbiased to zero.2 If the model for E(Y | x) is correctly specified, then E(ei) ∼= 0 under MAR.

25 / 36

Data Integration

Data Integration for Setup Two

Table: Setup Two

Data X YA X XB X

We are interested in estimating θ =∑i∈U yi from the two data sources.

26 / 36

Data Integration

Data Integration for Setup Two

Note that we can compute θA =∑i∈A wiyi from sample A.

Thus, unlike setup one, the goal of data integration is to improve theefficiency (i.e. reduce the variance), not to reduce the selection bias.

How to incorporate the partial auxiliary information in data B?1 If B = U , then it is an easy problem: Calibration weighting2 For B ⊂ U , we can treat B as a sub-population and apply the same

calibration weighting for A ∩B.

27 / 36

Data Integration

Calibration weighting in survey sampling

Initial (design) weight: wi

Final weight: w∗i satisfying∑i∈A

w∗i (1, xi) =∑i∈U

(1, xi). (2)

Calibration weighting problem: Find w∗i that minimize

D(w,w∗) =∑i∈A

wi

(w∗iwi− 1

)2

subject to (2).

28 / 36

Data Integration

Calibration weighting for big data integration

Auxiliary variable xi are observed only when δi = 1.

Calibration equation is changed to∑i∈A

w∗i (1− δi, δi, δixi) =∑i∈U

(1− δi, δi, δixi). (3)

If yi = xi, it reduces to the post-stratification estimator in (1).

29 / 36

Data Integration

Simulation Study: Setup One

Goal: Wish to compare four estimators1 Naive estimator: mean of sample B2 Rivers estimator3 Proposed estimator 1 (PS estimator) using propensity score weighting.4 Proposed estimator 2 (DR estimator) using a working model for E(Y | x)

and a working model for P (δ = 1 | x).Three scenarios for the simulation study

1 Both models are correct2 Only the model E(Y | x) is correct. (i.e. The true distribution for P (δ = 1 | x)

is different from the working model. )3 Only the model P (δ = 1 | x) is correct.

30 / 36

Data Integration

Simulation study one: Setup

Outcome regression model1 Linear model. That is,

yi = 1 + x1,i + x2,i + εi

for i = 1, . . . , N , where x1,i ∼ N(1, 1), x2,i ∼ Ex(1), εi ∼ N(0, 1),N = 1, 000, 000, and (x1,i, x2,i, εi) is pair-wise independent.

2 Nonlinear model. That is,

yi = 0.5(x1,i − 1.5)2 + x2,i + εi,

where (x1,i, x2,i, εi) is the same with those in the linear model.

Big data sampling mechanism1 Linear logistic model.

δi | pi ∼ Ber(pi)

for i = 1, . . . , N , where logit(pi) = x2,i.2 Nonlinear logistic model.

δi | pi ∼ Ber(pi)

for i = 1, . . . , N , where logit(pi) = −0.5 + 0.5(x2,i − 2)2.

31 / 36

Data Integration

Smulation Result

Scenario n = 500 n = 1000Bias S.E. C.R. Bias S.E. C.R.

I

Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.077 0.950 -0.002 0.054 0.954PS -0.001 0.023 0.950 0.000 0.016 0.946DR -0.002 0.063 0.950 -0.002 0.044 0.950

II

Naive -0.097 0.001 0.000 -0.097 0.001 0.000Rivers -0.003 0.077 0.955 -0.001 0.055 0.945PS 0.110 0.183 0.986 0.084 0.085 0.996DR -0.001 0.063 0.947 0.000 0.046 0.946

III

Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.074 0.944 0.000 0.053 0.948PS -0.001 0.022 0.946 -0.001 0.016 0.947DR -0.001 0.050 0.950 0.001 0.035 0.950

32 / 36

Data Integration

Simulation Study: Setup Two

Finite population of size N = 1, 000, 000.

xi ∼ N(2, 1)

yi = 3 + 0.7 · (xi − 2) + ei

y∗i = 2 + 0.9 · (yi − 3) + ui

where ei ∼ N(0, 0.51) and ui ∼ N(0, 0.52). Note that y∗i is an inaccuratemeasurement of yi.Sampling mechanism for A: SRS of size n = 500.Big data sampling mechanism: Stratified random sampling

1 Create two strata using xi ≤ 2 and xi > 2.2 Within each stratum, we select nh elements by SRS independently, wheren1 = 300, 000 and n2 = 200, 000.

3 The stratum information is not available to data analyst.

33 / 36

Data Integration

Simulation Study: Setup Two

In sample A, we observe yi.Two scenarios for sample B.

1 Observe yi: Big data is subject to selection bias2 Observe y∗i : Big data is subject to selection bias and measurement error.

We can identify the elements in A ∩B.Three estimators for θ = E(Y )

1 Mean of sample A (Mean A)2 Mean of sample B (Mean B)3 Proposed data integration (DI) method using calibration weighting: In

scenario one, we use calibration using (1− δi, δiyi). In scenario two, we usecalibration using (1− δi, δiy∗i ).

34 / 36

Data Integration

Simulation Result

Table: Monte Carlo results of mean, variance, and the MSE of the four estimators (Truemean = 3.00156)

Scenario Method Mean Variance MSE(×104) (×104)

Mean A 3.00 18.6 191 Mean B 2.89 0.0 121

Proposed DI 3.00 8.8 9Mean A 3.00 18.6 19

2 Mean B 1.90 0.0 12,130Proposed DI 3.00 11.4 11

35 / 36

Data Integration

Discussion

Big data should not be analyzed naively. (Big data paradox!)Data integration is a useful tool for harnessing big data for finitepopulation inference.Two setups are considered.

In Setup One, both Rivers’ method and DR method are promising.In Setup Two, calibration weighting method is useful.

In Setup One, MAR assumption is used. In Setup Two, we do not needMAR assumption.Promising area of research.

36 / 36

Documents

Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for ﬁnite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big