34
Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36

Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration for Big Data Analysis for finitepopulation inference

Jae-kwang Kim

ISU

January 23, 2018

1 / 36

Page 2: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

What is big data?

2 / 36

Page 3: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data do not speak for themselves

Knowledge

Information

Data

Intepretation

Reproducibility

3 / 36

Page 4: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Population and Sample

Population

Sample Estimator

Parameter

Generalization Inference

4 / 36

Page 5: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Survey Sampling

Survey: MeasurementSampling: Representation

Table: Survey Methodology and Sampling Statistics

Survey Methodology Sampling StatisticsPsychology, Cognitive Science Statistics

Studies Nonsampling error Studies Sampling error

Questionnaire design Sampling design, estimation

5 / 36

Page 6: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Two wings of survey data

6 / 36

Page 7: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

Big Data era- Freeconomics

8 / 36

Page 8: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

Survey sample data vs Big Data

Table: Features

Survey sample data Big Data

Cost function C = C0 + C1× n C is not linear in n

Reprentativeness X

Bias Bias = 0 Bias 6= 0

Variance Variance = K/n Variance ∼= 0

9 / 36

Page 9: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

Selection Bias

Finite population: U = {1, · · · , N}.Parameter of interest: YN = N−1

∑Ni=1 yi

Big data sample: B ⊂ U .

δi =

{1 if i ∈ B0 otherwise.

Estimator: yB = N−1B

∑Ni=1 δiyi, where NB =

∑Ni=1 δi is the big data

sample size (NB < N ).

10 / 36

Page 10: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

MSE of Big Data Estimator

MSE Formula

Eδ(yB − YN )2 = Eδ(ρ2δ,Y )× σ2 × 1− fB

fB

where ρδ,Y = Corr(δ, Y ), σ2 = V ar(Y ), fB = NB/N , and Eδ(·) is theexpectation with respect to the big data sampling mechanism, generallyunknown.

If Eδ(ρδ,Y ) = 0, then Eδ(ρ2δ,Y ) = O(N−1B ) and the MSE is of order 1/NB .

If Eδ(ρδ,Y ) 6= 0, then Eδ(ρ2δ,Y ) = O(1) the MSE is of order 1/fB − 1.

11 / 36

Page 11: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

Effective sample size

neff =fB

1− fB× 1

Eδ(ρ2δ,Y )

.

If ρδ,Y = 0.05 and fB = 1/2, then neff = 400.For example, suppose that the population size is N = 10, 000, 000 and wehave 50% of the population collected in the big data. If ρδ,Y = 0.05 thenthe MSE of the big data sample mean is equal to that of SRS mean withsize n = 400.

12 / 36

Page 12: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Big Data

Paradox of Big data (Meng 2018)

Confidence interval using the big data sample (ignoring the selectionbias):

CI = (yB − 1.96√

(1− fB)S2/NB , yB + 1.96√

(1− fB)S2/NB)

As NB →∞, we havePr(YN ∈ CI)→ 0.

Paradox: If one ignores the bias and apply the standard method ofestimation, the bigger the dataset, the more misleading it is for validstatistical inference.

13 / 36

Page 13: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Salvation of Big Data

15 / 36

Page 14: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data integration: Basic Idea

Two data set: Big data and survey dataBig data may be subject to selection bias.For simplicity, assume a binary Y variable

δ = 1 δ = 0Y = 1 NB1 NC1 N1

Y = 0 NB0 NC0 N0

NB NC N

where δi = 1 if unit i belongs to the big data sample and δi = 0 otherwise.Parameter of interest: P = P (Y = 1).

16 / 36

Page 15: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data integration: Basic Idea (Cont’d)

In addition, we have a survey data of size n by SRS with the followingobservations in the sample level:

δ = 1 δ = 0Y = 1 nB1 nC1 n1

Y = 0 nB0 nC0 n0

n

How to combine two data sources?

17 / 36

Page 16: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Combined estimation

Note that

P (Y = 1) = P (Y = 1 | δ = 1)P (δ = 1) + P (Y = 1 | δ = 0)P (δ = 0).

Three components1 P (δ = 1): Big data proportion (known)2 P (Y = 1 | δ = 1) = NB1/NB : obtained from the big data.3 P (Y = 1 | δ = 0): estimated by nC1/(nC0 + nC1) from the survey data.

Final estimatorP = PBWB + PC(1−WB) (1)

where WB = NB/N , PB = NB1/NB , and PC = nC1/(nC0 + nC1).

18 / 36

Page 17: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Remark 1

Variance

V (P ) = (1−WB)2V (PC)

.= (1−WB)

1

nPC(1− PC).

If WB is close to one, then the above variance is very small.Instead of using PC = nC1/(nC0 + nC1), we can construct a ratioestimator of PC to improve the efficiency. That is, use

PC,r =1

1 + θC

whereθC =

NB0/NB1

nB0/nB1× (nC0/nC1).

19 / 36

Page 18: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Remark 2

The combined estimator is essentially a post-stratified estimator using δas a post-stratification variable.Post-stratification idea can be directly applicable to continuous Y variable.Practical Issues

δ can be obtained inaccurately (due to Imperfect Matching).We may have measurement errors in y in the big data.Survey sample may not observe y at all.

20 / 36

Page 19: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Two setups (A: survey sample data, B: Big data)

Parameter of interest: θ =∑i∈U yi

Table: Setup One

Data X Y Represent?A X XB X X

Probability sample does not observe the study variable

Table: Setup Two

Data X YA X XB X X

Probability sample does observe the study variable

21 / 36

Page 20: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data Integration for Setup One

Rivers (2007) idea1 Use X to create nearest neighbor imputation for each unit i ∈ A.2 Compute

θ =∑i∈A

wiy∗i

where y∗i is the imputed value of yi in i ∈ A.

Based on MAR (missing at random) assumption

f(y | x, δ = 1) = f(y | x)

Bias may not be negligible if the dimension of x is high (due to curse ofdimensionality).Naive variance estimator works well. (Estimation error is asymptoticallynegligible.)

22 / 36

Page 21: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data Integration for Setup One

Proposed method 11 Obtain δi from A, by matching or by asking the membership for the big data.2 Fit a model for P (δ = 1 | x) using sample A.3 Use

θ =∑i∈B

π−1i yi

where πi = P (δi = 1 | xi) and adjusted to satisfy∑i∈B π

−1i = N .

Based on MAR assumption.Requires correct specification of the model for π(x) = P (δ = 1 | x).

23 / 36

Page 22: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data Integration for Setup One

Proposed method 2 : Doubly robust (DR) estimation1 Fit a “working” model for E(Y | x) to get yi = E(Yi | xi) for each i ∈ A andi ∈ B.

2 Fit a “working” model for P (δ = 1 | x) to get πi = P (δi = 1 | xi) for eachi ∈ B.

3 UseθDR =

∑i∈A

wiyi +∑i∈B

π−1i (yi − yi)

where πi = P (δi = 1 | xi).Based on MAR assumption.Requires one of the two models be correctly specified.

24 / 36

Page 23: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Justification for DR estimation

Let θHT =∑i∈A wiyi be the Horvitz-Thompson estimator that could be

used if yi were observed in sample A.Note that

θDR − θHT = −∑i∈A

wiei +∑i∈B

π−1i ei

where ei = yi − yi.Double Robustness

1 If the model for P (δ = 1 | x) is correctly specified, then

Eδ{θDR − θHT } ∼= −∑i∈A

wiei +∑i∈U

ei

which is design-unbiased to zero.2 If the model for E(Y | x) is correctly specified, then E(ei) ∼= 0 under MAR.

25 / 36

Page 24: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data Integration for Setup Two

Table: Setup Two

Data X YA X XB X

We are interested in estimating θ =∑i∈U yi from the two data sources.

26 / 36

Page 25: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Data Integration for Setup Two

Note that we can compute θA =∑i∈A wiyi from sample A.

Thus, unlike setup one, the goal of data integration is to improve theefficiency (i.e. reduce the variance), not to reduce the selection bias.

How to incorporate the partial auxiliary information in data B?1 If B = U , then it is an easy problem: Calibration weighting2 For B ⊂ U , we can treat B as a sub-population and apply the same

calibration weighting for A ∩B.

27 / 36

Page 26: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Calibration weighting in survey sampling

Initial (design) weight: wi

Final weight: w∗i satisfying∑i∈A

w∗i (1, xi) =∑i∈U

(1, xi). (2)

Calibration weighting problem: Find w∗i that minimize

D(w,w∗) =∑i∈A

wi

(w∗iwi− 1

)2

subject to (2).

28 / 36

Page 27: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Calibration weighting for big data integration

Auxiliary variable xi are observed only when δi = 1.

Calibration equation is changed to∑i∈A

w∗i (1− δi, δi, δixi) =∑i∈U

(1− δi, δi, δixi). (3)

If yi = xi, it reduces to the post-stratification estimator in (1).

29 / 36

Page 28: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Simulation Study: Setup One

Goal: Wish to compare four estimators1 Naive estimator: mean of sample B2 Rivers estimator3 Proposed estimator 1 (PS estimator) using propensity score weighting.4 Proposed estimator 2 (DR estimator) using a working model for E(Y | x)

and a working model for P (δ = 1 | x).Three scenarios for the simulation study

1 Both models are correct2 Only the model E(Y | x) is correct. (i.e. The true distribution for P (δ = 1 | x)

is different from the working model. )3 Only the model P (δ = 1 | x) is correct.

30 / 36

Page 29: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Simulation study one: Setup

Outcome regression model1 Linear model. That is,

yi = 1 + x1,i + x2,i + εi

for i = 1, . . . , N , where x1,i ∼ N(1, 1), x2,i ∼ Ex(1), εi ∼ N(0, 1),N = 1, 000, 000, and (x1,i, x2,i, εi) is pair-wise independent.

2 Nonlinear model. That is,

yi = 0.5(x1,i − 1.5)2 + x2,i + εi,

where (x1,i, x2,i, εi) is the same with those in the linear model.

Big data sampling mechanism1 Linear logistic model.

δi | pi ∼ Ber(pi)

for i = 1, . . . , N , where logit(pi) = x2,i.2 Nonlinear logistic model.

δi | pi ∼ Ber(pi)

for i = 1, . . . , N , where logit(pi) = −0.5 + 0.5(x2,i − 2)2.

31 / 36

Page 30: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Smulation Result

Scenario n = 500 n = 1000Bias S.E. C.R. Bias S.E. C.R.

I

Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.077 0.950 -0.002 0.054 0.954PS -0.001 0.023 0.950 0.000 0.016 0.946DR -0.002 0.063 0.950 -0.002 0.044 0.950

II

Naive -0.097 0.001 0.000 -0.097 0.001 0.000Rivers -0.003 0.077 0.955 -0.001 0.055 0.945PS 0.110 0.183 0.986 0.084 0.085 0.996DR -0.001 0.063 0.947 0.000 0.046 0.946

III

Naive 0.187 0.001 0.000 0.187 0.001 0.000Rivers 0.000 0.074 0.944 0.000 0.053 0.948PS -0.001 0.022 0.946 -0.001 0.016 0.947DR -0.001 0.050 0.950 0.001 0.035 0.950

32 / 36

Page 31: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Simulation Study: Setup Two

Finite population of size N = 1, 000, 000.

xi ∼ N(2, 1)

yi = 3 + 0.7 · (xi − 2) + ei

y∗i = 2 + 0.9 · (yi − 3) + ui

where ei ∼ N(0, 0.51) and ui ∼ N(0, 0.52). Note that y∗i is an inaccuratemeasurement of yi.Sampling mechanism for A: SRS of size n = 500.Big data sampling mechanism: Stratified random sampling

1 Create two strata using xi ≤ 2 and xi > 2.2 Within each stratum, we select nh elements by SRS independently, wheren1 = 300, 000 and n2 = 200, 000.

3 The stratum information is not available to data analyst.

33 / 36

Page 32: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Simulation Study: Setup Two

In sample A, we observe yi.Two scenarios for sample B.

1 Observe yi: Big data is subject to selection bias2 Observe y∗i : Big data is subject to selection bias and measurement error.

We can identify the elements in A ∩B.Three estimators for θ = E(Y )

1 Mean of sample A (Mean A)2 Mean of sample B (Mean B)3 Proposed data integration (DI) method using calibration weighting: In

scenario one, we use calibration using (1− δi, δiyi). In scenario two, we usecalibration using (1− δi, δiy∗i ).

34 / 36

Page 33: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Simulation Result

Table: Monte Carlo results of mean, variance, and the MSE of the four estimators (Truemean = 3.00156)

Scenario Method Mean Variance MSE(×104) (×104)

Mean A 3.00 18.6 191 Mean B 2.89 0.0 121

Proposed DI 3.00 8.8 9Mean A 3.00 18.6 19

2 Mean B 1.90 0.0 12,130Proposed DI 3.00 11.4 11

35 / 36

Page 34: Data Integration for Big Data Analysis for finite ... · Data Integration for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1/36. What is big

Data Integration

Discussion

Big data should not be analyzed naively. (Big data paradox!)Data integration is a useful tool for harnessing big data for finitepopulation inference.Two setups are considered.

In Setup One, both Rivers’ method and DR method are promising.In Setup Two, calibration weighting method is useful.

In Setup One, MAR assumption is used. In Setup Two, we do not needMAR assumption.Promising area of research.

36 / 36