Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Nuoo-Ting (Jassy) Molitor1

Chris Jackson2

With Nicky Best, Sylvia Richardson1

1Department of Epidemiology and Public Health

Imperial College, London2MRC Biostatistics Unit, [email protected]@mrc-bsu.cam.ac.uk

http://www.bias-project.org.uk

Bayesian graphical models for combining mismatched administrative and survey data:

application to low birth weight and water disinfection by-products

mailto:[email protected]

mailto:[email protected]

Motivation of combining different data sources

Case study: Chlorination Study

Data Sources

Statistical modeling

Simulation and Real Data Analysis

Outlines

Observational studies

Fill with lots of uncertainties other than random errors

Missing values

Unobservedconfounder

Measurement errors

Selectionbias

Random errors

Uncertainties are hard to identify within a single data set

Combining multiple data sources

Research questions are complicated in nature and a single data set may not able to provide sufficient answer.

Example: Puzzle

Case study

Combining birth register, survey and census data

to study effects of water disinfection by-products

on risk of low birth weight

Low Birthweight(LBW)

(birth weight < 2.5kg)

Environmental ExposureChlorine Byproducts

(THMs)

OutcomeLow Birth-weight

(LBW)

LBW and pre-term(LBWP)

LBW and Full-term(LBWF)

LBW: baby’s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks

Covariates: mothers’ race/ethnicityBabies’ sex mothers’ smoking statusMothers’ maternal age during the pregnancy

Example of combining different data sources – Chlorination Study

ChlorineNatural organic matter

and / or Chemical compound

bromide

organic & inorganic byproductsorganic & inorganic byproducts• bromatebromate• chlorite chlorite • haloacetic acids (HAA5)haloacetic acids (HAA5)

• total trihalomethanes (THMstotal trihalomethanes (THMs) )

reacts

Gestationage

Available data sources related to the Chlorination StudyWhy do we need them?

Administrative data (NBR)

Deal with • Small % of LBW in pop• Inconclusive link between LBW and THMs

• Imputing missing covariates

Aggregate data

Surveydata (MCS)

• Adjust for importantsubject level covariate• Allows to examinedifferent types of LBW

Administrative data (large) -power, no selection bias

Observed postcode

Missing smoking and race/ethnicityMissing baby’s gestation age

NBR (national birth registry)

Observed postcodeCensus 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure

Aggregate Data (UK)

Survey data (Subset of NBR) - low power, selection bias

Observed postcode

Observed smoking and race/ethnicityObserved baby’s gestation age

MCS (millennium cohort study)

Summary of data sources

Disease sub-model for MCS m: subject index for MCS r: region index

y r m

normal LBWP

LBWF

THM r m

C r m

DiseaseModel

Parameters

Unknown Known

y : Birth weight indicator

(1: normal, 2: LBWP, 3: LBWF)

THM : THM (chlorine byproduct) exposure

C : missing covariates such as

race/ethnicity and smoking.

Only observed in the MCS.

Multinomial logistic regression for MCS

y r m ~ Multinomial (pr m,1:3, 1)

log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m

log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Cr m

Building the sub-model

Disease sub-model for NBR n: subject index for NBR r: region index

y r n

normal

THM r n

DiseaseModel

Parameters

UnknownKnown

Cr n

LBWP

LBWF

Missing LBWP & LBWF were due to

missing gestation age

C : missing covariates such as race/ethnicity

and smoking (Missing in the NBR, but

Observed in the MCS)


Multinomial logistic regression for NBR

y r n ~ Multinomial (pr n,1:3, 1)

log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n

log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Cr n

G-age: Gestation age

y r n

THM r n

DiseaseModel

Parameters

normal LBW

THM r m

DiseaseModel

Parameters

y r m

normal LBWC r n C r m

Birth Weight (BW) Birth Weight (BW)

LBWP

LBWF

LBWP

LBWFmissing G-age

known

unknown

NBR MCS

Missing outcome model - impute LBWP and LBWF for NBR

C r n C r m

NBR MCS

Aggregate

Ar

Unknown Known

missingcovar. modelparameters

Missing Covariate Model Impute CImpute Cr nr n in terms of aggregate data and MCS data in terms of aggregate data and MCS data


Since our missing covariate such as race and smoke are binary variables, we usea multivariate-probit model to account for their correlation

1: nonwhite(Asian, Black, Others)

0: white

1: yes0: no

Race Smoke

1,r

2,r

uSmoke*~ MVN ,

uRace*

Define underlying continuous variables (smoke*, race*)Smoke= I(smok* >0) & Race= I (Race* >0)

Multivariate Probit Model (Chip & Greenberg,1998)

Correlation

T11,r 01,s r

T22,r 02,s r

u =δ δ A

u =δ δ A s=1,2,3

1 b= , -1<=b<=1

b 1

S: Sampling StratumAdjust for selection bias

NBR disease sub-model

THM r n

DiseaseModel

Parameters

THM r m

DiseaseModel

Parameters

C r n C r m

y r m

normalLBWP

LBWFy r n

normal

LBWF

LBWP

MCS disease sub-model

C r n C r m

Missing covar. modelparameters

Missing covar. sub-model

Missing Outcome Model

Unified model

known

unknown

Aggre.Ar

ri

obs mri,1:3

Tri,uri0,u 1,u ri 3,u ri 4,u ri2,u

ri,1

y ~ Multinomial (P , 1) , i N

plog = β +β THM +β X β Smoke β Race

p

u=2,3

* *ri ri ri ri

*1,rri

*2,rri

smoke =I smoke >0 , race =I race >0

usmoke~ Multivariate Normal ,

urace

ri ri,1:3

miss * m

*ri,1

ri,u*ri,u

ri,2 ri,3

y ~ Multinomial(p ,1) , i N

p = 0

pp = , u=2,3

p +p

1. Disease Model (y={1,2,3} )

3. Missing Covariates Model (Multivariate Probit)

2. Missing Outcome Model

T11,r 01,s r

T22,r 02,s r

u =δ δ A

u =δ δ A s=1,2,3

1 b= , -1<=b<=1

b 1

i: subject indexNm : group of subjects who had missing outcome (ymiss )r: regionu: index for the category of outcomeyobs: observed outcomeX: observed covariates

Y(1, 2, 3)

C (0/1)A

(aggre.) Missing Covariate Model

Missing Outcome Model

Investigating the performance of the unified model

Good Performance of model depended on1. How well the aggre. data can inform C (covariate)2. How strong C and Y are linked

We can examine the following 4 data scenarios1. Strong (A C) Strong (CY)2. Strong (A C) Weak (CY)3. Weak (A C) Strong (CY)4. Weak (A C) Weak (CY)

Step 1: Create data (N=1333) under the scenarios:

Step 3: Compare the prediction based on an analysis using fully observed data (no imputation)with an analysis using partially observed data (imputation).

Note: partially observed data were analyzed under various models1. Covariate sub-model (examining A C)2. Outcome sub-model (examining C Y)3. Unified Model (examining AC and CY)4. Unified Model with cut

Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing

Repeat step 2 : generate 20 replicate samples

Simulation Study

Examining the Imputation of missing covariateone level (AC)

Strong AC

Weak AC

Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different

Ability to discriminateture covariate pattern

decrease

Examining the Imputation of missing covariatetwo level (AC & C Y)

Feedback form outcome model is beneficial to covariate imputation.

The predicted probabilities of covariate patter (C=0,0) are betterable to discriminate between subjects whose true covariates are C=0,0 or not.

In particular, weak C scenarios.

Examining the impact of the imputation modelon the Y-C association

outcome model only Unified model Unified model

w/ cut

SYSC EST Est (MSE) Est (MSE) Est (MSE)

beta.smoke[3] 0.9 0.91 (0.01) 1.07 (0.27) 0.25 (0.43)

beta.race[3] 1.79 1.83 (0.01) 2.22 (0.25) 1.12 (0.47)

SYWC

beta.smoke[3] 0.99 0.97 (0.00) 0.97 (0.51) 0.15 (0.71)

beta.race[3] 2.56 2.57 (0.01) 2.71 (0.49) 0.67 (3.63)

WYSC

beta.smoke[3] -0.02 0.05 (0.01) 0.57 (1.34) 0.06 (0.07)

beta.race[3] 0.32 0.41 (0.03) 0.61 (0.41) 0.18 (0.09)

WYWC

beta.smoke[3] 0.35 0.34 (0.03) 0.91 (0.89) 0.09 (0.11)

beta.race[3] 1 1.06 (0.04) 1.23 (1.32) 0.18 (0.84)

• Outcome VS unified modelUnified model has higher MSE than outcome model(more missing values need to impute)

• Unified VS. Unified with cutStrong Y-C association help reduce MSEbut not weak Y-C association

Real data analysis – a water company in Northern England

© Imperial College London

0 40 80 120 16020Kilometers

´

United Utilities

South West Water

Southern Water

Severn TrentWater

Essex and SuffolkWater

Anglian Water

Yorkshire Water

Northumbrian Water

Welsh Water

Thames Water

BristolWater

Three ValleysWater

Data:

Restrict on: Singleton birth

Period: Sep 2000 – Aug 2001

Subjects:

MCS1333

NBR7945+ =

Total9278

Missing % in Race and Smoke: ~ 85%Missing % in Outcome: ~ 7%

Complete Observed

information

Missing RaceMissing SmokeMissing outcome at levels of2 (LBWP) and 3 (LBWF)

Real data analysis – a water company in northern England

Exposure variable : THMs

• It was dichotomized into 2 groups

• low-medium exposure group (<= 60 g/l) : 57.35 %

• high exposure group (>60 g/l) : 42.65 %

• Estimated in separate model for MCS and NBR (Whitaker et al, 2005)

In addition to race and smoke, we also adjust for :

baby’s sexmother maternal age

Observed in both MCS and NBR

No imputation VS. Imputation

a. Multinomial logistic regression model for MCS data (Bayesian)

- no imputation

b. Bayesian multiple bias model for combined NBR, MCS and aggregate data

- impute missing outcome and covariates

Models for real data analysis

Results for the real data analysis (Low birth-weight full-term VS Normal)

OR ( 95% CI)*

Data Model Outcome THMs Smoke Non-white

MCS(1333)

Multinomial Logistic (Bayesian)

LBWF 1.64(0.8-3.1)

2.65(1.2-5.2)

5.92(2.2-12.9)

MCS+NBR (9278)

Bayesian Multiple Bias

LBWF 2.4(1.1- 4.5)

2.5 (1.1-4.7)

5.6(2.6-10.8)

* 95% Bayesian Credible Interval

All parameter estimates adjusted for baby’s sex, mother maternal age

Conclusion

There is an evidence for association of THM exposure with low birth-weight full-term.

Combining the datasets can increase statistical power of the survey data alleviate bias due to confounding in the administrative data

Must allow for selection mechanism of survey when combining data

THANKS

Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor

Documents

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1