Upload
makana
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products. Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health - PowerPoint PPT Presentation
Citation preview
Nuoo-Ting (Jassy) Molitor1
Chris Jackson2
With Nicky Best, Sylvia Richardson1
1Department of Epidemiology and Public Health
Imperial College, London2MRC Biostatistics Unit, [email protected]@mrc-bsu.cam.ac.uk
http://www.bias-project.org.uk
Bayesian graphical models for combining mismatched administrative and survey data:
application to low birth weight and water disinfection by-products
Motivation of combining different data sources
Case study: Chlorination Study
Data Sources
Statistical modeling
Simulation and Real Data Analysis
Outlines
Observational studies
Fill with lots of uncertainties other than random errors
Missing values
Unobservedconfounder
Measurement errors
Selectionbias
Random errors
Uncertainties are hard to identify within a single data set
Combining multiple data sources
Research questions are complicated in nature and a single data set may not able to provide sufficient answer.
Example: Puzzle
Case study
Combining birth register, survey and census data
to study effects of water disinfection by-products
on risk of low birth weight
Low Birthweight(LBW)
(birth weight < 2.5kg)
Environmental ExposureChlorine Byproducts
(THMs)
OutcomeLow Birth-weight
(LBW)
LBW and pre-term(LBWP)
LBW and Full-term(LBWF)
LBW: baby’s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks
Covariates: mothers’ race/ethnicityBabies’ sex mothers’ smoking statusMothers’ maternal age during the pregnancy
Example of combining different data sources – Chlorination Study
ChlorineNatural organic matter
and / or Chemical compound
bromide
organic & inorganic byproductsorganic & inorganic byproducts• bromatebromate• chlorite chlorite • haloacetic acids (HAA5)haloacetic acids (HAA5)
• total trihalomethanes (THMstotal trihalomethanes (THMs) )
reacts
Gestationage
Available data sources related to the Chlorination StudyWhy do we need them?
Administrative data (NBR)
Deal with • Small % of LBW in pop• Inconclusive link between LBW and THMs
• Imputing missing covariates
Aggregate data
Surveydata (MCS)
• Adjust for importantsubject level covariate• Allows to examinedifferent types of LBW
Administrative data (large) -power, no selection bias
Observed postcode
Missing smoking and race/ethnicityMissing baby’s gestation age
NBR (national birth registry)
Observed postcodeCensus 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure
Aggregate Data (UK)
Survey data (Subset of NBR) - low power, selection bias
Observed postcode
Observed smoking and race/ethnicityObserved baby’s gestation age
MCS (millennium cohort study)
Summary of data sources
Disease sub-model for MCS m: subject index for MCS r: region index
y r m
normal LBWP
LBWF
THM r m
C r m
DiseaseModel
Parameters
Unknown Known
y : Birth weight indicator
(1: normal, 2: LBWP, 3: LBWF)
THM : THM (chlorine byproduct) exposure
C : missing covariates such as
race/ethnicity and smoking.
Only observed in the MCS.
Multinomial logistic regression for MCS
y r m ~ Multinomial (pr m,1:3, 1)
log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m
log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Cr m
Building the sub-model
Disease sub-model for NBR n: subject index for NBR r: region index
y r n
normal
THM r n
DiseaseModel
Parameters
UnknownKnown
Cr n
LBWP
LBWF
Missing LBWP & LBWF were due to
missing gestation age
C : missing covariates such as race/ethnicity
and smoking (Missing in the NBR, but
Observed in the MCS)
Building the sub-model
Multinomial logistic regression for NBR
y r n ~ Multinomial (pr n,1:3, 1)
log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n
log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Cr n
G-age: Gestation age
y r n
THM r n
DiseaseModel
Parameters
normal LBW
THM r m
DiseaseModel
Parameters
y r m
normal LBWC r n C r m
Birth Weight (BW) Birth Weight (BW)
LBWP
LBWF
LBWP
LBWFmissing G-age
known
unknown
NBR MCS
Missing outcome model - impute LBWP and LBWF for NBR
C r n C r m
NBR MCS
Aggregate
Ar
Unknown Known
missingcovar. modelparameters
Missing Covariate Model Impute CImpute Cr nr n in terms of aggregate data and MCS data in terms of aggregate data and MCS data
Building the sub-model
Since our missing covariate such as race and smoke are binary variables, we usea multivariate-probit model to account for their correlation
1: nonwhite(Asian, Black, Others)
0: white
1: yes0: no
Race Smoke
1,r
2,r
uSmoke*~ MVN ,
uRace*
Define underlying continuous variables (smoke*, race*)Smoke= I(smok* >0) & Race= I (Race* >0)
Multivariate Probit Model (Chip & Greenberg,1998)
Correlation
T11,r 01,s r
T22,r 02,s r
u =δ δ A
u =δ δ A s=1,2,3
1 b= , -1<=b<=1
b 1
S: Sampling StratumAdjust for selection bias
NBR disease sub-model
THM r n
DiseaseModel
Parameters
THM r m
DiseaseModel
Parameters
C r n C r m
y r m
normalLBWP
LBWFy r n
normal
LBWF
LBWP
MCS disease sub-model
C r n C r m
Missing covar. modelparameters
Missing covar. sub-model
Missing Outcome Model
Unified model
known
unknown
Aggre.Ar
ri
obs mri,1:3
Tri,uri0,u 1,u ri 3,u ri 4,u ri2,u
ri,1
y ~ Multinomial (P , 1) , i N
plog = β +β THM +β X β Smoke β Race
p
u=2,3
* *ri ri ri ri
*1,rri
*2,rri
smoke =I smoke >0 , race =I race >0
usmoke~ Multivariate Normal ,
urace
ri ri,1:3
miss * m
*ri,1
ri,u*ri,u
ri,2 ri,3
y ~ Multinomial(p ,1) , i N
p = 0
pp = , u=2,3
p +p
1. Disease Model (y={1,2,3} )
3. Missing Covariates Model (Multivariate Probit)
2. Missing Outcome Model
T11,r 01,s r
T22,r 02,s r
u =δ δ A
u =δ δ A s=1,2,3
1 b= , -1<=b<=1
b 1
i: subject indexNm : group of subjects who had missing outcome (ymiss )r: regionu: index for the category of outcomeyobs: observed outcomeX: observed covariates
Y(1, 2, 3)
C (0/1)A
(aggre.) Missing Covariate Model
Missing Outcome Model
Investigating the performance of the unified model
Good Performance of model depended on1. How well the aggre. data can inform C (covariate)2. How strong C and Y are linked
We can examine the following 4 data scenarios1. Strong (A C) Strong (CY)2. Strong (A C) Weak (CY)3. Weak (A C) Strong (CY)4. Weak (A C) Weak (CY)
Step 1: Create data (N=1333) under the scenarios:
Step 3: Compare the prediction based on an analysis using fully observed data (no imputation)with an analysis using partially observed data (imputation).
Note: partially observed data were analyzed under various models1. Covariate sub-model (examining A C)2. Outcome sub-model (examining C Y)3. Unified Model (examining AC and CY)4. Unified Model with cut
Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing
Repeat step 2 : generate 20 replicate samples
Simulation Study
Examining the Imputation of missing covariateone level (AC)
Strong AC
Weak AC
Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different
Ability to discriminateture covariate pattern
decrease
Examining the Imputation of missing covariatetwo level (AC & C Y)
Feedback form outcome model is beneficial to covariate imputation.
The predicted probabilities of covariate patter (C=0,0) are betterable to discriminate between subjects whose true covariates are C=0,0 or not.
In particular, weak C scenarios.
Examining the impact of the imputation modelon the Y-C association
outcome model only Unified model Unified model
w/ cut
SYSC EST Est (MSE) Est (MSE) Est (MSE)
beta.smoke[3] 0.9 0.91 (0.01) 1.07 (0.27) 0.25 (0.43)
beta.race[3] 1.79 1.83 (0.01) 2.22 (0.25) 1.12 (0.47)
SYWC
beta.smoke[3] 0.99 0.97 (0.00) 0.97 (0.51) 0.15 (0.71)
beta.race[3] 2.56 2.57 (0.01) 2.71 (0.49) 0.67 (3.63)
WYSC
beta.smoke[3] -0.02 0.05 (0.01) 0.57 (1.34) 0.06 (0.07)
beta.race[3] 0.32 0.41 (0.03) 0.61 (0.41) 0.18 (0.09)
WYWC
beta.smoke[3] 0.35 0.34 (0.03) 0.91 (0.89) 0.09 (0.11)
beta.race[3] 1 1.06 (0.04) 1.23 (1.32) 0.18 (0.84)
• Outcome VS unified modelUnified model has higher MSE than outcome model(more missing values need to impute)
• Unified VS. Unified with cutStrong Y-C association help reduce MSEbut not weak Y-C association
Real data analysis – a water company in Northern England
© Imperial College London
0 40 80 120 16020Kilometers
´
United Utilities
South West Water
Southern Water
Severn TrentWater
Essex and SuffolkWater
Anglian Water
Yorkshire Water
Northumbrian Water
Welsh Water
Thames Water
BristolWater
Three ValleysWater
Data:
Restrict on: Singleton birth
Period: Sep 2000 – Aug 2001
Subjects:
MCS1333
NBR7945+ =
Total9278
Missing % in Race and Smoke: ~ 85%Missing % in Outcome: ~ 7%
Complete Observed
information
Missing RaceMissing SmokeMissing outcome at levels of2 (LBWP) and 3 (LBWF)
Real data analysis – a water company in northern England
Exposure variable : THMs
• It was dichotomized into 2 groups
• low-medium exposure group (<= 60 g/l) : 57.35 %
• high exposure group (>60 g/l) : 42.65 %
• Estimated in separate model for MCS and NBR (Whitaker et al, 2005)
In addition to race and smoke, we also adjust for :
baby’s sexmother maternal age
Observed in both MCS and NBR
No imputation VS. Imputation
a. Multinomial logistic regression model for MCS data (Bayesian)
- no imputation
b. Bayesian multiple bias model for combined NBR, MCS and aggregate data
- impute missing outcome and covariates
Models for real data analysis
Results for the real data analysis (Low birth-weight full-term VS Normal)
OR ( 95% CI)*
Data Model Outcome THMs Smoke Non-white
MCS(1333)
Multinomial Logistic (Bayesian)
LBWF 1.64(0.8-3.1)
2.65(1.2-5.2)
5.92(2.2-12.9)
MCS+NBR (9278)
Bayesian Multiple Bias
LBWF 2.4(1.1- 4.5)
2.5 (1.1-4.7)
5.6(2.6-10.8)
* 95% Bayesian Credible Interval
All parameter estimates adjusted for baby’s sex, mother maternal age
Conclusion
There is an evidence for association of THM exposure with low birth-weight full-term.
Combining the datasets can increase statistical power of the survey data alleviate bias due to confounding in the administrative data
Must allow for selection mechanism of survey when combining data
THANKS
Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor