39
Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1 , Irene Petersen 1 , James Carpenter 2 1 Department of Primary Care and Population Health, UCL 2 Department of Medical Statistics, LSHTM

Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Embed Size (px)

Citation preview

Page 1: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Simulation of “forwards-backwards” multiple imputation technique in a

longitudinal, clinical dataset

Catherine Welch1, Irene Petersen1, James Carpenter2

1Department of Primary Care and Population Health, UCL2Department of Medical Statistics, LSHTM

Page 2: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Acknowledgements

• Steering Group:– Irwin Nazareth (UCL)– Kate Walters (UCL)– Ian White (MRC Biostatistics, Cambridge)– Richard Morris (UCL)– Louise Marston (UCL)

• This study was funded by the MRC

Page 3: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Overview

• Summary of motivation• “Forwards-backwards” algorithm• Issues that we have encountered

Page 4: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Introduction

• Most missing data techniques have been mainly designed for cross-sectional data

• “Forwards-backwards” multiple imputation (MI) algorithm has been developed to impute missing values in longitudinal databases

• We are in the process of applying this technique to The Health Improvement Network (THIN) primary care database

• Impute variables associated with incidence of cardiovascular disease (CVD)

Page 5: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Clinical databases

• Offer many opportunities that would be difficult and expensive to address using standard study design

• Designed for patient management

Page 6: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

The Health Improvement Network (THIN)

• Primary care database• Longitudinal records of patients consultation with

General Practitioner (GP) or nurse• Data collected since early 90’s• 7 million patients to over 400 practices• Over 40 million person years of follow up• Systematically structured coding (Read codes)

Page 7: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Cardiovascular disease

• Clinical databases powerful data source for research e.g. cardiovascular disease

• New risk prediction models have caused much debate

• NICE recommends further research is required to validate models

• Important to have good measures of risk factors and consider missing data

Page 8: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Aims of this project…

• Explore the extent of missing data on health indicators (height, weight, blood pressure, cholesterol, smoking status, deprivation, alcohol consumption and ethnicity)

• Develop models for imputation of missing data

Page 9: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Survival models

1. Baseline – at practice registration

2. Age specific – extract data recorded at a specific age

3. Non-age specific – risk is constant across all ages

4. Time varying effect – risk varies across ages

50

Registration1 year following registration

60

Page 10: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Substantive model

• Include same variables as Framingham score plus deprivation (Townsend deprivation quintile) and BMI

• Poisson model to predict risk of Coronary heart disease• Explanatory variables without missing data: age, sex, left

ventricular hypertrophy (LVH), Type II diabetes• With missing data: deprivation, weight, height, total serum

cholesterol, high density lipoprotein (HDL) cholesterol, systolic blood pressure and smoking status

Page 11: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Imputation one year following registration

• Keep patients registered between 2005-2008 and with practice for at least one year

• Exclude patients that have coronary heart disease within the first year

• Average of all recorded measurements during the first year included in the analysis

• Select 50 practices with least missing data for systolic blood pressure and weight per person

• First step: understand structure and extent of missing data

Page 12: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

16-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95+Age group (years)

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Missing health indicator variables by age

72,759 patients registered to 50 practices between 2005 and 2008

Page 13: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

16-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95+Age group (years)

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Missing health indicator variables by age

72,759 patients registered to 50 practices between 2005 and 2008

Page 14: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

16-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95+Age group (years)

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Missing health indicator variables by age

72,759 patients registered to 50 practices between 2005 and 2008

Page 15: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

16-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95+Age group (years)

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Missing health indicator variables by age

72,759 patients registered to 50 practices between 2005 and 2008

Page 16: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Missing health indicator variables by age

72,759 patients registered to 50 practices between 2005 and 2008

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

16-24 25-34 35-44 45-54 55-64 65-74 75-84 85-94 95+Age group (years)

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Page 17: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

72,759 patients registered to 50 practices between 2005 and 2008

Missing health indicator variables by gender

0

10

20

30

40

50

60

70

80

90

100P

erce

ntag

e m

issi

ng

Female MaleGender

Townsend score Systolic blood pressure

Weight HeightTotal cholesterol HDL cholesterolSmoking status

Page 18: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Problems with ‘ad-hoc’ imputation

• ‘Ad hoc’ imputation methods (e.g. complete case analysis, LOCF) result in bias results and potentially incorrect conclusions

• Multiple imputation is now established as an alternative method to deal with missing data

Page 19: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Multiple imputation

• Assume Missing At Random• Use the relationship between the variables to

impute a valid estimate for a missing value• Multiple estimates are combined using Rubins

Rules to produce unbiased estimates of coefficients and standard errors

• This takes account of uncertainty and variation in the data

Page 20: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Multiple imputation model

• All variables in substantive model included in imputation model

• Exponential survival model so indicator for CHD and variable for time to event or censoring

• MI applied 5 times and results combined

Page 21: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Results for health indicators at baseline

Complete case Imputed data

Townsend score quintile, %

1 (least deprived) 13.72 13.65

2 14.05 13.95

3 24.77 24.84

4 30.46 30.59

5 (most deprived) 17.00 16.98

Height (m), mean (SE) 1.70 (0.00041) 1.70 (0.00041)

Weight (kg), mean (SE) 72.6 (0.06644) 72.8 (0.06583)

Systolic blood pressure (mmHg), mean (SE) 123.8 (0.06707) 123.8 (0.05866)

Total serum cholesterol (mmol l-1), mean (SE) 5.16 (0.01024) 5.05 (0.00882)

HDL cholesterol (mmol l-1), mean (SE) 1.40 (0.00401) 1.43 (0.00545)

Smoking status, % Smoker 30.29 30.32

Non-smoker 69.71 69.68

Page 22: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Survival models

1. Baseline – at practice registration

2. Age specific – extract data recorded at a specific age

3. Non-age specific – risk is constant across all ages

4. Time varying effect – risk varies across ages

50

Registration1 year following registration

60

Page 23: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Considerations when applying MI to longitudinal clinical data

• Longitudinal and dynamic structure of the data• Imputing cross-sectionally is not appropriate• Imputations need to produce a logical sequence

of values over time• Introduction of new quality measures which have

improved data recording

Page 24: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example of THIN data

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 ?.? 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 25: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

“Forwards-backwards” technique

• Based on the fully conditional specification method of MI

• Takes into account the dynamic, longitudinal structure of the data

• Does not require measurements at equally spaced time points

Nevalainen et al. Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification. Statist. Med. 2009; 28:3657–3669

Page 26: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Fully conditional specification (FCS)

• Based on a flexible selection of univariate imputation distributions

• Impute one variable at a time using a distribution conditional on all the other variables

• Procedure iterates over the variables in cycles until assumed convergence

• Appropriate for non-normal distributions

Page 27: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

A graphical illustration of the “forwards-backwards” FSC procedure

Within-time iteration

Among-time iteration

),,,|( 1,1 ijijiimisij YXXXXf

Page 28: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 ?.? 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 29: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 ?.? 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 30: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 ?.? 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 31: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 ?.? 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 32: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Prac ID Sex Age (years)

Cholesterol 66 (mmol/l)

Cholesterol 65 (mmol/l)

Cholesterol 67 (mmol/l)

Weight 66 (kg)

Weight 65 (kg)

Weight 67 (kg)

1 1 M 66 ?.? 5.2 6.0 86 80 89

1 2 F 66 3.6 3.4 3.6 60 60 ??

Page 33: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Example

Practice ID Sex Age (years)

Cholesterol (mmol/l)

Weight (kg)

1 1 M 65 5.2 80

1 1 M 66 5.8 86

1 1 M 67 6.0 89

1 1 M 68 6.0 95

1 2 F 65 3.4 60

1 2 F 66 3.6 60

1 2 F 67 3.6 ??

1 2 F 68 4.0 70

Page 34: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Apply “forwards-backwards” algorithm to THIN

• Select patients registered to 50 THIN practice from 2005 to 2008

• Apply algorithm at all ages• Extract imputations for 11,614 patients aged 60

years old

Page 35: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Preliminary results

Complete case Imputed data

Townsend score quintile, %

1 30.05 28.67

2 24.56 24.76

3 18.69 18.71

4 14.86 15.75

5 11.83 12.11

Height (m), mean (SE) 1.68 (0.00130) 1.67 (0.00091)

Weight (kg), mean (SE) 80.25 (0.23961) 79.39 (0.15976)

Systolic blood pressure (mmHg), mean (SE) 136.18 (0.18086) 135.86 (0.21134)

Total serum cholesterol (mmol l-1), mean (SE) 5.26 (0.01616) 5.40 (0.01482)

HDL cholesterol (mmol l-1), mean (SE) 1.44 (0.00667) 1.47 (0.00738)

Smoking status, % Smoker 29.13 27.92

Non-smoker 70.87 72.08

11,614 patients aged 60 years old registered to 50 practices between 2005 and 2008

Page 36: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

Discussion

• Potential to develop this method further• Validation:

– using simulations– investigate distributions of longitudinal values– external information

• What would be the best way to include outcome in the “forwards-backwards” imputation model?

• Interactions

Page 37: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

FCS using longitudinal data

• Y – fully observed outcome variable

• X =(X1, . . . , Xq ) where Xi =(Xi1, . . . , Xip), q repeated measures of p explanatory variables intended to be collected

• Xobs and Xmis denote the observed and the missing elements in X

• Need the specify a suitable imputation model f (Xmis|Xobs,Y,θ)

• The FCS of the imputation model in which imputations are made one variable at a time using a series (j =1, . . . , p) of conditional densities

• denoted as

• have been imputed k+1 times

• have been imputed k times.

),,,...,,,...,|( )1()1(1 ijipjijiimisij YXXXXXf

),|( , YXXf ji

misij

)1(1,..., jii XX

ipji XX ,...,)1(

Page 38: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

FCS using longitudinal data

• At time i impute conditional onand the outcome Y.

• Rather than condition only on the observed data, we generate appropriate values for from the fully conditional imputation model

• One iteration (within-time iteration) runs over the variables j =1, . . . , p.

• The inter-correlation among repeatedly measured variables is also of importance, we have a second imputation iteration among the index i (among times).

misiX

misiX

obsi

obsji

obsi XXX 1,1 ,,

misijX

),,,|( 1,1 ijijiimisij YXXXXf

Page 39: Simulation of “forwards-backwards” multiple imputation technique in a longitudinal, clinical dataset Catherine Welch 1, Irene Petersen 1, James Carpenter

FCS using longitudinal data