32
CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner 1 Missing data and imputation Philip Anner [email protected]

Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

1

Missing data and imputation Philip Anner

[email protected]

Page 2: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

2

• Why? • administrative reasons

• equipment failure

• human errors

• dropped out patients

• study design

• Prevention is better than statistical “cures”

• However, some missing values are unavoidable

Identify the underlying cause of missing values

Page 3: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Missing data patterns

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

3

van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16, 219–242 (2007).

Page 4: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Missing data patterns

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

4

Page 5: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

5

• Missing completely at random (MCAR)

• Missing values only depend on unknown parameters

• Test: significant differenced in observed values between patients with all variables observed and patients with missing values?

Example: missing blood pressure measurements in a study due to the breakdown of a medical device.

Page 6: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

6

• Missing completely at random (MCAR)

Page 7: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

7

• Missing at random (MCAR):

Page 8: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

8

• Missing at random (MAR):

• less restrictive assumption than MCAR

• systematic difference between and can be explained by observed data

Example: missing blood pressure measurements of young patients in a study. Young patients eventually forget measuring their blood pressure more often than older patients. Young patients have usually a lower blood pressure than older people. Missing values can be explained by age – whereas age must not have any missing value.

Page 9: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

9

• Missing at random (MAR):

***

Page 10: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

10

• Missing at random (MAR):

Page 11: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Mechanisms of missing data

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

11

• Missing not at random (MNAR): • Missing data does not occur at random and depends on other missing factors

• Significant differences between observed and missing data cannot be explained by observed data

• Cannot be excluded! … can only be explained by other missing factors

Example: missing blood pressure measurements patients suffering from hypertension. These patients have a higher incidence for headache. Therefore, they neglect an examination in the clinic and prefer to stay at home.

Page 12: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Missing value imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

12

Page 13: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Why missing value imputation?

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

13

• Complete case analysis • „easy way“

• drop study items or variables containing missing values

• Only for small amount of missing data

• Problems: • MCAR:

• reduced power (lower n!)

• A potentially existing effect cannot be shown

• MAR/ MNAR: • severe bias

• eventually wrong conclusions

Page 14: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Weighting Procedures

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

14

• “Weighted complete case analysis” vs. “weighted imputation”

• Weight respondents by their inverse probability of response

• Improved representation of the population containing missing values

• Pros: • Easy implementation/ interpretation

• Usually good results under MAR and 1 variable containing missing values

• Cons: • Not sufficient in multiple/ complex no response situations

Page 15: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Single Imputation Methods

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

15

Page 16: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Imputation-Based Procedures

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

16

• Use available data to estimate missing values • Mean imputation

• Hot deck imputation • copying available values from cases that are similar in observed variables

• Last value carried forward • replacing missing entries by the last measured value

• Regression imputation

• Cons: • Biased parameter estimates

• Biased standard Errors

Page 17: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Model-Based Procedures

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

17

• Bayesian approach

• Estimate a posterior distribution or likelihood for the imputation of missing variables

• Methods: • Maximum likelihood

• Multiple linear regression models

• Multiple Imputation

Page 18: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

18

Page 19: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

19

• Take uncertainties of missing values into account • Each missing value has a distribution of likely values

• The distribution reflects the uncertainty about what the variable may have been

Page 20: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

20

Page 21: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

21

• Joint modeling (JM) • Partition observations into groups of identical missing data patterns

• Impute each pattern according to a joint model

reduced flexibility

• Fully conditional specication (FCS) • Multivariate imputation by chained equations (MICE)

• Generate a model for each variable containing missing values

• Include: and other variables

Page 22: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation by Chained Equations (MICE)

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

22

• An imputation model for each variable • Including other variables (all observed or partially missing)

• Iterative estimation of missing values

1. Single value imputation (e.g. Mean imputation)

2. Estimate 1 variable – including all others as independent variables

3. Infer values by drawing from a Gibb’s sampler

4. Repeat steps 2,3 for each variable • Repeat multiple times, until distribution of drawn values converges

Page 23: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

23

• Predictor selection for imputation • Use correlated (partially) observed variables

• The more, the better

Page 24: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Elementary Imputation Methods

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

24

• Bayesian linear regression numeric variables

• Predictive mean matching (PMM) numeric variables • Non-parametric approach (donor pool)

• Imputed values cannot be outside of a variable’s range

• Logistic regression 2 categories

• Polytomous logistic regression >= 2 categories

• Linear discriminant analyses >= 2 categories

Page 25: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Predictive mean matching

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

25

X

Y

Mahalanobis distance

X and Y observed

Only X observed

Page 26: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

26

• Assessing convergence of the Gibb‘s sampler (distribution)

Page 27: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

27

• Example of non-convergence

Page 28: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

28

• Validity of imputed values

Page 29: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Multiple Imputation

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

29

• Validity of imputed values

Page 30: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Pooling estimates - Rubin's rules

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

30

Page 31: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Pooling estimates - Rubin's rules

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

31

• Combine m results to a single estimate

• Point estimates • Average

• Variance • Total variance

• Within imputation variance:

• Between imputation variance:

Page 32: Missing data and imputation - Medizinischen Universität Wien · 2017-11-23 · Missing data patterns CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data

Questions?

CeMSIIS - Section for Artificial Intelligence and Decision Support Missing data and imputation – Philip Anner

32

Thank you for your attention!

Literature:

Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons (1987).

Little, RJA and Rublin, D.: Statistical Analysis with Missing Data. John Wiley & Sons (2002).

Horton, N.J., Kleinman, K.P.: Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007).

mice https://cran.r-project.org/web/packages/mice/index.html