Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization

Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization

Joint Statistical MeetingsMinneapolis, MN August 9, 2005

Presented by:

Michael Davern, Ph.D.Assistant Professor, Research Director

SHADAC, Health Services Research and PolicyUniversity of Minnesota

Supported by a grant from The Robert Wood Johnson Foundation

2

Co-authors

This work is coauthored with:• Miriam King, Ph.D., Research Associate

We both are with the Minnesota Population Center at the University of Minnesota

3

Data set harmonization

• The goal is to simplify access to all available years of a data set for analysis of trends over time.

• This goal has many difficulties associated with it.

• We focus on the issues involved with handling major sources of survey error over time.

4

Survey changes present challenges to harmonization

• Sample design– How people and records are drawn into a data set

changes and affects how variance estimation is done.

• Nonresponse– How surveys account for unit, supplement, person and

item nonresponse changes over time.

• Survey questions and measurement– Changes to question wording and question universes.

• Survey processing/editing– Changes to processing and data editing.

5

Decennial census sample designs

• Decennial census sampling– Involves both sampling of people/households to receive the “long

form” and sampling of long form records to release (1% and 5%).

– Both the household/person selection changes over time as does the process used to select the public use micro data samples.

– Data users need access to the sample design information to calculate appropriate variances/standard errors.

• Although appropriate estimates can be obtained with replicate weights at the moment most users do not use them.

– We are testing sample design variables to add to the IPUMS for Taylor Series estimation.

• Will include both a stratification variable, cluster variable and weighting variable (when available) so analysts can simply program in SAS, Stata, SUDAAN, etc.

• Our approach will make the changes in sample design seem seamless to the data user and will increase the use of more appropriate estimation methods.

6

Survey sample designs

• The NHIS and CPS change sample designs over time.– Non-self representing PSUs are shuffled so some are not

included between the designs.– Self-representing PSUs (MSAs) can also change

(boundaries annex/lose counties).– Pooling data between two sample designs is a major

challenge.• Data users often like to pool data to get larger samples or rare

characteristics (e.g., those with SSI income).• When working with data from years with two sample designs it’s best

to average the estimates and the standard errors from single years.

– Also some surveys (e.g., NHIS) release sample design information that can be used for Taylor Series estimates, whereas others do not (e.g., CPS).

7

Nonresponse

• There are several types of survey nonresponse.– Unit, person, supplement and item.

• Nonresponse is also handled differently by the various surveys and can cause problems for data users.

• Unit nonresponse is generally handled by adjusting survey weights of responders to account for nonrespnders.– Heterogeneity among the weights makes it important

to use appropriate statistical routines for variance estimation.

8

Person and supplement nonresponse

• Person and supplement nonresponse can be more difficult to deal with.– NHIS, for example, contains information on a household, but if

they refused the supplement there is no supplement data for them.• This makes the data structure uneven.

– The CPS, on the other hand, fully imputes the missing ASEC (i.e., March) supplement nonresponders (currently about 10% of the cases).

• This evens out the data structure making it easier for data users to work with.• Although this can be problematic as the CPS full supplement imputation

process can lead to rather large biases in estimates (e.g., health insurance coverage).

– We are investigating ways of evening out portions of the NHIS data structure to make it easier to work with and disseminate.

9

Item nonresponse

• Item nonresponse is also a challenge.– Decennial census and CPS are fully imputed

for item nonresponse.– Makes it much easier for data users.

• Although it can simplify things too much.

– The NHIS, on the other hand, does not impute missing values.

• This is a major problem for people who want to work with the income series on the NHIS (recently they released separate imputed income files).

• We are experimenting with imputing the income data information on the NHIS files using CPS income data.

10

Question wording and measurement

• Question wording changes take many forms.– Change in the basic question

– The inclusion of examples

– the placement of the question in the survey

– Changes in the type of response allowed (e.g., can income amounts be reported in smaller than yearly intervals?)

• Providing facsimiles of question wording, and highlighting wording changes in variable documentation, allows users to decide whether comparability is possible for their analyses.

11

Changes to question universes

• Changes in universe definitions affect multiple variables (e.g., the age limit for “adults” answering work and income questions).

• Other changes affect single variables.

• Providing universe definitions in variable documentation tells users how to restrict their data to achieve comparability.

• Testing variable universes reveals when data cleaning is needed before the data are released to users.

12

Changes in response categories

• Many data harmonization projects lose detail by adopting a “least common denominator” approach.

• IPUMS projects adopt the joint goal of:– Losing no information – Providing comparability over time

• IPUMS projects achieve these goals through composite coding schemes. – The first digit(s) provides detail available across all

years – Trailing digits provide additional detail available in

only limited years

13

Other strategies for handling changes in response categories

• Creating “bridging” variables is another means of achieving comparability over time.

• When responses are given in intervalled form in some years, and in full detail in other years, IPUMS projects provide both detailed and intervalled variables.

• Recoding data using a common standard (e.g., the 1950 occupation and industry codes), together with providing the original, unrecoded data, is a third strategy employed by IPUMS projects.

• When response changes are too great to achieve comparability (e.g., the shift from 4 to 5 categories for health status in NHIS), the data are provided in separate variables and the issue is discussed in the documentation.

14

Changes in data processing

• Variable documentation also helps users by pointing out subtle changes in data processing by the agency releasing the non-harmonized public use data.

15

Conclusions

• The goal of simplifying data dissemination and harmonization is difficult and demographic survey design and processing play a major role in making it difficult.– Sample design– Survey nonresponse– Survey questions and items– Survey processing/editing

Documents

Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization