89
The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota [email protected]

Minnesota Data Harmonization Projects

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Minnesota Data Harmonization Projects

The MinnesotaData Harmonization

ProjectsBill & Melinda Gates Foundation

Seattle, WashingtonMay 21, 2014

Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota

[email protected]

Page 2: Minnesota Data Harmonization Projects
Page 3: Minnesota Data Harmonization Projects

Integrated Public Use Microdata Series

Page 4: Minnesota Data Harmonization Projects

We build data infrastructure for research community. Specialize in data harmonization.

World’s largest collection of individual population and health data, across 9 projects.

50,000 registered users from over 100 countries.

Free

Minnesota Population Center

Page 5: Minnesota Data Harmonization Projects

MPC Data Dissemination, 1993-2012

Gigabytes per week

Page 6: Minnesota Data Harmonization Projects

MPC Data Projects

Page 7: Minnesota Data Harmonization Projects

The Problem

1. Combining data from multiple sources is time consuming

Discovery Data management

2. It’s error prone Recoding data Overlook documentation

3. Hard to replicate results

4. Discourages comparative research

Page 8: Minnesota Data Harmonization Projects

Outline

Harmonization methods

Dissemination system

International projects Integrated DHS Terra Populus IPUMS-International

Page 9: Minnesota Data Harmonization Projects

Terminology

Harmonization:

Combining datasets collected at different times or places into a single, consistent data series.

“Integration”

Metadata:

Data about data. Documentation in broadest sense.

Page 10: Minnesota Data Harmonization Projects

Relation to head

Marital status Education Occupation

Microdata

Page 11: Minnesota Data Harmonization Projects

Summary Data

Page 12: Minnesota Data Harmonization Projects

Harmonization Methods

Metadata

Data

Dissemination

Page 13: Minnesota Data Harmonization Projects

Systematize Metadata(record layout file, pdf)

Page 14: Minnesota Data Harmonization Projects

MPC Data DictionaryVariable Start Width Value Var ValueLabel Frequency Universe

SMOKE100 57 1 Ever smoked 100 cigarettes All persons

1 Yes 54,189

2 No 59,501

7 Don't know/Not sure 205

9 Refused 39

SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked

1 Yes 25,644

2 No 28,535

7 Don't know/Not sure 0

9 Refused 10

Blank [no label] 59,745

SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke

1 to 30 Number of days 25,290

77 Don't know/Not sure 293

88 None 49

99 Refused 12

Blank [no label] 88,290

SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke

0 to 76 Number of cigarettes 22,292

77 Don't know/Not sure 248

99 Refused 43

Blank [no label] 91,351

Page 15: Minnesota Data Harmonization Projects

WaterAccess

Convert Questionnaires to Metadata(Mexico 2000)

Page 16: Minnesota Data Harmonization Projects

5. Number of Rooms

How many rooms are used for sleeping without counting hallways? _____ Write the number

Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen

_____Write the number

6. Access to water

Read all of the options until you get an affirmative answer. Circle only one answer

1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other

Answers 3, 4, 5, 6 continue with number 8

7. Water supply

How many days of the week is water available? Circle only one answer

1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally

Metadata: Questionnaire Text

Page 17: Minnesota Data Harmonization Projects

Water access

Bedrooms

Rooms

XML-Tagged Questionnaire Text

Page 18: Minnesota Data Harmonization Projects

Data: Variable Harmonization

Marital Status: IPUMS-International

Bangladesh 2011

1 = Unmarried

2 = Married

3 = Widowed

4 = Divorced/separated

Mexico 1970

1 = Married, civil & relig

2 = Married, civil

3 = Married, religious

4 = Consensual union

5 = Widowed

6 = Divorced

7 = Separated

8 = Single

Kenya 1999

1 = Never married

2 = Monogamous

3 = Polygamous

4 = Widowed

5 = Divorced

6 = Separated

Page 19: Minnesota Data Harmonization Projects

Translation TableInput

Bangladesh

2011

4 = Divrc or separated

1 = Unmarried

2 = Married

3 = Widowed

Mexico1970

1 = Married, civil & relig2 = Married, civil

3 = Married, religious

4 = Consensual union

5 = Widowed

6 = Divorced

7 = Separated

8 = Single

Kenya1999

1 = Never married

2 = Monogamous

3 = Polygamous

4 = Widowed

5 = Divorced

6 = Separated

Page 20: Minnesota Data Harmonization Projects

LabelCode

Translation TableHarmonized

1 = Never married1 = Married, civil & relig

4 = Divrc or separated

1 = Unmarried

2 = Married

3 = Widowed

2 = Married, civil

3 = Married, religious

4 = Consensual union

5 = Widowed

6 = Divorced

7 = Separated

8 = Single

Single

Married or in union

Married, formally

Civil

Religious

Civil and religious

Monogamous

Polygamous

Consensual union

Separated

Divorced

2 = Monogamous

3 = Polygamous

4 = Widowed

5 = Divorced

6 = Separated

1 0 0

2 0 0

2 1 0

2 1 1

2 1 2

2 1 3

2 1 4

2 1 5

2 2 0

0 0

3 1 0

3 2 0

0 0

Mexico1970

Input

Bangladesh

2011Kenya1999

Divorced or separated3

Widowed4

Page 21: Minnesota Data Harmonization Projects

LabelCode

Translation TableHarmonized

1 = Never married

1 = Married, civil & relig

4 = Divrc or separated

1 = Unmarried

2 = Married

3 = Widowed

2 = Married, civil

3 = Married, religious

4 = Consensual union

5 = Widowed

6 = Divorced

7 = Separated

8 = SingleSingle

Married or in union

Married, formally

Civil

Religious

Civil and religious

Monogamous

Polygamous

Consensual union

Separated

Divorced

2 = Monogamous

3 = Polygamous

4 = Widowed

5 = Divorced

6 = Separated

1 0 0

2 0 0

2 1 0

2 1 1

2 1 2

2 1 3

2 1 4

2 1 5

2 2 0

0 0

3 1 0

3 2 0

0 0

Mexico1970

Input

Bangladesh

2011Kenya1999

Divorced or separated3

Widowed4

Page 22: Minnesota Data Harmonization Projects

Data Dissemination System

Page 23: Minnesota Data Harmonization Projects

Data Dissemination System

Page 24: Minnesota Data Harmonization Projects

Variables Page

Page 25: Minnesota Data Harmonization Projects

Variables Page

238 censuses

Page 26: Minnesota Data Harmonization Projects

Sample Filtering

Page 27: Minnesota Data Harmonization Projects

Variables Page – Filtered

Page 28: Minnesota Data Harmonization Projects

Variable Page: Marital Status

Page 29: Minnesota Data Harmonization Projects

Variable Codes(Marital status)

Page 30: Minnesota Data Harmonization Projects

Variable Codes(Marital status)

Page 31: Minnesota Data Harmonization Projects

Variable Codes(Marital status)

Page 32: Minnesota Data Harmonization Projects

Variable Page: Marital Status

Page 33: Minnesota Data Harmonization Projects

Variable Comparability Discussion(Marital status)

Page 34: Minnesota Data Harmonization Projects

Variable Page: Documentation

Page 35: Minnesota Data Harmonization Projects

Questionnaire Text

Page 36: Minnesota Data Harmonization Projects

Questionnaire Text(Marital status, Cambodia)

Page 37: Minnesota Data Harmonization Projects

Variables Page

Page 38: Minnesota Data Harmonization Projects

Extract Summary

Page 39: Minnesota Data Harmonization Projects

Case Selection

Page 40: Minnesota Data Harmonization Projects

Age of spouse

Employment status of father

Occupation of father

Attached Characteristics

Page 41: Minnesota Data Harmonization Projects

Extract Summary

Page 42: Minnesota Data Harmonization Projects

Download or Revise Extract

Page 43: Minnesota Data Harmonization Projects

On-line Analysis

Page 44: Minnesota Data Harmonization Projects

The International Projects

Page 45: Minnesota Data Harmonization Projects

Integrated DHS

Page 46: Minnesota Data Harmonization Projects

Foremost source of health information for the developing world

Funded by USAID

Since 1980s, over 300 surveys, 90 countries

Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc

Demographic and Health Surveys

Page 47: Minnesota Data Harmonization Projects

5-year NIH grant (end of year 2)

Focus on Africa, with India

Partnership with ICF-International and USAID

IDHS Project

Page 48: Minnesota Data Harmonization Projects

Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential.

Problem:

Data discovery

Dispersed documentation

Data management

Variable changes over time

Not unique to DHS: endemic to any survey that’s persisted over decades.

Why an Integrated DHS?

Page 49: Minnesota Data Harmonization Projects

DHS Research Process Example: Find data on female genital cutting

Survey Search Tool

Page 50: Minnesota Data Harmonization Projects
Page 51: Minnesota Data Harmonization Projects
Page 52: Minnesota Data Harmonization Projects

Recode notes

Data dictionary

Just the woman file – for one survey. 61 to go.

Still need Report (377 page pdf)

• Contains questionnaire and sample design information

• Errata file

Page 53: Minnesota Data Harmonization Projects

DHS “Recode Variables” make it more harmonized than most surveys Consistent variable names Each DHS phase has a shared model questionnaire

But:

6 phases over 25+ years

Country control over final wording of surveys

Country-specific variables

The recode variables can be a two-edged sword

At least the DHS variables are alreadyharmonized, right?

Page 54: Minnesota Data Harmonization Projects

100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim200 Christian 2 = Christian 3 = Christian201 Catholic 2 = Catholic 1 = Catholic202 Protestant 1 = Protestant203 Anglican 2 = Anglican204 Methodist 3 = Methodist205 Presbyterian 4 = Presbyterian206 Pentacostal 5 = Pentecostal208 Other Christian 3 = Other Christian 6 = Other Christian300 Other301 Hindu 0 = Hindu 1 = Hindu302 Sikh 3 = Sikh 4 = Sikh303 Buddhist 5 = Buddhist302 Jain 6 = Jain305 Jewish 7 = Jewish306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian307 Doni-Polo 10 = Donyi polo400 Traditional/spiritual 8 = Trad/spiritualist401 Traditional 5 = Traditional402 Spiritual403 Animist500 No religion 0 = No religion 9 = No religion 9 = No religion600 Other 96 = Other 4 = Other 96 = Other

Ghana 1993V130

Ghana 2008V130

India 1992V130

India 2005V130

Harmonization: Religion

Page 55: Minnesota Data Harmonization Projects

Egypt 1995 S802 Ever circumcisedEgypt 2005 S801 Respondent circumcisedEgypt 2008 G102 Respondent circumcisedEthiopia 2000 FG103 CircumcisedEthiopia 2005 FG103 CircumcisedGhana 2003 S821 CircumcisedKenya 1998 S1002 Respondent circumcisedKenya 2003 S821 CircumcisedKenya 2008 G102 Respondent circumcisedMali 1995 S551 CircumcisedMali 2001 FG103 Circumcised?Mali 2006 G102 Respondent circumcisedNigeria 1999 S521 Type of circumcisionNigeria 2003 FG103 CircumcisedNigeria 2008 G102 Respondent circumcised

Harmonization: Female Circumcision

Ever Circumcised

Page 56: Minnesota Data Harmonization Projects

Timeline: 2014 (current)

9 countries, 39 samples

Much of woman files Women of child

bearing age as unit of analysis

Page 57: Minnesota Data Harmonization Projects

Timeline: 2015

15 countries, 69 samples

Complete the woman files

Children & birth files

Page 58: Minnesota Data Harmonization Projects

Timeline: 2017

21 countries, 94 samples

Men and couples files

Page 59: Minnesota Data Harmonization Projects

Timeline: Next grant

41 African countries, 130+ samples

11 Asian countries, 32+ samples

Page 60: Minnesota Data Harmonization Projects

Beta

Page 61: Minnesota Data Harmonization Projects

Lower barriers to conducting research on population and the environment.

Motivation:

The data from different domains have incompatible formats, and few researchers have the skills to combine them

Terra Populus Goal

Page 62: Minnesota Data Harmonization Projects

5 year grant NSF

At mid-point: year 3

TerraPop

Page 63: Minnesota Data Harmonization Projects

6 countries: Argentina

Brazil

Malawi

Spain

United States

Vietnam

Population Microdata

Page 64: Minnesota Data Harmonization Projects

Tabulations of census data for administrative units

Area-level Data

Page 65: Minnesota Data Harmonization Projects

Land cover from satellite images (Global Land Cover 2000)

Agricultural usefrom satellites and government records (Global Landscapes Initiative)

Climate from weather stations (WorldClim)

Environmental DataRasters (Grid Cells)

Page 66: Minnesota Data Harmonization Projects

Microdata

Area-level dataRasters

Mix and match variables originating in

any of the data structures

Obtain output in the data structure most

useful to you

Location-Based Integration

Page 67: Minnesota Data Harmonization Projects

Individuals and households with their environmental

and social context

Microdata

Area-level dataRasters

Location-Based Integration

Page 68: Minnesota Data Harmonization Projects

Summarized environmental and population

Microdata

Area-level dataRasters

County IDG17003100001G17003100002G17003100003G17003100004G17003100005G17003100006G17003100007

County IDMean Ann. Temp.

Max. Ann. Precip.

G17003100001 21.2 768G17003100002 23.4 589G17003100003 24.3 867G17003100004 21.5 943G17003100005 24.1 867G17003100006 24.4 697G17003100007 25.6 701

County IDMean Ann. Temp.

Max. Ann. Precip.

Rent, Rural

Rent, Urban

Own, Rural

Own, Urban

G17003100001 21.2 768 3129 1063 637 365G17003100002 23.4 589 2949 1075 1469 717G17003100003 24.3 867 3418 1589 1108 617G17003100004 21.5 943 1882 425 202 142G17003100005 24.1 867 2416 572 426 197G17003100006 24.4 697 2560 934 950 563G17003100007 25.6 701 2126 653 321 215

characteristics for administrative

districts

Location-Based Integration

Page 69: Minnesota Data Harmonization Projects

Rasters of population and environment data

Microdata

Area-level dataRasters

Location-Based Integration

Page 70: Minnesota Data Harmonization Projects

Rasterization of Area-Level Data

Page 71: Minnesota Data Harmonization Projects

Area-Level Summary of Raster Data

Page 72: Minnesota Data Harmonization Projects

Linkages across data formats rely on administrative unit boundaries

Particular needs Lower level

boundaries Historical

boundaries

Boundaries are Key

Page 73: Minnesota Data Harmonization Projects

Geographic Harmonization

Page 74: Minnesota Data Harmonization Projects

Geographic Harmonization

Page 75: Minnesota Data Harmonization Projects

Geographic Harmonization

Page 76: Minnesota Data Harmonization Projects

Web interface will change significantly in fall 2014

Fast microdata tabulator needed

Beta Version

Page 77: Minnesota Data Harmonization Projects

IPUMS-International

Page 78: Minnesota Data Harmonization Projects

IPUMS-International

Census microdata from around world

Funded by NSF and NIH

Motivation:

Provide data access

Preservation

Page 79: Minnesota Data Harmonization Projects

Khartoum, CBS-Sudan

Page 80: Minnesota Data Harmonization Projects

Dhaka, Bangladesh Bureau of Statistics

Page 81: Minnesota Data Harmonization Projects

IPUMS-International

ParticipatingDisseminating

Page 82: Minnesota Data Harmonization Projects

IPUMS Censuses Per Country

Page 83: Minnesota Data Harmonization Projects

IPUMS Censuses Per Country

Page 84: Minnesota Data Harmonization Projects

Variables Included in Extracts

Page 85: Minnesota Data Harmonization Projects

Top Institutional UsersCountry Institution Country Institution

1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais

2 USA Harvard University 17 Mexico El Colegio de México

3 USA University of Michigan Ann Arbor 18 USA Yale University

4 USA Columbia University 19 China University of Hong Kong

5 Spain Autonomous University Barcelona 20 USA University of Washington

6 USA Arizona State University 21 UK London School Economics

7 Singapore National University of Singapore 22 UK University of Stirling

8 IADB Inter American Development Bank 23 France Université de Bordeaux 4

9 WB World Bank Group 24 Austria University of Vienna

10 USA University of California Berkeley 25 Malaysia National University of Malaysia

11 USA Vanderbilt University 26 Austria Vienna Institute of Demography

12 USA University of Chicago 27 USA Pew Research Center

13 Australia University of Queensland Australia 28 Colombia Universidad del Valle

14 USA University of California Los Angeles 29 USA University of Delaware

15 USA Dartmouth College 30 USA Brown University

Page 86: Minnesota Data Harmonization Projects

Millennium Development Goals

Ratio of literate women to men, 15-24 years old

Source: Cuesta and Lovatón (2014) 1990 Census round

Page 87: Minnesota Data Harmonization Projects

Millennium Development Goals

Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center

Census 1993 Census 2005

Colombia: Adolescent Birth Rate

Page 88: Minnesota Data Harmonization Projects

Data acquisition

Outreach: developing countries

Virtual data enclave

IPUMSI Future

Page 89: Minnesota Data Harmonization Projects

Thank you!

[email protected]