27
Entrusting census microdata and metadata Entrusting census microdata and metadata for timely integration and dissemination for timely integration and dissemination via the IPUMS-EurAsia and IECM via the IPUMS-EurAsia and IECM initiatives, 2010-2014 initiatives, 2010-2014 * * * * * * Robert McCaa, Albert Esteve and Patricia Robert McCaa, Albert Esteve and Patricia Kelly-Hall Kelly-Hall Minnesota Population Center and Minnesota Population Center and Centre d’Estudis Centre d’Estudis Demogràfics Demogràfics [email protected]; [email protected] [email protected]; [email protected] www.ipums.org/international www.ipums.org/international www.iecm-project.org www.iecm-project.org

Entrusting census microdata and metadata for timely integration and dissemination via the IPUMS-EurAsia and IECM initiatives, 2010-2014 * * * Robert McCaa,

Embed Size (px)

Citation preview

Entrusting census microdata and metadata for timely Entrusting census microdata and metadata for timely integration and dissemination via the IPUMS-EurAsia integration and dissemination via the IPUMS-EurAsia

and IECM initiatives, 2010-2014and IECM initiatives, 2010-2014 * * ** * *

Robert McCaa, Albert Esteve and Patricia Kelly-HallRobert McCaa, Albert Esteve and Patricia Kelly-HallMinnesota Population Center and Minnesota Population Center and Centre d’Estudis Demogràfics Centre d’Estudis Demogràfics

[email protected]; [email protected] [email protected]; [email protected] www.ipums.org/internationalwww.ipums.org/international

www.iecm-project.org www.iecm-project.org

no. of slidesno. of slides

1.1. IPUMS-International: “Best practice”IPUMS-International: “Best practice” 332.2. The IECM Project: a European FlavorThe IECM Project: a European Flavor 553.3. Census output needs:Census output needs:

44a.a. Form “A”: succinct descriptions of both census and microdataForm “A”: succinct descriptions of both census and microdatab.b. Metadata: questionnaires, instructions, dictionaries, codebooksMetadata: questionnaires, instructions, dictionaries, codebooks

as images, .txt, .doc, .xls, .pdf, XML, SDMX, CSPro, IMPS, DDI, etc. as images, .txt, .doc, .xls, .pdf, XML, SDMX, CSPro, IMPS, DDI, etc. c.c. Microdata: to prepare, choose 1 of 4 modalities;Microdata: to prepare, choose 1 of 4 modalities;

entrust as encrypted, executable files (email or fax password)entrust as encrypted, executable files (email or fax password)

4.4. Conclusion Conclusion 22

Outline: Outline: Entrusting census microdata and metadata for timely Entrusting census microdata and metadata for timely

integration and dissemination via the IPUMS-EurAsia and integration and dissemination via the IPUMS-EurAsia and IECM initiatives, 2010-2014IECM initiatives, 2010-2014

What is IPUMS-International?What is IPUMS-International? “…best practice for a data repository of international “…best practice for a data repository of international

statistical data”statistical data”--Dennis Trewin --Dennis Trewin

chair UNECE task force on Statistical Confidentiality & Microdata Accesschair UNECE task force on Statistical Confidentiality & Microdata Access

IPUMS-International:IPUMS-International:

» Begun in 1999, IPUMS-International is the world’s largest Begun in 1999, IPUMS-International is the world’s largest integrated demographic database: integrated demographic database: » 130 integrated, anonymized census samples (44 countries) 130 integrated, anonymized census samples (44 countries)

» 279 million person records; 3,000+ approved researchers279 million person records; 3,000+ approved researchers

» Database is likely to double over the next five years, by the Database is likely to double over the next five years, by the addition of:addition of:» 2010 round samples of 17 current partners2010 round samples of 17 current partners: Austria, Belarus, : Austria, Belarus,

Canada, France, Greece, Hungary, Israel, Italy, Kyrgyzstan, Canada, France, Greece, Hungary, Israel, Italy, Kyrgyzstan, Netherlands, Portugal, Romania, Slovenia, Spain, Switzerland, UK, Netherlands, Portugal, Romania, Slovenia, Spain, Switzerland, UK, USA, etc.USA, etc.

» Samples for 5 countries currently in developmentSamples for 5 countries currently in development: Belgium, Czech : Belgium, Czech Republic, Ireland, Germany, TurkeyRepublic, Ireland, Germany, Turkey

» Future partnersFuture partners? Albania? Bulgaria? Croatia? Estonia? Finland? ? Albania? Bulgaria? Croatia? Estonia? Finland? Kazahkstan? Latvia? Lithuania? Poland? Russian Federation? Kazahkstan? Latvia? Lithuania? Poland? Russian Federation? Serbia? Slovakia? Ukraine? FYR Macedonia? Others?Serbia? Slovakia? Ukraine? FYR Macedonia? Others?

IPUMS-International IPUMS-International dark greendark green = integrated and disseminating = integrated and disseminating

(44 countries, 130 censuses, 279 millon person records)(44 countries, 130 censuses, 279 millon person records)green = to be integrated (35 countries, 90 censuses, 150 mill.)green = to be integrated (35 countries, 90 censuses, 150 mill.)

Microdata

Integrated into IPUMS

Entrusted to IPUMS None entrusted

None inventoried

Mollweide projection

Microdata

Integrated into IPUMS

Entrusted to IPUMS None entrusted

None inventoried

IPUMS-EurAsiaIPUMS-EurAsia

2010-11:2010-11:GermanyGermanyIndonesiaIndonesiaIrelandIrelandNepalNepalPakistanPakistanSwitzerlandSwitzerlandThailandThailand

2012-4:2012-4:why not yours?why not yours?

The IPUMS-International team The IPUMS-International team May 14, 2009 with NSF over-sight boardMay 14, 2009 with NSF over-sight board

(Not present: computer gurus, some researchers, research assistants, civil (Not present: computer gurus, some researchers, research assistants, civil service employees, and others who were absent from the National Science service employees, and others who were absent from the National Science

Foundation Board meeting)Foundation Board meeting)

Steven Ruggles, inventor of IPUMS, Professor of History, and Director of the Minnesota Population Center

Constructing the IPUMS-International integrated Constructing the IPUMS-International integrated metadata and microdata systemmetadata and microdata system

» IPUMS-International NEVER disseminates source IPUMS-International NEVER disseminates source microdata!microdata!

» 5 step process of integration--2+ years invested in integrating 5 step process of integration--2+ years invested in integrating metadata and microdata: metadata and microdata:

1.1. *Confirm the integrity and validity of source microdata and metadata*Confirm the integrity and validity of source microdata and metadata2.2. *Draw and anonymize high precision samples *Draw and anonymize high precision samples 3.3. Integrate microdata sampleIntegrate microdata sample4.4. Integrate metadataIntegrate metadata5.5. Confirm the integrity and validity of the integrated microdata sample Confirm the integrity and validity of the integrated microdata sample

and metadata and metadata

» *Steps 1 & 2 conducted by commissioned senior staff*Steps 1 & 2 conducted by commissioned senior staff» Original source microdata never disseminatedOriginal source microdata never disseminated» Violation of confidentiality: subject to civil fine ($250,000) and/or Violation of confidentiality: subject to civil fine ($250,000) and/or

criminal prosecutioncriminal prosecution

5 step process of integration in the IPUMS system5 step process of integration in the IPUMS system

3.3. Integrate microdataIntegrate microdata• Composite coding scheme to Composite coding scheme to

1)1) preserve every significant detail and preserve every significant detail and 2)2) harmonize every code harmonize every code

• Example: marital statusExample: marital status• ……• 200 = married200 = married• 210 = married, formal 210 = married, formal • 211 = married, civil211 = married, civil• 212 = married, religious212 = married, religious• ……..• 220 = married, informal (consensual)220 = married, informal (consensual)• ……

5 step process of integration in the IPUMS system5 step process of integration in the IPUMS system

4.4. Integrate metadata (XML): Document Integrate metadata (XML): Document every census, sample, variable and code:every census, sample, variable and code:

• Source documents (pdf) in official language Source documents (pdf) in official language and English and English

• Dynamic metadata system—compare any Dynamic metadata system—compare any combination of countries and samples:combination of countries and samples:

• wording of any census question and instructions wording of any census question and instructions to field workers to field workers

• Characteristics of each census and sampleCharacteristics of each census and sample• Describe each variable: “universe”, Describe each variable: “universe”,

definition, comparability, etc.definition, comparability, etc.

5 step process of integration in the IPUMS system5 step process of integration in the IPUMS system

5.5. Confirm integrity and validity of each sampleConfirm integrity and validity of each sample• Before launch, each sample is scruplously checkedBefore launch, each sample is scruplously checked• Test each integrated variable against non-Test each integrated variable against non-

harmonized harmonized • Each integration decision may be checked by any Each integration decision may be checked by any

researcher using integrated vs. non-harmonized researcher using integrated vs. non-harmonized

• External evaluation by INDEC-Argentina External evaluation by INDEC-Argentina (commissioned by IPUMS), 4 censuses (1970-2001)(commissioned by IPUMS), 4 censuses (1970-2001)

• Compared each variable, code and metadata against Compared each variable, code and metadata against original source data and documentationoriginal source data and documentation

• Tens of thousands of words, codes, and frequencies testedTens of thousands of words, codes, and frequencies tested—only a handful of errors, mis-interpretations or mis-—only a handful of errors, mis-interpretations or mis-understandings.understandings.

The IECM project The IECM project Integrated European Census MicrodataIntegrated European Census Microdata

www.iecm-project.org

PROJECT OVERVIEW | COORDINATION | HARMONIZATION | DISSEMINATIONwww.iecm-project.org

Disseminating: Austria, Belarus, France, Greece, Hungary, Italy, Netherlands, Portugal, Romania, Spain, Slovenia, United Kingdom

Harmonizing: Czech Republic, GermanyIreland, Switzerland (next release), Turkey

Negotiating: Belgium, Bulgaria, Latvia, Poland, Russia, Ukraine

Contacted: Finland, Iceland, Lithuania, Moldova, Norway, Slovak Republic

Variables Included in Extracts

Under-represented:Under-represented:geography, migration, ethnicitygeography, migration, ethnicity

Harmonization increases usability and accessibility

Samples extracted

Users statistics July – Dec 2008

Extracts by user’s country of residence

634 France

537 Greece

441 Spain

408 Austria

404 Hungary

340 Portugal

185 United Kingdom179 Netherlands85 Belarus

164 Spain105 Italy102 France90 Germany81 United Kingdom45 Greece37 Netherlands21 Belgium18 Czech Republic17 Denmark17 Switzerland16 Austria12 Ireland6 Romania6 Portugal2 Poland

PROJECT OVERVIEW | COORDINATION | HARMONIZATION | DISSEMINATIONwww.iecm-project.org

Integrated European Census Microdata

Coordination Harmonization Dissemination

Meetings:

Barcelona 2005

Paris 2006

Lisbon 2007

Barcelona 2008

Integrated Documentation

Intra-European classifications

Mirror site

Additional documentation

Data Browser /Online Tabulator

The IECM project—addendum. New tools for data analysisPrototype of on-line tabulator of integrated variables

PROJECT OVERVIEW | COORDINATION | HARMONIZATION | DISSEMINATION

How are we currently disseminating the IECM census microdata?

- Through an extraction system where users can create custom tailored microdata samples

Why a data browser?

- Fast and convenient tool to explore the contents of the database before making an extract

- It prevents users from downloading microdata (if only basic figures are needed)

Some caveats

- We are not providing official statistics

- Frequencies are not based on 100% population counts

-Sampling errors must be calculated

- Compared to microdata, cross-tabulated data have les s analyitical power

The online tabulator based on Redatam

CENSUS MICRODATA FANS…

Census Output Needs:Census Output Needs:1. Succinct description of census and microdata (Form “A”)1. Succinct description of census and microdata (Form “A”)

2. Comprehensive metadata: 2. Comprehensive metadata: questionnaires, instructions, codebooksquestionnaires, instructions, codebooks

3. Encrypted microdata3. Encrypted microdata

Ship FEDEX prepaid (email for account #) to:Prof. Robert McCaaProf. Robert McCaaMinnesota Population CenterMinnesota Population Center50 Willey Hall, 225 1950 Willey Hall, 225 19thth Ave. S. Ave. S.Minneapolis MN 55455Minneapolis MN 55455Tel. 1+612.624.5818, [email protected] Tel. 1+612.624.5818, [email protected]

1. Need for succinct, authoritative documentation of 1. Need for succinct, authoritative documentation of census and microdata: Form “A”census and microdata: Form “A”

» Efficient processing of metadata & microdataEfficient processing of metadata & microdata

» Form “A”: Form “A”: » See Appendix A for details See Appendix A for details

» Appendix B is the completed form for Spain--censuses of 1981, 1991, 2001Appendix B is the completed form for Spain--censuses of 1981, 1991, 2001

» https://international.ipums.org/international/samples.shtml chttps://international.ipums.org/international/samples.shtml c lick the name of a lick the name of a country to view samplescountry to view samples

» Describe the census: name, population universe, reference Describe the census: name, population universe, reference date, field work period, etc.date, field work period, etc.

» Describe the microdata: source, sample design, sample unit, Describe the microdata: source, sample design, sample unit, sample fraction, size, weights, etc.sample fraction, size, weights, etc.

» Define units in the microdata: private household, collective Define units in the microdata: private household, collective dwelling, included/excluded populations, etc. dwelling, included/excluded populations, etc.

2. Metadata needs 2. Metadata needs see paragraphs 15-23 for additional detailssee paragraphs 15-23 for additional details

» Documents in any form: .pdf, .txt, .doc, .xls, .pdf, XML, Documents in any form: .pdf, .txt, .doc, .xls, .pdf, XML, SDMX, DDI, CSPro, IMPS, etc.SDMX, DDI, CSPro, IMPS, etc.

» Copies in official language and English:Copies in official language and English:Essential:Essential:

1.1. QuestionnairesQuestionnaires2.2. Instructions to interviewersInstructions to interviewers3.3. Codebooks, data dictionariesCodebooks, data dictionaries

Helpful:Helpful:4.4. Correspondence tables (e.g., occupation with ISCO08/88)Correspondence tables (e.g., occupation with ISCO08/88)5.5. Summary official resultsSummary official results6.6. Technical, methodological reportsTechnical, methodological reports7.7. Sample design: preferred, every tenth private household; for collective Sample design: preferred, every tenth private household; for collective

dwellings (e.g., hospitals), every tenth person.dwellings (e.g., hospitals), every tenth person.8.8. Boundary files for administrative geography coded in microdataBoundary files for administrative geography coded in microdata

3. Microdata needs 3. Microdata needs see paragraphs 24-30 for additional detailssee paragraphs 24-30 for additional details

» 2 goals:2 goals:1.1. Permanently archive source microdata against loss (copies provided Permanently archive source microdata against loss (copies provided

exclusively to the National Statistical Agency owner)exclusively to the National Statistical Agency owner)2.2. Integrate high precision, anonymized household samples into databaseIntegrate high precision, anonymized household samples into database

» We prefer 100% microdata, particularly from developing We prefer 100% microdata, particularly from developing countries where microdata are at risk of loss countries where microdata are at risk of loss » Note: some European statistical offices can no longer locate census Note: some European statistical offices can no longer locate census

microdata for 1960s, 1970s, 1980s and even 1990s! microdata for 1960s, 1970s, 1980s and even 1990s! » Or even where they can locate it, are unable to make the data useable Or even where they can locate it, are unable to make the data useable

» 4 modalities for entrusting microdata:4 modalities for entrusting microdata:1.1. 100% microdata to MPC: 100% microdata to MPC: 38 countries38 countries2.2. Samples provided by National Statistical Office: Samples provided by National Statistical Office: 25253.3. Multi-use samples also entrusted to MPC: Multi-use samples also entrusted to MPC: 12 12 4.4. Samples constructed by Research Institute upon request of NSO: Samples constructed by Research Institute upon request of NSO: 6 6

» License fee: US$5,000 for dataset of 1 million plus recordsLicense fee: US$5,000 for dataset of 1 million plus records

3. Microdata needs 3. Microdata needs see paragraphs 24-30 for additional detailssee paragraphs 24-30 for additional details

» High precision, household samplesHigh precision, household samples» 10 percent: 70 of 130 samples currently available10 percent: 70 of 130 samples currently available» 5 percent: 285 percent: 28» <5 percent: 32 (8 constitute all that survives)<5 percent: 32 (8 constitute all that survives)

» Systematic random samples : Systematic random samples : » every nevery nthth private household after a random start private household after a random start» Collective dwellings: every nCollective dwellings: every nth th person person » extremely fine geographic stratification with proportional weightingextremely fine geographic stratification with proportional weighting » NUTS-2, NUTS-3NUTS-2, NUTS-3

» Anonymization, performed by NSO or MPC Anonymization, performed by NSO or MPC In addition to sampling, 6 layers of technical protections:In addition to sampling, 6 layers of technical protections:

1.1. Suppress small places or residence, work, school, etc.Suppress small places or residence, work, school, etc.2.2. Suppress codes of social categories with small countsSuppress codes of social categories with small counts3.3. Top and Bottom coding of continuous variablesTop and Bottom coding of continuous variables4.4. Suppress sensitive variablesSuppress sensitive variables5.5. Swap small % of households into different place of residenceSwap small % of households into different place of residence6.6. Randomly order all householdRandomly order all household

Conclusion Conclusion

» Thanks to:Thanks to:» National Statistical Offices for trust and cooperationNational Statistical Offices for trust and cooperation

» International organizations for support and encouragementInternational organizations for support and encouragement

» Researchers for using of IPUMS integrated datasetsResearchers for using of IPUMS integrated datasets

» Invitation to:Invitation to:» National Statistical Office partners to entrust 2010 round National Statistical Office partners to entrust 2010 round

microdata and metadata with Form “A”microdata and metadata with Form “A”

» National Statistical Offices that are not yet cooperating to National Statistical Offices that are not yet cooperating to participate to integrate pre-2010 census microdataparticipate to integrate pre-2010 census microdata

» And…And…

……to the 58to the 58thth Session ISI: Session ISI: Dublin, Aug 21-26, 2011Dublin, Aug 21-26, 2011http://www.isi2001.iehttp://www.isi2001.ie

» IPUMS IPUMS Workshop, Workshop, Aug 19-20 Aug 19-20

» Microdata Microdata sessionssessions

» IPUMS IPUMS Funding for Funding for delegates delegates from from developing developing countries countries

» IPUMS IPUMS boothbooth

Thank you!!Thank you!!

[email protected]@[email protected]@ced.uab.es

[email protected]@umn.edu

www.ipums.org/internationalwww.ipums.org/internationalwww.iecm-project.org www.iecm-project.org