Upload
icpsr
View
3.038
Download
0
Embed Size (px)
DESCRIPTION
This is Part II of a workshop presented by ICPSR at IASSIST 2011. This section focuses on data sharing of publicly available data.
Citation preview
ICPSR AT 50:Facilitating Research
and Data Sharing
Part II: Data SharingIASSIST Vancouver, BCMay 31, 2011
“Public” Data Sharingbegins at 10:45
ICPSR’s Public Data
Sharing Public Data - Agenda• 2010 US Census• ICPSR’s “Public” Archives
DISSEMINATION OF DATA - MICRODATA
From the Office of Management and Budget (OMB) Policy Directive published in the Federal Register, Vol. 72, No. 46, Friday, March 7, 2008, Notices, pp. 12662-12626:
“When appropriate to facilitate in-depth research, and feasible in the presence of resource constraints, statistical agencies should provide public access to microdata files with secure safeguards to protect the confidentiality of individually-identifiable responses and with readily accessible documentation, metadata, or other means to facilitate user access to and manipulation of the data. “
U.S. CENSUS DATA – 2010: KEY DATES
• National Census Day: 1 April 2010• April - July 2010: Census takers visit
households that did not return a form by mail• December 2010: By law, the Census Bureau
delivers population information to the President for apportionment
• March 2011: By law, the Census Bureau completes delivery of redistricting data to states
U.S. CENSUS DATA – 2010: DISSEMINATION OF RESULTS
American FactFinder (AFF) is an online source for population, housing, economic and geographic data that presents the results from four key data programs: Decennial Census of Housing and Population - 1990 and
2000 Economic Census 1997-2002-2007 American Community Survey 1-Year Estimates and 3-
Year Estimates Population Estimates Program - July 1, 2006 to July 1,
2009
Results from each of these data programs are provided in the form of data sets, tables, thematic maps, and reference maps.
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
• Direct File Access through Download FTP Center at Census Bureau• Free Access to all PUBLIC-USE DATA FILES• First Release of Data (February – March 2011)
– 2010 Census Redistricting Data Summary File (P.L. 94-171):• State and sub-state population counts to the block level for
the total population and the population 18 years and over for 63 race groups; and not Hispanic or Latino origin by 63 race groups
• State and sub-state housing unit counts down to the block level by occupancy status (occupied units, vacant units)
• Quickly followed by (April 2011):– National Summary File of Redistricting Data: Contains the
same data tables as the state files, but the geographic levels include the U.S., regions, divisions, other areas that cross state boundaries, and a small subset of the geographic areas shown in the state files.
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
SUMMARY FILE 1 (SF 1): This file shows detailed tables on age, sex, households,
families, relationship to householder, housing units, detailed race and Hispanic or Latino origin groups, and group quarters. Most tables are shown down to the block or census tract level. Some tables are repeated for nine race/Hispanic or Latino origin groups. The nine groups are (1) White alone, (2) Black or African American alone, (3) American Indian and Alaska Native alone, (4) Asian alone, (5) Native Hawaiian and Other Pacific Islander alone, (6) Some Other Race alone, (7) Two or More Races, (8) Hispanic or Latino; (9) White alone, Not Hispanic or Latino. (Release: June-August 2011)
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
SUMMARY FILE 1 (SF 1 CONTINUED):• The SF 1 National Update File contains the same data tables
as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: November 2011)
• The SF 1 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to block) and characteristics for urbanized areas and urban clusters. (Release: October 2012)
• The SF 1 Redefined Core Based Statistical Areas Update File contains the same data tables as the state files for redefined CBSAs as defined by OMB following the 2010 Census. (Release: August 2013)
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
SUMMARY FILE 2 (SF 2 CONTINUED): This file shows detailed tables on age, sex, households,
families, relationship to householder, housing units, and group quarters. Most tables are shown down to the census tract level. Tables are repeated by 141 race groups, 98 American Indian and Alaska Native tribes/tribal groupings, and 39 Hispanic or Latino origin groups. In order for any of the tables for a specific group to be shown in SF 2, the data must meet a minimum population threshold. The tables in SF 2 will be repeated for each group if there are at least 100 or more people of that specific group in a particular geographic area. (Release: December 2011-April 2012)
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
SUMMARY FILE 2 (SF 2):• The SF 2 National Update File contains the same data tables
as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: May 2012)
• The SF 2 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to census tract) and characteristics for urbanized areas and urban clusters. (Release: January 2013)
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
• Congressional District Summary File – This file is a re-tabulation of Summary File 1 for newly redistricted Congressional Districts for the 113th Congress. State-based files will be released in January 2013 and every 2 years thereafter for states where congressional redistricting occurs.
• State Legislative District Summary File – This file is a re-tabulation of Summary File 1 for State Legislative Districts drawn following the 2010 Census. State-based files will be released in June 2013 and every 2 years thereafter for states where legislative redistricting occurs.
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
American Indian and Alaska Native (AIAN) Summary
File – This is a national-level file showing the same content as Summary File 2. Tables are repeated for the total population, the total AIAN population, the total American Indian population, the total Alaska Native population, and for numerous American Indian and Alaska Native tribes. In order for any of the tables for a specific group to be shown, the data must meet a minimum population threshold of at least 100 or more people of that specific group in a particular geographic area. (Release: April 2013)
U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA
• Public Use Microdata Sample (PUMS) Files – The PUMS files contain state-level 2010 Census data containing individual records of characteristics for a 10 percent sample of people and housing units. Data will be included for age, sex, race, Hispanic or Latino origin, household type and relationship, and tenure data with identifying information removed, for PUMAs of 100,000 or more population. (Release: TBD)
• Of lesser importance than 2000?
Decennial Census
• In Census 2000, the census used 2 forms
– “short” form – asked for basic demographic and housing information, such as age, sex, race, how many people lived in the housing unit, and if the housing unit was owned or rented by the resident
– “long” form – collected the same information as the short form but also collected more in-depth information such as income, education, and language spoken at home
• Only a small portion of the population, called asample, received the long form.
2010 Census and American Community Survey
• 2010 Census will focus on counting the U.S. population
• The sample data are now collected in the ACS
• Puerto Rico is the only U.S. territory where the ACS is conducted
• 2010 Census will have a long form for U.S. territories such as Guam and U.S. Virgin Islands
• Same “short form” questions on the ACS
American Community Survey2008 Content Changes
• Three new questions– Health Insurance Coverage– Veteran’s Service-connected Disability– Marital History
• Deletion of one question– Time and main reason for staying at the address
• Changes in some wording and format
American Community Survey Methodology
• Sample includes about 3 million addresses each year
• Three modes of data collection– mail– phone– personal visit
• Data are collected continuously throughout the year
American Community SurveyTarget Population
• Resident population of the United States and Puerto Rico
– Living in housing units and group quarters
• Current residents at the selected address– “Two month” rule
American Community SurveyGroup Quarters
• Place where people live or stay that is normally owned or managed by an entity or organization providing housing or services for the residents.
• 2 categories of group quarters:– Institutional– Non-institutional
American Community Survey Period Estimates
• ACS estimates are period estimates, describing the average characteristics over a specified period
• Contrast with point-in-time estimates that describe the characteristics of an area on a specific date
• 1-year, 3-year, and 5-year estimates will be released for geographic areas that meet specific population thresholds
American Community Survey Data Products Release Schedule
Data Product Population Size Data released in: of Area 2006 2007 2008 2009 2010 2011 2012 2013
1-Year Estimates 65,000+ 2005 2006 2007 2008 2009 2010 2011 2012for Data Collected in:
3-Year Estimates 20,000+ 2005-2007 2006-2008 2007-2009 2008-2010 2009-2011 2010-2012for Data Collected in:
5-Year Estimates All Areas* 2005-2009 2006-2010 2007-2011 2008-2012for Data Collected in:
* Five-year estimates will be available for areas as small as census tracts and block groups.Source: US Census Bureau
American Community SurveyData Products
• Profiles– Data Profiles– Narrative Profiles– Comparison Profiles– Selected Population Profiles
• Tables– Detailed Tables– Subject Tables– Ranking Tables– Geographic Comparison Tables
• Thematic Maps• Public Use Microdata Sample (PUMS) Files
American Community SurveySimilarities with Census 2000
• Same questions and many of the same basic statistics
• 5-year estimates will be produced for same broad set of geographic areas including census tracts and block groups
American Community SurveyKey Differences from Census 2000
• Beginning in 2010, data for small geographic areas will be produced every year versus once every 10 years
• Data for larger areas are available now and data for mid sized area will be available in December 2008
• Census 2000 data described the population and housing as of April 1, 2000 while ACS data describe a period of time and require data for 12 months, 36 months, or 60 months
American Community SurveyKey Differences from Census 2000
• The goal of ACS is to produce data comparable to the Census 2000 long form data
• These estimates will cover the same small areas as Census 2000 but with smaller sample sizes
• Smaller sample sizes for 5-year ACS estimates results in reductions in the reliability of estimates
Cooperative Agreements
• Close collaboration with the Bureau over the years in making data available to the academic research community.
• Since the 1980’s ICPSR has sought outside funding to deal with Census data and entered into joint statistical agreements with the Bureau to facilitate its distribution and use.
• Importance in 1990: High cost of raw data ($175 per reel of tape; entire Census comprised about 2000 tapes = C. $350,000).
Cooperative Agreements
• Data available to at no cost to member institutions without any rights to redistribute or resell.
• Joint annual summer workshops to offer training on the new Census data products.– One week training sessions held in 1991-1994
and 2001-2004– Census Bureau staff participated extensively
in these courses– Attracted both researchers and ICPSR Official
Representatives who attended to learn how to provide assistance to faculty and students on their campuses
The Decennial In(di)gestion
• Census Data: Collected regularly since the 1960s.
• Number of files and bytes have grown exponentially with every new Census.
• Main reason for the rapid growth in the numbers of data files archived and disseminated by ICPSR.
• How much and how rapid?
The Decennial In(di)gestion
Census Year Number of Files Number of Kilobytes
1960 83 586,8481970 502 9,622,3681980 1,134 62,947,6721990 4,880 312,893,9302000 412,051 556,240,394
U.S. CENSUS DATA: DISSEMINATION OF DATA AND ICPSR• Another access point, focused on the social science
research community, to Census data and documentation• Original Census data available from the 1960s onward as
well as special samples created for earlier years• TIGER Line Files• American Community Survey • Many of the newer files are available in a variety of formats:
• SAS• SPSS• Stata• Ascii text files• Tab-delimited
Special Census Subsets
These files report population and housing data for national and specific sub-national geographical entities, for example: • The entire nation • Each individual state• Counties• Metropolitan Areas• Places • Census Tracts
Contextual File
• Based largely on Census data• Provides information at the ‘county’ level
in the U.S. (subunits of states numbering more than 3,100 in all)
• Contains data from other government and private sources at the same geographic level
• Under certain circumstances, can be merged with survey data
Contextual File - 2• Population by age, sex, race, and Hispanic origin• Labor force size and unemployment• Personal income• Earnings and employment by industry• Land surface form typography• Climate• Government revenue and expenditures• Crimes reported to police• Presidential election results • Housing authorized by building permits• Medicare enrollment• Health profession shortage areas
Preservation
• ICPSR provides another location to preserve data and documentation files produced by the Census Bureau
• ICPSR keeps multiple copies of these files both at its home location at the University of Michigan and at other sites in the United States
• Copies are continually checked and updated when necessary
• Considerable interest in historical Census data by demographers, historians, and economists.
Current Happenings with ACS and Plans for Census 2010
Consulted with Collection Development Committee of ICPSR Council:
• Advised to continue ICPSR precedent of acquiring Census 2010 since the membership and the research community in general have traditionally come to ICPSR for their Census data needs.
• Suggestion that the data files need not be archived right away since all public-use data will be available directly from the Census Bureau.
• Emphases should center on archiving the most important Census data products when it could be best determined that final versions were created.
• The Committee also suggested that ICPSR consider holding training workshops on Census data once again as they did during the last decade and decide how best to finance them within the context of the Summer Program.
Current Happenings with ACS and Plans for Census 2010
• Suggestion to study possibility that SDA functionality might work to produce subsets for Census data instead of creating specific data products to do so.
• Emphasis placed on partnerships and as an example working with the University of Minnesota Population Center and their National Historical Geographic Information System (NHGIS) which is expected to be able to produce subsets of 2010 Census data.
• Determine in general from membership and user community what value-added features might make sense for academic researchers as greater amounts of Census 2010 data become available.
Current Happenings with ACS and Plans for Census 2010
Select files archived at ICPSR beginning with 1996 ACS:
• Emphasis on PUMS files at first
• Greater interest in Summary Files as more data is released and, in particular, with the recent appearance of the first 5-year Estimates File covering calendar years 2005-2009
Current Happenings with ACS and Plans for Census 2010
TIGER files (Topologically Integrated Geographic Encoding and Referencing System)• 2010 extracts containing geographic and cartographic
information from the Census Bureau's MAF/TIGER® (Master Address File/Topologically Integrated Geographic Encoding and Referencing) database.
• These files support the 2010 Census Redistricting Data (P. L. 94-171) and the National Summary File of Redistricting Data/Summary File 1 releases.
• The files provide the digital map base for a Geographic Information System or mapping software. The files do not contain any mapping software.
Current Happenings with ACS and Plans for Census 2010
TIGER files (Topologically Integrated Geographic Encoding and Referencing System)• All legal boundaries and names are as of January 1, 2010.
The boundaries shown are for Census Bureau statistical data collection and tabulation purposes only; their depiction and designation for statistical purposes does not constitute a determination of jurisdictional authority or rights of ownership or entitlement.
• The geographic entity codes needed to link the Census Bureau's demographic data to the geography are included in the files. The TIGER/Line Shapefiles do not contain any demographic or economic data; data can be downloaded separately using American FactFinder.
Current Happenings with ACS and Plans for Census 2010
TIGER files (Topologically Integrated Geographic Encoding and Referencing System)
• Differences between shape files and line files• Data stored at ICPSR through designated Web site• Maintain archival copies as older versions of TIGER
files cease to be distributed by Census Bureau
http://www.icpsr.umich.edu/TIGER/index.html
ICPSR’s Public Archives
ICPSR’s Public Archives
Three Differentiating Characteristics of a “Public Archive”
• Funding Sources
• Access
• Search
Funding Sources & Long Term Access
ICPSR’s public archives are funded by entities including:• Government agencies• Foundations• Other Organizations
And if the funding ceases:• ICPSR commitment to support access• Access generally reverts to membership-
only after some time period
Why are Funders using ICPSR? An Archive’s Reasons for Being
• Dissemination Infrastructure– Systems & Search = technology, security, & metadata – Data Community Base (700 immediate members to
share with)– Community Outreach/engagement expertise
• Preservation• Fulfillment of Data Management Plan (Grant)
Requirements• Ability to Measure & Report Dissemination
Statistics
Data Search within our Public Archives
A search for data/documents from within a public archive defaults to searches of materials (data) within that archive• A strategy to help one narrow their
scope• All materials are publicly available
The Relationship Visual
ICPSR
NACDA
SAMHDA
NCAA
NACJD
HMCA
DSDR
Research Connections
NAHDAP
A common hub, yet each unique
NACJD: National Archive of Criminal Justice Data
• Study topic: criminal justice
• Funders: BJS, OJJDP, NIJ
• Unique attribute: staff routinely assist non-researchers (police departments) in data use
DSDR: Data Sharing for Demographic Research
• Study topic: demography
• Partnership of several institutions
• Unique attribute: as much a resource for data producers as well as a mechanism for dissemination
NACDA: National Archive of Computerized Data on Aging
• Study topic: Aging – gerontological research
• Funder: National Institute on Aging
• Unique attribute: largest library of electronic data on aging in the US
Research Connections: Child Care and Early Education
• Study topic: early education
• Funder: US Dept. of Health & Human Service
• Unique attribute: goal is more than data – to be the destination for child care & early education research
NCAA Student-Athlete Experiences Data Archive
• Study topic: intercollegiate athletics and higher education
• Funder: NCAA
• Unique attribute: to assist in the development of national athletics policies
• Unique attribute: to assist in development of national athletics policies
Health and Mental Health Collections
• Enhanced sensitivity in the area of disclosure risk
• From ingest of data to storage of data to analysis of data
• Has driven ICPSR, as the hub, to heighten its computing and data sharing environments
• Increasing demand has lead to a need to automate – in a secured manner
Center for Population Research in LGBT Health
• Partner: Fenway Institute
• Unique attribute: data is processed offsite – ICPSR acts as the host
SAMHDA: Substance Abuse & Mental Health Data Archive
• Funder: SAMHSA
• Unique attribute: driving our online services and virtual analysis capabilities
NAHDAP: National Addiction & HIV Data Archive Program
• Funder: NIDA
• Unique attribute: driving restricted contract system
IFSS: Integrated Fertility Survey Series
• Funder: NICHD • Unique attribute: data harmonization
Let’s Take a BreakReturn at 11:45