59
The MCDC Data The MCDC Data Archive Archive John Blodgett John Blodgett Office of Social & Economic Data Office of Social & Economic Data Analysis Analysis University of Missouri University of Missouri Rev. May 2007 Rev. May 2007 http://mcdc.missouri.edu/tutorials/ http://mcdc.missouri.edu/tutorials/ mcdc_data_archive.ppt mcdc_data_archive.ppt

The MCDC Data Archive John Blodgett Office of Social & Economic Data Analysis University of Missouri Rev. May 2007

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

The MCDC Data ArchiveThe MCDC Data Archive

John BlodgettJohn BlodgettOffice of Social & Economic Data AnalysisOffice of Social & Economic Data Analysis

University of Missouri University of Missouri Rev. May 2007Rev. May 2007

http://mcdc.missouri.edu/tutorials/mcdc_data_archive.ppthttp://mcdc.missouri.edu/tutorials/mcdc_data_archive.ppt

A Brief History of the ArchiveA Brief History of the Archive

Started by the Urban Information Center (UIC) at Started by the Urban Information Center (UIC) at UM St. Louis (UMSL), circa 1981.UM St. Louis (UMSL), circa 1981.

Accessing census data files (“STF”s – huge Accessing census data files (“STF”s – huge sequential summary files on tape) was very sequential summary files on tape) was very tedious and error-prone.tedious and error-prone.

Idea was to standardize the data and make it Idea was to standardize the data and make it easier, cheaper and more reliable to access.easier, cheaper and more reliable to access.

SAS® software package was becoming SAS® software package was becoming thethe tool tool for accessing the data.for accessing the data.

Brief History (cont)Brief History (cont)Idea was to create an organized collection Idea was to create an organized collection of datasets with certain standardization.of datasets with certain standardization.

E.g. A FIPS county code field would E.g. A FIPS county code field would always be converted to a SAS variable always be converted to a SAS variable named County and would be stored as a named County and would be stored as a 3-character field (NOT a numeric) with 3-character field (NOT a numeric) with leading 0’s. leading 0’s.

STF’s with thousands of records would be STF’s with thousands of records would be partitioned into smaller datasets based on partitioned into smaller datasets based on geographic summary units (counties, geographic summary units (counties, tracts, places, etc.)tracts, places, etc.)

Brief History (cont)Brief History (cont)Very informal “database” concept.Very informal “database” concept.

Users were 3 SAS programmers at UIC using MVS (IBM Users were 3 SAS programmers at UIC using MVS (IBM mainframe).mainframe).

No web access and no end-user access to worry about. No web access and no end-user access to worry about. A database designed for easy and efficient analysis and A database designed for easy and efficient analysis and ad-hoc queries. ad-hoc queries.

The data was almost entirely (decennial) Census data. The data was almost entirely (decennial) Census data.

We developed SCADS – We developed SCADS – SAS Census Access and SAS Census Access and Display SystemDisplay System. Sold 8 copies. . Sold 8 copies.

Only ran on IBM mainframe systems (MVS) with SAS. Only ran on IBM mainframe systems (MVS) with SAS.

Brief History: 1988Brief History: 1988In 1988 the UIC and OSEDA (UM-Extension at In 1988 the UIC and OSEDA (UM-Extension at Columbia) team up to become data support for Columbia) team up to become data support for the the Missouri Census Data CenterMissouri Census Data Center. .

OSEDA has a wider variety of data that is to be OSEDA has a wider variety of data that is to be added to the collection (archive). added to the collection (archive).

OSEDA has data analysts who are not SAS OSEDA has data analysts who are not SAS programmers. Lotus 1-2-3 is very big. programmers. Lotus 1-2-3 is very big.

Storing metadata (documentation) in pendaflex-Storing metadata (documentation) in pendaflex-based system no longer as viable as when it based system no longer as viable as when it was just “us guys”. was just “us guys”.

Brief History: 1991-1992Brief History: 1991-1992

The 1990 Census results are flowing. The UIC The 1990 Census results are flowing. The UIC is converting all the files to SAS datasets, mostly is converting all the files to SAS datasets, mostly on tape. Data on disk is on tape. Data on disk is veryvery expensive on the expensive on the MVS system. MVS system.

The Census Bureau is releasing the data on The Census Bureau is releasing the data on CD’s along with some extraction software. CD’s along with some extraction software. These are the DOS ages. These are the DOS ages.

To access an STF3 table for Poplar Bluff To access an STF3 table for Poplar Bluff requires mounting a tape and reading it requires mounting a tape and reading it sequentially to find the relevant data, paying for sequentially to find the relevant data, paying for tape I/O’s required to get there. Slow, expensive tape I/O’s required to get there. Slow, expensive and hard to estimate the cost of a query. and hard to estimate the cost of a query.

Brief History: 1993Brief History: 1993

Breakthrough year. COIN (Columbia Online Breakthrough year. COIN (Columbia Online Information Network) and Gopher become Information Network) and Gopher become important elements of the MSCDC. important elements of the MSCDC.

The UIC’s standard extract reports based on The UIC’s standard extract reports based on STF3 are turned into very simple but very STF3 are turned into very simple but very popular 1 or 2-page demographic profile reports. popular 1 or 2-page demographic profile reports. Delivered via the Internet using the Delivered via the Internet using the GopherGopher protocol. protocol.

This required copying the report files to a Unix This required copying the report files to a Unix system at OSEDA. But the data and most system at OSEDA. But the data and most processing are still on MVS mainframe. processing are still on MVS mainframe.

Brief History: 1994-1996Brief History: 1994-1996Transition years. (Most) archive data are copied Transition years. (Most) archive data are copied to an AIX (IBM Unix) system. This was the Great to an AIX (IBM Unix) system. This was the Great Leap Forward for the archive. Leap Forward for the archive.

The web takes off. Windows 95 appears. The web takes off. Windows 95 appears. Suddenly it seems like everybody has MS-Office Suddenly it seems like everybody has MS-Office with Excel.with Excel.

First version of First version of UexploreUexplore debuts in 1996 with debuts in 1996 with “sub-applications” xtract, hypercon and tabrgen. “sub-applications” xtract, hypercon and tabrgen. It allows users to explore the data archive and It allows users to explore the data archive and do extractions. Targeted for use by the state do extractions. Targeted for use by the state data center core group & affiliates.data center core group & affiliates.

Brief History: 2001-2003Brief History: 2001-2003

Archive moves to new hardware system Archive moves to new hardware system with storage and processing speed to with storage and processing speed to handle 2k decennial census. handle 2k decennial census.

DexterDexter replaces old xtract modules. replaces old xtract modules. Hypercon & tabrgen are retired. Hypercon & tabrgen are retired.

Metadata system based on “datasets Metadata system based on “datasets dataset” developed, with dataset” developed, with Datasets.html Datasets.html index pages.index pages.

Enhancements designed to make archive Enhancements designed to make archive more “self service” oriented. more “self service” oriented.

Relevance of History to DARelevance of History to DA

It was not until the mid-90’s that the data archive It was not until the mid-90’s that the data archive was made end-user-accessible via the web. was made end-user-accessible via the web. Even then it was for a more sophisticated user, Even then it was for a more sophisticated user, not a casual 1-time user. not a casual 1-time user. The advent of the WWW resulted in much more The advent of the WWW resulted in much more emphasis on making datasets easier to use and emphasis on making datasets easier to use and on creating metadata.on creating metadata.The widespread use of Excel led us to The widespread use of Excel led us to concentrate on creating extracts that could be concentrate on creating extracts that could be easily loaded into spreadsheets.easily loaded into spreadsheets.There are still “filetypes” in the archive that pre-There are still “filetypes” in the archive that pre-date the web and these are generally not as date the web and these are generally not as accessible as those created after we started accessible as those created after we started worrying about web-access issues. worrying about web-access issues.

What Is the Data Archive?What Is the Data Archive?

A loosely organized collection of data files (data sets, A loosely organized collection of data files (data sets, data tables, SAS data sets -- these are all terms for the data tables, SAS data sets -- these are all terms for the same thing).same thing).

Related supporting files in html, pdf, csv, xls and other Related supporting files in html, pdf, csv, xls and other standard web formats. Such files may contain metadata, standard web formats. Such files may contain metadata, extracts, raw input data, reports, etc. extracts, raw input data, reports, etc.

A reasonably rigorous set of naming and organizational A reasonably rigorous set of naming and organizational conventions that make accessing the data easier. conventions that make accessing the data easier.

A network of MCDC people who will assist you with A network of MCDC people who will assist you with accessing the data. accessing the data.

Data Archive DirectoriesData Archive Directories

The archive is really just a very large Unix The archive is really just a very large Unix directory. It is named /pub/data . directory. It is named /pub/data .

The 1The 1stst level subdirectories represent data level subdirectories represent data categories that we call “categories that we call “filetypesfiletypes”.”.

All filetypes have a subdirectory named All filetypes have a subdirectory named ToolsTools where we keep the SAS programs that created where we keep the SAS programs that created the data sets in the filetype directory. the data sets in the filetype directory.

Occasionally we have subdirectories of filetype Occasionally we have subdirectories of filetype directories that contain data files. We do this to directories that contain data files. We do this to avoid having too many data sets in 1 directory. avoid having too many data sets in 1 directory.

Uexplore and DirectoriesUexplore and Directories

The Uexplore navigation utility displays the The Uexplore navigation utility displays the contents of a single directory. It lists contents of a single directory. It lists subdirectories, data files and other files.subdirectories, data files and other files.

Subdirectories (identified via folder icons) are Subdirectories (identified via folder icons) are listed before most files (special files like listed before most files (special files like Datasets.html & Readme.html are the only ones Datasets.html & Readme.html are the only ones that appear before subdirectories). that appear before subdirectories).

Clicking on a subdirectory invokes Uexplore to Clicking on a subdirectory invokes Uexplore to display the contents of that subdirectory. display the contents of that subdirectory.

Files and Data FilesFiles and Data FilesThe directories are simply containers for The directories are simply containers for organizing the content of the DA, which is organizing the content of the DA, which is comprised of files. comprised of files.

““Data Files”Data Files” is the term we use to reference the is the term we use to reference the special files that can be accessed via the Dexter special files that can be accessed via the Dexter extraction utility. AKA “data sets” & “SAS data extraction utility. AKA “data sets” & “SAS data sets”. sets”.

Uexplore displays a listing of all the files within a Uexplore displays a listing of all the files within a directory in alphabetical order, with the filenames directory in alphabetical order, with the filenames serving as hyperlinks. serving as hyperlinks.

In Unix, case matters and uppercase letters sort In Unix, case matters and uppercase letters sort before lowercase. before lowercase.

File Naming ConventionsFile Naming Conventions

File extensionsFile extensions determine what happens when determine what happens when you select (click on) a file on the uexplore-you select (click on) a file on the uexplore-generated web page. generated web page.

Extensions Extensions sas7bdatsas7bdat and and sas7bvewsas7bvew indicate indicate data files. Clicking invokes Dexter to extract data files. Clicking invokes Dexter to extract from that data set. from that data set.

Extension Extension sassas indicates a SAS code file. It will indicates a SAS code file. It will display as a text file in your browser. display as a text file in your browser.

Most other extensions (html, pdf, csv, txt, etc) Most other extensions (html, pdf, csv, txt, etc) will be displayed as usual by your browser. E.g. will be displayed as usual by your browser. E.g. for most users clicking on a file with a “.csv” for most users clicking on a file with a “.csv” extension will cause Excel to be invoked. extension will cause Excel to be invoked.

File Naming ConventionsFile Naming Conventions

Many data sets pertain to a specific Many data sets pertain to a specific geographic universe. In these cases we geographic universe. In these cases we commonly use a filename that identifies this commonly use a filename that identifies this universe such as “mo” (for Missouri) or “us” universe such as “mo” (for Missouri) or “us” (for United States). (for United States).

A file name that ends with 2 digits usually A file name that ends with 2 digits usually indicates data pertaining to a year. So file indicates data pertaining to a year. So file mocom06.sas7bdatmocom06.sas7bdat contains data for 2006. contains data for 2006.

File Naming Conventions File Naming Conventions (cont)(cont)

We sometimes use geographic levels as We sometimes use geographic levels as part of file names to indicate the level(s) of part of file names to indicate the level(s) of geography being summarized on the set.geography being summarized on the set.

E.g. E.g. mostcntymostcnty is a file containing is a file containing summaries for Missouri state and summaries for Missouri state and counties. counties.

uszips04uszips04 would indicate ZIP code level would indicate ZIP code level summaries for the entire U.S. for 2004. summaries for the entire U.S. for 2004.

Datasets.htmlDatasets.htmlThis is a This is a special filespecial file that occurs in most that occurs in most (but not yet all) filetype directories. (but not yet all) filetype directories. Uexplore displays it at the top of the page Uexplore displays it at the top of the page in bold and uses the Description field to in bold and uses the Description field to tell you totell you to Use this custom data directory page to Use this custom data directory page to access the database files (only) with greatly access the database files (only) with greatly enhanced descriptions and metadataenhanced descriptions and metadata.. The MCDC goes to considerable trouble to The MCDC goes to considerable trouble to create these files in order to make it easier create these files in order to make it easier to access our data. Take advantage of to access our data. Take advantage of them. them.

SeeAlso.htmlSeeAlso.html

This filename is used in several of our This filename is used in several of our filetype directories and we hope to create filetype directories and we hope to create them for many more. them for many more.

They provide links to other web sites with They provide links to other web sites with related data or information regarding this related data or information regarding this data directory. data directory.

They are usually very short pages with no They are usually very short pages with no fancy formatting. fancy formatting.

Tools and QueriesTools and Queries

These are two specially-named subdirectories.These are two specially-named subdirectories.

ToolsTools we have already discussed: it’s where we we have already discussed: it’s where we store the code for creating the data files, as well store the code for creating the data files, as well as (sometimes) examples of sas programs for as (sometimes) examples of sas programs for accessing.accessing.

QueriesQueries contains saved Dexter queries. We contains saved Dexter queries. We have not fully implemented these yet, but the have not fully implemented these yet, but the idea is that users can select these saved queries idea is that users can select these saved queries and re-run them just by clicking on the .txt files in and re-run them just by clicking on the .txt files in these special subdirectories. these special subdirectories.

Structure of Data FilesStructure of Data FilesThe Data Files in the archive are stored as SAS The Data Files in the archive are stored as SAS data sets. data sets. ( If you do not know or want to know ( If you do not know or want to know anything about SAS that is OK. Dexter lets you access anything about SAS that is OK. Dexter lets you access these without need to know anything about SAS. )these without need to know anything about SAS. )

They are rectangular data tables with rows and They are rectangular data tables with rows and columns – aka observations and variables. columns – aka observations and variables.

The rows represent the entities being described The rows represent the entities being described or summarized. The columns contain the or summarized. The columns contain the attributes or the statistics summarizing the entity. attributes or the statistics summarizing the entity.

Finding Out About Data FilesFinding Out About Data Files

The key to using the data archive is The key to using the data archive is understanding what kinds of information understanding what kinds of information about what kinds of entities are stored in about what kinds of entities are stored in the data files.the data files.

Within a filetype directory the best place to Within a filetype directory the best place to start trying to figure out what we have is start trying to figure out what we have is using a Datasets.html page (if available).using a Datasets.html page (if available).

Each row of the table displayed on a Each row of the table displayed on a Datasets.html page tells you about a data Datasets.html page tells you about a data file. Not file. Not allall about, but some basic stuff. about, but some basic stuff.

The The Uexplore/Dexter Home PageUexplore/Dexter Home Page

The Archive DirectoryThe Archive Directory((on the Uexplore/Dexter home pageon the Uexplore/Dexter home page))

The The tealteal box contains links to 9 major data box contains links to 9 major data categories (2000 Census thru Compendia)categories (2000 Census thru Compendia)

The rest of the page consists mostly of The rest of the page consists mostly of descriptions of, and hyperlinks to, the descriptions of, and hyperlinks to, the archive’s data categories (which we refer to archive’s data categories (which we refer to as as filetypesfiletypes.) .)

Filetypes within the major categories are in Filetypes within the major categories are in order of what we think will be user interest. order of what we think will be user interest.

Sf32000xSf32000x has been our most popular has been our most popular filetype. filetype. PopestsPopests and and acs2005acs2005 are gaining. are gaining.

What’s In the Archive?What’s In the Archive?

Over 20,000 data tables (“datasets”) Over 20,000 data tables (“datasets”) organized into 60+ major categories. organized into 60+ major categories. Heavy emphasis on U.S. census data. Heavy emphasis on U.S. census data.

Not all Not all filetypesfiletypes are created equal. We are created equal. We spend 90% of our resources on maybe spend 90% of our resources on maybe 10% of our data directories.10% of our data directories.

Filetypes Filetypes in boldin bold on the directory page are on the directory page are the MCDC “house specialties”. the MCDC “house specialties”.

Uexplore & DexterUexplore & Dexter

Uexplore is the web tool that lets you Uexplore is the web tool that lets you browse the archive, displaying the browse the archive, displaying the contents of one directory at a time. contents of one directory at a time.

When Uexplore displays a special data When Uexplore displays a special data table file it makes the name of the file a table file it makes the name of the file a hyperlink to invoke Dexter for that table.hyperlink to invoke Dexter for that table.

Dexter (which is really 2 modules) allows Dexter (which is really 2 modules) allows the user to do custom extractions from the the user to do custom extractions from the data table files. data table files.

Facts Worth RepeatingFacts Worth Repeating

The data tables (the things Dexter accesses) are in The data tables (the things Dexter accesses) are in the same directories with other related files the same directories with other related files (SeeAlso.html’s, spreadsheets, csv files, Readme (SeeAlso.html’s, spreadsheets, csv files, Readme files, etc.)files, etc.)

Each Each filetypefiletype directory has a special directory has a special ToolsTools subdirectory where we keep program code and other subdirectory where we keep program code and other tool modules related to the data.tool modules related to the data.

Subdirectories & files starting with capital letters are Subdirectories & files starting with capital letters are listed first and are usually worth looking at. listed first and are usually worth looking at.

Dexter-accessible table files (“SAS datasets”) have Dexter-accessible table files (“SAS datasets”) have extensions of extensions of sas7bdatsas7bdat or or sas7bvewsas7bvew..

ExerciseExercise

The Bureau of Economic Analysis The Bureau of Economic Analysis disseminates its REIS data with key disseminates its REIS data with key economic indictors for US geography economic indictors for US geography down to the county level. down to the county level.

On the Uexplore home page locate the On the Uexplore home page locate the filetype corresponding to this data filetype corresponding to this data collection (what’s the major category?) collection (what’s the major category?) and navigate to the directory page. and navigate to the directory page.

Uexplore Page for Uexplore Page for beareisbeareis(cropped)(cropped)

What you see when you click on the beareis link on the Uexplore home page. It displays a list of files within the directory. The “File” column entries are hyperlinks. With a few exceptions the files are displayed in alphabetical order.

Datasets.html is a special file providing enhanced navigation of the data files in this dir. It displays just the data-table files, but in a more logical order and with additional metadata.

Datasets.htmlDatasets.html page page

Datasets.html Columns Datasets.html Columns

The The NameName column is also a link to column is also a link to uex2dex / dexter.LabelLabel is a short description of the dataset. is a short description of the dataset. #Rows#Rows (# of observations) and (# of observations) and #Cols#Cols (# of (# of columns/variables) are taken from the columns/variables) are taken from the datasets datasets metadata set. As are the metadata set. As are the Geographic UniverseGeographic Universe and and UnitsUnits. . DetailsDetails link provides access to more detailed link provides access to more detailed metadata. metadata.

Universe and UnitsUniverse and UnitsThe majority of datasets in the archive contain The majority of datasets in the archive contain summary data for geographic areas. For summary data for geographic areas. For example, a dataset in the example, a dataset in the popestspopests directory directory might contain the latest estimates for all counties might contain the latest estimates for all counties in the state of Missouri. The geographic in the state of Missouri. The geographic universe is Missouri, and the units are counties. universe is Missouri, and the units are counties.

When we have many datasets in a directory it’s When we have many datasets in a directory it’s usually because we have many different usually because we have many different combinations of universe and units. combinations of universe and units.

Common UniversesCommon Universes

Missouri (the state of) is by far the most common Missouri (the state of) is by far the most common universe for the MCDC archive.universe for the MCDC archive.

United States is second – we have quite a United States is second – we have quite a number of national datasets. number of national datasets.

Illinois and Kansas are also very common since Illinois and Kansas are also very common since we routinely download and convert census files we routinely download and convert census files for these key neighbor states. for these key neighbor states.

A common sort order for files on Datasets.html A common sort order for files on Datasets.html pages is Missouri files first, then US, then IL/KS pages is Missouri files first, then US, then IL/KS and then other states. and then other states.

Rows & ColumnsRows & Columns

The rows of the data tables typically represent The rows of the data tables typically represent (i.e. contain data about) geographic entities: (i.e. contain data about) geographic entities: states, counties, cities (places), etcstates, counties, cities (places), etc

Most of the columns in the data tables are Most of the columns in the data tables are summary stats for the entity: e.g. the 2000 pop summary stats for the entity: e.g. the 2000 pop count, the latest estimated pop, the change and count, the latest estimated pop, the change and percent change, etc.percent change, etc.

Other columns (“variables”) are Other columns (“variables”) are identifiersidentifiers with with names such as names such as sumlevsumlev, , geocodegeocode and and areaname areaname

A A Details Metadata PageDetails Metadata Page

We get here by We get here by clicking on the clicking on the Details Details link on link on Datasets.htmlDatasets.html page.page.Lots of info here Lots of info here – but varies– but variesKey variablesKey variables is often very is often very useful when useful when doing doing filtersfilters..Note the direct Note the direct link to Dexter link to Dexter under under Access Access the dataset the dataset near near the bottom.the bottom.

Increase Text Size to Read Fine PrintIncrease Text Size to Read Fine Print

Exercise – Navigate to DatasetExercise – Navigate to DatasetThe The filetypefiletype mig2000 mig2000 has data regarding has data regarding migration from 1995 to 2000 as captured migration from 1995 to 2000 as captured in the 2000 census.in the 2000 census.Go to the Uexplore home page and Go to the Uexplore home page and navigate to this filetype. navigate to this filetype. Use the Datasets.html page to display the Use the Datasets.html page to display the datasets within the directory. datasets within the directory. Find the row for the Find the row for the usccflowsusccflows data table data table and click on the and click on the DetailsDetails link for this table. link for this table.From the Details page click on the keyvals From the Details page click on the keyvals link for the variable State. link for the variable State.

Key Variables Report: Key Variables Report: StateState

Tells you that Tells you that the variable the variable StateState has a has a value of value of 0101 (for (for ““AlabamaAlabama”)”) in in 2213722137 rows of rows of this dataset.this dataset.This can be This can be very helpful very helpful when doing a when doing a data filterdata filter in in Dexter.Dexter.

General Information About General Information About Archive Data Sets and Data Archive Data Sets and Data

Set Variables (Columns)Set Variables (Columns)

Dataset Naming ConventionsDataset Naming ConventionsAll filetype names are 8 characters or less.All filetype names are 8 characters or less.

Dataset names were limited to 8 characters by the Dataset names were limited to 8 characters by the software until recently.software until recently.

The first characters of the dataset name often The first characters of the dataset name often correspond to the universe – e.g. “mo”, “il”, “us”. correspond to the universe – e.g. “mo”, “il”, “us”.

The geo units are often part of the ds-name – e.g. The geo units are often part of the ds-name – e.g. “motracts”, “uszips”. “motracts”, “uszips”.

For time series data the name usually ends with a For time series data the name usually ends with a time indicator – e.g. “uscom05” contains data thru time indicator – e.g. “uscom05” contains data thru 20200505..

The names are cryptic on purpose. The names are cryptic on purpose.

Variable Naming ConventionsVariable Naming Conventions

Not as rigorously applied as we might like, esp. Not as rigorously applied as we might like, esp. for older datasets (conventions used for 1980 for older datasets (conventions used for 1980 datasets differ a little from 2K and 1990 sets, for datasets differ a little from 2K and 1990 sets, for example)example)

Certain names appear on many datasets and Certain names appear on many datasets and are consistent. These are mostly identifier are consistent. These are mostly identifier variables, the ones used in creating filters and variables, the ones used in creating filters and as keys for merging data from different files. as keys for merging data from different files.

Why Variable Names are ShortWhy Variable Names are Short

Why do we call it Why do we call it medhhincmedhhinc instead of instead of Median_Household_IncomMedian_Household_Income? e?

Because we are SAS programmers, not Because we are SAS programmers, not COBOL.COBOL.

Until the late 90’s SAS variable names Until the late 90’s SAS variable names were limited to 8 characters. We learned were limited to 8 characters. We learned to live with this and even to like it. to live with this and even to like it.

Numeric vs. Character Numeric vs. Character VariablesVariables

SAS© stores data as character strings or as SAS© stores data as character strings or as numerics. numerics.

We store all We store all identifiersidentifiers (geographic codes, etc) (geographic codes, etc) as character strings even if they are made up of as character strings even if they are made up of numeric digits. numeric digits.

The value of the state code for California is “06”, The value of the state code for California is “06”, not 6. The leading “0” matters. not 6. The leading “0” matters.

(Unfortunately, Excel ignores the distinction when (Unfortunately, Excel ignores the distinction when importing csv files.)importing csv files.)

Some Common IdentifiersSome Common Identifiers

SumLev: Geographic summary level SumLev: Geographic summary level codes as used in 2K census. (3-char)codes as used in 2K census. (3-char)

State: 2-char state FIPS code.State: 2-char state FIPS code.

County: 5-char county FIPS code, incl. the County: 5-char county FIPS code, incl. the state.state.

Geocode: A composite code to id a Geocode: A composite code to id a geographic area. E.g. the value for a geographic area. E.g. the value for a census tract might be “29019-0010.00”. census tract might be “29019-0010.00”.

AreaName: Name of the area.AreaName: Name of the area.

Common ID Variables (cont)Common ID Variables (cont)

Tract: census tract in tttt.ss format, always Tract: census tract in tttt.ss format, always 7 characters with leading 0s and 00 7 characters with leading 0s and 00 suffixes. E.g. “0012.00” .suffixes. E.g. “0012.00” .

Esriid: Similar to geocode but intended to Esriid: Similar to geocode but intended to use as a key for linking to shape files from use as a key for linking to shape files from ESRI (the ArcInfo people). When ESRI (the ArcInfo people). When geocode=“geocode=“29019-0010.0029019-0010.00” the value of ” the value of esriid=“esriid=“2901900100029019001000”. ”.

Consistency With Census Bureau Consistency With Census Bureau Data Dictionary NamesData Dictionary Names

The Bureau often distributes data dictionary files The Bureau often distributes data dictionary files with their data that include suggested names for with their data that include suggested names for the fields.the fields.

Their name for the field (variable) containing the Their name for the field (variable) containing the name of the geographic area being summarized name of the geographic area being summarized is is ANPSADPIANPSADPI. We decided to go with . We decided to go with AreaNameAreaName instead. instead.

But in most cases we try to use the same name But in most cases we try to use the same name as in the data dictionary. as in the data dictionary.

Variable LabelsVariable Labels

While variable names tend to be very While variable names tend to be very cryptic we can (and often do) associate cryptic we can (and often do) associate descriptive descriptive labelslabels to better describe the to better describe the meaning of the variable.meaning of the variable.

You see these labels as part of the drop-You see these labels as part of the drop-down variable-select lists in Dexter.down variable-select lists in Dexter.

They also occupy the 2They also occupy the 2ndnd row (variable row (variable names occupy the 1names occupy the 1stst) of csv files ) of csv files generated by Dexter.generated by Dexter.

FormatsFormatsSome variables are codes that have custom Some variables are codes that have custom formatsformats associated with them. The format causes associated with them. The format causes them to display a them to display a value labelvalue label instead of the stored instead of the stored code value. code value.

E.g. the variable E.g. the variable StateState may have a stored code may have a stored code value of “value of “2929” but displays as “” but displays as “MissouriMissouri” using the ” using the $state format. By default, all Dexter output has the $state format. By default, all Dexter output has the “formatted” value labels. “formatted” value labels.

SAS dataset output is an exception. (See notes). SAS dataset output is an exception. (See notes). Click the “Click the “View qmeta Metadata reportView qmeta Metadata report” option at ” option at the end of Section II on the Dexter form to see the end of Section II on the Dexter form to see which variables have formats associated with them. which variables have formats associated with them.

Tables Within Data TablesTables Within Data Tables

Conventions for Storing SF’s and Conventions for Storing SF’s and STF’sSTF’s

What’s an S(T)F?What’s an S(T)F?STF=Summary Tape File (pre-2000)STF=Summary Tape File (pre-2000)

SF=Summary File (2000)SF=Summary File (2000)

The Census Bureau’s terminology for The Census Bureau’s terminology for large data files consisting of a large large data files consisting of a large number of tabulated summary tables.number of tabulated summary tables.

A table might have several “dimensions”A table might have several “dimensions”

To See ExamplesTo See Examples

Go American Factfinder on the Census Go American Factfinder on the Census Bureau web site and use the Data Sets Bureau web site and use the Data Sets option. option.

Choose “Detailed Tables”. Choose “Detailed Tables”.

Complete the query and you’ll see all the Complete the query and you’ll see all the hundreds of tables you have to choose hundreds of tables you have to choose from. from.

Table Cells as VariablesTable Cells as VariablesThe SF tables have codes consisting of a table-The SF tables have codes consisting of a table-type code, a table number and (sometimes) a type code, a table number and (sometimes) a table suffix.table suffix.

E.g. on SF3, 2000 census, there are tables E.g. on SF3, 2000 census, there are tables named P5, H11, PCT23, HCT28, and HCT29A. named P5, H11, PCT23, HCT28, and HCT29A. Each table is comprised of multiple cells or Each table is comprised of multiple cells or “items”. “items”.

Each item is a column (variable) in an archive Each item is a column (variable) in an archive data set in the sf32000 filetype directory. data set in the sf32000 filetype directory.

Table Item VariablesTable Item Variables

The variables in a sf32000 data table The variables in a sf32000 data table corresponding to table P5 are named p5i1, corresponding to table P5 are named p5i1, p5i2, …, p5i7.p5i2, …, p5i7.

P5i1 is the Total Population, p5i2 is the P5i1 is the Total Population, p5i2 is the total Urban population, etc. total Urban population, etc.

The variable name is the table ID followed The variable name is the table ID followed by the letter “i” and the item number within by the letter “i” and the item number within the table. the table.

Dexter Recognizes TablesDexter Recognizes Tables

Certain filetype/dataset name Certain filetype/dataset name combinations are recognized by Dexter as combinations are recognized by Dexter as having a table-based structure.having a table-based structure.

When these are recognized Dexter section When these are recognized Dexter section III where you normally select numeric III where you normally select numeric variables is modified to let you select variables is modified to let you select entire tables instead. entire tables instead.

Regular Data Sets vs. ViewsRegular Data Sets vs. Views

There are 2 kinds of SAS data files used in There are 2 kinds of SAS data files used in the archive. the archive.

““Regular SAS data sets” are standard data Regular SAS data sets” are standard data sets created with the default format sets created with the default format (“engine”). (“engine”).

A “view” is a pseudo or virtual data set. It A “view” is a pseudo or virtual data set. It does not consist of actual data but instead does not consist of actual data but instead is a small (usually) program that gets is a small (usually) program that gets invoked and generates data on the fly.invoked and generates data on the fly.

Why SAS Data Sets?Why SAS Data Sets?

Because we want to use SAS.Because we want to use SAS.

As compared to a true database (Oracle, As compared to a true database (Oracle, et al) they are much easier to create and et al) they are much easier to create and access, occupy less space, are very good access, occupy less space, are very good for very large collections, have decent for very large collections, have decent metadata tools. Good Oracle DBA’s are metadata tools. Good Oracle DBA’s are way too expensive. way too expensive.

Have you noticed how it takes a while for Have you noticed how it takes a while for the Census Bureau to add new data sets the Census Bureau to add new data sets to AFF? to AFF?

Why Not More Excel Files? Why Not More Excel Files?

Creating xls files on a Unix platform is Creating xls files on a Unix platform is tricky because Excel does not run on Unix.tricky because Excel does not run on Unix.

It is a proprietary format that we do not It is a proprietary format that we do not have much experience with manipulating have much experience with manipulating (to convert from/to other formats). (to convert from/to other formats).

SAS data sets can be converted to csv SAS data sets can be converted to csv files which then load into Excel. files which then load into Excel.

255 and 65,xxx limits (cols and rows)255 and 65,xxx limits (cols and rows)

Why Not More Excel Files?Why Not More Excel Files?

One SAS data set with 100,000 rows and One SAS data set with 100,000 rows and 2000 variables can be rather easily 2000 variables can be rather easily deconstructed (using Dexter) into deconstructed (using Dexter) into whatever xls file you need. whatever xls file you need.

Better to let the user decide what rows and Better to let the user decide what rows and columns are of interest than have us columns are of interest than have us decide ahead of time and offer only the decide ahead of time and offer only the results of our guessing at what is wanted.results of our guessing at what is wanted.

Thank YouThank You

Questions, comments, Questions, comments, suggestions to:suggestions to:

[email protected]@missouri.edu