54
GEOGRAPHY and WEIGHTS IN THE NLS By Randall Olsen

GEOGRAPHY and WEIGHTS IN THE NLS

  • Upload
    merrill

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

GEOGRAPHY and WEIGHTS IN THE NLS. By Randall Olsen. The Plan for this Module. Basics of geographic data Geography and sampling Level of detail available in the NLS and GIS data How you get geographic data Weighting Correct standard errors and geo-variables - PowerPoint PPT Presentation

Citation preview

Page 1: GEOGRAPHY and WEIGHTS IN THE NLS

GEOGRAPHY and WEIGHTS IN THE NLS

ByRandall Olsen

Page 2: GEOGRAPHY and WEIGHTS IN THE NLS

The Plan for this Module

• Basics of geographic data• Geography and sampling• Level of detail available in the NLS and GIS data• How you get geographic data• Weighting• Correct standard errors and geo-variables• Using GIS and family-level data to enhance your

analysis plan

Page 3: GEOGRAPHY and WEIGHTS IN THE NLS

Basics of Geographic Data• Four Census Regions – this is the finest level

of geography available for original cohorts without going to a Census Research Data Center.

• State and Counties• Census Tracts – size varies, thousands of

persons• Block Groups – neighborhoods (almost)

Page 4: GEOGRAPHY and WEIGHTS IN THE NLS

Counties

• Finest level routinely available in NLSY79 and NLSY97

• 3100 in U.S. – Texas (254), Delaware (3), Georgia (159); FIPS codes to designate

• Extensive socio-economic & demographic data available at county level (nee City-County Data Book) – you merge them in using FIPS codes

Page 5: GEOGRAPHY and WEIGHTS IN THE NLS

Census Tracts

• In 2000 all U.S. partitioned into tracts• Size and population varies, but can contain

several thousand people in urban areas or a few hundred in rural areas

• Using these data requires clearance from BLS

Page 6: GEOGRAPHY and WEIGHTS IN THE NLS

Block Group

• In Urban areas, consist of groups of blocks (did you guess?)

• This is the finest level of aggregation for Census geography (building blocks for reapportionment)

• What non-rural folk think of as a neighborhood, except for those near boundary of the block group

Page 7: GEOGRAPHY and WEIGHTS IN THE NLS

Census Regions

Page 8: GEOGRAPHY and WEIGHTS IN THE NLS

Census Tracts NYC Metro Area

Page 9: GEOGRAPHY and WEIGHTS IN THE NLS

Census Tracts in Wyoming

Page 10: GEOGRAPHY and WEIGHTS IN THE NLS

Sampling - Original Cohorts

• Original Cohorts drawn from experimental CPS sample frame in 1960’s; Title 13 confidentiality restrictions prohibit release of geographic data below Census Region

• We recently geocoded these data (latitude and longitude) and one may use them at a Census Research Data Center

• The exact sampling structure was kept secret a la Raiders of the Lost Ark – details may exist in a musty file in Suitland, MD although the latitude and longitude data allows one to reverse engineer the sampling

Page 11: GEOGRAPHY and WEIGHTS IN THE NLS

Sampling for the NLSY’sMultiple stages

• U.S. divided into Primary Sampling Units (PSUs) – Major Met areas, counties or groups of counties (rural areas)

• Selection probability proportional to population of interest; large cities always chosen

• Dividing PSUs into groups can insure correct fraction of rural & suburban areas chosen – this can reduce the sampling variance relative to a simple random sample (SRS)

Page 12: GEOGRAPHY and WEIGHTS IN THE NLS

Next Stages

• Select Tracts or Block groups; list in order by income or ethnic composition to pick every nth one – insures even distribution over the ordered characteristics. Again, this can reduce sampling variance relative to a SRS. Segments of streets selected within block groups.

• List all addresses in selected segments; randomly select units to do a screening interview to identify eligible persons

• This process generates area “clusters” of nearly contiguous respondents

Page 13: GEOGRAPHY and WEIGHTS IN THE NLS

Examples of PSU Clustering• NLSY97: 100 PSUs in cross sectional sample. 100

PSUs in minority oversample. • NLSY79: 102 PSUs in cross sectional sample. 100

PSUs in oversample. 38 PSUs in Military oversample.• We average about 50 respondents per PSU; the effect

of clustering on statistical properties increases with the size of the cluster and degree to which variables are correlated within cluster.

• Correlations within clusters increase the sampling variance and usually overcome the advantages of stratification.

Page 14: GEOGRAPHY and WEIGHTS IN THE NLS

PSU’s in NLSY79 initial screening (done in 1978) – over yield

PSU’s in NLSY97 initial screening (done in 1997 – screen and go) under yield

Page 15: GEOGRAPHY and WEIGHTS IN THE NLS

Geographic Detail in NLS

• States and counties available in Geocode release

• Zipcode data kept at BLS and CHRR• Census tracts and block group identifiers at

CHRR and BLS• Latitude and longitude (accurate to about 50

feet) at CHRR

Page 16: GEOGRAPHY and WEIGHTS IN THE NLS

Geocode – How you get it• You need to apply to BLS (see Web site)• Describe how you plan to use the data• If BLS approves you, CHRR sends you a

CD• You need to return the CD when finished

and you are subject to audit and legal liabilities if you violate terms of agreement with BLS. BLS performs many audits – keep yourself in compliance.

Page 17: GEOGRAPHY and WEIGHTS IN THE NLS

Geocode – How you use it

• You use the state and county codes to merge in the data you need

• Use standard FIPS codes• There is a variable indicating when R is in a

central city (this was done using zipcodes - before 1998 missing values show zips that are not unambiguously central/non-central)

• Data merge is a do-it-yourself project

Page 18: GEOGRAPHY and WEIGHTS IN THE NLS

Zipcode Data

• CD is at BLS or CHRR and is not released • The CD has Zipcodes, but matching and

merging in the data you need is a do-it-yourself project

• You can have CHRR create a variable you need, with BLS approval

• Zipcode centroid can be used as rough location of respondent for simple distance calculations

Page 19: GEOGRAPHY and WEIGHTS IN THE NLS

Fine Level Location

• Modern Geographic Information Systems data use latitude and longitude as the basis for linking data

• We geocode respondent addresses with latitude and longitude; sometimes with GPS units (all years except 1980)

• We place R within about 50 feet• Opportunities to extend analysis abound

Page 20: GEOGRAPHY and WEIGHTS IN THE NLS

Distance from R to:

• Fast food restaurants• Employers• Doctors’ offices• Hospitals• Freeways• Schools, public &

private• Post offices

• Banks• Bus stops• Train stations• State licensed day care

centers• Drug seizures & prices• Air quality measures• Toxic waste sites

Page 21: GEOGRAPHY and WEIGHTS IN THE NLS

Data at Tract and Block Group Level

• Based on Decennial Census Long Form or American Community Survey (recent years)

• Ethnicity and Color of people in area• Average income, poverty rate, dispersion in

income, housing attributes• Population density, education, employment

rates

Page 22: GEOGRAPHY and WEIGHTS IN THE NLS

Other Sensitive Data for Analysis

• CHRR maintains the names of employers for each respondent in each round

• With BLS approval we can identify persons working for a particular sort of employer or match in employer characteristics

• The guiding principal is that these specialized extracts must not give you the ability to re-identify the respondent

Page 23: GEOGRAPHY and WEIGHTS IN THE NLS

Ideas using detailed geography

• Does proximity to fast-food restaurants now and in the past correlate with BMI?

• Does current and past air quality have a relationship to the incidence of asthma?

• Does proximity to health care correlate with health outcomes?

• Is local income inequality related to health?

Page 24: GEOGRAPHY and WEIGHTS IN THE NLS

Respondent location is generally chosen by respondent; this problem of “endogenous” location may be attenuated or “solved” using locational attributes at either screening or age 15 – locations reflecting primarily parental choice, not respondent choice.

These past locational attributes can be used as either regressors or instrumental variables (IV). IV creates a variable that “stands in” for a regressor that is correlated with the error term.

Page 25: GEOGRAPHY and WEIGHTS IN THE NLS

Some respondent choices may be endogenous to an outcome, such as smoking and birth weight of one’s infant. One could use the incidence of smoking by one’s peers in the original PSU (or by one’s siblings) as an instrumental variable. Peer smoking reflects shared socio-economic forces, but weight of R’s baby unlikely to have an effect on smoking behavior of R’s peers. Need to avoid weak instruments, that is instruments that do not explain much of the variation in the variable they stand in for.

Page 26: GEOGRAPHY and WEIGHTS IN THE NLS

Using Fine-level Geography

• Make application to BLS• CHRR can often create the variable for you

if it does not threaten re-identification• Rounding data reduces precision and

reduces threat of re-identification of tract, block group or zipcode

• Do the analysis at BLS or CHRR

Page 27: GEOGRAPHY and WEIGHTS IN THE NLS

DIFFUSION OF THE SAMPLES:

PSU Clusters in original NLSY97 Sample

But this clustering has broken down over time. Here is where people live as of Round 6 in NLSY97

Page 28: GEOGRAPHY and WEIGHTS IN THE NLS

PSU Clusters in original NLSY79 Sample 12,000

By Round 20 in NLSY79 Sample there is even more geographic dispersion. 9,000

Page 29: GEOGRAPHY and WEIGHTS IN THE NLS

Example of Segment Clustering

• In the NLSY97 a cluster of respondents were picked from the Lower East side of Manhattan and a cluster from around Yankee Stadium.

Page 30: GEOGRAPHY and WEIGHTS IN THE NLS

Implications of Sample Design for Routine Data Use

• All NLS samples contain oversamples of Blacks and NLSY’s oversample Hispanics. Poor whites and military members have discontinued oversamples in NLSY79.

• NLSY looks different from a Simple Random Sample

• Clusters of R’s may share unobservable characteristics

Page 31: GEOGRAPHY and WEIGHTS IN THE NLS

Weighting• Weight summary statistics to describe

population• For regressions – Gauss Markov rules

– OLS is BLUE under standard conditions, including correct specification of the model

– Model heterogeneity does not call for weighted regression but rather weighting the various regression coefficients

Page 32: GEOGRAPHY and WEIGHTS IN THE NLS

Weighting – Horn of the Dilemma

Locus of Coefficients with Weighted Regression

-0.5-0.4-0.3-0.2-0.10

0.10.20.30.40.5

-0.5 0 0.5 1 1.5 2

B1

B2

Page 33: GEOGRAPHY and WEIGHTS IN THE NLS

Using Weights• Weights for the NLSY97 Round 7 range from a

high of 1,785,202 to a low of 90,060 – two implied decimal places– One respondent represents from 900 to 17,852

people, average is about 2,500– Zero weights indicate person not interviewed

• NLSY97 and NLSY79 have single round weights representing population in 1997 and 1978 – not immigrants since screening

• NLSY97 has weights for cross section (no oversamples) as well as “panel” weights

Page 34: GEOGRAPHY and WEIGHTS IN THE NLS

A NLSY79 Example From 1994• Blacks and Hispanics on average have lower wages

than whites (see WeightingWageData.Sas). • Unweighted

– Mean Wage $12.50 per hour– Median Wage $10.15 per hour

• Weighted with 1994 sample weight (R50804.00) to correct for oversampling– Mean Wage $13.60 per hour– Median Wage $11.10 per hour

• Weighting increases average wage by roughly $1.00 per hour

Page 35: GEOGRAPHY and WEIGHTS IN THE NLS

How Do I Weight Multiple Years?NLS has a custom weighting program that provides users with the ability to go beyond weighting just a single round– Web Version: http://www.nlsinfo.org/web-

investigator. Allows you to weight a set of survey rounds.

– PC-SAS Version: Allows you to use the code that runs the web version on your own PC. Enables you to weight any set of respondent ids. This allows you to take into account event history data and item non-response. This is a powerful tool.

Page 36: GEOGRAPHY and WEIGHTS IN THE NLS

Web Version

Page 37: GEOGRAPHY and WEIGHTS IN THE NLS

PC-SAS Custom Weight Program

• Contact NLS User Services. They will send you a pair of PC-SAS programs, a set of data files and an input file. Jay Zagorsky at CHRR will help you.

• You must be comfortable making minor modifications to SAS programs and must have SAS installed on your computer.

• Program takes as input a sorted list of ids, one id per line. Program produces same output as web version

• This program allows you to weight data from an event history or other complex designs

Page 38: GEOGRAPHY and WEIGHTS IN THE NLS

Clustering & Standard Errors• NLS has numerous clusters of respondents who are

alike; same person in different rounds, siblings, people in same neighborhood

• Clustering means all observations are not independent (not i.i.d.) – heterogeneity across persons and families plus spatial correlation

• PSU clustering more a problem than family clustering for variances d.e. = [1+p(k-1)] – (adjust s.e. by sqrt). Large k produces problems – clusters larger than families. But same person in different rounds means large p.

Page 39: GEOGRAPHY and WEIGHTS IN THE NLS

Clustering & Standard Errors(cont.)

• If intra-cluster correlations are high, number of effective observations = number of clusters, not number of observations

• OLS is still consistent and unbiased – must use GLS for correct standard errors

• Design effects in regressions are perhaps better described as misspecification effects as the intracluster correlation is due to unobserved variables affecting the cluster

Page 40: GEOGRAPHY and WEIGHTS IN THE NLS

Household Clustering• NLSY97: 4,027 respondents came from homes

that had multiple respondents. – There were six homes that each provided five

respondents.• NLSY79: 5,914 respondents came from homes

that had multiple respondents. – There were four homes that each provided six

respondents.• Data on siblings allows us to separate effects of

household versus individual characteristics• For original cohorts, refer to multiple respondent

file to detect parents and children across cohorts and siblings both within and across cohorts

Page 41: GEOGRAPHY and WEIGHTS IN THE NLS

Effect of Clustering on Std Errors• NLSY79 to explain log of male hourly wage• Regress hourly wages on race, age, education,

AFQT score and marital status. Details are in WageData.sas

Y = XB + ui + vij + wijk + zijkt

ui is error for PSU i, vij is component for PSU i and family j, wijk is component for PSU i, family j and person k, zijkt is idiosyncratic

Page 42: GEOGRAPHY and WEIGHTS IN THE NLS

OLS Results From SAS• Results using OLS with SAS. Note high T-

values.

Variable Coefficient T ValueConstant 5.21 133Black -0.16 18Hispanic -0.04 4.4Age 0.02 18High Grade 0.06 38AFQT 0.001 16Married 0.19 26

Page 43: GEOGRAPHY and WEIGHTS IN THE NLS

How To Fix Problem• There are at least two statistical packages

designed to fix the clustering problem.– Sudaan (www.rti.org/sudaan) is a special

purpose package designed to fix clustering issues. Integrates with SAS.

– Stata (www.stata.com) is a general purpose statistical program. To adjust for clustering for means use the “Svyset” command; for regression use “robust cluster” (Huber-White).

• No clustering data available for Original Cohorts

Page 44: GEOGRAPHY and WEIGHTS IN THE NLS

OLS Results From SudaanHere we correct for the survey’s clustering on PSU (not on person or family)

Variable Coefficient SAS’s T Value

Sudaan’s T Value

Constant 5.21 133 83Black -0.16 18 8.8Hispanic -0.04 4.4 1.5Age 0.02 18 13.4High Grade 0.06 38 18.7AFQT 0.001 16 8.3Married 0.19 26 15.8

Page 45: GEOGRAPHY and WEIGHTS IN THE NLS

What Happened?

• Adjusting for clustering using Sudaan resulted in most of the T-values falling by half. Most are still highly significant.

• The Hispanic variable, which was considered highly significant with the SAS results (Pr < 0.0001) is now no longer statistically significant (Pr < 0.15) by most commonly used levels. (Problem more severe with clustered characteristics)

Page 46: GEOGRAPHY and WEIGHTS IN THE NLS

What Steps Are Needed To Adjust?

• First, get geocode clearance. You need this clearance to access replicate and PSU data.

• Second, extract all variables for your research plus the replicate and PSU values.– NLSY79: The PSU variable is R02191.45, titled “Stratum

Number For Primary Sampling Units” and the replicate variable is R02191.46, titled “Within Stratum Replicate Of Primary Sampling Unit.” PSU=10*R02191.45+R02191.46

– NLSY97: The PSU variables is R13082.00, titled “PRIMARY SAMPLING UNIT (CODED).” The replicate variable is not released. Set replicate=1 in your work.

Page 47: GEOGRAPHY and WEIGHTS IN THE NLS

What Steps Are Needed To Adjust?

• Third, sort your data set by replicate and PSU.• Fourth run Sudaan. We used the following

command.– Proc Regress

data="C:\Documents and Settings\All Users\Desktop\ClusteringandWeighting\WageData.dbs"

filetype=ascii design=wr DEFT1 est_no=24000; weight _ONE_; nest REPLICAT PSU / MISSUNIT; Model Ln_Pay = Black Hispanic Age HGC AFQT Marry;

Page 48: GEOGRAPHY and WEIGHTS IN THE NLS

Small Extension• The SAS file we used to create the previous

example is called WageData.sas.• What happens when we add one more

explanatory variable, “height in inches?”• Adding this variable investigates if taller people

earn higher wages.• The created variable “height” is already part of

the SAS data set.

Page 49: GEOGRAPHY and WEIGHTS IN THE NLS

Extension Results• Using SAS the OLS regression results show height’s

coefficient is 0.004 and the t-value is 3.34.• In simple language this means each extra inch of height is

associated with a 0.4% increase in hourly wages. The 3.34 T-value shows the coefficient is robust at the 99.9% level of significance, suggesting height and wages are definitely related.

• Using Sudaan to take into account clustering lowers the T-value to 2.0. Sudaan computes the height’s coefficient significance level at 95%. Hence, adjusting for clustering means we no longer have almost complete statistical certainty in the relationship.

Page 50: GEOGRAPHY and WEIGHTS IN THE NLS

What If You Do Not Have Sudaan (or Stata)?

• One method of getting roughly similar results is to add extra geographic variables which track each PSU’s characteristics to the regression.

• Using just SAS we reran the wage function and included for each respondent’s 1979 location: percent black, percent Hispanic, median income, did the respondent reside in a SMSA of 2+ million people and dummies for USA regions (see the file named WageDataPlusGeoVariables.sas).

• Note we get results much like Sudaan just using location characteristics that are 20 years old.

Page 51: GEOGRAPHY and WEIGHTS IN THE NLS

Result of Adding Geographic Variables• The left 3 columns are the original wage equation. The right 2

columns are the results after adding the geographic variables. Like Sudaan regressions, adding the extra geographic indicators dramatically lowers the T-statistics.

Variable Original Coefficient

Orig. T Value

NewCoefficient

New T Value

Constant 5.21 133 4.42 39.7

Black -0.16 18 -0.22 10.4

Hispanic -0.04 4.4 -0.065 3.1

Age 0.02 18 0.02 9.5

High Grade

0.06 38 0.06 18.2

I.Q. 0.001 16 0.0015 9.2

Married 0.19 26 0.22 15.1

Page 52: GEOGRAPHY and WEIGHTS IN THE NLS

Children of NLSY & YAG

• Children had already diffused by their teen years relative to their mothers

• Clustering is more subtle, includes kin networks• Appropriateness of Sudaan & other routines more

problematic• The two alternatives are a complex random effects

model or using geographic descriptors to explain the error components responsible for the design effect problem

Page 53: GEOGRAPHY and WEIGHTS IN THE NLS

Bottom Line• GIS systems have become essential to social

scientists. The NLSYs have a lot of data on location, but use is restricted.

• The oversamples and clustering of the NLSYs require you to think carefully about the impact of heterogeneity, weighting and clustering on your analysis. Weighting is not usually correct except when estimating univariate population moments.

• Using geographic descriptors as regressors attenuates design effects.

• For original cohorts, geography is very limited.

Page 54: GEOGRAPHY and WEIGHTS IN THE NLS

Q&A on Core Material

Audrey Light, Amanda McClain and Steve McClaskie