38
1 U.S. BUREAU OF LABOR STATISTICS bls.gov Exploring the Application of Machine Learning Techniques to Construct R-indicators Arcenis Rojas & Lucilla Tan Division of Consumer Expenditure Surveys 2019 AAPOR Conference May 17, 2019

Exploring the Application of Machine Learning Techniques

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring the Application of Machine Learning Techniques

1 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Exploring the Application of Machine Learning Techniques

to Construct R-indicators

Arcenis Rojas & Lucilla Tan

Division of Consumer Expenditure Surveys2019 AAPOR Conference

May 17, 2019

Page 2: Exploring the Application of Machine Learning Techniques

2 — U.S. BUREAU OF LABOR STATISTICS • bls.gov2 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Consumer Expenditure Survey (CE)

Page 3: Exploring the Application of Machine Learning Techniques

3 — U.S. BUREAU OF LABOR STATISTICS • bls.gov3 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Motivation Problem: Non-response bias violates many

assumptions that are made during the sampling procedure and can lead to biased survey estimates.

Potential Solution: Develop an indicator of representativeness of the respondent pool while data collection is still on-going.

This can inform the allocation of recruitment resources of under-represented groups.

Page 4: Exploring the Application of Machine Learning Techniques

4 — U.S. BUREAU OF LABOR STATISTICS • bls.gov4 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Motivation Develop a representativeness indicator for a

specific expenditure category (food) to monitor the representativeness of the respondent pool with respect to select characteristics over the course of data collection during the survey year.

We’re looking for variables that are associated with both food expenditures and survey participation.

Page 5: Exploring the Application of Machine Learning Techniques

5 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Background

Representativeness Indicator (R-Indicator): Measures the risk of potential non-response bias based on weighted, estimated propensities of response.

R(x) = 1-2* S( 𝜌𝑖 )

S( 𝜌𝑖 ) = 1

𝑁−1 𝑖=1

𝑁 𝑑𝑖 ( 𝜌𝑖 − 𝜌 )2

di : design weight for sample unit i

𝜌 ∶ weighted sample average of

response propensities

𝜌𝑖 : the estimated response

propensity for unit i

Page 6: Exploring the Application of Machine Learning Techniques

6 — U.S. BUREAU OF LABOR STATISTICS • bls.gov6 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Background

Sample Design

Data Collection

Production

Data Quality

R- Indicator

Page 7: Exploring the Application of Machine Learning Techniques

7 — U.S. BUREAU OF LABOR STATISTICS • bls.gov7 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Food Model

x1 + x2 + x3...

- - -- - -- - -- - -

PDB [TEST]

Food ($) Predictions

$$$$

$$$$

Food ($) [TEST]

Accuracy Check

Food Variables

Survey Response Model

x1 + x2 + x3...

- - -- - -- - -- - -

CHI + CEQ + PDB{FOOD} [TEST]

Response [TEST]

1010

CHI + CEQ + PDB{FOOD} [TRAIN]

- - - 1- - - 0- - - 1- - - 0

PDB [TRAIN]

- - - $- - - $- - - $- - - $

1010

Predicted Class

%%%%

Response Propensity

Accuracy Check

Survey Response Variables

The Pipeline

Page 8: Exploring the Application of Machine Learning Techniques

8 — U.S. BUREAU OF LABOR STATISTICS • bls.gov8 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Data Description 2015 Consumer Expenditure Interview Survey

(CEQ)

CE Interview Survey Contact History Instrument (CHI)

The 2015 Census Planning Database (PDB) Geographic aggregation: tract-level (2010 Census boundaries)

Incorporates the 2009-2013 American Community Survey (ACS) five-year estimates

Latest data available at the time study started

Page 9: Exploring the Application of Machine Learning Techniques

9 — U.S. BUREAU OF LABOR STATISTICS • bls.gov9 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Data Description Initial sample size: 36,226

After data cleaning: 32,255

21,546 (66.8%) were survey participants

Initial number of PDB variables: 114

After eliminating highly correlated and near-zero variance PDB variables: 54

CHI variables: 2

CEQ variables: 6

Page 10: Exploring the Application of Machine Learning Techniques

10 — U.S. BUREAU OF LABOR STATISTICS • bls.gov10 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Data Description Data covered 5 periods over 2015

Feb

Feb-Mar

Feb-Jun

Feb-Sept

Feb-Dec

We could compute an R-Indicator over time

Continuous monitoring is the motivation to build a repeatable process

Page 11: Exploring the Application of Machine Learning Techniques

11 — U.S. BUREAU OF LABOR STATISTICS • bls.gov11 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Desired model characteristics:

High prediction accuracy

Dimension reduction

Interpretability

Smooth propensity distribution

Page 12: Exploring the Application of Machine Learning Techniques

12 — U.S. BUREAU OF LABOR STATISTICS • bls.gov12 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Classification Tree

Random Forest

Logistic Regression LASSO

Prediction Accuracy

Dimension Reduction

Interpretability

Smooth Propensity Distribution

Page 13: Exploring the Application of Machine Learning Techniques

13 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Classification Tree

Page 14: Exploring the Application of Machine Learning Techniques

14 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Classification Tree

Easy to interpret

Good dimension reduction

Propensities are “chunky”

Cannot always handle missing values

Sensitive to tuning parameter specification

Pros: Cons:

Page 15: Exploring the Application of Machine Learning Techniques

15 — U.S. BUREAU OF LABOR STATISTICS • bls.gov15 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Classification Tree

Random Forest

Logistic Regression LASSO

Prediction Accuracy

Dimension Reduction

Interpretability

Smooth Propensity Distribution

Page 16: Exploring the Application of Machine Learning Techniques

16 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Random Forest

Page 17: Exploring the Application of Machine Learning Techniques

17 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Random Forest

High accuracy

Great for dimension reduction

Interpretation not as clear as other models

Easily biased if not properly tuned

Cannot handle missing values

Pros: Cons:

Page 18: Exploring the Application of Machine Learning Techniques

18 — U.S. BUREAU OF LABOR STATISTICS • bls.gov18 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Classification Tree

Random Forest

Logistic Regression LASSO

Prediction Accuracy

Dimension Reduction

Interpretability

Smooth Propensity Distribution

Page 19: Exploring the Application of Machine Learning Techniques

19 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Logistic Regression

Easy to interpret

Propensities are smooth

Good for explaining variance

No dimension reduction

Does not necessarily predict well

Pros: Cons:

Page 20: Exploring the Application of Machine Learning Techniques

20 — U.S. BUREAU OF LABOR STATISTICS • bls.gov20 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Classification Tree

Random Forest

Logistic Regression LASSO

Prediction Accuracy

Dimension Reduction

Interpretability

Smooth Propensity Distribution

Page 21: Exploring the Application of Machine Learning Techniques

21 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

LASSO

Min Lambda Lambda 1SE

Page 22: Exploring the Application of Machine Learning Techniques

22 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Least Absolute Shrinkage and Selection Operator (LASSO)

High prediction accuracy

Easy interpretation

Great for dimension reduction

Coefficients do not necessarily indicate the magnitude of an effect.

Narrative may not be intuitive

Pros: Cons:

Page 23: Exploring the Application of Machine Learning Techniques

23 — U.S. BUREAU OF LABOR STATISTICS • bls.gov23 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Selecting Algorithms

Classification Tree

Random Forest

Logistic Regression LASSO

Prediction Accuracy

Dimension Reduction

Interpretability

Smooth Propensity Distribution

Page 24: Exploring the Application of Machine Learning Techniques

24 — U.S. BUREAU OF LABOR STATISTICS • bls.gov24 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Food Model

x1 + x2 + x3...

- - -- - -- - -- - -

PDB [TEST]

Food ($) Predictions

$$$$

$$$$

Food ($) [TEST]

Accuracy Check

Food Variables

Survey Response Model

x1 + x2 + x3...

- - -- - -- - -- - -

CHI + CEQ + PDB{FOOD} [TEST]

Response [TEST]

1010

CHI + CEQ + PDB{FOOD} [TRAIN]

- - - 1- - - 0- - - 1- - - 0

PDB [TRAIN]

- - - $- - - $- - - $- - - $

1010

Predicted Class

%%%%

Response Propensity

Accuracy Check

Survey Response Variables

The Pipeline

Page 25: Exploring the Application of Machine Learning Techniques

25 — U.S. BUREAU OF LABOR STATISTICS • bls.gov25 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

LASSO ExplainedThe lasso regression coefficient estimates are obtained by solving the optimization problem that can be generally characterized as:

𝛽𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑖=1

𝑛 𝑦𝑖 − 𝛽0 − 𝑗=1𝑝

𝛽𝑗𝑥𝑖𝑗

2+ 𝜆 𝑗=1

𝑝|𝛽𝑗|

where ≥ 0 is the shrinkage parameter that controls the relative impact of the two terms. The effect of the penalty is to get a more generalized (than a strict) fit to the data to minimize over-fitting.

Page 26: Exploring the Application of Machine Learning Techniques

26 — U.S. BUREAU OF LABOR STATISTICS • bls.gov26 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Training the LASSO Split the data 50-50 into a training set and a

testing set

Used 10-fold cross validation to find the best shrinkage parameter () and used a mixing parameter of = 1

Selected the largest lambda within one standard error of the minimum cross-validation mean-standard error, which we call “Lambda 1SE”

Ran LASSO

Page 27: Exploring the Application of Machine Learning Techniques

27 — U.S. BUREAU OF LABOR STATISTICS • bls.gov27 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Predictors of Food Expenditure Covariates for Food Expenditure:

Started with 54 variables from the PDB

After running LASSO with “Lambda 1SE” we were left with 2 variables with non-zero coefficients:

– Average Household Income (PDB)

– Average House Value (PDB)

Page 28: Exploring the Application of Machine Learning Techniques

28 — U.S. BUREAU OF LABOR STATISTICS • bls.gov28 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Predicting Survey Participation Inputs:

– Average Household Income (PDB)

– Average House value (PDB)

– Census Region (CEQ)

– Dwelling Unit Structure Type (CEQ)

– Household Size (CEQ)

– Homeowner / Renter (CEQ)

– Urbanicity (CEQ)

– Survey Wave (CEQ)

– Number of Contact Attempts (CHI)

– Ever Changed Interviewer (CHI)

Page 29: Exploring the Application of Machine Learning Techniques

29 — U.S. BUREAU OF LABOR STATISTICS • bls.gov29 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Predicting Survey Participation After running LASSO with “Lambda 1SE” we

were left with 6 variables with non-zero coefficients:

– Average Household Income (PDB)

– Household Size (CEQ)

– Homeowner / Renter (CEQ)

– Urbanicity (CEQ)

– Number of Contact Attempts (CHI)

– Ever Changed Interviewer (CHI)

Page 30: Exploring the Application of Machine Learning Techniques

30 — U.S. BUREAU OF LABOR STATISTICS • bls.gov30 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Model Accuracy Baseline was the unregularized GLM Logistic

modelUnit response model: model performance comparison using Test subsample with model

parameters estimated on Train subsample

Model predictors Regression Proportion prediction

accuracy (cut-off value prob

>0.5)*

Area under the

ROC

1SE- regularized (6

predictors)

GLM logistic 0.7261 0.699

Unregularized (10

predictors)

GLM logistic 0.7263 0.703

* Units with predicted probabilities >0.5 were classified as respondents.

Page 31: Exploring the Application of Machine Learning Techniques

31 — U.S. BUREAU OF LABOR STATISTICS • bls.gov31 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Period 5 Final Model

Predictors Coeff SE p-value

(Intercept) 1.414 0.041 0.000

Household income -2.79E-06 3.49E-07 0.000

No. contact attempts -0.089 0.004 0.000

HH size - one -0.088 0.031 0.005

HH size - three 0.369 0.037 0.000

HH size – 4+ 0.214 0.040 0.000

Changed interviewer -1.508 0.034 0.000

Renter 0.179 0.028 0.000

Rural area 0.504 0.051 0.000

* Prediction Accuracy = 72.8%

Page 32: Exploring the Application of Machine Learning Techniques

32 — U.S. BUREAU OF LABOR STATISTICS • bls.gov32 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Period 5 Model Propensity Distribution

Page 33: Exploring the Application of Machine Learning Techniques

33 — U.S. BUREAU OF LABOR STATISTICS • bls.gov33 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

R-Indicators by Period and Model

Period Months

of Data

Classification

Tree

Logistic

Regression

LASSO

Regression

1 1 0.730 0.668 0.734

2 2 0.702 0.658 0.729

3 5 0.690 0.632 0.682

4 8 0.671 0.617 0.653

5 11 0.678 0.622 0.657

Page 34: Exploring the Application of Machine Learning Techniques

34 — U.S. BUREAU OF LABOR STATISTICS • bls.gov34 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

R-Indicators

Page 35: Exploring the Application of Machine Learning Techniques

35 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Lessons Learned

Distribution of estimated unit response propensities with variable Language included as a predictor (Accuracy > 98%)

Page 36: Exploring the Application of Machine Learning Techniques

36 — U.S. BUREAU OF LABOR STATISTICS • bls.gov36 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Lessons Learned No. sample units with missing value for

LANGUAGE: 10,536

No. of survey non-respondents = 10,393

=> 98.7% of sample units with missing value for LANGUAGE were non-respondents

KNOW YOUR DATA!!!

Page 37: Exploring the Application of Machine Learning Techniques

39 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Page 38: Exploring the Application of Machine Learning Techniques

Contact Information

40 — U.S. BUREAU OF LABOR STATISTICS • bls.gov

Division of Consumer Expenditure Surveys

www.bls.gov/cex

Arcenis Rojas

Economist

202-691-6884

[email protected]

Lucilla Tan

Senior Economist

202-691-5128

[email protected]