27
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

Embed Size (px)

Citation preview

Page 1: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

Data Editing

United Nations Statistics Division (UNSD)

16 March 2011Santiago, Chile

Page 2: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

22

Editing and Imputation Defined

• Data editing: Identification and flagging of missing, invalid, inconsistent or anomalous entries

• Imputation: Resolves problems identified in editing

Page 3: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

33

Editing and Imputation Process Flow

1.

2.

3.

Page 4: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

44

A General Editing and Imputation Process

1. Identify and treat initial errors• At the data capture stage• At the data entry stage • Ex: Data entered into a table is shifted by a row

2. Identify and treat errorsa: Interactively/Manually treat influential errorsb: Automatically treat non-influential errors

3. Check the aggregated output

Page 5: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

55

Editing and Imputation Process Flow

1.

2.

3.

Page 6: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

6

Editing Errors• Two categories of errors

– Systematic – reported consistently by some of the respondents• Ex: Gross values are reported instead of net values• Ex: Units are reported in thousands

– Random – non-systematic or caused by accident • Ex: An extra digit is accidentally typed in the response

• Manifestations of errors can be systematic or random– Missing

• Ex: A variable is left blank because the respondent does not know the answer to the question, does not want to answer the question or does not understand the question

– Outliers – values that deviate from a model• Ex: Unanticipated large values as compared to historic trend

– Violation of logical or consistency rules• Ex: A total value is larger than the sum of its components

• Edit rules are used to detect errors and often define how they should be treated

Page 7: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

77

Systematic Errors• Errors that are reported consistently over time.– Unit error

• Ex: xt-1 / xt <= 300– Sign error– Bugs in the collection vehicle– Misunderstanding a question or skip rules

• Ex: systematic missing values

• Detection– High failure rates of edits– Outlier detection (e.g. for unit errors)– Knowledge of the survey and the raw data processing

Page 8: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

88

Systematic errors (2)

Suggestions• Improvements in the survey or processing

procedures should be made• When systematic errors are identified, they

should be turned into edit rules• Detecting and correcting is cost effective• Should be treated before random errors

Page 9: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

99

Missing Values

• Stem from questions a respondent did not answer• Detection is usually simple

Suggestions• Do not ignore missing values (→ bias and loss of

estimate precision)– Missing values may not be missing at random

• Do not replace with zeros (→ inaccurate results)• Nonresponse indicators should be compiled and

analyzed because missing values may be systematic

Page 10: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

1010

Outliers

• Observations that do not fit well to a model– Ex: Median-k*IQR < value < Median+k*IQR– Ex: Month-on-month change <= 50%

• May be defined by one variable (univariate) or a set of variables (multivariate)

• Two types– Representative: correct with similar units in population– Non-representative: either incorrect or correct but unique

• Ex: correct – isolated labor strike at a plant

Page 11: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

1111

Outliers (2)

• Detection– Univariate – Multivariate– Periodic data (e.g. Hidiroglou-Berthelot)– Regression models or tree-models

Page 12: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

12

Edit Rules

• Edit rules are used to determine whether a value is consistent or may be erroneous– Surveys are often created to allow these rules

• Edit rules flag data in two ways– Fatal edit – indicates a value that is (almost)

certainly in error– Query edit – indicates values that may be in error

Page 13: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

13

Types of Edit Rules• Validation edits – often in the form of if-then

statements– Ex: if total hours worked > 0 then employees > 0– Ex: if Σproduction quantity > 0 then Σproduction value > 0– Ex: if revenue from manufacturing plant> 0 then

1. hours worked by machinery technicians > 02. plant capacity utilization > 03. Σproduction volume > 04. Σproduction value > 0

• Balance edits – detail items must add to total– Ex: total employee remuneration = wages + salaries +

employer contributions to social security + welfare benefits + profits distributed to workers

Page 14: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

14

Types of Edit Rules (2)

• Ratio edits – the ratio of two data items is bounded by lower and upper bounds. The pairs should be correlated.– Ex: total hours/employee/day is between 6 and 10 (very

correlated)– Ex: plant capacity utilization <= 20% change from prvs

month– Ex: wages (W) should change within 10% of the change in

total employment (E)(Et/Et-1 - 1) - .1 <= Wt/Wt-1 -1 <= (Et/Et-1 - 1) + .1

– Ex: Σproduct value / Σ product quantity <= 10% change from previous month

Page 15: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

15

Types of Edit Rules (3)• Hidiroglou-Berthelot is a particular type of ratio edit– Ex : Employee month-on-month change

<=100 employees: <= 50% change from prvs month100< emp < =200: <= 20% change from prvs month>200 emp: <= 10% change from prvs month

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 100 200 300 400 500 600 700 800 900 1000

% C

hang

e fr

om P

rvs

Mon

th

# Employees

Page 16: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

16

Editing & Imputation Process

• Interactive/Manual – a record with flagged data is manually reviewed, preferably by a subject matter expert

• Automatic – a record with flagged data is automatically reviewed and corrected by a computer

• Selective – designed to route edits/imputations into interactive or automatic streams– based on influential vs. non-influential errors

• Marcroediting

Page 17: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

1717

Editing and Imputation Process Flow

1.

2.

3.

Page 18: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

1818

Selective Editing

• Distinguishes between errors in values that have a significant influence on survey estimate and those that are insignificant to the estimate

• Selective editing splits raw data into two streams: – critical stream: records that most likely contain influential

errors and large companies– non-critical stream: records that are unlikely to contain

influential errors

• A score function determines which responses go into which stream

Page 19: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

1919

Selective Editing (2)

• Local score function = influence * risk

• For example:Influence =

Risk =

Raw valueAnticipated valueSampling weight

ii yw~

iii yyy ~/~

iy

iy~

iw

Influence

Risk

Page 20: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

20

Selective Editing (3)

• Local score functions are aggregated into global score functions for each record– First local scores are scaled, e.g. dividing observed values

by mean values– Scaled local scores are combined into a global score.

For example: Minkowski metric (a common approach)

– The influence of large local scores increases with α α = 1 : simple sum of local scores α = 2 : Euclidean metric α -> ∞ : max local score

1/αn

1i

αir,r LSGS

Page 21: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

21

Selective Editing (4)• GS cut-off threshold must be determined– All records above the cut-off are selected for interactive

editing– A simulation can be performed on previous data to

determine a threshold• Raw unedited values and corresponding edited values are used• The first p% of records are edited and the resultant estimate is

compared with the fully edited estimate• Trial and error will lead to estimates that are the same and a

corresponding cut-off value

• Alternatively, a threshold doesn’t need to be used– Records can be edited in priority order until time or

budget constraints tell one to stop

Page 22: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

2222

Selective Editing (5)

• A score function can be augmented in many ways– E.g. Size criteria where large enterprises are

always selected for critical stream (influence irrespective of risk)

• Selective editing improves efficiency

Page 23: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

2323

Macro-Editing

• Macro-editing techniques account for the distribution of variables and for the plausibility of estimates

• Two forms of macro-editing– Aggregation method– Distribution method

Page 24: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

24

Macro-Editing - Aggregation

• Verification whether figures to be published seem plausible

• Compare estimates with– Previous estimate values– Values from other related sources– Related estimates (such as electricity production

and consumption)

Page 25: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

25

Macro-Editing - Distribution

• Available data used to characterize distribution of variables

• Individual values are compared with this distribution

• Records that contain values that are uncommon may require further inspection and possibly for editing

Page 26: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

2626

Macro-Editing Example: Graphical Editing

• Univariate plot

• Bivariate scatter plot

Page 27: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

2727

Editing and Imputation Process Flow

1.

2.

3.