Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK

Deliverable 2.8: Outliers

Gary BrownOffice for National Statistics

UK

Outliers = Outlier detection and treatment aspects of combining

data (survey/administrative) including options for various

hierarchies

Overview

• Introduction• Definitions• Identification• Treatment• Recommendations

Introduction

• Deliverable 2.8 led by UK– UK leader worked in methodology over 14 years– Expert in Sample Design and Estimation for Business

Surveys– ... also expert in Small Area Estimation, Quality, Editing and

Imputation, Time Series Analysis

• QA by Italy

Definitions

• Outliers • Errors• Outliers in survey data• Outliers in administrative data• Outliers in modelling• ... two glossaries considered: ONS and OECD

Definitions – outliers

• OECD“A data value that lies in the tail of the statistical

distribution of a set of data values”



distribution of a set of data values”• ONS

“A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample

weight that would have an undue influence on the estimate”
















• Question 1: extreme (1) influential (2) both (3)

Definitions – errors

• Errors are incorrect values identified by edit rules


• Errors are incorrect values identified by edit rules


• Errors are incorrect values identified by edit rules • OECD“A logical condition or a restriction which must be met

if the data is to be considered correct”


• Errors are incorrect values identified by edit rules • OECD“A logical condition or a restriction which must be met

if the data is to be considered correct”• ONS

“A rule designed to detect specific errors in data for potential subsequent correction”


• Errors are incorrect values identified by edit rules• OECD“A logical condition or a restriction which must be met







• Errors are corrected before outliers are considered





• Errors are corrected before outliers are considered

• Question 2: outliers = errors (1) outliers ≠ errors (2)

Definitions – survey outliers

• In the survey context, an outlier is an unrepresentative value



influential



influential

• A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population

• If the unit is unique, the assumption is invalid

Definitions – administrative outliers

• In the administrative context, an outlier is an atypical value



extreme



extreme

• Administrative data represent a census, so each unit is treated as unique

• No assumptions

Definitions – modelling outliers

• In the modelling context, an outlier is an influential value



influential



influential

• ONS“The amount of effect a particular point has on the

parameters of a regression equation”• Influence on processing and statistical modelling


• Processing – editing“fail if > 60% of maximum over past 5 years”



• Processing – imputation“uplift last return by average growth in domain”




• Statistical modelling













Identification – units

• A data warehouse stores data once for repeated use


• A data warehouse stores data once for repeated use• Each unit will have multiple values (variables/time

periods), and whether any value is – extreme depends on which other data are used– influential depends on what process/model is estimated




• Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted





every unit in a data warehouse is a potential outlier





every unit in a data warehouse is a potential outlier

• Question 3: yes (1) no (2) unsure (3)

Identification – uses

• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse



• Expected data uses & egs of identification methods



• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures



• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources



• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources– survey (estimating variables & calibration weights) eg

winsorisation & setting acceptable ranges



• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources– survey (estimating variables & calibration weights) eg

winsorisation & setting acceptable ranges – survey/admin (modelling relationship & estimating survey) eg

Cook’s distance & winsorisation

Treatment – units in uses

• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)



• Expected data uses & egs of treatment methods



• Expected data uses & egs of treatment methods– processing eg use medians rather than means



• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source



• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source– survey (estimating variables & calibration weights) eg

winsorisation & restrict to acceptable ranges



• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source– survey (estimating variables & calibration weights) eg

winsorisation & restrict to acceptable ranges – survey/admin (modelling relationship & estimating survey)

eg delete from modelling process & winsorisation

Recommendations

1. Neither data units nor their entries in a data warehouse should be labelled as outliers

Recommendations


2. Identification and treatment of outliers should be unique to each instance data are used

Recommendations



3. Metadata on outliers should only be included in a data warehouse alongside outputs

Recommendations



3. Metadata on outliers should only be included in a data warehouse alongside outputs

• Question 4: agree (1) disagree (2) discuss! (3)

Documents

Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK