Upload
haviva-stokes
View
34
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Using Selective Editing Combined with an Automatic System in the FSS of Spain Dolores Lorca National Statistical Institute of Spain. Summary. An integrated editing process that combines selective editing and the generalized edit and imputation system Banff - PowerPoint PPT Presentation
Citation preview
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Using Selective Editing Combined with an Automatic System in the FSS of Spain
Dolores LorcaNational Statistical Institute of Spain
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Summary • An integrated editing process that combines selective
editing and the generalized edit and imputation system Banff
• We use Banff to detect the suspicious units and a score function for selective editing
• Spanish FSS: the different types of data (crop, livestock, employment) contribute to the complexity of this process
• Some results obtained from the traditional microediting approach and from selective editing are compared
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Traditional microediting approach
• The subject matter expert specifies the edits
• The processing department makes tailored-made programs for each survey to detect the edit failures
• The edit failures are manually reviewed
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
New integrated edit and imputation process
1) Initial editing prior to selective editing
2) Selective editing procedure
3) Automatic system process (BANFF)
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
1) Initial editing prior to selective editing
Controls of consistence are established in the data collection phase carried out by interviewers
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA 2) Selective editing procedure
Score functions are built to determine and prioritize the survey suspect units to be reviewed manually due to their significant weight on the final estimates
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
3) Automatic system process (BANFF).
The automatic system process is carried out using the generalized system Banff, developed by Statistics Canada
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Study case:Farm Structure Survey (FSS)
The Spanish FSS collects different types of data such as:
• Utilised agricultural land
• Cultivated Land by kind of crop
• Types of livestock
• The structure and the amount of farm employment
• Machinery and equipment
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA The main characteristics of the FSS
• FSS is carried out every 2 years • It consists of a farm panel drawn from the last
Agrarian Census • The sample design is a single stage design with
stratification of the farms according to geographical area, type of farming (TF) and size
• Data collection is carried out by interviewers
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA FSS Estimators: total estimate of the jth variable in stratum h
Fh is the sample weight for the stratum h
nh is the sample size in stratum h
Xhji denotes the jth variable value for the sampled unit i in stratum h.
1XFX̂hn
1iihjhhj
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Initial editing prior to selective editing
• Initial editing is carried out by interviewers in the NSI’s provincial offices
• In this phase, all fatal errors are corrected Most of these fatal errors come from balance edits
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Selective editing procedure• The goal: To select the survey units with suspicious
values that may have a significant effect on survey estimates
• Key variable chosen:
Utilized Agricultural Land (UAL), Cultivated Land (CL),
Woody Crops (WCs), Olive Grove (OG),Vineyard (VY),
Animal Units (AU), Annual Labour Units (ALU)
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Selective editing: crop variables
• Relative stability over time
• Anomalous variations, from the previous year to the current one, can be a sign of data errors
• We determine the units with anomalous and significant variations of the selected crop variables
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Steps of selective editing procedure: crop variables
1) In each stratum, we obtain the units with anomalous variations with respect to the previous period of the analyzed variables, using the Hidiroglou-Berthelot (1986) method of outlier detection (PROC OUTLIER of Banff system)
2)The units for manual editing are selected among
the outliers identified previously having a
significant weight on the population total estimates
using a score function
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA (1) step: Hidiroglou-Berthelot method
PROC OUTLIER of BANFF system
1thji
thji
hji X
Xr
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA (1) step: Hidiroglou-Berthelot method
hjmhji
hjm
hji
hjmhji
hji
hjm
hji
rr1r
r
rr0r
r1
s
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA (1) step: Hidiroglou-Berthelot method
Effect ehji for each unit i:
ehji=shji(max(Fht-1
xhjit , Fh
t-1 xhji
t-1 ))exp
exp=1
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA (1) step: Hidiroglou-Berthelot method
M, Q1,Q3: median, the first quartile and the third quartile of the transformed ehji values of the variable being processed
dQ1=max(M-Q1,|A*M)
dQ3=max(Q3-M,|A*M)
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA (1) step: Hidiroglou-Berthelot method
(M-C dQ1 ,M+CdQ3)
C=5
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
(2) step: scaled local score function (Latouche and Berthelot 1992):
3X̂
XXF1t
hj
1thji
thji
1th
hji
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Setting threshold value (Lawrence and Mckenzie 2000):
4)X̂(SEn
k3a hj
hhj
ahj is the threshold value of the jth variable in stratum h, SE(Xhj) is the standard error of the jth variable in stratum h, nh is the sample size in stratum h and k is a value such as :
)X̂(kVar))X̂(bias(E hjhj2
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Using the Lawrence and Mckenzie formula ensures that the bias due to not editing some of the survey units is less than k% of the variance of the estimate. The value of k is set to 10%
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
• Within eachs stratum,the values Δhji are sorted in descending order
• Then, the outliers with score Δhji > ahj are selected for manual editing
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Selective editing: employment variable•ALU variable One ALU is equivalent to the work carried out by one person on a full-time basis over one year
• Using auxiliary information to estimate the expected amended value: the ratio between the employment number in agriculture obtained in t and t-2 through the Force labour Survey (FLS)
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Selective editing: score function
5X̂R
RXXF1t
hj
1t
hji
t
hji
1t
h
hji
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Selective Editing: livestock variables
– The FSS collects the existing livestock in
the farm on the day of the interview – A farm can have a strong livestock variation depending
on the interview date
The selective editing procedure for livestock
is different to the rest of variables
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Animal Units (AU)
Livestock data are expressed in AUs which
are obtained by applying a coefficient to
each species and type in order to group
different species in one common unit
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Steps of Selective Editing procedure: livestock
1) Units that fail some of the edits, which are specified in the traditional microediting approach, are selected as suspicious units
2) For each suspicious unit or edit failure, an estimate of the expected amended response of AU variable is calculated
3) We determine, among the suspicious units detected at the previous step, those units with a significant weight on the total estimate of the AU variable
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Edits specified in the traditional microediting approach
yhji < chj
yhji is the jth variable (types of livestock) for the unit i in stratum h chj is a constant determined by the historical empirical distributions
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA •Estimate of the expected amended response of AU variable:
chj expressed in AU, i.e. x’hji
•Magnitude of failure for the suspicious unit i
ehji=xhji-x’hji
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA • The threshold is calculated using the Lawrence and Mckenzie formula as in previous cases
• Within each stratum,the values Δhji are sorted in descending order
• Then, the edit failures with score Δhji > ahj are selected for manual editing
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Macroediting and selective editing approach
• In first place, a selection of the strata with the largest variation with respect to the previous period of the analysed variables is carried out
• After, the steps of selective editing procedure are applied only to the farms of the selected strata
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Macro-editing approach:
p
1:h
1t
hj
t
hj
1t
hj
t
hj
hj
X̂X̂
X̂X̂
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
threshold value for the strata
• In each region, the hj values are sorted in descending order.
• We determine a threshold value, j* and strata with hj >j* are selected
This threshold value is set to 3%.
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CAResults
– Farm number: 3690
– We compare the results obtained for the following editing procedures:• (A) Traditional microediting approach• (B) Selective editing procedure• (C) Macroediting and selective editing
approach
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Table 1
Procedure A B C
Rate of editedunits(%)
21.5 9.0 4.8
Rate of correctedunits(%)
3.9 7.2 9.1
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Table 2:Change rates of total estimate
for the CL variable (%)
Change rate of (B)over (A)
Change rate of (C)over (A)
0.8 1.1
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CATable 3
95% Confident interval CL(B) CL (C)(72657.2; 86031.5) 78770.72 78471.56
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA Further research
– Banff will be applied to the rest of units that have not been edited in the selective editing procedure
– Different methods of imputation will be tested
INST
ITU
TO N
ACIO
NAL
DE E
STAD
ÍSTI
CA
Final remarks
• Integrating the PROC OUTLIER of Banff to detect suspicious units and a score function to select units for manually editing has been useful in the Spanish FSS
• Reduction in cost and processing time would be attained using this approach
• Response burden is reduced from carrying out less number of recontacts