Upload
noreen-poole
View
212
Download
0
Embed Size (px)
Citation preview
Antonio Bernardi - Fulvia Cerroni - Viviana De Giorgi (Istat)
An application to the Tax Authority Source (Sector Studies)
Session: Administrative data
10 July 2008
A methodological process for assessing variables coming from administrative
sources
10 July 2008 2
Agenda
A methodological process for assessing variables coming from administrative sources
Part 1 - Scheme for assessing administrative sources for statistical use
Part 2 - The process for assessing variables: the theory
Part 3 - An application to the Tax Authority Source - Sector Studies (SS)
10 July 2008 3
Agenda
A methodological process for assessing variables coming from administrative sources
Part 1 - Scheme for assessing administrative sources for statistical use
Part 2 - The process for assessing variables: the theory
Part 3 - An application to the Tax Authority Source - Sector Studies (SS)
10 July 2008 4
Background and motivations
A methodological process for assessing variables coming from administrative sources
• use of administrative archives in place of statistical surveys
• much more information on small medium enterprises • reducing the statistical burden
• development of a general scheme for validating administrative data as statistical ones
• focus on the process of assessing quantitative variables with benchmark
• Sector Studies (SS) compared with the statistical survey on SMEs as a benchmark source
10 July 2008 5
Scheme for assessing administrative sources 1/2
A methodological process for assessing variables coming from administrative sources
GENERAL ANALYSIS OF THE SOURCE
SELECTION OF THE ARCHIVE’S PARTS AND/OR VARIABLES TO BE ASSESSED
ANALYSIS OF VARIABLES
OUTPUT: SOURCE’S ASSESSMENT FOR STATISTICAL USE
Part 1
10 July 2008 6
Scheme for assessing administrative sources 2/2
A methodological process for assessing variables coming from administrative sources
Part 1 Preliminary judgement on an administrative archive
Is it possible to identify a well defined universe? yes/no
Reference population for coverage yes (specify)
Mean coverage level (specify percentage)
Coverage level (by existing disaggregation) between … and … (specify)
Are there any benchmark variables? yes (specify)/no
Can data be imported in a SAS format? yes/no
Data delivery timeliness (specify)
Does it need a formal request for data releasing? yes/no
Variables’ classifications specify existing problems
Judgementwe can/can not go on processing the source
10 July 2008 7
A methodological process for assessing variables coming from administrative sources
Part 1 - General scheme for assessing administrative sources for statistical use
Part 2 - The process for assessing variables: the theory
Part 3 - An application to the Tax Authority Source - Sector Studies (SS)
10 July 2008 8
Scheme for assessing quantitative variables having a benchmark
A methodological process for assessing variables coming from administrative sources
GENERAL ANALYSIS OF THE SOURCE
SELECTION OF THE ARCHIVE’S PARTS AND/OR VARIABLES TO BE ASSESSED
ANALYSIS OF VARIABLES
OUTPUT: SOURCE’S ASSESSMENT FOR STATISTICAL USE
Part 2
QUANTITATIVE ASSESSMENT
QUALITATIVE ASSESSMENT
INPUT: DATA (ARCHIVES)
OUTPUT: VARIABLE’S ASSESSMENT FOR STATISTICAL USE
PRACTICAL VALIDATION
OUTLIER DETECTION
STANDARD VALIDATION
Frequency validation
By group average validation
Microdata validation
Forcing the definition
IS COMPARISON OF DEFINITIONS EFFECTIVE?
N Y
N END
IS NUMERICAL EVALUATION EFFECTIVE?
Looking for a new benchmark
N Y
QUANTITATIVE ASSESSMENT
QUALITATIVE ASSESSMENT
10 July 2008 9
Qualitative and quantitative assessment of a variable 1/2
A methodological process for assessing variables coming from administrative sources
1. Outlier detection: irregular values/outliers
Irregular values:
• legal and economic constraints are taken into account
• inexistence of a systematic scheme for them
Outliers: 2 out of 3 criterions should be satisfied
i. statistical/probabilistic (Bienaymé–Tchebicev)
ii. computational/explorative (k-mean clustering method)
iii. deterministic (relative differences within the threshold values of 5%, 2% or 1%)
• inexistence of a systematic scheme for them
Part 2
10 July 2008 10
A methodological process for assessing variables coming from administrative sources
2. Standard validation:For both the source variable and its benchmark • calculation of the main descriptive statistics (mean, std, median,
asymmetry, kurtosis) and check whether the distance between the two variables decreases from the raw to the trimmed distribution
• through the kernel histogram check whether the series have the same graphical shape and the distribution of the deviations is symmetric, leptokurtic and with a zero mean.
3. Practical validation:It is useful for specific surveys and studies to check a level of concordance between the variable and its benchmark
• Frequency validation: concordance by class frequencies, simple index of dissimilarity, Cohen coefficient, relative weights of frequencies on the main diagonal, verification of correspondence by log-linear model adjusting test
• By group validation: per group concordance by checking the linearity of the groups’ means
• Micro-data validation: robust point to point correspondence through regression techniques
Quantitative assessment of a variable 2/2
Part 2
10 July 2008 11
A methodological process for assessing variables coming from administrative sources
Part 1 - General scheme for assessing administrative sources for statistical use
Part 2 - The process for assessing variables: the theory
Part 3 - An application to the Tax Authority Source - Sector Studies
10 July 2008 12
A methodological process for assessing variables coming from administrative sources
Part 3
Assessing the source: The accounting table of Sector Studies
Preliminary judgement on the accounting table of Sector Studies
Is it possible to identify a well defined universe? yes
Reference population for coverageItalian Business Register
(ASIA)
Mean coverage level 79.4%
Coverage level (by existing disaggregation) between 65% and 90%
Are there any benchmark variables? yes (SME survey)
Can data be imported in a SAS format? yes
Data delivery timeliness 15-months time lag
Does it need a formal request for data releasing? yes
Variables’ classificationssome differences exist but
they can be overcome
Judgement
the accounting table can be processed through the procedure for assessing
variables
10 July 2008 13
A methodological process for assessing variables coming from administrative sources
Part 3 Qualitative assessment
First hypothesis: assess each cost variable of Sector Studies with its own SME survey benchmark
Results: comparison of definitions is not effective for each variable. Even forcing the definition, the numerical evaluation is not effective: an appropriate combination of variables and its new benchmark should be taken into account
Second hypothesis: assess total cost of Sector Studies with the total cost of SME survey
Total cost of SS = Total cost of SME survey
Assessing the variable: the total cost 1/5
10 July 2008 14
A methodological process for assessing variables coming from administrative sources
Part 3 Quantitative assessment
Outlier detection and standard validation
1° 5° 10° 25° 75° 90° 95° 99° Pctl Pctl Pctl Pctl Pctl Pctl Pctl Pctl
Total cost of SS 1 2 6 10 36 192 957 2,538 3,473 4,700 6,566Total cost of SMEs 1 2 5 10 36 192 960 2,532 3,488 4,699 6,660
Variable Min Median Max
Assessing the variable: the total cost 2/5
VariableN of
recordsMean St. dev.
Coeff. of var.
Symmetry Kurtosis
Total cost of SS (raw data) 29.464 762,0 1.153,01 151,3 1,99 3,88
Total cost of SS (trimmed data) 29.123 749,1 1.132,77 151,22 1,94 3,12Total cost of SME survey 29.123 750,6 1.135,54 151,28 1,94 3,13
10 July 2008 15
A methodological process for assessing variables coming from administrative sources
Part 3
Fig 1. Distribution of the deviations of SS from SME survey values
Assessing the variable: the total cost 3/5
10 July 2008 16
A methodological process for assessing variables coming from administrative sources
Part 3
Practical validation
Frequency validationthe independence between the two sources does not exist: the percentage of frequencies on the main diagonal (79.8%) plus the percentage found on its contiguous lines achieves 95.8%
By group validation
Assessing the variable: the total cost 4/5
10 July 2008 17
A methodological process for assessing variables coming from administrative sources
Part 3
Micro-data validation
Correlation coefficient (Pearson): 0.99837Linear regression: TC(SS)= α + β×TC(SMEs)
a ≈ 0 b ≈ 1R2= 0.9967
Point to point correspondence (through the robust regression method) : 87,8%
ConclusionJudgment on the total cost: the variable is reliable at an individual level
Assessing the variable: the total cost 5/5
10 July 2008 18
Summary of the overall process
A methodological process for assessing variables coming from administrative sources
Part 3
N. record confrontati % sul totaleDati originari 29,149 100.0 Dati estremi 201 0.7 Dati ridotti 28,948 99.3
Valori caratteristici Variabile d'analisi Variabile benchmarkVariazione % tra
valori caratteristiciMedia 785.6 787.9 - 0.3 Mediana 207.0 207.0 - Std 1,187.5 1,191.4 - 0.3 Cv 151.2 151.2 - 1° Pctl 1.0 1.0 - 5° Pctl 5.0 5.0 - 10° Pctl 10.0 10.0 - 25° Pctl 38.0 38.0 - 75° Pctl 1,023.0 1,026.0 - 0.3 90° Pctl 2,609.0 2,615.0 - 0.2 95° Pctl 3,595.0 3,599.0 - 0.1 99° Pctl 4,734.0 4,765.0 - 0.7
Aspetti distibutivi Variabile d'analisi Variabile benchmark
Asimmetria Positiva (2,3) Positiva (2,3)
Curtosi Leptocurtica (8,8) Leptocurtica (8,7)
Giudizio sulla validazione di base
% di dati sulla diagonale principale
% di dati nella triangolare bassa
% di dati nella parte triangolare alta
Modello base IM: dev/gdl
Modello QIM: dev/gdl
Validazione di livello di dati sintetizzati in valori
medi di gruppo Trend lineare delle medie
% di dati robusti
% di dati non robusti
Giudizio di sintesi sulla variabile
193.5
27.7
Sì
Si può proseguire con il processo di validazione della variabile
7.6
Costi totali SDSCosti totali PMI
Sì
Variabile d'analisiVariabile benchmark
Validazione di base
Validazione qualitativa (definizioni)
Pre-analisi
0.99871
Validazione funzionale
La variabile può ritenersi molto affidabile dal punto di vista statistico.
88.3
11.7
Validazione a livello puntuale
Indice di correlazione (Pearson)
Validazione a livello di dati sintetizzati in
frequenze
82.0
10.4
10 July 2008 19
Thank you for your attention
For further information:
Antonio Bernardi: [email protected]
Fulvia Cerroni: [email protected]
Viviana De Giorgi: [email protected]