Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Specific Grant Agreement (SGA)
Harmonised protection of census data in the ESS
Contract N° 11112.2016.005-2016.367
under
FPA N° 11112.2014.005-2014.533
Date
23/06/2017
Work Package 3
Development and testing of recommendations; identification of best practices
Deliverable D3.2
Results of the tests on census hypercube and grid data and information loss analysis
Authors
Maël-Luc Buron, Annu Cabrera, Junoš Lukan
Sensitivity
Available to NSIs
2
1. Introduction This project deliverable provides the results of the testing of the selected protection meth-
ods, record swapping and random noise, for census data. Three countries from the project
team participated in the testing: France, Slovenia and Finland. Each country used its own
2011 census microdata.
The project team selected the following two hypercube sets foreseen for the Census 2021
for the test.1
1. First hypercube set: Group 9, e.g. 9.1, 9.2, 9.3 and 9.4.
2. Second hypercube set: Group 11, e.g. 11.1 and 11.2.
These selected test hypercubes include variables for geographical area, sex, age, country of
citizenship, place of birth and year of arrival in the country since 1980. More detailed de-
scription of the test data can be found in the project deliverables D3.1 part II.
The project team selected two SDC methods for testing: record swapping and random noise.
These method are outlined in the project deliverable D3.1 part I. Independent from this pro-
ject, the UK Office for National Statistics (ONS) has developed SAS codes for implementing
record swapping and random noise. ONS agreed to collaborate with this project and kindly
offered their SAS codes to be used for the testing in the project. The SAS codes were modi-
fied and enhanced by the project members to suit the test settings better. For example,
Germany produced the necessary noise distributions tables (“ptable”) for the random noise
macro. The project deliverable D3.1 part II outlines in more detail how the SAS codes can be
used to protect hypercubes. Even though D3.1 part II focuses on the test for hypercubes it
was shown during the testing that the necessary extension of the codes to grid data was
quite straightforward.
The aim was that each method would have been tested for each hypercube and grid data
using the provided SAS codes. In practice all testers chose which methods or variants of
methods and parameter values were most suitable for them and tested only those. All three
countries tested the method for both hypercubes and grid data.
2. Software issues There were some software issues encountered during testing. This is why the project team
saw it necessary to prepare a note to help the other countries outside the project to test the
software. This note (see Annex A of this document) provides practical information on how to
get started with the testing and which kind of issues have already been encountered (and
solved). The note was made available in addition to the software and Deliverable 3.1 to all
countries interested in testing.
1 Details on the content of the hypercubes can be found in the Census 2021 draft implementing regu-
lation that was approved at the 30th Meeting of the European Statistical System Committee on 28 September 2016 (item 2 of the agenda).
3
The code for record swapping was really extensive including several different files. Since the
code was made by ONS there naturally was a strong connection to UK’s census data. The
code had to be modified in order to make it run with other countries’ data as well. However,
within this project it was not possible to modify everything due to the huge amount of code.
There were some elements left in the record swapping code that cannot be parametrized to
fit better other countries’ data. These elements are listed in Annex A.
Protection by random noise was tested implementing the ONS code for cell key method.
Even though the code was not as complex as the one for record swapping some adjustments
were needed. Germany provided some new noise distributions tables (“ptable”) that the
project team considered more suitable than the original ONS ptable. One important adjust-
ment was to modify the ONS perturbation code so that zero cells were not perturbed (see
D3.1 part II, pp. 8-9). During the testing some issues occurred running the cell key method
code for tables that didn’t have (enough) zero cells. To fix this the code has been augmented
and some error checking has been included. Now, it should be possible to either perturb
tables without zeros and small counts or to receive an error message to SAS log. Finally, the
additivity module (see D3.1 part II, pp. 10-12) was added to the code.
Considering execution times, testing record swapping on French data revealed some time-
consuming steps. The execution times depends on the number of geographical areas and, of
course, the number of households in the data. Some examples on record swapping execu-
tion times can be found in Annex A. While testing the cell key method no runtime issues oc-
curred.
3. Test results and information loss analysis Slovenia, Finland and France tested the chosen methods with their own data. Even though
there were only two selected methods, record swapping and random noise (cell key meth-
od), there were lots of different variants of these methods available. For example, the choice
of parameter values for record swapping and different options of ptable and ways to deal
with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-
ent variants of these methods. Countries participating in the testing could choose which var-
iants of the methods they wanted to test.
As a result of the testing countries produced information loss measures that include
simple descriptive statistics for absolute (AD) and relative absolute (RAD) distances
and the distances of the square roots (D_R) between the cell/grid square values in
the original data and protected data, and
aggregate level summary statistics for AD, RAD and D_R.
More detailed descriptions and formulas of these information loss measures can be found in
section 5 of deliverable D3.1 part I and section 5 of D3.1 part II. In addition to the above
mentioned measures countries also produced the empirical variance of perturbation for
each hypercube and two measures to evaluate the use of zero cells in the protection proce-
dure. These two measures are
4
the percentage of “false zero” cells, i.e. the number of cells with observed value
other than zero perturbed to zero compared to the number of cells with observed
value zero, and
the percentage of “false positive” cells, i.e. the number of cells with observed value
zero perturbed to other than zero compared to the number of cells with observed
value zero.
Annex B of this document contains the SAS code for calculation of the information loss
measures for hypercube 9.1. The SAS code needs to be adjusted when applied to other hy-
percubes or grid data.
3.1. Chosen variants by countries Below the different test settings (variants and combinations of records swapping and cell key
method) chosen by countries are described.
Slovenia
Slovenia tested both record swapping and cell key method on 2011 census data set, more
specifically for hypercubes 9.2 and 9.4. Only one variant of both methods was applied and
the information loss measures were calculated comparing the original data and data for
which both record swapping and cell key method had been applied.
Slovenia also tested cell key method and record swapping on grid data, where only one-
dimensional “tables” with total population, sex, age, current activity status or place of birth
(as described in the foreseen Commission implementing Regulation on 1km2 grid data for
the 2021 census round2) on the lowest geographical level (i.e. 1 km² grids) were considered.
Information loss measures were calculated comparing 3 different sets of data obtained:
1. original data compared to swapped and perturbed data
2. original data compared to swapped data
3. swapped data compared to swapped and perturbed data.
Parameter values for record swapping, description of used cell key method and numerical
values for information loss measures for Slovenian data can be found in Annex C11 (hyper-
cubes 9.2 and 9.4) and Annex C12 (grid data).
Finland
Finland tested only the cell key method on hypercubes from 2011 census data. Three differ-
ent variants of cell key methods with two different ways to deal with additivity were tested.
2 https://circabc.europa.eu/sd/a/88b3101e-f98b-4c1d-b97f-
e7af25033e48/TFFC%28April%202017%296.3%20Geo-coding%20of%202021%20census%20data%20to%20a%201km%25c2%25b2%20grid%20-%20Presentation%20of%20the%20draft%20regulation%20and%20proposed%20way%20forward.pdf Please note that this paper is stored in the CIRCABC folder of the Census Task Force with restricted access. However, all members of the Census Working Group have access (or will be granted access as soon as they request it), so please contact your census colleagues in order to obtain the document.
5
The description of the variants of cell key method and the numerical values for information
loss measures comparing original and perturbed data can be found in Annex C21.
Like Slovenia, Finland also tested cell key method on grid data. One-dimensional tables with
total population, sex, age, current activity status, place of birth and usual residence 12
months before on the level of 1 km² grids were considered. The test grid data included only
populated grid squares, i.e. there were no empty grid squares in the data before perturba-
tion. Annex C22 contains the information loss measures for Finnish grid data.
France
France tested the cell key method on hypercubes from 2011 census data and both record
swapping and cell key method on grid data.
Four different variants of cell key method were tested on hypercubes and information loss
measures calculated for all of these variants. The description of variants and numerical val-
ues for information loss measures can be found in Annex C31.
For the grid data, the record swapping code applied first and then one variant of cell key
method. Information loss measures were calculated to compare the 4 different sets of data
obtained:
1. original data compared to perturbed data (O_C)
2. original data compared to swapped data (O_S)
3. swapped data compared to swapped and perturbed data (S_SC)
4. original data compared to swapped and perturbed data (O_SC)
The information loss was calculated first using total population counts (tot) and then using 8
other counts (other): male, female, age under 15, age between 15-64, age 65 and older,
place of birth in the reporting country, place of birth in another EU country, place of birth
outside EU. Annex C32 contains the results for French grid data.
3.2. General evaluation of information loss Since all three countries that participated in testing chose different variants and combina-
tions of protection methods the comparison of results between countries is somewhat chal-
lenging. However, it is possible to do some comparison between countries on a really gen-
eral level or compare results between different variants within one country.
All cell key method variants tested for French hypercubes produced cell values with maxi-
mum absolute deviation of 3 (or even 1 or 2) from the original cell values. Since in all vari-
ants the higher level aggregates within the hypercubes were calculated before perturbation
(option 1 for dealing with additivity, D3.1 part II, p.11) the maximum perturbation was exact-
ly the same as defined by the ptable. Finland and Slovenia tested cell key method variants
that perturbed the most detailed cross-combination table of each hypercube and calculated
higher level aggregates afterwards (option 2 for dealing with additivity). This lead to maxi-
mum absolute deviation of several hundreds. Also the empirical variance of perturbation
was on completely different level in option 2 than in option 1. It is clear that doing aggrega-
6
tion before perturbation leads to smaller information loss than doing aggregation after per-
turbation.
Another interesting aspect of information loss, especially for Eurostat in the context of Arti-
cle 6 clause 2(b) of the before mentioned draft regulation on grid data2, is the ability to pre-
serve zero cells with the chosen method/variant. For cell key method there was a version of
perturbation available that should prevent perturbation of zero cells into non-zero cells by
design. Finland and France tested this version of perturbation and it generated, as expected,
hypercubes where the number of cells with observed value zero perturbed to non-zero was
0. For Slovenian hypercubes the amount of “false positive” cells compared to the number of
observed zero cells was maximum 3.6 %. For French, Slovenian and Finnish grid data 0 % of
false positive grid squares was achieved using only total population counts in the calculation.
This is due to the fact that the data used for testing did not include empty (unpopulated)
grid squares, i.e. there were no available zero grid squares that could be perturbed to non-
zero. France and Slovenia compared original grid data to the data where only record swap-
ping (and not cell key method) had been applied. Also with this comparison the amount of
false positives was 0.
Cell perturbation and record swapping generated some “false zero” cells and grid squares to
the data, i.e. cells/grid squares with original value not zero perturbed or swapped to zero.
Like the ability to preserve zero cells, the amount of false zeros is also interesting in the con-
text of article 6 clause 2(c) of the before mentioned draft regulation. For total population
counts Slovenia and France reported a 0 % share of false zeros (number of false zeros com-
pared to the number of original zeros). For Slovenia this is due to the used test scenario. For
cell key method Slovenia used a parameter that says that grid squares subject to record
swapping are not eligible for noise. As for Slovenian data, all small count grid squares had
been subject to record swapping and the cell key method did not change them anymore. As
record swapping setting used by Slovenia only changes the properties of the records (e.g.
people), but not the total number of records/people in a given grid square, the result is 0 %
of false zeros. French and Finnish grid data had a certain number of false zero grid squares
after perturbation but since the data used for testing did not include empty grid squares
determining the share of false zeros was not that straightforward. However, for both coun-
tries an estimate for unpopulated grid squares was obtained from an external data source
and based on this the approximate share of false zeros was 1.62 % for Finland and 1.8 % for
France considering only total population counts. Even though Germany did not test the cell
key method on real data, they provided a rough estimate for the share of false zeros that
would be the result for their data. With the protection strategy Germany used in the Census
2011, they rounded all 1-inhabitant grids down to 0, thus increasing the number of (true)
zero grids by more than 2 %. The effect of the methodology currently foreseen in Germany
to protect Census 2021 data would even be stronger. It would tend to turn about 70 % of the
1's and 35% of the 2's into 0's. This affects even more cells, increasing the zero cells by more
than 3 %.
7
For more detailed results on information loss, Annexes C11, C12, C21, C22, C31 and C32 are
available in Excel format.
Annex A
8
Practical information to get started with Record Swapping and
Random Noise codes
While deliverable 3.1 part II describes the methods, the SAS codes and the data used for the
testing, this short note aims to provide some practical information to get started with the SAS
codes. We felt it was necessary especially for the swapping code as there are a lot of different
files that could be overwhelming.
Swapping
Section 3 of deliverable 3.1 part II refers to random swapping and describes the different parts
of the SAS codes.
The code was made by the ONS and therefore there is a strong adherence to English data.
When we modified it in order to run it on another country’s census we tried to change it as
little as possible and kept the original global architecture. That’s why we rename the variables
in a program sdc_data_transform that we run in the beginning of sdc_control
There are elements that cannot be parameterized in the current state of the programs:
the number of embedded geographies must be 4: one must therefore find a counterpart
of the structure: ward - oa - msoa – lad3
there are 7 age * sex categories (where only the age limits can be changed : <20 years
old, male 20-39, female 20-39, etc. ...)
the number of variables to define the risk must be between 4 and 6. We define these in
sdc_data_transform
the code is designed for a structure in which each OA comprises a minimum number
of observations
In the data, individuals need to be linked to households, and two separate input tables need to
be provided: one for individuals and the other one for households, with the geography varia-
bles in one of them only.
The main result of the method is the output table named sdcresults_hh_swapped_person. The
geography variables are swapped: a “keyend” suffix is added to the variables ward, oa, msoa
and lad, e.g. oakeyend. These are therefore the new geography values.
The different tables elaborated at the end of the code in sdc_diagnostics_results are kept in
the work library and provide interesting statistics, namely the rate of people and households
swapped in each geography level. We added sdc_diag_suppl to construct a spreadsheet with
some statistics computed in sdc_diagnostics_results.
3 Output Areas (OA) were specifically created for Census, with a minimum size to ensure prevent disclosure,
typically 125 households. These are grouped in Middle Layer Super Output Areas (MSOA) which are nested in the Local Authority Districts (LAD). See https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeography for detailed infor-mation about the ONS census geography.
Annex A
9
We summarized the first tests we did on French data below. We are currently running other
simulations to test sensibility to the specifications. We expect to have better results by first
grouping grid squares to define a geography in which each small area have at least a few hun-
dred individuals.
First tests with the ONS Targeted Record Swapping programs
Data
We worked on the 2011 census, with coordinates (x; y) filled in and at least one individual in
the housing.
We have worked on several versions of this data:
- V1 (for debug): only geolocated data in large municipalities, and subsample at 1 / 100th
- V2: exhaustive geolocated data on large municipalities (3.3 million households and 7.8 mil-
lion individuals in these households)
- V3: bigger data with small communes (coordinates for 82% of the households, i.e. 16.7 mil-
lion households and 39.5 million individuals). For this to run, it was necessary for us to split
these tables into 22 regional tables.
Settings
We need to find corresponding geographical variables to the structure LAD> MSOA> OA>
WARD. It is necessary to use hierarchical geographies. That’s why we crossed these with the
upper echelon4.
We chose to test5 different nested structures:
- STR1: NUTS2> NUTS3> 1km² grid squares crossed with NUTS3 (and we chose ward = oa)
- STR2: NUTS3> (10km)² grid squares crossed with NUTS3 1km² grid squares crossed with
NUTS3 (and we chose ward = oa)
In the end, the number of modalities on each level of the structure are the following:
- V2 / STR1: 22 / 96 / 17 541
- V3 (for region Ile de France only) / STR2: 8 / 194 / 8459
Swapping rate: 10%
4 variables to define the risk: country of birth with 12 modalities, nationality with 13 modali-
ties, looking for a job with 4 modalities, age in 7 age groups
Detailed profile: ageg1 || ageg2 || ageg3 || ageg4 || ageg5 || ageg6 || ageg7 || ethc
Reduced profile: ageg1 || agec1 || ageg4 || agec2 || ageg7 || ethc
4 In the structure STR1 with version V2 the number of populated grid squares goes from 17 395 to 17 541 when
crossing with NUTS3. In the structure STR2 with version V3 for the region Ile de France only, the number of large squares goes from 151 to 194 when crossed with NUTS3, and the number of small squares goes from 8253 to 8459. 5 The grids are not all inhabited whereas the code is designed for OAs with a few hundred individuals. That’s
why we think it is in fact better to group grids together in order to construct specifics OAs : these first nested
structures we tested are not optimized for the swapping code shared by the ONS.
Annex A
10
where agec1=ageg2+ageg3 and agec2=ageg5+ageg6
Very reduced profile: person || ethc
We kept the definition of ageg1 to ageg7, ethc and person:
Ageg1 = number of people under 20 years old
Ageg2 = number of men aged 20 to 39
Ageg3 = number of men aged 40-59
Ageg4 = number of men aged 60 and over
Ageg5 = number of women aged 20 to 39
Ageg6 = number of women aged 40-59
Ageg7 = number of women aged 60 and over
Ethc = number of people not born in France
Person = number of individuals in the household
Calculation time
We identified 3 time-consuming steps:
- sdc_data_transform: linked to our filters (10 min approx.)
- sdc_sample: linear calculation time with the number of OAs
- sdc_matching: 22 calls to a loop of 16 iterations each comprising several proc sort and
merge (of the order of 20 seconds per iteration on the V2).
On a subsampled table of about 500,000 households and with the same parameterization as
above, the total execution time was around 15 min.
On the table with 3.3 million households, it goes up to 5 hours, but it is largely dependent on
the number of OA.
If the V3 table is not split by region, it is impossible to run the program without saturating the
server. It must therefore be executed on the regional tables by adapting the structure of nested
geographies (we changed from STR1 to STR2). In fact ONS also split the country in delivery
groups (there are just over 100 Delivery Groups covering England and Wales) before running
the program. For the region Ile de France, which is the biggest region, the execution time is
about 2 hours. On an average region, it is around 30 min.
Random noise
Section 4 of deliverable 3.1 part II refers to random noise and describes how to use the SAS
codes implementing the cell key method.
For some first experience we suggest to first run the example provided. Open the master pro-
gram “ckm_hc_9_2.sas” which is an example using a synthetic version of "hypercube 9.2".
Testers only have to change the settings of the master program (e.g. path names, etc.). Here is
a short overview of the four steps within the master program:
Step 1 Initializing:
Define your paths, include the macros and set the parameters
Annex A
11
Step 2 Hypercube Preparation:
Specify and import the synthetic hypercube or one of your Census 2011
hypercubes as outlined in section 2
Step 3 Blocking zero cells:
Implement edit rules of the hypercube to determine zero cells that should
not be perturbed (only necessary if you are going to perturb zero cells into
non-zero cells)
Step 4 Perturbation:
You will have the choice between four different versions of perturbation
tables (ptables). Execute ONE of the four perturbations. Version 01 is rec-
ommended for the testing in the project since it avoids the complexity, for
example, of step 3.
There should not be any runtime issue (as for example the code took around two and a half
minutes on a single computer and a table with 9 million individuals for hypercube 9.1).
Annex B
12
***********************************
*** Information loss measures ***
***********************************;
* Annu Cabrera;
* 1.6.2017;
* Hypercube: 9.1
* Different variants of protection method:
- Description of variant 1 (e.g. Cell key method, ptable_d3_v200, Option 1 in D3.1 part
II, p. 11) --> Variant1
- Description of variant 2 --> Variant2
- ...
;
* Read in Variant1, Variant2, ...;
libname v1 "...";
data work.hc91_v1;
set v1.hc91;
run;
libname v2 "...";
data work.hc91_v2;
set v2.hc91;
run;
* Number of variants;
%let vari = 2;
options mprint;
***********************************************************
** 5.1. Simple descriptive statistics (D3.1 part II, p.12);
* Just a reminder
rs_cv = original cell value
cp_cv = perturbed cell value;
* Definitions (D3.1 part I, p.11):
Absolute difference of rs_cv and cp_cv --> ad
Relative absolute difference of rs_cv and cp_cv --> rad
Distance of square roots --> d_r ;
%macro def;
%do i=1 %to &vari.;
data hc91_v&i.a;
set hc91_v&i.;
ad = abs( cp_cv - rs_cv );
if rs_cv = 0 then rad = .;
else rad = ad / rs_cv;
d_r = abs( sqrt(cp_cv) - sqrt(rs_cv) );
/* cumulative distribution function for ad */
if ad < 15 then cdf_ad = ad;
else cdf_ad = 15;
/* cumulative distribution function for rad */
if rad < 0.02 then cdf_rad = 0.019;
else if 0.02 <= rad < 0.05 then cdf_rad = 0.049;
else if 0.05 <= rad < 0.1 then cdf_rad = 0.099;
else if 0.1 <= rad < 0.2 then cdf_rad = 0.199;
else if 0.2 <= rad < 0.3 then cdf_rad = 0.299;
else if 0.3 <= rad < 0.4 then cdf_rad = 0.399;
else if 0.4 <= rad < 0.5 then cdf_rad = 0.499;
else if 0.5 <= rad < 1 then cdf_rad = 0.999;
else cdf_rad = 1;
/* false zeros = cells with original value not zero perturbed to zero */
if cp_cv = 0 and rs_cv > 0 then false_zero = 1;
else false_zero = .;
/* false positive = cells with original value zero perturbed to positive value
*/
Annex B
13
if rs_cv = 0 and cp_cv > 0 then false_positive = 1;
else false_positive = .;
/* indicator for cells with original (observed) value zero */
if rs_cv = 0 then zero = 1;
else zero = .;
run;
%end;
%mend def;
%def
* For ad simple descriptive statistics across all cells of a hypercube:
Max, Mean, Median, Percentiles p60, p70, p80, p90, p95 and p99
Cumulative df F_ad(d), d = 1,..,15;
%macro ad_ds;
%do i=1 %to &vari.;
proc means data=hc91_v&i.a max mean median p60 p70 p80 p90 p95 p99 var;
var ad;
output out=hc91_v&i._ad max=max_ad mean=mean_ad median=median_ad p60=p60_ad p70=p70_ad
p80=p80_ad p90=p90_ad p95=p95_ad p99=p99_ad
var=var_ad;
run;
proc freq data=hc91_v&i.a;
tables cdf_ad / out=hc91_v&i._cdf_ad outcum;
run;
data hc91_v&i._ad;
retain variant;
set hc91_v&i._ad;
variant="v&i.";
drop _type_ _freq_;
run;
data hc91_v&i._cdf_ad;
retain variant;
set hc91_v&i._cdf_ad;
variant="v&i.";
drop count percent cum_freq;
run;
proc means data=hc91_v&i.a sum;
var false_zero false_positive zero;
output out=hc91_v&i._zero1 sum(zero)=zero sum(false_zero)=false_zero
sum(false_positive)=false_positive;
run;
data hc91_v&i._zero2;
retain variant;
set hc91_v&i._zero1;
variant="v&i.";
if zero=. then pct_false_zero=-1;
else if false_zero=. then pct_false_zero=0;
else pct_false_zero = false_zero / zero * 100;
if zero=. then pct_false_positive=-1;
else if false_positive=. then pct_false_positive=0;
else pct_false_positive = false_positive / zero * 100;
if zero=. and false_zero=. then count_false_zero=0;
else if zero=. then count_false_zero=false_zero;
else count_false_zero=.;
if zero=. and false_positive=. then count_false_positive=0;
else if zero=. then count_false_positive=false_positive;
else count_false_positive=.;
drop _freq_ _type_ zero false_zero false_positive; run;
%end;
%mend ad_ds;
%ad_ds
%macro variants_ad;
%do i=1 %to &vari.;
hc91_v&i._ad
%end;
%mend variants_ad;
data hc91_ad_ds;
set %variants_ad;
run;
%macro variants_ad_cdf;
Annex B
14
%do i=1 %to &vari.;
hc91_v&i._cdf_ad
%end;
%mend variants_ad_cdf;
data hc91_ad_cdf;
set %variants_ad_cdf;
run;
%macro variants_zero;
%do i=1 %to &vari.;
hc91_v&i._zero2
%end;
%mend variants_zero;
data hc91_zero;
set %variants_zero;
run;
* All 3-dim sub-hypercubes of hc91:
SEX x AGE.M x COC.L --> hc91s1
SEX x AGE.M x POB.H --> hc91s2
SEX x COC.L x POB.H --> hc91s3
AGE.M x COC.L x POB.H --> hc91s4;
%macro sub;
%do i=1 %to &vari.;
data hc91s1_v&i.a;
set hc91_v&i.a;
where POB_H="0.";
run;
%end;
%mend sub;
%sub
* For rad simple descriptive startistics across all cells of a 3-dim sub-hypercube:
Max, Mean, Median, Percentiles p60, p70, p80, p90, p95 and p99
Cumulative df F_rad(r), r = 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1;
%macro rad_ds;
%do i=1 %to &vari.;
proc means data=hc91s1_v&i.a max mean median p60 p70 p80 p90 p95 p99;
var rad;
output out=hc91s1_v&i._rad max=max_rad mean=mean_rad median=median_rad p60=p60_rad
p70=p70_rad p80=p80_rad p90=p90_rad p95=p95_rad
p99=p99_rad;
run;
proc freq data=hc91s1_v&i.a;
tables cdf_rad / out=hc91s1_v&i._cdf_rad outcum;
run;
data hc91_v&i._rad;
retain variant;
set hc91s1_v&i._rad;
variant="v&i.";
drop _type_ _freq_;
run;
data hc91_v&i._cdf_rad;
retain variant;
set hc91s1_v&i._cdf_rad;
variant="v&i.";
drop count percent cum_freq;
run;
%end;
%mend rad_ds;
%rad_ds
%macro variants_rad;
%do i=1 %to &vari.;
hc91_v&i._rad
%end;
%mend variants_rad;
data hc91_rad_ds;
set %variants_rad;
run;
Annex B
15
%macro variants_rad_cdf;
%do i=1 %to &vari.;
hc91_v&i._cdf_rad
%end;
%mend variants_rad_cdf;
data hc91_rad_cdf;
set %variants_rad_cdf;
run;
***************************************************************
** 5.2. Summary statistics for aggregates (D3.1 part II, p.13);
* Aggregates for hc91:
SEX x COC.L --> 3 x 7 = 21 aggregates;
* Cells contributing to these aggregates
1) cross-tabulations of AGE groups 1,2,3,4,5,6 and POB groups 1,2,3,4 --> l
2) cross-tabulations of AGE groups 1.1, 1.2, ... , 6.4 and POB groups 2.1.01, ... ,
2.2.6.21 --> h;
%macro aggr;
%do i=1 %to &vari.;
data hc91_v&i.b_l;
length aggregate $ 4;
set hc91_v&i.a;
where AGE_M in ('1.' '2.' '3.' '4.' '5.' '6.') and POB_H in ('1.' '2.' '2.1.' '2.2.'
'3.' '4.');
aggregate=substr(SEX,1,1)||"_"||substr(COC_L,1,1)||substr(COC_L,3,1);
run;
data hc91_v&i.b_h;
length aggregate $ 4;
set hc91_v&i.a;
where AGE_M not in ('0.' '1.' '2.' '3.' '4.' '5.' '6.') and POB_H not in ('0.' '2.'
'2.1.' '2.2.' '2.2.1.' '2.2.2.' '2.2.3.'
'2.2.4.' '2.2.5.' '2.2.6.');
aggregate=substr(SEX,1,1)||"_"||substr(COC_L,1,1)||substr(COC_L,3,1);
run;
%end;
%mend aggr;
%aggr
* Mean and sum;
%macro summary(level);
%do i=1 %to &vari.;
proc means data=hc91_v&i.b_&level.;
var ad rad;
class aggregate;
output out=hc91_v&i._meansum_&level. mean(ad)=mean_ad_&level. sum(rad)=sum_rad_&level.;
run;
%end;
%mend summary;
%summary(h)
%summary(l)
* Hellinger's distance;
%macro hd(level);
%do i=1 %to &vari.;
proc sql;
create table hc91_v&i._hd_&level. as
Select aggregate, sqrt(0.5*sum(d_r**2)) as hd_&level.
from hc91_v&i.b_&level.
group by aggregate
;
quit;
%end;
%mend hd;
%hd(h)
%hd(l)
* Merge mean, sum and hd into one dataset;
Annex B
16
%macro sum1;
%do i=1 %to &vari.;
data hc91_v&i._summary1 (drop=_type_ _freq_);
merge hc91_v&i._meansum_l hc91_v&i._meansum_h hc91_v&i._hd_l hc91_v&i._hd_h;
by aggregate;
where aggregate ne ' ';
run;
%end;
%mend sum1;
%sum1
* Means of summary statistics;
%macro sum2;
%do i=1 %to &vari.;
proc means data=hc91_v&i._summary1;
var mean_ad_h sum_rad_h hd_h mean_ad_l sum_rad_l hd_l;
output out=hc91_v&i._summary2 mean(mean_ad_h)=ad_h mean(sum_rad_h)=rad_h
mean(hd_h)=hd_h mean(mean_ad_l)=ad_l
mean(sum_rad_l)=rad_l mean(hd_l)=hd_l;
run;
data hc91_v&i._summary3;
retain variant;
set hc91_v&i._summary2;
variant="v&i.";
drop _type_ _freq_;
run;
%end;
%mend sum2;
%sum2
* All six information loss indicators;
%macro variants_sum;
%do i=1 %to &vari.;
hc91_v&i._summary3
%end;
%mend variants_sum;
data hc91_summary;
set %variants_sum;
run;
* Clean up work library;
%macro variants_orig;
%do i=1 %to &vari.;
hc91_v&i.
%end;
%mend variants_orig;
proc datasets library=work nolist;
save hc91_zero hc91_ad_ds hc91_ad_cdf hc91_rad_ds hc91_rad_cdf hc91_summary %variants_orig;
run;
quit;