16
Specific Grant Agreement (SGA) Harmonised protection of census data in the ESS Contract N° 11112.2016.005-2016.367 under FPA N° 11112.2014.005-2014.533 Date 23/06/2017 Work Package 3 Development and testing of recommendations; identification of best practices Deliverable D3.2 Results of the tests on census hypercube and grid data and information loss analysis Authors Maël-Luc Buron, Annu Cabrera, Junoš Lukan Sensitivity Available to NSIs

Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Specific Grant Agreement (SGA)

Harmonised protection of census data in the ESS

Contract N° 11112.2016.005-2016.367

under

FPA N° 11112.2014.005-2014.533

Date

23/06/2017

Work Package 3

Development and testing of recommendations; identification of best practices

Deliverable D3.2

Results of the tests on census hypercube and grid data and information loss analysis

Authors

Maël-Luc Buron, Annu Cabrera, Junoš Lukan

Sensitivity

Available to NSIs

Page 2: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

2

1. Introduction This project deliverable provides the results of the testing of the selected protection meth-

ods, record swapping and random noise, for census data. Three countries from the project

team participated in the testing: France, Slovenia and Finland. Each country used its own

2011 census microdata.

The project team selected the following two hypercube sets foreseen for the Census 2021

for the test.1

1. First hypercube set: Group 9, e.g. 9.1, 9.2, 9.3 and 9.4.

2. Second hypercube set: Group 11, e.g. 11.1 and 11.2.

These selected test hypercubes include variables for geographical area, sex, age, country of

citizenship, place of birth and year of arrival in the country since 1980. More detailed de-

scription of the test data can be found in the project deliverables D3.1 part II.

The project team selected two SDC methods for testing: record swapping and random noise.

These method are outlined in the project deliverable D3.1 part I. Independent from this pro-

ject, the UK Office for National Statistics (ONS) has developed SAS codes for implementing

record swapping and random noise. ONS agreed to collaborate with this project and kindly

offered their SAS codes to be used for the testing in the project. The SAS codes were modi-

fied and enhanced by the project members to suit the test settings better. For example,

Germany produced the necessary noise distributions tables (“ptable”) for the random noise

macro. The project deliverable D3.1 part II outlines in more detail how the SAS codes can be

used to protect hypercubes. Even though D3.1 part II focuses on the test for hypercubes it

was shown during the testing that the necessary extension of the codes to grid data was

quite straightforward.

The aim was that each method would have been tested for each hypercube and grid data

using the provided SAS codes. In practice all testers chose which methods or variants of

methods and parameter values were most suitable for them and tested only those. All three

countries tested the method for both hypercubes and grid data.

2. Software issues There were some software issues encountered during testing. This is why the project team

saw it necessary to prepare a note to help the other countries outside the project to test the

software. This note (see Annex A of this document) provides practical information on how to

get started with the testing and which kind of issues have already been encountered (and

solved). The note was made available in addition to the software and Deliverable 3.1 to all

countries interested in testing.

1 Details on the content of the hypercubes can be found in the Census 2021 draft implementing regu-

lation that was approved at the 30th Meeting of the European Statistical System Committee on 28 September 2016 (item 2 of the agenda).

Page 3: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

3

The code for record swapping was really extensive including several different files. Since the

code was made by ONS there naturally was a strong connection to UK’s census data. The

code had to be modified in order to make it run with other countries’ data as well. However,

within this project it was not possible to modify everything due to the huge amount of code.

There were some elements left in the record swapping code that cannot be parametrized to

fit better other countries’ data. These elements are listed in Annex A.

Protection by random noise was tested implementing the ONS code for cell key method.

Even though the code was not as complex as the one for record swapping some adjustments

were needed. Germany provided some new noise distributions tables (“ptable”) that the

project team considered more suitable than the original ONS ptable. One important adjust-

ment was to modify the ONS perturbation code so that zero cells were not perturbed (see

D3.1 part II, pp. 8-9). During the testing some issues occurred running the cell key method

code for tables that didn’t have (enough) zero cells. To fix this the code has been augmented

and some error checking has been included. Now, it should be possible to either perturb

tables without zeros and small counts or to receive an error message to SAS log. Finally, the

additivity module (see D3.1 part II, pp. 10-12) was added to the code.

Considering execution times, testing record swapping on French data revealed some time-

consuming steps. The execution times depends on the number of geographical areas and, of

course, the number of households in the data. Some examples on record swapping execu-

tion times can be found in Annex A. While testing the cell key method no runtime issues oc-

curred.

3. Test results and information loss analysis Slovenia, Finland and France tested the chosen methods with their own data. Even though

there were only two selected methods, record swapping and random noise (cell key meth-

od), there were lots of different variants of these methods available. For example, the choice

of parameter values for record swapping and different options of ptable and ways to deal

with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-

ent variants of these methods. Countries participating in the testing could choose which var-

iants of the methods they wanted to test.

As a result of the testing countries produced information loss measures that include

simple descriptive statistics for absolute (AD) and relative absolute (RAD) distances

and the distances of the square roots (D_R) between the cell/grid square values in

the original data and protected data, and

aggregate level summary statistics for AD, RAD and D_R.

More detailed descriptions and formulas of these information loss measures can be found in

section 5 of deliverable D3.1 part I and section 5 of D3.1 part II. In addition to the above

mentioned measures countries also produced the empirical variance of perturbation for

each hypercube and two measures to evaluate the use of zero cells in the protection proce-

dure. These two measures are

Page 4: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

4

the percentage of “false zero” cells, i.e. the number of cells with observed value

other than zero perturbed to zero compared to the number of cells with observed

value zero, and

the percentage of “false positive” cells, i.e. the number of cells with observed value

zero perturbed to other than zero compared to the number of cells with observed

value zero.

Annex B of this document contains the SAS code for calculation of the information loss

measures for hypercube 9.1. The SAS code needs to be adjusted when applied to other hy-

percubes or grid data.

3.1. Chosen variants by countries Below the different test settings (variants and combinations of records swapping and cell key

method) chosen by countries are described.

Slovenia

Slovenia tested both record swapping and cell key method on 2011 census data set, more

specifically for hypercubes 9.2 and 9.4. Only one variant of both methods was applied and

the information loss measures were calculated comparing the original data and data for

which both record swapping and cell key method had been applied.

Slovenia also tested cell key method and record swapping on grid data, where only one-

dimensional “tables” with total population, sex, age, current activity status or place of birth

(as described in the foreseen Commission implementing Regulation on 1km2 grid data for

the 2021 census round2) on the lowest geographical level (i.e. 1 km² grids) were considered.

Information loss measures were calculated comparing 3 different sets of data obtained:

1. original data compared to swapped and perturbed data

2. original data compared to swapped data

3. swapped data compared to swapped and perturbed data.

Parameter values for record swapping, description of used cell key method and numerical

values for information loss measures for Slovenian data can be found in Annex C11 (hyper-

cubes 9.2 and 9.4) and Annex C12 (grid data).

Finland

Finland tested only the cell key method on hypercubes from 2011 census data. Three differ-

ent variants of cell key methods with two different ways to deal with additivity were tested.

2 https://circabc.europa.eu/sd/a/88b3101e-f98b-4c1d-b97f-

e7af25033e48/TFFC%28April%202017%296.3%20Geo-coding%20of%202021%20census%20data%20to%20a%201km%25c2%25b2%20grid%20-%20Presentation%20of%20the%20draft%20regulation%20and%20proposed%20way%20forward.pdf Please note that this paper is stored in the CIRCABC folder of the Census Task Force with restricted access. However, all members of the Census Working Group have access (or will be granted access as soon as they request it), so please contact your census colleagues in order to obtain the document.

Page 5: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

5

The description of the variants of cell key method and the numerical values for information

loss measures comparing original and perturbed data can be found in Annex C21.

Like Slovenia, Finland also tested cell key method on grid data. One-dimensional tables with

total population, sex, age, current activity status, place of birth and usual residence 12

months before on the level of 1 km² grids were considered. The test grid data included only

populated grid squares, i.e. there were no empty grid squares in the data before perturba-

tion. Annex C22 contains the information loss measures for Finnish grid data.

France

France tested the cell key method on hypercubes from 2011 census data and both record

swapping and cell key method on grid data.

Four different variants of cell key method were tested on hypercubes and information loss

measures calculated for all of these variants. The description of variants and numerical val-

ues for information loss measures can be found in Annex C31.

For the grid data, the record swapping code applied first and then one variant of cell key

method. Information loss measures were calculated to compare the 4 different sets of data

obtained:

1. original data compared to perturbed data (O_C)

2. original data compared to swapped data (O_S)

3. swapped data compared to swapped and perturbed data (S_SC)

4. original data compared to swapped and perturbed data (O_SC)

The information loss was calculated first using total population counts (tot) and then using 8

other counts (other): male, female, age under 15, age between 15-64, age 65 and older,

place of birth in the reporting country, place of birth in another EU country, place of birth

outside EU. Annex C32 contains the results for French grid data.

3.2. General evaluation of information loss Since all three countries that participated in testing chose different variants and combina-

tions of protection methods the comparison of results between countries is somewhat chal-

lenging. However, it is possible to do some comparison between countries on a really gen-

eral level or compare results between different variants within one country.

All cell key method variants tested for French hypercubes produced cell values with maxi-

mum absolute deviation of 3 (or even 1 or 2) from the original cell values. Since in all vari-

ants the higher level aggregates within the hypercubes were calculated before perturbation

(option 1 for dealing with additivity, D3.1 part II, p.11) the maximum perturbation was exact-

ly the same as defined by the ptable. Finland and Slovenia tested cell key method variants

that perturbed the most detailed cross-combination table of each hypercube and calculated

higher level aggregates afterwards (option 2 for dealing with additivity). This lead to maxi-

mum absolute deviation of several hundreds. Also the empirical variance of perturbation

was on completely different level in option 2 than in option 1. It is clear that doing aggrega-

Page 6: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

6

tion before perturbation leads to smaller information loss than doing aggregation after per-

turbation.

Another interesting aspect of information loss, especially for Eurostat in the context of Arti-

cle 6 clause 2(b) of the before mentioned draft regulation on grid data2, is the ability to pre-

serve zero cells with the chosen method/variant. For cell key method there was a version of

perturbation available that should prevent perturbation of zero cells into non-zero cells by

design. Finland and France tested this version of perturbation and it generated, as expected,

hypercubes where the number of cells with observed value zero perturbed to non-zero was

0. For Slovenian hypercubes the amount of “false positive” cells compared to the number of

observed zero cells was maximum 3.6 %. For French, Slovenian and Finnish grid data 0 % of

false positive grid squares was achieved using only total population counts in the calculation.

This is due to the fact that the data used for testing did not include empty (unpopulated)

grid squares, i.e. there were no available zero grid squares that could be perturbed to non-

zero. France and Slovenia compared original grid data to the data where only record swap-

ping (and not cell key method) had been applied. Also with this comparison the amount of

false positives was 0.

Cell perturbation and record swapping generated some “false zero” cells and grid squares to

the data, i.e. cells/grid squares with original value not zero perturbed or swapped to zero.

Like the ability to preserve zero cells, the amount of false zeros is also interesting in the con-

text of article 6 clause 2(c) of the before mentioned draft regulation. For total population

counts Slovenia and France reported a 0 % share of false zeros (number of false zeros com-

pared to the number of original zeros). For Slovenia this is due to the used test scenario. For

cell key method Slovenia used a parameter that says that grid squares subject to record

swapping are not eligible for noise. As for Slovenian data, all small count grid squares had

been subject to record swapping and the cell key method did not change them anymore. As

record swapping setting used by Slovenia only changes the properties of the records (e.g.

people), but not the total number of records/people in a given grid square, the result is 0 %

of false zeros. French and Finnish grid data had a certain number of false zero grid squares

after perturbation but since the data used for testing did not include empty grid squares

determining the share of false zeros was not that straightforward. However, for both coun-

tries an estimate for unpopulated grid squares was obtained from an external data source

and based on this the approximate share of false zeros was 1.62 % for Finland and 1.8 % for

France considering only total population counts. Even though Germany did not test the cell

key method on real data, they provided a rough estimate for the share of false zeros that

would be the result for their data. With the protection strategy Germany used in the Census

2011, they rounded all 1-inhabitant grids down to 0, thus increasing the number of (true)

zero grids by more than 2 %. The effect of the methodology currently foreseen in Germany

to protect Census 2021 data would even be stronger. It would tend to turn about 70 % of the

1's and 35% of the 2's into 0's. This affects even more cells, increasing the zero cells by more

than 3 %.

Page 7: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

7

For more detailed results on information loss, Annexes C11, C12, C21, C22, C31 and C32 are

available in Excel format.

Page 8: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex A

8

Practical information to get started with Record Swapping and

Random Noise codes

While deliverable 3.1 part II describes the methods, the SAS codes and the data used for the

testing, this short note aims to provide some practical information to get started with the SAS

codes. We felt it was necessary especially for the swapping code as there are a lot of different

files that could be overwhelming.

Swapping

Section 3 of deliverable 3.1 part II refers to random swapping and describes the different parts

of the SAS codes.

The code was made by the ONS and therefore there is a strong adherence to English data.

When we modified it in order to run it on another country’s census we tried to change it as

little as possible and kept the original global architecture. That’s why we rename the variables

in a program sdc_data_transform that we run in the beginning of sdc_control

There are elements that cannot be parameterized in the current state of the programs:

the number of embedded geographies must be 4: one must therefore find a counterpart

of the structure: ward - oa - msoa – lad3

there are 7 age * sex categories (where only the age limits can be changed : <20 years

old, male 20-39, female 20-39, etc. ...)

the number of variables to define the risk must be between 4 and 6. We define these in

sdc_data_transform

the code is designed for a structure in which each OA comprises a minimum number

of observations

In the data, individuals need to be linked to households, and two separate input tables need to

be provided: one for individuals and the other one for households, with the geography varia-

bles in one of them only.

The main result of the method is the output table named sdcresults_hh_swapped_person. The

geography variables are swapped: a “keyend” suffix is added to the variables ward, oa, msoa

and lad, e.g. oakeyend. These are therefore the new geography values.

The different tables elaborated at the end of the code in sdc_diagnostics_results are kept in

the work library and provide interesting statistics, namely the rate of people and households

swapped in each geography level. We added sdc_diag_suppl to construct a spreadsheet with

some statistics computed in sdc_diagnostics_results.

3 Output Areas (OA) were specifically created for Census, with a minimum size to ensure prevent disclosure,

typically 125 households. These are grouped in Middle Layer Super Output Areas (MSOA) which are nested in the Local Authority Districts (LAD). See https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeography for detailed infor-mation about the ONS census geography.

Page 9: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex A

9

We summarized the first tests we did on French data below. We are currently running other

simulations to test sensibility to the specifications. We expect to have better results by first

grouping grid squares to define a geography in which each small area have at least a few hun-

dred individuals.

First tests with the ONS Targeted Record Swapping programs

Data

We worked on the 2011 census, with coordinates (x; y) filled in and at least one individual in

the housing.

We have worked on several versions of this data:

- V1 (for debug): only geolocated data in large municipalities, and subsample at 1 / 100th

- V2: exhaustive geolocated data on large municipalities (3.3 million households and 7.8 mil-

lion individuals in these households)

- V3: bigger data with small communes (coordinates for 82% of the households, i.e. 16.7 mil-

lion households and 39.5 million individuals). For this to run, it was necessary for us to split

these tables into 22 regional tables.

Settings

We need to find corresponding geographical variables to the structure LAD> MSOA> OA>

WARD. It is necessary to use hierarchical geographies. That’s why we crossed these with the

upper echelon4.

We chose to test5 different nested structures:

- STR1: NUTS2> NUTS3> 1km² grid squares crossed with NUTS3 (and we chose ward = oa)

- STR2: NUTS3> (10km)² grid squares crossed with NUTS3 1km² grid squares crossed with

NUTS3 (and we chose ward = oa)

In the end, the number of modalities on each level of the structure are the following:

- V2 / STR1: 22 / 96 / 17 541

- V3 (for region Ile de France only) / STR2: 8 / 194 / 8459

Swapping rate: 10%

4 variables to define the risk: country of birth with 12 modalities, nationality with 13 modali-

ties, looking for a job with 4 modalities, age in 7 age groups

Detailed profile: ageg1 || ageg2 || ageg3 || ageg4 || ageg5 || ageg6 || ageg7 || ethc

Reduced profile: ageg1 || agec1 || ageg4 || agec2 || ageg7 || ethc

4 In the structure STR1 with version V2 the number of populated grid squares goes from 17 395 to 17 541 when

crossing with NUTS3. In the structure STR2 with version V3 for the region Ile de France only, the number of large squares goes from 151 to 194 when crossed with NUTS3, and the number of small squares goes from 8253 to 8459. 5 The grids are not all inhabited whereas the code is designed for OAs with a few hundred individuals. That’s

why we think it is in fact better to group grids together in order to construct specifics OAs : these first nested

structures we tested are not optimized for the swapping code shared by the ONS.

Page 10: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex A

10

where agec1=ageg2+ageg3 and agec2=ageg5+ageg6

Very reduced profile: person || ethc

We kept the definition of ageg1 to ageg7, ethc and person:

Ageg1 = number of people under 20 years old

Ageg2 = number of men aged 20 to 39

Ageg3 = number of men aged 40-59

Ageg4 = number of men aged 60 and over

Ageg5 = number of women aged 20 to 39

Ageg6 = number of women aged 40-59

Ageg7 = number of women aged 60 and over

Ethc = number of people not born in France

Person = number of individuals in the household

Calculation time

We identified 3 time-consuming steps:

- sdc_data_transform: linked to our filters (10 min approx.)

- sdc_sample: linear calculation time with the number of OAs

- sdc_matching: 22 calls to a loop of 16 iterations each comprising several proc sort and

merge (of the order of 20 seconds per iteration on the V2).

On a subsampled table of about 500,000 households and with the same parameterization as

above, the total execution time was around 15 min.

On the table with 3.3 million households, it goes up to 5 hours, but it is largely dependent on

the number of OA.

If the V3 table is not split by region, it is impossible to run the program without saturating the

server. It must therefore be executed on the regional tables by adapting the structure of nested

geographies (we changed from STR1 to STR2). In fact ONS also split the country in delivery

groups (there are just over 100 Delivery Groups covering England and Wales) before running

the program. For the region Ile de France, which is the biggest region, the execution time is

about 2 hours. On an average region, it is around 30 min.

Random noise

Section 4 of deliverable 3.1 part II refers to random noise and describes how to use the SAS

codes implementing the cell key method.

For some first experience we suggest to first run the example provided. Open the master pro-

gram “ckm_hc_9_2.sas” which is an example using a synthetic version of "hypercube 9.2".

Testers only have to change the settings of the master program (e.g. path names, etc.). Here is

a short overview of the four steps within the master program:

Step 1 Initializing:

Define your paths, include the macros and set the parameters

Page 11: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex A

11

Step 2 Hypercube Preparation:

Specify and import the synthetic hypercube or one of your Census 2011

hypercubes as outlined in section 2

Step 3 Blocking zero cells:

Implement edit rules of the hypercube to determine zero cells that should

not be perturbed (only necessary if you are going to perturb zero cells into

non-zero cells)

Step 4 Perturbation:

You will have the choice between four different versions of perturbation

tables (ptables). Execute ONE of the four perturbations. Version 01 is rec-

ommended for the testing in the project since it avoids the complexity, for

example, of step 3.

There should not be any runtime issue (as for example the code took around two and a half

minutes on a single computer and a table with 9 million individuals for hypercube 9.1).

Page 12: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex B

12

***********************************

*** Information loss measures ***

***********************************;

* Annu Cabrera;

* 1.6.2017;

* Hypercube: 9.1

* Different variants of protection method:

- Description of variant 1 (e.g. Cell key method, ptable_d3_v200, Option 1 in D3.1 part

II, p. 11) --> Variant1

- Description of variant 2 --> Variant2

- ...

;

* Read in Variant1, Variant2, ...;

libname v1 "...";

data work.hc91_v1;

set v1.hc91;

run;

libname v2 "...";

data work.hc91_v2;

set v2.hc91;

run;

* Number of variants;

%let vari = 2;

options mprint;

***********************************************************

** 5.1. Simple descriptive statistics (D3.1 part II, p.12);

* Just a reminder

rs_cv = original cell value

cp_cv = perturbed cell value;

* Definitions (D3.1 part I, p.11):

Absolute difference of rs_cv and cp_cv --> ad

Relative absolute difference of rs_cv and cp_cv --> rad

Distance of square roots --> d_r ;

%macro def;

%do i=1 %to &vari.;

data hc91_v&i.a;

set hc91_v&i.;

ad = abs( cp_cv - rs_cv );

if rs_cv = 0 then rad = .;

else rad = ad / rs_cv;

d_r = abs( sqrt(cp_cv) - sqrt(rs_cv) );

/* cumulative distribution function for ad */

if ad < 15 then cdf_ad = ad;

else cdf_ad = 15;

/* cumulative distribution function for rad */

if rad < 0.02 then cdf_rad = 0.019;

else if 0.02 <= rad < 0.05 then cdf_rad = 0.049;

else if 0.05 <= rad < 0.1 then cdf_rad = 0.099;

else if 0.1 <= rad < 0.2 then cdf_rad = 0.199;

else if 0.2 <= rad < 0.3 then cdf_rad = 0.299;

else if 0.3 <= rad < 0.4 then cdf_rad = 0.399;

else if 0.4 <= rad < 0.5 then cdf_rad = 0.499;

else if 0.5 <= rad < 1 then cdf_rad = 0.999;

else cdf_rad = 1;

/* false zeros = cells with original value not zero perturbed to zero */

if cp_cv = 0 and rs_cv > 0 then false_zero = 1;

else false_zero = .;

/* false positive = cells with original value zero perturbed to positive value

*/

Page 13: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex B

13

if rs_cv = 0 and cp_cv > 0 then false_positive = 1;

else false_positive = .;

/* indicator for cells with original (observed) value zero */

if rs_cv = 0 then zero = 1;

else zero = .;

run;

%end;

%mend def;

%def

* For ad simple descriptive statistics across all cells of a hypercube:

Max, Mean, Median, Percentiles p60, p70, p80, p90, p95 and p99

Cumulative df F_ad(d), d = 1,..,15;

%macro ad_ds;

%do i=1 %to &vari.;

proc means data=hc91_v&i.a max mean median p60 p70 p80 p90 p95 p99 var;

var ad;

output out=hc91_v&i._ad max=max_ad mean=mean_ad median=median_ad p60=p60_ad p70=p70_ad

p80=p80_ad p90=p90_ad p95=p95_ad p99=p99_ad

var=var_ad;

run;

proc freq data=hc91_v&i.a;

tables cdf_ad / out=hc91_v&i._cdf_ad outcum;

run;

data hc91_v&i._ad;

retain variant;

set hc91_v&i._ad;

variant="v&i.";

drop _type_ _freq_;

run;

data hc91_v&i._cdf_ad;

retain variant;

set hc91_v&i._cdf_ad;

variant="v&i.";

drop count percent cum_freq;

run;

proc means data=hc91_v&i.a sum;

var false_zero false_positive zero;

output out=hc91_v&i._zero1 sum(zero)=zero sum(false_zero)=false_zero

sum(false_positive)=false_positive;

run;

data hc91_v&i._zero2;

retain variant;

set hc91_v&i._zero1;

variant="v&i.";

if zero=. then pct_false_zero=-1;

else if false_zero=. then pct_false_zero=0;

else pct_false_zero = false_zero / zero * 100;

if zero=. then pct_false_positive=-1;

else if false_positive=. then pct_false_positive=0;

else pct_false_positive = false_positive / zero * 100;

if zero=. and false_zero=. then count_false_zero=0;

else if zero=. then count_false_zero=false_zero;

else count_false_zero=.;

if zero=. and false_positive=. then count_false_positive=0;

else if zero=. then count_false_positive=false_positive;

else count_false_positive=.;

drop _freq_ _type_ zero false_zero false_positive; run;

%end;

%mend ad_ds;

%ad_ds

%macro variants_ad;

%do i=1 %to &vari.;

hc91_v&i._ad

%end;

%mend variants_ad;

data hc91_ad_ds;

set %variants_ad;

run;

%macro variants_ad_cdf;

Page 14: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex B

14

%do i=1 %to &vari.;

hc91_v&i._cdf_ad

%end;

%mend variants_ad_cdf;

data hc91_ad_cdf;

set %variants_ad_cdf;

run;

%macro variants_zero;

%do i=1 %to &vari.;

hc91_v&i._zero2

%end;

%mend variants_zero;

data hc91_zero;

set %variants_zero;

run;

* All 3-dim sub-hypercubes of hc91:

SEX x AGE.M x COC.L --> hc91s1

SEX x AGE.M x POB.H --> hc91s2

SEX x COC.L x POB.H --> hc91s3

AGE.M x COC.L x POB.H --> hc91s4;

%macro sub;

%do i=1 %to &vari.;

data hc91s1_v&i.a;

set hc91_v&i.a;

where POB_H="0.";

run;

%end;

%mend sub;

%sub

* For rad simple descriptive startistics across all cells of a 3-dim sub-hypercube:

Max, Mean, Median, Percentiles p60, p70, p80, p90, p95 and p99

Cumulative df F_rad(r), r = 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1;

%macro rad_ds;

%do i=1 %to &vari.;

proc means data=hc91s1_v&i.a max mean median p60 p70 p80 p90 p95 p99;

var rad;

output out=hc91s1_v&i._rad max=max_rad mean=mean_rad median=median_rad p60=p60_rad

p70=p70_rad p80=p80_rad p90=p90_rad p95=p95_rad

p99=p99_rad;

run;

proc freq data=hc91s1_v&i.a;

tables cdf_rad / out=hc91s1_v&i._cdf_rad outcum;

run;

data hc91_v&i._rad;

retain variant;

set hc91s1_v&i._rad;

variant="v&i.";

drop _type_ _freq_;

run;

data hc91_v&i._cdf_rad;

retain variant;

set hc91s1_v&i._cdf_rad;

variant="v&i.";

drop count percent cum_freq;

run;

%end;

%mend rad_ds;

%rad_ds

%macro variants_rad;

%do i=1 %to &vari.;

hc91_v&i._rad

%end;

%mend variants_rad;

data hc91_rad_ds;

set %variants_rad;

run;

Page 15: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex B

15

%macro variants_rad_cdf;

%do i=1 %to &vari.;

hc91_v&i._cdf_rad

%end;

%mend variants_rad_cdf;

data hc91_rad_cdf;

set %variants_rad_cdf;

run;

***************************************************************

** 5.2. Summary statistics for aggregates (D3.1 part II, p.13);

* Aggregates for hc91:

SEX x COC.L --> 3 x 7 = 21 aggregates;

* Cells contributing to these aggregates

1) cross-tabulations of AGE groups 1,2,3,4,5,6 and POB groups 1,2,3,4 --> l

2) cross-tabulations of AGE groups 1.1, 1.2, ... , 6.4 and POB groups 2.1.01, ... ,

2.2.6.21 --> h;

%macro aggr;

%do i=1 %to &vari.;

data hc91_v&i.b_l;

length aggregate $ 4;

set hc91_v&i.a;

where AGE_M in ('1.' '2.' '3.' '4.' '5.' '6.') and POB_H in ('1.' '2.' '2.1.' '2.2.'

'3.' '4.');

aggregate=substr(SEX,1,1)||"_"||substr(COC_L,1,1)||substr(COC_L,3,1);

run;

data hc91_v&i.b_h;

length aggregate $ 4;

set hc91_v&i.a;

where AGE_M not in ('0.' '1.' '2.' '3.' '4.' '5.' '6.') and POB_H not in ('0.' '2.'

'2.1.' '2.2.' '2.2.1.' '2.2.2.' '2.2.3.'

'2.2.4.' '2.2.5.' '2.2.6.');

aggregate=substr(SEX,1,1)||"_"||substr(COC_L,1,1)||substr(COC_L,3,1);

run;

%end;

%mend aggr;

%aggr

* Mean and sum;

%macro summary(level);

%do i=1 %to &vari.;

proc means data=hc91_v&i.b_&level.;

var ad rad;

class aggregate;

output out=hc91_v&i._meansum_&level. mean(ad)=mean_ad_&level. sum(rad)=sum_rad_&level.;

run;

%end;

%mend summary;

%summary(h)

%summary(l)

* Hellinger's distance;

%macro hd(level);

%do i=1 %to &vari.;

proc sql;

create table hc91_v&i._hd_&level. as

Select aggregate, sqrt(0.5*sum(d_r**2)) as hd_&level.

from hc91_v&i.b_&level.

group by aggregate

;

quit;

%end;

%mend hd;

%hd(h)

%hd(l)

* Merge mean, sum and hd into one dataset;

Page 16: Specific Grant Agreement (SGA) · with additivity (cf. section 4.2 in D3.1 part II) for the cell key method generate several differ-ent variants of these methods. Countries participating

Annex B

16

%macro sum1;

%do i=1 %to &vari.;

data hc91_v&i._summary1 (drop=_type_ _freq_);

merge hc91_v&i._meansum_l hc91_v&i._meansum_h hc91_v&i._hd_l hc91_v&i._hd_h;

by aggregate;

where aggregate ne ' ';

run;

%end;

%mend sum1;

%sum1

* Means of summary statistics;

%macro sum2;

%do i=1 %to &vari.;

proc means data=hc91_v&i._summary1;

var mean_ad_h sum_rad_h hd_h mean_ad_l sum_rad_l hd_l;

output out=hc91_v&i._summary2 mean(mean_ad_h)=ad_h mean(sum_rad_h)=rad_h

mean(hd_h)=hd_h mean(mean_ad_l)=ad_l

mean(sum_rad_l)=rad_l mean(hd_l)=hd_l;

run;

data hc91_v&i._summary3;

retain variant;

set hc91_v&i._summary2;

variant="v&i.";

drop _type_ _freq_;

run;

%end;

%mend sum2;

%sum2

* All six information loss indicators;

%macro variants_sum;

%do i=1 %to &vari.;

hc91_v&i._summary3

%end;

%mend variants_sum;

data hc91_summary;

set %variants_sum;

run;

* Clean up work library;

%macro variants_orig;

%do i=1 %to &vari.;

hc91_v&i.

%end;

%mend variants_orig;

proc datasets library=work nolist;

save hc91_zero hc91_ad_ds hc91_ad_cdf hc91_rad_ds hc91_rad_cdf hc91_summary %variants_orig;

run;

quit;