29
Statistical Disclosure Risk: Separating Potential and Harm Chris Skinner Department of Statistics, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK. Email: [email protected] Summary Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Al- though methods of statistical disclosure risk assessment are now well established in the statistical science literature, the integration of these methods by agencies into a general scientific basis for their practice still proves difficult. This paper seeks to review and clarify the role of statistical science in the conceptual foundations of disclosure risk assessment in an agency’s decision making. Disclosure risk is broken down into disclosure potential, a measure of the ability to achieve true disclosure, and disclosure harm. It is argued that statistical science is most suited to assessing the former. A framework for this assessment is presented. The paper argues that the intruder’s decision making and behaviour may be separated from this framework, provided appropriate account is taken of the nature of potential intruder attacks in the definition of disclosure potential. Key words: Confidentiality; disclosure limitation; identification; intruder. 1

Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Statistical Disclosure Risk:

Separating Potential and Harm

Chris Skinner

Department of Statistics, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK.

Email: [email protected]

Summary

Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Al-

though methods of statistical disclosure risk assessment are now well established in the statistical science literature,

the integration of these methods by agencies into a general scientific basis for their practice still proves difficult.

This paper seeks to review and clarify the role of statistical science in the conceptual foundations of disclosure risk

assessment in an agency’s decision making. Disclosure risk is broken down into disclosure potential, a measure of the

ability to achieve true disclosure, and disclosure harm. It is argued that statistical science is most suited to assessing

the former. A framework for this assessment is presented. The paper argues that the intruder’s decision making and

behaviour may be separated from this framework, provided appropriate account is taken of the nature of potential

intruder attacks in the definition of disclosure potential.

Key words: Confidentiality; disclosure limitation; identification; intruder.

1

Page 2: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

1 Introduction

The challenge of devising ways to provide researchers with access to microdata and other statistical outputs while protecting

confidentiality continues to be the subject of intense interest and development around the world. Given the wide variety of

contexts in which this challenge arises, it seems unlikely that these developments will converge on any single solution. Na-

tional Research Council (2005, p.68) recommends that “data produced or funded by government agencies should continue

to be made available for research through a variety of modes, including various modes of restricted access to confidential

data and unrestricted access to public-use data altered in a variety of ways to maintain confidentiality”.

The prototypical set-up is depicted in Figure 1. An agency undertakes a survey (or similar exercise) in which data are

collected from data subjects. The agency has a remit to use these data to produce outputs which will serve the statistical

needs of the agency’s users. The agency faces the potential problem of statistical disclosure, however, that one of the users

may be able to use the outputs to disclose information about one of the data subjects. A (hypothetical) user who ‘misuses’

the outputs in this way is called an intruder. To address this problem, the agency needs to decide what mode of access to

employ and, given this mode of access, what statistical disclosure limitation (SDL) methods to employ in the specification

of the statistical outputs. We include in the scope of SDL not only methods which modify data by restriction of detail,

perturbation (Willenborg and De Waal, 2001, Ch. 4, 5) or conversion into synthetic data (Reiter, 2009), but also methods

employed to restrict the outputs which a researcher can obtain from a remote access analysis server (Gomatam et al., 2005a)

or can take out from a research data centre (National Research Council, 2005, pp. 29-31).

Experience with different modes of access can lead to consensus on ’best practice’ for some features of these approaches.

Nevertheless, it is widely recognized that questions about when confidentiality is adequately protected by an SDL method

and how to come to that judgement remain hard ones to answer across most modes of access. The decision problem is often

posed in a ‘risk-utility’ framework (Duncan et al., 2011, Ch. 6). The extent to which confidentiality is protected is assessed

by measure(s) of what is called statistical disclosure risk. The extent to which the outputs satisfy the statistical purposes

for which they are produced is assessed by measure(s) of utility. An agency has to trade off these two criteria. The fact

that attempts to reduce disclosure risk often also lead to a decrease in utility is described by Doyle et al. (2001, p.1) as “a

fundamental tension at the heart of every statistical agency’s mission”.

Research into the assessment of disclosure risk has a long history in statistical science, going back at least to the 1970s,

e.g. Fellegi (1972) and Dalenius (1977). It also remains an active current area of research, e.g. at the interface with computer

2

Page 3: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

��

��Data Subjects

AAA

���

Data

?AAU

����

���Agency

���AAA

Outputs

?AAU

���'&

$%

Users� �Intruder

Figure 1: The Different Parties in the Prototypical Set-up where Statistical Disclosure may arise

science (Dwork et al., 2006; Abowd et al., 2009; Wasserman and Zhou, 2010). Why then does it remain difficult to use this

body of work to support decisions about statistical disclosure in practice? Cox et al. (2011) consider that the risk-utility

framework does provide a common language for thinking about confidentiality but they argue, through examples, that it is

much less useful in practice and conclude that “today there is not a science of data confidentiality”.

In this paper it will be argued that there is a place for statistical science in disclosure risk assessment, but that it is

important to separate out what can be achieved by statistical science and what aspects of decision making require other

inputs, such as policy judgements. To help obtain this separation, we propose to break down the notion of disclosure risk.

The term ‘risk’ has many meanings across scientific disciplines and we take as our starting point its use in risk manage-

ment (Vaughan, 1997) as an uncertain event with potentially adverse consequences. The risk of statistical disclosure then

has two dimensions: the probability of statistical disclosure and a measure of the harmful consequences which may ensue if

statistical disclosure occurs. Consideration of both dimensions may be found in the statistical literature. Much attention is

devoted to the probability of disclosure (e.g. Reiter, 2005). A number of papers (see end of Section 2.2) suggest, however,

that disclosure risk should also embrace the notion of harm. A difficulty in formulating a definition for this two-dimensional

representation of disclosure risk, however, is to decide whether statistical disclosure should refer to true disclosure or, more

3

Page 4: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

broadly, to any kind of attack and claimed disclosure, whether true or not. We argue in Section 2.2 that, in the context of

many official confidentiality statements, it is appropriate that the probability of disclosure refers to true disclosure. On the

other hand, we consider that when an agency wishes to assess and manage the potential harm from threats to confidentiality

it is appropriate to adopt the broader definition. In summary, we propose to break down the notion of disclosure risk into

two separate notions: a measure summarising the uncertainty about whether true disclosure occurs, which we shall refer to

as disclosure potential (or identifiability when disclosure is defined as identification) and a measure of the adverse conse-

quences of potential attempts at disclosure (whether successful or not), which we shall refer to as disclosure harm. We shall

argue that statistical science is primarily suited to assessing disclosure potential but not disclosure harm.

A further source of complexity in developing methodology for decision making about disclosure is the existence of the

multiple parties in Figure 1, each of which may make decisions. We shall take the agency’s decisions as the central ones

which statistical science is needed to support, but we shall seek to clarify the link between these decisions and the intruder’s

perspective. Fienberg et al. (1997, p.75) argue that “a data collection agency must consider disclosure from the perspective

of an intruder in order to efficiently evaluate data disclosure limitation procedures”. We shall review this idea.

The aim of the paper is then to review, explore and clarify the conceptual basis and logic of decisions by a statistical

agency relating to disclosure risk. In particular, we seek to clarify the role of statistical science in the agency’s decision

making and, by separation of tasks, to limit the amount of non-statistical assumptions (about matters such as intruder

judgements and behaviour) required for disclosure risk assessment.

The paper is intended to support the work of statistical agencies in two ways: first, by offering alternative ways of

providing transparent and defensible bases of SDL decisions and second, by facilitating the division of labour between sta-

tistical methodologists and others, such as policy analysts, within a statistical agency regarding the preparation of evidence

to support SDL decisions.

Given the breadth of the field, we first set some limits to the scope of the paper:

• we recognize, as discussed by National Research Council (2005, pp. 55-59), that confidentiality breaches may occur

for reasons of carelessness, illegal intrusions, law enforcement and national security as well as statistical disclosure,

but we restrict attention to the latter;

• we focus on broader conceptual issues and not the detail of specific SDL techniques;

• we shall not discuss utility in any detail;

4

Page 5: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

• while decision theoretic ideas underly parts of the development, we shall not attempt any detailed formalization;

• we do not seek to provide a comprehensive review of the literature; rather we aim to identify and discuss what are

judged to be the key ideas;

• we refer to an agency, with government statistical agencies in mind, since these have been the main drivers of develop-

ments in SDL methodology, but our discussion should be relevant to other kinds of organisation with responsibilities

both to disseminate statistical data products to multiple users and to protect the confidentiality of the sources of the

data;

• we restrict attention to the prototypical set-up for a single agency, outlined in this section, which provides a basis for

discussing key conceptual issues in SDL methodology, and do not consider e.g. secure multi-party computation (Karr

et al., 2005; Fienberg, 2006) for distributed databases across several agencies;

• we use the term survey to denote the source of the data, but most of what we say will also be relevant to other sources

such as a census or an administrative source.

The paper is organised as follows. We contrast the agency’s and the intruder’s perspectives in Sections 2 and 3, starting

by considering the agency as a decision maker, which is our principal interest. We then proceed to consider the notion of

disclosure risk in more detail in Section 4. We reflect further on the role of harm in Section 5 and provide some conclusions

in Sections 6 and 7.

2 The Agency as Decision Maker

In deciding what actions to take to protect against statistical disclosure, the agency needs to consider:

• the nature of the alternative actions;

• the loss to the agency arising from the consequences of its actions.

Zaslavsky and Horton (1998) provide an example, where the agency needs to decide between two actions, whether to

publish an output or not, and the loss to the agency as a result of these actions is defined in terms of the threat of disclosure,

if the agency decides to publish, or the loss of information, if the agency takes the alternative action.

5

Page 6: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

We discuss the nature of the actions and the loss in Sections 2.1 and 2.2, respectively. We shall return to outlining the

decision framework in general terms in Section 2.3.

2.1 The Agency’s Actions

We distinguish two kinds of actions which an agency may take when confronted by potential threats to confidentiality:

• the use of SDL methods as part of the process of determining the statistical outputs: examples include techniques

which transform or perturb outputs, such as recoding, sampling or creation of missing values, adding noise, swapping

values, replacing microdata by synthetic data;

• additional disclosure management actions to protect against and discourage disclosure attempts and the misuse of

outputs: examples include access controls, where users must sign up to certain conditions in a licence agreement,

which may include penalties for misuse; or training of users in why confidentiality needs to be protected.

See e.g. Willenborg and De Waal (2001) and Doyle et al. (2001) for references to the statistical literature on SDL

methods and National Research Council (2005, Ch. 2, 5) for broader issues of disclosure management.

2.2 The Loss to the Agency

In order to decide which action to take in the light of possible threats, the agency needs to be able to evaluate the potential

consequences of its decisions. In this section, we discuss the nature of the principal criteria relevant to this evaluation and

refer to these as loss criteria. The need for multiple criteria makes the decision a complex value problem in the terminology

of Keeney and Raiffa (1976, Sect. 1.4.1). We address the question of how to accommodate these multiple criteria when

making decisions in Section 2.3.

It is natural for the agency’s obligations to protect confidentiality to provide the basis of key loss criteria. These obli-

gations might be legal, professional or ethical. For example, it is stated in the United States Code Title 13 that the Census

Bureau is prohibited from producing outputs “whereby the data furnished by any particular establishment or individual

under this title can be identified”. Such obligations are frequently expressed in terms of the ability of anyone with access to

the outputs to use them to achieve disclosure or breach confidentiality. We conceive of this ability in terms of the potential

to infer the value of a target y, embracing the possibilities of:

6

Page 7: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

• predictive disclosure: where y is the value of a survey variable for a particular known individual or other unit;

• identity disclosure (also called identification): where y is the value of a binary variable indicating whether a particular

element of the output (e.g. a record in a microdata file) belongs to a particular known individual or other unit.

The Title 13 example above refers to identity disclosure, and the ability to infer this may be called identifiability. This

term is used by Government Statistical Service (2009, p.12) to explain the requirements for confidentiality protection in the

UK Code of Practice for Official Statistics: an output “will not usually directly reveal the identity of an individual, but more

usually the risk is that the statistic may make an individual identifiable - the individual has not yet been identified, but it

may be possible to do so”.

We shall represent inference about y by a predictive probability distribution p(y|O,D, attack method), which depends

on O, the statistical output, and D the auxiliary data available to an intruder who may seek to learn about y using a particular

attack method. There will generally be a set of such distributions for different targets y (corresponding to different data

subjects and variables) and different possible sources of auxiliary data, Dk, and attack methods k (k = 1, . . . ,K).

We take the agency’s first loss criterion to be a summary of these predictive distributions in what we call the disclosure

potential, (or identifiability in the case of identity disclosure) expressed as the function: H(O;Dk, attack method k; k =

1, . . . ,K). The agency is assumed here to have integrated out the uncertainty in the predictive distributions for the various

possible targets y. The form of H will depend on what the agency judges to be its obligations. If, for example, these

relate to identification and the agency focuses on the worst case of identification, H could be the maximum value of

p(y|O,D, attack method) across all identity indicators y for different output elements and data subjects. Alternatively, if

the agency focuses on how often this probability exceeds 0.5 or some other threshold, H might count the number of data

subjects for which this is the case for some element of the output.

We shall discuss the notion of predictive probability distribution further in Section 4. We draw on the use of this distribu-

tion by Duncan and Lambert (1986) as a unifying representation of disclosure risk, but do not follow their interpretation of

this distribution in terms of the intruder’s perspective, as will be discussed in Section 3. Note also that there is a significant

part of the disclosure limitation literature where disclosure relates to deductive (mathematical) inference, e.g. the notions

implicit in the p% rule (Cox, 2001) and not probabilistic inference as supposed here. In principle, it would seem that such

measures could also be combined into a disclosure potential function, since this is not required to take a probabilistic form,

but we shall not explore this possibility further here.

7

Page 8: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Although agencies’ legal and professional obligations to protect confidentiality appear most often to be expressed in

terms of inferential criteria with disclosure viewed as the potential outcome of concern, there are also ethical and other

reasons why agencies will wish to take account of the possible undesirable consequences of possible disclosure or, more

generally, of any actions by intruders designed to breach confidentiality. We refer to such consequences as disclosure harm

(c.f. Lambert, 1993). Such consequences may refer to:

• respondents: for example the impact of whatever actions an intruder might take following an attempt at disclosure,

whether the resulting disclosure is correct or false; such harm might be ’legal, reputational, financial or psychological’

(National Research Council, 2007, p.14) and

• the agency: for example, the impact of publicity about reported disclosures on the agency’s reputation and on response

rates to surveys run by the agency (Couper et al., 2010).

Disclosure harm is our second loss criterion. We discuss its definition further in Section 5. We note that the assessment

of harm may need to take account of considerable uncertainty about potential intruder actions, but we treat this uncertainty

as quite separate from the inferential uncertainty arising in the assessment of disclosure potential.

Both disclosure potential and disclosure harm are criteria which capture the possible consequences of the agency’s

actions for respondents and for the agency. It is also, of course, essential to consider the consequences for the genuine

users of the outputs. This impact is represented by the utility of the output, which refers to how far it meets the analytical

needs of users. Many SDL methods can impact negatively on these needs either through loss of information, e.g. loss of

geographical detail, or though reductions of data quality, e.g. by affecting the biases and variances of users’ estimators.

These impacts are of crucial importance when evaluating SDL methods but, as noted earlier, are not the main focus of this

paper. For further discussion see Karr et al. (2006) and Woo et al. (2009).

In summary, we have argued that the loss to the agency be based upon three principal criteria:

• disclosure potential;

• disclosure harm;

• utility of output to users.

The distinction between disclosure potential and harm is fundamental in this paper, in contrast to some of the literature

where the notion of harm is subsumed within that of risk. For example, Duncan et al. (2001b) view disclosure risk, in its

8

Page 9: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

most general form, as including the consequences of intrusion both for the agency and for the intruder, and this enables

risk-utility analyses to have more general applicability. Dobra et al. (2003) also subsume harm within risk by referring to

disclosure risk as “the extent to which [the output’s] release can harm the agency or the data providers”. To reiterate, our

distinction between disclosure potential and harm is that the former refers to the ability of the intruder to disclose (i.e. infer)

information, whereas the latter refers to consequences of intruder actions (following a deliberate or inadvertent attempt at

disclosure). A related distinction between disclosure risk and harm is presented by Lambert (1993). She restricts disclosure

risk to considerations of identity disclosure, whereas it is assumed here that disclosure potential may also refer to predictive

disclosure. She states that she considers the latter form of disclosure only to the extent that it may lead to disclosure harm.

In this paper, we suppose that identity disclosure could also lead to harm. Our distinction is similar to that in National

Research Council (2007, pp. 13-15) and to that between disclosure harm and ’discredit harm’ in Trottini (2001).

2.3 The General Decision Problem

The elements of the decision framework outlined in the two previous sections are brought together in Figure 2. The two

components of the agency’s actions (SDL methods and disclosure management) each impact directly on what analyses the

users can conduct and on the environment in which these analyses are undertaken and hence each component has direct

consequences for utility. Our primary focus here, however, is on the other loss criteria.

Disclosure potential is defined in terms of the capacity of an intruder to make inferences. The nature of these inferences

depend upon the nature of the statistical outputs (and hence the SDL method) and the information additionally available to

the intruder. The extent to which disclosure potential may depend additionally upon the way in which an intruder might

attempt disclosure and, indirectly, upon the agency’s management approach to discourage such an attempt is discussed in

Section 4.

The actions of the intruder, as represented in Figure 2, are taken to refer to any actions resulting from an attempt at

disclosure. Examples include a journalist publishing a claim that they have identified a known individual in an output and

possibly disclosing information about that person or a commercial intruder linking information from the output into a credit

referencing database (Paass, 1988). It is supposed that the nature of the actions will depend upon the intruder’s capacity

to make inferences (but not on the outputs themselves other than via these inferences) as well as upon other motivations

of the intruder. It is assumed that the actions will also be influenced by the agency’s disclosure management approach, for

9

Page 10: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

��

��Agency

����

AAAAU

Utility �-Disclosure

Limitation

Methods

AAAAAU

Disclosure

Management

�������

��

��Intruder

���

AAU

Inference

?

- Actions

?

Disclosure

Harm

Disclosure

Potential

Figure 2: Framework for Agency’s Decision Making

example by any penalties for misusing the outputs. The disclosure harm is taken to be purely a function of these actions.

Given the three distinct loss criteria, the general decision problem faced by the agency is a multiple objective one

(Keeney and Raiffa, 1976). In fact, the multiplicity of objectives is even greater since each of the loss criteria may be mul-

tidimensional. The measurement of utility will usually require a trade-off between the needs of different users, for example

reducing the geographic detail in an output may reduce the utility of the output far more to a user in local government (in a

small municipality) than to a user in national government. Measures of the potential for predictive disclosure are variable-

specific and thus typically multiple. Measuring harm also requires consideration of multiple dimensions, for example the

harm to respondents vs. the harm to the agency.

There is an extensive literature on multiple objective decision making (e.g. Keeney and Raiffa, 1976; French, 1988). A

classical approach would build on an assumption that the agency has preferences between any pair of different consequences

to demonstrate the existence of a real-valued loss function of the different loss criteria and actions (DeGroot, 1970, Ch. 7).

The agency’s optimal decision would then be to choose that action which minimized the expected loss (DeGroot, 1970, Ch.

8). We do not seek to develop the application of such an approach in this paper, however. We note that any such approach

10

Page 11: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

would need to address the trade-off between the the three loss criteria (Doyle et al., 2001, p.1). See e.g. Keeney and Raiffa

(1976) for some discussion of trade-offs under uncertainty including, for example, the notion of efficient frontiers. Gomatam

et al. (2005b) provide some discussion of such notions in the context of the two criteria of disclosure risk and utility. Such

ideas could be extended to the trade-off between utility, harm and disclosure potential. Thus, for example, an agency might

seek to maximize utility subject to upper bound constraints on each of disclosure potential and harm. A partial order on

release options might be defined with respect to each of decreasing potential, decreasing harm and increasing utility and an

associated potential-harm-utility frontier defined.

Specifying upper bound constraints for disclosure potential and harm will typically represent a key challenge for an

agency. The separate assessment of these two criteria proposed in this paper should enable an agency to place greater

reliance on methods of statistical science to set an upper bound for disclosure potential. Nevertheless, difficulties in estab-

lishing upper bounds for harm may in practice imply that judgements about tolerance levels for the two criteria will still

require joint consideration. Thus, a higher level of disclosure potential might be tolerable if it is judged inconceivable that

any serious harm could result from a proposed release, whereas a lower tolerance level might be specified if this is not the

case. Such judgements about potential harm might be based upon the sensitivity of the data product to be released. Reports

of illegal drug use or detailed information about financial assets are examples of highly sensitive data given by National

Research Council (2005, p.71), where a lower tolerance level for disclosure potential might be judged prudent.

3 The Intruder as Decision Maker

3.1 Decision Theory

We now turn to considering the intruder’s perspective and explore his/her potential role as a decision maker. The steps which

an intruder might take are set out in Figure 3. The first step is to attempt disclosure. This involves the intruder gaining access

to the outputs, which may require overcoming obstacles, such as the completion of a licensing agreement. The intruder may

attempt disclosure in a number of ways. For example, if the output consists of a microdata file then the intruder may attempt

to use record linkage software to link a microdata record to a record on an external database of known individuals. We

may even include the possibility that a user discovers the opportunity for disclosure inadvertently by observing an unusual

feature of the output and hence only becomes interested in disclosing information after gaining access to the output, despite

11

Page 12: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

having no mischievous intentions originally. ‘Methods of attack’ are discussed further in Section 4.1.

Attempt at

Disclosure- Inference - Actions

Figure 3: Potential Behaviour of an Intruder

The second and third steps in Figure 3 have already been distinguished in the previous section. If the intruder takes

the inference step then he/she is able to compute predictive distributions. If the action step is taken then these distributions

are used in some way. We now focus on these two steps since they have received particular attention in the literature. In

particular, Duncan and Lambert (1986, 1989) introduced the use of decision theory (from the intruder’s perspective) to

represent these two steps. Their framework is now outlined.

Let y denote a value which the intruder wishes to disclose. As discussed in Section 2.2, y is the (target) value of a

variable (which is not publicly known) of a data subject in the case of predictive disclosure or the identity of an element of

the output in the case of identity disclosure. At the inference step, suppose that a standard Bayesian approach is adopted,

whereby the intruder’s prior distribution for y is updated using the output, to obtain a posterior predictive distribution pI(y).

The subscript I is to emphasize the dependence on the intruder and to contrast it to p(y) in Section 2.2. Now consider

the action step and let a denote a consequent action which the intruder might take, for example to claim publicly that the

target value is a specified value or to claim that the identity of an element of the output is a specified person (or other

unit). Suppose that the intruder specifies a loss function LI(y, a), representing his/her loss incurred from action a when the

true value is y. Then, the Bayesian optimal choice of a will be to minimise the expected value of this loss, i.e. minimise∑y LI(y, a)pI(y) with respect to a. Duncan and Lambert (1989) argue that it is natural for the agency to suppose that the

intruder adopts this optimal strategy since it represents a conservative assumption.

3.2 Relation to Disclosure Potential and Harm

How does this decision theoretic formulation for the intruder relate to disclosure potential and harm as conceived in Section

2?

We contend first that the introduction of the intruder’s loss function and the consideration of expected loss is not relevant

to the consideration of disclosure potential, but only to disclosure harm. The intruder’s loss function refers to actions which,

12

Page 13: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

as represented in Figure 2, only affect disclosure harm.

We distinguish the distribution p(y), which may be formulated in a publicly defensible way by the agency, and the

intruder’s distribution pI(y). The latter might be referred to as the risk of perceived disclosure, (c.f. Lambert, 1993),

emphasizing its dependence on the intruder’s perspective. Lambert (1993) states that “the disclosure limitation model of

Duncan and Lambert . . . does not separate true and false disclosures, since what matters is what the intruder believes has

been disclosed” (p.315) and that “disclosure is limited only to the extent that the intruder is discouraged from making

any inferences, correct or incorrect, about a particular target respondent” (p.316). We conclude that the decision theoretic

formulation for the intruder may be relevant to those elements of the agency’s disclosure management approach (or indeed

choice of SDL methods) designed to discourage disclosure attempts and it may be relevant to the agency’s consideration

of disclosure harm, in particular since false disclosures may be harmful, but it is not relevant to the assessment of ‘true’

disclosure potential.

The decision theoretic framework of Duncan and Lambert (1986, 1989) is extended by Dobra et al. (2003). They set out

a very general framework where each of the agency, the intruder and the genuine user face decisions in relation to their own

actions and loss functions. They conceive of the intruder as operating in a similar way to Duncan and Lambert (1986, 1989),

i.e. adopting an optimal strategy based upon expected loss calculations. They then set up a framework in which the agency

takes decisions in the light of the potential intruder actions. They define an agency loss function which “quantifies, from the

agency’s perspective, the harm that the intruder’s action produces to the agency and the data providers” for a given ‘state of

the world’. They then define the ‘disclosure risk’ as the expected value of this agency loss function, where the expectation

is with respect to the agency’s posterior probability distribution about the actions of the intruder and the state of the world.

They assume that the agency knows the intruder’s target, prior and loss function. Their definition refers, however, to what

we have called disclosure harm not disclosure potential. Some further extensions of the decision theoretic framework of

Duncan and Lambert (1986, 1989) are given by Trottini (2001, 2003) and Trottini and Fienberg (2002).

A related application of decision theory to the actions of each of the agency, the intruder and the genuine user is

presented by Keller-McNulty et al. (2005). They view these actions, as in a game theory perspective, as ones where the

intruder is an adversary of the agency and the genuine user. By focussing on actions rather than inference, their approach

may also be viewed as of relevance to disclosure harm but not potential. Their implicit definition of disclosure harm is

the intruder’s expected utility which is analogous (after multiplying by -1) to the expected loss in the Duncan and Lambert

13

Page 14: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

(1986) framework, although they adopt a particular approach to defining the loss function in terms of Shannon’s information

entropy. This definition of disclosure harm is analogous to a special case of the definition of Dobra et al. (2003), where

the agency assesses harm purely in terms the intruder’s perspective, with anything that the intruder views as a gain being

viewed by the agency as a loss.

4 Measuring Disclosure Potential in terms of Predictive Distributions

We now return to the agency’s perspective and consider the conceptual basis of disclosure potential. We build on the notion

of a predictive distribution p(y) for a target y introduced in Section 2.2. Recall that this embraces not only the notion of

predictive disclosure but also that of identity disclosure. Although, as discussed in Section 3, we assume that disclosure

potential does not depend on any actions of an intruder which might follow an attempt at disclosure, we still need to consider

the possible dependence of p(y) on the way in which the intruder attempts disclosure and this is discussed in Section 4.1.

The move from the intruder’s to the agency’s perspective also raises the question of how to take account of this change of

perspective and this is discussed in Section 4.2. Some more issues in implementing definitions of disclosure potential are

presented in Section 4.3.

4.1 Dependence on Nature of Attack

The predictive distribution p(y) may depend on the agency’s statistical outputs, denoted O, the auxiliary data which the

intruder has available, denoted D, as well as on the way in which the intruder obtained these data, which we refer to as

the method of attack. Suppose that the agency is able to enumerate possible scenarios k = 1, . . . ,K, each of which might

correspond to a different attack method k, and/or a different set of auxiliary data Dk, and/or a different intruder. The

predictive distribution for scenario k is then denoted p(y|O,Dk, attack method k). If the agency could attach a probability

p(Dk, attack method k|O) to each scenario k and if these scenarios were mutually exclusive, the agency could, in principle,

compute an unconditional predictive distribution

p(y|O) =∑k

p(y|O,Dk, attack method k)P (Dk, attack method k|O).

We argue, however, that it is more natural to define disclosure risk conditional on the scenario, since this seems to

14

Page 15: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

correspond better to the kinds of obligations, like that for Title 13 in Section 2.2, which refer to whether disclosure can

be achieved. Suppose, for example, that only one attack method is feasible, that identity disclosure can be achieved with

probability 0.99 if this method is used, but that there is reason to suppose that the probability that this attack method will

be used is 0.001. Then we suggest that this form of output does not meet the confidentiality requirements of Title 13 (see

Section 2.2) since the probability that identity disclosure can be achieved (if an attempt is made) is high, even though the

probability that a disclosure will take place might be judged low.

An advantage of conditioning on the scenario is that it avoids the need to make probability judgments about whether

the scenario would occur. Such judgements are very difficult, given the hypothetical nature of the intruder, and seem rather

different than the kinds of probability judgements required for assessing the potential for inference. The former judgements

thus seem better considered alongside the actions related to disclosure harm. See Marsh et al. (1991) and National Research

Council (2005, p.70) for further discussion of the probability of an attack.

A related advantage of conditioning is that it removes the direct connection between disclosure potential and what was

called disclosure management in Section 2.1, enabling the tasks of the agency to be separated into assessments of:

(i) disclosure potential, under different scenarios, and how this depends on the choice of SDL method;

(ii) attempts at disclosure and disclosure harm, and how these depend upon disclosure management and (indirectly) on

SDL methods;

(iii) utility and its dependence on SDL methods and disclosure management.

Of course, specification of the scenarios of attack in (i) requires some speculation about possible intruder behaviour, but

it suffices to focus such speculation on the identification of potential auxiliary data sources (for which disclosure potential

needs to be assessed).

A consequence of conditioning on the scenario is that there may be a multitude of measures of disclosure potential,

given the large number of possible scenarios. The assessment exercise may thus be viewed as a sensitivity analysis. Frank

(1986, p.22) considers that this represents a “fundamental difficulty” and that the need to specify “predictive distributions

for all conceivable users” could be “intractable”. Some research has been undertaken on possible scenarios (e.g. Duncan et

al., 2011, Sect. 2.1.3) and such research provides a basis for specifying different scenarios in assessing disclosure potential

(e.g. Paass, 1988, Willenborg and De Waal, 2001, Sect. 2.3). In practice, it is common for an agency to specify a set of such

15

Page 16: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

scenarios against which it wishes to protect and to update these in the light of new research on external data sources. Such

research is important, given the strong dependence of disclosure potential on the nature of the auxiliary data available.

Some reduction in the task of investigating all possible scenarios may, in principle, be achieved, by restricting attention

to the worst case(s) (e.g. Duncan et al., 2001b). Alternatively, there may be grounds for considering only ‘reasonable’

scenarios and not the most extreme ones. In its guidance on the interpretation of the UK Code of Practice for Official

Statistics, Government Statistical Service (2009) states that “account should be taken of the ‘means likely reasonably to

be used’ to identify an individual. Thus the hypothetical but remote possibility of identification is not something that

automatically makes a statistic disclosive. The design and selection of intruder scenarios should be informed by the means

likely reasonably to be used to identify an individual in the statistic” (p.11). A related example is the use of ‘de facto

anonymisation’ of business microdata in Germany, for which scenarios are excluded from consideration if the intruder’s

“costs of trying to reidentify records in the dataset” are deemed to be “higher than the benefit gained by the disclosed

information” (Brandt et al., 2008).

In our formulation above, we have expressed the predictive distribution, p(y|O,Dk, attack method k), as dependent

not only on the outputs and the data available to the intruder but also on the attack method. It is more conventional to specify

dependence upon only the output O and the data D. See e.g. the definition of identification risk in Reiter (2005, equation

1). Although, this may be reasonable in practice in many situations, Skinner (2007) shows that, in general, there may be an

additional dependence on the attack method, i.e. the attack method may be ‘non-ignorable’ in the sense that:

p(y|O,D1 = D, attack method 1) 6= p(y|O,D2 = D, attack method 2)

for two different attack methods, even though the auxiliary data D observed by the intruder as a result of each method of

attack may be the same. An example presented by Skinner (2007) is a comparison between:

• a directed search, where the intruder selects a known individual from the population and seeks a match in the output,

vs.

• a fishing expedition, where the intruder selects an unusual element of the output and then seeks a match in the

population.

Skinner (2007) suggests that in such cases it may be reasonable to identify and assume a realistic worst case for the attack

16

Page 17: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

method, given D.

4.2 Dependence on Intruder Perspective via a Subjective Bayesian Approach

As noted in the previous section, our definition of p(y|O,Dk, attack method k) already requires consideration of the

intruder’s perspective via the auxiliary data Dk, available to the intruder, and possibly via the intruder’s attack method.

There remains the question of whether the agency should specify any further dependence of p(y|O,Dk, attack method k)

on the intruder’s perspective. Many forms of prior information, available to the intruder from their personal knowledge and

experience, may be incorporated into the auxiliary data Dk within our framework. We focus instead in this section on the

possible use of a Bayesian approach to represent the intruder’s pre-existing information or beliefs as a prior distribution

for some parameters in the model upon which p(y|O,Dk, attack method k) is based. Or, more generally, the model

itself may be interpreted in a subjective Bayesian way as reflecting the intruder’s uncertainty about y. See Fienberg et al.

(1997) and Reiter (2005) for illustration in the case of identity disclosure. Such Bayesian approaches may be contrasted

with comparable model-based frequentist approaches, as in Skinner and Shlomo (2008), which also base estimates of

p(y|O,Dk, attack method k) on a model, but do not seek to employ prior distributions to reflect the intruder perspective,

nor do they view the model as representing the intruder’s subjective perspective, but rather as an ‘objective’ model which

may be specified by the agency using data-based techniques of statistical modelling. For example, Skinner and Shlomo

(2008) propose a data-based technique for selecting a log-linear model which ‘optimises’ certain prediction properties of

the model and does not attempt to incorporate prior information.

A basic question for a subjective Bayesian approach is: what criteria should the prior distribution be required to meet

for the resulting predictive distribution of y to reflect an appropriate notion of disclosure potential? If the prior distribution

is allowed to represent any subjective beliefs of an intruder then, as Lambert (1993) discusses, it seems more appropriate

to view the predictive distribution as reflecting perceived risk, which may embrace incorrect disclosure as well as correct

disclosure. As discussed in Section 3.2, such perceived risk may be relevant to assessments of disclosure harm. However,

in our view, the kinds of obligations discussed at the beginning of Section 2.2 require disclosure potential to be defined in

terms of correct disclosure and for the predictive distribution to have an inferential basis which is defensible under public

scrutiny (in the same way that any outputs of official statistics need to be publicly defensible). We do not see that these

requirements can be guaranteed if priors are allowed to be any plausible subjective distribution for any intruder. This could

17

Page 18: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

include, for example, the case of an intruder who is over-confident that an observed match is correct on the grounds that a

combination of matching variables is unique in the population, failing to appreciate the potential for non-uniqueness or for

the match to have arisen because of measurement errors or other reasons. It thus seems inappropriate for the definition of

disclosure potential to be dependent upon intruders’ unjustified prior perceptions. To answer the question at the beginning

of this paragraph, we consider that it should be possible to defend any prior distribution used in a Bayesian approach by

justifying how it leads to a predictive distribution which reflects a valid probability of correct disclosure. More specifically,

we consider that the construction of the prior distribution should be defensible from the agency’s perspective (and thus in

an inter-subjective way) on the basis of plausible assumptions about the prior information available to the intruder.

In principle, one could imagine that an agency might itself seek to elicit these priors. In practice, however, the task

of identifying plausible sources of auxiliary information, Dk, is already so challenging that it seems understandable that

agencies might place such elicitation outside the bounds of their standard procedures. This appears to be the usual case in

practice till now. In Reiter (2005), perhaps the most substantial practical application of Bayesian methods to identification

risk assessment to date, intruders’ prior distributions have very little prominence. The main value of Bayesian methodology

in current disclosure risk assessment practice seems more of a technical one, that it provides a clear conceptual way of

integrating out uncertainty about parameters in the predictive distribution p(y|O,Dk, attack method k). Empirical evi-

dence is needed to assess whether there are non-negligible practical differences between the resulting Bayesian measures

and comparable frequentist model-based measures, as developed by Skinner and Shlomo (2008).

4.3 Implementation of Measures of Disclosure Potential

Having addressed conceptual aspects of the predictive distributions in the previous two sections, we now turn to consider

some ways in which an agency may implement measures of disclosure potential based on these distributions. This might be

viewed as the problem of ‘estimating’ these distributions. We focus on the case of identity disclosure, where y is binary, in

the context of the release of microdata. For some discussion of attribute disclosure (where y is the value of a variable for a

target unit) see Duncan and Lambert (1986).

We consider a method of attack where the intruder seeks to link a record in the microdata file to some external data source

of known units using values of some variables, which are included in both the microdata and the external source. These

variables are often called key variables and their values in the external data source define the auxiliary data D (Bethlehem

18

Page 19: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

et al., 1990). A basic problem with estimating p(y|O,D, attack method) in this context is that D is hypothetical and thus

unknown. There are two established approaches to specifying D and estimating the predictive distribution:

• an empirical matching experiment - construct a surrogate file D, for which the true correspondence between the

records in D and O is known by the agency, mimic the behavior of the intruder by using a record linkage algorithm

to match D and O and record the proportion of correct matches;

• a modelling approach - make assumptions about the nature of D (and the attack method) within a modelling frame-

work which enables p(y|O,D, attack method) to be specified and then make inference about this distribution, given

the data available to the agency.

We consider these two approaches in turn.

The empirical matching experiment cannot be used to estimate the probability of a correct match for a specific pair of

records since all that is observed at this level is binary, either match or non-match. Hence, such an experiment will not

provide a ‘record-level’ measure of disclosure potential. Instead, the proportion of correct matches across a set of records,

possibly the whole file, provides an estimated probability, which may be treated as a ‘file-level’ or ‘subfile-level’ measure.

For the estimate to be reliable, the number of records in the set will need to be sufficiently large. However, as a result

of ‘smoothing’ across all these records, this approach may fail to identify the most ‘risky’ records. An advantage of the

empirical matching approach is that it can accommodate any matching algorithm which an intruder might use, for example

a deterministic record linkage approach (Spruill, 1982), and any SDL method and, in this sense, can avoid modelling

assumptions. In particular, the approach does not depend upon assumptions about intruder perceptions and Lambert (1993)

thus terms the empirical proportion the risk of true identification. A key practical challenge in an empirical matching

experiment is how to construct a realistic surrogate intruder dataset, which allows for the disclosure protection arising from

sampling and measurement differences between sources and for which there is some overlap of units with the microdata and

the nature of this overlap is known. Sometimes there may be a suitable alternative data source (e.g. Blien et al., 1992) or

a different survey undertaken by the agency, although agencies often control sampling to avoid such overlap. Even if there

is overlap, determining which units are in common may be resource intensive, discouraging routine use of this approach.

In the absence of another dataset, the agency may consider a ’re-identification’ experiment, in which the microdata file is

matched against itself, normally after the application of some SDL method (Paass and Wauschkuhn, 1985, Paass, 1988 and

Winkler, 2004).

19

Page 20: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

The modeling approach may be formulated in the same conceptual framework as the empirical matching experiment,

but seeks to obtain expressions for the predictive distributions via theoretical arguments, under assumptions about the nature

of D and the attack method. Measures of disclosure potential associated with these expressions may then be estimated from

the microdata. A practical disadvantage of this approach, compared to the empirical matching approach, is that it may

not be theoretically straightforward to accommodate any specific matching algorithm which an intruder might use. Instead,

approximating assumptions might be made. These may have the benefit of providing simpler or more interpretable measures

of disclosure potential.

An advantage of the modelling approach is that it permits the estimation of record-level measures of identifiability. A

concern with file-level measures is that the principles governing confidentiality protection often seek to avoid the identi-

fication of any individual, that is require the probability to be below a threshold for each record, and such aims may not

adequately be addressed by file-level measures. In contrast, record level measures, which take different values for each

microdata record, may help identify those parts of the sample where disclosure potential is high and more protection is

needed and may be aggregated to a file level measure in different ways if desired (Lambert, 1993).

Model-based expressions for predictive distributions have tended to be studied separately for continuous and categorical

key variables. We shall illustrate some points about predictive distributions in the categorical case. For continuous key

variables, where random noise is added to the values of the key variables appearing in O, see e.g. Fuller (1983) who

derives expressions for the relevant predictive distributions and discusses their estimation. Paass and Wauschkuhn (1985)

and Paass (1988) discuss a general approach where identification is viewed as a classification problem and discriminant

analysis techniques are used.

Turning to the case when the key variables are categorical, suppose initially that they are recorded in the same way in D

and O and that no SDL method is applied. In this case, a simple expression for the probability that an observed exact match

between records in the two sources is correct is 1/F , where F is the number of units in the population which share the same

key variable values as the two matching records (Duncan and Lambert, 1989; Skinner and Holmes, 1998). Skinner (2008)

derives this expression under the assumption that the intruder employs a probabilistic record linkage method, of the type

considered by Fellegi and Sunter (1969). The expression 1/F assumes that there is only a unique matching record in the

microdata and that certain exchangeability assumptions hold about the F matching units. More importantly, it assumes that

F is part of D and thus included in the conditioning set in p(y|O,D, attack method). We contend that this conditioning set

20

Page 21: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

should consist of the information assumed to be available to the intruder (not the information available to the agency) and,

thus, whether F should be part of the conditioning set depends on whether it is reasonable to suppose that the intruder knows

F . In many realistic settings, it may be assumed that F is unknown to the intruder. In this case, p(y|O,D, attack method)

may be expressed as E(1/F |O,D, attack method), where F is ’integrated out’ using its conditional distribution given what

the intruder observes. Skinner and Holmes (1998) and Elamir and Skinner (2006) provide expressions for this probability

assuming the key variables obey certain log-linear models and discuss how this probability may be estimated given only

the sample microdata. Skinner and Shlomo (2008) discuss the specification of these models. The estimated probabilities

may be viewed as record-level measures of disclosure potential. A simpler approach is obtained by assuming that the

match observed by the intruder is obtained randomly (with equal probability) from all units in the population which match

a microdata record which has a unique combination of key variable values in the sample. The resulting probability can

be expressed as 1/F , where F is the mean value of F across all key combinations of values of key variables which are

unique in the sample. This measure may be estimated simply from sample microdata (Skinner and Elliot, 2002; Skinner

and Carter, 2003) and may be treated as a file level measure. Some related file-level measures, such as the proportion of

‘sample uniques’ that are ‘population unique’, are discussed by Bethlehem et al. (1990) and Fienberg and Makov (1998).

Model-based assessment can become more complex when the output O has been subject to SDL methods. Reiter

(2005) demonstrates how measures of identifiability can be obtained for a variety of SDL methods, including recoding, data

swapping and adding random noise. Shlomo and Skinner (2010) consider the use of misclassification-based SDL methods.

5 Further Consideration of Disclosure Harm and Intruder Behaviour

We have argued for separating the assessment of disclosure potential from the assessment of disclosure harm with a view

to focussing the role of statistical science in the former task. Now, we make some remarks about the latter task. From the

agency’s decision taking perspective, the key question is how to reduce disclosure harm through disclosure management

approaches together with the SDL methods.

We conceive of disclosure harm as the expected value of the loss incurred from the consequences of intruder behaviour.

Assessing disclosure harm thus requires the assessment of three components:

• the nature of potential intruder behaviour and its consequences;

21

Page 22: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

• the uncertainty about intruder behaviour and the consequences;

• the loss incurred from these consequences.

Although it is clearly feasible for Bayesian decision theory to have a place in modelling intruder behaviour, as discussed

in Section 3.2, we suggest that the scientific assessment of potential intruder behaviour in the face of alternative disclosure

management (and SDL) approaches, is more a matter for social rather than statistical science. Assessing which kinds of

people might attempt disclosure and the effects of approaches such as user training on the probability of a user attempting

disclosure seem social science questions and may be addressed by empirical experiments. For example, O’Hara et al. (2011)

describe an experiment where intruder behaviour was mimicked by recruiting postgraduate students who, like hackers in the

real world, lacked knowledge of large-scale data handling and the SDL literature but had good computing skills and were

driven by the aim of identifying subjects or disclosing further information about them in an anonymized microdata source.

The experiment provided the agency with a better understanding of the kinds of attacks which intruders might employ and

the kinds of threats arising from such behaviour.

The assessment of uncertainty about potential intruder attacks and behaviour seems a very different task to the use of

statistical inference to assess disclosure potential. Systematic modelling or assessment of such uncertainty seems more

related to social science and risk management.

Valuing the loss which would be incurred from specified consequences of intruder behaviour seems more a matter for

policy judgement. Social science may inform this judgement, for example research into respondents’ perspectives and

concerns about confidentiality. Different respondents and associated stakeholder groups, such as privacy organisations, do

not all share a common perspective and dealing with these differences is a policy issue. Handling uncertainty about potential

harmful outcomes is unlikely to be simply a matter of considering expected outcomes. Most agencies will also wish to take

account of public perceptions of potential harm, in particular since these may adversely affect participation in surveys run

by the agency (Singer, 2004; Couper et al., 2008, 2010). There are difficult challenges too in taking account of potential

changes in public perceptions over time, in particular since the intruder behaviour and its consequences may occur well

after the agency makes its release decision. The potential for sudden changes in public concerns about confidentiality was

illustrated by the intense media coverage of losses of data discs by government in the United Kingdom (Hand, 2008). Issues

in the management of public perceptions of the agency may also arise, but these are not ones of technical statistical science.

22

Page 23: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

6 Implications for Agency Practice

The conceptual framework in this paper may be used by an agency to structure its evaluation of SDL and disclosure man-

agement options. This framework takes account of the different kinds of expertise which are needed by staff undertaking

the evaluation or are obtained through consultation with individuals and bodies outside the agency. The restriction of a

broader notion of disclosure risk to the narrower one of disclosure potential is designed to enable it to be assessed by

agency statisticians alone with methods of statistical science, by excluding consideration of matters like intruder behaviour,

public perceptions of disclosure risk and false disclosure, which are relevant to the separate criterion of disclosure harm.

The assessment of this latter criterion needs more multidisciplinary input, from social scientists, stakeholders representing

respondents and policy makers (see Section 5). Assessment of utility, our third criterion, requires input from the users of the

outputs. Overall decision making requires further policy input, such as through a microdata review panel, to take account

of trade-offs between the three criteria.

The evaluation of disclosure potential might be divided into three kinds of tasks. First, there is a need for ongoing

assessment and updating of plausible scenarios and associated sources of auxiliary information which intruders might use.

This is necessary background information for any assessment of disclosure potential.

Second, evaluation will be required for routine decisions about release. In this case, it is appealing in practice for rules

to be simple, transparent and objective. Some discussion of two types of approach to assessing disclosure potential was

given in Section 4.3 for the case of identity disclosure in microdata: (i) empirical matching experiments have the advantage

that they can handle a wide range of SDL methods although they may be somewhat elaborate for routine use, (ii) modelling

approaches can provide the rationale for simpler measures, provided the SDL method can be handled tractably. Thresholds

for levels of disclosure potential will need to be established from broader policy considerations. Harm considerations may

lead to different thresholds as well as different disclosure management procedures for different kinds of survey variable.

For example, health variables might be subject to more stringent rules.

Third, detailed assessments may be undertaken as part of occasional in-depth exercises designed to choose between

alternative major types of SDL approach or to validate simple disclosure protection rules. We proposed in Section 4 that

disclosure potential be measured in terms of probabilistic prediction, whether of identity or unknown attributes, given the

observable statistical outputs and hypothetical auxiliary data. Simple measures of disclosure potential might therefore be

validated though simulation studies calibrating the probabilistic measures against their empirical performance.

23

Page 24: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

7 Conclusions

Statistical disclosure limitation is well established in statistical science as a body of theory and methods and remains the

subject of active research. Although there is no shortage of SDL methods which have found application by government

statistical agencies, a common scientific methodology for assessing disclosure risk and making decisions based upon these

assessments has found less ready adoption in agency practice. This paper has addressed the foundations of this topic with

the aim of clarifying what statistical science can contribute to such decision making and what it cannot.

We have argued for the assessment of disclosure risk to be separated into the assessments of disclosure potential and

disclosure harm, enabling the use of methods of statistical science to be focussed on the former task. Whilst we have

recognized a role for statistical decision theory, we have sought to remove intruder behaviour from its ambit, viewing that

as of more relevance to what we have called disclosure management. Nevertheless, in our more detailed consideration of

disclosure potential, we have discussed how it may depend on the nature of potential intruder attacks.

Our discussion of how to assess disclosure potential in practice has been limited to a prototypical set-up and there is

certainly much scope for further research, as recommended by National Research Council (2005, p.72), taking account of

the different kinds of SDL methods finding application under evolving modes of access.

Acknowledgements

I am grateful to Natalie Shlomo and two reviewers for comments. Research was supported by the Economic and Social

Research Council.

References

Abowd, J.M., Nissim, K. & Skinner, C. (2009). First issue editorial. J. Privacy Confident., 1, 1-6.

Bethlehem, J.G., Keller, W.J. & Pannekoek, J. (1990). Disclosure control for microdata. J. Amer. Statist. Assoc., 85, 38-45.

Blien, U., Wirth, H. & Muller, M. (1992). Disclosure risk for microdata stemming from official statistics. Statist. Neerland.,

46, 69-82.

24

Page 25: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Brandt, M., Lenz, R. & Rosemann, M. (2008). Anonymisation of panel enterprise microdata - survey of a German project.

In Privacy In Statistical Databases, Lecture Notes In Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp.

139-151. Berlin: Springer.

Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Risk of disclosure, perceptions of risk and concerns about

privacy and confidentiality as factors in survey participation. J. Official Statist., 24, 255-275.

Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Experimental studies of disclosure risk, disclosure harm,

topic sensitivity, and survey participation. J. Official Statist., 26, 287-300.

Cox, L.H. (2001). Disclosure risk for tabular economic data. In Confidentiality, Disclosure And Data Access: Theory And

Practical Applications For Statistical Agencies, Eds. P. Doyle, J.I. Lane, J.J.M. Theeuwes & L.V. Zayatz, pp. 167-183.

Amsterdam: Elsevier.

Cox, L.H., Karr, A.F. & Kinney, S.K. (2011). Risk-utility paradigms for statistical disclosure limitation: how to think, but

not how to act (with discussion). Int. Statist. Rev., 79, 160-183.

Dalenius, T. (1977). Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, 429-444.

DeGroot, M.H. (1970). Optimal Statistical Decisions. New York: Wiley.

Dobra, A., Fienberg, S.E. & Trottini, M. (2003). Assessing the risk of disclosure of confidential categorical data. In Bayesian

Statistics 7, Proceedings Of The Seventh Valencia International Meeting On Bayesian Statistics, Eds. J.M. Bernardo, M.J.

Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith & M. West , pp. 125-144. Oxford University Press.

Doyle, P., Lane, J.I., Theeuwes, J.J.M. & Zayatz, L.V. (2001). Confidentiality, Disclosure And Data Access: Theory And

Practical Applications For Statistical Agencies. Amsterdam: Elsevier.

Duncan, G.T., Elliot, M. & Salazar-Gonzalez, J.-J. (2011). Statistical Confidentiality. New York: Springer.

Duncan, G.T., Kelly-McNulty, S.A. & Stokes, S.L. (2001b). Disclosure risk vs. data utility: the R-U confidentiality map.

Technical Report No. 121, National Institute Of Statistical Sciences, North Carolina.

Duncan, G. & Lambert, D. (1986). Disclosure-limited data dissemination (with discussion). J. Amer. Statist. Assoc., 81,

10-28.

25

Page 26: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Duncan, G. & Lambert, D. (1989). The risk of disclosure for microdata. J. Bus. Econ. Statist., 7, 207-217.

Dwork, C., McSherry, F., Nissim, K. & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Pro-

ceedings Of The 24th Annual International Cryptography Conference - CRYPTO, 528-544. New York: Springer.

Elamir, E.A.H. & Skinner, C.J. (2006). Record level measures of disclosure risk for survey microdata. J. Official Statist.,

22, 525-539.

Elliot, M.J. & Dale, A. (1999). Scenarios of attach: the data intruder’s perspective on statistical disclosure risk. Netherlands

Official Statist., 14, 6-10.

Fellegi, I.P. (1972). On the question of statistical confidentiality. J. Amer. Statist. Assoc., 67, 7-18.

Fellegi, I.P. & Sunter, A.B. (1969). A theory for record linkage. J. Amer. Statist. Assoc., 64, 1183-1210.

Fienberg, S.E. (2006). Privacy and confidentiality in an e-commerce world: data mining, data warehousing, matching and

disclosure limitation. Statist. Sci., 21, 143-154.

Fienberg, S.E. & Makov, U.E. (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J. Official

Statist., 14, 385-397.

Fienberg, S.E., Makov, U.E. & Sanil, A.P. (1997). A Bayesian approach to data disclosure: optimal intruder behavior for

continuous data. J. Official Statist., 13, 75-89.

Frank, O. (1986). Comment on “Disclosure-limited data dissemination” by G. Duncan & D. Lambert. J. Amer. Statist.

Assoc., 81, 21-22.

French, S. (1988). Decision Theory: an Introduction to the Mathematics of Rationality. Chichester: Ellis Horwood.

Fuller, W. (1983). Masking procedures for microdata disclosure limitation. J. Official Statist., 9, 383-406.

Gotamam, S., Karr, A.F., Reiter, J.P. & Sanil, A.P. (2005a). Data dissemination and disclosure limitation in a world without

microdata: a risk-utility framework for remote access analysis servers. Statist. Sci., 20, 163-177.

Gotamam, S., Karr, A.F. & Sanil, A.P. (2005b). Data swapping as a decision problem. J. Official Statist., 21, 635-655.

26

Page 27: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Government Statistical Service (2009) National Statistician’s Guidance: Confidentiality Of Official Statistics. Office For

National Statistics, UK.

Hand, D. J. (2008). Privacy, data discs and realistic risk. Significance, 5, 11-14.

Karr, A.F., Kohnen, C.N., Organian, A., Reiter, J.P. & Sanil, A.P. (2006). A framework for evaluating the utility of data

altered to protect confidentiality. Amer. Statistician, 60, 224-232.

Karr, A.F., Lin, X., Sanil, A.P. & Reiter, J.P. (2005). Secure regressions on distributed databases. J. Comput. Graphical

Statist., 14, 263-279.

Keeney, R.L. & Raiffa, J. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley.

Keller-McNulty, S., Nakhleh, C.W. & Singpurwalla, N.D. (2005). A paradigm for masking (camouflaging) information. Int.

Statist. Rev., 73, 331-349.

Lambert, D. (1993). Measures of disclosure risk and harm. J. Official Statist., 9 313-331.

Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D. & Walford, N. (1993). The case for

samples of anonymized records from the 1991 census. J. Roy. Statist. Soc. Ser. A, 154, 305-340.

National Research Council (2005). Expanding Access To Research Data: Reconciling Risks And Opportunities. Panel On

Data Access For Research Purposes, Committee On National Statistics. Washington DC: The National Academies Press.

National Research Council (2007). Putting People On The Map: Protecting Confidentiality With Linked Social-Spatial

Data. Panel On Confidentiality Issues Arising From The Integration Of Remotely Sensed And Self-Identifying Data,

Eds. M.P. Guttmann & P.C. Stern. Washington DC: The National Academies Press.

O’Hara, K., Whitley, E. & Whittall, P. (2011). Avoiding the jigsaw effect: experiences with Ministry of Justice reoffending

Data. Technical Report , Electronics and Computer Science, University of Southampton.

Paass, G. (1988). Disclosure risk and disclosure avoidance for microdata. J. Bus. Econ. Statist., 6, 487-500.

Paass, G. & Wauschkuhn, U. (1985). Datenzugang, Datenschutz und Anonymisierung-Analysepotential und Identifizier-

barkeit von Anonymisierten Individualdaten. Munich: Oldenbourg Verlag.

27

Page 28: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Reiter, J. (2005). Estimating risks of identification disclosure in microdata. J. Amer. Statist. Assoc., 100, 1103-1112.

Reiter, J. (2009). Using multiple imputation to integrate and disseminate confidential microdata. Int. Statist. Rev., 77, 179-

195.

Shlomo, N. & Skinner, C. (2010). Assessing the protection provided by misclassification-based disclosure limitation meth-

ods for survey microdata. Ann. Applied Statist., 4, 1291-1310.

Singer, E. (2004). Principles and practices related to scientific integrity. In R.M. Groves, F.J. Fowler, M.P. Couper, J.M.

Lepkowski and Eleanor Singer. Survey Methodology. New York: Wiley.

Skinner, C.J. (2007). The probability of identification: applying ideas from forensic science to disclosure risk assessment.

J. Roy. Statist. Soc. Ser. A, 170, 195-212.

Skinner, C.J. (2008). Assessing disclosure risk for record linkage. In Privacy In Statistical Databases, Lecture Notes In

Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp. 166-176. Berlin: Springer.

Skinner, C.J. & Carter, R.G. (2003). Estimation of a measure of disclosure risk for survey microdata under unequal proba-

bility sampling. Survey Methodology 29, 177-180.

Skinner, C.J. & Elliot, M.J. (2002). A measure of disclosure risk for microdata. J. Roy. Statist. Soc., Ser. B, 64, 855-867.

Skinner, C.J. & Holmes, D.J. (1998). Estimating the re-identification risk per record in microdata. J. Official Statist., 14,

361-372.

Skinner, C.J. & Shlomo, N. (2008). Assessing disclosure risk in survey microdata using log-linear models. J. Amer. Statist.

Assoc., 103, 989-1001.

Spruill, N.L. (1982). Measures of confidentiality. Proc. Surv. Res. Sect. Amer. Statist. Ass., 260-265.

Trottini, M. (2001). A decision-theoretic approach to data disclosure problems. Res. Official Statist., 4, 7-22.

Trottini, M. (2003). Decision Models For Data Disclosure Limitation, PhD Thesis, Department Of Statistics, Carnegie-

Mellon University.

28

Page 29: Statistical Disclosure Risk: Separating Potential and Harmpersonal.lse.ac.uk/skinnecj/RiskRev.pdf · Disclosure risk is broken down into disclosure potential, a measure of the ability

Trottini, M. & Fienberg, S.C. (2002). Modelling user uncertainty for disclosure risk and data utility. Intern. J. Uncertainty,

Fuzziness and Knowledge-Based Systems, 10, 511-527.

Vaughan, E.J. (1997). Risk Management. New York: Wiley.

Wasserman, L. & Zhou, S. (2010). A statistical framework for differential privacy. J. Amer. Statist. Assoc., 105, 375-389.

Willenborg, L. & De Waal, T. (2001). Elements Of Statistical Disclosure Control. New York: Springer.

Winkler, W.E. (2004). Masking and re-identification methods for public use microdata: overview and research problems.

In Privacy In Statistical Databases, Lecture Notes In Computer Science 3050, Eds. J. Domingo-Ferrer & V. Torra, pp.

231-246. Berlin: Springer.

Woo, M., Reiter, J.P., Organian, A. & Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure

limitation. J. Privacy Confident., 1, 111-124.

Zaslavsky, A. M. & Horton, N. J. (1998). Balancing disclosure risk against the loss of nonpublication. J. Official Statist.,

14, 411-419.

29