1
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1 , Li Xiong 1 , Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory University, Atlanta, GA, USA 2 CIISE, Concordia University, Montreal, QC, Canana Problem Statement We study the problem of anonymizing microdata with quasi-sensitive (QS) attributes which are not sensitive by themselves, but can be linked to external knowledge to reveal indirect sensitive information of an individual. (a) Original microdata with quasi-sensitive attribute symptoms (b) External knowledge that maps symptoms to disease (c) A generalized table that cannot prevent indirect disclosure of disease through symptoms Figure 1. Anonymizing data with QS attributes Preliminary Results With the Mondrian generalization and our suppression algorithm implemented in C++, we conducted experiments with: 1) a dataset with 3000 tuples augmented from the Adult dataset, with 8 QI attributes and 9 synthesized QS terms per tuple, and 2) an external table with 3000 pieces of knowledge labels linked to random QS terms with Poison distribution. The external knowledge table E has each row as a pair (Li, Si), i = 1, 2, ..., |E|, where Li is a sensitive label and S i is a corresponding set of QS values. All sensitive labels that can be linked to the d tuples in a QI group G with quasi-identifying (QI) vector q is d i=1 K(tp i ), the sensitive label set of G. The attacker’s prior belief α (q,L) and posterior belief β (q,L) are the probabilities that a target tp with QI-vector q is linked to a label L before and after the data release. Definition (QS (c,l)-diversity). A group G satisfies QS (c,l)-diversity if and only if p 1 ≤c (p l + p l +1 + ... + p | di=1K(tpi)| ), where p 1 , p 2 , ..., p | di=1K(tpi)| are the values of β(q,L i ) in decreasing order. A table Dsatisfies QS (c,l)-diversity if every group satisfies QS (c,l)-diversity. Definition (QS t-closeness). A group G satisfies QS t-closeness if and only if the distance between α (q,L) and β (q,L) is no more than a threshold t. A table Dsatisfies QS t- closeness if every group satisfies QS t-closeness. Figure 5. Two-phase algorithm for QS t- closeness showing the trade-off between better privacy and smaller removal cost and benefit of the two-phase algorithm compared to generalization only approach. Algorithm Figure 2. Disclosure risks with QS attributes Formal notions of QS l-diversity and QS t- closeness that extend l-diversity and t- closeness to prevent indirect attribute disclosure due to QS attribute values. •A two-phase algorithm that combines generalization and value suppression Contributions Definitions Phase 1 (QI generalization). Given D, an intermediate dataset Dg is obtained that satisfies k-anonymity. Phase 2 (QS suppression). Given Dg, a suppression algorithm is used to remove proper QS values (items) until every QI group satisfies QS (c,l)-diversity or QS t-closeness. • Greedy search heuristics with dynamic reordering of tailsets that contain potential values to be removed in the next step to enable quick return of result • Dynamic updates when a solution with a lower cost is found to enable continuous improvement of the result within a bounded time period. Figure 3. QS suppression search tree and algorithm features Figure 4. QS suppression for QS (c,l)- diversity showing adaptive QS suppression outperforms baseline DFS search significantly

Anonymizing Data with Quasi-Sensitive Attribute Values

  • Upload
    laken

  • View
    42

  • Download
    1

Embed Size (px)

DESCRIPTION

Anonymizing Data with Quasi-Sensitive Attribute Values. Pu Shi 1 , Li Xiong 1 , Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory University, Atlanta, GA, USA 2 CIISE, Concordia University, Montreal, QC, Canana. Definitions. Problem Statement. Preliminary Results. - PowerPoint PPT Presentation

Citation preview

Page 1: Anonymizing Data with Quasi-Sensitive Attribute Values

Anonymizing Data with Quasi-Sensitive Attribute ValuesPu Shi1, Li Xiong1, Benjamin C. M. Fung2

1Departmen of Mathematics and Computer Science, Emory University, Atlanta, GA, USA2CIISE, Concordia University, Montreal, QC, Canana

Problem StatementWe study the problem of anonymizing microdata with quasi-sensitive (QS) attributes which are not sensitive by themselves, but can be linked to external knowledge to reveal indirect sensitive information of an individual.

(a) Original microdata with quasi-sensitive attribute symptoms

(b) External knowledge that maps symptoms to disease

(c) A generalized table that cannot prevent indirect disclosure of disease through symptoms

Figure 1. Anonymizing data with QS attributes

Preliminary ResultsWith the Mondrian generalization and our suppression algorithm implemented in C++, we conducted experiments with: 1) a dataset with 3000 tuples augmented from the Adult dataset, with 8 QI attributes and 9 synthesized QS terms per tuple, and 2) an external table with 3000 pieces of knowledge labels linked to random QS terms with Poison distribution.

The external knowledge table E has each row as a pair (Li, Si), i = 1, 2, ..., |E|, where Li is a sensitive label and Si is a corresponding set of QS values. All sensitive labels that can be linked to the d tuples in a QI group G with quasi-identifying (QI) vector q is ∪d

i=1K(tpi), the sensitive label set of G. The attacker’s prior belief α(q,L) and posterior belief β(q,L) are the probabilities that a target tp with QI-vector q is linked to a label L before and after the data release.

Definition (QS (c,l)-diversity). A group G satisfies QS (c,l)-diversity if and only if p1 ≤c (pl + pl +1 + ... + p| di=1K(tpi)|∪ ), where p1, p2, ..., p | di=1K(tpi)|∪ are the values of β(q,Li) in decreasing order. A table D∗ satisfies QS (c,l)-diversity if every group satisfies QS (c,l)-diversity.

Definition (QS t-closeness). A group G satisfies QS t-closeness if and only if the distance between α(q,L) and β(q,L) is no more than a threshold t. A table D∗ satisfies QS t-closeness if every group satisfies QS t-closeness.

Figure 5. Two-phase algorithm for QS t-closeness showing the trade-off between better privacy and smaller removal cost and benefit of the two-phase algorithm compared to generalization only approach.

Algorithm

Figure 2. Disclosure risks with QS attributes

• Formal notions of QS l-diversity and QS t-closeness that extend l-diversity and t-closeness to prevent indirect attribute disclosure due to QS attribute values.

•A two-phase algorithm that combines generalization and value suppression to achieve QS l-diversity and QS t-closeness.

Contributions

Definitions

Phase 1 (QI generalization). Given D, an intermediate dataset Dg is obtained that satisfies k-anonymity.

Phase 2 (QS suppression). Given Dg, a suppression algorithm is used to remove proper QS values (items) until every QI group satisfies QS (c,l)-diversity or QS t-closeness.

• Greedy search heuristics with dynamic reordering of tailsets that contain potential values to be removed in the next step to enable quick return of result

• Dynamic updates when a solution with a lower cost is found to enable continuous improvement of the result within a bounded time period.

Figure 3. QS suppression search tree and algorithm features

Figure 4. QS suppression for QS (c,l)-diversity showing adaptive QS suppression outperforms baseline DFS search significantly