Protecting Statistical Databases Against Snoopers

Comparison of two methods

Disclosure vs. Anonymity

Information disclosure necessary for planning and numerical measurements

Anonymity necessary for protection of the individual and the public’s trust in systems

Medical Data

Necessary for: Measuring effectiveness of

current treatments Finding sources of common

medical mistakes Tracking contagious disease Government spending planning Health Insurance Companies

Anonymity: Not as Easy as it Looks

Profession

SexZip code

Birth date

Complete Identification Without Uniquely Identifying Information

Outside Factors Affecting Privacy

Snooper’s supplementary knowledge Public data sources Rarity

Comparing Two Methods of Protection

What are the privacy guarantees?

Can useful information be gained?

Sensitivity-based Noise-adding Algorithm

Proposed by Dwork, McSherry, Nissim and Smith

Adds noise to each answer based on the sensitivity of the series of queries

Amount of privacy based on ε, a coefficient in the noise-generating formula

SensitivityHow much could changing one row

change an answer?

MEAN COUNT HISTOGRAMS

The sensitivity of a series of queries is the sum of the sensitivities of the queries

Coin-flip Algorithm

Proposed by Mishra and Sandler

A way for individuals to publish their own personal data

Amount of privacy based on ε, the bias in the coin-flip

Implementing the Coin-flip Algorithm

Each of the k possible answers to a query are ordered and numbered

If an individual’s answer to the query is the ith answer, the profile would be a string of k bits where the ith is a one and the others are zero

To sanitize, each bit is flipped with probability ½ + ε/2

All sanitized profiles resemble a random string of ones and zeros

Example: HIV status Ordered possible responses:

“POSITIVE, NEGATIVE, UNKNOWN” The original profile of an HIV+

individual: “1, 0, 0” Results of coin-flips: “STAY, FLIP, STAY”

Resulting sanitized profile: “1, 1, 0” What do we know about the

individual from the sanitized profile?

My Research

Compare the total amount of error generated by histogram / frequency queries

Hypothesis: The noise-adding algorithm will generate less error for few queries and the coin-flip algorithm will generate less error for many queries

Research question: Where is the “sweet spot” where the error lines cross on a graph?

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Number of Frequency Queries

Coinflip

Noise Addition

The “sweet spot” first occurs at 101 queries.

With the smallest histograms first, the first “sweet spot” occurs at 32 queries.

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Coinflip

Noise Addition

With the largest histograms first, the first “sweet spot” occurs at 189 queries.

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Coinflip

Noise Addition

A Second Look Range of sensitivity: 2 to 136

Unordered histograms:At first “sweet spot”, sensitivity= 30.

Smallest histograms first:At first “sweet spot”, sensitivity= 32.

Largest histograms first:At first “sweet spot”, sensitivity= 34.

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Coinflip

Noise Addition

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Coinflip

Noise Addition

Sum of Error

500.00%

1000.00%

1500.00%

2000.00%

2500.00%

3000.00%

3500.00%

4000.00%

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381

Coinflip

Noise Addition

Difference in Error

-200.00%

200.00%

400.00%

600.00%

800.00%

1000.00%

1200.00%

1400.00%

1600.00%

2 12 22 32 42 52 62 72 82 92

Sensitivity

Conclusions

For histogram / frequency queries, “sweet spots” occur between sensitivity=30 and sensitivity=40, so for least error: If sensitivity < 30, use NOISE-ADDING algorithm If sensitivity > 40, use COIN-FLIP algorithm

Quick Bibliography

Survey: N R Adam and J C Wortmann. Security-control methods

for statistical databases: a comparative study. ACM Computing Surveys, 25(4), December 1989.

Noise-adding algorithm: C Dwork, F McSherry, K Nissim, A Smith. Calibrating

noise to sensitivity in private data analysis. 3rd Theory of Cryptography Conference, 2006.

Coin-flip algorithm: N Mishra, M Sandler. Symposium on Principles of

Database Systems, 2006.

Professor Alf Weaver, PhD

Professor Nina Mishra, PhD

REU program at UVa, sponsored by the National Science Foundation

Protecting Statistical Databases Against Snoopers

Documents

Protecting Your Business – Protecting Your Family ...dokusfinancialpartners.com/.../Guardian-BusinessContinuationBroch… · Protecting Your Business – Protecting Your Family

Protecting Microsoft SQL Server databases using IBM

Protecting You, Protecting U.S

NoSQL: Graph Databases. Databases Why NoSQL Databases?

Hackers, snoopers and the cybersecurity threat...Increased stakes and some numbers 4 • Norton Rose Fulbright “2017 Litigation Trends Annual Survey”: “63% of respondents have

ZDLRA and MAA, Protecting Everything · •Consulting at European Institution •LCM (Life Cycle Management) to the Oracle Products. •Supporting the Production Databases. •Patch

Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video

Protecting Oracle databases with HP StoreOnce … white paper Protecting Oracle databases with HP StoreOnce Catalyst and RMAN Oracle database backup using the HP StoreOnce Catalyst

PROTECTING CRITICAL DATABASES TOWARDS A RISK BASED … · 2013-06-03 · MN NJOTINI 2013(16)1 PER / PELJ 451 / 536 PROTECTING CRITICAL DATABASES – TOWARDS A RISK BASED ASSESSMENT

Protecting databases from inference attacks* - UAH · TH. Hinke et al. / Protecting Databases From Inference Attacks is that the protectors of the database must also apply this deep

Adrian Onet -2015. Non-Relational Databases Graph databases Columnar databases. Large databases: Key-Value Stores (Amazon) BigTable (Google)

Databases and types of databases

ULB4.3 Relational databases VS Non relational databases 4.3.1 Relational databases Relation databases are also known as relational database management systems (RDBMS) or SQL databases,

Databases Creating databases to store information

Databases Overview for the Databases team

Protecting SQL Server Databases - Pixel IT · SQL Server databases should be protected on a scheduled basis. A SQL database actually consists of two files; the database file (*.mdf)

PROTECTING CRITICAL DATABASES TOWARDS A RISK … · BASED ASSESSMENT OF CRITICAL INFORMATION INFRASTRUCTURES (CIIS) IN SOUTH AFRICA ... 451 / 536 PROTECTING CRITICAL DATABASES –

Protecting Your SharePoint Content Databases using SQL Transparent Data Encryption

Online Databases for Academic Libraries Databases for the Academic Library Academic Databases Business Databases Full Text Linking

Protecting Oracle Databases with Zerto Virtual Replication...Protecting Oracle Databases with Zerto Virtual Replication VERSION 2 November 2015