Statistical database, problems and mitigation

Paper Discussion:Techniques for Security Statistical DatabasesJASON M. SCHATZ, University of California at Davis

Arbin Maharjan, Bikrant GautamSaint Cloud State UniversityIA643-MSIA-Fall

Paper Objective

Statistical Database: Database used for statistical Analysis [1]

Inference: Data Mining technique to analyze data to gain knowledge about the subject [2]

❖ Define Statistical Database.❖ explain the standard notation for statistical queries❖ give a definition of what it means for a SDB to be compromised and

discusses the most common taxonomy used to classify techniques for preventing compromise.

❖ describe a number of specific inference control systems, then conclude with a discussion of future directions for research

[1]: https://en.wikipedia.org/wiki/Statistical_database

[2]:http://goo.gl/nTOav7

https://en.wikipedia.org/wiki/Statistical_database




http://goo.gl/nTOav7

Statistical Database❖ Database used for statistical analysis❖ Researchers have access to statistics, but not to the

records inside❖ Access control and limited query keeps the prying eyes

off from sensitive data❖ Basic operations limited to : Count, Sum, Mean etc.

The problem?Inference!!!

❖ Despite access control and limited query, it is easy to infer the contents of specific records from statistical data

❖ conflict of providing statistics and securing individual records gives rise to Inference Control

Formalizing Statistical Query❖ R is the flat table by joining all the tables with N

records.

❖ Statistical query concerns with finding an aggregate property of a subset of relation R.

❖ Example; “How many car accidents which involving red cars occur before 6:00AM” from a DB maintaining crash record.

Formalizing Statistical Query Cont.

Example table R

Table 1


❖ Formula C selects a subset of R, which is a combination of logical operations like and, or and not.

❖ XC denotes the set of records that satisfy characteristic formula C.

❖ Although C itself is not a set, for easier understanding we can write, C1 ⊆ C2 to denote XC1 ⊆ XC2 , and |C| to denote |XC|

❖ D is the characteristic formula that is true for all records, therefore C ⊆ D is true for all characteristic formulas C


❖ C = (red color) AND (time < 6am)❖ Xc =

Now applying count and Sum on the formula C;❖ Count(C) = 2❖ Sum (Age,C) = 76

Average (Ai, C) can be represented as;❖ Sum(Ai,C)/ Count(C) where Ai represents

a particular attribute of the tuple. ❖ Avg(Age,C) = 38

Define Compromise ❖ when a user amasses information that can be

used to infer the value of a specific field.

➢ Exact compromise: if malicious user can infer the value 1 in a Boolean field, or the exact value of a numerical field

➢ Partial compromise: if user can infer the value 0 for a Boolean field, or can estimate the value of a numerical field to a precision; If the exact value of a field is A, and the estimated value is A’, then the SDB has been partially compromised if |A’/A| < k*k for a given k.

Compromise Example For example, if a user knows that Patrick Lindstrom was in an accident involving his yellow car, and he wants to know if Pat was at fault, he could form the equation;❖ C = (Name = P. Lindstrom) AND (Color = Yellow)

❖ Count(C) =1❖ Sum (At Fault,C) =1 ❖ Sum(DUI, C) = 0

This was a simple attack with Characteristics formula C that gave query set of 1. This attack could be prevented by setting minimum query size k, such that for queries |C| < k is rejected.

Effective Method for Compromising SDBs Tracker Method: a researcher can calculate the value of any query, including queries with a query set of size one, so long as the SDB’s minimum query size is set no higher than k = N/6.

❖ To use the tracker method, a user must find a characteristic formula with a legal query set size. This formula is referred to as the tracker, denoted as T, where: 2k <= |T| <= N – 2k

Tracker Method Example❖ from Table 1; (number of records N = 12)

minimum query size, k = N/6 = 12/2 =2

maximum query size, N - k = 12 -2 = 10

Calculate T, such that; 2k <= |T| <= N - 2K

❖ User has to guess characteristics formula T, that falls within the range; Let’s try with T = (age<25) Count(T) = 5 which satisfies 4 <=5 <= 8

Tracker Method Example Continued.❖ User could find the tracker to see if,

C = ( Name = P. Lindstrom) AND (Color = Yellow) has a size of one.

❖ User could use the method again to determine whether or not the driver was at fault using the earlier condition.

SUM( At fault, C) = SUM(At Fault, (C OR (Age < 25)) + SUM( At Fault, ( C OR (Age >= 25)) – SUM( At Fault, D)

= 3 + 3 – 5 = 1

❖ The user has partially compromised the SDB by inferring that Mr. Lindstrom was At Fault in his accident.

Note that the record corresponding to Lindstrom’s crash in his yellow car is selected by both ( C or T ) and ( C or Tc ), but records not specified by C are not double counted. Subtracting out the query value for all records ( Q( D ) ) leaves only the value of the records which were double counted, or Q( C )

Taxonomy of inference control Technique

❖ Lots of power control technique have been proposed

❖ Common Classification scheme supplied by Adam and Wortman➢ Query Set Restriction➢ Data Perturbation➢ Output Perturbation➢ Conceptual Approach

❖ Security❖ Robustness❖ Bias❖ Precision❖ Consistency❖ Cost

Evaluating Effectiveness of inference control Technique

❖ Approximate Data Swapping❖ Random Sample Queries❖ Fixed Perturbation❖ Query based Perturbation❖ Rounding

Inference Control Technique

❖ Creates new db from existing SDB by swapping the attribute values between records without altering the values of any t-order statistics for some number t.

❖ based on the notion that significant amount of the valuable information in an SDB is captured in its first t-ordered statistics

❖ From table 1, COUNT(Color=Red^Make=Honda) is two-ordered statistics

❖ Statistics of order higher than t are not necessarily equivalent to original database

❖ Increasing t decreases bias.❖ Increasing t also increases snooper to compromise db.❖ not suitable for SDBs which are dynamical updated.

Approximate Data Swapping

❖ US Census Bureau uses this technique❖ Prevents the inference in their tabulated SDBs❖ Each published query is based on randomly

selected query sets❖ there is fundamental difference between statically

published queries and dynamically calculated queries

❖ Census Bureau produce single random sampling per query

❖ New sampling must be selected for each query.❖ Repeating the same query many times and

averaging the results, snooper can eliminate the bias introduced through sampling.

Random Sample Queries

Fixed Perturbation❖ Creates an alternate db from the original data.❖ only applicable to numerical data.❖ Alternate db is created by altering the value of each field

by a randomly generated perturbation value❖ Perturbation is done only once, repeated queries

provides same results. No Averaging Attack.❖ Values are Normally distributed with =0 and standard 𝜇

deviation 𝜎2

❖ for a query set and the given attributes, the sum of perturbation is zero.

❖ bias is high and additional bias is added up because attribute values in characteristics formula are also altered.

❖ Bias might be as high as 50%

Query based Perturbation❖ Technique that does not require the creation of proxy

database.❖ For each query, Perturbation function is applied to all the

attributes that affect the resulting value. ❖ bias introduced is not stored in the db and different for

individual queries.❖ for count queries bias function is applied to all attributes,

so the alter query set has a biased count.❖ for other aggregate query set, unbiased data is used❖ After selecting correct records, biasing function is

applied to attributes involved in the aggregate function.❖ easy to implement but vulnerable to attacks ❖ It allows inconsistencies and paradoxes in query results.

❖ Result based on perturbation technique❖ queries are calculated on unbiased data, then the

results are altered before being returned to user.❖ does not involve random bias generation, rather,

results are rounded to nearest multiple of a value.❖ Not vulnerable to averaging attack.❖ Have number of complicated technique which

allow exact compromise. ❖ must be used in concert with other techniques to

provide security.

Rounding

❖ There is no effective solution to the inference control problem that can be applied to wide range of SDBs

❖ No generally applicable solution provides both a high level of security, and unbiased statistics for a rich set of queries.

❖ Some major of success found by relaxing the goals of inference control in two ways.➢ by using a definition of security that is less

difficult than fully avoiding exact and partial compromise

➢ by tailoring inference control solution to specific categories of SDBs.

Conclusion

Thanks,Any Questions?

Engineering

Statistical database, problems and mitigation