47
Auditing Compliance with a Hippocratic Database Rakesh Agrawal Rakesh Agrawal Roberto Bayardo Roberto Bayardo Christos Faloutsos Christos Faloutsos Jerry Kiernan Jerry Kiernan Ralf Rantzau Ralf Rantzau Ramakrishnan Srikant Ramakrishnan Srikant Intelligent Information Systems Research Intelligent Information Systems Research IBM Almaden Research Center IBM Almaden Research Center

Rakesh Agrawal Roberto Bayardo Christos Faloutsos Jerry Kiernan Ralf Rantzau Ramakrishnan Srikant Intelligent Information Systems Research IBM Almaden

Embed Size (px)

Citation preview

Auditing Compliance with a Hippocratic

Database

Rakesh AgrawalRakesh AgrawalRoberto BayardoRoberto Bayardo

Christos FaloutsosChristos FaloutsosJerry KiernanJerry KiernanRalf RantzauRalf Rantzau

Ramakrishnan SrikantRamakrishnan Srikant

Intelligent Information Systems ResearchIntelligent Information Systems ResearchIBM Almaden Research CenterIBM Almaden Research Center

OutlineOutline

Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and

algorithmsalgorithms PerformancePerformance SummarySummary

MotivationMotivation

Hippocratic databases advocate policy directed Hippocratic databases advocate policy directed data management for privacy sensitive datadata management for privacy sensitive data

– Need reinforced by legislations and regulations:Need reinforced by legislations and regulations: Health Insurance Portability & Accountability ActHealth Insurance Portability & Accountability Act Gramm-Leach Bliley Act – Consumer Privacy RuleGramm-Leach Bliley Act – Consumer Privacy Rule

GoalGoal– Build a system to assist with auditing compliance with Build a system to assist with auditing compliance with

the stated policythe stated policy Event driven - privacy complaintEvent driven - privacy complaint Periodic - monitor exposure to privacy violationPeriodic - monitor exposure to privacy violation

Audit ScenarioAudit Scenario

Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes

The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action

Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests

Jane has not been feeling well and decides to consult her doctor

The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes

Audit ExpressionAudit Expression

audit T.disease

from Customer C, Treatment T

where C.cid=T.pcid and C.name = ‘Jane’

Who has accessed Jane’s disease information?

OutlineOutline

Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and

algorithmsalgorithms PerformancePerformance SummarySummary

Problem StatementProblem Statement

GivenGiven– A log of queries executed over a A log of queries executed over a

databasedatabase– An audit expression specifying An audit expression specifying

sensitive data sensitive data Precisely identifyPrecisely identify

– Those queries that accessed the data Those queries that accessed the data specified by the audit expressionspecified by the audit expression

““Suspicious” QueriesSuspicious” Queries

cidcid namename addresaddresss

zipzip ……

11 JaneJane 1234 …1234 … 9512095120 ……

……

A query Qi has accessed information contained in the Customer table

The audit expression A specifies the data to the audited

If query Qi accesses all the cells specified by the audit expression A for any row, Qi is suspicious

Customer table

IssuesIssues

Convenient languageConvenient language– Audit expression (essentially SPJ Audit expression (essentially SPJ

query)query) Fast and precise on auditsFast and precise on audits Non disruptive Non disruptive

– Minimal performance impact on Minimal performance impact on normal database operationnormal database operation

Fine grainedFine grained

AssumptionsAssumptions

Disclosures stemming from multiple Disclosures stemming from multiple query executions is not consideredquery executions is not considered

No use of outside knowledge to No use of outside knowledge to deduce information without deduce information without detectiondetection

Queries considered include Queries considered include – Joins and aggregation, but not nested Joins and aggregation, but not nested

subqueriessubqueries Note that existential subqueries can be Note that existential subqueries can be

converted into joins [SIGMOD92]converted into joins [SIGMOD92]

OutlineOutline

Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and

algorithmsalgorithms PerformancePerformance SummarySummary

Informal DefinitionsInformal Definitions

““Candidate” queryCandidate” query– Logged query that accesses all columns Logged query that accesses all columns

specified by the audit expressionspecified by the audit expression ““Indispensable” tuple (for a query)Indispensable” tuple (for a query)

– A tuple whose omission makes a difference A tuple whose omission makes a difference to the result of a queryto the result of a query

““Suspicious” querySuspicious” query– A candidate query that shares an A candidate query that shares an

indispensable tuple with the audit indispensable tuple with the audit expressionexpression

Indispensable TupleIndispensable Tuple

))(())((

STARTQ

AOA

QOQ

PC

PC

The SPJ query Q and the audit expression A are of the form:

))}){((())((),( RvTRTQvind QQQQ PCPC

Definition 1 - A virtual tuple v T is indispensable for an SPJ query Q if the result of Q changes when we delete v:

Predicates in Q

Columns appearing anywhere in Q

Duplicate preserving projection operator

Tables common to Q and A

Output columns in Q

““Candidate” QueryCandidate” Query

OAQ CC

Definition 6 - Q is a candidate query with respect to A if:

Only candidate queries can be suspicous queries

““Suspicious” QuerySuspicious” Query

),(),( s.t. ),( AvindQvindTvAQsusp

Definition 7 - Q is suspicious with respect to A if they share an indispensable MVT v

For example,Query Q: Addresses of people with diabetesAudit A: Jane’s diagnosis

Jane’s tuple is indispensable for both; hence query Q is “suspicious” with respect to A

A tuple v is a MVT for queries Q1 and Q2 if it belongs to the cross product of common tables in their from clauses

Definition 5 - Maximal virtual tuple (MVT):

OutlineOutline

Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and

algorithmsalgorithms PerformancePerformance SummarySummary

System OverviewSystem Overview

DataTables

IDID TimestamTimestampp

QueryQuery UserUser PurposePurpose RecipientRecipient

11 2004-02…2004-02… Select …Select … JamesJames CurrentCurrent OursOurs

22 2004-02…2004-02… Select …Select … JohnJohn TelemarketingTelemarketing publicpublic

Query Log

DatabaseLayer

Query with purpose, recipient

Updates, inserts, delete

Backlog

Database triggers track updates to base tables

Audit

DatabaseLayer

Audit expression

IDs of log queries having accessed data specified by the audit query

Audit query

Static analysis

Generate audit query

Static AnalysisStatic Analysis

IDID TimestamTimestampp

QueryQuery UserUser PurposePurpose RecipientRecipient

11 2004-02…2004-02… Select …Select … JamesJames CurrentCurrent OursOurs

22 2004-02…2004-02… Select …Select … JohnJohn TelemarketingTelemarketing publicpublic

Query Log

Audit expression

Filter Queries

Candidate queries

Eliminates queries that could not possibly have violated the audit expression

Insures that

Accomplished by examining only the queries themselves (i.e., without running the queries)

OAQ CC

Audit Query Audit Query GenerationGeneration GoalGoal

– Build a query which, when run, Build a query which, when run, returns the id’s of suspicious queries returns the id’s of suspicious queries with respect to an audit expression with respect to an audit expression AA

Generating the Audit Generating the Audit QueryQuery

Candidate Query

1

Candidate Query

2

Audit Expressio

n

Union

Combine individual candidate queries and the audit expression into a single query graph

Combine the audit expression with individual candidate queries to identify suspicious queries

Replace each table with it’s backlog to restore the version of the table to the time of each query

T1 T2

QGM is a graphical representation of a query

Boxes represent operators, such as select

Lines represent input/output relationships between

operators

Boxes with no inputs are tables

Suspicious SPJ QuerySuspicious SPJ Query

)(( SRTQA PP

))((

))((

STA

RTQ

AOA

QOQ

PC

PC

Theorem 2 - A candidate SPJ query Q is suspicious with respect to an audit expression A if and only if:

The candidate SPJ query Q and the audit expression A are of the form:

QGM rewrites, shown in previous slide, transform Q and A into:

)))((("" SRTQAi PPQ

Proof of correctness is based upon Definition 7 (suspicious query) and

given in the paper

Suspicious Aggregate Suspicious Aggregate Query (Including Query (Including Having)Having) Solution in the paper Solution in the paper

ExampleExample

Jane’s audit

Audit ExpressionAudit Expression

audit T.disease

from Customer C, Treatment T

where C.cid=T.pcid and C.name = ‘Jane’

Who has accessed Jane’s disease information?

Query LogQuery Log

IDID QueryQuery TSTS UserUser PurposePurpose RecipientRecipient

11 select name, address, zip select name, address, zip from Customer, from Customer, Treatment where disease Treatment where disease = ‘diabetes’ and cid=pcid= ‘diabetes’ and cid=pcid

T3T3 jamesjames marketingmarketing othersothers

22 select name, address select name, address from Customer where from Customer where zip=‘95112’zip=‘95112’

T3T3 johnjohn contactcontact othersothers

Query 1 was executed at time T3

Backlog Table (Time Backlog Table (Time Stamp)Stamp)

NameName AddresAddresss

…… OPROPR TSTS

JaneJane 1234…1234… …… II T2T2

JaneJane 1234…1234… …… UU T4T4

AliceAlice …… …… II T1T1

Attributes also in the source table Attributes only in the backlog table

Jane’s record was inserted at time T2 and updated at time T4. The backlog table records both versions of her information

Operation on a tuple among Insert, Update and Delete

Timestamp of the operation

C. S. Jensen, L. Mark, and N. Roussopoulos [TKDE 1991]

Merge Logged Queries Merge Logged Queries and Audit Expressionand Audit Expression

Customer

c, n, …, t

audit expression := T.p=C.c and C.n= ‘Jane’

T.s

Select := T.s=‘diabetes’ and T.p=C.c

C.n, C.a, C.z

C

C

Merge logged queries and audit expression into a single query graph

Treatment

p, r, …, t

TT

Transform Query Transform Query Graph into an Audit Graph into an Audit QueryQuery

Customer

c, n, …, t

audit expression := X.n= ‘Jane’

‘Q1’

Select := T.s=‘diabetes’ and C.c=T.p

C.n

C

X

View of Customer (Treatment) is a temporal view at the time of the query was executed

The audit expression now ranges over the logged query. If the logged query is suspicious, the audit query will output the id of the logged query

T

Treatment

p, r, ..., t

Scenario OutcomeScenario Outcome

The audit uncovers that Query 1 in the query The audit uncovers that Query 1 in the query log accessed Jane’s informationlog accessed Jane’s information

OutlineOutline

Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and

algorithmsalgorithms PerformancePerformance SummarySummary

Empirical Evaluation: Empirical Evaluation: GoalsGoals Cost of maintaining backlog Cost of maintaining backlog

tablestables– Understand the impact of Understand the impact of

maintaining backlog tables on maintaining backlog tables on ongoing database operationsongoing database operations

Cost of running auditsCost of running audits– Understand whether audits can run Understand whether audits can run

in reasonable timein reasonable time

Experimental SetupExperimental Setup

IBM M Pro 6868 IntellistationIBM M Pro 6868 Intellistation– 800 MHz Pentium III processor800 MHz Pentium III processor– 512 MB of memory512 MB of memory– 16.9 GB disk drive16.9 GB disk drive

Windows 2000 Version 5, SP 4Windows 2000 Version 5, SP 4 DB2 v7 with default settingsDB2 v7 with default settings TPC-H databaseTPC-H database

– Supplier tableSupplier table 100,000 tuples100,000 tuples

System StructuresSystem Structures

IndexingIndexing– Eager indexingEager indexing

Maintain an index over the backlog tableMaintain an index over the backlog table Maintained during ongoing database operationsMaintained during ongoing database operations

– Lazy indexingLazy indexing No index over the backlog tableNo index over the backlog table Create indices at the time of auditCreate indices at the time of audit

Choice of indexChoice of index– Simple indexSimple index

Primary key of source tablePrimary key of source table– Composite indexComposite index

Primary key of source tablePrimary key of source table Time stampTime stamp

Impact on Ongoing Impact on Ongoing OperationsOperations QueriesQueries

– Additionally log the query stringAdditionally log the query string Already performed in many application Already performed in many application

environmentsenvironments

UpdatesUpdates– For each updated tuple,For each updated tuple,

Insert a tuple to the backlog tableInsert a tuple to the backlog table

– Inserts and deletes are handled similarlyInserts and deletes are handled similarly In a majority of environments, queries In a majority of environments, queries

are much more frequent than updatesare much more frequent than updates

Update PerformanceUpdate Performance

100,000 tuples in Supplier table100,000 tuples in Supplier table Update statement updates all tuplesUpdate statement updates all tuples Each update statement fires triggers Each update statement fires triggers

which inserts an additional 100,000 which inserts an additional 100,000 tuples in backlogtuples in backlog

Evaluate impact of multiple versions Evaluate impact of multiple versions on performanceon performance

Overhead on UpdatesOverhead on Updates

0

50

100

150

200

250

5 20 35 50

# of versions per tuple

Tim

e (

min

ute

s)

CompositeSimpleNo IndexNo Triggers

Simple wins over Composite

7x if all tuples are updates

3x if a single tuple is updated

Eager indexing doesn’t add much cost

Number of version of each tuple in the Supplier backlog

table

Audit Query Audit Query PerformancePerformance

Audit query:

select ‘Q’ from Supplier where skey = k

Experiment:

Evaluate the impact of the number of versions of tuples in the backlog table on performance

Audit Query Execution Audit Query Execution TimeTime

1

10

100

1000

1 10 20 30 40 50

# versions per tuple

Tim

e (

mse

c.)

Simple-ISimple-CComposite-IComposite-C

Composite wins over simple if initial version is selected

Simple wins over composite if the current

version is selected

TakeawaysTakeaways

The composite indexThe composite index– Enhances the performance of audits, Enhances the performance of audits,

butbut– Additionally burdens updates when Additionally burdens updates when

using eager indexingusing eager indexing The system supportsThe system supports

– Efficient auditingEfficient auditing– Without substantially burdening normal Without substantially burdening normal

query processingquery processing

Related WorkRelated Work

Oracle Privacy Security Auditing– Facility for logging queries with timestamp– Flash-back queries

Restores the version of the data at the time of the query– No support for automated auditing

User manually selects queries from the log and runs them The user to decide if the query is suspicious

G. Miklau D. Suciu [SIGMOD 2004]– Formal analysis of information disclosure in data exchange

Is information about a secret query S revealed by views V1,…,Vn Considers all possible instances of a database schema Assumes tuple independence

– We’re interested in given instances (temporal versions)– Nonetheless, it will be interesting to explore the connection

between the two works Active enforcement of policies by limiting disclosure Active enforcement of policies by limiting disclosure

[VLDB’04][VLDB’04] Literature on multi-query optimization

SummarySummary

In light of new privacy legislationIn light of new privacy legislation– The problem of auditing usage of The problem of auditing usage of

information represents an important information represents an important opportunity for database researchopportunity for database research

Formalized the problem through the Formalized the problem through the fundamental concepts of fundamental concepts of indispensable tuple and suspicious indispensable tuple and suspicious queriesqueries

Achieved our design goals:Achieved our design goals:

Design GoalsDesign Goals

Convenient languageConvenient language Fast and precise on auditsFast and precise on audits Non disruptive Non disruptive

– Minimal performance impact on Minimal performance impact on normal database operationnormal database operation

Fine grainedFine grained

BackupBackup

Multiple Candidate Multiple Candidate QueriesQueries

audit expression := C.n= ‘Jane’

‘Q1’

audit expression := C.n= ‘Jane’

‘Q2’

Union

Aggregate Queries Aggregate Queries with Havingwith Having

group:= c1, …, ci

c1, …, ci, agg1, …, aggn

select:= …

c1, …, ci

Qs

Qg

Qh

audit expression := …

c1, …, ck

audit expression := …

c1, …, ck

select:= q1.c1=q2.c1 and … and q1.ci=q2.ci

‘Q1’

q1q1

The join on aggregate columns ensures that the group being tracked by the audit has not been eliminated by the having clause

Dynamic Temporal ViewsDynamic Temporal Views

Customer_backlog

c, n, a, h, z, o, t, ts, op

Select :=ts <= andop <> ‘delete’ andnot(C5)

c, n, a, h, z, o, t

Exists :=C4.ts <= andC3.c = C4.c andC4.ts > C3.ts

*

C3

C1

C4

C5

View of Customer table at time c = id

n = name

a = address

h = phone

z = zip

o = contact

t = marketing

ts = ts

op = opr

Time stamp of the logged

query

Cost of Building Cost of Building Indices over Backlog Indices over Backlog TablesTables

0

2

4

6

8

10

12

14

# versions per tuple

Tim

e (

min

ute

s)

TS-CompositeTS-Simple