Upload
meredith-curtis
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Auditing Compliance with a Hippocratic
Database
Rakesh AgrawalRakesh AgrawalRoberto BayardoRoberto Bayardo
Christos FaloutsosChristos FaloutsosJerry KiernanJerry KiernanRalf RantzauRalf Rantzau
Ramakrishnan SrikantRamakrishnan Srikant
Intelligent Information Systems ResearchIntelligent Information Systems ResearchIBM Almaden Research CenterIBM Almaden Research Center
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
algorithmsalgorithms PerformancePerformance SummarySummary
MotivationMotivation
Hippocratic databases advocate policy directed Hippocratic databases advocate policy directed data management for privacy sensitive datadata management for privacy sensitive data
– Need reinforced by legislations and regulations:Need reinforced by legislations and regulations: Health Insurance Portability & Accountability ActHealth Insurance Portability & Accountability Act Gramm-Leach Bliley Act – Consumer Privacy RuleGramm-Leach Bliley Act – Consumer Privacy Rule
GoalGoal– Build a system to assist with auditing compliance with Build a system to assist with auditing compliance with
the stated policythe stated policy Event driven - privacy complaintEvent driven - privacy complaint Periodic - monitor exposure to privacy violationPeriodic - monitor exposure to privacy violation
Audit ScenarioAudit Scenario
Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes
The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action
Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests
Jane has not been feeling well and decides to consult her doctor
The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes
Audit ExpressionAudit Expression
audit T.disease
from Customer C, Treatment T
where C.cid=T.pcid and C.name = ‘Jane’
Who has accessed Jane’s disease information?
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
algorithmsalgorithms PerformancePerformance SummarySummary
Problem StatementProblem Statement
GivenGiven– A log of queries executed over a A log of queries executed over a
databasedatabase– An audit expression specifying An audit expression specifying
sensitive data sensitive data Precisely identifyPrecisely identify
– Those queries that accessed the data Those queries that accessed the data specified by the audit expressionspecified by the audit expression
““Suspicious” QueriesSuspicious” Queries
cidcid namename addresaddresss
zipzip ……
11 JaneJane 1234 …1234 … 9512095120 ……
……
A query Qi has accessed information contained in the Customer table
The audit expression A specifies the data to the audited
If query Qi accesses all the cells specified by the audit expression A for any row, Qi is suspicious
Customer table
IssuesIssues
Convenient languageConvenient language– Audit expression (essentially SPJ Audit expression (essentially SPJ
query)query) Fast and precise on auditsFast and precise on audits Non disruptive Non disruptive
– Minimal performance impact on Minimal performance impact on normal database operationnormal database operation
Fine grainedFine grained
AssumptionsAssumptions
Disclosures stemming from multiple Disclosures stemming from multiple query executions is not consideredquery executions is not considered
No use of outside knowledge to No use of outside knowledge to deduce information without deduce information without detectiondetection
Queries considered include Queries considered include – Joins and aggregation, but not nested Joins and aggregation, but not nested
subqueriessubqueries Note that existential subqueries can be Note that existential subqueries can be
converted into joins [SIGMOD92]converted into joins [SIGMOD92]
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
algorithmsalgorithms PerformancePerformance SummarySummary
Informal DefinitionsInformal Definitions
““Candidate” queryCandidate” query– Logged query that accesses all columns Logged query that accesses all columns
specified by the audit expressionspecified by the audit expression ““Indispensable” tuple (for a query)Indispensable” tuple (for a query)
– A tuple whose omission makes a difference A tuple whose omission makes a difference to the result of a queryto the result of a query
““Suspicious” querySuspicious” query– A candidate query that shares an A candidate query that shares an
indispensable tuple with the audit indispensable tuple with the audit expressionexpression
Indispensable TupleIndispensable Tuple
))(())((
STARTQ
AOA
QOQ
PC
PC
The SPJ query Q and the audit expression A are of the form:
))}){((())((),( RvTRTQvind QQQQ PCPC
Definition 1 - A virtual tuple v T is indispensable for an SPJ query Q if the result of Q changes when we delete v:
Predicates in Q
Columns appearing anywhere in Q
Duplicate preserving projection operator
Tables common to Q and A
Output columns in Q
““Candidate” QueryCandidate” Query
OAQ CC
Definition 6 - Q is a candidate query with respect to A if:
Only candidate queries can be suspicous queries
““Suspicious” QuerySuspicious” Query
),(),( s.t. ),( AvindQvindTvAQsusp
Definition 7 - Q is suspicious with respect to A if they share an indispensable MVT v
For example,Query Q: Addresses of people with diabetesAudit A: Jane’s diagnosis
Jane’s tuple is indispensable for both; hence query Q is “suspicious” with respect to A
A tuple v is a MVT for queries Q1 and Q2 if it belongs to the cross product of common tables in their from clauses
Definition 5 - Maximal virtual tuple (MVT):
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
algorithmsalgorithms PerformancePerformance SummarySummary
System OverviewSystem Overview
DataTables
IDID TimestamTimestampp
QueryQuery UserUser PurposePurpose RecipientRecipient
11 2004-02…2004-02… Select …Select … JamesJames CurrentCurrent OursOurs
22 2004-02…2004-02… Select …Select … JohnJohn TelemarketingTelemarketing publicpublic
Query Log
DatabaseLayer
Query with purpose, recipient
Updates, inserts, delete
Backlog
Database triggers track updates to base tables
Audit
DatabaseLayer
Audit expression
IDs of log queries having accessed data specified by the audit query
Audit query
Static analysis
Generate audit query
Static AnalysisStatic Analysis
IDID TimestamTimestampp
QueryQuery UserUser PurposePurpose RecipientRecipient
11 2004-02…2004-02… Select …Select … JamesJames CurrentCurrent OursOurs
22 2004-02…2004-02… Select …Select … JohnJohn TelemarketingTelemarketing publicpublic
Query Log
Audit expression
Filter Queries
Candidate queries
Eliminates queries that could not possibly have violated the audit expression
Insures that
Accomplished by examining only the queries themselves (i.e., without running the queries)
OAQ CC
Audit Query Audit Query GenerationGeneration GoalGoal
– Build a query which, when run, Build a query which, when run, returns the id’s of suspicious queries returns the id’s of suspicious queries with respect to an audit expression with respect to an audit expression AA
Generating the Audit Generating the Audit QueryQuery
Candidate Query
1
Candidate Query
2
Audit Expressio
n
Union
Combine individual candidate queries and the audit expression into a single query graph
Combine the audit expression with individual candidate queries to identify suspicious queries
Replace each table with it’s backlog to restore the version of the table to the time of each query
T1 T2
QGM is a graphical representation of a query
Boxes represent operators, such as select
Lines represent input/output relationships between
operators
Boxes with no inputs are tables
Suspicious SPJ QuerySuspicious SPJ Query
)(( SRTQA PP
))((
))((
STA
RTQ
AOA
QOQ
PC
PC
Theorem 2 - A candidate SPJ query Q is suspicious with respect to an audit expression A if and only if:
The candidate SPJ query Q and the audit expression A are of the form:
QGM rewrites, shown in previous slide, transform Q and A into:
)))((("" SRTQAi PPQ
Proof of correctness is based upon Definition 7 (suspicious query) and
given in the paper
Suspicious Aggregate Suspicious Aggregate Query (Including Query (Including Having)Having) Solution in the paper Solution in the paper
Audit ExpressionAudit Expression
audit T.disease
from Customer C, Treatment T
where C.cid=T.pcid and C.name = ‘Jane’
Who has accessed Jane’s disease information?
Query LogQuery Log
IDID QueryQuery TSTS UserUser PurposePurpose RecipientRecipient
11 select name, address, zip select name, address, zip from Customer, from Customer, Treatment where disease Treatment where disease = ‘diabetes’ and cid=pcid= ‘diabetes’ and cid=pcid
T3T3 jamesjames marketingmarketing othersothers
22 select name, address select name, address from Customer where from Customer where zip=‘95112’zip=‘95112’
T3T3 johnjohn contactcontact othersothers
Query 1 was executed at time T3
Backlog Table (Time Backlog Table (Time Stamp)Stamp)
NameName AddresAddresss
…… OPROPR TSTS
JaneJane 1234…1234… …… II T2T2
JaneJane 1234…1234… …… UU T4T4
AliceAlice …… …… II T1T1
Attributes also in the source table Attributes only in the backlog table
Jane’s record was inserted at time T2 and updated at time T4. The backlog table records both versions of her information
Operation on a tuple among Insert, Update and Delete
Timestamp of the operation
C. S. Jensen, L. Mark, and N. Roussopoulos [TKDE 1991]
Merge Logged Queries Merge Logged Queries and Audit Expressionand Audit Expression
Customer
c, n, …, t
audit expression := T.p=C.c and C.n= ‘Jane’
T.s
Select := T.s=‘diabetes’ and T.p=C.c
C.n, C.a, C.z
C
C
Merge logged queries and audit expression into a single query graph
Treatment
p, r, …, t
TT
Transform Query Transform Query Graph into an Audit Graph into an Audit QueryQuery
Customer
c, n, …, t
audit expression := X.n= ‘Jane’
‘Q1’
Select := T.s=‘diabetes’ and C.c=T.p
C.n
C
X
View of Customer (Treatment) is a temporal view at the time of the query was executed
The audit expression now ranges over the logged query. If the logged query is suspicious, the audit query will output the id of the logged query
T
Treatment
p, r, ..., t
Scenario OutcomeScenario Outcome
The audit uncovers that Query 1 in the query The audit uncovers that Query 1 in the query log accessed Jane’s informationlog accessed Jane’s information
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
algorithmsalgorithms PerformancePerformance SummarySummary
Empirical Evaluation: Empirical Evaluation: GoalsGoals Cost of maintaining backlog Cost of maintaining backlog
tablestables– Understand the impact of Understand the impact of
maintaining backlog tables on maintaining backlog tables on ongoing database operationsongoing database operations
Cost of running auditsCost of running audits– Understand whether audits can run Understand whether audits can run
in reasonable timein reasonable time
Experimental SetupExperimental Setup
IBM M Pro 6868 IntellistationIBM M Pro 6868 Intellistation– 800 MHz Pentium III processor800 MHz Pentium III processor– 512 MB of memory512 MB of memory– 16.9 GB disk drive16.9 GB disk drive
Windows 2000 Version 5, SP 4Windows 2000 Version 5, SP 4 DB2 v7 with default settingsDB2 v7 with default settings TPC-H databaseTPC-H database
– Supplier tableSupplier table 100,000 tuples100,000 tuples
System StructuresSystem Structures
IndexingIndexing– Eager indexingEager indexing
Maintain an index over the backlog tableMaintain an index over the backlog table Maintained during ongoing database operationsMaintained during ongoing database operations
– Lazy indexingLazy indexing No index over the backlog tableNo index over the backlog table Create indices at the time of auditCreate indices at the time of audit
Choice of indexChoice of index– Simple indexSimple index
Primary key of source tablePrimary key of source table– Composite indexComposite index
Primary key of source tablePrimary key of source table Time stampTime stamp
Impact on Ongoing Impact on Ongoing OperationsOperations QueriesQueries
– Additionally log the query stringAdditionally log the query string Already performed in many application Already performed in many application
environmentsenvironments
UpdatesUpdates– For each updated tuple,For each updated tuple,
Insert a tuple to the backlog tableInsert a tuple to the backlog table
– Inserts and deletes are handled similarlyInserts and deletes are handled similarly In a majority of environments, queries In a majority of environments, queries
are much more frequent than updatesare much more frequent than updates
Update PerformanceUpdate Performance
100,000 tuples in Supplier table100,000 tuples in Supplier table Update statement updates all tuplesUpdate statement updates all tuples Each update statement fires triggers Each update statement fires triggers
which inserts an additional 100,000 which inserts an additional 100,000 tuples in backlogtuples in backlog
Evaluate impact of multiple versions Evaluate impact of multiple versions on performanceon performance
Overhead on UpdatesOverhead on Updates
0
50
100
150
200
250
5 20 35 50
# of versions per tuple
Tim
e (
min
ute
s)
CompositeSimpleNo IndexNo Triggers
Simple wins over Composite
7x if all tuples are updates
3x if a single tuple is updated
Eager indexing doesn’t add much cost
Number of version of each tuple in the Supplier backlog
table
Audit Query Audit Query PerformancePerformance
Audit query:
select ‘Q’ from Supplier where skey = k
Experiment:
Evaluate the impact of the number of versions of tuples in the backlog table on performance
Audit Query Execution Audit Query Execution TimeTime
1
10
100
1000
1 10 20 30 40 50
# versions per tuple
Tim
e (
mse
c.)
Simple-ISimple-CComposite-IComposite-C
Composite wins over simple if initial version is selected
Simple wins over composite if the current
version is selected
TakeawaysTakeaways
The composite indexThe composite index– Enhances the performance of audits, Enhances the performance of audits,
butbut– Additionally burdens updates when Additionally burdens updates when
using eager indexingusing eager indexing The system supportsThe system supports
– Efficient auditingEfficient auditing– Without substantially burdening normal Without substantially burdening normal
query processingquery processing
Related WorkRelated Work
Oracle Privacy Security Auditing– Facility for logging queries with timestamp– Flash-back queries
Restores the version of the data at the time of the query– No support for automated auditing
User manually selects queries from the log and runs them The user to decide if the query is suspicious
G. Miklau D. Suciu [SIGMOD 2004]– Formal analysis of information disclosure in data exchange
Is information about a secret query S revealed by views V1,…,Vn Considers all possible instances of a database schema Assumes tuple independence
– We’re interested in given instances (temporal versions)– Nonetheless, it will be interesting to explore the connection
between the two works Active enforcement of policies by limiting disclosure Active enforcement of policies by limiting disclosure
[VLDB’04][VLDB’04] Literature on multi-query optimization
SummarySummary
In light of new privacy legislationIn light of new privacy legislation– The problem of auditing usage of The problem of auditing usage of
information represents an important information represents an important opportunity for database researchopportunity for database research
Formalized the problem through the Formalized the problem through the fundamental concepts of fundamental concepts of indispensable tuple and suspicious indispensable tuple and suspicious queriesqueries
Achieved our design goals:Achieved our design goals:
Design GoalsDesign Goals
Convenient languageConvenient language Fast and precise on auditsFast and precise on audits Non disruptive Non disruptive
– Minimal performance impact on Minimal performance impact on normal database operationnormal database operation
Fine grainedFine grained
Multiple Candidate Multiple Candidate QueriesQueries
audit expression := C.n= ‘Jane’
‘Q1’
audit expression := C.n= ‘Jane’
‘Q2’
Union
Aggregate Queries Aggregate Queries with Havingwith Having
group:= c1, …, ci
c1, …, ci, agg1, …, aggn
select:= …
c1, …, ci
Qs
Qg
Qh
audit expression := …
c1, …, ck
audit expression := …
c1, …, ck
select:= q1.c1=q2.c1 and … and q1.ci=q2.ci
‘Q1’
q1q1
The join on aggregate columns ensures that the group being tracked by the audit has not been eliminated by the having clause
Dynamic Temporal ViewsDynamic Temporal Views
Customer_backlog
c, n, a, h, z, o, t, ts, op
Select :=ts <= andop <> ‘delete’ andnot(C5)
c, n, a, h, z, o, t
Exists :=C4.ts <= andC3.c = C4.c andC4.ts > C3.ts
*
C3
C1
C4
C5
View of Customer table at time c = id
n = name
a = address
h = phone
z = zip
o = contact
t = marketing
ts = ts
op = opr
Time stamp of the logged
query