View
217
Download
0
Category
Tags:
Preview:
Citation preview
1
Searching and Integrating Information on the Web
Seminar 4: Ranking Queries and Data Privacy
Professor Chen Li
UC Irvine
Seminar 3 2
Outline and readings• Ranking Queries
Fagin, R., Combining Fuzzy Information from Multiple Systems, PODS 1996
Fagin et al., Optimal Aggregation Algorithms for Middleware, PODS 2001.
• Data privacy:– Database-as-service
Executing SQL over Encrypted Data in the Database-Service-Provider Model. Hakan Hacigumus, Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD 2002.
– XML Data publishing Secure XML Publishing without Information Leakage in the
Presence of Data Inference. Xiaochun Yang and Chen Li. To appear in VLDB'04
Seminar 3 4
1. Finding multi-attribute tuples with top-k highest scores
2. Scoring function: aggregating scores on attributes, e.g., w1*A1 + … + wn * An, where wi is the weight for attribute Ai.
3. Monotone aggregation functions: if tuple A has a higher grade than tuple B on each attribute, then A’s overall grade is higher than B’s.
Top-k queries
Car ID Mileage Year Price
1 10000 1997 200002 20000 2000 110003 17000 1998 120004 15000 1990 80005 5000 1990 120006 15000 1990 50007 12000 1985 5000
Seminar 3 6
Modes of Data Access (Fagin)
Underlying Middleware (e.g., Search engines, Garlic, QBIC) supports 2 modes:
1. Sorted access: - Attribute Ai (column) forms a list Li sorted based on the score
of Ai.- The list is output one by one.
2. Random access: - Ask the system for the grade of any given objectGoal: minimize the total cost to get the top-k results
ace...
price
mileage year
bef...
ade...
Sorted lists
Seminar 3 7
FA: Fagin’s algorithm [PODS96]
1. Do sorted access in parallel to each of the m sorted lists Li. Wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists.
2. For each object R that has been seen, do random access as needed to each of the lists Li to find the i-th field xi or R.
3. Compute the aggregate results.
Seminar 3 8
Example:
1. Suppose k = 1. Given the three partial lists retrieved so far, ‘e’ appears in all of them. We can say that the top 1 tuple must be in {a,b,c,e,d,f}.
2. Reason: since the function is monotonic, tuple ‘e’ “blocks” all tuples below, since they can only have a smaller overall grade than ‘e’.
3. The algorithm does random access for these 5 tuples to get their grades, and pick the top 1.
4. Notice that we cannot say ‘e’ must be the top 1, since other tuples (e.g., ‘a’) may still have a higher overall score
5. Minor point: one possible improvement – ‘f’ can never be better than ‘e’.
ace...
price
mileage year
bef...
ade...
Cut-off line
Seminar 3 9
General case
1. Once k tuples have appeared in all the partial lists, halt.2. Reason: these k tuples block all the tuples below, which
cannot be better than these k tuples3. Do random access for the retrieved tuples to get their
overall grades, and find the top-k.
k
price
mileage year
kk
Cut-off line
Seminar 3 10
FA’s Properties
1. Can correctly find top-k results for monotone aggregation functions
2. Cost of a database with N objects: O(N^[(m-1)/m]*K^[1/m]) with arbitrarily high probability.
Seminar 3 11
FA’s Drawbacks• The number of sorted accesses is still
large.• Since all seen tuples should be
buffered, the required buffer size is unbounded.
• Does not exploit the bound given by the aggregation function to determine when to stop sorted access.
Seminar 3 12
TA: Threshold Algorithm [PODS2001]
1. Do sorted access in parallel to each of the m sorted lists. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in other lists. Then compute the aggregate grade for this object R. If this is one of the highest, insert it, else discard it.
2. For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t( x1, …, xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt.
3. Return the K objects that have been seen with the highest grades.
Seminar 3 13
Example:
1. A buffer keeps the top-k tuples that have been found so far2. For any tuple in a sorted list, do a random access to get its overall grade.
Compare it with the tuples in the buffer queue, and decide to insert it or discard it.
3. Threshold window (including the previous m records) represents the “best” top-k results we can see, assuming we can combine best values from different tuples.
4. Notice that this window may not be “horizontal” if we use different speeds to access different lists
5. This window helps us decide when to stop: once we find k tuple whose grade is at least equal to the window tuple, we halt.
ace...
price
mileage year
bef...
ade...
buffer for top-kThreshold
window
Seminar 3 14
TA’s Properties1. TA is optimal for all monotone functions
and over every database.2. Compared to FA, TA requires a small,
constant-size buffer. 3. TA allows early stopping
– Can show TA never stops later than FA. (Why?)
4. There are times when the user is satisfied with approximate top k list. TA is modified to give such approximation.
5. TA can be modified to the case where random access is impossible
Seminar 3 15
Instance Optimality
1. Algorithm b is instance optimal over an algorithm set A and a database instance set D, if b is in A, and for any algorithm a in A and every instance d in D, we have: cost (b,D) = O(cost(a,D)).
2. Similar to “competitive ratio”3. Essentially: b is the best algorithm in A.4. Stronger than “optimality in a worst-case
case”5. TA is instance optimal in all “correct
algorithms” (nondeterministic algorithms).Ab
a
Seminar 3 16
Variations of TA
• NRA: When no random access is possible– Example: Web search engines, which typically do not
allow you to enter a URL and get its ranking
• TAZ: When no sorted access is possible for some predicates– Example: Find good restaurants near location x (sorted
and random access for restaurant ratings, random access only for distances from a mapping site)
• CA: When the relative costs of random and sorted accesses matter.
• TA: Only when approximate answers are needed – Example: Web search, with lots of good quality answers
Seminar 3 18
Motivation
• Privacy in publishing XML data• Applications:
– Web publishing– Data sharing and exchange, e.g., in
P2P systems
Seminar 3 19
Example: Hospital XML data
physician
Walker
physician
phname
Smith
(1)
(1)
treat (1)
(2)
phname(2)
treat (3)
treat (2)
patient (1)
pname(1)
disease(1)
ward (1)
leukemia(1)
W305(1)
Alice(1)
Alice(2)
Betty(2)Cathy (2)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)W403
patient
pname
Tom
cancer
(4)
(4)
ward (4)
disease(4)
... ...
hospital
leukemia(1)
Goal: hide Alice’s disease Common Knowledge: patients in the same ward have the same disease
Seminar 3 20
Problem
Given: • An XML document to be published• Sensitive data in the document• Common knowledge using which public
users can do data inferenceFind:• A partial document to be released so that
users cannot infer the sensitive data
Seminar 3 21
Research challenges
• How to model data inference using common knowledge?
• How to compute all possible inferred data?
• How to compute a partial document to be published without leaking sensitive information?
Seminar 3 22
Roadmap
• Information Leakage– Defining sensitive data– Describing common knowledge– Computing inferred documents
• Prevent information leakage
Seminar 3 23
Defining sensitive data
hospital
pname*
CathyA2
disease
*
patient
Alice
SA1
• Using an XQuery, called “regulating query”• A special node marked “*” to indicate the
sensitive data
Seminar 3 24
Example 1
disease
*
patient
Alice
SA1
leukemia(2)
hospital
patient (1)
pname(1)
disease(1)
ward (1)
leukemia(1)
W305(1)Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
• Map the query to the XML tree• For each mapping, the target of the * node is sensitive.
Seminar 3 25
Example 2
hospital
pname*
CathyA2
leukemia(2)
hospital
patient (1)
pname(1)
disease(1)
ward (1)
leukemia(1)
W305(1)Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
Seminar 3 26
Common Knowledge
• Represented as XML constraints
• Could be obtained in various ways, e.g., – possible schema– analysis from the published data
Seminar 3 27
Common Constraints
• Child constraints: //p //p/c//patient //patient/pname
• Descendant constraints: //p //p//d//patient //patient//disease
• Functional dependencies: //p/a//p/b//patient/ward //patient/disease
Patient Patient
pname
Patient Patient
disease
Patient
warddisease
Patient
warddisease
w1 w2d1 d2
If w1 = w2, then d1 = d2
(value equal)
Seminar 3 28
Modify partial document using constraints
C1: //patient //patient/pname C2: //patient //patient//diseaseC3: //patient/ward //patient/disease
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
Partial document P
Seminar 3 29
Apply C1 on document P
C1: //patient //patient/pname
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
C1(P)
pname
Seminar 3 30
Apply C2 on document P
C2: //patient //patient//disease
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
C2(P)
disease
•Floating branch: exact location unknown
Seminar 3 31
Apply C3 on document P
C3: //patient/ward//patient/disease
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
C3(P)
disease
leukemia
Seminar 3 32
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
C2: //patient //patient//diseaseC3: //patient/ward //patient/disease
diseasedisease
leukemia
Apply a sequence of constraints: <C2,C3>
Seminar 3 33
hospital
leukemia
patient
disease ward
W305 W305
patient
wardpname
(1)
(1) (1) (1) (2)
(2)
(1) (1) (2)
C2: //patient //patient//diseaseC3: //patient/ward //patient/disease
disease
leukemia
Another user applies a different sequence of constraints: <C3,C2>
After applying C3, we cannot use C2 to expand the treeNo more floating branch!
Seminar 3 34
They look different!
• P1 is “m-contained” in P2:– There is a mapping from P1 to P2.
– A floating branch can be mapped to a path.
– The m-containing document P2 has more information
• P2 is also “m-contained” in P1.
• Thus they are “m-equivalent”!
P2: result of <C3,C2>
hospital
leukemia
patient
disease ward
W305 W305
patient
ward
(1)
pname(1) (1) (1) (2)
(2)
(1) (1) (2)
disease
leukemia
P1: result of <C2,C3>
diseasedisease
leukemia
hospital
leukemia
patient
disease ward
W305 W305
patient
ward
(1)
pname(1) (1) (1) (2)
(2)
(1) (1) (2)
Seminar 3 35
What documents can users infer?
• Different users can use different sequences of constraints to do inference
• Thus they can infer different documents• Questions:
– Can an inference process terminate?– What inferred document should we consider to
prevent leakage of sensitive data?
Seminar 3 36
Theorem• Given a partial document P of an XML
document D and a set of constraints C={C1,…, Ck}, there is a document M that can be inferred from P using a sequence of constraints, such that:– for any sequence of constraints, its resulting
document is m-contained in M. • Can be computed using a greedy
approach. • Such a document is unique under m-
equivalence.
Seminar 3 37
Information leakage• For a partial document P, if there exists a
regulating query A, such that the maximal inferred document M can produce a non-empty answer to the query A, then we say “P causes information leakage.” Partial Document P
Inference
Regulating query A
Seminar 3 39
Formal Problem• Given an XML document D, a regulating
query A, common knowledge represented as constraints C1,…,Ck;– How to find a partial document P without
information leakage?– Called a valid partial document
• The empty document is a trivial one• We want the published document to
have as much data as possible
Seminar 3 40
An algorithm
• We develop an algorithm for solving this problem
• We use the running example to illustrate the algorithm
Seminar 3 41
Example
patient (1)
pname(1)
disease(1)
ward (1)
leukemia(1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
disease
*
patient
Alice
S
Regulating query A
Functional dependency: //patient/ward //patient/disease
Seminar 3 42
Remove sensitive data A(D)
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
Remaining document: D - A(D)
disease
*
patient
Alice
S
Seminar 3 43
Compute the maximal inferred document M of D-A(D)
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
Maximal inferred document: M
disease
*
patient
Alice
S
Seminar 3 44
Testing Information Leakage
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
There is a mapping from A to P. So information leaked.
disease
*
patient
Alice
S
Regulating query A
Seminar 3 45
Computing a valid partial document
A
S
InferenceA
Sbreak mapping
InferenceA
Sbreak mapping
chase back chase back
How to break the mappings?How to chase back the inference steps?
A(D)D - A(D)
Seminar 3 46
AND/OR Graphs• A structure representing how a goal
can be reached by solving subproblems.
• We use such graphs to formulate the process of finding a valid partial document
Seminar 3 47
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
disease
*
patient
Alice
S
Regulating query A
START
leukemiaAlice (1)(1)
OR
•Consider mapping images of the leaf nodes in A•An “OR” connector shows that solving any of the subproblems can solve the parent problem.
Seminar 3 48
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
disease
*
patient
Alice
S
Regulating query A
START
leukemia
W305
AND
OR OR
Alice
leukemia(2) (3)
(1)(1)
leukemia(2) W305(1)
W305(3)
OR
•Multiple ways to infer the sensitive data.•An “AND” connector shows that solving ALL the subproblems can solve the parent problem.
Seminar 3 49
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
disease
*
patient
Alice
S
Regulating query A
START
leukemia
W305
AND
OR OR
Alice
leukemia(2) (3)
(1)(1)
leukemia(2) W305(1)
W305(3)
OR OR
AND
. . .
OR
•Continue expanding the AND/OR graph
Seminar 3 50
AND/OR Graphs (cont)• A special START node representing the
goal of computing a valid partial document.
• The graph has nodes corresponding to nodes in the maximal inferred document M.
• Such a node represents the subproblem of hiding its corresponding node n in M– This node n should be removed from M– It cannot be inferred using the constraints
and other nodes in M.
Seminar 3 51
Solution graphs• A connected subgraph (of M) including the
START node• For each node in the subgraph, its
successor connectors are also in the subgraph.
• If it contains an OR connector, it must also contain one of the connector's successors.
• If it contains an AND connector, it must also contain all the successors of the connector.
Seminar 3 53
Computing a valid partial document using a solution graph
• For a solution graph G, for each node in G, we remove the corresponding node in M to get a valid partial document
START
Alice (1)
OR
START
leukemia
AND
OR OR
(1)
W305(1)
OR
patient (1)
pname(1)
disease(1)
ward (1)
W305(1)
Alice(1)
patient(2)
disease (2)
pname(2)
ward(2)
W305 (2)
leukemia(2)
Betty (1)
patient(3)
pname(3)
leukemia(3)
ward (3)
W305(3)
disease(3)
Cathy(1)
hospital
leukemia(1)
Seminar 3 54
Constructing an AND/OR Graph
• Give an algorithm for computing an AND/OR graph
• Consider inference steps of different constraints
• Many algorithms proposed on finding a solution graph. They are applicable
• No need to construct the entire AND/OR graph. Search for a solution graph “on the fly.”
Seminar 3 55
Related work
Data Execution Query
Data Query Execution
Query Execution Data
B. C/S access control
C. Database as a service
D. Data publishing (our work)
Data Execution QueryA. Single-user DBMS
Different scenarios of database security based on trust domains
Seminar 3 56
Summary of 2nd paper
• Formulated problem of publishing XML document without information leakage due to data inference
• Showed the effect of constraints on inference
• Algorithm for finding a valid partial document of a given document
Recommended