1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy...

Searching and Integrating Information on the Web

Seminar 4: Ranking Queries and Data Privacy

Professor Chen Li

UC Irvine

Seminar 3 2

Outline and readings• Ranking Queries

Fagin, R., Combining Fuzzy Information from Multiple Systems, PODS 1996

Fagin et al., Optimal Aggregation Algorithms for Middleware, PODS 2001.

• Data privacy:– Database-as-service

Executing SQL over Encrypted Data in the Database-Service-Provider Model. Hakan Hacigumus, Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD 2002.

– XML Data publishing Secure XML Publishing without Information Leakage in the

Presence of Data Inference. Xiaochun Yang and Chen Li. To appear in VLDB'04

Seminar 3 3

Outline

• Ranking Queries

• Data privacy:– XML Data publishing– Database-as-service

Seminar 3 4

1. Finding multi-attribute tuples with top-k highest scores

2. Scoring function: aggregating scores on attributes, e.g., w1*A1 + … + wn * An, where wi is the weight for attribute Ai.

3. Monotone aggregation functions: if tuple A has a higher grade than tuple B on each attribute, then A’s overall grade is higher than B’s.

Top-k queries

Car ID Mileage Year Price

1 10000 1997 200002 20000 2000 110003 17000 1998 120004 15000 1990 80005 5000 1990 120006 15000 1990 50007 12000 1985 5000

Seminar 3 5

Applications

• Multimedia databases• Web search queries:

– Restaurants– Houses– Cars– …

Seminar 3 6

Modes of Data Access (Fagin)

Underlying Middleware (e.g., Search engines, Garlic, QBIC) supports 2 modes:

1. Sorted access: - Attribute Ai (column) forms a list Li sorted based on the score

of Ai.- The list is output one by one.

2. Random access: - Ask the system for the grade of any given objectGoal: minimize the total cost to get the top-k results

ace...

mileage year

bef...

ade...

Sorted lists

Seminar 3 7

FA: Fagin’s algorithm [PODS96]

1. Do sorted access in parallel to each of the m sorted lists Li. Wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists.

2. For each object R that has been seen, do random access as needed to each of the lists Li to find the i-th field xi or R.

3. Compute the aggregate results.

Seminar 3 8

Example:

1. Suppose k = 1. Given the three partial lists retrieved so far, ‘e’ appears in all of them. We can say that the top 1 tuple must be in {a,b,c,e,d,f}.

2. Reason: since the function is monotonic, tuple ‘e’ “blocks” all tuples below, since they can only have a smaller overall grade than ‘e’.

3. The algorithm does random access for these 5 tuples to get their grades, and pick the top 1.

4. Notice that we cannot say ‘e’ must be the top 1, since other tuples (e.g., ‘a’) may still have a higher overall score

5. Minor point: one possible improvement – ‘f’ can never be better than ‘e’.

ace...

mileage year

bef...

ade...

Cut-off line

Seminar 3 9

General case

1. Once k tuples have appeared in all the partial lists, halt.2. Reason: these k tuples block all the tuples below, which

cannot be better than these k tuples3. Do random access for the retrieved tuples to get their

overall grades, and find the top-k.

mileage year

Cut-off line

Seminar 3 10

FA’s Properties

1. Can correctly find top-k results for monotone aggregation functions

2. Cost of a database with N objects: O(N^[(m-1)/m]*K^[1/m]) with arbitrarily high probability.

Seminar 3 11

FA’s Drawbacks• The number of sorted accesses is still

large.• Since all seen tuples should be

buffered, the required buffer size is unbounded.

• Does not exploit the bound given by the aggregation function to determine when to stop sorted access.

Seminar 3 12

TA: Threshold Algorithm [PODS2001]

1. Do sorted access in parallel to each of the m sorted lists. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in other lists. Then compute the aggregate grade for this object R. If this is one of the highest, insert it, else discard it.

2. For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t( x1, …, xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt.

3. Return the K objects that have been seen with the highest grades.

Seminar 3 13

Example:

1. A buffer keeps the top-k tuples that have been found so far2. For any tuple in a sorted list, do a random access to get its overall grade.

Compare it with the tuples in the buffer queue, and decide to insert it or discard it.

3. Threshold window (including the previous m records) represents the “best” top-k results we can see, assuming we can combine best values from different tuples.

4. Notice that this window may not be “horizontal” if we use different speeds to access different lists

5. This window helps us decide when to stop: once we find k tuple whose grade is at least equal to the window tuple, we halt.

ace...

mileage year

bef...

ade...

buffer for top-kThreshold

window

Seminar 3 14

TA’s Properties1. TA is optimal for all monotone functions

and over every database.2. Compared to FA, TA requires a small,

constant-size buffer. 3. TA allows early stopping

– Can show TA never stops later than FA. (Why?)

4. There are times when the user is satisfied with approximate top k list. TA is modified to give such approximation.

5. TA can be modified to the case where random access is impossible

Seminar 3 15

Instance Optimality

1. Algorithm b is instance optimal over an algorithm set A and a database instance set D, if b is in A, and for any algorithm a in A and every instance d in D, we have: cost (b,D) = O(cost(a,D)).

2. Similar to “competitive ratio”3. Essentially: b is the best algorithm in A.4. Stronger than “optimality in a worst-case

case”5. TA is instance optimal in all “correct

algorithms” (nondeterministic algorithms).Ab

Seminar 3 16

Variations of TA

• NRA: When no random access is possible– Example: Web search engines, which typically do not

allow you to enter a URL and get its ranking

• TAZ: When no sorted access is possible for some predicates– Example: Find good restaurants near location x (sorted

and random access for restaurant ratings, random access only for distances from a mapping site)

• CA: When the relative costs of random and sorted accesses matter.

• TA: Only when approximate answers are needed – Example: Web search, with lots of good quality answers

Seminar 3 17

Outline

• Ranking Queries

• Data privacy:– XML Data publishing – Database-as-service

Seminar 3 18

Motivation

• Privacy in publishing XML data• Applications:

– Web publishing– Data sharing and exchange, e.g., in

P2P systems

Seminar 3 19

Example: Hospital XML data

physician

Walker

physician

phname

treat (1)

phname(2)

treat (3)

treat (2)

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)

Alice(1)

Alice(2)

Betty(2)Cathy (2)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)W403

patient

cancer

ward (4)

disease(4)

... ...

hospital

leukemia(1)

Goal: hide Alice’s disease Common Knowledge: patients in the same ward have the same disease

Seminar 3 20

Problem

Given: • An XML document to be published• Sensitive data in the document• Common knowledge using which public

users can do data inferenceFind:• A partial document to be released so that

users cannot infer the sensitive data

Seminar 3 21

Research challenges

• How to model data inference using common knowledge?

• How to compute all possible inferred data?

• How to compute a partial document to be published without leaking sensitive information?

Seminar 3 22

Roadmap

• Information Leakage– Defining sensitive data– Describing common knowledge– Computing inferred documents

• Prevent information leakage

Seminar 3 23

Defining sensitive data

hospital

pname*

CathyA2

disease

patient

• Using an XQuery, called “regulating query”• A special node marked “*” to indicate the

sensitive data

Seminar 3 24

Example 1

disease

patient

leukemia(2)

hospital

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

• Map the query to the XML tree• For each mapping, the target of the * node is sensitive.

Seminar 3 25

Example 2

hospital

pname*

CathyA2

leukemia(2)

hospital

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

Seminar 3 26

Common Knowledge

• Represented as XML constraints

• Could be obtained in various ways, e.g., – possible schema– analysis from the published data

Seminar 3 27

Common Constraints

• Child constraints: //p //p/c//patient //patient/pname

• Descendant constraints: //p //p//d//patient //patient//disease

• Functional dependencies: //p/a//p/b//patient/ward //patient/disease

Patient Patient

disease

Patient

warddisease

Patient

warddisease

w1 w2d1 d2

If w1 = w2, then d1 = d2

(value equal)

Seminar 3 28

Modify partial document using constraints

C1: //patient //patient/pname C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

Partial document P

Seminar 3 29

Apply C1 on document P

C1: //patient //patient/pname

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

Seminar 3 30

C2: //patient //patient//disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

disease

•Floating branch: exact location unknown

Seminar 3 31

C3: //patient/ward//patient/disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

disease

leukemia

Seminar 3 32

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

diseasedisease

leukemia

Apply a sequence of constraints: <C2,C3>

Seminar 3 33

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1) (1) (1) (2)

(1) (1) (2)

C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

disease

leukemia

Another user applies a different sequence of constraints: <C3,C2>

After applying C3, we cannot use C2 to expand the treeNo more floating branch!

Seminar 3 34

They look different!

• P1 is “m-contained” in P2:– There is a mapping from P1 to P2.

– A floating branch can be mapped to a path.

– The m-containing document P2 has more information

• P2 is also “m-contained” in P1.

• Thus they are “m-equivalent”!

P2: result of <C3,C2>

hospital

leukemia

patient

disease ward

W305 W305

patient

pname(1) (1) (1) (2)

(1) (1) (2)

disease

leukemia

P1: result of <C2,C3>

diseasedisease

leukemia

hospital

leukemia

patient

disease ward

W305 W305

patient

pname(1) (1) (1) (2)

(1) (1) (2)

Seminar 3 35

What documents can users infer?

• Different users can use different sequences of constraints to do inference

• Thus they can infer different documents• Questions:

– Can an inference process terminate?– What inferred document should we consider to

prevent leakage of sensitive data?

Seminar 3 36

Theorem• Given a partial document P of an XML

document D and a set of constraints C={C1,…, Ck}, there is a document M that can be inferred from P using a sequence of constraints, such that:– for any sequence of constraints, its resulting

document is m-contained in M. • Can be computed using a greedy

approach. • Such a document is unique under m-

equivalence.

Seminar 3 37

Information leakage• For a partial document P, if there exists a

regulating query A, such that the maximal inferred document M can produce a non-empty answer to the query A, then we say “P causes information leakage.” Partial Document P

Inference

Regulating query A

Seminar 3 38

Roadmap

• Information Leakage• Prevent information leakage

Seminar 3 39

Formal Problem• Given an XML document D, a regulating

query A, common knowledge represented as constraints C1,…,Ck;– How to find a partial document P without

information leakage?– Called a valid partial document

• The empty document is a trivial one• We want the published document to

have as much data as possible

Seminar 3 40

An algorithm

• We develop an algorithm for solving this problem

• We use the running example to illustrate the algorithm

Seminar 3 41

Example

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

disease

patient

Regulating query A

Functional dependency: //patient/ward //patient/disease

Seminar 3 42

Remove sensitive data A(D)

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Remaining document: D - A(D)

disease

patient

Seminar 3 43

Compute the maximal inferred document M of D-A(D)

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Maximal inferred document: M

disease

patient

Seminar 3 44

Testing Information Leakage

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

There is a mapping from A to P. So information leaked.

disease

patient

Regulating query A

Seminar 3 45

Computing a valid partial document

InferenceA

Sbreak mapping

InferenceA

Sbreak mapping

chase back chase back

How to break the mappings?How to chase back the inference steps?

A(D)D - A(D)

Seminar 3 46

AND/OR Graphs• A structure representing how a goal

can be reached by solving subproblems.

• We use such graphs to formulate the process of finding a valid partial document

Seminar 3 47

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

patient

Regulating query A

leukemiaAlice (1)(1)

•Consider mapping images of the leaf nodes in A•An “OR” connector shows that solving any of the subproblems can solve the parent problem.

Seminar 3 48

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

patient

Regulating query A

leukemia

leukemia(2) (3)

(1)(1)

leukemia(2) W305(1)

W305(3)

•Multiple ways to infer the sensitive data.•An “AND” connector shows that solving ALL the subproblems can solve the parent problem.

Seminar 3 49

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

patient

Regulating query A

leukemia

leukemia(2) (3)

(1)(1)

leukemia(2) W305(1)

W305(3)

•Continue expanding the AND/OR graph

Seminar 3 50

AND/OR Graphs (cont)• A special START node representing the

goal of computing a valid partial document.

• The graph has nodes corresponding to nodes in the maximal inferred document M.

• Such a node represents the subproblem of hiding its corresponding node n in M– This node n should be removed from M– It cannot be inferred using the constraints

and other nodes in M.

Seminar 3 51

Solution graphs• A connected subgraph (of M) including the

START node• For each node in the subgraph, its

successor connectors are also in the subgraph.

• If it contains an OR connector, it must also contain one of the connector's successors.

• If it contains an AND connector, it must also contain all the successors of the connector.

Seminar 3 52

Example solution graphsSTART

Alice (1)

leukemia

W305(1)

Seminar 3 53

Computing a valid partial document using a solution graph

• For a solution graph G, for each node in G, we remove the corresponding node in M to get a valid partial document

Alice (1)

leukemia

W305(1)

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Seminar 3 54

Constructing an AND/OR Graph

• Give an algorithm for computing an AND/OR graph

• Consider inference steps of different constraints

• Many algorithms proposed on finding a solution graph. They are applicable

• No need to construct the entire AND/OR graph. Search for a solution graph “on the fly.”

Seminar 3 55

Related work

Data Execution Query

Data Query Execution

Query Execution Data

B. C/S access control

C. Database as a service

D. Data publishing (our work)

Data Execution QueryA. Single-user DBMS

Different scenarios of database security based on trust domains

Seminar 3 56

Summary of 2nd paper

• Formulated problem of publishing XML document without information leakage due to data inference

• Showed the effect of constraints on inference

• Algorithm for finding a valid partial document of a given document

Seminar 3 57

Outline

• Ranking Queries

• Data privacy:– XML Data publishing– Database-as-service (DAS) model

1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy...

Documents

Irvine Family Chronicle · Irvine Family Chronicle Descendants of William Irvine & Elizabeth Campbell Michael R. Hyde 2020

Embassy Suites Irvine Extended Stay America — Irvine Spectrum Hilton Garden Inn Hilton Irvine Hotel Irvine Jamboree Center Irvine Marriott LaQuinta Inn & Suites, Irvine Spectrum

Building SAP BusinessObjects Web Intelligence queries ... · 1 Building queries on BEx queries BEx queries (SAP Business Explorer queries) are queries created using the SAP BEx Query

Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

2011 Mini Cooper Convertible Irvine CA | Irvine Mini

TopRank: A Practical Algorithm for Online Stochastic Ranking · there exists a ranking problem such that E[Rn] p KLn . Experiments Yandex dataset –167 million search queries –In

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD

A Uniﬁed Framework for Efﬁciently Processing Ranking Related Queries

Entity Ranking and Relationship Queries Using an Extended Graph Model Ankur Agrawal S. Sudarshan Ajitav Sahoo Adil Sandalwala Prashant Jaiswal IIT Bombay

GROW YOUR BUSINESS WITH HELP FROM GOOGLE –Ranking Factors library/nna... · Conversation = Conversion • Context is everything. Google connects queries to websites relevant and

E cient Probabilistic Inference with Partial Ranking Queries · E cient Probabilistic Inference with Partial Ranking Queries ... election, we are interested in ... conditional independence,

The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple

Irvine Park CHRISTMAS TRAIN Name Irvine Park Railroad Age...Irvine Park CHRISTMAS TRAIN Name Irvine Park Railroad Age

26 Ranking Data in Queries - webiworx

Ranking-based Processing of SQL Queries

“Gateway” - Denver · 2015-06-25 · Ranking 1 to 10 Weight x Ranking; Ranking 1 to 10. Weight x Ranking; Ranking 1 to 10 Weight x Ranking. Ranking 1 to 10 Weight x Ranking; Ranking

Dynamic Collective Entity Representations for Entity Ranking · Many queries issued to general web search engines are related to entities [20]. Entity ranking, where the goal is to

QS Ranking – World University Ranking & Ranking by Subject€¦ · QS Ranking – World University Ranking & Ranking by Subject Herausgeber: Quacquarelli Symonds (USA, England)

Ranked Queries over sources with Boolean Query Interfaces without Ranking Support

Ranking queries on uncertain data - Simon Fraser Universityjpei/publications/Ranking...Q1 Q3 R2 Q2 R3 R5 Fig. 1 The top-k probability distribution of R3 and R4 in Table 1a threshold