Sampling Attacks Against Hidden Databases

LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASESPresenter: Nan Zhang, George Washington Univ.

Joint work with Arjun Dasgupta and Gautam Das,

The University of Texas at Arlington

OUTLINE

• Introduction• Baseline Algorithm• COUNT-DECISION-TREE• ALERT-HYBRID• Experimental Results• Related Work• Conclusion

2

THE DEEP WEB

Deep Web vs Surface Web Dynamic contents, unlinked pages, private web,

contextual web, etc Estimated size [1]: 91,850 vs 167 tera bytes

[1] SIMS, UC Berkeley, How much information? 2003http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

3

HIDDEN DATABASES

Form-like interface

Return top-k tuples

4

SAMPLING HIDDEN DATABASES THROUGH PUBLIC INTERFACES

Problem definition given such restricted query interfaces, how can

one efficiently obtain a uniform random sample of the backend database by only accessing the database via the public front end interface?

Applications In which geographic area, and for which industry,

do MSN Careers’ job sources have especially low presence?

Which flight at which date is more likely to be relatively empty?

What is the real size of the hidden database?5

PERFORMANCE MEASURES OF SAMPLING HIDDEN DATABASES

Sample bias Over- or under-representing a portion of the

population Objective: minimize sample bias

Efficiency (query cost) The number of queries issued to the web

interface of a hidden database Note: many hidden databases charge for each

issued query or have limits on the number of queries one can issue per day.

Objective: minimize query cost

6

TWO TYPES OF HIDDEN DATABASE INTERFACE

TOP-k-ALERT

<Prev> 1 … 20 21 22 23 24 25

Showing 481-500 of more than 500 results

Depending on how overflowing query results are displayed

(display overflowing flag only)

7


TOP-k-COUNT

<Prev> 1 … 20 21 22 23 24 25

Showing 481-500 of 15,167 results

Depending on how overflowing query results are displayed

(display real COUNT)

8

OUTLINE OF TECHNICAL RESULTS

Existing work HIDDEN-DB-SAMPLER [DDM07]

Our results

COUNT-DECISION-TREE An efficient unbiased sampling algorithm for Top-k-

COUNT interfaces

ALERT-HYBRID An efficient sampling algorithm with slight bias for Top-

k-ALERT interfaces9

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

10

A RUNNING EXAMPLE

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

000 001 010 011 100 101 110 111

t1 t2 t3 t4

11

BASELINE ALGORITHM: COUNT-ORDER

12

A1 = 0 &A2 = 0

A1 = 0 A1 = 1

A1

A2

A3

A1 = 0 &A2 = 1

A1 = 0 &A2 = 0 &A3 = 0

A1 = 0 &A2 = 1 &A3 = 1

valid

underflow

overflow


000 010001 011 101100 111110

3/4

1/2

2/3

3/4 * 2/3 * 1/2 = 1/4

Count=3 Count=1

Count=1 Count=2

Count=1

A1

A2

A3

4

3

3

Count=1

13


000 010001 011 101100 111110

3/4

1/3

3/4 * 1/3 = 1/4

A1

A2

A3

Count=3 Count=1

Count=1 Count=2

4

3

14

OUTLINE


15

COUNT-DECISION-TREE

First result of the paper

Two main ideas for improving the efficiency of sampling

Utilizing query history

Attribute order tree Decision tree

16

UTILIZING QUERY HISTORY: BASIC IDEA

000 010001 011 101100 111110

A1

A2

A3

17

UTLIZING QUERY HISTORY: SAVING

Saving from query history is significant When minimum domain size is b, saving for

collecting s samples is:

Expected saving: 5,000 samples from a 100,000-tuple i.i.d. Boolean

database with uniform distribution: 49.90% (83,04841,610)

b = 5: 62.62% (143,06753,317)

18

ATTRIBUTE ORDER TREE DECISION TREE: BASIC IDEA

000 010001 011 101100 111110

A1

A2

A3

t1 t2 t3 t4

A3

0 1

A1 A2

0 01 1

t2 t4 t1 t3

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

Note: Not to be confused witha decision tree for classification

19

CONSTRUCTING AN OPTIMAL DECISION TREE: TWO MAIN CHALLENGES

• Problem is hard even if one has access to the entire database– When k = 1, to collect s = 1 sample, the

construction of an optimal decision tree over a Boolean DB is equivalent to a well-known NP-hard problem of constructing an optimal decision tree for entity identification in the database.

• Furthermore, without knowledge of the database, construction must be done on-the-fly• Note: Tree construction costs queries too!

20

INTUITION OF A HEURISTIC ALGORITHMThe decision-tree construction algorithm must consider query history

Consider the number of unique queries required to acquire an infinite number of samples.

Observation I: m – 1 = 7 queries are required for Trees A, B, and C (m = 8 is the number of tuples)Observation II: Empty leaves in Tree D leads to more required queries.

21

DECISION TREES FOR COLLECTING A FINITE NUMBER OF TUPLES Loss: empty leaves Saving: since the number of samples to be

collected is finite, not all m – 1 queries need to be issued.

Constructing an optimal decision tree Given the number of samples to be collected,

maximize (Saving – Loss)

Possible saving Possible loss

22

A GREEDY HEURISTIC ALGORITHM

Saving:

Loss: Expense:

Net Saving Per Expense: ifthen total cost <=

23

COMPUTATION OF SER

• How to compute branch COUNTs (i.e., |uj|) for all candidates of a node?– Exact computation diminishes the entire concept

of minimizing cost– Fortunately, in many cases a rough estimation is

enough e.g., fanout of 2 vs. 10:

– Unfortunately, uniform assumption does not suffice

– Proposed Solution: Issue a small number of marginal queries first, conditional independence assumption

24

Select attribute to maximize SER


[Transmission]

Automatic Manual

Honda ToyotaFordVW

High Low

Nissan

[Make]


Sample

Found!

Sample

Found!

[Price Segment]

ALGORITHM COUNT-DECISION-TREE

Medium

25

[Transmission]

Automatic Manual

Honda

ToyotaFordVW

High MediumLow

Nissan

Go back to the root and start another walk!

[Make]

[Price Segment]


26

[Transmission]

Automatic Manual

HondaToyota

Ford

VW

HighMedium

Low

Nissan

[Transmission]

Automatic Manual

Select attribute to maximize SER[Price Segment]

High Medium Low

[Make]

Honda Toyota FordVW


Sample

Found!

Sample

Found!

Saving from History

[Make]

[Price Segment]


Nissan

27

[Transmission]

Automatic Manual

[Price Segment]

High Medium Low

[Make]

HondaToyota

Nissan

FordVW

Honda

ToyotaFord

VW

High MediumLow

Nissan


[Make]

[Price Segment]


28

OUTLINE


29

TWO MAIN IDEAS OF ALERT-HYBRID

Use a small number of pilot samples to estimate COUNT Motivation: COUNT eliminates bias for sampling,

but is unavailable for Top-k-ALERT interfaces. On-the-fly switch from COUNT-DECISION-

TREE to ALERT-ORDER during the drill down processes for collecting the remaining samples Motivation: an inaccurate COUNT may introduce

additional bias – switch when confidence is low

30

OUTLINE


31

EXPERIMENTAL SETUP

• Synthetic Boolean-iid:– 200,000 tuples, 80 attributes, p = 0.25

• Synthetic Boolean-mixed:– 200,000 tuples, 40 independent attributes– 5 with uniform distribution, the others have p from

1/160 to 35/160 with step of 1/160• Yahoo! Auto: http://autos.yahoo.com

– 15,211 tuples, 32 Boolean, 6 categorical attributes– Domain size ranges from 5 to 447

• Census: UCI Data Mining Archive– 1990 census data, we remove all attributes with

domain size > 100, 12 attributes and 32,561 tuples– Domain size from Boolean to 92 32

EFFICIENCY OF COUNT-DECISION-TREE VS. COUNT-ORDER

33

EFFICIENCY AND BIAS OF ALERT-HYBRID VS. ALERT-ORDER

34

ILLUSTRATION OF ALERT-HYBRID

35

OUTLINE


36

RELATED WORK

Crawling hidden text databases [BB98, AIG03, NZC05]

Extracting data from hidden structured databases [RG01, LES+02, ARP+07]

Sampling search engine’s index using a public interface [BB98, BJ04, BG06, BG07]

Sampling hidden structural databases [DDM07]

37

OUTLINE


38

CONCLUSION

Main Technical Contribution 1: COUNT-DECISION-TREE Unbiased sampling algorithm for Top-k-COUNT

interfaces Orders of magnitude more efficient than the

existing algorithms Main Technical Contribution 2: ALERT-HYBRID

Sampling algorithm with slight bias for Top-k-ALERT interfaces

Orders of magnitude more efficient and has smaller bias than the existing algorithms.

39

CONCLUSION

Our studies unveil powerful techniques to perform data analytics over hidden databases Hidden databases owners may be extremely

concerned about the privacy of aggregates over their hidden databases.

How to reveal individual tuples truthfully and efficiently, but hide aggregated views of the data A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri,

Privacy Preservation of Aggregates in Hidden Databases: Why and How? SIGMOD 2009.

40

THANK YOU

BACKUP SLIDES42

EFFICIENCY AND BIAS OF COUNT-DECISION-TREE VS. ALERT-ORDER

43


• Top-k-ALERT– MSN Stock Screener (k = 24)

• http://moneycentral.msn.com/investor/finder/customstocks.asp

– Microsoft Solution Finder (k = 500)• https://solutionfinder.microsoft.com/Solutions/

SolutionsDirectory.aspx?mode=searchproblem

• Top-k-COUNT– MSN Careers (k = 4,000)

• http://msn.careerbuilder.com/JobSeeker/Jobs/JobFindAdv.aspx

44

SETTINGS OF S1 AND CS

45

COUNT-DECISION-TREE VS. ALERT-RANDOM

46

IMPROVEMENT BY CONSIDERING HISTORY

Note: Almost unrelated to k47

IMPROVEMENT BY DECISION TREE

48

ALGORITHM ALERT-HYBRID

[Transmission]

|Automatic| |Manual|

[Price Segment]

|High| |Medium| |Low|

[State]

COUNT-DECISION-TREE

Count below THRESHOLD

VW

[Make]

TX VA NY CA

Start ALERT-ORDERSelect attribute to maximize SER



49

ALGORITHM ALERT-HYBRID

[Transmission]

|Automatic| |Manual|

COUNT-DECISION-TREE

|Honda|

|Toyota||VW|

|Nissan|

[Make]

[Price Segment]

|High| |Medium|

|Low|

[Price Segment]


Count below THRESHOLD

Start ALERT-ORDER

TX VA NY CA


[State]Continue

50

Documents

Sampling Attacks Against Hidden Databases