50
LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASES Presenter: Nan Zhang, George Washington Univ. Joint work with Arjun Dasgupta and Gautam Das, The University of Texas at Arlington

Sampling Attacks Against Hidden Databases

Embed Size (px)

Citation preview

Page 1: Sampling Attacks Against Hidden Databases

LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASESPresenter: Nan Zhang, George Washington Univ.

Joint work with Arjun Dasgupta and Gautam Das,

The University of Texas at Arlington

Page 2: Sampling Attacks Against Hidden Databases

OUTLINE

• Introduction• Baseline Algorithm• COUNT-DECISION-TREE• ALERT-HYBRID• Experimental Results• Related Work• Conclusion

2

Page 3: Sampling Attacks Against Hidden Databases

THE DEEP WEB

Deep Web vs Surface Web Dynamic contents, unlinked pages, private web,

contextual web, etc Estimated size [1]: 91,850 vs 167 tera bytes

[1] SIMS, UC Berkeley, How much information? 2003http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

3

Page 4: Sampling Attacks Against Hidden Databases

HIDDEN DATABASES

Form-like interface

Return top-k tuples

4

Page 5: Sampling Attacks Against Hidden Databases

SAMPLING HIDDEN DATABASES THROUGH PUBLIC INTERFACES

Problem definition given such restricted query interfaces, how can

one efficiently obtain a uniform random sample of the backend database by only accessing the database via the public front end interface?

Applications In which geographic area, and for which industry,

do MSN Careers’ job sources have especially low presence?

Which flight at which date is more likely to be relatively empty?

What is the real size of the hidden database?5

Page 6: Sampling Attacks Against Hidden Databases

PERFORMANCE MEASURES OF SAMPLING HIDDEN DATABASES

Sample bias Over- or under-representing a portion of the

population Objective: minimize sample bias

Efficiency (query cost) The number of queries issued to the web

interface of a hidden database Note: many hidden databases charge for each

issued query or have limits on the number of queries one can issue per day.

Objective: minimize query cost

6

Page 7: Sampling Attacks Against Hidden Databases

TWO TYPES OF HIDDEN DATABASE INTERFACE

TOP-k-ALERT

<Prev> 1 … 20 21 22 23 24 25

Showing 481-500 of more than 500 results

Depending on how overflowing query results are displayed

(display overflowing flag only)

7

Page 8: Sampling Attacks Against Hidden Databases

TWO TYPES OF HIDDEN DATABASE INTERFACE

TOP-k-COUNT

<Prev> 1 … 20 21 22 23 24 25

Showing 481-500 of 15,167 results

Depending on how overflowing query results are displayed

(display real COUNT)

8

Page 9: Sampling Attacks Against Hidden Databases

OUTLINE OF TECHNICAL RESULTS

Existing work HIDDEN-DB-SAMPLER [DDM07]

Our results

COUNT-DECISION-TREE An efficient unbiased sampling algorithm for Top-k-

COUNT interfaces

ALERT-HYBRID An efficient sampling algorithm with slight bias for Top-

k-ALERT interfaces9

Page 10: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

10

Page 11: Sampling Attacks Against Hidden Databases

A RUNNING EXAMPLE

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

000 001 010 011 100 101 110 111

t1 t2 t3 t4

11

Page 12: Sampling Attacks Against Hidden Databases

BASELINE ALGORITHM: COUNT-ORDER

12

A1 = 0 &A2 = 0

A1 = 0 A1 = 1

A1

A2

A3

A1 = 0 &A2 = 1

A1 = 0 &A2 = 0 &A3 = 0

A1 = 0 &A2 = 1 &A3 = 1

valid

underflow

overflow

Page 13: Sampling Attacks Against Hidden Databases

BASELINE ALGORITHM: COUNT-ORDER

000 010001 011 101100 111110

3/4

1/2

2/3

3/4 * 2/3 * 1/2 = 1/4

Count=3 Count=1

Count=1 Count=2

Count=1

A1

A2

A3

4

3

3

Count=1

13

Page 14: Sampling Attacks Against Hidden Databases

BASELINE ALGORITHM: COUNT-ORDER

000 010001 011 101100 111110

3/4

1/3

3/4 * 1/3 = 1/4

A1

A2

A3

Count=3 Count=1

Count=1 Count=2

4

3

14

Page 15: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

15

Page 16: Sampling Attacks Against Hidden Databases

COUNT-DECISION-TREE

First result of the paper

Two main ideas for improving the efficiency of sampling

Utilizing query history

Attribute order tree Decision tree

16

Page 17: Sampling Attacks Against Hidden Databases

UTILIZING QUERY HISTORY: BASIC IDEA

000 010001 011 101100 111110

A1

A2

A3

17

Page 18: Sampling Attacks Against Hidden Databases

UTLIZING QUERY HISTORY: SAVING

Saving from query history is significant When minimum domain size is b, saving for

collecting s samples is:

Expected saving: 5,000 samples from a 100,000-tuple i.i.d. Boolean

database with uniform distribution: 49.90% (83,04841,610)

b = 5: 62.62% (143,06753,317)

18

Page 19: Sampling Attacks Against Hidden Databases

ATTRIBUTE ORDER TREE DECISION TREE: BASIC IDEA

000 010001 011 101100 111110

A1

A2

A3

t1 t2 t3 t4

A3

0 1

A1 A2

0 01 1

t2 t4 t1 t3

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

A1 A2 A3

t1 0 0 1

t2 0 1 0

t3 0 1 1

t4 1 1 0

Note: Not to be confused witha decision tree for classification

19

Page 20: Sampling Attacks Against Hidden Databases

CONSTRUCTING AN OPTIMAL DECISION TREE: TWO MAIN CHALLENGES

• Problem is hard even if one has access to the entire database– When k = 1, to collect s = 1 sample, the

construction of an optimal decision tree over a Boolean DB is equivalent to a well-known NP-hard problem of constructing an optimal decision tree for entity identification in the database.

• Furthermore, without knowledge of the database, construction must be done on-the-fly• Note: Tree construction costs queries too!

20

Page 21: Sampling Attacks Against Hidden Databases

INTUITION OF A HEURISTIC ALGORITHMThe decision-tree construction algorithm must consider query history

Consider the number of unique queries required to acquire an infinite number of samples.

Observation I: m – 1 = 7 queries are required for Trees A, B, and C (m = 8 is the number of tuples)Observation II: Empty leaves in Tree D leads to more required queries.

21

Page 22: Sampling Attacks Against Hidden Databases

DECISION TREES FOR COLLECTING A FINITE NUMBER OF TUPLES Loss: empty leaves Saving: since the number of samples to be

collected is finite, not all m – 1 queries need to be issued.

Constructing an optimal decision tree Given the number of samples to be collected,

maximize (Saving – Loss)

Possible saving Possible loss

22

Page 23: Sampling Attacks Against Hidden Databases

A GREEDY HEURISTIC ALGORITHM

Saving:

Loss: Expense:

Net Saving Per Expense: ifthen total cost <=

23

Page 24: Sampling Attacks Against Hidden Databases

COMPUTATION OF SER

• How to compute branch COUNTs (i.e., |uj|) for all candidates of a node?– Exact computation diminishes the entire concept

of minimizing cost– Fortunately, in many cases a rough estimation is

enough e.g., fanout of 2 vs. 10:

– Unfortunately, uniform assumption does not suffice

– Proposed Solution: Issue a small number of marginal queries first, conditional independence assumption

24

Page 25: Sampling Attacks Against Hidden Databases

Select attribute to maximize SER

Select attribute to maximize SER

[Transmission]

Automatic Manual

Honda ToyotaFordVW

High Low

Nissan

[Make]

Select attribute to maximize SER

Sample

Found!

Sample

Found!

[Price Segment]

ALGORITHM COUNT-DECISION-TREE

Medium

25

Page 26: Sampling Attacks Against Hidden Databases

[Transmission]

Automatic Manual

Honda

ToyotaFordVW

High MediumLow

Nissan

Go back to the root and start another walk!

[Make]

[Price Segment]

ALGORITHM COUNT-DECISION-TREE

26

Page 27: Sampling Attacks Against Hidden Databases

[Transmission]

Automatic Manual

HondaToyota

Ford

VW

HighMedium

Low

Nissan

[Transmission]

Automatic Manual

Select attribute to maximize SER[Price Segment]

High Medium Low

[Make]

Honda Toyota FordVW

Select attribute to maximize SER

Sample

Found!

Sample

Found!

Saving from History

[Make]

[Price Segment]

ALGORITHM COUNT-DECISION-TREE

Nissan

27

Page 28: Sampling Attacks Against Hidden Databases

[Transmission]

Automatic Manual

[Price Segment]

High Medium Low

[Make]

HondaToyota

Nissan

FordVW

Honda

ToyotaFord

VW

High MediumLow

Nissan

Go back to the root and start another walk!

[Make]

[Price Segment]

ALGORITHM COUNT-DECISION-TREE

28

Page 29: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

29

Page 30: Sampling Attacks Against Hidden Databases

TWO MAIN IDEAS OF ALERT-HYBRID

Use a small number of pilot samples to estimate COUNT Motivation: COUNT eliminates bias for sampling,

but is unavailable for Top-k-ALERT interfaces. On-the-fly switch from COUNT-DECISION-

TREE to ALERT-ORDER during the drill down processes for collecting the remaining samples Motivation: an inaccurate COUNT may introduce

additional bias – switch when confidence is low

30

Page 31: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

31

Page 32: Sampling Attacks Against Hidden Databases

EXPERIMENTAL SETUP

• Synthetic Boolean-iid:– 200,000 tuples, 80 attributes, p = 0.25

• Synthetic Boolean-mixed:– 200,000 tuples, 40 independent attributes– 5 with uniform distribution, the others have p from

1/160 to 35/160 with step of 1/160• Yahoo! Auto: http://autos.yahoo.com

– 15,211 tuples, 32 Boolean, 6 categorical attributes– Domain size ranges from 5 to 447

• Census: UCI Data Mining Archive– 1990 census data, we remove all attributes with

domain size > 100, 12 attributes and 32,561 tuples– Domain size from Boolean to 92 32

Page 33: Sampling Attacks Against Hidden Databases

EFFICIENCY OF COUNT-DECISION-TREE VS. COUNT-ORDER

33

Page 34: Sampling Attacks Against Hidden Databases

EFFICIENCY AND BIAS OF ALERT-HYBRID VS. ALERT-ORDER

34

Page 35: Sampling Attacks Against Hidden Databases

ILLUSTRATION OF ALERT-HYBRID

35

Page 36: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

36

Page 37: Sampling Attacks Against Hidden Databases

RELATED WORK

Crawling hidden text databases [BB98, AIG03, NZC05]

Extracting data from hidden structured databases [RG01, LES+02, ARP+07]

Sampling search engine’s index using a public interface [BB98, BJ04, BG06, BG07]

Sampling hidden structural databases [DDM07]

37

Page 38: Sampling Attacks Against Hidden Databases

OUTLINE

Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion

38

Page 39: Sampling Attacks Against Hidden Databases

CONCLUSION

Main Technical Contribution 1: COUNT-DECISION-TREE Unbiased sampling algorithm for Top-k-COUNT

interfaces Orders of magnitude more efficient than the

existing algorithms Main Technical Contribution 2: ALERT-HYBRID

Sampling algorithm with slight bias for Top-k-ALERT interfaces

Orders of magnitude more efficient and has smaller bias than the existing algorithms.

39

Page 40: Sampling Attacks Against Hidden Databases

CONCLUSION

Our studies unveil powerful techniques to perform data analytics over hidden databases Hidden databases owners may be extremely

concerned about the privacy of aggregates over their hidden databases.

How to reveal individual tuples truthfully and efficiently, but hide aggregated views of the data A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri,

Privacy Preservation of Aggregates in Hidden Databases: Why and How? SIGMOD 2009.

40

Page 41: Sampling Attacks Against Hidden Databases

THANK YOU

Page 42: Sampling Attacks Against Hidden Databases

BACKUP SLIDES42

Page 43: Sampling Attacks Against Hidden Databases

EFFICIENCY AND BIAS OF COUNT-DECISION-TREE VS. ALERT-ORDER

43

Page 44: Sampling Attacks Against Hidden Databases

TWO TYPES OF HIDDEN DATABASE INTERFACE

• Top-k-ALERT– MSN Stock Screener (k = 24)

• http://moneycentral.msn.com/investor/finder/customstocks.asp

– Microsoft Solution Finder (k = 500)• https://solutionfinder.microsoft.com/Solutions/

SolutionsDirectory.aspx?mode=searchproblem

• Top-k-COUNT– MSN Careers (k = 4,000)

• http://msn.careerbuilder.com/JobSeeker/Jobs/JobFindAdv.aspx

44

Page 45: Sampling Attacks Against Hidden Databases

SETTINGS OF S1 AND CS

45

Page 46: Sampling Attacks Against Hidden Databases

COUNT-DECISION-TREE VS. ALERT-RANDOM

46

Page 47: Sampling Attacks Against Hidden Databases

IMPROVEMENT BY CONSIDERING HISTORY

Note: Almost unrelated to k47

Page 48: Sampling Attacks Against Hidden Databases

IMPROVEMENT BY DECISION TREE

48

Page 49: Sampling Attacks Against Hidden Databases

ALGORITHM ALERT-HYBRID

[Transmission]

|Automatic| |Manual|

[Price Segment]

|High| |Medium| |Low|

[State]

COUNT-DECISION-TREE

Count below THRESHOLD

VW

[Make]

TX VA NY CA

Start ALERT-ORDERSelect attribute to maximize SER

Select attribute to maximize SER

Go back to the root and start another walk!

49

Page 50: Sampling Attacks Against Hidden Databases

ALGORITHM ALERT-HYBRID

[Transmission]

|Automatic| |Manual|

COUNT-DECISION-TREE

|Honda|

|Toyota||VW|

|Nissan|

[Make]

[Price Segment]

|High| |Medium|

|Low|

[Price Segment]

Select attribute to maximize SER

Count below THRESHOLD

Start ALERT-ORDER

TX VA NY CA

Select attribute to maximize SER

[State]Continue

50