Upload
hondafanatics
View
134
Download
4
Tags:
Embed Size (px)
Citation preview
LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASESPresenter: Nan Zhang, George Washington Univ.
Joint work with Arjun Dasgupta and Gautam Das,
The University of Texas at Arlington
OUTLINE
• Introduction• Baseline Algorithm• COUNT-DECISION-TREE• ALERT-HYBRID• Experimental Results• Related Work• Conclusion
2
THE DEEP WEB
Deep Web vs Surface Web Dynamic contents, unlinked pages, private web,
contextual web, etc Estimated size [1]: 91,850 vs 167 tera bytes
[1] SIMS, UC Berkeley, How much information? 2003http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
3
HIDDEN DATABASES
Form-like interface
Return top-k tuples
4
SAMPLING HIDDEN DATABASES THROUGH PUBLIC INTERFACES
Problem definition given such restricted query interfaces, how can
one efficiently obtain a uniform random sample of the backend database by only accessing the database via the public front end interface?
Applications In which geographic area, and for which industry,
do MSN Careers’ job sources have especially low presence?
Which flight at which date is more likely to be relatively empty?
What is the real size of the hidden database?5
PERFORMANCE MEASURES OF SAMPLING HIDDEN DATABASES
Sample bias Over- or under-representing a portion of the
population Objective: minimize sample bias
Efficiency (query cost) The number of queries issued to the web
interface of a hidden database Note: many hidden databases charge for each
issued query or have limits on the number of queries one can issue per day.
Objective: minimize query cost
6
TWO TYPES OF HIDDEN DATABASE INTERFACE
TOP-k-ALERT
<Prev> 1 … 20 21 22 23 24 25
Showing 481-500 of more than 500 results
Depending on how overflowing query results are displayed
(display overflowing flag only)
7
TWO TYPES OF HIDDEN DATABASE INTERFACE
TOP-k-COUNT
<Prev> 1 … 20 21 22 23 24 25
Showing 481-500 of 15,167 results
Depending on how overflowing query results are displayed
(display real COUNT)
8
OUTLINE OF TECHNICAL RESULTS
Existing work HIDDEN-DB-SAMPLER [DDM07]
Our results
COUNT-DECISION-TREE An efficient unbiased sampling algorithm for Top-k-
COUNT interfaces
ALERT-HYBRID An efficient sampling algorithm with slight bias for Top-
k-ALERT interfaces9
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
10
A RUNNING EXAMPLE
A1 A2 A3
t1 0 0 1
t2 0 1 0
t3 0 1 1
t4 1 1 0
000 001 010 011 100 101 110 111
t1 t2 t3 t4
11
BASELINE ALGORITHM: COUNT-ORDER
12
A1 = 0 &A2 = 0
A1 = 0 A1 = 1
A1
A2
A3
A1 = 0 &A2 = 1
A1 = 0 &A2 = 0 &A3 = 0
A1 = 0 &A2 = 1 &A3 = 1
valid
underflow
overflow
BASELINE ALGORITHM: COUNT-ORDER
000 010001 011 101100 111110
3/4
1/2
2/3
3/4 * 2/3 * 1/2 = 1/4
Count=3 Count=1
Count=1 Count=2
Count=1
A1
A2
A3
4
3
3
Count=1
13
BASELINE ALGORITHM: COUNT-ORDER
000 010001 011 101100 111110
3/4
1/3
3/4 * 1/3 = 1/4
A1
A2
A3
Count=3 Count=1
Count=1 Count=2
4
3
14
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
15
COUNT-DECISION-TREE
First result of the paper
Two main ideas for improving the efficiency of sampling
Utilizing query history
Attribute order tree Decision tree
16
UTILIZING QUERY HISTORY: BASIC IDEA
000 010001 011 101100 111110
A1
A2
A3
17
UTLIZING QUERY HISTORY: SAVING
Saving from query history is significant When minimum domain size is b, saving for
collecting s samples is:
Expected saving: 5,000 samples from a 100,000-tuple i.i.d. Boolean
database with uniform distribution: 49.90% (83,04841,610)
b = 5: 62.62% (143,06753,317)
18
ATTRIBUTE ORDER TREE DECISION TREE: BASIC IDEA
000 010001 011 101100 111110
A1
A2
A3
t1 t2 t3 t4
A3
0 1
A1 A2
0 01 1
t2 t4 t1 t3
A1 A2 A3
t1 0 0 1
t2 0 1 0
t3 0 1 1
t4 1 1 0
A1 A2 A3
t1 0 0 1
t2 0 1 0
t3 0 1 1
t4 1 1 0
Note: Not to be confused witha decision tree for classification
19
CONSTRUCTING AN OPTIMAL DECISION TREE: TWO MAIN CHALLENGES
• Problem is hard even if one has access to the entire database– When k = 1, to collect s = 1 sample, the
construction of an optimal decision tree over a Boolean DB is equivalent to a well-known NP-hard problem of constructing an optimal decision tree for entity identification in the database.
• Furthermore, without knowledge of the database, construction must be done on-the-fly• Note: Tree construction costs queries too!
20
INTUITION OF A HEURISTIC ALGORITHMThe decision-tree construction algorithm must consider query history
Consider the number of unique queries required to acquire an infinite number of samples.
Observation I: m – 1 = 7 queries are required for Trees A, B, and C (m = 8 is the number of tuples)Observation II: Empty leaves in Tree D leads to more required queries.
21
DECISION TREES FOR COLLECTING A FINITE NUMBER OF TUPLES Loss: empty leaves Saving: since the number of samples to be
collected is finite, not all m – 1 queries need to be issued.
Constructing an optimal decision tree Given the number of samples to be collected,
maximize (Saving – Loss)
Possible saving Possible loss
22
A GREEDY HEURISTIC ALGORITHM
Saving:
Loss: Expense:
Net Saving Per Expense: ifthen total cost <=
23
COMPUTATION OF SER
• How to compute branch COUNTs (i.e., |uj|) for all candidates of a node?– Exact computation diminishes the entire concept
of minimizing cost– Fortunately, in many cases a rough estimation is
enough e.g., fanout of 2 vs. 10:
– Unfortunately, uniform assumption does not suffice
– Proposed Solution: Issue a small number of marginal queries first, conditional independence assumption
24
Select attribute to maximize SER
Select attribute to maximize SER
[Transmission]
Automatic Manual
Honda ToyotaFordVW
High Low
Nissan
[Make]
Select attribute to maximize SER
Sample
Found!
Sample
Found!
[Price Segment]
ALGORITHM COUNT-DECISION-TREE
Medium
25
[Transmission]
Automatic Manual
Honda
ToyotaFordVW
High MediumLow
Nissan
Go back to the root and start another walk!
[Make]
[Price Segment]
ALGORITHM COUNT-DECISION-TREE
26
[Transmission]
Automatic Manual
HondaToyota
Ford
VW
HighMedium
Low
Nissan
[Transmission]
Automatic Manual
Select attribute to maximize SER[Price Segment]
High Medium Low
[Make]
Honda Toyota FordVW
Select attribute to maximize SER
Sample
Found!
Sample
Found!
Saving from History
[Make]
[Price Segment]
ALGORITHM COUNT-DECISION-TREE
Nissan
27
[Transmission]
Automatic Manual
[Price Segment]
High Medium Low
[Make]
HondaToyota
Nissan
FordVW
Honda
ToyotaFord
VW
High MediumLow
Nissan
Go back to the root and start another walk!
[Make]
[Price Segment]
ALGORITHM COUNT-DECISION-TREE
28
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
29
TWO MAIN IDEAS OF ALERT-HYBRID
Use a small number of pilot samples to estimate COUNT Motivation: COUNT eliminates bias for sampling,
but is unavailable for Top-k-ALERT interfaces. On-the-fly switch from COUNT-DECISION-
TREE to ALERT-ORDER during the drill down processes for collecting the remaining samples Motivation: an inaccurate COUNT may introduce
additional bias – switch when confidence is low
30
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
31
EXPERIMENTAL SETUP
• Synthetic Boolean-iid:– 200,000 tuples, 80 attributes, p = 0.25
• Synthetic Boolean-mixed:– 200,000 tuples, 40 independent attributes– 5 with uniform distribution, the others have p from
1/160 to 35/160 with step of 1/160• Yahoo! Auto: http://autos.yahoo.com
– 15,211 tuples, 32 Boolean, 6 categorical attributes– Domain size ranges from 5 to 447
• Census: UCI Data Mining Archive– 1990 census data, we remove all attributes with
domain size > 100, 12 attributes and 32,561 tuples– Domain size from Boolean to 92 32
EFFICIENCY OF COUNT-DECISION-TREE VS. COUNT-ORDER
33
EFFICIENCY AND BIAS OF ALERT-HYBRID VS. ALERT-ORDER
34
ILLUSTRATION OF ALERT-HYBRID
35
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
36
RELATED WORK
Crawling hidden text databases [BB98, AIG03, NZC05]
Extracting data from hidden structured databases [RG01, LES+02, ARP+07]
Sampling search engine’s index using a public interface [BB98, BJ04, BG06, BG07]
Sampling hidden structural databases [DDM07]
37
OUTLINE
Introduction Baseline Algorithm COUNT-DECISION-TREE ALERT-HYBRID Experimental Results Related Work Conclusion
38
CONCLUSION
Main Technical Contribution 1: COUNT-DECISION-TREE Unbiased sampling algorithm for Top-k-COUNT
interfaces Orders of magnitude more efficient than the
existing algorithms Main Technical Contribution 2: ALERT-HYBRID
Sampling algorithm with slight bias for Top-k-ALERT interfaces
Orders of magnitude more efficient and has smaller bias than the existing algorithms.
39
CONCLUSION
Our studies unveil powerful techniques to perform data analytics over hidden databases Hidden databases owners may be extremely
concerned about the privacy of aggregates over their hidden databases.
How to reveal individual tuples truthfully and efficiently, but hide aggregated views of the data A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri,
Privacy Preservation of Aggregates in Hidden Databases: Why and How? SIGMOD 2009.
40
THANK YOU
BACKUP SLIDES42
EFFICIENCY AND BIAS OF COUNT-DECISION-TREE VS. ALERT-ORDER
43
TWO TYPES OF HIDDEN DATABASE INTERFACE
• Top-k-ALERT– MSN Stock Screener (k = 24)
• http://moneycentral.msn.com/investor/finder/customstocks.asp
– Microsoft Solution Finder (k = 500)• https://solutionfinder.microsoft.com/Solutions/
SolutionsDirectory.aspx?mode=searchproblem
• Top-k-COUNT– MSN Careers (k = 4,000)
• http://msn.careerbuilder.com/JobSeeker/Jobs/JobFindAdv.aspx
44
SETTINGS OF S1 AND CS
45
COUNT-DECISION-TREE VS. ALERT-RANDOM
46
IMPROVEMENT BY CONSIDERING HISTORY
Note: Almost unrelated to k47
IMPROVEMENT BY DECISION TREE
48
ALGORITHM ALERT-HYBRID
[Transmission]
|Automatic| |Manual|
[Price Segment]
|High| |Medium| |Low|
[State]
COUNT-DECISION-TREE
Count below THRESHOLD
VW
[Make]
TX VA NY CA
Start ALERT-ORDERSelect attribute to maximize SER
Select attribute to maximize SER
Go back to the root and start another walk!
49
ALGORITHM ALERT-HYBRID
[Transmission]
|Automatic| |Manual|
COUNT-DECISION-TREE
|Honda|
|Toyota||VW|
|Nissan|
[Make]
[Price Segment]
|High| |Medium|
|Low|
[Price Segment]
Select attribute to maximize SER
Count below THRESHOLD
Start ALERT-ORDER
TX VA NY CA
Select attribute to maximize SER
[State]Continue
50