Upload
reyna
View
80
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Deep web. Jianguo Lu. What is deep web. Also called hidden / invisible web/database In contrast to surface web interconnected by hyperlinks Content is dynamically generated from a search interface by sending queries. The search interface can be HTML form Web service … - PowerPoint PPT Presentation
Citation preview
Deep web
1
Jianguo Lu
What is deep webAlso called hidden/invisible
web/database In contrast to surface web
interconnected by hyperlinksContent is dynamically generated
from a search interface by sending queries.
The search interface can be HTML form Web service …
Almost every surface web site has a companion deep web search box E.g., Wiki, Amazon, nytimes,
ibm.com…
2
Deep web • Pictures from http://www.osti.gov/fedsearch
3
Deep web crawling
Crawl and index the deep web so that hidden data can be surfaced
Unlike the surface web, there are no hyperlinks to followTwo tasks
Find deep web data sources, i.e., html forms, web services E.g., Accessing the deep web: A survey, B He et al. CACM,
2007 Given a data source, download the data from this data source
E.g., Google, Bing, UCLA papers.
4
Why deep web crawling and estimation Most web site has a companion searchable, richer deep site
Needs to profile (#docs, #words, dist) and crawl these sites
Population size of a deep web site is an important metric for both general interests and competitive analysis Which search engine is the largest? Lawrence and Giles, Science Apr
1998. Which bookstore/library has most books? Which university hosts most web pages? How many users does facebook/twitter/weibo have? Which newspaper has more online articles?...
Crawling Indexing by general purpose search engines
Madhavan et al (Google) VLDB 2008 Wu et al (Bing) ICDE 2006
Data integration, meta crawlers, focused topic search engines, shop-bot Business intelligence, archiving, …
Graph model of deep web
5
Blue dot represents a doc Red dot represents a query Documents can be retrieved by
queries only Both document degree and query
degree follows power law
Size of blue dot is proportional to the reciprocal of its degree
Size of red dot is proportional to the sum of the size of blue dot
6
How difficult to download
Important to both data providers and consumersSub questions
Is it always possible to download all the documents? How many queries are needed to download all the documents?
Popular queries Rare queries Volume of queries
Can we reduce the cost of downloading What is the cost if we download part of the data, say 50% ?
Same as the estimation problem: given the overlapping info, we can estimate the percentage of the data downloaded, hence we can know the population size.
7
Q1: can we download all the documents
We can if the graph is connected Use any graph traversal algorithm
When graph is connected? By Erdos-Renyi random graph model, mean degree > ln(n)
May not be able to download all the docs if There are many small documents Use mainly rare queries
Implications (detailed experiments needed) Easy to download large docs such as books, articles May not be able to download all the micro-blogs,
database of paper titles
What if the graph is not connected? Hard to guess the queries in other components The topology of the disconnected graph usually contains
one large component and many small components
8
Disconnected graph
Document graph constructed using all the dis legomena (words occurs twice) in newsgroups20 20k documents Omitted many smaller islands
Only one large component, many islands
It is a challenge to harvest all the data islands
Believed to be a universal graph topology E.g., social networks Semantic web?
9
Q2: How many queries are needed
Depends on the volume of queries Expected volume can be calculated using Zipf’s law and Heaps’law
10
# queries to cover all the documents
Number of queries increases with database size Decrease with document size
Top and bottom queries increase in different speed
Rule of thumb Hundreds of top queries
Q3: How to reduce the crawling cost?
11
Obtain documents by sending queries Do not need to send all the queries Goal: select the queries to
Cover all the documents Minimize the cost It is an optimization problem (set covering problem)
Set covering It is NP-hard There are many heuristic and approximation
algorithms Greedy, Lagrangian relaxation, Randomized greedy,
Genetic algorithm …
None studies the dependency on data distribution Greedy algorithm selects the next best query
Each document has the same weight
Since doc degree follows power law, document degree distribution is heavily skewed
Weighted greedy: document weight is the probability of being visited in Markov random walk
More scalable than integer programming products such as CPLEX
12
13
Set covering problemgiven a universe U and a family of subsets S={S1, S2, …,
Sn} of U. a cover is a subfamily of S whose union is U. Let J={1,2,…,n}. J* is a cover if
set covering decision problem: the input is a pair (S,U) and an integer k; the question is whether there is a set covering of size k or less.
set covering optimization problem: the input is a pair (S,U), and the task is to find a set covering which uses the fewest sets.
The decision version of set covering is NP-complete, and the optimization version of set cover is NP-hard.
*Jjj US
Set covering
14
Set covering example
d1 d3d2t1
t3
t2
Set covering
• Can be represented by matrix or hyper-graph.• Matrix representation
• each row represents a term• each column represents a doc• If the A(i,j) is 1, • term i can retrieve document j, or • term i covers document j.
15
Greedy algorithm
Number of new elements
Set covering
16
Greedy algorithm is not optimalThere can be two solutions
If the first set selected is t1, then the solution is {t1, t2} The cost 4
If the first selection is t2, then the solution is {t2, t3} the cost is 3.
Set covering
17
Weighted greedy algorithm
3 8
q1
q4
q3
q5
q25
4 92
1
6
7
Set covering
d1 d2 d3 d4 d5 d6 d7 d8 d9q1 0 0 1 0 1 1 0 1 0q2 0 0 0 0 0 1 0 1 0q3 1 1 1 1 0 0 0 0 1q4 0 1 0 0 0 1 1 0 1q5 0 0 1 1 1 0 0 1 1
18
One solution obtained by greedy algorithm
Set covering
3 8
q5
5
4 92
1
6
7
3 8
q4q5
5
4 92
1
6
7
3 8
q4
q3
q5
5
4 92
1
6
7
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 1 0 1 1 0 1 0 4 4 1q2 0 0 0 0 0 1 0 1 0 2 2 1q3 1 1 1 1 0 0 0 0 1 5 5 1q4 0 1 0 0 0 1 1 0 1 4 4 1q5 0 0 1 1 1 0 0 1 1 5 5 1
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 1 0 0 0 4 1 0.25q2 0 0 0 0 0 1 0 0 0 2 1 0.5q3 1 1 0 0 0 0 0 0 0 5 2 0.4q4 0 1 0 0 0 1 1 0 0 4 3 0.75q5 0 0 0 0 0 0 0 0 0 5 0 0
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 0 0 0 0 4 0 0q2 0 0 0 0 0 0 0 0 0 2 0 0q3 1 0 0 0 0 0 0 0 0 5 1 0.2q4 0 0 0 0 0 0 0 0 0 4 0 0q5 0 0 0 0 0 0 0 0 0 5 0 0
Total cost is 5+4+5=14
19
Solution obtained by weighted greedy algorithm
3 8
q1
q4
q3
q5
q25
4 92
1
6
7 q492
6
7 q4
3
q3
4 92
1
6
7
7
3 8
q1
q3
5
4 92
1
6
Set covering
d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/dfq1 0 0 0.3 0 0.5 0.3 0 0.3 0 1.5 0.375q2 0 0 0 0 0 0.3 0 0.3 0 0.667 0.3333q3 1 0.5 0.3 0.5 0 0 0 0 0.3 2.667 0.5333q4 0 0.5 0 0 0 0.3 1 0 0.3 2.167 0.5417q5 0 0 0.3 0.5 0.5 0 0 0.3 0.3 2 0.4
d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df
q1 0 0 0.3 0 0.5 0 0 0.3 0 1.167 0.2917q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 1 0 0.3 0.5 0 0 0 0 0 1.833 0.3667q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0.3 0.5 0.5 0 0 0.3 0 1.667 0.3333
d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df
q1 0 0 0 0 0.5 0 0 0.3 0 0.833 0.2083q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 0 0 0 0 0 0 0 0 0 0 0q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0 0 0.5 0 0 0.3 0 0.833 0.1667
Total cost is 4+5+4=13
Return limit and ranking
20
Newsgroups 20 #doc=20k 190 popular words Df=1000~5000 K=200, 500, ∞
Percentage can be downloaded ≈ k/min(df)
Jianguo Lu, Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method , Data and Knowledge Engineering, Elsevier. 69(8): 866-879 (2010).
Models of deep web Different site requires different way of crawling and
estimation In the diagram, model in the lower layer is more
difficult to crawl and estimate Other dimensions such as
Doc extension: anchor text are indexed Doc restriction: index first part of long text Doc and site involve over time Doc size and distribution Query size and distribution
Every combination calls for its solution
Model All the matched documents returned?
Each document has equal probability of being matched?
M0 yes Yes
Mr No Yes
Mh Yes No
Mrh No no
21
Random queries
Model M0r
Assumptions Only top k documents are returned Each document has equal probability
being matched Documents have static ranking
22
Random queries
Estimate the size
Estimate the number of documents in a deep web Only send queries and obtain the returns Based on capture-recapture method developed in ecology
Documents are captured by queries
23
Model M0
1.21 ORP
24
Random queries
• Assumptions
– All the matched documents are returned
– Each document has equal probability of being matched
• Result
– Jianguo Lu, Dingding Li, Estimating Deep Web Data Source Size by Capture-Recapture Method, Information Retrieval. Springer 2010.
Model M0
The more accurate formula for the relationship between P and OR is
Conclusion: In model M0, it is not difficult to crawl a data source at all
In most cases OR will be higher than what is calculated by the above formula Because M0 is the simplest
PPOR /)1ln(
25
Random queries
P OR
0.1 1.053605
0.2 1.115718
0.3 1.188916
0.4 1.277064
0.5 1.386294
0.6 1.527151
0.7 1.719961
0.8 2.011797
0.9 2.558428
0.95 3.153402
0.99 4.651687
0.999 6.91467
Model M0 vs Mh
The blue line is drawn using equation P=1-OR-2.1
Several real data show different trend
Why?
26
Random queries
Model Mh
Assumptions: Each document has unequal
probability of being matched by a query
All matched documents are returned
h means heterogeneity in catch probability Originally developed in
ecology, to estimate the population of wild animals
Process: capture a group of animals, mark and release them; capture another group of animals, mark and release them again; … …
Mh was first proposed in capture-recapture method
27
Random queries
Capture frequency of news groups documents by queries
(A) is the scatter plot when documents are selected by queries. In total 13,600 documents are retrieved. (B) is the first 100 captures in Figure (A). (C) is the histogram of (A). (D) is the log-log plot of (C).
Model MhP OR
0 1
0.533484 2
0.701347 3
0.782362 4
0.829732 5
0.860674 6
0.882404 7
0.898468 8
0.910806 9
28
The empirical result is
Obtained by linear regression
1.11 ORP
Random queries
File size distributions
29
Random queries
Measuring heterogeneity Coefficient of Variation (CV)
Assume that the documents in the data source have different but fixed probabilities of being captured, i.e., p = {p1, p2, …,pn}, pj=1.
30
Sampling based approach
Scatter plots for various CVs. 200 random numbers within the range of 1 and 20,000 are generated in Pareto distribution.
Measuring heterogeneity
ORP 1
31
Relationship between CV (γ) and α
Random queries CV
α P
Model M0r
)1( 1.2 ORmkP
32
When k and m are fixed for every query
Not a practical assumption
Random queries
Model Mhr
Assumptions Only top k documents are returned documents have unequal probability being matched Documents have static ranking
When k and m are fixed, we have
)1( 1.1 ORmkP
33
Random queries
Evolution of the models
Comparison of models M0, Mh, M0r, and Mhr. 1000 documents are sorted according to their file size in decreasing order. 600 documents are selected in the four models, including the duplicates. k = 10;m = 20.
Subplot M0 shows that all the documents are retrieved uniformly.
Subplot Mh shows that large documents are preferred, but most of the documents can be eventually sampled.
Subplot M0r exhibits a clear cut around the 500th document. Beyond this line there are almost no documents retrieved.
Mhr is the compound of M0r and Mh.34
Random queries