Deep web

Deep web

1

Jianguo Lu

What is deep webAlso called hidden/invisible

web/database In contrast to surface web

interconnected by hyperlinksContent is dynamically generated

from a search interface by sending queries.

The search interface can be HTML form Web service …

Almost every surface web site has a companion deep web search box E.g., Wiki, Amazon, nytimes,

ibm.com…

2

Deep web • Pictures from http://www.osti.gov/fedsearch

http://www.osti.gov/fedsearch

3

Deep web crawling

Crawl and index the deep web so that hidden data can be surfaced

Unlike the surface web, there are no hyperlinks to followTwo tasks

Find deep web data sources, i.e., html forms, web services E.g., Accessing the deep web: A survey, B He et al. CACM,

2007 Given a data source, download the data from this data source

E.g., Google, Bing, UCLA papers.

4

Why deep web crawling and estimation Most web site has a companion searchable, richer deep site

Needs to profile (#docs, #words, dist) and crawl these sites

Population size of a deep web site is an important metric for both general interests and competitive analysis Which search engine is the largest? Lawrence and Giles, Science Apr

1998. Which bookstore/library has most books? Which university hosts most web pages? How many users does facebook/twitter/weibo have? Which newspaper has more online articles?...

Crawling Indexing by general purpose search engines

Madhavan et al (Google) VLDB 2008 Wu et al (Bing) ICDE 2006

Data integration, meta crawlers, focused topic search engines, shop-bot Business intelligence, archiving, …

Graph model of deep web

5

Blue dot represents a doc Red dot represents a query Documents can be retrieved by

queries only Both document degree and query

degree follows power law

Size of blue dot is proportional to the reciprocal of its degree

Size of red dot is proportional to the sum of the size of blue dot

6

How difficult to download

Important to both data providers and consumersSub questions

Is it always possible to download all the documents? How many queries are needed to download all the documents?

Popular queries Rare queries Volume of queries

Can we reduce the cost of downloading What is the cost if we download part of the data, say 50% ?

Same as the estimation problem: given the overlapping info, we can estimate the percentage of the data downloaded, hence we can know the population size.

7

Q1: can we download all the documents

We can if the graph is connected Use any graph traversal algorithm

When graph is connected? By Erdos-Renyi random graph model, mean degree > ln(n)

May not be able to download all the docs if There are many small documents Use mainly rare queries

Implications (detailed experiments needed) Easy to download large docs such as books, articles May not be able to download all the micro-blogs,

database of paper titles

What if the graph is not connected? Hard to guess the queries in other components The topology of the disconnected graph usually contains

one large component and many small components

8

Disconnected graph

Document graph constructed using all the dis legomena (words occurs twice) in newsgroups20 20k documents Omitted many smaller islands

Only one large component, many islands

It is a challenge to harvest all the data islands

Believed to be a universal graph topology E.g., social networks Semantic web?

9

Q2: How many queries are needed

Depends on the volume of queries Expected volume can be calculated using Zipf’s law and Heaps’law

10

# queries to cover all the documents

Number of queries increases with database size Decrease with document size

Top and bottom queries increase in different speed

Rule of thumb Hundreds of top queries

Q3: How to reduce the crawling cost?

11

Obtain documents by sending queries Do not need to send all the queries Goal: select the queries to

Cover all the documents Minimize the cost It is an optimization problem (set covering problem)

Set covering It is NP-hard There are many heuristic and approximation

algorithms Greedy, Lagrangian relaxation, Randomized greedy,

Genetic algorithm …

None studies the dependency on data distribution Greedy algorithm selects the next best query

Each document has the same weight

Since doc degree follows power law, document degree distribution is heavily skewed

Weighted greedy: document weight is the probability of being visited in Markov random walk

More scalable than integer programming products such as CPLEX

12

13

Set covering problemgiven a universe U and a family of subsets S={S1, S2, …,

Sn} of U. a cover is a subfamily of S whose union is U. Let J={1,2,…,n}. J* is a cover if

set covering decision problem: the input is a pair (S,U) and an integer k; the question is whether there is a set covering of size k or less.

set covering optimization problem: the input is a pair (S,U), and the task is to find a set covering which uses the fewest sets.

The decision version of set covering is NP-complete, and the optimization version of set cover is NP-hard.

*Jjj US

Set covering

14

Set covering example

d1 d3d2t1

t3

t2

Set covering

• Can be represented by matrix or hyper-graph.• Matrix representation

• each row represents a term• each column represents a doc• If the A(i,j) is 1, • term i can retrieve document j, or • term i covers document j.

15

Greedy algorithm

Number of new elements

Set covering

16

Greedy algorithm is not optimalThere can be two solutions

If the first set selected is t1, then the solution is {t1, t2} The cost 4

If the first selection is t2, then the solution is {t2, t3} the cost is 3.

Set covering

17

Weighted greedy algorithm

3 8

q1

q4

q3

q5

q25

4 92

1

6

7

Set covering

d1 d2 d3 d4 d5 d6 d7 d8 d9q1 0 0 1 0 1 1 0 1 0q2 0 0 0 0 0 1 0 1 0q3 1 1 1 1 0 0 0 0 1q4 0 1 0 0 0 1 1 0 1q5 0 0 1 1 1 0 0 1 1

18

One solution obtained by greedy algorithm

Set covering

3 8

q5

5

4 92

1

6

7

3 8

q4q5

5

4 92

1

6

7

3 8

q4

q3

q5

5

4 92

1

6

7

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 1 0 1 1 0 1 0 4 4 1q2 0 0 0 0 0 1 0 1 0 2 2 1q3 1 1 1 1 0 0 0 0 1 5 5 1q4 0 1 0 0 0 1 1 0 1 4 4 1q5 0 0 1 1 1 0 0 1 1 5 5 1

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 1 0 0 0 4 1 0.25q2 0 0 0 0 0 1 0 0 0 2 1 0.5q3 1 1 0 0 0 0 0 0 0 5 2 0.4q4 0 1 0 0 0 1 1 0 0 4 3 0.75q5 0 0 0 0 0 0 0 0 0 5 0 0

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 0 0 0 0 4 0 0q2 0 0 0 0 0 0 0 0 0 2 0 0q3 1 0 0 0 0 0 0 0 0 5 1 0.2q4 0 0 0 0 0 0 0 0 0 4 0 0q5 0 0 0 0 0 0 0 0 0 5 0 0

Total cost is 5+4+5=14

19

Solution obtained by weighted greedy algorithm

3 8

q1

q4

q3

q5

q25

4 92

1

6

7 q492

6

7 q4

3

q3

4 92

1

6

7

7

3 8

q1

q3

5

4 92

1

6

Set covering

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/dfq1 0 0 0.3 0 0.5 0.3 0 0.3 0 1.5 0.375q2 0 0 0 0 0 0.3 0 0.3 0 0.667 0.3333q3 1 0.5 0.3 0.5 0 0 0 0 0.3 2.667 0.5333q4 0 0.5 0 0 0 0.3 1 0 0.3 2.167 0.5417q5 0 0 0.3 0.5 0.5 0 0 0.3 0.3 2 0.4

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df

q1 0 0 0.3 0 0.5 0 0 0.3 0 1.167 0.2917q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 1 0 0.3 0.5 0 0 0 0 0 1.833 0.3667q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0.3 0.5 0.5 0 0 0.3 0 1.667 0.3333

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df

q1 0 0 0 0 0.5 0 0 0.3 0 0.833 0.2083q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 0 0 0 0 0 0 0 0 0 0 0q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0 0 0.5 0 0 0.3 0 0.833 0.1667

Total cost is 4+5+4=13

Return limit and ranking

20

Newsgroups 20 #doc=20k 190 popular words Df=1000~5000 K=200, 500, ∞

Percentage can be downloaded ≈ k/min(df)

Jianguo Lu, Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method , Data and Knowledge Engineering, Elsevier. 69(8): 866-879 (2010).

http://cs.uwindsor.ca/~jlu/ranked.pdf



Models of deep web Different site requires different way of crawling and

estimation In the diagram, model in the lower layer is more

difficult to crawl and estimate Other dimensions such as

Doc extension: anchor text are indexed Doc restriction: index first part of long text Doc and site involve over time Doc size and distribution Query size and distribution

Every combination calls for its solution

Model All the matched documents returned?

Each document has equal probability of being matched?

M0 yes Yes

Mr No Yes

Mh Yes No

Mrh No no

21

Random queries

Model M0r

Assumptions Only top k documents are returned Each document has equal probability

being matched Documents have static ranking

22

Random queries

Estimate the size

Estimate the number of documents in a deep web Only send queries and obtain the returns Based on capture-recapture method developed in ecology

Documents are captured by queries

23

Model M0

1.21 ORP

24

Random queries

• Assumptions

– All the matched documents are returned

– Each document has equal probability of being matched

• Result

– Jianguo Lu, Dingding Li, Estimating Deep Web Data Source Size by Capture-Recapture Method, Information Retrieval. Springer 2010.

http://www.springerlink.com/content/e2823086n2r3q71m/



Model M0

The more accurate formula for the relationship between P and OR is

Conclusion: In model M0, it is not difficult to crawl a data source at all

In most cases OR will be higher than what is calculated by the above formula Because M0 is the simplest

PPOR /)1ln(

25

Random queries

P OR

0.1 1.053605

0.2 1.115718

0.3 1.188916

0.4 1.277064

0.5 1.386294

0.6 1.527151

0.7 1.719961

0.8 2.011797

0.9 2.558428

0.95 3.153402

0.99 4.651687

0.999 6.91467

Model M0 vs Mh

The blue line is drawn using equation P=1-OR-2.1

Several real data show different trend

Why?

26

Random queries

Model Mh

Assumptions: Each document has unequal

probability of being matched by a query

All matched documents are returned

h means heterogeneity in catch probability Originally developed in

ecology, to estimate the population of wild animals

Process: capture a group of animals, mark and release them; capture another group of animals, mark and release them again; … …

Mh was first proposed in capture-recapture method

27

Random queries

Capture frequency of news groups documents by queries

(A) is the scatter plot when documents are selected by queries. In total 13,600 documents are retrieved. (B) is the first 100 captures in Figure (A). (C) is the histogram of (A). (D) is the log-log plot of (C).

Model MhP OR

0 1

0.533484 2

0.701347 3

0.782362 4

0.829732 5

0.860674 6

0.882404 7

0.898468 8

0.910806 9

28

The empirical result is

Obtained by linear regression

1.11 ORP

Random queries

File size distributions

29

Random queries

Measuring heterogeneity Coefficient of Variation (CV)

Assume that the documents in the data source have different but fixed probabilities of being captured, i.e., p = {p1, p2, …,pn}, pj=1.

30

Sampling based approach

Scatter plots for various CVs. 200 random numbers within the range of 1 and 20,000 are generated in Pareto distribution.

Measuring heterogeneity

ORP 1

31

Relationship between CV (γ) and α

Random queries CV

α P

Model M0r

)1( 1.2 ORmkP

32

When k and m are fixed for every query

Not a practical assumption

Random queries

Model Mhr

Assumptions Only top k documents are returned documents have unequal probability being matched Documents have static ranking

When k and m are fixed, we have

)1( 1.1 ORmkP

33

Random queries

Evolution of the models

Comparison of models M0, Mh, M0r, and Mhr. 1000 documents are sorted according to their file size in decreasing order. 600 documents are selected in the four models, including the duplicates. k = 10;m = 20.

Subplot M0 shows that all the documents are retrieved uniformly.

Subplot Mh shows that large documents are preferred, but most of the documents can be eventually sampled.

Subplot M0r exhibits a clear cut around the 500th document. Beyond this line there are almost no documents retrieved.

Mhr is the compound of M0r and Mh.34

Random queries