CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics

CS246

Web Characteristics

Junghoo "John" Cho (UCLA Computer Science) 2

Web Characteristics

What is the Web like? Any questions on some of the characteristics

and/or properties of the Web?


Web Characteristics

Size of the Web Search engine coverage Link structure of the Web


How Many Web Sites?

Polling every IP 2^32 = 4B sites, 10 sec/IP, 1000

simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days


How Many Web Sites?

Sampling based

T: All IPs

S: Sampled IPs

V: Valid reply ||||

||T

S

V


How Many Web Sites?

1. Select |S| random IPs

2. Send HTTP requests to port 80 at the selected IPs

3. Count valid replies: “HTTP 200 OK” = |V|

4. |T| = 2^32


How Many Web Sites?

OCLC (Online Computer Library) results http://wcp.oclc.org

Total number of available IPs: 2^32 = 4.2 Billion Growth (in terms of sites) has slowed down

1998 1999 2000 2001 2002

Sites 2,636,000 4,662,000 7,128,000 8,443,000 8,712,000


Issues

Multi-hosted servers cnn.com: 207.25.71.5, 207.25.71.20, …

Select the lowest IP addressFor each sampled IP:

Look up domain name Resolve the name to IP Is our sampled IP the lowest?

http://cnn.com/


Issues

Virtual hosting Multiple sites on the same IP Find the average number of hosted sites per IP

7.4M sites on 3.4M IPs by polling all available site names [Netcraft, 2000]

Other ports? Temporarily unavailable sites?


Where Are They Located?


What Language?

(Based on Web sites)


Questions?


How Many Web Pages?

Infinite number of URLs

Sampling based?

T: All URLs

S: Sampled URLs

V: Valid reply ||||

||T

S

V


How Many Web Pages?

Solution 1: Estimate the average number of pages per site:

(average no of pages) * (total no of sites) Algorithm:

For each site with valid reply, download all pages Take average

Result [LG99]: 289 pages per site, 2.8M sites 800M pages


Issues

A small number of sites with TONS of pages Very likely to miss these sites Lots of samples necessary

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

0 200 400 600 800 1000

No of Sites

No

of

Pa

ge

s

99.99% of the sites


How Many Pages?

Solution 2: Sampling-based

T: All pages

B: Base setS: Random samples

||

||

||

||

S

SB

T

B


Related Question

How many deer in Yosemite National Park?


Random Page?

Idea: Random walk Start from the Yahoo home page Follow random links, say 10,000 times Select the page

Problem: Biased to “popular” pages. e.g., Microsoft, Google


Random Page?

Random walks on regular, undirected graph uniform random sample Regular graph: an equal number of edges for all nodes After steps

: depends on the graph structure N: number of nodes

Idea: Transform the Web graph to a regular, undirected graph Perform a random walk

NO log

1


Ideal Random Walk

Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:

say, 300,000 If edge(n) < 300,000, then add self-loop

Perform random walks on the graph 10-5 for the 1996 Web, N 109

3,000,000 steps, but mostly self-loops 100 actual walk


Different Interpretation

Random walk on irregular Web graph High chance to be at a “popular” node at a

particular time Increase the chance to be at an “unpopular”

node by staying there longer through self loops.

Unpopular nodesPopular node


Issues

How to get edges to/from node n? Edges discovered so far From search engines, like Altavista, HotBot Still limited incoming links


WebWalker [BBCF00]

Our graph does not have to be the same as the real Web

Construct regular undirected graphs while performing the random walk

Add new node n when it visits n Find edges for node n at that time

1. Edges discovered so far2. From search engines

Add self-loops as necessary Ignore any more edges to n later


WebWalker

d = 5

1

31

2

2


WebWalker

Why ignore “new incoming” edges? Make the graph regular.

“Discovered parts” of the graph do not change “Uniformity theorem” still holds

Can we arrive at “all reachable” pages? We ignore only the edges to “visited nodes”

Can we use the same ? No


WebWalker results

Size of the Web Altavista: |B| = 250M |BS|/|S| = 35% |T| = 720M

Avg page size: 12K Avg no of out-links: 10


WebWalker results

Pages by domain .com: 49% .edu: 8% .org: 7% .net: 6% .de: 4% .jp: 3% .uk: 3% .gov: 2%

What About Other Web Pages?

Pages that are Available within corporate Intranet Protected by authentication Not reachable through following links

E.g., pages within e-commerce sites

Deep Web vs Hidden Web Information reachable through search interface What if a page is reachable both through links

and search interface?


Size of Deep Web?

Estimation: (Avg no of records per site) * (Total no of Deep

Web sites) How to estimate?

By sampling


Size of Deep Web?

Total # of Deep Web sites: |BS|/|S|

Avg no of records per site: Contact the site directly Use “Not zzxxyyxx,” if the site reports no of

matches


Size of Deep Web

BrightPlanet report Avg no of records per site: 5 million Total no of Deep Web sites: 200,000 Avg size of a record: 14KB Size of the Deep Web: 10^16 (10 petabytes) 1000 larger than the “Surface Web”

How to access it?


Web Characteristics

Size of the Web Search engines Link structure of the Web


Search Engines

Coverage Overlap Dead links Indexing delay


Coverage?

Q: How to estimate coverage? A: Create a random sample and measure how many

of them are indexed by a search engine In 1999

Estimated Web size: 800M, 1999 Reported indexed pages: 128M (Northern light)

16%

No reliable Web size estimate at this point Search engines often claim ~20B index


Overlap? How many pages are commonly indexed? Method 1

Create a random sample and measure how many are indexed only by A or B and commonly by A and B

Method 2 Send common queries, compare returned pages, and

measure overlap Result from method 2: Little overlap

E.g., Infoseek and AltaVista: 20% overlap [Bharat and Broder 1997]

Is it still true? Results seem to converge


Dead Links?

Q: How can we measure what fraction of pages in search engines are dead?

A: Issue random queries and check and see whether returned pages are dead?

Result in Feb 2000 AltaVista: 13.7% Excite: 8.7% Google: 4.3%

Search engines have got much better due to better recrawling algorithms A topic for later study


How Early Pages Get Indexed?

Method 1: Create pages at random locations Check when they are available at search engines Cons: Difficult to create pages at random

locations Method 2:

Repeatedly issue same queries over time When a new page appears in the result, record

the “last modified date” Cons: last modified date is only a “lower bound”


How Early are Pages Indexed?

Mean time [Lawrence and Giles 2000] Northern Light: 141 days AltaVista: 166 days HotBot: 192 days


Monitor a set of random sites Percentage of Web servers available:

(similar results for other years)

How Stable Are the Sites?

Year 1998 1999 2000 2001 2002

Available 100% 56% 35% 25% 13%


Web Characteristics

Size of the Web Search engines Link structure of the Web


Web As A Graph

Page: Node Link: Edge


Power lawWhy consistently 2.1?

(No of pages) 1

(No of links)2.1

Link Degree

How many links? In-degree


Link Degree

Out-degree

(No of pages) 1

(No of links)2.7


Large-Scale Structure?

Study by AltaVista & IBM, 1999 Based on 200M pages downloaded by

AltaVista crawler “Bow-tie” result based on two experiments


Experiment 1:Strongly Connected Components

Strongly connected component (SCC): C is a strongly connected component if:

a, b C, there are pathsfrom a b and from b a

b c

a

b c

aSCC?

YesNo


Result 1: SCC

Identified all SCCs from 200M pages Biggest SCC: 50M (25%) Other SCCs are small

Second largest: 150K Mostly fewer than 1000 nodes


Experiment 2: Reachability

How many pages can we reach starting from a random page?

Experiment Pick 500 random pages Follow links in the Breadth-first manner until no

more links Repeated the same experiments following links in

the “reverse direction”


Result 2: Reachability

50% reaches 100M 50% reaches fewer

than 1000

Out-links (forward direction)


Result 2: Reachability

50% reaches 100M 50% reaches fewer

than 1000

In-links (reverse direction)


What Can We Conclude?

50M (25%) SCC

SCC(50M, 25%)



How many nodes would we reach from SCC? Clearly not 1000, then 100M 50M more pages reachable from SCC

(no way back, though)

SCC(50M, 25%)

Out(50M,25%)



Similar result for “in-links” when we followed links backwards 50M more pages reachable by following in-links

SCC(50M, 25%)

Out(50M,25%)

In(50M,25%)



25% Miscellaneous

SCC(50M, 25%)

Out(50M,25%)

In(50M,25%)

(50M, 25%)


Questions

How did they “crawl” 50M In and 50M Misc nodes in the first place?

There may be much more In and Misc nodes that were not crawled (25% is lower bounds)

Only 25% SCC surprising (will be explained)


SCC

If there are only two links, A B and B A, then A and B becomes one SCC.

A B


Links between In, SCC and Out

No single link from SCC to In No single link from Out to SCC At least 50% of the Web “unknown” to the

core SSC


Diameter of SCC

On average, 16 links between two nodes in SCC

The “maximum distance” (diameter) is at least 28


Questions?


More Sources For Web Characteristics

OCLC (Online Computer Library) http://wcp.oclc.org

Netcraft Survey http://www.netcraft.com/survey/

NEC Web Analysis http://www.webmetrics.com

http://wcp.oclc.org/

http://www.netcraft.com/survey/


How To Sample?

Method 1: Take the last page and repeat

Many “wasted” visits


How To Sample?

Method 2: Take last k pages

Are they random samples?


How To Sample?

Theorem: If k is large enough, they are approximately random pages Intuition: If we visit many pages, we visit all

different pages


How To Sample?

Goal: Estimate A/N by m/k. Make A/N ~ m/k, i.e.,

if NA

km

1/

/1Pr

km

NA

1~/

/

km

NA

1

log/

12 NA

Ok


How To Sample?

Assuming A is 20% of the Web = 0.1: less than 10% error = 0.01: 99% confidence = 10^-5: the value from 1996 Web crawl

k = 350,000,000 12,000 non-self-loop

Documents

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics