Web Characterization Week 9 LBSC 690 Information Technology

Web Characterization

Week 9

LBSC 690

Information Technology

Outline

• What is the Web?

• What’s on the Web?

• What is the nature of the Web?

• Preserving the Web

Defining the Web

• HTTP, HTML, or URL?

• Static, dynamic or streaming?

• Public, protected, or internal?

Economics of the Web in 1995

• Affordable storage– 300,000 words/$

• Adequate backbone capacity– 25,000 simultaneous transfers

• Adequate “last mile” bandwidth– 1 second/screen

• Display capability– 10% of US population

• Effective search capabilities– Lycos (now google), Yahoo

Nature of the Web

• Over one billion pages by 1999– Growing at 25% per month!

– Google indexed about 3 billion pages in 2003

• Unstable– Changing at 1% per week

• Redundant– 30-40% (near) duplicates

• e.g., unix man page tree

Source: Michael Lesk, How Much Information is there in the World?

Number of Web Sites

Web Sites by Country, 2002

What’s a Web “Site”?

• OCLC counts any server at port 80– Misses many servers at other ports

• Some servers host unrelated content– Geocities

• Some content requires specialized servers– rtsp

World Trade in 2001

Rank Exporters Value Share change Rank Importers Value Share change

1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5

10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5

Source: World Trade Organization

Source: Global Reach

English English

2000 2005

Global Internet User Population

Chinese

Widely Spoken Languages

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chi

nese

Eng

lish

Hin

di-U

rdu

Span

ish

Por

tugu

ese

Ben

gali

Rus

sian

Ara

bic

Japa

nese

Source: http://www.g11n.com/faq.html

Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown

Source: Jack Xu, Excite@Home, 1999

Web Page Languages

European Web Size: Exponential Growth

0

1

10

100

1,000

10,000

Oct

-96

Oct

-97

Oct

-98

Oct

-99

Oct

-00

Oct

-01

Oct

-02

Oct

-03

Oct

-04

Oct

-05

Bil

lio

ns

of

Wo

rds

English Other European

Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

European Web Content

Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Live Streams

source: www.real.com, Feb 2000

529

1367

English

OtherLanguages

Almost 2000 Internet-accessible

Radio and TelevisionStations

Streaming Media

• SingingFish indexes 35 million streams

• 60% of queries are for music– Then movies– Then sports– Then news

Crawling the Web

Web Crawl Challenges• Temporary server interruptions

• Discovering “islands” and “peninsulas”

• Duplicate and near-duplicate content

• Dynamic content

• Link rot

• Server and network loads

• Have I seen this page before?

Duplicate Detection

• Structural– Identical directory structure (e.g., mirrors, aliases)

• Syntactic– Identical bytes– Identical markup (HTML, XML, …)

• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)

Robots Exclusion Protocol

• Based on voluntary compliance by crawlers

• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl

• Exclusion by document (in HTML head)– Not implemented by all crawlers

<meta name="robots“ content="noindex,nofollow">

Link Structure of the Web

The Deep Web

• Dynamic pages, generated from databases

• Not easily discovered using crawling

• Perhaps 400-500 times larger than surface Web

• Fastest growing source of new information

Content of the Deep Web

Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times

NameType URL

Web Size

(GBs)

National Climatic Data Center (NOAA)

Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html

366,000

NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html

219,600

National Oceanographic (combined with Geophysical) Data Center (NOAA)

Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/

32,940

Alexa Public (partial)

http://www.alexa.com/ 15,860

Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640

MP3.com Public http://www.mp3.com/

Hands on: The Wayback Machine

• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org

• Check out Maryland’s Web site in 1997

• Check out the history of your favorite site

http://archive.org/

Discussion Point

• Can we save everything?

• Should we?

• Do people have a right to remove things?

Documents

Web Characterization Week 9 LBSC 690 Information Technology