28
Web Characterization Week 9 LBSC 690 Information Technology

Web Characterization

  • Upload
    kesia

  • View
    35

  • Download
    2

Embed Size (px)

DESCRIPTION

Web Characterization. Week 9 LBSC 690 Information Technology. Outline. What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web. Defining the Web. HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?. - PowerPoint PPT Presentation

Citation preview

Page 1: Web Characterization

Web Characterization

Week 9

LBSC 690

Information Technology

Page 2: Web Characterization

Outline

• What is the Web?

• What’s on the Web?

• What is the nature of the Web?

• Preserving the Web

Page 3: Web Characterization

Defining the Web

• HTTP, HTML, or URL?

• Static, dynamic or streaming?

• Public, protected, or internal?

Page 4: Web Characterization

Economics of the Web in 1995

• Affordable storage– 300,000 words/$

• Adequate backbone capacity– 25,000 simultaneous transfers

• Adequate “last mile” bandwidth– 1 second/screen

• Display capability– 10% of US population

• Effective search capabilities– Lycos (now google), Yahoo

Page 5: Web Characterization

Nature of the Web

• Over one billion pages by 1999– Growing at 25% per month!

– Google indexed about 3 billion pages in 2003

• Unstable– Changing at 1% per week

• Redundant– 30-40% (near) duplicates

• e.g., unix man page tree

Page 6: Web Characterization

Source: Michael Lesk, How Much Information is there in the World?

Page 7: Web Characterization

Number of Web Sites

Page 8: Web Characterization

Web Sites by Country, 2002

Page 9: Web Characterization

What’s a Web “Site”?

• OCLC counts any server at port 80– Misses many servers at other ports

• Some servers host unrelated content– Geocities

• Some content requires specialized servers– rtsp

Page 10: Web Characterization

World Trade in 2001

Rank Exporters Value Share change Rank Importers Value Share change

1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5

10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5

Source: World Trade Organization

Page 11: Web Characterization

Source: Global Reach

English English

2000 2005

Global Internet User Population

Chinese

Page 12: Web Characterization

Widely Spoken Languages

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chi

nese

Eng

lish

Hin

di-U

rdu

Span

ish

Por

tugu

ese

Ben

gali

Rus

sian

Ara

bic

Japa

nese

Source: http://www.g11n.com/faq.html

Page 13: Web Characterization

Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

Page 14: Web Characterization

English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown

Source: Jack Xu, Excite@Home, 1999

Web Page Languages

Page 15: Web Characterization

European Web Size: Exponential Growth

0

1

10

100

1,000

10,000

Oct

-96

Oct

-97

Oct

-98

Oct

-99

Oct

-00

Oct

-01

Oct

-02

Oct

-03

Oct

-04

Oct

-05

Bil

lio

ns

of

Wo

rds

English Other European

Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

Page 16: Web Characterization

European Web Content

Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Page 17: Web Characterization

Live Streams

source: www.real.com, Feb 2000

529

1367

English

OtherLanguages

Almost 2000 Internet-accessible

Radio and TelevisionStations

Page 18: Web Characterization

Streaming Media

• SingingFish indexes 35 million streams

• 60% of queries are for music– Then movies– Then sports– Then news

Page 19: Web Characterization

Crawling the Web

Page 20: Web Characterization

Web Crawl Challenges• Temporary server interruptions

• Discovering “islands” and “peninsulas”

• Duplicate and near-duplicate content

• Dynamic content

• Link rot

• Server and network loads

• Have I seen this page before?

Page 21: Web Characterization

Duplicate Detection

• Structural– Identical directory structure (e.g., mirrors, aliases)

• Syntactic– Identical bytes– Identical markup (HTML, XML, …)

• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)

Page 22: Web Characterization

Robots Exclusion Protocol

• Based on voluntary compliance by crawlers

• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl

• Exclusion by document (in HTML head)– Not implemented by all crawlers

<meta name="robots“ content="noindex,nofollow">

Page 23: Web Characterization

Link Structure of the Web

Page 24: Web Characterization

The Deep Web

• Dynamic pages, generated from databases

• Not easily discovered using crawling

• Perhaps 400-500 times larger than surface Web

• Fastest growing source of new information

Page 25: Web Characterization

Content of the Deep Web

Page 26: Web Characterization

Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times

NameType URL

Web Size

(GBs)

National Climatic Data Center (NOAA)

Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html

366,000

NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html

219,600

National Oceanographic (combined with Geophysical) Data Center (NOAA)

Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/

32,940

Alexa Public (partial)

http://www.alexa.com/ 15,860

Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640

MP3.com Public http://www.mp3.com/

Page 27: Web Characterization

Hands on: The Wayback Machine

• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org

• Check out Maryland’s Web site in 1997

• Check out the history of your favorite site

Page 28: Web Characterization

Discussion Point

• Can we save everything?

• Should we?

• Do people have a right to remove things?