21
A Characterization of the Portuguese Web Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt

A Characterization of the Portuguese Web

  • Upload
    kalea

  • View
    30

  • Download
    2

Embed Size (px)

DESCRIPTION

A Characterization of the Portuguese Web. Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt. Presentation. Introduction Setup Statistics Conclusions Future Work. Terminology. Document: file resultant from a successful HTTP download - PowerPoint PPT Presentation

Citation preview

Page 1: A Characterization of the Portuguese Web

A Characterization of the Portuguese Web

Daniel Gomes and Mário J. Silva

University of Lisbon

http://xldb.fc.ul.pt

Page 2: A Characterization of the Portuguese Web

Presentation

• Introduction

• Setup

• Statistics

• Conclusions

• Future Work

Page 3: A Characterization of the Portuguese Web

Terminology

• Document: file resultant from a successful HTTP download

• Publisher: entity responsible for publishing the document on the Web

• Web site: collection of documents referenced by URLs that share the same host name

Page 4: A Characterization of the Portuguese Web

Why Characterize?

• Extraction of cultural, commercial and social aspects: – Presence of natural languages

– Most popular web servers

• Adequate design and tuning of web applications:– The web is described through its characterization.

– Parameters of the Web graph- How many nodes compose the graph

- Types of this nodes

Page 5: A Characterization of the Portuguese Web

Characterizing the WWW vs. Community Webs

• Huge

• Sampling is a “must”

• WWW is not uniform

• Small partitions are ignored

+ Relevant to a certain community

+ Less resources

+ A complete scan is possible, no sampling!

– Difficult to establish boundaries

Page 6: A Characterization of the Portuguese Web

WWW.TUMBA.PT

Publicly available:

• Characterize

• Search

Almost:

• Archive» The Portuguese Web

Page 7: A Characterization of the Portuguese Web

Main objectives:

• Estimate the resources need to create a web-archive of the Portuguese Web;

• Validate crawls;

• Gather guidelines to improve the systems (crawling, repository, index).

Page 8: A Characterization of the Portuguese Web

Characterization Setup

• Viúva Negra Crawlers: gather information from the Web and insert it into Versus.

• Versus: keeps documents in files and meta-data in relations.

• Web statistics are produced issuing SQL queries to the Versus Repository.

VN CrawlerVN Crawler VN Crawler

Versus Repository

WebStatistics

SQL

Page 9: A Characterization of the Portuguese Web

What is the Portuguese Web?

• Set of documents of cultural and sociological interest to the Portuguese people.

• Language– Brazilian/Portuguese community web sites– Both written in Portuguese

• TLDs– Many sites hosted in gTLDs.

Page 10: A Characterization of the Portuguese Web

Crawler configuration

• Influences statistics– The depth of the crawl influences the number of

documents gathered

– Replication• Mirrors

• URL Aliases

• Crawl as many documents as possible• Maintain robustness against pathological

situations

Page 11: A Characterization of the Portuguese Web

VN Configuration Parameters– Text documents (list selected MIME types)

– Hosted under “.PT”

– Hosted under “.COM”, “.NET”, “.ORG”, “.TV”.• Written in Portuguese

• Host site had at least one incoming link originated under “.PT”

– Download timeout=60s

– Max Size=2MB

– Avoid traps: • max docs per site=8000

• crawl at most 50 times the same document

Page 12: A Characterization of the Portuguese Web

Collected Statistics

• 4 million URLs and 78 GB.

• 83% successfully downloaded (200)

• 3.4% not found (404)

• 1.2% took more than 1 minute to download

• 0.5% bigger than 2 MB

Page 13: A Characterization of the Portuguese Web

Site statistics

COM12%

NET2%

ORG1%

TV0%

PT85%

138%

1-1034%

10-10021%

>10001%

100-10006%

Sites per TLD Documents per Site

Page 14: A Characterization of the Portuguese Web

Language Distribution (.pt only)

Portuguese 73%

English 17%

German 3%

Spanish 1% others

1%

unknown 4%

French 1%

Page 15: A Characterization of the Portuguese Web

Size Distribution

0

200000

400000

600000

800000

1000000

0 1 2 4 8 16 32 64 128

256

512

1024

2048

size (KB)

nu

mb

er o

f d

ocu

men

ts

Page 16: A Characterization of the Portuguese Web

Other Statistics

• Average length of an URL is 62 chars

• unknown Last-Modified Date: 53%

• HTML: 95%

• 78 GB of data produced 8.7 GB of text

• Meta-tags are scarce (description 17%, keywords 18%)

• 15.5% Replication

Page 17: A Characterization of the Portuguese Web

http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank

Page 18: A Characterization of the Portuguese Web

Other Statistics

• Average length of an URL is 62 chars

• unknown Last-Modified Date: 53%

• HTML: 95%

• 78 GB of data produced 8.7 GB of text

• Meta-tags are scarce (description 17%, keywords 18%)

• 15.5% Replication

Page 19: A Characterization of the Portuguese Web

Conclusions

• Defined the Portuguese Web as a crawling policy.

• Characterization can not be dissociated from crawling technology.

• A search engine repository is a source of interesting statistics.

• Statistics are an important tool for validating and designing web applications

Page 20: A Characterization of the Portuguese Web

Future Work

• Study the linkage structure

• Crawl other types such as postscripts

• Improve the algorithm used to find Portuguese web sites outside the .PT domain

• Study the evolution of the Portuguese Web

Page 21: A Characterization of the Portuguese Web

Thank you for your attention.

[email protected]

http://xldb.fc.ul.pt

http://www.tumba.pt