32
Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Embed Size (px)

Citation preview

Page 1: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

WIRE: an Open Source Web Information

Retrieval Environment

Carlos Castillo and Ricardo Baeza-Yates

Center for Web Researchhttp://www.cwr.cl/

CS Dept., University of Chile

OSWIR 2005Compiegne, FranceSeptember 19, 2005

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 2: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

1 WIRE Project

2 Web Crawler

3 Conclusions

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 3: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 4: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 5: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 6: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 7: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 8: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 9: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

General Architecture

Crawling Collection

Text Index

Importing

Statistics

Text Search

Clustering

XML Index XML Search

Classification

Focused Crawling

Extracting

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 10: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Characteristics

b Roughly 25,000 lines of open-source C/C++ code

L Asynchronous DNS and HTTP requests, small memoryand processing requirements (except during the analysis)

V Highly configurable: rate of download, parser parameters,scheduling policy, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 11: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Characteristics

b Roughly 25,000 lines of open-source C/C++ code

L Asynchronous DNS and HTTP requests, small memoryand processing requirements (except during the analysis)

V Highly configurable: rate of download, parser parameters,scheduling policy, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 12: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Characteristics

b Roughly 25,000 lines of open-source C/C++ code

L Asynchronous DNS and HTTP requests, small memoryand processing requirements (except during the analysis)

V Highly configurable: rate of download, parser parameters,scheduling policy, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 13: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Web Crawler

GathererParsing

Link extraction

HarvesterShort-term scheduling

Network transfers

SeederLink resolving

Robots exclusions

ManagerPage score calculationsLong-term scheduling

Collection

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 14: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Scheduling

P1

quality 0.4freshness 0.1visited? 1

P2

quality 0.7freshness 0.9visited? 1

P3

quality 0.6freshness -visited? 0

0.7

0.6

0.4

0.63

0

0.04 Profit: 0.36

Profit: 0.07

Profit: 0.6

=

=

=

}}}

FutureValue

CurrentValue

Profit=

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 15: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Downloading pages

World Wide Web

S1 S2 S3 S4 S5 S6 S7

P1,1

P1,2

P1,3

P1,4

P2,1

P2,2

P2,3

P2,4

P2,5

P2,6

P3,1

P3,2

P4,1

P4,2

P4,3

P4,4

P4,5

P5,1

P5,2

P5,3

P5,4

P6,1

P6,2

P7,1

P7,2

P7,3

P7,4

P7,5

P6,2

Web sites

Web pages

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 16: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Storing contents

Document

hash( )

Content seen?

Free space list

Disk Storage

1

2

3

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 17: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

URL parsing

host.domain.com 235

235 path/file.html 9421

h1('host.domain.com')

h2('235 dir/file.html')

SITE-ID = 235; DOC-ID = 9421

1

2

3

4

http://host.domain.com/dir/file.html

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 18: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 19: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 20: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 21: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 22: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 23: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 24: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Data analysis

b Includes link analysis and extraction of statistics (data isexported as .csv files)

b Reports are generated using LATEXand gnuplot

b Report about documents: histograms of size, in- andout-degree, link scores, page depth, HTTP responses,age, media types, etc.

b Report about sites: degree distribution in the hostgraph,maximum depth, pages per site, link structure, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 25: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Data analysis

b Includes link analysis and extraction of statistics (data isexported as .csv files)

b Reports are generated using LATEXand gnuplot

b Report about documents: histograms of size, in- andout-degree, link scores, page depth, HTTP responses,age, media types, etc.

b Report about sites: degree distribution in the hostgraph,maximum depth, pages per site, link structure, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 26: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Data analysis

b Includes link analysis and extraction of statistics (data isexported as .csv files)

b Reports are generated using LATEXand gnuplot

b Report about documents: histograms of size, in- andout-degree, link scores, page depth, HTTP responses,age, media types, etc.

b Report about sites: degree distribution in the hostgraph,maximum depth, pages per site, link structure, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 27: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Data analysis

b Includes link analysis and extraction of statistics (data isexported as .csv files)

b Reports are generated using LATEXand gnuplot

b Report about documents: histograms of size, in- andout-degree, link scores, page depth, HTTP responses,age, media types, etc.

b Report about sites: degree distribution in the hostgraph,maximum depth, pages per site, link structure, etc.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 28: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 29: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 30: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 31: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

Page 32: WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Outline WIRE Project Web Crawler Conclusions

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/