28
Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida [email protected]

Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida [email protected]

Embed Size (px)

Citation preview

Page 1: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Using the Web Efficiently: Mobile

Crawlers

August 7, 1999

Joachim Hammer

Database CenterUniversity of Florida

[email protected]

Page 2: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 217th Annual AoM/IAoM

Presentation Outline

Problem Statement and Goal Web Crawling Techniques

Traditional Web Crawling Mobile Web Crawling

Mobile Crawling Architecture Distributed Runtime Environment Application Framework

Performance Evaluation Summary and Future Work

Page 3: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 317th Annual AoM/IAoM

What’s Wrong with the Web?

Web represents large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a

month Exponential growth rate

Lacks structure (i.e. no strict hierarchy)

Page 4: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 417th Annual AoM/IAoM

Web Indices and Search Engines

Search engine statistics: Index size: 30-110 million pages (approx.

700GB) Web coverage: 10%-35% and decreasing! Daily crawl: 3-10 million pages (approx.

60GB)

Year 2000 estimates: Index size 880 million pages (approx. 5.6TB) Daily crawl 80 million pages (approx. 480GB)

Traditional Web crawling will experience severe scaling problems in the near future

Page 5: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 517th Annual AoM/IAoM

Goals

Long-term Overlay the distributed Web structure with

a centralized information system which allows efficient resource discovery Turn Web into an effectively organized and

cataloged “digital library” Topic-specific search engines, e.g., self-health

care, consumer electronics, etc.

Project Find an alternative to the current “brute-

force” approach to Web crawling/indexing

Page 6: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 617th Annual AoM/IAoM

Traditional Web Crawling Approach

Google domain

LAN

Web

Repository

URLServer

IndexerAnchorsURL

Resolver

Crawler

Crawler

Crawler

Crawler

HTTP

StoreServer

Based on the Google search engine (www.google.com)

Page 7: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 717th Annual AoM/IAoM

Traditional Web Crawling

Characteristics of traditional Web crawling: Remote data access Focus on rapid data retrieval Centralized, database oriented architecture

Resource intensive Traditional Web crawling techniques do

not exploit information about the pages being crawled

“Download first–process later” approach

Page 8: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 817th Annual AoM/IAoM

(Our) Mobile Crawling Approach

Search Engine

Remote Host

HTTPServer

Web

Remote Host

HTTPServer

Remote Host

HTTPServer

Index

Crawler Manager

Page 9: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 917th Annual AoM/IAoM

Mobile Web Crawling

Crawler code migrates to host sites where pages are located

Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission

Characteristics: Focus on effective data retrieval Distributed, data source oriented architecture: access

data where it is stored Intelligent downloading of only relevant Web content Resource preserving approach

“Process first–download later” approach

Page 10: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1017th Annual AoM/IAoM

State-of-the-Art in Mobile Crawling

Many search engines, little information e.g., WWW Worm, WebCrawler, Lycos, Altavista,

Infoseek, Excite, HotBot, Google

Distributed search engines e.g., Harvest

Mobile code Simplest form: downloadable Java applets Code migration, software agents–very active

research area e.g, IBM Aglet infrastructure

Crawling algorithms

Page 11: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1117th Annual AoM/IAoM

Mobile Crawling Architecture

Application Framework Architecture

Distributed Crawler Runtime Environment

DatabaseCommand Manager

DB

ConnectionManager

SQ

L

Crawler ManagerCrawlerSpec

CommunicationSubsystem

Outbox Inbox

QueryEngine

Archive Manager

CommunicationSubsystem

VirtualMachine

HTTPServer

Net

CommunicationSubsystem

VirtualMachine

HTTPServer

CommunicationSubsystem

VirtualMachine

HTTPServer

CommunicationSubsystem

VirtualMachine

HTTPServer

Page 12: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1217th Annual AoM/IAoM

Architecture Highlights

Distributed Crawler Runtime Environment Platform independent execution environment for crawlers Virtual machine for remote crawler execution Communication layer to provide crawler transport service

Application Framework Communication layer to provide crawler transport service Crawler manager to support crawler creation and

configuration, controls crawler migration Web site selection

Query engine as crawler/application (database) interface Archive manager as database connectivity framework

Page 13: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1317th Annual AoM/IAoM

A Word About Mobile Crawlers

Crawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) Use CLIPS to represent crawler data (e.g., page-

facts) and user-defined crawling strategies (as rules)

CLIPS - C Language Integrated Production System

Advantages of rule-base approach Easier to specify crawling rules than to devise a

crawling algorithm No need to model control flow Rule-based programs have simple runtime states

Page 14: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1417th Annual AoM/IAoM

Crawling Strategies

General-purpose search engines use simple strategies Crawl and index all pages, e.g., depth-first

For subject-specific crawling, strategy is important

Find as many of the important pages while crawling the fewest number of pages overall

Page importance [Cho et al., Stanford University] Keyword frequency and location Backlink count PageRank

Figure out when to return to crawler manager Memory management issue

Page 15: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1517th Annual AoM/IAoM

Crawler Virtual Machine

How to execute a rule based crawler specification? Crawler execution = rule application upon fact base Use inference engine (JESS) for the rule application

process JESS is platform independent and extensible

1. Initialization Insert rules and facts into inference engine

2. Rule application Start rule application process within inference engine

3. Finalization Extract rules and facts once the rule application stopped Store back into crawler

Page 16: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1617th Annual AoM/IAoM

Crawler Virtual Machine

Virtual Machine

Communication Layer

Scheduling

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

Page 17: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1717th Annual AoM/IAoM

Crawler Manager

Centralized control unit for mobile crawlers Create new crawlers based on user-defined

crawler specification Provide initial destination using seed URLs

Determine subsequent itinerary for crawlers Requires knowledge about mobile-crawler-

enabled sites Estimate the importance of Web sites, e.g.,

Backlink count, hot page count, ...

Coordinate many crawlers running in parallel

Page 18: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1817th Annual AoM/IAoM

Crawler Query Engine

Used to access crawler contents after returning “Hot pages”

Characteristics Provide a query facility to query the crawler fact base Implement an SQL subset as query language Represent query result as data tuples, not as facts Allows the user to reason about crawling results Query engine implementation uses inference engine

Query engine serves as the primary interface between the user application and the mobile crawler SQL Database which holds Web index

Page 19: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 1917th Annual AoM/IAoM

Crawler Query Engine

Crawler Object

Query Engine

UserQuery

QueryCompiler

Query Rule

Inference Engine

Result Tuples

Crawler Facts

Crawler Facts

Crawler Rules

Page 20: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2017th Annual AoM/IAoM

Mobile Crawling Advantages

Remote page selection Determine significance of a page prior to

transmission Applicable for topic-specific search engines

Remote page filtering Control the granularity of the retrieved data Applicable for non-fulltext search engines

Remote page compression Compress page data prior to transmission Applicable for all search engines

Page 21: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2117th Annual AoM/IAoM

Performance Evaluation Setup

Two virtual machines (local and remote) plus crawler management system Set up for mobile as well as traditional Web crawling

REM OT E L OC A L

Craw lerManager

Communic ationSubs y s tem

Craw lerSpec

V ir tualMac hine

Communic ationSubs y s tem

HTMLHTTPServ er

V ir tualMac hine

Communic ationSubs y s tem

Page 22: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2217th Annual AoM/IAoM

Performance Evaluation

Focus on proving viability of mobile crawling approach Not focusing on analyzing crawling strategies [Cho97]

Controlled environment setup Static HTML data set with known properties - subset of

University of Florida intranet Apache HTTP server, unshared communication channel Breadth-first crawling strategy - predictable crawler

behavior

Measurements1. Network load for traditional (stationary) crawler2. Network load for mobile crawler without page

compression3. Network load for mobile crawler with page

compression

Page 23: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2317th Annual AoM/IAoM

Benefit of Remote Page Selection

0

50

100

150

200

250

300

350

400

450

S1 M1 M2 M3 M4

Tota

l loa

d (K

B)

uncompressed

compressed

Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection

Page 24: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2417th Annual AoM/IAoM

Benefit of Remote Page Filtering

0%

20%

40%

60%

80%

100%

120%

90% 80% 70% 60% 50% 40% 30% 20% 10%

Filter degree

Net

wor

k lo

ad

Load uncompressed Load compressed

Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved)

Page 25: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2517th Annual AoM/IAoM

Benefit of Page Compression

0

100

200300

400

500

600700

800

900

1 10 22 51 82 158

Retrieved pages

Tota

l loa

d (in

KB

)

Stationary Mobile uncompressed Mobile compressed

Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages

Page 26: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2617th Annual AoM/IAoM

Cost Benefit Analysis

Overhead Overhead due to crawler migration (<5K) Overhead due to fact-based data representation

(6%)

Benefits without page compression As soon as less than 85% per page needs to be

preserved As long as less than 90% of all pages are

transmitted

Benefits with page compression Reduction in network load by a factor of 4.5

Page 27: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2717th Annual AoM/IAoM

Summary and Conclusion Mobile crawling advantages:

Natural fit for distributed web environment Well suited for topic-specific search engines Small network overhead due to crawler mobility

Solves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely

Approach provides a base for smart Web crawling Currently improving crawler memory

management Completing more realistic testbed consisting of

~10 mobile crawler-enabled Web sites within UF intranet

Page 28: Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida jhammer@cise.ufl.edu

Joachim Hammer - UF 2817th Annual AoM/IAoM

Ongoing/Future Work

Security Crawler identification based on digital signatures Restrict crawler execution to positive identified

crawlers Implement virtual machine as a secure sandbox

Crawler mobility support Integrate virtual machine into web servers Comparison with other infrastructures, e.g, IBM Aglet

infrastructure (currently ongoing)

Mobile crawling algorithms (currently ongoing) Optimize crawling algorithms, site relocation

algorithms Carry out analysis