Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida [email protected]

Using the Web Efficiently: Mobile

Crawlers

August 7, 1999

Joachim Hammer

Database CenterUniversity of Florida

[email protected]

Joachim Hammer - UF 217th Annual AoM/IAoM

Presentation Outline

Problem Statement and Goal Web Crawling Techniques

Traditional Web Crawling Mobile Web Crawling

Mobile Crawling Architecture Distributed Runtime Environment Application Framework

Performance Evaluation Summary and Future Work


What’s Wrong with the Web?

Web represents large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a

month Exponential growth rate

Lacks structure (i.e. no strict hierarchy)


Web Indices and Search Engines

Search engine statistics: Index size: 30-110 million pages (approx.

700GB) Web coverage: 10%-35% and decreasing! Daily crawl: 3-10 million pages (approx.

60GB)

Year 2000 estimates: Index size 880 million pages (approx. 5.6TB) Daily crawl 80 million pages (approx. 480GB)

Traditional Web crawling will experience severe scaling problems in the near future


Goals

Long-term Overlay the distributed Web structure with

a centralized information system which allows efficient resource discovery Turn Web into an effectively organized and

cataloged “digital library” Topic-specific search engines, e.g., self-health

care, consumer electronics, etc.

Project Find an alternative to the current “brute-

force” approach to Web crawling/indexing


Traditional Web Crawling Approach

Google domain

LAN

Web

Repository

URLServer

IndexerAnchorsURL

Resolver

Crawler

Crawler

Crawler

Crawler

HTTP

StoreServer

Based on the Google search engine (www.google.com)


Traditional Web Crawling

Characteristics of traditional Web crawling: Remote data access Focus on rapid data retrieval Centralized, database oriented architecture

Resource intensive Traditional Web crawling techniques do

not exploit information about the pages being crawled

“Download first–process later” approach


(Our) Mobile Crawling Approach

Search Engine

Remote Host

HTTPServer

Web

Remote Host

HTTPServer

Remote Host

HTTPServer

Index

Crawler Manager


Mobile Web Crawling

Crawler code migrates to host sites where pages are located

Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission

Characteristics: Focus on effective data retrieval Distributed, data source oriented architecture: access

data where it is stored Intelligent downloading of only relevant Web content Resource preserving approach

“Process first–download later” approach


State-of-the-Art in Mobile Crawling

Many search engines, little information e.g., WWW Worm, WebCrawler, Lycos, Altavista,

Infoseek, Excite, HotBot, Google

Distributed search engines e.g., Harvest

Mobile code Simplest form: downloadable Java applets Code migration, software agents–very active

research area e.g, IBM Aglet infrastructure

Crawling algorithms


Mobile Crawling Architecture

Application Framework Architecture

Distributed Crawler Runtime Environment

DatabaseCommand Manager

DB

ConnectionManager

SQ

L

Crawler ManagerCrawlerSpec

CommunicationSubsystem

Outbox Inbox

QueryEngine

Archive Manager


VirtualMachine

HTTPServer

Net


VirtualMachine

HTTPServer


VirtualMachine

HTTPServer


VirtualMachine

HTTPServer


Architecture Highlights

Distributed Crawler Runtime Environment Platform independent execution environment for crawlers Virtual machine for remote crawler execution Communication layer to provide crawler transport service

Application Framework Communication layer to provide crawler transport service Crawler manager to support crawler creation and

configuration, controls crawler migration Web site selection

Query engine as crawler/application (database) interface Archive manager as database connectivity framework


A Word About Mobile Crawlers

Crawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) Use CLIPS to represent crawler data (e.g., page-

facts) and user-defined crawling strategies (as rules)

CLIPS - C Language Integrated Production System

Advantages of rule-base approach Easier to specify crawling rules than to devise a

crawling algorithm No need to model control flow Rule-based programs have simple runtime states


Crawling Strategies

General-purpose search engines use simple strategies Crawl and index all pages, e.g., depth-first

For subject-specific crawling, strategy is important

Find as many of the important pages while crawling the fewest number of pages overall

Page importance [Cho et al., Stanford University] Keyword frequency and location Backlink count PageRank

Figure out when to return to crawler manager Memory management issue


Crawler Virtual Machine

How to execute a rule based crawler specification? Crawler execution = rule application upon fact base Use inference engine (JESS) for the rule application

process JESS is platform independent and extensible

1. Initialization Insert rules and facts into inference engine

2. Rule application Start rule application process within inference engine

3. Finalization Extract rules and facts once the rule application stopped Store back into crawler


Crawler Virtual Machine

Virtual Machine

Communication Layer

Scheduling

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine


Crawler Manager

Centralized control unit for mobile crawlers Create new crawlers based on user-defined

crawler specification Provide initial destination using seed URLs

Determine subsequent itinerary for crawlers Requires knowledge about mobile-crawler-

enabled sites Estimate the importance of Web sites, e.g.,

Backlink count, hot page count, ...

Coordinate many crawlers running in parallel


Crawler Query Engine

Used to access crawler contents after returning “Hot pages”

Characteristics Provide a query facility to query the crawler fact base Implement an SQL subset as query language Represent query result as data tuples, not as facts Allows the user to reason about crawling results Query engine implementation uses inference engine

Query engine serves as the primary interface between the user application and the mobile crawler SQL Database which holds Web index


Crawler Query Engine

Crawler Object

Query Engine

UserQuery

QueryCompiler

Query Rule

Inference Engine

Result Tuples

Crawler Facts

Crawler Facts

Crawler Rules


Mobile Crawling Advantages

Remote page selection Determine significance of a page prior to

transmission Applicable for topic-specific search engines

Remote page filtering Control the granularity of the retrieved data Applicable for non-fulltext search engines

Remote page compression Compress page data prior to transmission Applicable for all search engines


Performance Evaluation Setup

Two virtual machines (local and remote) plus crawler management system Set up for mobile as well as traditional Web crawling

REM OT E L OC A L

Craw lerManager

Communic ationSubs y s tem

Craw lerSpec

V ir tualMac hine


HTMLHTTPServ er

V ir tualMac hine



Performance Evaluation

Focus on proving viability of mobile crawling approach Not focusing on analyzing crawling strategies [Cho97]

Controlled environment setup Static HTML data set with known properties - subset of

University of Florida intranet Apache HTTP server, unshared communication channel Breadth-first crawling strategy - predictable crawler

behavior

Measurements1. Network load for traditional (stationary) crawler2. Network load for mobile crawler without page

compression3. Network load for mobile crawler with page

compression


Benefit of Remote Page Selection

0

50

100

150

200

250

300

350

400

450

S1 M1 M2 M3 M4

Tota

l loa

d (K

B)

uncompressed

compressed

Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection


Benefit of Remote Page Filtering

0%

20%

40%

60%

80%

100%

120%

90% 80% 70% 60% 50% 40% 30% 20% 10%

Filter degree

Net

wor

k lo

ad

Load uncompressed Load compressed

Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved)


Benefit of Page Compression

0

100

200300

400

500

600700

800

900

1 10 22 51 82 158

Retrieved pages

Tota

l loa

d (in

KB

)

Stationary Mobile uncompressed Mobile compressed

Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages


Cost Benefit Analysis

Overhead Overhead due to crawler migration (<5K) Overhead due to fact-based data representation

(6%)

Benefits without page compression As soon as less than 85% per page needs to be

preserved As long as less than 90% of all pages are

transmitted

Benefits with page compression Reduction in network load by a factor of 4.5


Summary and Conclusion Mobile crawling advantages:

Natural fit for distributed web environment Well suited for topic-specific search engines Small network overhead due to crawler mobility

Solves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely

Approach provides a base for smart Web crawling Currently improving crawler memory

management Completing more realistic testbed consisting of

~10 mobile crawler-enabled Web sites within UF intranet


Ongoing/Future Work

Security Crawler identification based on digital signatures Restrict crawler execution to positive identified

crawlers Implement virtual machine as a secure sandbox

Crawler mobility support Integrate virtual machine into web servers Comparison with other infrastructures, e.g, IBM Aglet

infrastructure (currently ongoing)

Mobile crawling algorithms (currently ongoing) Optimize crawling algorithms, site relocation

algorithms Carry out analysis

Documents

Using the Web Efficiently: Mobile Crawlers August 7, 1999 Joachim Hammer Database Center University of Florida [email protected]