Upload
marsha-lloyd
View
213
Download
0
Embed Size (px)
Citation preview
Using the Web Efficiently: Mobile
Crawlers
August 7, 1999
Joachim Hammer
Database CenterUniversity of Florida
Joachim Hammer - UF 217th Annual AoM/IAoM
Presentation Outline
Problem Statement and Goal Web Crawling Techniques
Traditional Web Crawling Mobile Web Crawling
Mobile Crawling Architecture Distributed Runtime Environment Application Framework
Performance Evaluation Summary and Future Work
Joachim Hammer - UF 317th Annual AoM/IAoM
What’s Wrong with the Web?
Web represents large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a
month Exponential growth rate
Lacks structure (i.e. no strict hierarchy)
Joachim Hammer - UF 417th Annual AoM/IAoM
Web Indices and Search Engines
Search engine statistics: Index size: 30-110 million pages (approx.
700GB) Web coverage: 10%-35% and decreasing! Daily crawl: 3-10 million pages (approx.
60GB)
Year 2000 estimates: Index size 880 million pages (approx. 5.6TB) Daily crawl 80 million pages (approx. 480GB)
Traditional Web crawling will experience severe scaling problems in the near future
Joachim Hammer - UF 517th Annual AoM/IAoM
Goals
Long-term Overlay the distributed Web structure with
a centralized information system which allows efficient resource discovery Turn Web into an effectively organized and
cataloged “digital library” Topic-specific search engines, e.g., self-health
care, consumer electronics, etc.
Project Find an alternative to the current “brute-
force” approach to Web crawling/indexing
Joachim Hammer - UF 617th Annual AoM/IAoM
Traditional Web Crawling Approach
Google domain
LAN
Web
Repository
URLServer
IndexerAnchorsURL
Resolver
Crawler
Crawler
Crawler
Crawler
HTTP
StoreServer
Based on the Google search engine (www.google.com)
Joachim Hammer - UF 717th Annual AoM/IAoM
Traditional Web Crawling
Characteristics of traditional Web crawling: Remote data access Focus on rapid data retrieval Centralized, database oriented architecture
Resource intensive Traditional Web crawling techniques do
not exploit information about the pages being crawled
“Download first–process later” approach
Joachim Hammer - UF 817th Annual AoM/IAoM
(Our) Mobile Crawling Approach
Search Engine
Remote Host
HTTPServer
Web
Remote Host
HTTPServer
Remote Host
HTTPServer
Index
Crawler Manager
Joachim Hammer - UF 917th Annual AoM/IAoM
Mobile Web Crawling
Crawler code migrates to host sites where pages are located
Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission
Characteristics: Focus on effective data retrieval Distributed, data source oriented architecture: access
data where it is stored Intelligent downloading of only relevant Web content Resource preserving approach
“Process first–download later” approach
Joachim Hammer - UF 1017th Annual AoM/IAoM
State-of-the-Art in Mobile Crawling
Many search engines, little information e.g., WWW Worm, WebCrawler, Lycos, Altavista,
Infoseek, Excite, HotBot, Google
Distributed search engines e.g., Harvest
Mobile code Simplest form: downloadable Java applets Code migration, software agents–very active
research area e.g, IBM Aglet infrastructure
Crawling algorithms
Joachim Hammer - UF 1117th Annual AoM/IAoM
Mobile Crawling Architecture
Application Framework Architecture
Distributed Crawler Runtime Environment
DatabaseCommand Manager
DB
ConnectionManager
SQ
L
Crawler ManagerCrawlerSpec
CommunicationSubsystem
Outbox Inbox
QueryEngine
Archive Manager
CommunicationSubsystem
VirtualMachine
HTTPServer
Net
CommunicationSubsystem
VirtualMachine
HTTPServer
CommunicationSubsystem
VirtualMachine
HTTPServer
CommunicationSubsystem
VirtualMachine
HTTPServer
Joachim Hammer - UF 1217th Annual AoM/IAoM
Architecture Highlights
Distributed Crawler Runtime Environment Platform independent execution environment for crawlers Virtual machine for remote crawler execution Communication layer to provide crawler transport service
Application Framework Communication layer to provide crawler transport service Crawler manager to support crawler creation and
configuration, controls crawler migration Web site selection
Query engine as crawler/application (database) interface Archive manager as database connectivity framework
Joachim Hammer - UF 1317th Annual AoM/IAoM
A Word About Mobile Crawlers
Crawler is a user-defined, set of rules that executes on a virtual machine and collects facts (about Web pages) Use CLIPS to represent crawler data (e.g., page-
facts) and user-defined crawling strategies (as rules)
CLIPS - C Language Integrated Production System
Advantages of rule-base approach Easier to specify crawling rules than to devise a
crawling algorithm No need to model control flow Rule-based programs have simple runtime states
Joachim Hammer - UF 1417th Annual AoM/IAoM
Crawling Strategies
General-purpose search engines use simple strategies Crawl and index all pages, e.g., depth-first
For subject-specific crawling, strategy is important
Find as many of the important pages while crawling the fewest number of pages overall
Page importance [Cho et al., Stanford University] Keyword frequency and location Backlink count PageRank
Figure out when to return to crawler manager Memory management issue
Joachim Hammer - UF 1517th Annual AoM/IAoM
Crawler Virtual Machine
How to execute a rule based crawler specification? Crawler execution = rule application upon fact base Use inference engine (JESS) for the rule application
process JESS is platform independent and extensible
1. Initialization Insert rules and facts into inference engine
2. Rule application Start rule application process within inference engine
3. Finalization Extract rules and facts once the rule application stopped Store back into crawler
Joachim Hammer - UF 1617th Annual AoM/IAoM
Crawler Virtual Machine
Virtual Machine
Communication Layer
Scheduling
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
Joachim Hammer - UF 1717th Annual AoM/IAoM
Crawler Manager
Centralized control unit for mobile crawlers Create new crawlers based on user-defined
crawler specification Provide initial destination using seed URLs
Determine subsequent itinerary for crawlers Requires knowledge about mobile-crawler-
enabled sites Estimate the importance of Web sites, e.g.,
Backlink count, hot page count, ...
Coordinate many crawlers running in parallel
Joachim Hammer - UF 1817th Annual AoM/IAoM
Crawler Query Engine
Used to access crawler contents after returning “Hot pages”
Characteristics Provide a query facility to query the crawler fact base Implement an SQL subset as query language Represent query result as data tuples, not as facts Allows the user to reason about crawling results Query engine implementation uses inference engine
Query engine serves as the primary interface between the user application and the mobile crawler SQL Database which holds Web index
Joachim Hammer - UF 1917th Annual AoM/IAoM
Crawler Query Engine
Crawler Object
Query Engine
UserQuery
QueryCompiler
Query Rule
Inference Engine
Result Tuples
Crawler Facts
Crawler Facts
Crawler Rules
Joachim Hammer - UF 2017th Annual AoM/IAoM
Mobile Crawling Advantages
Remote page selection Determine significance of a page prior to
transmission Applicable for topic-specific search engines
Remote page filtering Control the granularity of the retrieved data Applicable for non-fulltext search engines
Remote page compression Compress page data prior to transmission Applicable for all search engines
Joachim Hammer - UF 2117th Annual AoM/IAoM
Performance Evaluation Setup
Two virtual machines (local and remote) plus crawler management system Set up for mobile as well as traditional Web crawling
REM OT E L OC A L
Craw lerManager
Communic ationSubs y s tem
Craw lerSpec
V ir tualMac hine
Communic ationSubs y s tem
HTMLHTTPServ er
V ir tualMac hine
Communic ationSubs y s tem
Joachim Hammer - UF 2217th Annual AoM/IAoM
Performance Evaluation
Focus on proving viability of mobile crawling approach Not focusing on analyzing crawling strategies [Cho97]
Controlled environment setup Static HTML data set with known properties - subset of
University of Florida intranet Apache HTTP server, unshared communication channel Breadth-first crawling strategy - predictable crawler
behavior
Measurements1. Network load for traditional (stationary) crawler2. Network load for mobile crawler without page
compression3. Network load for mobile crawler with page
compression
Joachim Hammer - UF 2317th Annual AoM/IAoM
Benefit of Remote Page Selection
0
50
100
150
200
250
300
350
400
450
S1 M1 M2 M3 M4
Tota
l loa
d (K
B)
uncompressed
compressed
Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection
Joachim Hammer - UF 2417th Annual AoM/IAoM
Benefit of Remote Page Filtering
0%
20%
40%
60%
80%
100%
120%
90% 80% 70% 60% 50% 40% 30% 20% 10%
Filter degree
Net
wor
k lo
ad
Load uncompressed Load compressed
Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved)
Joachim Hammer - UF 2517th Annual AoM/IAoM
Benefit of Page Compression
0
100
200300
400
500
600700
800
900
1 10 22 51 82 158
Retrieved pages
Tota
l loa
d (in
KB
)
Stationary Mobile uncompressed Mobile compressed
Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages
Joachim Hammer - UF 2617th Annual AoM/IAoM
Cost Benefit Analysis
Overhead Overhead due to crawler migration (<5K) Overhead due to fact-based data representation
(6%)
Benefits without page compression As soon as less than 85% per page needs to be
preserved As long as less than 90% of all pages are
transmitted
Benefits with page compression Reduction in network load by a factor of 4.5
Joachim Hammer - UF 2717th Annual AoM/IAoM
Summary and Conclusion Mobile crawling advantages:
Natural fit for distributed web environment Well suited for topic-specific search engines Small network overhead due to crawler mobility
Solves scaling problems of traditional crawling approach by allowing filtering operations to be performed remotely
Approach provides a base for smart Web crawling Currently improving crawler memory
management Completing more realistic testbed consisting of
~10 mobile crawler-enabled Web sites within UF intranet
Joachim Hammer - UF 2817th Annual AoM/IAoM
Ongoing/Future Work
Security Crawler identification based on digital signatures Restrict crawler execution to positive identified
crawlers Implement virtual machine as a secure sandbox
Crawler mobility support Integrate virtual machine into web servers Comparison with other infrastructures, e.g, IBM Aglet
infrastructure (currently ongoing)
Mobile crawling algorithms (currently ongoing) Optimize crawling algorithms, site relocation
algorithms Carry out analysis