Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 11

(Web) Crawlers Domain(Web) Crawlers DomainPresented byPresented by::

Or ShohamOr Shoham

Amit YanivAmit Yaniv

Guy KrouppGuy Kroupp

Saar KohanovitchSaar Kohanovitch


CrawlersCrawlersIntroductionIntroduction 33Crawler BasicsCrawler Basics 44Domain TerminologyDomain Terminology 55In-Depth Domain ElaborationIn-Depth Domain Elaboration 66Application ExamplesApplication Examples 77UM Domain AnalysisUM Domain Analysis 8-338-33CM Domain AnalysisCM Domain Analysis 34-4434-44Lessens LearnedLessens Learned 45-4645-46ConclusionConclusion 47-49 47-49


IntroductionIntroduction A little bit about search enginesA little bit about search engines How do search engines work?How do search engines work? Why are crawlers needed?Why are crawlers needed? Many names – same meaningMany names – same meaning

crawler, spider, robot, BOT, Grub, spycrawler, spider, robot, BOT, Grub, spy

The Goggle PhenomenalThe Goggle Phenomenal founders Larry Page and Sergey Brin, September 1998founders Larry Page and Sergey Brin, September 1998


Crawler’s BasicsCrawler’s Basics What is a crawler?What is a crawler? How do crawlers work?How do crawlers work? Crawling web pagesCrawling web pages

What pages should the crawler download?What pages should the crawler download? How should the crawler refresh pages?How should the crawler refresh pages? How should the load on the visited Web sites be minimized? How should the load on the visited Web sites be minimized?

How do Crawlers Index Web Pages?How do Crawlers Index Web Pages? Link indexingLink indexing Text indexingText indexing

How do Crawlers save Data?How do Crawlers save Data? Scalability: Scalability: distribute the repository across a cluster of computers and disksdistribute the repository across a cluster of computers and disks Large bulk updates: Large bulk updates: the repository needs to handle a high rate of modificationsthe repository needs to handle a high rate of modifications Obsolete pages:Obsolete pages: must have a mechanism for detecting and removing obsolete must have a mechanism for detecting and removing obsolete

pages.pages.


Domain TerminologyDomain Terminology Link – an HTML code which redirects the user to a different Link – an HTML code which redirects the user to a different

web pageweb page URL - universal Resource Locator. An Internet World Wide URL - universal Resource Locator. An Internet World Wide

Web Address Web Address Seeds – a set of URLs which are the crawler’s starting point Seeds – a set of URLs which are the crawler’s starting point Parser – The element which is responsible for link extraction Parser – The element which is responsible for link extraction

from pages from pages Thread – A dependent stack instance of the same processThread – A dependent stack instance of the same process Queue – The element which holds the retrieved URLsQueue – The element which holds the retrieved URLs Politeness Policy - a common set of rules which are intended Politeness Policy - a common set of rules which are intended

to protect from over abusing the sites while crawling in themto protect from over abusing the sites while crawling in them Repository - the resource which stores all the crawler’s Repository - the resource which stores all the crawler’s

retrieved dataretrieved data


Domain ElaborationDomain Elaboration

Rules which apply on the Domain:Rules which apply on the Domain: All crawlers have a URL FetcherAll crawlers have a URL Fetcher All crawlers have a Parser (Extractor)All crawlers have a Parser (Extractor) Crawlers are a Multi Threaded processesCrawlers are a Multi Threaded processes All crawlers have a Crawler ManagerAll crawlers have a Crawler Manager

Strongly related to the search engine Strongly related to the search engine domaindomain


Application ExamplesApplication Examples

Many different crawlers doing different things Many different crawlers doing different things

WebCrawlerWebCrawler

Google Crawler Google Crawler

HeritrixHeritrix

Mirroring applicationsMirroring applications


User ModelingUser Modeling


User Modeling: Class DiagramUser Modeling: Class Diagram

Main Classes:Main Classes:

Spider:Spider:The Spider is the Base component of the crawler , and The Spider is the Base component of the crawler , and

whilewhile

each spider has it own unique way of performingeach spider has it own unique way of performing

Most of the spider contains the same basic Most of the spider contains the same basic

Features :Features :



Features:Features: Run/Kill : activation of the spider andRun/Kill : activation of the spider and

deactivation.deactivation. Update: updating running parameters .Update: updating running parameters .

IN ORDER TO OF GETTING THEIN ORDER TO OF GETTING THE

REQUESTED URL’S REQUESTED URL’S

THE SPIDER USES :THE SPIDER USES :



IN ORDER TO OF GETTING THE REQUESTEDIN ORDER TO OF GETTING THE REQUESTED

URL’S THE SPIDER USES :URL’S THE SPIDER USES :

URL FETCH NOW :URL FETCH NOW :

This is the basic class that actuallyThis is the basic class that actually

Fetches the url’s Fetches the url’s

The basic features are :The basic features are : URLFetchNow : activation of the class.URLFetchNow : activation of the class. Get/Fetch URL : gets the URL.Get/Fetch URL : gets the URL.



To config the SPIDER’S parameters:To config the SPIDER’S parameters:

SPIDER CONFIG:SPIDER CONFIG:

This is the basic class that can set This is the basic class that can set

The SPIDER’S configuration and lets the The SPIDER’S configuration and lets the

SPIDER updates itself . SPIDER updates itself .

Features:Features:

Set/Get Configuration.Set/Get Configuration.



To Sort results we are going to need To Sort results we are going to need

Some kind of a Data Structure. Some kind of a Data Structure.

Most commune is a Queue:Most commune is a Queue:

URL QUEUE HANDLER:URL QUEUE HANDLER:

A class containing a queue or any kind of dataA class containing a queue or any kind of data

Structure which sorts results. Structure which sorts results.

Features:Features:

Queue/Dequeue.Queue/Dequeue.



In order to make search and result handlingIn order to make search and result handling

More efficient we are going to use an :More efficient we are going to use an :

INDEXER:INDEXER:

The INDEXER is a class that sets the most The INDEXER is a class that sets the most

Effective index and lets the spider use it Effective index and lets the spider use it

And set it . And set it .

Features:Features:

SET/GET INDEX().SET/GET INDEX().



In order to control the SPIDER an entity hasIn order to control the SPIDER an entity has

to get The access to kill it and create it , the to get The access to kill it and create it , the

entity will be updated from the queue or the entity will be updated from the queue or the

SCHEDUELER SCHEDUELER

we are going to use a:we are going to use a:

CRAWLER MANAGER:CRAWLER MANAGER:

The MANAGER is a class that is able to make the calls weather The MANAGER is a class that is able to make the calls weather

The spider is created or killed .The spider is created or killed .

Features:Features:

Update By Scheduler/queue : enables the queue/scheduler to Update By Scheduler/queue : enables the queue/scheduler to inform the manager about on going activity .inform the manager about on going activity .



In most of the cases we are going to use a In most of the cases we are going to use a

database to store our results ,database to store our results ,

for this we’re going to use a for this we’re going to use a

Class that will communicate with the DB :Class that will communicate with the DB :

STORAGE MANAGER:STORAGE MANAGER:

The STORAGE MANAGER is a class that willThe STORAGE MANAGER is a class that will

write the crawl result to the DB .write the crawl result to the DB .

Features:Features: Sort Info () : the MANAGER will sort info previous to writing it Sort Info () : the MANAGER will sort info previous to writing it

To DB.To DB.

Write To DB(): Write crawl results to DB.Write To DB(): Write crawl results to DB.



User Modeling: Sequence(1)User Modeling: Sequence(1)

Getting Schedule:Getting Schedule:

The MANAGER isThe MANAGER is

Getting next schedule Getting next schedule

Crawler ManagerCrawler Manager

SchedulerScheduler



Creating a new Spider:Creating a new Spider:

The MANAGER isThe MANAGER is

Creating a new SpiderCreating a new Spider


SPIDERSPIDER



Creating a new search:Creating a new search:

The MANAGER is telling theThe MANAGER is telling the

SPIDER to start search SPIDER to start search


SPIDERSPIDER



Getting an index:Getting an index:

The SPIDER isThe SPIDER is

Getting index for Getting index for

next crawlnext crawl


SPIDERSPIDER



Actual Fetching URL’S:Actual Fetching URL’S:


Activating URL fetchingActivating URL fetching

SPIDERSPIDER

URL FETCH NOWURL FETCH NOW



Queuing results:Queuing results:


Sending results toSending results to

queuequeue

SPIDERSPIDER

URL QUEUE HANDLERURL QUEUE HANDLER



DeQueuing results:DeQueuing results:


Dequeuing sorted Dequeuing sorted

results results

SPIDERSPIDER

URL QUEUE HANDLERURL QUEUE HANDLER



Writing to DB:Writing to DB:


Sending sorted results Sending sorted results

to DBto DB

SPIDERSPIDER

STORAGE HANDLERSTORAGE HANDLER



Update Scheduler:Update Scheduler:

The Queue Handler The Queue Handler

updates the Schedulerupdates the Scheduler

QUEUE HANDLERQUEUE HANDLER

SCHEDULERSCHEDULER



Update Manager:Update Manager:

The Scheduler updatesThe Scheduler updates

The managerThe manager

SCHEDULERSCHEDULER

CRAWLER MANAGERCRAWLER MANAGER



Kill SPIDER:Kill SPIDER:

The Manager kills the SPIDERThe Manager kills the SPIDER

In the end of the processIn the end of the process

CRAWLER MANAGERCRAWLER MANAGER

SPIDERSPIDER



Domain PatternsDomain Patterns How can a crawler cope with new page standards How can a crawler cope with new page standards

conventions?conventions? Fatch new standard pagesFatch new standard pages Index new standard pagesIndex new standard pages

Factory Design PatternFactory Design Pattern


Domain Patterns (2)Domain Patterns (2) The parser class as a Factory designThe parser class as a Factory design

parse different pages: HTML , PDF , Word est.parse different pages: HTML , PDF , Word est.

The UML Fatcher class as a Factory DesignThe UML Fatcher class as a Factory Design Fatchs pages from different protocols and conventionsFatchs pages from different protocols and conventions: :

UDP, TCP/IP , FTP , IP6UDP, TCP/IP , FTP , IP6

How to ensure we have only one Crawler manager Queue and How to ensure we have only one Crawler manager Queue and repository?repository?

Singleton Design PatternSingleton Design Pattern


User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?

Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific

feature?feature?

Code varies more the Domain . Code varies more the Domain .

Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?


User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?

Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific

feature?feature?

Code varies more the Domain . Code varies more the Domain .

Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?


Code ModelingCode Modeling


Code Modeling – Reverse Code Modeling – Reverse Engineering – Applications (1)Engineering – Applications (1)

Applications which were R.E.’d:Applications which were R.E.’d: Arale, WebEater – Basic web crawlers for Arale, WebEater – Basic web crawlers for

file downloading (for offline viewing)file downloading (for offline viewing) JoBo – Advanced web crawler for file JoBo – Advanced web crawler for file

downloading (for offline viewing)downloading (for offline viewing) Heritrix – Advanced distributed crawler for Heritrix – Advanced distributed crawler for

file downloading (to archives)file downloading (to archives) HyperSpider – Basic crawler for displaying HyperSpider – Basic crawler for displaying

hyperlink treeshyperlink trees


Code Modeling – Reverse Code Modeling – Reverse EngineeringEngineering - Applications (2) - Applications (2)

Nutch (Lucerne) – Advanced distributed Nutch (Lucerne) – Advanced distributed crawler / search engine for indexingcrawler / search engine for indexing

WebSphinxWebSphinx –– Crawler framework for Crawler framework for mirroring and hyperlink tree displaymirroring and hyperlink tree display

ApertureAperture - Advanced crawler able to read - Advanced crawler able to read HTTP, FTP, local files, for indexingHTTP, FTP, local files, for indexing


Code Modeling – Reverse Code Modeling – Reverse Engineering – CASE ToolEngineering – CASE Tool

Reverse Engineering using Visual Reverse Engineering using Visual Paradigm for UMLParadigm for UML

Used only for class diagrams – use case + Used only for class diagrams – use case + sequence were modeled by hand based sequence were modeled by hand based on classes, usage and documentationon classes, usage and documentation

Good results for small applications, poor Good results for small applications, poor results for large applications (too much results for large applications (too much noise made signal hard to find)noise made signal hard to find)


Application class: A single class containing the main application elements, starts the crawling sequence based on parameters

Page Manager (Page): Class holding all data relevant to a web (or local) page, may save entire page or only summary / relevant parts

Parameters: Class holding parameters required for the application to run

Robots: Class containing information on pages the crawler may not visit

Queue: Class containing a list of links (pages) the crawler should visit

Thread: Class containing information required for each crawler thread

Listener: Class responsible for receiving pages from the internet

Extractor: Class responsible for parsing pages and extracting links for queue

Filters: Classes responsible for deciding if a link should be queued or visited

Helpers: Classes responsible for helping the crawler deal with forms, cookies, etc.

DB / Merger / External DB: Classes required for saving data into databases for local / distributed applications with DBs


Code Modeling Code Modeling – – Sequence (1) Sequence (1)






Code Modeling – Sequence (4)


Code Modeling – Results Example


Code Modeling - ConclusionsCode Modeling - Conclusions

Very difficult to reach domain-level Very difficult to reach domain-level abstraction based on code modelingabstraction based on code modeling

VP not very helpful in dealing with large VP not very helpful in dealing with large applications (clutter)applications (clutter)

Difficult to understand sequences and use Difficult to understand sequences and use cases correctly (no R.E. at all)cases correctly (no R.E. at all)

Documentation was often the most helpful Documentation was often the most helpful tool for code modeling, rather than R.E.tool for code modeling, rather than R.E.


Domain Modeling with ADOMDomain Modeling with ADOM

ADOM was helpful in establishing domain ADOM was helpful in establishing domain requirementsrequirements

Difficult to model when many optional entities Difficult to model when many optional entities exist, some of which heavily impact class exist, some of which heavily impact class relations and sequencesrelations and sequences

ADOM was not very helpful with abstraction, but ADOM was not very helpful with abstraction, but that may be a function of the domain itself that may be a function of the domain itself (functional)(functional)

End results difficult to read, but seem to provide End results difficult to read, but seem to provide a good domain framework for applicationsa good domain framework for applications


Domain Problems and IssuesDomain Problems and Issues

Crawler domain contains many functional Crawler domain contains many functional entities which do not necessarily store entities which do not necessarily store information (difficult to model)information (difficult to model)

Many optional controller / manager entities Many optional controller / manager entities (clutter with relations)(clutter with relations)

Vast difference in application scaleVast difference in application scale Entity / function containmentEntity / function containment


Future Work (1)Future Work (1)

Merging Code Modeling and User Modeling Merging Code Modeling and User Modeling will be difficult:will be difficult:

User modeling focused mostly on large-User modeling focused mostly on large-scale crawlers (research focuses on scale crawlers (research focuses on these)these)

Mostly from a search engine perspectiveMostly from a search engine perspective Schedule-orientedSchedule-oriented High level of abstractionHigh level of abstraction



Code modeling focused mostly on smaller Code modeling focused mostly on smaller applications (easier to model, available)applications (easier to model, available)

Focus mostly on archival / mirroringFocus mostly on archival / mirroring User-orientedUser-oriented Medium level of abstractionMedium level of abstraction



Merged product entities should be closer Merged product entities should be closer to User Modeling than Code Modeling to User Modeling than Code Modeling (Higher level of abstraction)(Higher level of abstraction)

User vs. scheduleUser vs. schedule Indexing vs. archivingIndexing vs. archiving Importance of optional entitiesImportance of optional entities


Web Crawlers DomainWeb Crawlers Domain

Thank youThank you Any questions?Any questions?

Documents

Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch