50
Crawlers - March 2008 Crawlers - March 2008 1 (Web) Crawlers (Web) Crawlers Domain Domain Presented by Presented by : : Or Shoham Or Shoham Amit Yaniv Amit Yaniv Guy Kroupp Guy Kroupp Saar Kohanovitch Saar Kohanovitch

Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Embed Size (px)

Citation preview

Page 1: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 11

(Web) Crawlers Domain(Web) Crawlers DomainPresented byPresented by::

Or ShohamOr Shoham

Amit YanivAmit Yaniv

Guy KrouppGuy Kroupp

Saar KohanovitchSaar Kohanovitch

Page 2: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 22

CrawlersCrawlersIntroductionIntroduction 33Crawler BasicsCrawler Basics 44Domain TerminologyDomain Terminology 55In-Depth Domain ElaborationIn-Depth Domain Elaboration 66Application ExamplesApplication Examples 77UM Domain AnalysisUM Domain Analysis 8-338-33CM Domain AnalysisCM Domain Analysis 34-4434-44Lessens LearnedLessens Learned 45-4645-46ConclusionConclusion 47-49 47-49

Page 3: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 33

IntroductionIntroduction A little bit about search enginesA little bit about search engines How do search engines work?How do search engines work? Why are crawlers needed?Why are crawlers needed? Many names – same meaningMany names – same meaning

crawler, spider, robot, BOT, Grub, spycrawler, spider, robot, BOT, Grub, spy

The Goggle PhenomenalThe Goggle Phenomenal founders Larry Page and Sergey Brin, September 1998founders Larry Page and Sergey Brin, September 1998

Page 4: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 44

Crawler’s BasicsCrawler’s Basics What is a crawler?What is a crawler? How do crawlers work?How do crawlers work? Crawling web pagesCrawling web pages

What pages should the crawler download?What pages should the crawler download? How should the crawler refresh pages?How should the crawler refresh pages? How should the load on the visited Web sites be minimized? How should the load on the visited Web sites be minimized?

How do Crawlers Index Web Pages?How do Crawlers Index Web Pages? Link indexingLink indexing Text indexingText indexing

How do Crawlers save Data?How do Crawlers save Data? Scalability: Scalability: distribute the repository across a cluster of computers and disksdistribute the repository across a cluster of computers and disks Large bulk updates: Large bulk updates: the repository needs to handle a high rate of modificationsthe repository needs to handle a high rate of modifications Obsolete pages:Obsolete pages: must have a mechanism for detecting and removing obsolete must have a mechanism for detecting and removing obsolete

pages.pages.

Page 5: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 55

Domain TerminologyDomain Terminology Link – an HTML code which redirects the user to a different Link – an HTML code which redirects the user to a different

web pageweb page URL - universal Resource Locator. An Internet World Wide URL - universal Resource Locator. An Internet World Wide

Web Address Web Address Seeds – a set of URLs which are the crawler’s starting point Seeds – a set of URLs which are the crawler’s starting point Parser – The element which is responsible for link extraction Parser – The element which is responsible for link extraction

from pages from pages Thread – A dependent stack instance of the same processThread – A dependent stack instance of the same process Queue – The element which holds the retrieved URLsQueue – The element which holds the retrieved URLs Politeness Policy - a common set of rules which are intended Politeness Policy - a common set of rules which are intended

to protect from over abusing the sites while crawling in themto protect from over abusing the sites while crawling in them Repository - the resource which stores all the crawler’s Repository - the resource which stores all the crawler’s

retrieved dataretrieved data

Page 6: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 66

Domain ElaborationDomain Elaboration

Rules which apply on the Domain:Rules which apply on the Domain: All crawlers have a URL FetcherAll crawlers have a URL Fetcher All crawlers have a Parser (Extractor)All crawlers have a Parser (Extractor) Crawlers are a Multi Threaded processesCrawlers are a Multi Threaded processes All crawlers have a Crawler ManagerAll crawlers have a Crawler Manager

Strongly related to the search engine Strongly related to the search engine domaindomain

Page 7: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 77

Application ExamplesApplication Examples

Many different crawlers doing different things Many different crawlers doing different things

WebCrawlerWebCrawler

Google Crawler Google Crawler

HeritrixHeritrix

Mirroring applicationsMirroring applications

Page 8: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 88

User ModelingUser Modeling

Page 9: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 99

User Modeling: Class DiagramUser Modeling: Class Diagram

Main Classes:Main Classes:

Spider:Spider:The Spider is the Base component of the crawler , and The Spider is the Base component of the crawler , and

whilewhile

each spider has it own unique way of performingeach spider has it own unique way of performing

Most of the spider contains the same basic Most of the spider contains the same basic

Features :Features :

Page 10: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1010

User Modeling: Class DiagramUser Modeling: Class Diagram

Features:Features: Run/Kill : activation of the spider andRun/Kill : activation of the spider and

deactivation.deactivation. Update: updating running parameters .Update: updating running parameters .

IN ORDER TO OF GETTING THEIN ORDER TO OF GETTING THE

REQUESTED URL’S REQUESTED URL’S

THE SPIDER USES :THE SPIDER USES :

Page 11: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1111

User Modeling: Class DiagramUser Modeling: Class Diagram

IN ORDER TO OF GETTING THE REQUESTEDIN ORDER TO OF GETTING THE REQUESTED

URL’S THE SPIDER USES :URL’S THE SPIDER USES :

URL FETCH NOW :URL FETCH NOW :

This is the basic class that actuallyThis is the basic class that actually

Fetches the url’s Fetches the url’s

The basic features are :The basic features are : URLFetchNow : activation of the class.URLFetchNow : activation of the class. Get/Fetch URL : gets the URL.Get/Fetch URL : gets the URL.

Page 12: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1212

User Modeling: Class DiagramUser Modeling: Class Diagram

To config the SPIDER’S parameters:To config the SPIDER’S parameters:

SPIDER CONFIG:SPIDER CONFIG:

This is the basic class that can set This is the basic class that can set

The SPIDER’S configuration and lets the The SPIDER’S configuration and lets the

SPIDER updates itself . SPIDER updates itself .

Features:Features:

Set/Get Configuration.Set/Get Configuration.

Page 13: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1313

User Modeling: Class DiagramUser Modeling: Class Diagram

To Sort results we are going to need To Sort results we are going to need

Some kind of a Data Structure. Some kind of a Data Structure.

Most commune is a Queue:Most commune is a Queue:

URL QUEUE HANDLER:URL QUEUE HANDLER:

A class containing a queue or any kind of dataA class containing a queue or any kind of data

Structure which sorts results. Structure which sorts results.

Features:Features:

Queue/Dequeue.Queue/Dequeue.

Page 14: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1414

User Modeling: Class DiagramUser Modeling: Class Diagram

In order to make search and result handlingIn order to make search and result handling

More efficient we are going to use an :More efficient we are going to use an :

INDEXER:INDEXER:

The INDEXER is a class that sets the most The INDEXER is a class that sets the most

Effective index and lets the spider use it Effective index and lets the spider use it

And set it . And set it .

Features:Features:

SET/GET INDEX().SET/GET INDEX().

Page 15: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1515

User Modeling: Class DiagramUser Modeling: Class Diagram

In order to control the SPIDER an entity hasIn order to control the SPIDER an entity has

to get The access to kill it and create it , the to get The access to kill it and create it , the

entity will be updated from the queue or the entity will be updated from the queue or the

SCHEDUELER SCHEDUELER

we are going to use a:we are going to use a:

CRAWLER MANAGER:CRAWLER MANAGER:

The MANAGER is a class that is able to make the calls weather The MANAGER is a class that is able to make the calls weather

The spider is created or killed .The spider is created or killed .

Features:Features:

Update By Scheduler/queue : enables the queue/scheduler to Update By Scheduler/queue : enables the queue/scheduler to inform the manager about on going activity .inform the manager about on going activity .

Page 16: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1616

User Modeling: Class DiagramUser Modeling: Class Diagram

In most of the cases we are going to use a In most of the cases we are going to use a

database to store our results ,database to store our results ,

for this we’re going to use a for this we’re going to use a

Class that will communicate with the DB :Class that will communicate with the DB :

STORAGE MANAGER:STORAGE MANAGER:

The STORAGE MANAGER is a class that willThe STORAGE MANAGER is a class that will

write the crawl result to the DB .write the crawl result to the DB .

Features:Features: Sort Info () : the MANAGER will sort info previous to writing it Sort Info () : the MANAGER will sort info previous to writing it

To DB.To DB.

Write To DB(): Write crawl results to DB.Write To DB(): Write crawl results to DB.

Page 17: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1717

Page 18: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1818

User Modeling: Sequence(1)User Modeling: Sequence(1)

Getting Schedule:Getting Schedule:

The MANAGER isThe MANAGER is

Getting next schedule Getting next schedule

Crawler ManagerCrawler Manager

SchedulerScheduler

Page 19: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 1919

User Modeling: Sequence(2)User Modeling: Sequence(2)

Creating a new Spider:Creating a new Spider:

The MANAGER isThe MANAGER is

Creating a new SpiderCreating a new Spider

Crawler ManagerCrawler Manager

SPIDERSPIDER

Page 20: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2020

User Modeling: Sequence(3)User Modeling: Sequence(3)

Creating a new search:Creating a new search:

The MANAGER is telling theThe MANAGER is telling the

SPIDER to start search SPIDER to start search

Crawler ManagerCrawler Manager

SPIDERSPIDER

Page 21: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2121

User Modeling: Sequence(4)User Modeling: Sequence(4)

Getting an index:Getting an index:

The SPIDER isThe SPIDER is

Getting index for Getting index for

next crawlnext crawl

Crawler ManagerCrawler Manager

SPIDERSPIDER

Page 22: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2222

User Modeling: Sequence(5)User Modeling: Sequence(5)

Actual Fetching URL’S:Actual Fetching URL’S:

The SPIDER isThe SPIDER is

Activating URL fetchingActivating URL fetching

SPIDERSPIDER

URL FETCH NOWURL FETCH NOW

Page 23: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2323

User Modeling: Sequence(6)User Modeling: Sequence(6)

Queuing results:Queuing results:

The SPIDER isThe SPIDER is

Sending results toSending results to

queuequeue

SPIDERSPIDER

URL QUEUE HANDLERURL QUEUE HANDLER

Page 24: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2424

User Modeling: Sequence(7)User Modeling: Sequence(7)

DeQueuing results:DeQueuing results:

The SPIDER isThe SPIDER is

Dequeuing sorted Dequeuing sorted

results results

SPIDERSPIDER

URL QUEUE HANDLERURL QUEUE HANDLER

Page 25: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2525

User Modeling: Sequence(8)User Modeling: Sequence(8)

Writing to DB:Writing to DB:

The SPIDER isThe SPIDER is

Sending sorted results Sending sorted results

to DBto DB

SPIDERSPIDER

STORAGE HANDLERSTORAGE HANDLER

Page 26: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2626

User Modeling: Sequence(9)User Modeling: Sequence(9)

Update Scheduler:Update Scheduler:

The Queue Handler The Queue Handler

updates the Schedulerupdates the Scheduler

QUEUE HANDLERQUEUE HANDLER

SCHEDULERSCHEDULER

Page 27: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2727

User Modeling: Sequence(10)User Modeling: Sequence(10)

Update Manager:Update Manager:

The Scheduler updatesThe Scheduler updates

The managerThe manager

SCHEDULERSCHEDULER

CRAWLER MANAGERCRAWLER MANAGER

Page 28: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2828

User Modeling: Sequence(11)User Modeling: Sequence(11)

Kill SPIDER:Kill SPIDER:

The Manager kills the SPIDERThe Manager kills the SPIDER

In the end of the processIn the end of the process

CRAWLER MANAGERCRAWLER MANAGER

SPIDERSPIDER

Page 29: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 2929

Page 30: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3030

Domain PatternsDomain Patterns How can a crawler cope with new page standards How can a crawler cope with new page standards

conventions?conventions? Fatch new standard pagesFatch new standard pages Index new standard pagesIndex new standard pages

Factory Design PatternFactory Design Pattern

Page 31: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3131

Domain Patterns (2)Domain Patterns (2) The parser class as a Factory designThe parser class as a Factory design

parse different pages: HTML , PDF , Word est.parse different pages: HTML , PDF , Word est.

The UML Fatcher class as a Factory DesignThe UML Fatcher class as a Factory Design Fatchs pages from different protocols and conventionsFatchs pages from different protocols and conventions: :

UDP, TCP/IP , FTP , IP6UDP, TCP/IP , FTP , IP6

How to ensure we have only one Crawler manager Queue and How to ensure we have only one Crawler manager Queue and repository?repository?

Singleton Design PatternSingleton Design Pattern

Page 32: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3232

User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?

Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific

feature?feature?

Code varies more the Domain . Code varies more the Domain .

Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?

Page 33: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3333

User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?

Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific

feature?feature?

Code varies more the Domain . Code varies more the Domain .

Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?

Page 34: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3434

Code ModelingCode Modeling

Page 35: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3535

Code Modeling – Reverse Code Modeling – Reverse Engineering – Applications (1)Engineering – Applications (1)

Applications which were R.E.’d:Applications which were R.E.’d: Arale, WebEater – Basic web crawlers for Arale, WebEater – Basic web crawlers for

file downloading (for offline viewing)file downloading (for offline viewing) JoBo – Advanced web crawler for file JoBo – Advanced web crawler for file

downloading (for offline viewing)downloading (for offline viewing) Heritrix – Advanced distributed crawler for Heritrix – Advanced distributed crawler for

file downloading (to archives)file downloading (to archives) HyperSpider – Basic crawler for displaying HyperSpider – Basic crawler for displaying

hyperlink treeshyperlink trees

Page 36: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3636

Code Modeling – Reverse Code Modeling – Reverse EngineeringEngineering - Applications (2) - Applications (2)

Nutch (Lucerne) – Advanced distributed Nutch (Lucerne) – Advanced distributed crawler / search engine for indexingcrawler / search engine for indexing

WebSphinxWebSphinx –– Crawler framework for Crawler framework for mirroring and hyperlink tree displaymirroring and hyperlink tree display

ApertureAperture - Advanced crawler able to read - Advanced crawler able to read HTTP, FTP, local files, for indexingHTTP, FTP, local files, for indexing

Page 37: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3737

Code Modeling – Reverse Code Modeling – Reverse Engineering – CASE ToolEngineering – CASE Tool

Reverse Engineering using Visual Reverse Engineering using Visual Paradigm for UMLParadigm for UML

Used only for class diagrams – use case + Used only for class diagrams – use case + sequence were modeled by hand based sequence were modeled by hand based on classes, usage and documentationon classes, usage and documentation

Good results for small applications, poor Good results for small applications, poor results for large applications (too much results for large applications (too much noise made signal hard to find)noise made signal hard to find)

Page 38: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3838

Application class: A single class containing the main application elements, starts the crawling sequence based on parameters

Page Manager (Page): Class holding all data relevant to a web (or local) page, may save entire page or only summary / relevant parts

Parameters: Class holding parameters required for the application to run

Robots: Class containing information on pages the crawler may not visit

Queue: Class containing a list of links (pages) the crawler should visit

Thread: Class containing information required for each crawler thread

Listener: Class responsible for receiving pages from the internet

Extractor: Class responsible for parsing pages and extracting links for queue

Filters: Classes responsible for deciding if a link should be queued or visited

Helpers: Classes responsible for helping the crawler deal with forms, cookies, etc.

DB / Merger / External DB: Classes required for saving data into databases for local / distributed applications with DBs

Page 39: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 3939

Code Modeling Code Modeling – – Sequence (1) Sequence (1)

Page 40: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4040

Code Modeling Code Modeling – – Sequence (2) Sequence (2)

Page 41: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4141

Code Modeling Code Modeling – – Sequence (3) Sequence (3)

Page 42: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4242

Code Modeling – Sequence (4)

Page 43: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4343

Code Modeling – Results Example

Page 44: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4444

Code Modeling - ConclusionsCode Modeling - Conclusions

Very difficult to reach domain-level Very difficult to reach domain-level abstraction based on code modelingabstraction based on code modeling

VP not very helpful in dealing with large VP not very helpful in dealing with large applications (clutter)applications (clutter)

Difficult to understand sequences and use Difficult to understand sequences and use cases correctly (no R.E. at all)cases correctly (no R.E. at all)

Documentation was often the most helpful Documentation was often the most helpful tool for code modeling, rather than R.E.tool for code modeling, rather than R.E.

Page 45: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4545

Domain Modeling with ADOMDomain Modeling with ADOM

ADOM was helpful in establishing domain ADOM was helpful in establishing domain requirementsrequirements

Difficult to model when many optional entities Difficult to model when many optional entities exist, some of which heavily impact class exist, some of which heavily impact class relations and sequencesrelations and sequences

ADOM was not very helpful with abstraction, but ADOM was not very helpful with abstraction, but that may be a function of the domain itself that may be a function of the domain itself (functional)(functional)

End results difficult to read, but seem to provide End results difficult to read, but seem to provide a good domain framework for applicationsa good domain framework for applications

Page 46: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4646

Domain Problems and IssuesDomain Problems and Issues

Crawler domain contains many functional Crawler domain contains many functional entities which do not necessarily store entities which do not necessarily store information (difficult to model)information (difficult to model)

Many optional controller / manager entities Many optional controller / manager entities (clutter with relations)(clutter with relations)

Vast difference in application scaleVast difference in application scale Entity / function containmentEntity / function containment

Page 47: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4747

Future Work (1)Future Work (1)

Merging Code Modeling and User Modeling Merging Code Modeling and User Modeling will be difficult:will be difficult:

User modeling focused mostly on large-User modeling focused mostly on large-scale crawlers (research focuses on scale crawlers (research focuses on these)these)

Mostly from a search engine perspectiveMostly from a search engine perspective Schedule-orientedSchedule-oriented High level of abstractionHigh level of abstraction

Page 48: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4848

Future Work (2)Future Work (2)

Code modeling focused mostly on smaller Code modeling focused mostly on smaller applications (easier to model, available)applications (easier to model, available)

Focus mostly on archival / mirroringFocus mostly on archival / mirroring User-orientedUser-oriented Medium level of abstractionMedium level of abstraction

Page 49: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 4949

Future Work (3)Future Work (3)

Merged product entities should be closer Merged product entities should be closer to User Modeling than Code Modeling to User Modeling than Code Modeling (Higher level of abstraction)(Higher level of abstraction)

User vs. scheduleUser vs. schedule Indexing vs. archivingIndexing vs. archiving Importance of optional entitiesImportance of optional entities

Page 50: Crawlers - March 2008 1 (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch

Crawlers - March 2008Crawlers - March 2008 5050

Web Crawlers DomainWeb Crawlers Domain

Thank youThank you Any questions?Any questions?