5
Automating Change detection and Notification of Web Pages (Invited paper) Sharma Chakravarthy and Subramanian C Hari Hara Information Technology Laboratory and Computer Science & Engineering Department The University of Texas at Arlington {sharma, harihara}@cse.uta.edu Abstract Search engines provide an invaluable capability to search the web which is growing at a fast pace. In addi- tion to growth, the content of the pages on the web is also changing continuously. Periodical retrieval of pages for un- derstanding changes is both inefficient and time consum- ing. Search engines do not help in this aspect of informa- tion retrieval at all. In this paper, we present an overview of WebVigiL, a system that automates the change detection and timely notification of HTML/XML pages based on user- specified changes of interest. User interest, specified as a sentinel/profile, is automatically monitored by the system using a combination of learning-based and event-driven techniques. The system is currently available for use. 1. Introduction The World Wide Web has become an indispensable source of information. There is data for everyone on the web and this data is increasing at a rapid pace. This has greatly affected the way in which information is accessed, searched, and delivered. Users, at present, are not only in- terested in the new information available on web pages but also in knowing when the information changes in a timely manner. For example, researchers might want to know which conference has extended the deadline for paper sub- mission and the new deadline. They might also want to track if any new papers have been added to the faculty web sites. As another example, students might want to track their course web site for updates on homeworks, projects, or assignments. In general, the ability to specify and moni- tor for changes to arbitrary web pages and get notified in a timely manner is necessary. This is definitely a more effec- tive and efficient alternative to visiting web pages periodi- This work was supported, in part, by NSF grants (IIS-0123730, EIA- 0216500 and IIS 0534611). cally for monitoring for changes. WebVigiL has been developed with the above problem in mind and is a general purpose system for monitoring changes over a distributed repository. WebVigiL provides a powerful way disseminate change information efficiently without sending unnecessary or irrelevant information. Tra- ditionally, retrieval of information is being done by the pull paradigm [1], where users explicitly query pages of interest on a regular basis and analyze them for changes of interest. In contrast, in the push paradigm the system is responsi- ble for detecting changes in the best possible way and no- tifies the user. WebVigiL uses a combination of push and intelligent pull paradigms to automate change detection and notification in the best possible way. To provide timely re- trieval of pages, detect changes, and for the notification of results WebVigiL uses the active capability in the form of ECA (event-condition-action) rules. We can summarize WebVigiL as a general-purpose, ac- tive capability based information monitoring and notifi- cation system, which retrieves information from remote sources, detects changes, and notifies the users of changes of their interest in a timely manner. The larger goal of We- bVigiL is to handle specification, management, and evalu- ation of sentinels (Requests/profiles given by the user for specifying the changes they are interested in). The fo- cus of change detection in WebVigiL is to detect selective changes based on user intent in the context of Hyper Text Markup Language (HTML) and eXtensible Markup Lan- guage (XML), which constitute a major portion of the web documents on the World Wide Web. Meaningful presenta- tion of changes is also an important aspect of WebVigiL. 2. WebVigiL Architecture Figure 1 illustrates the architecture of the WebVigiL sys- tem. Below, we briefly discuss themodules of WebVigiL. Sentinel: Sentinel is the medium through which a user can specify his/her request for monitoring. Users can spec- Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06) 0-7695-2641-1/06 $20.00 © 2006

[IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

  • Upload
    sch

  • View
    212

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

Automating Change detection and Notification of Web Pages∗

(Invited paper)

Sharma Chakravarthy and Subramanian C Hari HaraInformation Technology Laboratory and

Computer Science & Engineering DepartmentThe University of Texas at Arlington{sharma, harihara}@cse.uta.edu

Abstract

Search engines provide an invaluable capability tosearch the web which is growing at a fast pace. In addi-tion to growth, the content of the pages on the web is alsochanging continuously. Periodical retrieval of pages for un-derstanding changes is both inefficient and time consum-ing. Search engines do not help in this aspect of informa-tion retrieval at all. In this paper, we present an overviewof WebVigiL, a system that automates the change detectionand timely notification of HTML/XML pages based on user-specified changes of interest. User interest, specified as asentinel/profile, is automatically monitored by the systemusing a combination of learning-based and event-driventechniques. The system is currently available for use.

1. Introduction

The World Wide Web has become an indispensablesource of information. There is data for everyone on theweb and this data is increasing at a rapid pace. This hasgreatly affected the way in which information is accessed,searched, and delivered. Users, at present, are not only in-terested in the new information available on web pages butalso in knowing when the information changes in a timelymanner. For example, researchers might want to knowwhich conference has extended the deadline for paper sub-mission and the new deadline. They might also want totrack if any new papers have been added to the faculty websites. As another example, students might want to tracktheir course web site for updates on homeworks, projects,or assignments. In general, the ability to specify and moni-tor for changes to arbitrary web pages and get notified in atimely manner is necessary. This is definitely a more effec-tive and efficient alternative to visiting web pages periodi-

∗This work was supported, in part, by NSF grants (IIS-0123730, EIA-0216500 and IIS 0534611).

cally for monitoring for changes.WebVigiL has been developed with the above problem

in mind and is a general purpose system for monitoringchanges over a distributed repository. WebVigiL providesa powerful way disseminate change information efficientlywithout sending unnecessary or irrelevant information. Tra-ditionally, retrieval of information is being done by the pullparadigm [1], where users explicitly query pages of intereston a regular basis and analyze them for changes of interest.In contrast, in the push paradigm the system is responsi-ble for detecting changes in the best possible way and no-tifies the user. WebVigiL uses a combination of push andintelligent pull paradigms to automate change detection andnotification in the best possible way. To provide timely re-trieval of pages, detect changes, and for the notification ofresults WebVigiL uses the active capability in the form ofECA (event-condition-action) rules.

We can summarize WebVigiL as a general-purpose, ac-tive capability based information monitoring and notifi-cation system, which retrieves information from remotesources, detects changes, and notifies the users of changesof their interest in a timely manner. The larger goal of We-bVigiL is to handle specification, management, and evalu-ation of sentinels (Requests/profiles given by the user forspecifying the changes they are interested in). The fo-cus of change detection in WebVigiL is to detect selectivechanges based on user intent in the context of Hyper TextMarkup Language (HTML) and eXtensible Markup Lan-guage (XML), which constitute a major portion of the webdocuments on the World Wide Web. Meaningful presenta-tion of changes is also an important aspect of WebVigiL.

2. WebVigiL Architecture

Figure 1 illustrates the architecture of the WebVigiL sys-tem. Below, we briefly discuss the modules of WebVigiL.Sentinel: Sentinel is the medium through which a user canspecify his/her request for monitoring. Users can spec-

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 © 2006

Page 2: [IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

Figure 1. WebVigiL Architecture

ify their interests which capture the information requiredfor change detection and notification. Users specify thefollowing: a) Page to be monitored. b) Type of content(any change or specific keywords/phrases/links/images) tobe monitored. c) A single page or pages up to a certaindepth. d) How (frequency or when the page changes) thepage needs to be monitored. e) Start and End time for themonitoring request. f) The medium and how (frequent orimmediate) notification of changes should be sent and g)The relative version of comparison for changes. The spec-ification language also supports complex queries via AND,OR and NOT operators. A web-based interface is providedfor specifying these requirements and for submitting theirprofiles to the WebVigiL system.Verification Module: Sentinels are processed for syntac-tic and semantic correctness. Valid sentinels are stored inthe knowledge base and a notification is sent to the changedetection module for further processing of the sentinel.Knowledge Base Module: The knowledge base is main-tained as a set of relational tables (Oracle database). Itmaintains the state of all the sentinels in the system. It isalso used for recovering the server, after a failure.Change Detection Module: Changes are detected by ex-tracting appropriate objects from the pages based on thechange type specified by the user. The objects are then com-pared for changes with the previous version of the page.This module uses ECA rules (to activate/deactivate sen-tinels, to generate fetch rules for retrieving pages, to detectevents of interest and to generate time-based notificationof changes), change detection graphs (reduce the amountof information stored, if more than one user requests fordifferent types of changes on the same page), and CH-Diff(HTML pages) [2, 3] and CX-Diff (XML pages) [4, 5] al-gorithms.Fetch Module: A page has to be fetched in order to mon-itor the change requested by a sentinel. The fetch mod-

ule fetches the pages of interest. The properties of a page,such as the Last Modified Time (LMT) or size of the page(Checksum) are checked for determining the freshness ofa page. Only if a change is detected in the properties, themodule fetches the actual page and sends it to the versioncontroller for storage. The Fetch module employs two rulesfor fetching the page. If a user specifies frequency for fetch-ing the page, WebVigiL uses a Fixed-Interval Rule. If a userwants the system to determine the frequency, then a Best Ef-fort Rule [6] is used.Version Management: WebVigiL requires various ver-sions of a page at different times. Moving and every changetypes require older pages. Every page fetched needs to bearchived and managed efficiently until they are no longerneeded for change detection. Also, if the page propertieshave not changed, the page need not be fetched again. Thismodule manages all the pages retrieved and supplies appro-priate versions of a page to the requesting modules at run-time. Deletion of pages is also handled by this module.Presentation and Notification Module: Changes detectedin a page need to be presented to the user in an intuitive andmeaningful manner and at the same time the device char-acteristics need to be taken into account. Also, the timeli-ness of presentation is important. This module is responsi-ble for notifying the user of detected changes and presentthese changes in an elegant manner.

3. Knowledge Base

User information, such as the sentinel creation date, sen-tinel start/end time, change type, notification method andthe page versions used for change detection need to bestored not only to allow a user to keep track of his/her sen-tinels but also to provide details about the sentinel to severalmodules at runtime. Hence, there is a need to store informa-tion about a sentinel in a persistent and recoverable manner.WebVigiL stores the sentinel information in a Knowledgebase (KB). Each monitoring request is checked for correct-ness before being persisted in the KB. The validation mod-ule does three types of validation: syntax, semantics, anddatabase. Syntax validation checks the correctness of sen-tinel in terms of the defined grammar of the change speci-fication language. The accepted sentinel should be seman-tically correct for the policies to be meaningful and usedby different modules of WebVigiL. The database validationmakes sure that attributes that are going to be stored followsthe format specified by the database.

4. Event-Based Intelligent Fetching

Users can specify a sentinel for fetching with either OnChange option or Interval-Based option. Based on this, anevent is generated with a Best-Effort Rule or an Interval-Based Rule which differs in the way “t” (the fetch interval)of the periodic event is handled.

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 © 2006

Page 3: [IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

Interval-Based Rule: The user can explicitly specify afetch frequency. For example, if the user is aware that apage is changing every 4 hrs, he/she can specify a sentinel tostart monitoring the page with a fetch interval of 4 hrs. Forthis, a periodic event whose periodicity (interval t) equal tothe given interval is created [7].

Best Effort (BE) Rule: In the best-effort algorithm, thenext fetch interval for a page is computed using its changehistory. When the next fetch interval is determined, the BERule changes the interval “t” of the periodic event. Clearlythe effectiveness of the algorithm depends on the accurateestimation of the fetch interval.

The current learning-based algorithm [8] which is an en-hancement to a previous best-effort algorithm [6] can besummarized as follows.

Initial Tuning: The learning algorithm starts with an initialtime period (t minutes). This is the minimum time intervalbetween any two fetches. To tune to the actual page changefrequency, taking an aggressive approach, the time periodis doubled until the first change is detected. To preventthe possibility of non-detection of intermediate changes, thefrequency is increased by a factor of 1.5 from that point.

No Change Detected: There will be many instances whenthe current frequency of page change differs from the actualfrequency of page change. These situations lead to unsuc-cessful fetches wherein a page is fetched but no change isdetected. Since the goal of the algorithm is to minimizethe number of unsuccessful fetches, after every unsuccess-ful fetch, the fetch interval is increased by 20%. This in-creases the fetch frequency gradually rather than changingthe frequency abruptly as done by the current algorithm.

Collecting required history: When a page change is de-tected, the algorithm predicts the next change time based onthe observed change history of the page. A certain amountof minimum history size is required before the actual pre-diction can be done. In the current algorithm, until the re-quired history size (= window size) is collected, the expo-nential weighted mean method is used to predict the nextchange. The equation to calculate the exponential mean isgiven below: Tnext = α×Tcur +(1−α)×Tavg The valueof α in this equation determines the amount of weight givento the current interval and the values present in the history.Value of α = 0.35 gave the best possible results.

Change Detected: If the page is not changing with con-stant frequency, the current algorithm takes a pessimisticapproach in predicting the next value. It calculates the av-erage of all the values in the history and predicts the nextchange interval as 60% of that average value. To make abetter prediction, the new algorithm uses a weighted aver-age to predict the next value. The weights associated withdifferent values are calculated by observing the changingbehavior of standard deviation of the series.

5. Change Detection Algorithms

In WebVigiL, change detection is performed betweentwo different versions of a given page. When the compari-son of two versions of the same page results in the detectionof a change, all the sentinels interested in that page are noti-fied of the changes. Change detection algorithms, CH-DIFFand CX-DIFF have been developed to detect changes occur-ring to HTML and XML documents, respectively.

CH-DIFF [2, 3] detects changes to various componentssuch as links, images, keywords, phrases and anychange.By anychange we mean change to any word in the pageand/or changes to links/images. For detecting changes toany word we could use Longest Common Subsequence(LCS) [9] algorithm. The existing tools use LCS (with sev-eral speed optimizations) to compare HTML pages at pagelevel. But for scenarios where the user is only interestedin a change to a particular object in a page, using LCS willbe computationally expensive. We describe our approachfor change detection to HTML pages briefly here. Let tbe the object type, which is of interest in a page, S(A) bethe set of objects of type t extracted from version Vi andS(B) be the set of objects extracted from version Vi+1.Here S(A) - S(B) gives the objects that are absent inS(B) indicating deletion of those objects between versionsVi and Vi+1. Similarly S(B) - S(A) gives the new ob-jects that have been inserted or added into version Vi+1.The algorithm improves upon this idea by introducing theconcept of window-based change detection. We define a sets(o1,c1),(o2,c2),(o3,c3), ..., (on,cn) where o1, o2, o3,. . . ,on

are objects of type twith c1,c2,. . . ,cn being the correspond-ing number of occurrences of each object in a version Vi.For detecting changes to objects of type t in version Vi,we need to compare the set obtained from Vi, with the oldset obtained from version Vi−1. Increase or decrease in thenumber of instances of an object, is taken as an insert ordelete.

CX-DIFF [4, 5] detects customized changes on XMLdocuments. According to the definition of XML as de-fined in [10], the text nodes are ordered but attributes ofan element are considered unordered. But the changes de-tected are word-based changes (including phrases) on thecontent of the page. In an XML tree, the leaf nodes rep-resent the content. Hence changes to the leaf nodes areof interest. The changes are detected by identifying thechange operations, which transform a tree T1 to tree T2. Forcustomized change detection based on user intent, extrac-tion of the objects of interest such as keywords and phrasesis necessary to detect changes to a page. A signature iscomputed for each extracted leaf node. To detect changeoperations between given trees T1 and T2, the unique in-serts/deletes are filtered, and matching nodes and signaturesare extracted. The common ordered subsequence from theextracted matching nodes is used to detect move and in-

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 © 2006

Page 4: [IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

sert/delete to duplicate nodes. The algorithm consists ofseveral steps: i) object extraction and signature computa-tion, ii) filtering of unique inserts/deletes and iii) findingthe common order subsequence between the leaf nodes ofthe given trees. For reducing the computational time fordetecting changes, an optimization is also proposed.

6. Types of Pages Handled

Monitoring of multiple web pages [8] for changes can beclassified into:Assorted Monitoring: Users’ may want to monitor multi-ple distinct web pages for changes using a single request.For example, a user might be interested in knowing aboutchanges to links on www.cnn.com and changes to keywords”rangers” on www.espn.com. The user might want to be no-tified only when both the pages are updated with the corre-sponding changes. These requests correspond to monitoringmultiple web pages for different (or same) change types.Linked Monitoring: Information present on a web pagecan be categorized into different sections and each sectioncan be placed on a different page. In such cases, all thedifferent pages are linked from one main page. In such situ-ations, users’ might be interested in monitoring actual webpages pointed to by these links. For example, for an on-line web site such as http://javaforums.com, there will bemany different sections on different web pages but all thesections will be linked from the home page. A user mightbe interested in monitoring posts made on different sectionsof such a web site. In this case, the base page can be as-sumed to be at a depth 0 and all the pages linked from it tobe at a depth 1. Web pages containing frames are specialcase of this where at depth 1 the content is usually located.Therefore, requests on web pages with frames are convertedto requests for linked monitoring with a depth of 1 so thatthe base page and all the pages pointed by the frames canbe monitored. Depth is a user provided parameter and istypically limited to a small value in order not to monitor anentire web site. The construction of CDG for these requestscan be done using the following two approaches a) Usingbinary OR nodes b) Extended OR (EOR) nodes. Althoughan EOR can be simulated using multiple binary OR opera-tor, the efficiency will suffer. Hence an EOR operator whichis an n-ary OR operator (EOR) [8] has been introduced andimplemented in WebVigiL

7. Version Management

Deletion or purging [11] of pages that are no longer re-quired, is extremely important. When the number of ver-sions for a particular URL exceeds a specified number, dele-tion is triggered. The number mainly depends on the Com-pare Option (indicating the versions used of comparisonby several sentinels on the same page) and the number ofchanges that need to be kept for an interactive notification.

If the number of versions for a URL exceeds the productof Deletion Count (the number of versions for the page itmight need in the future considering the compare options)and Change History (The number of changes that the userwants the system to persist) or Max Change History (themaximum number of changes associated with the page re-gardless of any sentinel, that have to be stored) there is apossibility that some versions are not required anymore andcan be purged. Deletion is triggered at (n * Deletion Count* Max Change History or Change History), where n is apositive integer. We choose Max Change History for a BestEffort sentinel and Change History for a Fixed Interval sen-tinel.

In addition, storage of pages need to be optimized aswell. A diff-based storage techniques is used to reducethe amount of space used by pages that are required by ac-tive sentinels [12]. The deletion algorithm becomes com-plicated with diff-based storage as one needs to keep trackof the ability to reconstruct pages from differences. Web-VigiL uses the GNU diff utility and has customized it forchange detection.

8. Buffer Management

Change detection requires that pages are converted intomain memory objects. Reuse of main memory objects re-quire a buffer management policy that is appropriate forWebVigiL needs. Our analysis indicates that traditional re-placement policies such as MRU, LRU, LFU, MFU, Clockand FIFO are not appropriate [12] for change detection. Areplacement strategy to estimate the utility of a version inmemory requires maintaining two aspects. Utility factor isthe primary factor considered to decide the usefulness of aversion in the system and Frequency factor is the secondaryfactor which is considered for objects having the same util-ity factor. The runtime object is stored in the cache only ifthe utility factor is greater than 0. If the cache is full, thenthe new object is stored only if it has a better utility thanall the objects that are already in version-cache. But therecan be situations where more than one object in the cachehas the same utility factor. In such cases, the frequency fac-tor resolves the relative utility of the objects with the sameutility factor.

9. Presentation and Notification

The presentation contains the following basic informa-tion: a) URL for which the changes have been detected,b) The type of change (for e.g., keywords, phrases), c)Notification time, and d) A URL for viewing the type ofpresentation (only change or dual frame). Two schemesare being used for presenting changes in HTML and XMLpages. Change-Only Approach: In this approach only thechanges are presented. The traditional method is a tabularstructure with the types of changes (insert/delete/move) as

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 © 2006

Page 5: [IEEE 17th International Conference on Database and Expert Systems Applications (DEXA'06) - Krakow, Poland (04-08 Sept. 2006)] 17th International Conference on Database and Expert

different columns of the table, Dual-Frame Approach: Inthis approach we display both the documents side-by-sideto highlight the changes. Since, XML does not contain pre-sentation tags we can use customized style-sheets or embedthem in HTML as data islands for us to highlight or strikeout the changed contents.

9.1 Notification of changes

Users have to be notified of the changes detected by thesystem on the pages of interest. One of the very importantcriteria of notification is the frequency by which the userswant to get notified. The various options are: Immediate:The user is notified as soon as changes occur on pages ofinterest, Best-effort: The user is notified as soon as thepage changes, Interval based: The user is notified with auser-specified periodicity, Interactive: Notification is nav-igational style retrieval where the users visit the WebVigiLdashboard to retrieve the detected changes at their conve-nience. As notifications needs to be delivered asynchro-nously with the detection of changes, a triggering mecha-nism based on the ECA paradigm is used in WebVigiL.

10. Related WorkAIDE (AT&T Internet Difference Engine) [13] uses

HTMLdiff, which uses the weighted LCS algorithm [9].This approach may be expensive computationally as eachsentence may need to be compared with all sentences in thedocument even if user is interested in change to a particu-lar phrase. ChangeDetect [14] detects all the changes anddoes not differentiate between specific changes like links,keywords and phrases.It also lacks support to specify com-posite changes on a page. WebCQ [15] detects changesbetween the last two versions of a HTML page, whereasWebVigiL provides the option of comparing the page withrelatively older versions. WYSIGOT [16] is a commer-cial application that detects changes between HTML pages.This system has to be installed on the local machine and thegranularity of change detection is at page level.

RSS [17] stands for ‘Rich Site Summary’, it is an XML-based specification format that is used to syndicate newssites, personal web logs etc. The RSS file is updated when-ever there is a change to the published information on thatpage. To keep track of these changes, a reader application isrequired which periodically checks the published RSS fileand notifies of changes. RSS only provides a basic frame-work for change presentation, but the complexity and us-ability of the published changes depend on the RSS reader[18], but WebVigiL presents the changes as per user speci-fication.

11. Conclusions and Future WorkWebVigiL is a change monitoring system for the web

that supports monitoring of single or multiple HTML/XMLpages for different change types. It provides mecha-

nisms to efficiently fetch required pages with either ‘user-specified’ time interval or with an interval predicted by alearning-based algorithm. Customized changes to singleor multiple web pages (including frames) are supported.The first version of the WebVigiL system incorporatingthe features discussed in this paper can be accessed fromhttp://berlin.uta.edu:8081/webvigil. The system is beingextended to distribute sentinels over several servers for loadbalancing.

12. AcknowledgementsThe authors would like acknowledge the contributions of

J. Jacob, N. Pandrangi, A. Sanka, A. Sachde, A. Eppili, andS. Chamakura who worked on the WebVigiL system.

References[1] P. Deolasee, et al., “Adaptive Push-Pull: Disseminating Dynamic Web Data,”

in Proc. of WWW, Hong Kong, China, 2001, pp. 265–274.

[2] N. Pandrangi, “WebVigiL: Adaptive Fetching and User-profilebased Change Detection of HTML Pages,” Master’s thesis, TheUniversity of Texas at Arilngton, 2003. [Online]. Available:http://itlab.uta.edu/ITLABWEB/Students/sharma/theses/naveen.pdf

[3] N. Pandrangi, et al., “WebVigiL: User-Profile Based Change Detection forHTML/XML Documents,” in Proc. of BNCOD, UK, 2003, pp. 38–55.

[4] J. Jacob, “WebVigiL: Sentinel Specification and User-IntentBased Change Detection for XML,” Master’s thesis, The Uni-versity of Texas at Arilngton, 2003. [Online]. Available:http://itlab.uta.edu/ITLABWEB/Students/sharma/theses/Jyoti.pdf

[5] J. Jacob, A. Sachde, and S. Chakravarthy, “CX-DIFF: A Change DetectionAlgorithm for XML Content and Change Presentation Issues For WebVigiL,”DKE, vol. 25, no. 52, pp. 209–230, 2005.

[6] S. Chakravarthy, et al., “A Learning-Based Approach for Fetching Pages inWebVigiL,” in Proc. of ACM SAC, Mar. 2004, pp. 1725–1731.

[7] S. Chakravarthy and D. Mishra, “Snoop: An Expressive Event SpecificationLanguage for Active Databases,” DKE, vol. 14, no. 10, pp. 1–26, Oct. 1994.

[8] S. Chamakura, “Improvements to Change Detection and Fetch-ing to Handle Multiple URLs in WebVigiL,” Master’s thesis,The University of Texas at Arilngton, 2004. [Online]. Available:http://itlab.uta.edu/ITLABWEB/Students/sharma/theses/Shravan.pdf

[9] D. Hirschberg, “Algorithms for the longest common subsequence problem,”Journal of the ACM, vol. 24, no. 4, pp. 664–675, 1997.

[10] B. Nguyen, et al., “Monitoring XML data on the Web,” in Proc. of SIGMOD,2001, pp. 437–448.

[11] A. Sachde, “Persistence, Notification, and Presentation of Changes in Web-VigiL,” Master’s thesis, The University of Texas at Arilngton, 2004. [Online].Available: http://itlab.uta.edu/ITLABWEB/Students/sharma/theses/sachde.pdf

[12] A. Eppili, “Approaches to Improve the Performance of Stor-age And Processing Subsystems in WebVigiL,” Master’s thesis,The University of Texas at Arilngton, 2004. [Online]. Available:http://itlab.uta.edu/ITLABWEB/Students/sharma/theses/ajay.pdf

[13] F. Douglis, et al., “The AT&T Internet Difference Engine: Tracking and View-ing Changes on the Web,” WWW, vol. 1, no. 1, pp. 27–44, Jan. 1998.

[14] ChangeDetect, “http://changedetect.com/.”

[15] L. Liu, C. Pu, and T. Wei, “WebCQ: Detecting and Delivering InformationChanges on the Web,” in Proc of CIKM, Washington D.C, 2000, pp. 512–519.

[16] Wysigot, “http://www.wysigot.com.”

[17] RSS, “http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html.”

[18] RSS-Reader, “http://www.bradsoft.com/feeddemon/.”

Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)0-7695-2641-1/06 $20.00 © 2006