31
A Fragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING IBM Research This article presents a publishing system for efficiently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other frag- ments. Relationships between Web pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change. We also present algorithms for publishing sets of Web pages consistently; different algorithms are used depending upon the consistency requirements. Our publishing system provides an easy method for Web site designers to specify and modify inclusion relationships among Web pages and fragments. Users can update content on multiple Web pages by modifying a template. The system then automatically updates all Web pages affected by the change. Our system accommodates both content that must be proofread before publication and is typically from humans as well as content that has to be published immediately and is typically from automated feeds. We discuss some of our experiences with real deployments of our system as well as its perfor- mance. We also quantitatively present characteristics of fragments used at a major deployment of our publishing system including fragment sizes, update frequencies, and inclusion relationships. Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General General Terms: Design, Performance Additional Key Words and Phrases: Caching, dynamic content, fragments, publishing, Web, Web performance 1. INTRODUCTION Many Web sites need to provide dynamic content. Examples include sport sites, stock market sites, and virtual stores or auction sites where information on available products is constantly changing. Based on: A Publishing System for Efficiently Creating Dynamic Web Content, by Jim Challenger, Arun Iyengar, Karen Witting, Cameron Ferstat, and Paul Reed, which appeared in Proceedings of INFOCOM 2000,c 2000 IEEE. Authors’ address: IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598; email: {challngr,PaulDantzig,aruni,witting} @us.ibm.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. C 2005 ACM 1533-5399/05/0500-0359 $5.00 ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005, Pages 359–389.

AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

A Fragment-Based Approach for EfficientlyCreating Dynamic Web Content

JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR,and KAREN WITTINGIBM Research

This article presents a publishing system for efficiently creating dynamic Web content. ComplexWeb pages are constructed from simpler fragments. Fragments may recursively embed other frag-ments. Relationships between Web pages and fragments are represented by object dependencegraphs. We present algorithms for efficiently detecting and updating Web pages affected after oneor more fragments change. We also present algorithms for publishing sets of Web pages consistently;different algorithms are used depending upon the consistency requirements.

Our publishing system provides an easy method for Web site designers to specify and modifyinclusion relationships among Web pages and fragments. Users can update content on multiple Webpages by modifying a template. The system then automatically updates all Web pages affected bythe change. Our system accommodates both content that must be proofread before publication andis typically from humans as well as content that has to be published immediately and is typicallyfrom automated feeds.

We discuss some of our experiences with real deployments of our system as well as its perfor-mance. We also quantitatively present characteristics of fragments used at a major deployment ofour publishing system including fragment sizes, update frequencies, and inclusion relationships.

Categories and Subject Descriptors: H.4.0 [Information Systems Applications]: General

General Terms: Design, Performance

Additional Key Words and Phrases: Caching, dynamic content, fragments, publishing, Web, Webperformance

1. INTRODUCTION

Many Web sites need to provide dynamic content. Examples include sport sites,stock market sites, and virtual stores or auction sites where information onavailable products is constantly changing.

Based on: A Publishing System for Efficiently Creating Dynamic Web Content, by Jim Challenger,Arun Iyengar, Karen Witting, Cameron Ferstat, and Paul Reed, which appeared in Proceedings ofINFOCOM 2000, c©2000 IEEE.Authors’ address: IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598;email: {challngr,PaulDantzig,aruni,witting} @us.ibm.com.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2005 ACM 1533-5399/05/0500-0359 $5.00

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005, Pages 359–389.

Page 2: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

360 • J. Challenger et al.

There are several problems with providing dynamic data to clients efficientlyand consistently. A key problem with dynamic data is that it can be expensiveto create; a typical dynamic page may require several orders of magnitude moreCPU time to serve than a typical static page of comparable size. The overheadfor dynamic data is a major problem for Web sites that receive substantialrequest volumes. Significant hardware may be needed for such sites.

A key requirement for many Web sites providing dynamic data is to com-pletely and consistently update pages that have changed. In other words, ifa change to underlying data affects multiple pages, all such pages should becorrectly updated. In addition, a bundle of several changed pages may have tobe made visible to clients at the same time. For example, publishing pages inbundles instead of individually may prevent situations where a client views afirst page, clicks on a hypertext link to view a second page, and sees informationon the second page that is older and not consistent with the information on thefirst page.

Depending upon the way in which dynamic data are being served, achievingcomplete and consistent updates can be difficult or inefficient. Many Web sitescache dynamic data in memory or a file system in order to reduce the over-head of re-calculating Web pages every time they are requested [Iyengar andChallenger 1997]. In these systems, it is often difficult to identify which cachedpages are affected by a change to underlying data that modifies several dynamicWeb pages. In making sure that all obsolete data are invalidated, deleting somecurrent data from cache may be unavoidable. Consequently, cache miss ratesafter an update may be high, adversely affecting performance. In addition, mul-tiple cache invalidations from a single update must be made consistently.

This article presents a system for efficiently and consistently publishing dy-namic Web content. In order to reduce the overhead of generating dynamicpages from scratch, our system composes dynamic pages from simpler entitiesknown as fragments. Fragments typically represent parts of Web pages thatchange together; when a change to underlying data occurs that affects severalWeb pages, the fragments affected by the change can easily be identified. It ispossible for a fragment to recursively embed another fragment.

Our system provides a user-friendly method for managing complex Webpages composed of fragments. Users specify how Web pages are composed fromfragments by creating templates in a markup language. Templates are parsedto determine inclusion relationships among fragments and Web pages. Theseinclusion relationships are represented by a graph known as an object depen-dence graph (ODG). Graph traversal algorithms are applied to ODGs in orderto determine how changes should be propagated throughout the Web site afterone or more fragments change.

Our system allows multiple independent authors to provide content as wellas multiple independent proofreaders to approve some pages for publicationand reject others. Publication may proceed in multiple stages in which a set ofpages must be approved in one stage before it is passed to the next. Our systemcan also include a link checker, which verifies that a Web page has no brokenhypertext links at the time the page is published. It is also scalable to handlehigh request rates.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 3: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 361

In addition to describing the architecture of our fragment-based Web pub-lishing system in detail, this article also presents the first empirical study ofa large scale real deployment of a fragment-based Web publishing system thatwe are aware of. We present object size distributions, update frequency dis-tributions, and characteristics of the fragment inclusion relationships. Thesestatistics can be used to architect efficient Web publishing systems. For exam-ple, we have used fragment size distributions to optimize the manner in whichfragments are stored on disk [Iyengar et al. 2001].

The remainder of the article is organized as follows. Section 2 discusses ourmethodology for constructing Web pages from fragments. Section 3 describesthe architecture of our system in detail. Section 4 describes the performance ofour system and presents statistics obtained from deploying our system at realWeb sites. Section 5 discusses related work. Finally, Section 6 summarizes ourmain results and conclusions.

2. CONSTRUCTING WEB PAGES FROM FRAGMENTS

2.1 Overview

A key feature of our system is that it composes complex Web pages from simplerfragments (Figure 9). A page is a complete entity that may be served to aclient. We say that a fragment or page is atomic if it doesn’t include any otherfragments and complex if it includes other fragments. An object is either a pageor a fragment.

Our approach is efficient because the overhead for composing an object fromsimpler fragments is usually minor. By contrast, the overhead for construct-ing the object from scratch as an atomic fragment is generally much higher.Using the fragment approach, it is possible to achieve significant performanceimprovements without caching dynamic pages and dealing with the difficultiesof keeping caches consistent. For optimal performance, our system has the abil-ity to cache dynamic pages. Caching capabilities are integrated with fragmentmanagement.

The fragment-based approach for generating Web pages makes it easier todesign Web sites in addition to improving performance. It is easy to design aset of Web pages with a common look and feel. It is also easy to embed commoninformation into several Web pages. Sets of Web pages containing similar in-formation can be managed together. For example, it is easy to update commoninformation represented by a single fragment but embedded within multiplepages; in order to update the common information everywhere, only the frag-ment needs to be changed.

By contrast, if the Web pages are stored statically in a file system, identifyingand updating all pages affected by a change can be difficult. Once all changedpages have been identified, care must be taken to update all changed pages inorder to preserve consistency.

Dynamic Web pages that embed fragments are implicitly updated any timean embedded fragment changes, so consistency is automatically achieved. Con-sistency becomes an issue with the fragment-based approach when the pages

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 4: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

362 • J. Challenger et al.

are being published to a cache or file system. Our system provides several dif-ferent methods for consistently publishing Web pages in these situations; eachmethod provides a different level of consistency.

Fragments also provide a mechanism by which remote caches can store someparts of dynamic and personalized pages. The remote cache stores the staticparts of a page. When a page is requested, the cache requests the dynamic orpersonalized fragments of the page from the server. Typically, this would onlyconstitute a small fraction of the page. The cache then composes and serves thecomposite page.

HTML contains an OBJECT tag which allows an arbitrary program objectto be embedded in an HTML page. While the OBJECT tag has the potentialto achieve fragment composition at the client, a major drawback is that ma-jor browsers don’t support the tag properly. Therefore, we couldn’t rely on theOBJECT tag for our Web sites.

One of the key things our publishing system enables is separation of thecreative process from the mechanical process of building a Web site. Previously,the content, look, and feel of large sites we were involved with had to be carefullyplanned well in advance of the creation of the first page. Changes to the originalplans were quite difficult to execute, even in the best of circumstances. Last-minute changes tended to be impossible, resulting in a choice between delayedor flawed site publication.

With our publishing system, the entire look and feel of a site can be changedand republished within minutes. Aside from the cost savings, this has allowedtremendous creativity on the part of designers. Entire site designs can be cre-ated, experimented with, changed, discarded, and replaced several times a dayduring the construction of the site. This can take place in parallel with andindependently of the creation of site content.

A specific example of this was demonstrated just before a new site look forthe 2000 Sydney Olympic Games Web site was made public. One day beforethe site was to go live before the public, it was decided that the search facilitywas not working sufficiently well and must be removed. This change affectedthousands of pages, and would previously have delayed publication of the siteby as much as several days. Using our system, the site authors simply removedthe search button from the appropriate fragment and republished the fragment.Ten minutes later, the change was complete, every page had been rebuilt, andthe site went live on schedule.

2.2 Object Dependence Graphs

When pages are constructed from fragments, it is important to construct a frag-ment f1 before any object containing f1 is constructed. In order to constructobjects in an efficient order, our system represents relationships between frag-ments and Web pages by graphs known as object dependence graphs (ODGs)(Figures 1 and 2).

Object dependence graphs may have several different edge types. An inclu-sion edge indicates that an object embeds a fragment. A link edge indicates thatan object contains a hypertext link to another object.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 5: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 363

Fig. 1. A set of Web pages containing fragments.

Fig. 2. The object dependence graph (ODG) corresponding to Figure 1.

In the ODG in Figure 2, all but one of the edges are inclusion edges. Forexample, the edge from f4 to P1 indicates that P1 contains f4; thus, when f4changes, f4 should be updated before P1 is updated. The graph resulting fromonly inclusion edges is a directed acyclic graph.

The edge from P3 to P2 is a link edge, which indicates that P2 contains ahypertext link to P3. A key reason for maintaining link edges is to prevent dan-gling or inconsistent hypertext links. In this example, the link edge from P3to P2 indicates that publishing P2 before P3 will result in a broken hypertextlink. Similarly, when both P2 and P3 change, publishing a current version ofP2 before publishing a current version of P3 could present inconsistent infor-mation to clients who view an updated version of P2, click on the hypertextlink to an outdated version of P3, and then see information that is obsoleterelative to the referring page. Link edges can form cycles within an ODG. Thiswould occur, for example, if two pages both contain hypertext links to eachother.

There are two methods for creating and modifying ODGs. Using one ap-proach, users specify how Web pages are composed from fragments by creatingtemplates in a markup language. Templates are parsed to determine inclusionrelationships among fragments and Web pages. Using the second approach, aprogram may directly manipulate edges and vertices of an ODG by using anAPI. We have deployed systems using both approaches. The first approach iseasier for Web page designers to use but incurs higher overheads.

Our system allows an arbitrary number of edge types to exist in ODGs. So far,we have only found practical use for inclusion and link edges. We suspect that

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 6: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

364 • J. Challenger et al.

there may be other types of important relationships that can be represented byother edge types.

When our system becomes aware of changes to a set S of one or more objects,it does a depth-first graph traversal using topological sort [Cormen et al. 1990]to determine all vertices reachable from S by following inclusion edges. Thetopological sort orders vertices such that whenever there is an edge from avertex v to another vertex u, v appears before u in the topological sort. Forexample, a valid topological sort of the graph in Figure 2 after P3, f4, and f2change would be P3, f4, f2, f5, P2, f1, f3, and P1. This topological sort ignoreslink edges.

Objects are updated in an order consistent with the topological sort. Oursystem updates objects in parallel when possible. In the previous example, P3,f4, and f2 can be updated in parallel. After f2 is updated, f1 and f5 may beupdated in parallel. A number of other objects may be constructed in parallelin a manner consistent with the inclusion edges of the ODG.

After a set of pages, U , has been updated (or generated for the first time),the pages in U are published so that they can be viewed by clients. In somecases, the pages are published to file systems. In other cases, they are pub-lished to caches. Pages may be published either locally on the system gen-erating them or to a remote system. It is often a requirement for a set ofmultiple pages to be published consistently. Consistency can be guaranteedby publishing all changed (or newly generated) pages in a single atomic ac-tion. One potential drawback to this method of publication is that the pub-lication process may be relatively long. For example, pages may have to beproofread before publication. If everything is published together in a singleatomic action, there can be considerable delay before any information is madeavailable.

Therefore, incremental publication, wherein information is published instages instead of together, is often desirable. The disadvantage to incremen-tal publication is that consistency guarantees are not as strong. Our systemprovides three different methods for incremental publication, each providingdifferent levels of consistency.

The first incremental publishing method guarantees that a freshly publishedpage will not contain a hypertext link to either an obsolete or unpublishedpage. This consistency guarantee applies to pages reached by following severalhypertext links. More specifically, if P1 and P2 are two pages in U, if a clientviews an updated version of P1 and follows one or more hypertext links to viewP2, then the client is guaranteed to see a version of P2 that is not obsolete withrespect to the version of P1 that the client viewed (a version of P2 is obsoletewith respect to a version of P1 if the version of P2 was outdated at the timethe version of P1 became current, regardless of whether P1 or P2 have anyfragments in common).

For example, consider the Web pages in Figure 3. A client can access P3by starting at P1, following a hypertext link to P2 and then following a secondhypertext link to P3. Suppose that both P1 and P3 change. The first incrementalpublishing method guarantees that the new version of P1 will not be publishedbefore the new version of P3, regardless of whether P2 has changed.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 7: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 365

Fig. 3. A set of Web pages connected by hypertext links.

Fig. 4. A set of Web pages containing common fragments.

This incremental publishing method is implemented by first determining theset R of all pages that can be reached by following hypertext links from a pagein U. R includes all pages of U ; it may also include previously published pagesthat haven’t changed. R is determined by traversing link edges in reverse orderstarting from pages in U.

Let K be the subgraph of the ODG consisting of all nodes in R and link edgesin the ODG connecting nodes in R. The strongly connected components of K aredetermined. A strongly connected component of a directed graph is a maximalsubset of vertices S such that every vertex in S has a directed path to everyother vertex in S. A good algorithm for finding strongly connected componentsin directed graphs is contained in Cormen et al. [1990].

A graph Kscc is constructed in which each vertex represents a differentstrongly connected component of K and there is an edge between two nodes(S, T ) of Kscc if and only if there exist nodes s ∈ S and t ∈ T such that (s, t) isan edge in K .

Kscc is then topologically sorted and examined in an order consistent withthe topological sort to locate pages in U . Each time a page in U for which theupdated version hasn’t been published yet is examined, the page is publishedtogether with all other pages in U belonging to the same strongly connectedcomponent. Each set of pages that are published together in an atomic actionis known as a bundle.

The second incremental publishing method guarantees that any two pagesin U that both contain a common changed fragment are published in the samebundle. For example, consider the Web pages in Figure 4. Suppose that bothf1 and f2 change. Since P1 and P3 both embed f1, their updated versionsmust be published together. Since P2 and P3 both embed f2, their updatedversions must also be published together. Thus, updated versions of all threeWeb pages must be published together. Note that updated versions of P1 and P2must be published together, even though the two pages don’t embed a commonfragment.

In order to implement this approach, the set of all changed fragments con-tained within each changed object d1 is determined. We call this set the changedfragment set for d1 and denote it by C(d1). All changed objects are constructedin topological sorting order. When a changed object d1 is constructed, C(d1) iscalculated as the union of f2 and C( f2) for each changed fragment f2 such thatan inclusion edge ( f2, d1) exists in the ODG.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 8: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

366 • J. Challenger et al.

Fig. 5. Another set of related Web pages.

After all changed fragment sets have been determined, an undirected bipar-tite graph D is constructed in which the vertices of D are pages in U on oneside and changed fragments contained within pages of U on the other side. Foreach page P ∈ U , an undirected edge ( f , P ) is created for each f ∈ C(P ). D isexamined to determine its connected components (two vertices are part of thesame connected component if and only if there is a path between the verticesin the graph). All pages in U belonging to the same connected component arepublished in the same bundle.

The third incremental publishing method satisfies the consistency guaran-tees of both the first and second methods. In other words,

(1) A freshly published page will not contain a hypertext link to either anobsolete or unpublished page. More specifically, if P1 and P2 are two pagesin U, if a client views an updated version of P1 and follows one or morehypertext links to view P2, then the client is guaranteed to see a versionof P2 that is not obsolete with respect to the version of P1 that the clientviewed.

(2) Any two changed pages that both contain a common changed fragment arepublished together.

This method generally results in publishing fewer bundles but of larger sizesthan the first two approaches.

For example, consider the Web pages in Figure 5. Suppose that both P1 and f1change. Updated versions of P2 and P3 must be published together because theyboth embed f1. Since P1 contains a hypertext link to P3, the updated versionof P1 cannot be published before the bundle containing updated versions of P2and P3.

If, instead, the first incremental publishing method were used to publish theWeb pages in Figure 5, the updated version of P1 could not be published beforethe updated version of P3. However, the updated version of P2 would not haveto be published in the same bundle as the updated version of P3. If the secondincremental publishing method were used, updated versions of both P2 and P3would have to be published together in the same bundle. However, publicationof the updated version of P1 would be allowed to precede publication of thebundle containing updated versions of P2 and P3.

The third incremental publishing method is implemented by constructing Kas in the first method and D as in the second. Additional edges are then addedto K between pages in U to ensure that all Web pages belonging to the sameconnected component of D belong to the same strongly connected component of

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 9: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 367

K . The same procedure as in the first method is then applied to K to publishpages in bundles.

Incremental publishing methods can be designed for other consistency re-quirements as well. For example, consider Figure 3. Suppose that both P1 andP3 change. It may be desirable to publish updated versions of P1 and P3 in thesame bundle. This would avoid the following situation, which could occur usingthe first method.

A client views an old version of P1. After following hypertext links, the clientarrives at a new version of P3. The browser’s cache is then used to go to the oldversion of P1. The client reloads P1 in order to obtain a version consistent withP3 but still sees the old version because the new version of P1 has not yet beenpublished.

It is straightforward to implement an incremental publishing method whichwould publish P1 and P3 in the same bundle using techniques similar to theones just described.

2.3 Combined Content Pages

Many Web sites contain information that is fed from multiple sources. Someof the information, such as the latest scores from a sporting event, is gener-ated automatically by a computer. Other information, such as news stories,is generated by humans. Both types of information are subject to change. Apage containing both human and computer-generated information is known asa combined content page.

A key problem with serving combined content pages is the different rates atwhich sources produce content. Computer-generated content tends to be pro-duced at a relatively high rate, often as fast as the most sophisticated timingtechnology permits. Human-generated content is produced at a much lowerrate. Thus, it is difficult for humans to keep pace with automated feeds. By thetime an editor has finished with a page, the actual results on the page mayhave changed. If the editor takes time to update the page, the results may havechanged yet again.

A requirement for many of the Web sites we have helped design is thatcomputer-generated content should not be delayed by humans. Computer-generated results, such as the latest results from a sporting event, are oftenextremely important and should be published as soon as possible. If computer-generated results are combined with human-edited content using conventionalWeb publishing systems, publication of the computer-generated results can besignificantly delayed. What is needed is a scheme to combine data feeds of dif-fering speeds so that information arriving at high rates is not unnecessarilydelayed.

In order to provide combined content pages, our system divides fragmentsinto two categories. Immediate fragments are fragments that contain vital in-formation, which should be published quickly with minimal proofreading. Onthe sports Web sites that our system is being used for, the latest results ina sporting event would be published as an immediate fragment. Quality con-trolled fragments are fragments that don’t have to be published as quickly as

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 10: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

368 • J. Challenger et al.

immediate fragments but have content that must be examined in order to deter-mine whether the fragments are suitable to be published. Background storieson athletes are typically published as quality controlled fragments at the sportssites that use our system. Combined content Web pages consist of a mixture ofimmediate and quality controlled fragments.

When one or more immediate fragments change, the Web pages affected bythe changes are updated and published with minimal proofreading. If both im-mediate and quality controlled fragments change, the system first performs up-dates resulting from the immediate fragments and publishes the updated Webpages with only minor proofreading to ensure that the immediate fragmentsare properly placed in the Web pages; the immediate fragments themselvesare not proofread. The system subsequently performs updates resulting fromquality controlled fragments and only publishes these updated Web pages afterthey have been proofread. Multiple versions of a combined content page may bepublished using this approach. The first version would be the page before anyupdates. The second version might contain updates to all immediate fragmentsbut not to any quality controlled fragments. The third version might containupdates to all fragments.

It is possible for an update to an immediate fragment f1 to be publishedbefore an update to a quality controlled fragment f2 even though f2 changedbefore f1. This might occur if the changes to f2 are delayed in publication dueto proofreading.

3. SYSTEM ARCHITECTURE

Web pages produced by our system typically consist of multiple fragments.Each fragment may originate from a different source and may be produced at adifferent rate than other fragments. Fragments may be nested, permitting theconstruction of complex and sophisticated pages. Completed pages are writtento sinks, which may be file systems, caches, or even other HTTP servers.

The Trigger Monitor is the software that takes objects from one or moresources, constructs pages, and writes the constructed pages to one or more sinks(Figure 6). Relationships among fragments are maintained in a persistent ODG,which preserves state information in the event of a system crash.

Whenever the Trigger Monitor is notified of a modification or addition of oneor more objects, it fetches new copies of the changed objects from the appropriatesource. The ODG is updated by parsing new and changed objects. The graphtraversal algorithms described in Section 2.2 are then applied to the latestversion of the ODG to determine all Web pages that need to be updated andan efficient order for updating them. Finally, bundles of published pages arewritten to the sinks.

Since the Trigger Monitor is aware of all fragments and pages, synchro-nization is possible to prevent corruption of the pages. The ODG is used asthe synchronization object to keep the fragment space consistent. Many “trig-ger handlers,” each with their own sources and sinks, may be configured touse a common ODG. This design permits, for example, a slow-moving, care-fully edited human-generated set of pages and fragments to be integrated with

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 11: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 369

Fig. 6. Schematic of the publish process.

a high-speed, automated, database-driven content source. Because the ODGis aware of the entire fragment space and the interrelationship of the objectswithin that space, synchronization points can be chosen to ensure that multiple,differently-sourced, differently-paced content streams remain consistent.

Multiple Trigger Monitor instances may be chained, the sinks of earlier in-stances becoming the sources for later ones. This allows publication to takeplace in multiple stages. We have typically used the following stages in realdeployments (Figure 7):

Development is the first step in the process. Fragments that appear on manyWeb pages (such as generic headers and footers) as well as overall site designoccur here. The output of development may be structurally complete but lackingin content.

Staging takes as its input, or source, the output, or sink, of Development.Editors polish pages and combine content from various sources. Finished pagesare the result.

Quality Assurance takes as its source the sink of Staging. Pages are exam-ined here for correctness and appropriateness. The quality assurance peopleexamine whole pages, which include content from the automated results feed.In determining whether to reject a page, however, the quality assurance peoplewill not scrutinize fragments corresponding to automated results.

Automated Results are produced when a database trigger is generated as theresult of an update. The trigger causes programs to be executed that extract

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 12: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

370 • J. Challenger et al.

Fig. 7. Schematic of the publish process.

current results and compose relevant updated pages and fragments. Unlike theprevious stages, no human intervention occurs in this stage.

Production is where pages are served from. Its source is the sink of QA, andits sinks are the serving directories and caches.

Note how one stage can use the sink of another stage as its source. Theautomated feed updates sources at the same time, but independently of thehuman-driven stages. This achieves the dual goals of keeping the entire siteconsistent while immediately publishing content from automated feeds. Stagescan be added and deleted easily. Data sources can be added and deleted withlittle or no disruption to the flow.

3.1 Actual Deployments

We now describe how our publishing system is typically deployed. Approxi-mately three dozen different object types are produced by various data sources.These objects include entities such as images, PDF files, style sheets, movies,and HTML fragments. Objects are categorized into four primary classes basedon how they participate in page assembly, and two secondary classes basedon whether they are distributed as servable pages. Table I describes the fourprimary object types.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 13: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 371

Table I. The Four Primary Object Types

Embeds Other Fragments Doesn’t Embed Other Fragments

Embedded in other Fragment Intermediate LeafNot Embedded in other Fragment Top-Level Binary

Table II. Secondary Object Types

Input to Assembly Generated by Assembly

Intermediate PartialTop-Level Servable Html

Binary objects such as images, sound clips, and movies are embedded inpages by virtue of HTML tags and therefore do not affect the page assemblyprocess. The publishing system passes such objects directly from their sourcelocation (e.g. authors, web-cams) to the sink, which distributes them to theorigin server’s serving directory.

The page assembly process produces two secondary classes of objects. A top-level object is transformed into a final, servable HTML page after page assem-bly. This page is sent to the sink for distribution to the servers. An intermediateobject is transformed into a partially assembled object by embedding those ob-jects it refers to. The partials are written to a persistent cache for potentialreuse in subsequent page assemblies. Table II summarizes the two secondaryobject types.

Objects sent to the publishing system generally go through four steps:

(1) Read object from source,(2) Update object dependence graph,(3) Assemble pages affected by the object, and(4) Save input object and assembled objects to a persistent cache and to sink

for distribution.

Figure 8 shows the flow of data through the publishing system for each typeof object. The two repositories, “source” and “assembled,” are disk-based caches.The source repository caches new objects for potential reuse in subsequent pub-lish operations. The assembled repository caches partial objects for potentialreuse. Finally, servable pages are written to sinks for distribution to the servers.

The act of publishing a page consists of the authoring system sending amessage to the publishing system containing the names of the objects thathave changed and that have been validated as ready for distribution. The firststep taken by the publishing system is to fetch each object from the authoringsystem. If the object is a fragment of any sort, that is, a leaf, intermediate, ortop-level object, it is then saved into the source repository. Binary objects (whichdo not participate in assembly) do not need to be cached.

The fragments are then analyzed for dependencies, and the ODG is updatedto reflect the new state of the system. After ensuring that the dependencies inthe ODG are consistent with the new objects received, the ODG update taskissues a request that returns a list of objects needing reassembly.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 14: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

372 • J. Challenger et al.

Fig. 8. Dataflow through the publishing system.

Page assembly now occurs. The page assembler will pull needed objects fromthe cache. If a partial object is being embedded in an object requiring reassem-bly, then the page assembler will pull the embedded object from the assembledrepository (rather than the source repository). This extra cache eliminates theneed to reassemble embedded partial objects, which might be costly.

Finally, the output of the assembly process is written to the appropriateplace. If the output of assembly is a partial object, it is written to the assembledrepository for possible use in a future assembly. If the output of assembly isa servable page, then it is written to the sinks for distribution to the contentservers. Triggered binary objects are also written to sinks.

There are three major disk-based caches used in the publishing process:

—The “source” repository. All page fragments are placed into this repositoryupon entering the publishing system and retrieved during page assembly.The “publish” message plays the role of a cache-invalidation message for thesource repository.

—The “assembled” repository. During page assembly, intermediate fragments(Table II) are built into “partial” fragments. These “partial” fragments aresaved for reuse during page assembly. Objects in this cache are invalidatedwhen ODG analysis determines that at least one of the fragments makingup the partial fragment changes.

—The Object Dependence Graph (ODG). This is, in effect, a cache, because allof the information in it also resides in the database containing the fragments(which is, however, prohibitively expensive to search).

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 15: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 373

All three of these caches can be rebuilt in the event of a failure by republishingall the fragments in the authoring system’s database. It is critical that thesecaches be persistent. The time to rebuild all three caches after total failure canbe several hours for a major Web site.

The caches are implemented as disk-backed hash tables [Iyengar et al. 2001].The cache interfaces are implemented as Java(2) “Map” interfaces, compatiblewith the Java(2) HashMap, making it trivial to exchange a memory-basedhash table with a disk-backed hash table. We exploit the disk-based cache inseveral ways:

—Rapid start-up after shutdown or failure. The time required to restart theentire publishing system is about three orders magnitude faster with a warmcache than without one.

—Space consumed by the publishing system can easily overflow main memory.The disk-based caches provide sufficient storage for situations where mainmemory would be insufficient.

—Replication of the publishing system. It is often desirable to have several in-stances of the publishing system installed for purposes such as development,test, recovery, and so on. Because each cache is implemented as a single file,it is easy to replicate the state of the publishing system for use elsewhere.

3.2 Examples

To demonstrate how a site might be built from fragments, we present an ex-ample from a Web site for a French Open Tennis Tournament. A site architectviews the player page for Steffi Graf (shown in Figure 9) as consisting of astandard header, sidebar, and footer, with biographical information and recentresults thrown in. The site architect composes HTML similar to the following,establishing a general layout for the site:

<html><!-- %include(header.frg) --><table><tr><td><!-- %include(sidebr.frg) --></td>

<td><table><tr><!-- %fragment(graf_bio.frg) --></tr><tr><!-- %fragment(graf_score.frg) --></tr>

</td></table>

</tr></table>

<!-- %include(footer.frg) --></html>

where “footer.frg” consists of<!-- %fragment(factoid.frg) --><!-- %fragment(copyr.frg) -->

The form just shown for specifying fragments is concise and efficient. Sincethe fragments are specified by HTML comments, browsers and other entities,

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 16: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

374 • J. Challenger et al.

Fig. 9. Sample screen shot demonstrating the use of fragments.

which don’t recognize fragments will simply ignore directives related to frag-ments. Our system also allows fragments to be specified using the ESI lan-guage [Tsimelzon et al. www.esi.org], a standard that has been developed forspecifying fragments. When the first version of our publishing system was de-veloped, ESI was not yet in existence.

Prior to the beginning of play, the contents of “graf score.frg” will be empty,since no matches have commenced. This means the part of the page outlined bythe dashed box in Figure 9 will, at first, be empty. The first publication of thisfragment will result in the ODG seen to the right of Steffi Graf ’s player pagein Figure 9. Again, the objects and edges within the dashed box will not yet bewithin the ODG, since no match play has yet occurred.

Using fragments in this way permits many architects, editors, and even au-tomated systems to modify the page simultaneously. Our system ensures thatall changes are properly included in the final page that is seen by the user. Anarchitect updating the structure of the page does not need to know anythingabout copyrights, trademarks, the size of the sponsor’s logos, the look-and-feelof the site, or any of the data that will be included on the page. Similarly, aneditor wishing to change the look-and-feel of a site does not need to understandthe structure of any particular page.

Certain types of major site changes are greatly facilitated by our publishingsystem. For example, changing the sidebar to reflect the end of a long event

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 17: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 375

is as simple as updating “sidebr.frg”. To change the look-and-feel of a site, aneditor only needs to change “header.frg” and “footer.frg”. For both these kindsof changes, the system will use the ODG from Figure 9 to determine that SteffiGraf ’s page must be rebuilt (along with many others). Once all pages have beenrebuilt, they will be republished. The user will see the changes on every page,although the vast majority of underlying fragments will not have changed.

More static information, like player biographies, can be kept up-to-date inone place but used on many pages. For example, “graf bio.frg” is used on ourexample page, but may also be used in many other places. To include a newphoto or update the information included in the biography, the editors needonly concern themselves with updating “graf bio.frg”. The system ensures thatall pages that include “graf bio.frg” will automatically be rebuilt.

Since scoring information will change frequently once a tennis match is inprogress, updating that aspect of a page can be handled by an automated pro-cess. As a match begins, “graf score.frg” is updated to include the match inprogress. This means that once the final has begun, the “graf score.frg” pagewill consist of HTML similar to

<!-- %fragment(final.frg) --><!-- %fragment(semi.frg) -->

When the updated “graf score.frg” is published, the system will detect thatit now includes “final.frg” and “semi.frg” and will update the ODG as shownin the dashed box within Figure 9. Now, as the final match progresses, only“final.frg” needs to be updated and published through our system. As part ofthe publication process, the system will detect that “final.frg” is included in“graf score.frg”, causing “graf score.frg” to be rebuilt using the updated score.Likewise, the system will detect that Steffi Graf ’s page must be rebuilt as well,and a new page will be built including the updated scoring information. Even-tually, when the match completes, the complete page shown in the example isproduced.

The score for the final match will be displayed on many pages other thanSteffi Graf ’s player page. For instance, Martina Hingis’s player page will alsoinclude these results, as will the scoreboard page while the match is in progress.A page listing match-ups between different players will also contain the score.To update all of these pages, the automated system only updates one fragment.This keeps the automated system independent of the site design.

A more complex example of a Web page with fragments is shown in Figure 10,which depicts the Athletics Home Page from the 2000 Olympic Games WebSite on October 1, 2000. Both the header and footer are in separate frames.This reduces the size of pages and the amount of information that needs tobe loaded when navigating between pages. It also allows clients to access thetop and bottom navigation elements at any time since when scrolling throughpages, they do not move.

The page contains a total of 46 fragments, a typical number for the Website. The header contains 1 top-level fragment and 12 embedded fragments.The footer contains 1 top-level fragment and 3 embedded fragments. Neitherthe header nor footer were changed during the games. The Athletics Home

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 18: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

376 • J. Challenger et al.

Fig. 10. Another example of a Web page composed of fragments.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 19: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 377

Fig. 11. The CPU time in milliseconds required to construct and publish bundles of various sizes.

Page frame contains 9 top-level fragments and 20 embedded fragments. Thispage was updated frequently, and fragments were an essential component inreducing the overhead for updates.

4. SYSTEM PERFORMANCE AND DEPLOYMENT STATISTICS

This section describes the performance of a Java implementation of our systemrunning on an IBM Intellistation containing a 333 Mhz Pentium II proces-sor with 256 Mbytes of memory and the Windows NT (version 4.0) operatingsystem. The distribution of Web pages sizes is similar to the one for the 1998Olympic Games Web site [Iyengar et al. 1999] as well as more recent Web sitesdeploying our system; the average Web page size is around 10 Kbytes. Fragmentsizes are typically several hundred bytes but usually less than 1 Kbyte. The dis-tribution of fragment sizes is also representative of real Web sites deploying oursystem.

Figure 11 shows the CPU time in milliseconds required for constructing andpublishing bundles of various sizes. Times are averaged over 100 runs. All100 runs were submitted simultaneously, so the times in the figure reflect theability for the runs to be executed in parallel. The solid curve depicts timeswhen all objects that need to be constructed are explicitly triggered. The dottedline depicts times when a single fragment that is included in multiple pagesis triggered; the pages that need to be built as a result of the change to thefragment are determined from the ODG. Graph traversal algorithms applied tothe ODG have relatively low overhead. By contrast, each object that is triggeredhas to be read from disk and parsed; these operations consume considerableCPU overhead. As the graph indicates, it is more desirable to trigger a fewobjects that are included in multiple pages than to trigger all objects that needto be constructed.

Our implementation allows multiple complex objects to be constructed in par-allel. As a result, we are able to achieve near 100% CPU utilization, even whenconstruction of an object was blocked due to I/O, by concurrently constructingother objects.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 20: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

378 • J. Challenger et al.

Fig. 12. The breakdown in CPU time required to construct and publish a typical complex Web page.

The breakdown as to where CPU time is consumed is shown in Figure 12.CPU time is divided into the following categories:—Retrieve, parse: time to read all triggered objects from disk and parse them

for determining included fragments.—ODG update: time for updating the ODG based on the information obtained

from parsing objects and for analyzing the ODG to determine all objects thatneed to be updated and an efficient order for updating the objects.

—Assembly: time to update all objects.—Save data: time to save all updated objects on disk.—Send ack: time to send an acknowledgment message via HTTP that publica-

tion is complete.In the bars marked 1 to 100, one fragment included in 100 others was trig-

gered. The 100 pages that needed to be constructed were determined from theODG. In the bars marked 100 to 100, the 100 pages that needed to be constructedwere all triggered. The times shown in Figure 12 are the average times for asingle page. The total average time for constructing and publishing a page inthe 1 to 100 page is 25.86 milliseconds (represented by the aggregate of allbars); the corresponding time for the 100 to 100 case is 44.51 milliseconds.

The retrieve and parse time is significantly higher for the 100 to 100 casebecause the system is reading and parsing 100 objects compared with 1 in the1 to 100 case. Since the source for every object that is triggered must be saved,the time it takes to save the data is somewhat longer when 100 objects aretriggered than when only one object is triggered.

Figure 13 shows how the construction and publication time averaged over100 runs varies with the number of embedded fragments within a Web page.While the time grows with the number of fragments, the overhead for assem-bling a Web page with several embedded fragments is still relatively low. A Web

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 21: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 379

Fig. 13. The average CPU time in milliseconds required to construct and publish a complex Webpage as a function of the number of embedded fragments. In each case, one fragment in the pagewas triggered.

Fig. 14. The average CPU time in milliseconds required to construct and publish a complex Webpage as a function of the number of fragments triggered.

page constructed from multiple database accesses may require over a secondof CPU time to construct from scratch without using fragments. The benefitsof reducing database accesses by assembling Web pages from pre-existing frag-ments can thus be significant.

Figure 14 shows how the construction and publication time averaged over 100runs varies with the number of fragments that are triggered for a Web page con-taining 20 fragments. Since each fragment that is triggered has to be read fromdisk and parsed, overhead increases with the number of fragments triggered.

4.1 Deployment Statistics

We now present statistics we collected from a deployment of our publishingsystem at a major Web site.1 The statistics from this section can be used todesign efficient publishing systems. This deployment did not make use of linkedges or incremental publication as described in Section 2.2. Standard third

1The 2000 Olympic Games Web site.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 22: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

380 • J. Challenger et al.

Fig. 15. The distribution of object sizes in the source repository. Each bar represents the numberof objects contained in the size range whose upper limit is shown on the X-axis.

Fig. 16. The distribution of object sizes in the assembled repository. Each bar represents thenumber of objects contained in the size range whose upper limit is shown on the X-axis.

party tools were used to check hypertext links for correctness. Pages were ex-tensively tested to make sure that they worked as designed.

The publication system used two stages. The first stage consisted of develop-ment, staging, and quality assurance. The second stage consisted of production.Real-time results were fed directly to the production server, and quality assur-ance for such pages was handled after the pages were published. As the sitegrew in size over the course of the event, such changes were only done at night.

Figures 15 and 16 characterize the object size distributions within two of thedisk-based caches used in the publishing system.

Figure 17 shows the distribution of the number of incoming edges for ODGnodes. This graph represents the number of top level fragments embedded inthe object corresponding to the node. The total number of fragments embeddedin the object may be higher due to the fact that a fragment may recursivelyembed another fragment. Figure 18 shows the distribution of the number ofoutgoing edges for ODG nodes. This graph represents the number of objects that

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 23: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 381

Fig. 17. The distribution of the number of incoming edges for nodes of the ODG.

Fig. 18. The distribution of the number of outgoing edges for nodes of the ODG.

embed at top level, the fragment corresponding to the node. The total numberof objects containing the fragment may be higher due to recursive embedding offragments within other fragments. Some fragments, such as header and footerfragments, are embedded in a large number of Web pages. Both Figures 17 and18 show that the fragments are used extensively and that it is common for aWeb page to contain several fragments.

Finally, Figure 19 shows the distribution of maximum levels at which objectsare embedded. The embed depth of an object is the maximum length of any pathfollowing inclusion edges originating from the object. In this system there is nolimit on the embedding level. We see that just over two-thirds of fragments areonly embedded at top level and not recursively embedded. It is rare for a Webpage to be embedded more than 4 levels deep. Web pages were never embeddedbeyond five levels deep.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 24: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

382 • J. Challenger et al.

Fig. 19. The distribution of the degree to which objects are embedded.

Fig. 20. Number of updates each day, broken down by originator. Multiple objects are typicallychanged by each update.

Figure 20 shows the number of updates that were processed each day. Eachupdate typically results in several objects being changed. Updates originatedfrom one of the following:

—Scoring: Updated information from the scoring system, which containedsports scores, start lists, athlete information and event scheduling.

—News: Editorial changes to news stories.—Static: Data updating the presentation of the site, such as headers, trail-

ers, templates, images, logos, and sound files, as well as other mostly staticcontent created manually.

—Netcam: Graphics and images taken from live cameras.—Reapers: Several small applications that would grab data from outside

sources, like weather and time, and pump it into appropriate Web pages.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 25: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 383

Fig. 21. Total number of objects input and changed each day. The difference between the bars foreach day shows the number of dependent objects for that day.

Fig. 22. Number of objects updated by type.

The event started on Day 1. Days −3 through 0 correspond to the four daysbefore the event started. The number of updates during this period was notas high as during the games themselves. The number of updates tended todecrease towards the end of the event.

Each update identified one or more objects that had been changed by theoriginator. This is the list of the input objects. ODG analysis identifies depen-dent objects that have also changed because one of their underlying fragmentshad changed. The sum of the two lists, input objects and dependent objects, isthe actual list of changed objects resulting from the update. Figure 21 shows thenumber of objects input and changed for each day. Figure 22 breaks down thenumber of objects changed by type. Since only top-level and intermediate type

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 26: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

384 • J. Challenger et al.

Fig. 23. Distribution of object changes for each hour during two days.

Fig. 24. Number of updates to the ODG by type.

objects have underlying fragments that might have changed, all dependent ob-jects are of one of those types. The chart shows the count of each type of object,also indicating whether objects were input or dependent.

Figure 23 shows how object changes were distributed across two typical days.Each hourly count of object changes is broken down into the number of objectschanged because they were modified by the originating source, and the numberof objects that were changed because they were dependent on an underlyingobject that was changed. The number of updated objects decreases late at nightand in the early morning hours. The exceptions are the peaks around 1:00 and2:00 AM, which correspond to times at which updates were made to preparethe Web site for the next day.

Figure 24 shows how many updates were made to the ODG over the courseof the event. It indicates that the structure of the ODG was quite dynamic.This has implications for systems that remotely cache fragments and per-form remote assembly of the fragments. Since the ODG is constantly changing,

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 27: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 385

Fig. 25. Average elapsed time consumed per batched update. Multiple objects were typically up-dated in each batch.

remote caches may have to frequently update the dependency information theycontain in order to remain consistent. Note also that the most of the changesin the ODG tend to occur in the early part of the event.

Figure 25 shows the elapsed time per batched update, averaged over all up-dates for each day. Earlier in this section, we described how the publishingprocess consists of four steps. In the bar graph, read corresponds to readingobjects from source, odg update corresponds to updating the object dependencegraphs, assemble corresponds to assembling pages affected by the changed ob-jects, and write corresponds to writing the assembled objects to sink for dis-tribution or to a persistent cache. For the read and write phases, most of thetime was spent waiting for I/O. For the ODG phase, most of the time was spentwaiting to acquire a lock. The assemble phase was the only phase that was CPUintensive.

The growth of the disk caches is shown in Figure 26. Both the source andassembled repository grow significantly faster than the ODG. From the thirdday and all days afterward, the source repository consumed the most disk space.Efficient disk storage was a crucial component for our publishing system. Webuilt a customized disk storage allocator, which was optimized for the object sizedistributions and which outperformed both file systems and databases [Iyengaret al. 2001, 2003].

5. RELATED WORK

There are several products commercially available for managing Web con-tent such as ColdFusion and Dreamweaver by Macromedia [Macromediahttp://www.macromedia.com], FrontPage and Visual Studio from Microsoft[Microsoft http://www.microsoft.com], and NetObjects Fusion [NetObjectshttp://www.netobjects.com]. None of them have the capabilities for generatingWeb pages via fragments that our system provides.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 28: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

386 • J. Challenger et al.

Fig. 26. Growth of disk caches.

The ESI proposal [Tsimelzon et al. http:// www.esi.org] is an attempt to de-velop a standard protocol for caching fragments of Web pages remotely andassembling the pages at the cache in response to a request. Our system is com-patible with ESI. Datta and others have developed a proxy cache, which storesfragments remotely and assembles them in response to client requests [Dattaet al. 2002]. There has also been research in performing fragment assemblyat clients; in the Client-Side Includes approach, fragments are assembled inbrowsers [Rabinovich et al. 2003].

An earlier conference paper of ours [Challenger et al. 2000] described a pre-liminary version of our Web publishing system before it had been successfullydeployed on a large scale. The current article presents more information onhow our system was actually implemented. It is also the first published articlewe are aware of that presents an empirical study of a major real deploymentof a fragment-based Web publishing system. Mohapatra and Chen [2001] haveproposed a system for constructing Web pages from fragments using graphs torepresent inclusion relationships between fragments. However, they have notdeveloped a production quality publishing system deployed at highly accessedWeb sites as we have.

Our article has described techniques for efficiently creating dynamic con-tent. A number of other researchers have examined the related problem ofhow to allow at least some parts of dynamic Web content to be cached remotely.HPP [Douglis et al. 1997] is an extension to HTML, which allows Web resourcesto be separated into static and dynamic parts. Static portions can be cached,while the dynamic portions are obtained on each access. Delta encoding [Mogulet al. 1997] is a method for updating cached objects by transferring only thedifference between a cached object and the current value. Mikhailov and Wills[2001, 2002] have developed a technique for managing Web objects in which

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 29: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 387

objects at a site are classified based on their change characteristics. Serversanalyze relationships between objects in conjunction with object change charac-teristics and compile them into content control commands. Caches and serversthen use these commands to manage objects. Frequently changing pages areconstructed from individual components. There are also some conceptual simi-larities between remotely caching dynamic Web pages comprised of fragmentsand caching of materialized views for databases [Abiteboul et al. 1998; Zhugeand Garcia-Molina 1998].

There has been past work breaking up Web pages into constituent com-ponents. The Document Object Model (DOM) is an application programminginterface for HTML and XML documents [W3C http://www.w3.org/DOM]. It de-fines the logical structure of a document and the way a document is accessedand manipulated. While fragments may not neatly correspond to the structureof a Web page as defined by DOM, we have found decomposing Web pages usingDOM to be a critically important step in techniques for analyzing Web pages toautomatically detect fragments [Ramaswamy et al. 2004].

There have been a number of systems that use concepts from database man-agement to solve problems in Web site design complementary to the ones thatour system solves. Strudel [Fernandez et al. 1998] supports declarative spec-ification of a Web site’s content and structure and automatically generates abrowsable Web site from a specification. The ARANEUS Web-Base Manage-ment System manages Web data in a manner similar to databases [Atzeniet al. 1997]. A survey of database techniques and how they have been appliedto Web-based systems is contained in Florescu et al. [1998].

6. SUMMARY AND CONCLUSIONS

We have presented a publishing system for efficiently creating dynamic Webcontent. Our publishing system constructs complex objects from fragments thatmay recursively embed other fragments. Relationships between Web pages andfragments are represented by object dependence graphs. We presented algo-rithms for efficiently detecting and updating all affected Web pages after oneor more fragments change.

After a set of multiple Web pages change or are created for the first time, theWeb pages must be published to an audience. Publishing all changed Web pagesin a single atomic action avoids consistency problems but may cause delays inpublication, particularly if the newly constructed pages must be proofread be-fore publication. Incremental publication can provide information faster butmay also result in inconsistencies across published Web pages. We presentedthree algorithms for incremental publication designed to handle different con-sistency requirements.

Our publishing system provides an easy method for Web site designers tospecify and modify inclusion relationships among Web pages and fragments.Users can update content on multiple Web pages by modifying a template. Thesystem then automatically updates all Web pages affected by the change. It iseasy to change the look and feel of an entire Web site as well as to consistentlyupdate common information on many Web pages.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 30: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

388 • J. Challenger et al.

Our system accommodates both quality controlled fragments that must beproofread before publication and are typically from humans as well as imme-diate fragments that have to be published immediately and are typically fromautomated feeds. A Web page can combine both quality controlled and immedi-ate fragments and still be updated in a timely fashion. Our publishing systemhas been implemented in Java. We discussed some of our experiences with realdeployments of our system as well as its performance.

ACKNOWLEDGMENTS

Several people have contributed to this work including Peter Davis, DanielDias, Glenn Druce, Sara Elo, Grant Emery, Cameron Ferstat, Peter Fiorese,Kip Hansen, Brenden O’Sullivan, Kent Rankin, Paul Reed, and Jerry Spivak.

REFERENCES

ABITEBOUL, S., MCHUGH, J., RYS, M., VASSALOS, V., AND WEINER, J. 1998. Incremental maintenancefor materialized views over semistructured data. In Proceedings of VLDB ’98.

ATZENI, P., MECCA, G., AND MERIALDO, P. 1997. To weave the web. In Proceedings of VLDB ’97.CHALLENGER, J., IYENGAR, A., WITTING, K., FERSTAT, C., AND REED, P. 2000. A publishing system for

efficiently creating dynamic Web content. In Proceedings of IEEE INFOCOM.CORMEN, T., LEISERSON, C., AND RIVEST, R. 1990. Introduction to Algorithms. MIT Press, Cambridge,

MA.DATTA, A., DUTTA, K., THOMAS, H., VANDERMEER, D., SURESHA, AND RAMAMRITHAM, K. 2002. Proxy-

Based Accelaration of Dynamically Generated Content on the World Wide Web: An Approachand Implementation. In Proceedings of ACM SIGMOD 2002.

DOUGLIS, F., HARO, A., AND RABINOVICH, M. 1997. HPP: HTML Macro-Preprocessing to SupportDynamic Document Caching. In Proceedings of the USENIX Symposium on Internet Technologiesand Systems.

FERNANDEZ, M., FLORESCU, D., KANG, J., LEVY, A., AND SUCIU, D. 1998. Catching the Boat withStrudel: Experiences with a Web-Site Management System. In Proceedings of ACM SIGMOD.

FLORESCU, D., LEVY, A., AND MENDELZON, A. 1998. Database Techniques for the World-Wide Web:A Survey. ACM SIGMOD Record 27, 3.

IYENGAR, A. AND CHALLENGER, J. 1997. Improving Web server performance by caching dynamicdata. In Proceedings of the USENIX Symposium on Internet Technologies and Systems.

IYENGAR, A., JIN, S., AND CHALLENGER, J. 2001. Efficient algorithms for persistent storage alloca-tion. In Proceedings of the the 18th IEEE Symposium on Mass Storage Systems.

IYENGAR, A., JIN, S., AND CHALLENGER, J. 2003. Techniques for efficiently allocating persistentstorage. J. Sys. Soft. 68, 2 (Nov.), 85–102.

IYENGAR, A., SQUILLANTE, M., AND ZHANG, L. 1999. Analysis and characterization of large-scaleWeb server access patterns and performance. World Wide Web 2, 1,2 (June), 85–100.

MACROMEDIA. Macromedia Web content management products. http://www.macromedia.com/.MICROSOFT. Microsoft’s FrontPage and Visual Studio. http://www.microsoft.com/.MIKHAILOV, M. AND WILLS, C. 2001. Change and relationship-driven content caching, distribution

and assembly. Tech. Rep. WPI-CS-TR-01-03, Computer Science Department, Worcester Polytech-nic Institute. March.

MIKHAILOV, M. AND WILLS, C. 2002. Exploiting object relationships for deterministic web objectmanagement. In Proceedings of the 7th International Workshop on Web Content Caching andDistribution.

MOGUL, J., DOUGLIS, F., FELDMANN, A., AND KRISHNAMURTHY, B. 1997. Potential benefits of deltaencoding and data compression for HTTP. In Proceedings of SIGCOMM ’97.

MOHAPATRA, P. AND CHEN, H. 2001. A Framework for Managing QoS and Improving Performanceof Dynamic Web Content. In Proceedings of IEEE GLOBECOM 2001.

NETOBJECTS. NetObjects Fusion. http://www.netobjects.com/.

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.

Page 31: AFragment-Based Approach for Efficiently …...AFragment-Based Approach for Efficiently Creating Dynamic Web Content JIM CHALLENGER, PAUL DANTZIG, ARUN IYENGAR, and KAREN WITTING

Fragment-Based Approach for Efficiently Creating Dynamic Web Content • 389

RABINOVICH, M., XIAO, Z., DOUGLIS, F., AND KALMANEK, C. R. 2003. Moving Edge-Side Includes to theReal Edge—the Clients. In Proceedings of the 4th USENIX Symposium on Internet Technologiesand Systems.

RAMASWAMY, L., IYENGAR, A., LIU, L., AND DOUGLIS, F. 2004. Automatic Detection of Fragments inDynamically Generated Web Pages. In Proceedings of the 13th International World Wide WebConference.

TSIMELZON, M., WEIHL, B., AND JACOBS, L. ESI Language Specification 1.0. http://www.esi.org/language spec 1-0.html.

W3C. Document Object Model—W3C Recommendation. http://www.w3.org/DOM.ZHUGE, Y. AND GARCIA-MOLINA, H. 1998. Graph structured views and their incremental mainte-

nance. In Proceedings of IEEE ICDE ’98.

Received July 2002; revised August 2003; accepted November 2003

ACM Transactions on Internet Technology, Vol. 5, No. 2, May 2005.