XWRAPComposer: A Multi-Page Data Extraction Service for ...calton/publications/composer-SCC05.pdf · Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee, Sungkeun Park Terence

XWRAPComposer: A Multi-Page Data Extraction Servicefor Bio-Computing Applications

Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee, Sungkeun ParkCollege of Computing, Georgia Institute of Technology

{lingliu, zhangjj, weihan, calton, caverlee, mungooni}@cc.gatech.eduTerence Critchlow, Matthew Coleman, David Buttler

Lawrence Livermore Nationral Laboratory, California, USA{critchlow1, coleman16, buttler1}@llnl.gov

Author Responsible for Correspondence

Jianjun ZhangCollege of Computing

Georgia Institute of TechnologyAtlanta, GA, 30318

Phone: 404-385-2027

1

XWRAPComposer: A Multi-Page Data Extraction Servicefor Bio-Computing Applications

Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee, Sungkeun ParkTerence Critchlow, Matthew Coleman, David Buttler

Abstract

Bio-computing applications usually require gathering infor-mation from multiple Bioinformatic services such as proteindatabases, various BLAST services. Although Web service tech-nology such as WSDL, SOAP, UDDI has provided a standard-ized remote invocation interface, there exist other types of het-erogeneity in terms of query capability, content structure, andcontent delivery logics due to inherent diversity of different ser-vices. A popular approach to handling such heterogeneity isto use wrappers to serve as mediators to facilitate the automa-tion of collecting and extraction data from multiple diverse dataproviders.

This paper presents a service-oriented framework for the de-velopment of wrapper code generators, including the methodol-ogy of designing an effective wrapper program construction fa-cility and a concrete implementation, called XWRAPComposerThree unique features distinguish XWRAPComposer from exist-ing wrapper development approaches. First, XWRAPComposeris designed to enable multi-stage and multi-page data extrac-tion. Second, XWRAPComposer is the only wrapper generationsystem that promotes the distinction of information extractionlogic from query-answer control logic, allowing higher level ofrobustness against changes in the service provider’s web sitedesign or infrastructure. Third, XWRAPComposer provides auser-friendly plug-and-play interface, allowing seamless incor-poration of external services and continuous changing serviceinterfaces and data format.

1 Introduction

With the wide deployment of Web service technology, the In-ternet and the World Wide Web (Web) have become one of themost popular means for disseminating scientific data from a vari-ety of disciplines. Vast and growing amount of life sciences datareside today in specialized Bioinformatics data sources, manyof them are accessible online with specialized query process-ing capabilities. The Molecular Biology Database Collection,for instance, currently holds over 500 data sources [1], not evenincluding many tools that analyze the information containedtherein. Bioinformatics data sources over the Internet have awide range of query processing capabilities. Typically, manyWeb-based sources allow only limited types of selection queries.To compound the problem, data from one source often must becombined with data from other sources to provide scientists withthe information they need.

The extraordinary growth of service oriented computing hasbeen fueled by the enhanced ability to make a growing amount ofinformation available through the Web. This brings good newsand bad news. The good news is that the Web services providethe standard invocation interface for remote service calls and thebulk of useful and valuable information is designed and pub-lished in a human browsing format (HTML or XML). The badnews is that these “human-oriented” Web pages returned by Web

services are difficult for programs to capture and extract infor-mation of interests automatically and to fuse and integrate datafrom multiple autonomous and yet heterogeneous data producerservices. Also different web service provides use different andevolving custom data formats.

A popular approach to handle this problem is to write datawrappers to encapsulate the access to Web sources and to au-tomate the information extraction tasks on behalf of human.A wrapper is a software program specialized to a single datasource or single Web service (e.g., a web site), which convertsthe source documents and queries from the source data modelto another, usually a more structured, data model [15]. Severalprojects have implemented hand-coded wrappers for a varietyof sources [11, 3, 14, 13]. However, manually writing such awrapper and making it robust is costly due to the irregularity,heterogeneity, and frequent updates of the Web site and the datapresentation formats they use. Hand-coding wrappers can be-come a major pain in situations where the data integration ap-plications are more interested in integrating new data sources orfrequently changing Web sources. We observe that, with a gooddesign methodology, only a relatively small part of the wrappercode deals with the source-specific details, and the rest of thecode is either common among wrappers or can be expressed ata higher level, more structured fashion. There are a number ofchallenging issues in automation of the wrapper code generationprocess.

• First, most Web pages are HTML or XML documents, whichare semi-structured text files annotated with various HTMLpresentation tags. Due to the frequent changes in presen-tation style of the HTML documents, the lack of semanticdescription of their information content, and the difficulty inmaking all applications in one domain to use the same XMLschema, it is hard to identify the content of interest usingcommon pattern recognition technology such as string regu-lar expression specification used in LEX and YACC.

• Second, wrappers for Web sources should be more robustand adaptive in the presence of changes in both presenta-tion style and information content of the Web pages. It isexpected that the wrappers generated by the wrapper gen-eration systems will have lower maintenance overhead thanhandcrafted wrappers for unexpected changes.

• Third, wrappers often serve as interface programs and passthe Web data extracted to application-specific informationbroker agents or information integration mediators for moresophisticated data analysis and data manipulation. Thus it isdesirable to provide a wrapper interface language that is sim-ple, self-describing, and yet powerful enough for extractingand capturing information from most of the Web pages.

In scientific computing domains such as bioinformatics andbioengineering, information extraction over multiple differentpages imposes additional challenges for wrapper code gener-ation systems due to the varying correlation of the pages in-volved. The correlation can be either horizontal when group-

ing data from homogeneous documents (such as multiple resultpages from a single search) or vertical when joining data fromheterogeneous but related documents (a series of pages contain-ing information about a specific topic). Furthermore, the cor-relation can be extended into a graph of workflows as we willdescribe in Figure 2. Therefore there is an increasing demandfor automated wrapper code generation systems to incorporate amulti-page information extraction service. A multi-page wrap-per not only enriches the capability of wrappers to extract infor-mation of interests but also increases the sophistication of wrap-per code generation.

Surprisingly, almost all existing wrappers generated by appli-cation code generators [8, 19, 2] are single-page wrappers in thesense that the wrapper program responds to a keyword query byanalyzing only the page immediately returned. Most wrapperscannot follow the links within this page to continue the informa-tion extraction from other linked pages, unless separate queriesare issued to locate other linked pages.

Bearing all these issues in mind, we develop a code gener-ation framework for building a semi-automated wrapper codegeneration system that can generate wrappers capable of extract-ing information from multiple inter-linked Web documents, andwe implement this framework with XWRAPComposer, a toolkitfor semi-automatically generating Java wrapper programs thatcan collect and extract data from multiple inter-linked pages au-tomatically. XWRAPComposer has three unique features withregard to supporting multi-page data extraction.

• First, we introduce interface, outface, and composer scriptfor each wrapper program we generate. By encoding wrap-per developers’ knowledge in Interface Specification, Outer-face Specification and Composer Script, XWRAPComposerintegrates single-page wrapper programs into a compositewrapper capable of extracting information across multipleinter-linked pages from one service provider.

• Second, XWRAPComposer transforms the multi-page infor-mation extraction problem into an integration problem ofmultiple single-page data extraction results, and utilizes thecomposer script to interconnect a sequence of single-pagedata extraction results, offering flexible execution choicesto address diverse needs of different users. It generatesplatform-independent Java code that can be executed locallyon users’ machine. It also provide a WSDL-plugin moduleto allow users to produce WSDL enabled wrappers as WebServices [24].

• Third but not the least, XWRAPComposer supports micro-workflow management, such as intermediate informationflow or result auditing. We demonstrate this capability byintegrating XWRAPComposer and its generated wrapperswith some process modeling tools such as Ptolemy [4], al-lowing users to interactively manage different componentsof a wrapper and the interaction between them.

2 The Design FrameworkA multi-page wrapper code generation is a complex process

and it is not reasonable, either from a logical point of view orfrom an implementation point of view, to consider the construc-tion process as occurring in one single step. For this reason, wepartition the wrapper construction process into a series of sub-processes calledphases, as shown in Figure 1. A phase is a logi-

cally cohesive operation that takes as input one representation ofthe source document and produces as output another represen-tation. XWrapComposer wrapper generation goes through sixphases to construct and release a Java wrapper. Tasks within aphase run concurrently using a synchronized queue; each runsits own thread. For example, we decide to run the task of fetch-ing a remote document and the task of repairing the bad for-matting of the fetched document using two concurrently syn-chronous threads in a single pass of the source document. Thetask of generating a syntactic-token parse tree from an HTMLdocument requires as input the entire document; thus, it cannotbe done in the same pass as the remote document fetching andthe syntax reparation. Similar analysis applies to the other taskssuch as code generation, testing, and packaging.

The interaction and information exchange between any two ofthe phases is performed through communication with the book-keeping and the error handling routines. Thebookkeepingrou-tine of the wrapper generator collects information about all thedata objects that appear in the retrieved source document, keepstrack of the names used by the program, and records essentialinformation about each. For example, a wrapper needs to knowhow many arguments a tag expects, whether an element repre-sents a string or an integer. The data structure used to recordthis information is called a symbol table. Theerror handler isdesigned for the detection and reporting errors in the fetchedsource document. The error messages should allow a wrapperdeveloper to determine exactly where the errors have occurred.Errors can be encountered at virtually all the phases of a wrap-per. Whenever a phase of the wrapper discovers an error, it mustreport the error to the error handler, which issues an appropriatediagnostic message. Once the error has been noted, the wrap-per must modify the input to the phase detecting the error, sothat the latter can continue processing its input, looking for sub-sequent errors. Good error handling is difficult because certainerrors can mask subsequent errors. Other errors, if not properlyhandled, can spawn an avalanche of spurious errors. Techniquesfor error recovery are beyond the scope of this paper.

Figure 1 presents an architecture sketch of the XWRAPCom-poser system. The system architecture of XWRAPComposerconsists of four major components: (1) Remote Connectionand Source-specific Parser; (2) Multi-page Data Extraction; (3)Code Generation and Packaging; and (4) Debugging and Re-lease. Other components include GUI interface, bookkeepingand error handling. GUI interface allows wrapper developers tospecify workflow of the multi-page data extraction, the request-respond flow control rules and cross-page data extraction rulesinteractively.

Remote Connection and Source-specific Parseris the firstcomponent, which prepares and sets up the environment for in-formation extraction process by performing the following threetasks. First, it accepts an URL selected and entered by theXWrapComposer user, issues an HTTP request to the remoteservice provider identified by the given URL, and fetches thecorresponding web document (or so called page object). Duringthis process, the XWRAPComposer will learn the search inter-face and the remote service invocation procedure in the back-ground and generate a set of rules that describe the list of in-terface functions and parameters as well as how they are usedto fetch a remote document from a given web source. The list

3

Interface

Specification

Outerface

Specification

Wrapper

Java

Program

Configuration

FilesExtraction

Script

External

Software

Package

Code Generation and

PackagingStructure

Transformation

Generating WrapperProgram Code

Enter a URL

Remote Connection and

Source-specific ParserGenerating

Interface Spec.

GeneratingParse Tree

RepairingSyntax Erros

Multi-page Data Extraction

Document

Structure Spec

ExtractionRegion

Identification

InformationExtraction

Rules

Remote

Web Page

Search and Remote

Invocation Rules

URLs

Request-respond

Flow Control Rules

Information Extraction Rules

Debugging and Release

Wrapper Program

Testing

Wrapper ProgramRelease

XML presentation

of Sample Page

Testing Request

+ Feedbacks

XWRAPComposer System Wrapper

Repository

Wrapper

Web

Service

Ptolemy

Wrapper

Actors

Wrapper

Extension

Figure 1: XWRAPComposer System Architecture

of interface functions include the declaration to the standard li-brary routines for establishing the network connection, issuingan HTTP request to the remote web server through aHTTPGet or HTTP Post method, and fetching the correspondingweb page. Other desirable functions include building the correctURL to access the given service and pass the correct parameters,and handling redirection, failures, or authorization if necessary.Second, it cleans up bad HTML tags and syntactical errors us-ing XWRAPComposer plugin such as HTML TIDY [18, 22].Third, it transforms the retrieved page object into a parse tree orso-called syntactic token tree. This page object will be used as asample for XWRAPComposer to interact with the user to learnand derive the important information extraction rules, and the listof linked pages the user is interested in extracting information inconjunction with this page. In addition, all wrappers generatedby XWrap use the streaming mode instead of the blocking mode(recall Section 2). Namely, the wrapper will read the web pageone block1 at a time. An interface specification will be createdin this phase.

Multi-page Data Extraction is the second component,which is responsible for deriving information flow control logicand multi-page extraction logic, both are represented in form ofrules. The former describes the flow control logic of the tar-geted service in responding to a service request and the latterdescribes how to extract information content of interest fromthe answer page and the linked pages of interest. XWRAP-Composer performs the multi-page information extraction taskin four steps: (1) specify the structure of the retrieved docu-ment (page object) in a declarative extraction rule language. (2)identify the interesting regions of the main page object and gen-erating information extraction rules for this page; (3) identifythe list of URLs referenced in the extracted regions in the mainpage; and (4) generating information extraction rules for each ofthe pages linked from the interesting regions of the main pageobject. We perform single page data extraction process usingXWRAPElite [8] toolkit, a single page data extraction servicedeveloped by the XWRAP team at Georgia Tech. At the endof this phase, XWRAPComposer produces two specifications:an outerface specification that describes the output format of the

1A block here refers to a line of 256 characters or a transfer unit definedimplicitly by the HTTP protocol.

extraction result will be produced, and a composer script that de-scribes both the information flow control patterns and the multi-page data extraction patterns.

Code Generation and Packagingis the third component,which generates the wrapper program code by applying threesets of rules about the target service produced in the first twosteps: (1) the search and remote invocation rules, (2) the request-respond flow control rules, and the information extraction rules.A key technique in our implementation is the smart encodingof these three types of semantic knowledge in the form of ac-tive XML-template format (see Section 4 for detail). The codegenerator interprets the XML-template rules by linking each ex-ecutable component with the corresponding rule sets. The codegenerator also produces the XML representation for the retrievedsample page object as a byproduct.

Debugging and Releaseis the fourth component and the fi-nal phase of the multi-page wrapping process. It allows the userto enter a set of alternative service requests to the same ser-vice provider to debug the wrapper program generated by run-ning the XWRAPComposer’s code debugging module. For eachpage object obtained, the debugging module will automaticallygo through the syntactic structure normalization to rule out syn-tactic errors, the flow control and information extraction stepsto check if new or updated flow control rules or data extractionrules should be included. In addition, the debug-monitoring win-dow will pop up to allow the user to browse the debug report.Whenever an update to any of the three sets of rules occurs, thedebugging module will run the code generator to create a newversion of the wrapper program. Once the user is satisfied withthe test results, he or she may invoke the release to obtain therelease version of the wrapper program, including assigning theversion release number, packaging the wrapper program withapplication plug-ins and user manual into a compressed tar file.

XWRAPComposer wrapper generator takes the followingthree inputs: interface specification, outerface specification, andcomposer script, and compiles them into a Java wrapper pro-gram, which can be further extended into either a multi-pagedata extraction Web service (with WSDL specification) or aPtolemy wrapper actor, which can be used for large scale dataintegration.

In the subsequent sections, we first provide a walkthrough

4

example to illustrate the multi-page extraction process. Next,we focus our discussion primarily on multi-page data extractioncomponent of the XWrapComposer, and give a brief descrip-tion of the wrapping interface and remote invocation componentas the necessary preprocessing step for information extraction,together with a short summary of code generation as the post-processing for the multipage extraction.

3 Example WalkThroughBefore describing the detailed techniques used in designing

multi-page data extraction services, we first present a walk-through of XWRAPComposer using the motivating example in-troduced in Figure 2, where a biologist first uses a programcalledClusfavorto cluster genes that have changed significantlyin a micro-array analysis experiment. After extracting all geneids from the Clusfavor result, he feeds them into the NCBI Blastservice, which searches all related sequences over a variety ofdata sources. The returned sequences will be further examinedto find promoter sequences. Let us focus on the NCBI BLASTservice. Figure 2 shows the workflow of how a BLAST servicerequest to NCBI will be served. It consists of four steps: BLASTresponse step presents the user with a request ID, BLAST delaystep presents the user with the time delay for the result. BLASTSummary presents the user with an overview of all gene ids thatmatch well with the given gene sequence id. Finally, BLASTDetail shows for each gene id listed in the summary page, thefull sequence detail and the goal is to extract approximately1000-5000 bases of the DNA sequence around the alignment tocapture the promoter regulatory elements, the region of a genewhere RNA polymerase can bind and begin transcription to cre-ate the proteins that can regulate cell function.

Microarray

analysis

Genes that changed

significantly

CLUSFAVOR

NCBI BLAST

Data Integration

Gene ids

AA045112

All related sequencesStatistical Clustering

of genes

BLAST search over a variety

of data sources for common

promoter elements to link

new candidate genes

BLAST

Response

BLAST

Delay

BLAST

Summary

BLAST

Detail

Request ID Summary URL Detail URL

ID, etc. Sequences

Promoter sequences

…

…

Figure 2: A Scientific Data Integration Example Scenario

Figure 3 illustrates a typical BLAST query using NCBI ser-vice [17]. A BLAST query involves four steps. The first stepis to feed a gene sequence into the text entry of the query inter-face. Due to the time complexity of a BLAST search, the NCBIservice provider typically returns a response page with a requestID and the first estimate of the waiting time for each BLASTsearch. The biologist may later ask NCBI for the BLAST resultsusing the request ID (Step 2), the NCBI service will presents adelay page if the BLAST search is not completed and results arenot yet ready to display (Step 3). Once the BLAST results are

delivered, they are displayed in a BLAST summary page, whichcontains a summary of all genes matching the search query con-dition. Each of the matching genes will provide a link to theNCBI BLAST Detail page (Step 4). If the gene id used for theBLAST query is incorrect gene id or NCBI does not provideBLAST service for the given gene id, an error page will be dis-played. If the summary page does not include detailed informa-tion that the biologist is interested in, he has to visit each detailpage (Step 5) through the URLs embedded in the summary page.

A critical challenge for providing system-level support forscientists to achieve such complex data integration tasks is theproblem of locating, accessing, and fusing information from arapidly growing, heterogeneous, and distributed collection ofdata sources available on the Web. This is a complex searchproblem for two reasons. First, as the example in Figure 2shows, scientists today have much more complex data collec-tion requirements than ordinary surfers on the Web. They oftenwant to collect a set of data from a sequence of searches over alarge selection of heterogeneous data sources, and the data se-lected from one search step often forms the filter condition forthe next search step, turning a keyword-based query into a so-phisticated search and information extraction workflow. Second,such complex workflows are manually performed daily by sci-entists or data collection lab researchers (computer science spe-cialists). Automating such complex search and data collectionworkflows presents three major challenges.

• Different service providers use different request-respondflow control logics to present the answer pages to searchqueries.

• Cross-page data extraction has more complex extractionlogic than the single page extraction system. In addition,different applications require different sets of data to be ex-tracted by the cross-page data extraction engine. Typically,only portions of one page and the links that lead the extrac-tion to the next page need to be extracted.

• Data items extracted from multiple inter-linked pages requirebeing associated with semantically meaningful naming con-vention. Thus, mechanisms that can incorporate the knowl-edge of the domain scientists who issued such cross-pageextraction job are critical.

There are several ways to design NCBI BLAST wrapper.First, we can develop two wrappers, one for NSBI BLAST sum-mary and one for NCBI BLAST Detail. The NCBI BLAST sum-mer wrapper can be integrated with the NCBI BLAST Detailwrapper by service composition. In this approach, we need tocapture the request-respond flow control through a flow controllogic in the composer script of NCBI Summary wrapper. Theouterface specification of the NCBI summary wrapper consistsof the general overview of the given gene id and the list of geneids that are relevant to the given gene id. The NCBI BLASTDetail wrapper needs to extract approximately 1000-5000 basesof the DNA sequence around the alignment. The compositewrapper NCBI BLAST will be composed of the NCBI summarywrapper and a list of executions of the NCBI BLAST Detailwrapper. In the next section we describe the XWRAPComposerdesign using this example.

5

STEP 1

STEP 2

STEP 4

STEP 3

STEP 5

Figure 3: Multipage query with a NCBI web site

4 Multi-Page Data Extraction Service

We have developed a methodology and a framework for ex-traction of information from multiple pages connected via webpage links. The main idea is to separate what to extract fromhow to extract, and distinguish information extraction logic fromrequest-respond flow control logic. The control logic describesthe different ways in which a service request (query) could beanswered from a given service provider. The data extractionlogic describes the cross-page extraction steps, including whatinformation is important to extract at each page and how suchinformation is used as a complex filter in the next search andextraction step.

We use interface description to specify the necessary inputobjects for wrapping the target service and the outerface descrip-tion to describe what should be extracted and presented as thefinal result by the wrapper program. We design and developa XWRAPComposer Script language (a set of functional con-structs) to describe the request-respond flow control logic andmulti-page data extraction logic, and to implement the outputalignment and tagging of data items extracted based on the out-erface specification.

The compilation process of the XWRAPComposer includesgenerating code based on three sets of rules: (1) Remote con-nection and interface rules, (2) the request-respond flow controllogic and multi-page extraction logic outlined in the composerscript, (3) the correct output alignment and semantically mean-ingful tagging based on the outerface specification.

4.1 Interface and Outerface Specification

Interface specification describes the schema of the data thatthe wrapper takes as input. It defines the source location andthe service request (query) interface for the wrapper to be gener-ated. Outerface specification describes the schema of the result

that the wrapper outputs. It defines the type and structure of ob-jects extracted. The composer script consists of two sets of rule-based scripts. The request-respond flow control script describesthe alternative ways that the target service will respond to a re-mote service request, including result not found, multiple resultsfound or single result found, or server errors. The multi-pagedata extraction script which describes (1) the extraction rules forthe main page, (2) the extraction rules for each of the interest-ing pages linked from the main page, and (3) the rules on howto glue single page data extraction components. XWRAPCom-poser’s scripting language has domain-specific plugins to facili-tate the incorporation of domain-dependent correlations betweenthe fragments of information extracted and the domain-specifictagging scheme. Each wrapper generated by XWRAPComposerwill be associated with an interface specification, an outerfacedescription, and a composer script.

The design of the XWRAPComposer Interface and Outer-face Specification serves two important objectives. First and formost, it will ease the use of XWRAP wrappers as external ser-vices to any data integration applications. Second, it will facil-itate the XWRAPComposer wrapper code generation system togenerate Java code. Therefore, some components of the specifi-cation may not be directly useful for the users of these wrappers.In the first release of the XWRAPComposer implementation, wedescribe the input and output schema of a multi-page (compos-ite) wrapper in XML Schema and use the two XML schemas asthe interface and outerface specification. Concretely, the inter-face specification describes the wrapper name and which dataprovider’s service needs to be wrapped by giving the sourceURL and other related information. The outerface specificationdescribes what data items should be extracted and produced bythe wrapper and the semantically meaningful names to be usedto tag those data items. Figure 4 shows a fragment of the in-

6

terface and outerface description of an example NCBI BLASTsummary wrapper [21].

/ * Start constructing wrapper ncbisummary. * /WrapperName "ncbisummary";

/ * Contruct the URL for NCBi Blast search * /Generate blastSummaryPage :: ConstructHttpQuery (input) {

Set inputSource {Set url {"http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?QUERY=$$&..." };Set queryString { };Set method {"get" };Set variable { [text()] } ;

}}Generate blastSummaryData :: FetchDocument (blastSummaryPage) {}Generate recordid :: ExtractContent (blastSummaryData) {

GrabSubstring {Set BeginMatch {"The request ID is

Figure 9: Ptolemy Wrapper Actor Result Example – NCBi Blast Detail

4.4 WSDL-enabled Wrappers

Web Service Server

ProgramJava

Interface

Outerface

WSDLGenerator

XwrapComposer

WrapperServlet Config

WSDL requestWSDL response

Soap Handler

Package Output

Extract Input

SOAP Response

SOAP Request

Figure 10: Web-service Enabled Wrappers

XWRAPComposer is developed with two objectives in mind.First, we want to generate wrapper programs that can be used incommand line or embedded in an application system as a wrap-per procedure. Second, we want XWRAPComposer to be ableto generate WSDL-enabled wrappers to allow each wrapper pro-gram to be used as a Web service [23]. Our discussion so far hasbeen focused on the first objective. In this section we brieflydescribe how to generate WSDL enabled wrappers.

In order to enable XWRAPComposer to generate WSDL-enabled wrapper services, we add two extensions to theXWRAPComposer wrapper generation system. First, we en-capsulate an XWRAPComposer wrapper into a general Web ser-vice servlet. The servlet automatically extracts the input from aSOAP request, feeds it into the wrapper, and inserts the wrap-ping results in a SOAP envelope before sending back to the user.Second, we incorporate a WSDL generator to automatically gen-erate Web service description by binding the wrapper’s interfaceand outerface with the servlet configuration. Figure 10 shows theextensions added to the XWRAPComposer to produce wrappersas WSDL web services.

5 Related Work and ConclusionThe very nature of scientific research and discovery leads to

the continuous creation of information that is new in content or

representation or both. Despite the efforts to fit molecular biol-ogy information into standard formats and repositories such asthe PDB (Protein Data Bank) and NCBI, the number of data-bases and their content have been growing, pushing the enve-lope of standardization efforts such as mmCIF [26]. Providingintegrated and uniform access to these databases has been a se-rious research challenge. Several efforts [6, 7, 10, 12, 16, 20]have sought to alleviate the interoperability issue, by translatingqueries from a uniform query language into the native query ca-pabilities supported by the individual data sources. Typically,these previous efforts address the interoperability problem froma digital library point of view, i.e., they treat individual data-bases as well-known sources of existing information. While theyprovide a valuable service, due to the growing rate of scientificdiscovery, an increasing amount of new information (the kindof hot-off-the-bench information that scientists would be mostinterested in) falls outside the capability of these previous inter-operability systems or services.

Wrappers have been developed either manually or with soft-ware assistance, and used as a component of agent-based sys-tems, sophisticated query tools and general mediator-based in-formation integration systems [27].

XWRAPComposer is different from those systems in threeaspects. First, we explicitly separate tasks of building wrappersthat are specific to a Web service from the tasks that are repet-itive for any service, thus the code can be generated as wrap-per library component and reused automatically by the wrappergenerator system. Second, we use inductive learning algorithmsthat derive information flow and data extraction patterns by rea-soning about sample pages or sample specifications. More im-portantly, we design a declarative rule-based script language formulti-page information extraction, encouraging a clean separa-tion of the information extraction semantics from the informa-tion flow control and execution logic of wrapper programs.

Three unique features distinguish XWRAPComposer fromexisting wrapper development approaches. First, XWRAPCom-poser is designed to enable multi-stage and multi-page data ex-traction. Second, XWRAPComposer is the only wrapper gen-eration system that promotes the distinction of information ex-traction logic from request-respond flow control logic, allow-ing higher level of robustness against changes in the serviceprovider’s web site design or infrastructure. Third, XWRAP-Composer provides a user-friendly plug-and-play interface, al-lowing seamless incorporation of external services and continu-ous changing service interfaces and data format.

References[1] DBCAT, the public catalog of databases. see

http://www.infobiogen.fr/services/dbcat.[2] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web informa-

tion extraction with Lixto.Proceedings of VLDB, 2001.[3] R. Bayardo, W. Bohrer, and R. B. and et al. Semantic integration

of information in open and dynamic environments. InProceed-ings of ACM SIGMOD Conference, 1997.

[4] Berkeley. Ptolemy group in eecs.http://ptolemy.eecs.berkeley.edu/, 2003.

[5] D. Buttler, L. Liu, and C. Pu. A fully automated object extractionsystem for the world wide web.Proceedings of IEEE ICDCS,April 2001.

[6] T. Critchlow, K. Fidelis, M. Ganesh, R. Musick, and T. Slezak.Datafoundry: Information management for scientific data.IEEE

9

Transactions on Information Technology in Biomedicine, 4(1):52-57, March 2000.

[7] S. Davidson, O. Buneman, J. Crabtree, V. Tannen, G. Over-ton, and L. Wong. Biokleisli: Integrating biomedical dataand analysis packages.Bioinformatics: Databases and Sys-tems, S. Letovsky, Editor, Kluwer Academic Publishers, Norwell,MA:201-211, 1999.

[8] G. I. o. T. DISL Group. XWRAP Elite Project, 2000.[9] DISL Group, Georgia Insitute of Technology. Xwrapcomposer.

http://www.cc.gatech.edu/projects/disl/XWRAPComposer/, 2003.[10] C. A. Goble, R. Stevens, G. Ng, S. Bechhofer, N. Paton, P. G.

Baker, M. Peim, and A. Brass. Transparent access to multi-ple bioinformatics information sources.IBM Systems Journal,40(2):532-551, 2001.

[11] L. Haas, D. Kossmann, E. Wimmers, and J. Yan. Optimizingqueries across diverse data sources. InProceedings of VLDB,1997.

[12] L. Haas, P. Schwarz, P. Kodali, E. Kotlar, J. Rice, and W. Swope.Discoverylink: A system for integrated access to life sciencesdata sources.IBM Systems Journal, 40(2):489-511, 2001.

[13] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi,I. Muslea, A. Philpot, and S. Tejada. Modeling web sourcesfor information integration. InProceedings of AAAI Conference,1998.

[14] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakon-stantinou, J. Ullman, and M. Valiveti. Capability based mediationin tsimiss. InProceedings of ACM SIGMOD Conference, 1997.

[15] L. Liu, C. Pu, and W. Han. XWrap: An Extensible Wrapper Con-struction System for Internet Information Sources. InTechnicalReport, OGI/CSE, Feb., 1999.

[16] S. McGinnis. (genbank user services, national center for bitech-nology information (NCBI), national library of medicine, us na-tional institute of health).Personal Communication, 1, Jan.

[17] NCBI. National center for biotechnology information – blastdatabases.http://www.ncbi.nlm.nih.gov/BLAST/, 2003.

[18] D. Raggett. Clean up your web pages with HTML TIDY.http://www.w3.org/People/Raggett/tidy/, 1999.

[19] A. Sahuguet and F. Azavant. WysiWyg Web Wrapper Factory(W4F). Proceedings of WWW Conference, 1999.

[20] A. C. Siepel, A. N. Tolopko, A. D. Farmer, P. A. Steadman, F. D.Schilkey, B. Perry, and W. D. Beavis. An integration platform forheterogeneous bioinformatics software components.IBM Sys-tems Journal, 40(2):570-591, 2001.

[21] L. Team. LDRD Project, 2004.[22] W3C. Reformulating HTML in XML.

http://www.w3.org/TR/WD-html-in-xml/, 1999.[23] W3C. Web services.http://www.w3c.org/2002/ws/, 2002.[24] W3C. Web services description language (wsdl) version 1.2 part

1: Core language.http://www.w3c.org/TR/wsdl12/, 2003.[25] H. Wei. Wrapper Application Generation for Semantic Web: An

XWRAP Approach. PhD thesis, Georgia Institute of Technology,2003.

[26] J. Westbrook and P. Bourne. Star/mmcif: An extensive ontol-ogy for macromolecular structure and beyond.Bioinformatics,16(2):159-168, 2000.

[27] G. Wiederhold. Mediators in the architecture of future informa-tion systems.IEEE Computer, 1992.

10

Documents

XWRAPComposer: A Multi-Page Data Extraction Service for ...calton/publications/composer-SCC05.pdf · Ling Liu, Jianjun Zhang, Wei Han, Calton Pu, James Caverlee, Sungkeun Park Terence