SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF ...mog/Papers/xiaorong_phd.pdf · SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS A Dissertation

SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF

BIOINFORMATIC DATA AND APPLICATIONS

A Dissertation

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

by

Xiaorong Xiang, B.S., M.S.

Gregory R. Madey, Director

Graduate Program in Computer Science and Engineering

Notre Dame, Indiana

April 2007

c© Copyright by

Xiaorong Xiang

2007

All Rights Reserved

SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF

BIOINFORMATIC DATA AND APPLICATIONS

Abstract

by

Xiaorong Xiang

Service oriented architecture (SOA) is a new paradigm that originated in indus-

try for future distributed computing. It is recognized as a promising architecture

for application integration inside and across organizations. Since their introduc-

tion, semantic web and web services technologies are increasingly gaining interest

in the implementation of e-Science infrastructures. In this dissertation, we survey

current research trends and challenges for adopting SOA in general. We present

a practical experiment of building a service-oriented system for data integration

and analysis using current web services technologies and bioinformatics middle-

ware. The system is enhanced with an ontological model for semantics annotation

of services and data. It demonstrates that adopting SOA in the e-Science field

can accelerate the scientific research process. A new methodology and an en-

hanced system design is proposed to facilitate the reuse of workflows and verified

knowledge.

DEDICATION

To my parents, my husband, and my son

ii

CONTENTS

FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Main contributions of the dissertation . . . . . . . . . . . . . . . . 41.2 Organization of the dissertation . . . . . . . . . . . . . . . . . . . 7

CHAPTER 2: RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED COMPUTING . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Overview of related concepts and technologies . . . . . . . . . . . 10

2.2.1 Web services . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Semantic web . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Grid computing . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Peer-to-peer computing . . . . . . . . . . . . . . . . . . . . 15

2.3 Issues in the service-oriented computing . . . . . . . . . . . . . . 162.3.1 Service description . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Service discovery . . . . . . . . . . . . . . . . . . . . . . . 222.3.3 Service composition . . . . . . . . . . . . . . . . . . . . . . 292.3.4 Service execution . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Service-oriented computing in e-Science . . . . . . . . . . . . . . . 342.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

CHAPTER 3: A SERVICE-ORIENTED DATA INTEGRATION AND ANAL-YSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH . . . . 443.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii

3.3.1 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.2 Operational barriers . . . . . . . . . . . . . . . . . . . . . 51

3.4 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4.1 Data storage and access service . . . . . . . . . . . . . . . 543.4.2 Service and workflow registry . . . . . . . . . . . . . . . . 553.4.3 Indexing and querying metadata . . . . . . . . . . . . . . . 563.4.4 Service and workflow enactment . . . . . . . . . . . . . . . 57

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5.1 Development and deployment tools . . . . . . . . . . . . . 593.5.2 Services provision . . . . . . . . . . . . . . . . . . . . . . . 603.5.3 Workflow engine . . . . . . . . . . . . . . . . . . . . . . . 623.5.4 Building workflows . . . . . . . . . . . . . . . . . . . . . . 623.5.5 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6.1 Issues with the first prototype . . . . . . . . . . . . . . . . 653.6.2 Extension of the system . . . . . . . . . . . . . . . . . . . 67

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

CHAPTER 4: EXPLORING THE DEEP PHYLOGENY OF THE PLAS-TIDS WITH THE MOGSERV . . . . . . . . . . . . . . . . . . . . . . 734.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 System and methods . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.2 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2.3 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 794.2.4 Local query . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2.5 Set management . . . . . . . . . . . . . . . . . . . . . . . 824.2.6 ClustalW . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.7 Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.8 Phylip and Paup . . . . . . . . . . . . . . . . . . . . . . . 864.2.9 Data conversion . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Results of case studies . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1 Case study: the rediscovery of Erythrobacter litoralis . . . 88

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

CHAPTER 5: ONTOLOGICAL REPRESENTATION MODEL . . . . . . 915.1 The MoG life sciences project and biomedical application . . . . . 925.2 Ontological representation model . . . . . . . . . . . . . . . . . . 93

5.2.1 RDF, OWL, and DIG reasoner . . . . . . . . . . . . . . . 945.2.2 Generic service description ontology . . . . . . . . . . . . . 975.2.3 Service domain ontology . . . . . . . . . . . . . . . . . . . 985.2.4 MoG application domain ontology . . . . . . . . . . . . . . 99

iv

5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

CHAPTER 6: IMPROVING THE REUSE OF THE SCIENTIFIC WORK-FLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2 A hierarchical workflow structure . . . . . . . . . . . . . . . . . . 1096.3 An enhanced workflow system . . . . . . . . . . . . . . . . . . . . 113

6.3.1 Knowledge management . . . . . . . . . . . . . . . . . . . 1166.3.2 Knowledge discovery . . . . . . . . . . . . . . . . . . . . . 120

6.4 Translation process . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.4.1 Service discovery and matchmaking process . . . . . . . . 1206.4.2 Knowledge reuse . . . . . . . . . . . . . . . . . . . . . . . 1226.4.3 Implementation and evaluation . . . . . . . . . . . . . . . 124

6.5 Workflow reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.7 Conclusion and future Work . . . . . . . . . . . . . . . . . . . . . 129

CHAPTER 7: SUMMARY AND FUTURE WORKS . . . . . . . . . . . . 1317.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . 132

APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . 135A.1 Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

APPENDIX B: MOGSERV MANUAL . . . . . . . . . . . . . . . . . . . . 141B.1 Main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141B.2 Retrieve genome and gene data from NCBI database . . . . . . . 141B.3 Query local database . . . . . . . . . . . . . . . . . . . . . . . . . 141B.4 Set management . . . . . . . . . . . . . . . . . . . . . . . . . . . 142B.5 Data analysis services . . . . . . . . . . . . . . . . . . . . . . . . . 143B.6 Job mangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

APPENDIX C: DEVELOPMENT AND DEPLOYMENT TOOLKITS . . 155

APPENDIX D: SUPPLEMENTARY MATERIAL FOR CHAPTER 3 ANDCHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157D.1 Complete genome sequence in XML . . . . . . . . . . . . . . . . . 157D.2 Example of a ATP synthase subunit B sequence . . . . . . . . . . 159D.3 Protein name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160D.4 Syntax of search local database . . . . . . . . . . . . . . . . . . . 160D.5 Workflow of retrieve sequence . . . . . . . . . . . . . . . . . . . . 160

v

D.6 ClustalW input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.7 Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.8 PAUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

APPENDIX E: SUPPLEMENTARY MATERIAL FOR CHAPTER 5 ANDCHAPTER 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

vi

FIGURES

1.1 The evolution of the Web, yesterday’s web is a repository for textand images; today’s web is a platform to publish and access dy-namically changing new types of contents provided by a variety ofservices.[8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Two basic components in a simple service-oriented architecture. Aservice requester at the right sends a service request message to aservice provider at the left. The service provider returns a responsemessage to the service requester. . . . . . . . . . . . . . . . . . . . 12

2.2 Web services standards stack includes mutliple layered and interre-lated open standards. . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Venn Diagram representation of integration web service, grid com-puting, semantic web, and peer-to-peer technology into the realiza-tion of service-oriented architecure . . . . . . . . . . . . . . . . . . 17

2.4 A common service lifecycle in a service-oriented architecure includesservice publication, service discovery, and service invocation pro-cesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Broker-based service discovery mechanism. A service discovery bro-ker accepts requests from service requesters, translates requests intoappropriate formats, and sends them to multiple registries. The re-turned results may be unified and distilled based on requesters’needs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 P2P-based discovery mechanism containing a data layer, a commu-nication layer, and peers that control registries or service providers. 25

2.7 Summary of existing service discovery systems with different dis-covery mechanisms mapped relative to three characteristics: degreeof decentralization, richness of service descriptions, and static ordynamic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 A manual phylogenetic data collection and data analysis process . 50

vii

3.2 MoGServ System architecture includes a services access client, MoGServmiddle layer, and other data and services providers . . . . . . . . 54

3.3 Asynchronized services and workflow invocation model . . . . . . 58

3.4 A workflow built using Taverna workbench to get complete genomesequences and specific gene sequences . . . . . . . . . . . . . . . . 71

3.5 A workflow for querying two subset sequences from local database,filtering out sequences coming from same organism, and doing se-quence alignment analysis . . . . . . . . . . . . . . . . . . . . . . 72

3.6 Abstraction of user defined workflows . . . . . . . . . . . . . . . . 72

4.1 The growth of sequence databases (NCBI Genebank and EBI Swis-sprot) and annotations. This figure is from Folker Meyer[57] . . . 76

4.2 Entity relationship diagram of the data model in MoGServ createdby SQL::Translator . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.1 A RDF graph model to represent some information for describingthe MoG project web site . . . . . . . . . . . . . . . . . . . . . . 95

5.2 Main concepts and partial relationships defined in the MoG appli-cation domain ontology . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 The software components implementation of annotation and query-ing meta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.1 A four level hierarchical workflow structure representation and trans-formation of scientific processes . . . . . . . . . . . . . . . . . . . 109

6.2 An example illustrates the user-oriented workflow definition withdifferent levels of knowledge . . . . . . . . . . . . . . . . . . . . . 112

6.3 An enhanced workflow system with two added components, knowl-edge management and knowledge discovery . . . . . . . . . . . . . 115

6.4 The mismatching problem may be introduced due to the inaccu-rate annotation, incomplete semantic annotation, and inaccurateontological reasoning during the translation process. . . . . . . . . 122

6.5 The creation process of connectivity graph when a new service isadded in the registry, the connectivity is refined and updated duringthe workflow translation process. . . . . . . . . . . . . . . . . . . 124

6.6 The graph representation of a workflow for describing a scientificprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.1 Time line for the origin of life and major invasions giving rise tomitochondria and plastids.[27] . . . . . . . . . . . . . . . . . . . . 137

viii

A.2 Gene transfer to the nucleus. [27] . . . . . . . . . . . . . . . . . . 138

A.3 Symbioses process [69] . . . . . . . . . . . . . . . . . . . . . . . . 139

A.4 ATP Synthase: the wheel that powers life. It is a candidate forascertainment of deep phylogeny. . . . . . . . . . . . . . . . . . . 140

B.1 The main menu of the MoGServ . . . . . . . . . . . . . . . . . . . 142

B.2 A web interface provides users a way to define data with interests. 143

B.3 Input the query term from this interface and choose gene or genomedatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

B.4 The results from querying local database . . . . . . . . . . . . . . 146

B.5 Users may copy, past particular sequences and upload to the localdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.6 Set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

B.7 The set filter service is used to find intersection of organisms amongmutliple sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

B.8 tblastn interface in MoGServ . . . . . . . . . . . . . . . . . . . . . 150

B.9 ClustalW Interface in MoGServ . . . . . . . . . . . . . . . . . . . 151

B.10 Job management interface shows the status, input link, output linkof a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

B.11 An example input of a clustalW analysis, set id is a hot link, userscan view sequence information in this set. . . . . . . . . . . . . . 153

B.12 An example output of a clustalW analysis, users can download,convert, view the results. . . . . . . . . . . . . . . . . . . . . . . . 154

D.1 Phylogenetic tree generated from the PAUP . . . . . . . . . . . . 166

D.2 Phylogenetic tree file generated from the PAUP can be viewed byother program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

E.1 This is the WSDL description of QueryLocal service hosted in theMoGServ, which provides an operation to create a set in the localdatabase. This operation accepts two parameters and return theset id. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

E.2 One example of using Taverna workbench to create, test, and runworkflow. This workflow accepts users input, search the local database,create set, align set using ClustalW, convert the ClustalW result toNEXUS format, which can be fed to PAUP. . . . . . . . . . . . . 171

E.3 XScufl workflow format represents the workflow created using theTaverna workbench. . . . . . . . . . . . . . . . . . . . . . . . . . . 172

ix

E.4 Annotation of job and set information using ontological model de-fined. The sample rdf file is displayed using RDF Gravity. . . . . 173

E.5 Annotation of a service using ontological model defined. The sam-ple rdf file is displayed using RDF Gravity. . . . . . . . . . . . . . 174

x

TABLES

2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FORWEB SERVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 EXISTING DEPLOYMENT AND EXECUTION ENGINES FORATOMIC AND COMPOSITE SERVICES . . . . . . . . . . . . . 33

2.3 LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES36

3.1 ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIP-TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1 PERFORMANCE EVALUATION OF MATCH DETECTION PRO-CESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2 PERFORMANCE EVALUATION OF PATH SEARCHING PRO-CESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

C.1 OPEN SOURCE SOFTWARE PACKAGES USED FOR DEVEL-OPMENT AND DEPLOYMENT . . . . . . . . . . . . . . . . . . 156

D.1 NAME OF ATP SYNTHASE . . . . . . . . . . . . . . . . . . . . 161

D.2 SYNTAX OF SEARCHING LOCAL DATABASE . . . . . . . . . 161

D.3 INDEXING FIELD OF LOCAL DATABASE . . . . . . . . . . . 162

xi

ACKNOWLEDGMENTS

I would like to thank Dr. Gregory Madey for his encouragement and guidance

on my research. Thanks for him always saying “Life is short” and his kindness,

patience, and confidence in me. I appreciate him giving his students as much

freedom as possible on selecting research topics for our best interests and seeking

for collaborative opportunities to help us fulfill our goals. His spirit of never

stopping to learn new materials and never afraid of exploring new research areas

always encouraged me in the way to finish this dissertation and will encourage me

with my future work. Thanks for his efforts on trying to educate us as independent

researchers in numerous ways.

Many thanks goes to Dr. Jeanne Romero-Severson for providing me use cases

and training in the biological field, and her prompt feedback on my work. I

would like to thank Dr. Amitabh Chaudhary for answering my questions about

algorithms and discussion about my research topics.

I would like to thank my committee members Dr. Patrick J. Flynn, Dr. Aaron

Striegel, and Dr. Jeanne Romero-Severson for their valuable contributions.

I would also like to thank my son for trying hard not to bother me too much

while I was busing working and giving me excuses to relax. Many thanks go to

my husband, my parents, and my friends for their emotional support, always, no

matter how much frustration I had.

This research work is partially supported by the Indiana Center for Insect

xii

Genomics (ICIG) with funding from the Indiana 21st Century fund.

xiii

CHAPTER 1

INTRODUCTION

Since the first generation of the World Wide Web (the Web) appeared in 1990,

it mainly served as a repository for text and images presented in HTML format.

Nowadays, the Web is evolving as a platform to publish and access dynamically

changing new types of content provided by a variety of services that are realized

with web-accessible programs, databases, and physical devices. Tim Berners-Lee

et. al. [8] presents the evolution of the Web (see Figure 1.1); the authors emphasize

the importance of “understanding the current, evolving, and potential Web” in

the article – “Creating a Science of the Web”.

The Web has been used in e-commerce and Business-to-Business (B2B) ap-

plications to deliver information and provide services to customers and business

partners. For example, a travel agency provides services for travelers to view and

compare airfare, book tickets and hotel on-line. As the transaction of services

between businesses increases, there is a demand of increasing the interoperability

between these applications, the service-oriented architecture (SOA) is proposed as

an underlying architecture to enhance this capability. With many definitions and

non-standard definitions, the service-oriented architecture (SOA) is com-

monly accepted as a new architectural style that enables the combination and

communication among loosely coupled services. These services are described with

a standard interface definition that hides the implementation of the language and

1

BrowserBrowser Browser, blog, wiki, data integrationBrowser, blog, wiki, data integration

HTMLHTML InteractionInteraction SemanticWeb

SemanticWeb

WebServices

WebServices MultimodalMultimodal

XMLXML

Privacy, security, accessibility, mobilityPrivacy, security, accessibility, mobility

HTTP, SOAP, …HTTP, SOAP, …

URIURI

HTTPHTTP

URLURL

Yesterday Today

This picture is adapted from the article “Creating a Science of the Web” by Tim Berners-Lee et. al.

Figure 1.1. The evolution of the Web, yesterday’s web is a repositoryfor text and images; today’s web is a platform to publish and accessdynamically changing new types of contents provided by a variety of

services.[8]

platform of services in a SOA. A service can be called to perform a task with-

out the service having pre-knowledge of the calling application, and without the

application having or needing knowledge of how the service actually performs its

tasks.

The realization of a service-oriented architecture is not tied to a specific tech-

nology and protocols. The web service standards, including SOAP, WSDL, and

UDDI, have been widely accepted as the realization of a SOA with support from

a number of tools. Therefore, the service-oriented architecture is often defined

as services exposed using this web service protocol stack. A SOA based sys-

2

tem can therefore be referred to as a system developed using these technologies.

Building a SOA based system can help businesses respond more quickly and cost-

effectively to the changing market conditions. It promotes reuse of existing legacy

applications as services and simplifies the interconnection of distributed business

processes inside organizations or across organization boundaries.

As stated in the article [8], the Web “has changed the ways scientists com-

municate, collaborate, and educate”. The evolving process of using the Web in

the e-Science field is similar to the evolving process of using the Web in the e-

business domain. The effort of building the e-Science infrastructure started from

developing gateways or portals that provide access to integrated databases and

computing resources behind a web-based user interface in multiple scientific fields.

Examples of this kind of science include social simulations, physics, environmental

sciences and bioinformatics. This infrastructure has been used to solve problems

such as distributed physical or astronomic data analysis, and remote access of the

information source and simulations. It facilitates the use of the computational

resources located in different physical sites, thereby allowing users at different

locations to easily share information and communicate with each other. More

recently, the service-oriented architecture along with the combination of seman-

tic web, Peer-to-peer (P2P) computing, and grid computing technologies

are being identified as promising ways to build such infrastructures for supporting

e-Science by providing access to heterogeneous computation resources and integra-

tion of distributed scientific and engineering applications developed by individual

scientists and groups [91] [93].

With the promising future of adopting the service-oriented architecture in e-

Science and e-business, a number of challenges arise in term of integrating inde-

3

pendently developed data systems without requiring global agreements as to terms

and concepts, efficient allocation of computation resources, security and privacy is-

sues of accessing shared data resources. These challenges attract researchers from

diverse research areas such as information retrieval, database system, artificial

intelligence, software engineering, and distributed computing.

1.1 Main contributions of the dissertation

Our research work starts from an investigation of current research trends and

challenges in the SOA area. In order to discover the best practices for building

SOA based systems, we demonstrate our design and implementation of a SOA

based system to support scientific research and increase productivity. It serves

as a prototype for our future research work in this field as well as an in-silico

investigation platform for scientists. A particular scientific domain – studying

the deep phylogeny of the plant chloroplast – is applied in this prototype. This

application shows that a SOA based system can help scientists achieve a research

goal that it is difficult and almost impossible without this system. We conduct our

research from both practical and theoretical aspects. We propose a hierarchical

structure for workflow by integrating semantic web technology to improve the

reuse of workflows. To address the security and resource allocation issue, we

propose integrating the current system with an existing grid computing platform.

The main contributions of this dissertation are:

A survey and analysis of current trends and research challenges in

the service-oriented architecture: Grid computing, peer-to-peer computing

(P2P), and semantic web technologies are related to SOA. A recently proposed grid

standard, Open Grid Service Architecture (OGSA), built upon the service-oriented

4

architecture, demonstrates the convergence of grid computing with SOA. Semantic

web technology is used in grid services and SOA to enhance the automation of

scientific and engineering computational workflows. Applying P2P technology

in SOA makes service discovery and enactment more scalable than centralized

approaches. Much research has been done exploring the convergence of these

technologies so as to make this new distributed computing paradigm successful.

We present our investigation of the research issues and challenges in SOA. Our

discussion of open issues and future research trends focuses on several critical

aspects in SOA: service discovery, service composition, and service enactment.

A Service-oriented data integration and analysis environment for

In Silico experiments and bioinformatics research: As more public data

providers begin to provide their data in web service format in order to facili-

tate better data integration in bioinformatics community, we designed and im-

plemented a service-oriented architecture that integrates the data and services to

support a deep phylogenetic study. This software environment focuses on repre-

senting both data access and data analysis as web services. We believe with this

common interface, it will be easy for other researchers who are interested in deep

phylogenetic analysis to integrate our data and services into their applications.

Based on a first prototype, we discuss several issues in the implementation and

indicate the possible integration with semantic web and grid computing technolo-

gies to address these limitations. We present a practical experiment of building a

service-oriented system upon current web services technologies and bioinformat-

ics middleware. The system allows scientists to extract data from heterogeneous

data sources and to generate phylogenetic comparisons automatically. This can

be difficult to accomplish using manual search tools since sequence data is rapidly

5

accumulating and the process can be long and tedious.

An application for exploring the deep phylogeny of the plastids with

the SOA based system: To serve as an example and proof of concept that the

service-oriented architecture can help scientists increase their productivity and

solve more complex problems than possible with the traditional approaches, we

apply several use cases on the system. We detail the services provided in this

environment and illustrate the results which demonstrate that the environment

can help support scientific analysis and make new discoveries.

A methodology and a novel approach to facilitate the reuse of work-

flow and composition of services: Most current practical methodologies for

creating workflows relies heavily on users having complete knowledge and under-

standing of individual services at a low-level description. Using semantic web

technology, services can be described with rich semantics. Recent research has

focused on supporting users in the discovery and composition of services by using

rich service annotations. Users can choose to encapsulate a service in a workflow

to achieve particular goals based on the conceptual service definition in semi-

automatic and automatic ways. Most current practical methodologies for work-

flow creation pursue this using a semi-automatic way that allows users to discover

and select appropriate services to include in a workflow based on the semantic and

conceptual service definition. This effort lifts the load of requirement on bioinfor-

matics researchers of having detailed knowledge and understanding of each tool,

service, and data types. Instead, more complex middleware is used to assist with

the composition process and resolve the incompatibility between two given ser-

vices. Few approaches consider the potential of reuse of existing workflows or

partial reuse of these workflows. We present a hierarchical workflow structure

6

with a four level representation of workflow: abstract workflow, concrete work-

flow, optimal workflow, and workflow instance. This four level representation of

workflow provides more flexibility for the reuse of existing workflows. We believe

that reuse of complete or partial workflows takes advantage of the verified knowl-

edge learned in practice and can increase the soundness of the composed workflow.

We proposed an ontological representation model of data and services as well as

an approach that uses a graph matching algorithm to find similar workflows with

semantic annotation.

1.2 Organization of the dissertation

The rest of this dissertation is organized as follows: Chapter 2 introduces sev-

eral concepts and technologies related to SOA and discusses related research issues

and challenges. Chapter 3 presents the design and implementation of a SOA based

system for supporting bioinformatics research. Chapter 4 demonstrates a partic-

ular application that uses this system to discover new phylogenetic knowledge.

Chapter 5 presents an ontological model to annotate services and data. This se-

mantically enriched data allows easier reuse, sharing, and experiments involving

search to be conducted. Chapter 6 proposes a methodology and a novel approach

that can facilitate the reuse of workflow and composition of services. Chapter 7

summarizes the dissertation and identifies potential future work.

7

CHAPTER 2

RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED

COMPUTING

2.1 Introduction

The evolution of computing systems progressed through monolithic, client-

server, 3-Tier to N-Tier architectures. The N-Tier architecture layers request and

response calls among applications that may reside on multiple sites. Service-

oriented computing (SOC), an term frequently used interchangeably with the

service-oriented architecure (SOA), involves service layers, functionality, and roles

as described by SOA [70]. SOA can be considered as a conceptual description

of a concrete implementation of a service-oriented computing infrastructure. It

is an emerging paradigm for distributed computing intended to enable system-

atic application-to-application interaction. Services are basic units on a service-

oriented computing platform. They are autonomous, platform-independent soft-

ware components that can be described, published, discovered, invoked, and com-

posed using standard protocols within and across organizational boundaries. A

service is a piece of work done by a Service provider in order to provide de-

sired results for a Service requester. Service providers and requesters are roles

played by software agents on behalf of their owners. The goal of this new dis-

tributed computing architecture is to enable interaction among loosely-coupled

software agents in a flexible and effective way.

8

SOC has been adopted in portal design, e-commerce, e-Science, legacy system

integration, and grid computing. One example is the integration of engineering

design processes, such as automobile and aircraft design, which typically involve

several partners located at different locations. These partners may be both coop-

erative and competitive. Successful engineering design requires well-coordinated

interactions between individuals or teams in specialized knowledge domains, infor-

mation exchange, models, and integration to achieve an optimal goal. However,

there may be a significant part of design models and tools containing propri-

etary information that cannot be disclosed. Also, these models and tools are

normally written in a variety of programming languages and run on different plat-

forms. With service-oriented computing technologies, these models and tools can

be treated as black boxes and run at their original locations [5] [43].

Reusability, interoperability, security, and easy maintenance are major poten-

tial benefits of SOC.

• Reusability – services provide a higher-level standard abstraction that allows

the reuse of existing software.

• Interoperability – The standard abstraction of services enables the interop-

eration of software produced by different programmers and improves pro-

ductivity.

• Security – With the standard abstraction of services, software can be viewed

as a black box. The internal implementations or algorithms are not accessi-

ble to competitive partners.

• Maintenance – With the standard abstraction of service, changes to the

underlying implementation will adversely impact the use of the services.

9

While the potential benefits of SOC are compelling, successful service-oriented

implementation requires solving several issues and challenges arising from these

promising features. These issues and challenges include service discovery, ser-

vice composition, and service invocation; monitoring the execution of services;

methodologies supporting services development, evaluation, and life-cycle man-

agement; approaches to guarantee quality, security, and reliability of services.

These challenges attract researchers from diverse research areas such as informa-

tion retrieval, database systems, artificial intelligence, software engineering, and

distributed computing.

In this chapter 1 , we introduce several concepts and technologies related to

SOC and discuss related research issues and challenges.

2.2 Overview of related concepts and technologies

Several definitions of SOA are available; the W3C defines SOA as a form of

distributed systems architecture with the following properties: [105]

• The service is an abstracted, logical view of actual programs, databases, and

business processes.

• A service or a function is described using a description language.

• Services tend to use a small number of operations with relatively large and

complex messages.

• Services tend to be oriented toward use over a network.

1Portions of this chapter appear in “A semantic web services enabled web portal architecture”,International Conference of Web Services (ICWS2004)[108]

10

• Messages are sent in a platform-neutral, standardized format, such as XML,

through the interface. XML is the most obvious format.

• The service is implemented as a software agent. The service is formally

defined in terms of the messages exchanged between provider agents and

requester agents, and not the properties of the agents themselves. By avoid-

ing any knowledge of the internal structure of an agent, one can incorporate

any software component or application that can be ”wrapped” in message

handling code that allows it to adhere to the formal service definition.

There are two fundenmental components in a basic service-oriented architec-

ture as shown in Figure 2.1. A service requester at the right sends a service

request message to a service provider at the left. The service provider returns a

response message to the service requester. The request and subsequent response

connections are defined in some way that is understandable to both the service

requester and service provider.

2.2.1 Web services

Although there is no standard definition of “web services”, a web service is

generally considered as one type of realization of SOA. Among various definitions,

we refer to the definition from W3C:

“ A Web service is a software system designed to support interopera-ble machine-to-machine interaction over a network. It has an interfacedescribed in a machine-processable format (specifically WSDL). Othersystems interact with the Web service in a manner prescribed by its de-scription using SOAP messages, typically conveyed using HTTP withan XML serialization in conjunction with other Web-related standards2”.

2http://www.w3.org/TR/ws-arch/

11

Service Provider

Service Requester

Return results in XML format

Send request in XML format

Internet

SoftwareAgent ImplementThe service

SoftwareAgent Has knowledgeOf the serviceIn terns of theDescription notThe implementation

Servicedescription

Figure 2.1. Two basic components in a simple service-orientedarchitecture. A service requester at the right sends a service request

message to a service provider at the left. The service provider returns aresponse message to the service requester.

Concrete software agents that implement an abstract service interface can be

written in different programming languages and can run on different platforms.

Since these concrete agents implement the same function defined in the abstract

interface, any change of underlying implementation will not effect on the use of

the service. A web service architecture is based upon many layered and inter-

related open standard and web technologies as shown in Figure 2.2. The Web

Service Description Language (WSDL) defines the abstract interface of services.

The Simple Object Access Protocol (SOAP) is a protocol for exchanging messages

among requesters and providers. Universal Description, Discovery and Integra-

tion (UDDI) provides a standard registry for publishing, discovery, and reuse of

web services. WSDL, SOAP, and UDDI are core standards based on fundamental

web technologies including XML, TCP/IP, FTP and etc. There are also emerg-

ing standards proposed for defining business, scientific, or engineering processes,

12

transactions, and security, e.g., BPEL4WS, WS-I. Two main styles of web services

are available: SOAP web services and REST (Representational State Transfer) 3

web services. In this dissertation, we use the term “web services” to mean SOAP

style web services.

Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc

Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc

Meta LanguageXML

Meta LanguageXML

Services CommunicationSOAP

Services CommunicationSOAP

Service Publishing & DiscoveryUDDI

Service Publishing & DiscoveryUDDI

Services DescriptionWSDL

Services DescriptionWSDL

Business Process ExecutionBPEL4WS, WFML, WSFL,

BizTalk, …

Business Process ExecutionBPEL4WS, WFML, WSFL,

BizTalk, …

Additional WS* Standards …Additional WS* Standards … TransactionsTransactions

ManagementManagement

SecuritySecurity

Figure 2.2. Web services standards stack includes mutliple layered andinterrelated open standards.

2.2.2 Semantic web

The vision of the semantic web is to represent units of web-based information

with well-defined and machine-understandable semantics so that intelligent soft-

ware agents can autonomously process them [7]. This information, including these

3http://www.xfront.com/REST-Web-Services.html

13

abstract description of services, must be defined and linked in such a way that

it can be used for automation, sharing, integration, and reuse even when these

software agents are designed, developed, and owned by different groups or indi-

viduals. SOA, more specifically web services, becomes a key component to realize

the vision of semantic web since most web sites on todays’ web do not merely

provide static information but allow users to interact and generate dynamic infor-

mation through services. To make use of a web service, a software agent needs a

computer-interpretable description of the service.

Adding “meaningful” descriptions to the interface using semantic web technol-

ogy can avoid ambiguous interpretations of information and service descriptions

and increase the soundness of the results provided by service providers. The com-

bination of these two technologies results in the emergence of a new generation of

web services called semantic web services [54]. The proposed standards for knowl-

edge sharing and reuse in the semantic web range from the Resource Description

Framework (RDF) to the Web Ontology Language (OWL) [67]. These two stan-

dards have become W3C recommendations. The appearance of open source tools

that support creation, parsing, and reasoning using these standards makes the

addition of semantic web technology into SOC feasible.

2.2.3 Grid computing

Grid computing [32] is a computing platform that is intended to integrate

resources (both data and computational resources) from different organizations,

called virtual organizations, in a shared, coordinated and collaborative way to

solve large-scale science and engineering problems. The Globus toolkit [97] is

one implementation of the specifications for grid computing. It has become the

14

standard for grid middleware. Open Grid Service Architecture (OGSA), built

upon the service-oriented architecture, describes a service-oriented architecture

for grid computing. The Open Grid Services Architecture (OGSA) describes an

architecture for a service-oriented grid computing environment for business and

scientific use, developed within the Global Grid Forum (GGF). OGSA is based

on several other Web service technologies, notably WSDL and SOAP. It is a

distributed interaction and computing architecture based around services, assuring

interoperability on heterogeneous systems so that different types of resources can

communicate and share information.

The major goal of the grid computing platform is to provide an easy-to-use and

flexible computing infrastructure for supporting e-Science. The goal of e-Science

is to offer scientists and engineers an effective way to generate, analyze, and share

their experiments, data, instruments, computational tools, and results. Seamless

automation of the scientific process becomes a major gap between the vision and

reality. Grid computing shares some of problems and technical challenges with

service-oriented computing in general. Incorporating semantic web technologies

into grid computing bring us a new concept, the semantic grid [21], which intends

to minimize this gap and solve the problem of achieving seamless integration and

automation of scientific and engineering workflows.

2.2.4 Peer-to-peer computing

Peer-to-Peer (P2P) computing has received significant attention due to the

popularity of P2P file sharing system such as Napster, Gnutella, Freenet, Mor-

pheus, BitTorrent, and KaZaa. Peers are autonomous agents and exchange in-

formation in completely decentralized manner. P2P architecture does not have

15

a single point of failure. Since nodes contact with each other directly, the in-

formation they receive is up-to-date. The P2P model can provide an alternative

for service discovery dynamically without relying on centralized registries. The

P2P model also provides an alternative for interaction between web services. We

discuss the research done on this direction in the following sections.

Semantic web technology enhances the capability of automation in SOA and

grid computing. Grid computing building upon SOA increases the flexibility. P2P

computing model increases the scalability and reliability. Figure 2.3 demonstrates

an overview of current research trends that intend to use these technologies to-

gether.

2.3 Issues in the service-oriented computing

Figure 2.4 shows service publication, service discovery, and service invocation

stages in the life cycle of a service. This process involves three roles in the SOA:

service provider, service requester, and service discovery system. Service providers

create services and provide platforms to execute these services. Service requesters

query the service discovery system to find appropriate services. To enable ser-

vice requesters to find services, service providers need to publish their services

interface in a publicly available location. Specifying the capability and quality

of services, and finding a matched service based on these descriptions are usually

done as two separate activities. The more information that is given for describing

services, the more accurate are the matched results that are returned. Services

can be categorized into simple services (atomic services) and complex services

(composite services). Generating and executing a composite service to solve

a complicated problem is an important feature leading to the adoption of SOA.

16

Figure 2.3. Venn Diagram representation of integration web service,grid computing, semantic web, and peer-to-peer technology into the

realization of service-oriented architecure

In the following sections, we discuss several active research issues in SOA, service

description, service discovery, service composition, and service execution.

2.3.1 Service description

One requirement of the services oriented architecture is to provide meaningful

descriptions for services so that software agents can understand their features and

learn how to interact with them. A service description gives a formal representa-

tion for properties of a service. These properties can be classified into funcational

and non-functional properties.

Functional properties contain the details of a service interface and service

17

ServiceConsumerService

Consumer

ServiceBroker

ServiceBroker

ServiceProviderServiceProvider

2

3 54

1

DiscoveryInvoke

Publish

Figure 2.4. A common service lifecycle in a service-oriented architecureincludes service publication, service discovery, and service invocation

processes.

behavior including data types, operations, transport protocol information, and

binding address. WSDL is the first W3C standard that is widely used for service

descriptions.

There may be multiple service providers who offer the same functionalities

defined in a service interface. Determining and choosing the best service becomes

important for service requesters. The information in WSDL descriptions is not

sufficient for ranking best services. Non-functional properties including specifica-

tion of the cost, performance, security, and trustiness of a service are introduced

for measuring the Quality of Services (QoS). There are many aspect of QoS that

can be organized into categories with a set of quantifiable parameters [75]. The

“best” service may have different meanings for different requesters. One may

prefer security over cost while the other may prefer lower cost over performance.

Measurements of these non-functional properties can be achieved using statistical

analysis, data mining, and text mining technologies. It is normally done by a

third-party through the collection of subjective evaluations from requesters. This

information dynamically changes over time.

18

Pure syntactic descriptions of services require requesters to fully understand

the capability of a service before using it. The selection of a web service among

several ones with similar WSDL descriptions requires more information than what

WSDL actually defines. The semantic web, supported by the use of an ontology, is

likely to provide better qualitative and scalable solutions to overcome these issues.

There are two directions to enhance the semantics in the web service descrip-

tion (See Table 2.1). 1) enhance the WSDL description. The Semantic Annota-

tions for Web Services Description Language Working Group [81] has the objective

to develop a mechanism to enable annotation of Web services description. This

mechanism will take advantage of the existing WSDL standards (WSDL 2.0) to

build a simple and generic support for semantic in Web services. Some systems

[54] [55] define an ontology for web services using emerging languages, such as

DAML+OIL and OWL. 2) Second, the W3C recently proposed OWL-S to provide

the ontology description of web services using OWL. OWL-S enables description of

not only the functional properties of a service, but also the non-functional proper-

ties. This domain-independent service ontology is augmented by domain-specific

ontologies in real applications.

Enhancing service descriptions with ontological representations increases the

cost and complexity of services annotation from several aspects.

Creation of domain ontology Use of ontologies is considered to be the most

promising basis for defining the semantics of objects and allowing meaning-

ful information exchange among machines and humans. A commonly used

definition of ontology is “a specification of a conceptualization” [40]. An

ontology is intended to give a concise, uniform, and declarative description

of information and knowledge that is interesting and useful to a community

19

TABLE 2.1

SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR

WEB SERVICES

Descriptionmethods

Representation Challenges

syntactic WSDL No representation of non-functional prop-erties, not sufficient in representing mean-ingful description, no representation forprocess, only supporting the keywordssearch

semantic domain ontology +WSDL

No representation for process, complexityof services annotation

domain ontology +OWL-S

Complexity of services annotation

of users, using a common vocabulary and language.

Construction of a knowledge base involves investigating a particular domain,

determining important concepts in that domain, and creating a formal rep-

resentation (ontology) of the objects and relations in the domain. A general

ontology represents a broad selection of objects and relations at a higher-

level of abstraction [79]. Miller et al. [59] investigate ontologies for simu-

lation modeling. Christley et al. [15] presents an ontology for agent-based

modeling and simulation.

An ontology is normally defined and revised (if needed) by an authority.

Usually the authority needs to collaborate with the real experts in the do-

main before or during the process of creating formal representations. Large-

scale ontologies can be constructed by publishing a prototype ontology for

the research community. The Gene Ontology (GO) Consortium produces

20

a controlled vocabulary for classifying gene product attributes, molecular

functions, cellular components and biological process [35] in the biological

sciences field. It consists of 17838 terms (as of September 27, 2004) and

22742 terms (as of March 11, 2007).

Integration of ontologies Vast amounts of information may come from many

different ontologies. For this reason and because many heterogeneous data

repositories are developed by different research groups and reside on different

research institutes and organizations, it is impossible to process this infor-

mation and data without the knowledge of the semantic mapping between

them. Much research has been done to explore the mapping and matching

of concepts, and integrating different ontologies using sophisticated algo-

rithms and AI techniques, such as machine learning [25][62]. There are

two approaches for ontology integration. One approach involves integra-

tion of different ontologies that are developed by different groups for data

representation into a common global ontology. While this approach makes

the information correlation in the query processing easier, it increases the

complexity of integrating the ontologies and maintaining consistency among

concepts. The other approach is interoperation across different ontologies

via “terminological relationships” between terms instead of integration of

ontologies into a global one [56] [66]. Interontological relationships are spec-

ified using description logics in an interontological relationships manager to

handle vocabulary heterogeneity between ontologies. Although the adoption

of this approach increases the scalability, extensibility, and maintainability,

it shifts the burden to the interoperation mechanisms.

Annotation of services The annotation of services using ontologies is gener-

21

ally done manually. It is a complex process since there may be multiple

ontologies related to a single service. These ontologies may be developed

by different groups. Each group may represents the same concept using

different vocabulary or different concepts are presented using the same vo-

cabulary. Some systems, such as MWSAF (Meteor-S web service annotation

framework) [71] provides graphical tools that enable users to annotate exist-

ing web services description with ontologies in a semi-automatical way using

AI technologies such as machine learning. The IBM ETTK [30] technology

provides a set of toolkits including a graphical editor for annotating service

compatible with WSDL-S.

2.3.2 Service discovery

Without prior knowledge of a service, service requesters may not know the

location or even the existence of services they desired. A goal of the service

discovery process is to find services that are best suited for the requirement of the

requester.

A basic service discovery process can be described as follows.

1. Service providers provide descriptions of their services and advertise these

services in a service registry. A service registry is a service discovery system

that consists of mechanism for supporting efficient searching appropriate

services and physical spaces for storing characteristics of services. UDDI is

a registry standard.

2. Service requesters request desired services using keywords or complicated

query languages.

22

3. A service discovery system accepts requests from requesters. It searches

service descriptions in its database and tries to find services that match

requests. This process is also called matchmaking.

As the number of web services grows, new registries appear as needed. A ser-

vice may be registered in several registries. A service discovery broker accepts

requests from service requesters, translates requests into appropriate formats, and

sends them to multiple registries. The returned results may be unified and dis-

tilled based on requesters’ needs (See Figure 2.5). In this mechanism, the broker

may issue the request to multiple registries in parallel, however, there is still a

communication bottleneck to the broker and a single point failure may occur.

An alternative of the centralized discovery mechanism is the P2P based dis-

covery mechanism. In this approach, each service provider acts as a peer in the

P2P network. Each provider has its own way to store information about other

service providers, called neighbors, and provides the resources to relay or pass

information through. A network like a social network is eventually formed. At

the discovery process, a requester queries its neighborhoods for searching a desired

service, the query propagates through the network until a suitable one is found

or terminates [105]. This approach provides higher reliability than a centralized

approach. It avoids the single point of failure and the latency of providing up-to-

date description for updated services. However, since each service provider is a

peer, a huge peer community may result in inefficient search. Instead of treating

every provider as peer, each registry can act as a peer in the network to overcome

the problem.

Much research has been done for realizing a P2P discovery mechanism. Schmidt

and Parashar present a P2P based keyword web service discovery system on the

23

A B C … …

3

Service Discovery Broker

•Handle queries from requesters•Translate queries into appropriate formats needed by each registry

•Communicate each registry•Unify and distill results returned from registries

Service Discovery Broker

•Handle queries from requesters•Translate queries into appropriate formats needed by each registry

•Communicate each registry•Unify and distill results returned from registries

a

ServiceProviders

ServiceRegistries

ServiceRequesters

1 2 …

b …c

Publish services into one or multiple service registries

Send a request for inquiry services using broker required syntaxReceive the results from broker

Figure 2.5. Broker-based service discovery mechanism. A servicediscovery broker accepts requests from service requesters, translates

requests into appropriate formats, and sends them to multipleregistries. The returned results may be unified and distilled based on

requesters’ needs.

Chord overlay network [82] 4. In this system, a set of keywords is extracted from

the web services descriptions. These web services descriptions are indexed using

these keywords. The index is stored at the peers in the P2P system. Each web

service description is mapped to the index space. The underlying node joins, de-

partures, failures, and data lookup are build upon the Chord network’s lookup

protocol. Speed-R [88] is a JXTA based P2P network system supporting semantic

publication and discovery of web services. In this system, each service registry is

controlled by a peer. Dogac et. al. [26] describe a way to expose the semantic

of web service registries and connect the service registries through a P2P network

for the travel industry. A general P2P discovery system (See Figure 2.6) contains

4http://en.wikipedia.org/wiki/Chord project

24

a data layer, a communication layer, and peers that control registries or service

providers. The data layer can be formed by registries or service providers. Com-

munication layers are implementations of P2P network, such as JXTA and Chord.

Semantically enriched services and registries make the automation of the service

discovery and the discovery of service registries possible.

Registry A Registry B Registry NData Layer

CommunicationLayer

Peer A Peer B Peer N

SemanticEnrichedServices

OrRegistries

DescriptionUsing

ontology

JXTA … … ChordJXTA … … Chord

Figure 2.6. P2P-based discovery mechanism containing a data layer, acommunication layer, and peers that control registries or service

providers.

The traditional service discovery method, static discovery or manual discovery,

relies on humans’ intervention by using a discovery system to locate and select a

service description that meets the desired criteria at design time. The dynamically

changing service environment requires service discovery that should be possible

using a software agent during run time. The realization of the dynamic discovery

mechanism needs machine processable semantics to describe services.

The implementation and performance of a service discovery system depends

on the available information in service descriptions. The more information the

25

system can gather, the more accurate results the system can give back to the

requester. The implementation also depends on the kind of query that can be

given by the requester. Two examples are: “give a forecast service”, “give

a forecast service which has fastest response time.” For the first query,

a simple key-word based discovery system is sufficient. For the second query, the

discovery system needs to gather information on quality of service, find several

forecast services in registries and rank them based on response time.

The service discovery problem is related to the information retrieval problem.

Two key quality measurements in information retrieval are also applicable when

evaluating the performance of service discovery systems [45]. Recall is the number

of relevant items retrieved, divided by the total number of relevant items in the

collection. Precision is the number of relevant items retrieved, divided by the

total number of items retrieved.

The discovery mechanism in the traditional UDDI standard that only supports

the static service discovery has been recognized as insufficient. This discovery

mechanism often gives no result at all or gives many irrelevant results because

keywords are a poor method to capture the semantics of a request. Synonyms

(syntactically different words may have the same meaning) and homonyms (same

words may have different meaning in different domain) can not be distinguished

in a keyword-based retrieval. Also, relationships between different keywords in a

request can not be captured. This mechanism offers low retrieval precision and

recall.

WordNet [102] is used to handle the synonyms and to employ an information

retrieval model in the service retrieval process [99] so as to improve the precision

and recall. WordNet is a lexical reference system developed by the Cognitive Sci-

26

ence Laboratory at Princeton University. English nouns, verbs, adjectives and

adverbs are organized into synonym sets, each representing one underlying lexical

concept, and the synonym sets are linked with different relations. WordNet is

distributed as a data set. However, WordNet only supports the query of com-

mon words; vocabularies for a particular domain are most-likely not included in

WordNet.

With rich formal semantic descriptions added to web services, a service discov-

ery system can provide more accurate results with high precision and recall. It also

reduces human interference with the discovery process and makes the dynamic dis-

covery possible. Therefore, semantic web technologies become a solution for this

matchmaking process [47] [26] [28]. In the mean time, quality of service becomes

an interesting topic for selecting optimal services from a subset of services that

have the same functionality the requester asked for [10] [53] [48] [19]. Two types of

semantic descriptions result in two types of semantic discovery system: (1) Adding

semantics to current web services standards (UDDI and WSDL) [85]; (2) Using

DAML-S and OWL-S to represent both functional and non-functional properties

of web services enables software agents or search engines to automatically find ap-

propriate web services via ontologies and reasoning-algorithm enriched methods.

However, the high cost of formally defining heavy and complicated services makes

adoption of this improvement unlikely in the current stage.

Figure 2.7 shows, in three dimensions, existing service discovery mechanisms

currently used in implementations of service discovery systems. A is a keywords-

based system, such as traditional UDDI. B is semantic enriched UDDI systems

[85]. C is keywords-based P2P systems [82]. D is semantic-based systems with

DAML-S or OWL-S [47] [26] [28]. E is semantic-based systems on P2P network

27

[88] [26].

A

Keywords based Semantic based

BStatic

Dynamic

Centralized

Decentralized

(P2P)C E

D

Figure 2.7. Summary of existing service discovery systems withdifferent discovery mechanisms mapped relative to three characteristics:degree of decentralization, richness of service descriptions, and static or

dynamic.

The research challenges residing in the service discovery process may suggest

a way to integrate semantic and P2P technologies for building a discovery system.

This system should allow automatic service discovery and provide high precision

and recall at the same time, however, the cost of implementing this system makes

it hard to be adopted at this time.

28

2.3.3 Service composition

One of the most attractive features of service-oriented computing is that atomic

services can be combined into a large application to solve complicated problems.

The orchestration of a set of services to accomplish a larger and sophisticated goal

is called a workflow. In the business world, a workflow is referred to as a business

process. In the scientific domain, a workflow is sometimes referred to a scientific

process.

Several different approaches and platforms are being developed to achieve the

common goal of the web service composition. These approaches range from adop-

tion of industry standards to adoption of semantic web technology, from manual

or static composition to automatic dynamic composition [90]. Since there is no

standard service composition specification, each approach and platform defines its

own way for service composition, provides its specifications and languages, and

executes the workflow on a specific workflow execution engine.

Current solutions for web service composition include the adoption of industrial

standard, semantic web technologies [86] [29] [41], web components [111], Petri

nets [112], and so on. The long term goal of a successful composition mechanism

should meet several requirements: connectivity, quality of service, correctness,

and scalability [58].

Adoption of industrial standards and adoption of the semantic web technolo-

gies are two active research areas among current service composition mechanisms.

Both of these mechanisms support complex process activities, such as sequences,

branching, etc.

Current industrial standards include WSDL, UDDI, SOAP and a set of work-

flow specification languages (BPEL4WS, WSFL, BPML, WSCI, and XLANG)

29

used to support the data flow and control flow representations [98]. Among

all of these specifications, BPEL4WS is the most mature and widely supported

by the industry and research community. Service compositions described in the

BPEL4WS format may be deployed on execution engines, such as BPWS4J [11]

and Collaxa BPEL server [17].

The other model approach is based on semantic web technologies and AI plan-

ning techniques [84] [13]. In this model, services are semantically annotated with

RDF/RDF Schema, DAML-S, or OWL-S. The objective is to enable automation

of web service discovery, invocation, composition and execution. However, there

is limited implementation and product support for generating service descriptions

automatically at the current research stage.

Most service composition models require application developers to possess

complete knowledge of available services and the exact process logic. It depends

on developers to choose a particular service at each step. Adoption of seman-

tic web technologies allows automation of the composition process to be possi-

ble. There are two type of automation, semi-automatic and automatic. Both of

them require the existence of domain ontology. The typical system [84] using the

semi-automation method maintains a knowledge base which contains ontology of

services, such as DAML-S or OWL-S. A matchmaker is used to find a service

with required functionality. All the optional services that meet the requirement

are presented to user with ranking of the quality at each step. The user makes a

choice and continues the process. A typical system using the automatic method

is often cooperating using AI planning technology [13]. The composition process

starts from an explicitly defined goal. The workflow composition engine lets the

service requester provide the input and output information. This information is

30

fed into an AI planner. The planner returns one plan, multiple plans, or no results

to the end user for a further decision. Although the service composition problem

is highly related to the AI planning problem, the current planning technologies

can not be directly applied [90].

Services are dynamically changed and may have voluntary failure during exe-

cution time. A composed workflow that does desired work at one time may not

work at another time. Preventing run time failure at design time is important.

An issue in the automatic composition of web services is defining the compat-

ibility [55] or connectivity [58] of services. It can be a time comsuming process

to check if services to be composed can actually interact with each other. For

example, the output of one service is a required input of the subsequent service

in a workflow. It also requires a way to verify the soundness and correctness of

the composite services. Much research has been done to explore this using AI

planning techniques for automation of the composition process. It is still an open

research problem whether or not it is possible to use or extend the current plan-

ning techniques in the service composition process and modeling of services. The

application used most to motivate research in automatic service composition is a

virtual travel agent example; typically, the motivations lack a real world example.

This approach now may be practically used in domains with well-defined ontolo-

gies and a small number of available services in that domain. We believe the

semi-automatic approach is more practical when large number of services exist in

the domain.

31

2.3.4 Service execution

Service execution is a process in which an atomic service or a composite service

is invoked and results are returned to requesters.

Atomic web services can be created with different languages and deployed on

various platforms. Two major platforms are J2EE and .NET. Since execution of

atomic services does not require results from other services, the technologies to

support atomic services are relatively mature (See Table 2.2).

Service execution for composite services depends on the composition model

and the existing execution engine support. The industrial standard based model

can be transferred to a particular workflow specification, such as BPEL4WS, and

executed on a workflow engine. The semantic web based model can be represented

using the DAML-S specification and executed on a DAML-S Virtual Machine [84]

or OWL-S execution engine. Since there is no standard service composition spec-

ification, each composition approach and platform provides its own specifications

and languages for composite services and executes the workflow on a specific work-

flow execution engine. There are also composition toolkits that convert the visual

graph composition of service into a language-specific workflow. Several issues exist

in the service execution process.

Synchronized vs. Asynchronized communication Web service technol-

ogy is message passing oriented; the architecture should be able to support

different message passing methods. Most service-oriented frameworks only

provide support for synchronous invocation, such as Axis [3], which blocks

the process before the response from service provider arrives. The loose cou-

pled nature of web service requires more flexible invocation method. The

requester should not be blocked because it is waiting for the response from

32

TABLE 2.2

EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR

ATOMIC AND COMPOSITE SERVICES

Service type Specification Execution Engine

Atomic service WSDL Implemented using Java, C++,Perl, Python on .NET, J2EE,gSoap, SoapLite

OWL-S OWL-S execution engine

Composite service BPEL BEPL4J

OWL-S OWL-S execution engine

DAML-S DAML-S virtual machine

XScufl Freefluo

providers. Various research has been done to support this asynchronized

communication method [107] [113].

Centralized vs. Decentralized execution of composite web services Al-

though most of the composite service execution engines invoke an individual

atomic service on distributed service providers, the engine acts as a central-

ized coordinator for all interactions among these atomic services. Decentral-

ized execution allows independent sub-workflows to interact with each other

without any centralized control. It can reduce the amount of network traf-

fic. Mangala Gowri Nanda et. al.[60] present an algorithm that partitions

a composite services in BPEL into independent sub process. Each service

provider should host a BPEL engine. Their experimental results show that

decentralized execution can increase throughput substantially. Roger Weber

et. al.[100] present a peer-to-peer based execution systems. In this system,

33

when a node finishes its part of the work, then the data is migrated to nodes

offering a suitable service for one of the next steps in the process. Boualem

Benatallah et. al.[6] present an environment where a composite service can

be executed in a decentralized way within a dynamic environment.

Monitoring service and workflow execution One of the issues in service exe-

cution is that it is possible the selected service in the workflow is unavailable

or temporarily off-line. The execution engine then invokes the alternative

service if one is defined in the workflow at the service composition stage.

Service execution often needs a duration to be completed. Service requesters

may require a monitoring service so that they can query the status of their

requested services. Monitoring the service execution status is another im-

portant issue. The experience in grid computing research may be adopted

in the SOA for building a reliable infrastructure for service execution.

2.4 Service-oriented computing in e-Science

An individual life sciences researcher or research group starts a scientific project

by developing hypotheses, designing experiments to test those hypotheses, collect-

ing observational data, and publishing results. The published data allows other

researchers to build upon or verify the results. With the assistance from computer

software, users can import the raw data, click on buttons, and retrieve the results.

The analysis process, however, requires certain knowledge of how to use these

toolkits and how to access these data from different locations. Even for users who

posses this knowledge, this manual analysis process is a bottleneck when large data

sets are involved. As the World Wide Web becomes a platform for scientific study

(e-Science), research data can be published on the web to be shared with other

34

researchers. These data can be distributed in various formats (such as RDBMS

tables, text files, or XML documents) depending on the preferences and needs

of research groups. Manually accessing these data files becomes difficult as these

data may come from different institutes, different research groups, and in different

formats. There is a need for a methodology that frees users from having to locate

the data sources, interact with each data source, and manually combine data in

multiple formats from multiple sources. Applying semantic web and web services

technologies to support life sciences research becomes a promising solution to this

difficulty.

As the adoption of web services in the life sciences field grows, many large

public resource sites are publishing web services interfaces in WSDL format to

allow their data and analysis tools to be accessible to the research community, see

Table 2.3.

35

TA

BLE

2.3

LIF

ESC

IEN

CE

SR

ESO

UR

CE

SAVA

ILA

BLE

AS

WE

BSE

RV

ICE

S

Ser

vic

eP

rovid

erD

escr

iption

Res

ourc

esU

RL

NC

BI(

the

Nat

iona

lCen

-te

rfo

rB

iote

chno

logy

In-

form

atio

n)

Pro

vide

sa

vari

tiey

ofE

-Uti

lity

web

serv

ices

toal

low

data

retr

ieva

lag

ains

tth

eN

CB

Ida

taba

seus

ing

WSD

Lan

dSO

AP

http

://w

ww

.ncb

i.nlm

.nih

.gov

/ent

rez/

quer

y/st

atic

/eso

aphe

lp.h

tml

EM

BL-E

BI

(the

Eur

o-pe

anB

ioin

form

atic

sIn

-st

itut

e)

Pro

vide

sa

num

ber

ofw

ebse

r-vi

cesfo

rda

tare

trie

val,

data

anal

-ys

isto

ols,

and

onto

logy

look

upus

ing

WSD

Lan

dSO

AP

http

://w

ww

.ebi

.ac.

uk/T

ools

/web

serv

ices

/

DD

BJ

(the

DN

AD

atab

ase

ofJa

pan)

Pro

vide

sw

ebse

rvic

esfo

rda

tare

trie

val,

data

anal

ysis

agai

nst

DD

BJ

data

base

usin

gW

SDL

and

SOA

P

http

://x

ml.n

ig.a

c.jp

/ind

ex.h

tml

KE

GG

(the

Kyo

toE

n-cy

clop

edia

ofG

enes

and

Gen

omes

)

Pro

vide

sw

ebse

rvic

esfo

rda

tare

-tr

ieva

lan

dda

taan

alys

isag

ains

tK

EG

Gda

taba

se

http

://w

ww

.gen

ome.

ad.jp

/keg

g/so

ap/

SeqH

ound

Pro

vide

sw

ebse

rvic

esfo

rda

tare

trie

val

from

the

sequ

ence

and

stru

ctur

eda

taba

se

http

://w

ww

.blu

epri

nt.o

rg/s

eqho

und/

seqh

ound

docu

men

tati

on.h

tml

36

In e-Science, a number of legacy data analysis tools are designed to be command-

line applications. Soaplab 5, developed by EBI, is a SOAP-based web service

utility used to wrap such command-line applications into web services. Recently,

service-oriented computing middleware, capable of supporting life science exper-

iments, have been developed. We believe that an “ideal” service-oriented archi-

tecture should allow service and data providers to publish their information into

registries with semantically defined properties using domain ontologies; it should

allow not only experts but end-users to define their workflow at a high level of ab-

straction using vocabulary provided in the domain ontology; allow the execution

of the workflow and monitoring the workflow execution process; allow the reuse

or partially reuse the existing workflow and support the data provenance manage-

ment. Several workflow managment systems are developed in order to meet this

goal.

Discovery Net 6 is a service-oriented computing system based on an open

architecture re-using common protocols and common infrastructures such as the

Globus Toolkit for knowledge discovery. It is a multidisciplinary project serving

application scientists from various fields including biology, combinatorial chem-

istry, renewable energy research and geology. The system allows service providers

to publish and make data mining and data analysis software components as ser-

vices. It allows data owners to provide interfaces and access to scientific databases,

data stores, sensors and experimental results as services. It also allows users (sci-

entists) to plan, manage, share and execute complex knowledge discovery and data

analysis procedures. Besides re-use of the common protocols and infrastructure,

Discovery Net define its own protocol – DPML (Discovery Process Markup Lan-

5http://www.ebi.ac.uk/Tools/webservices/soaplab/overview6http://www.discovery-on-the.net/

37

guage) – for constructing and managing knowledge discovery procedures, as well

as recording their history. The defined data analysis task (scientific workflow) can

be executed on distributed resources, stored, shared, and re-executed.

Pegasus 7 [34] [23] [2] is a framework that enables the mapping of complex

scientific workflows onto the Grid. In the Pegasus system, an abstract work-

flow is a workflow in which the workflow activities (software components) are

independent of the Grid resources used to execute the activities. The abstract

workflow depicts the main steps in the scientific analysis including the data used

and generated, but does not include information about the resource needed for

execution. The abstract workflows can be constructed by using Chimera – VDS

(the GriPhyN Virtual Data System) 8 – or can be written by users from a workflow

template.

The concrete workflow represents an executable workflow that includes details

of the execution environment. It also includes the necessary data movement to

stage data in and out of the computations. Other nodes in the concrete workflow

also may include data publication activities, where newly derived data products

are published into the Grid environment. A major focus of research on the map-

ping of abstract workflows to concrete workflows in the Grid computing environ-

ment is on how to find an appropriate resource currently registered at each step.

Extra service components such as data transfer and data registration in the grid

environment may have to be encapsulated in the workflow. This mapping process

may be automated with algorithms and AI planning technologies if the resources

are semantically well-described. During the mapping process, the workflow may

be restructure, reordered, and refined to improve the overall performance and to

7http://pegasus.isi.edu/8http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain

38

adapt to dynamically changing execution environments. The concrete workflow

can be given to Condor’s DAGMan9 for execution.

myGrid 10 is a service-oriented computing middleware for supporting life sci-

ences researchers with the construction, execution, and sharing of scientific work-

flows using the Taverna 11 workbench. Researchers can use the graphic work-

bench to drag and drop service components into the model explorer. Recent

myGrid developments focus on supporting users in the discovery and composition

of services by using rich service annotations to make the workflow design more ac-

cessible to non-expert users. With incorporated semantic web technology, services

and workflows can be described using domain specific ontologies. It is a valuable

capability in a system potentially searching over thousands of services. Instead of

locating available Grid resources, the semantic web enabled services annotation

and discovery in myGrid is used to locate the software components or data that are

exposed as web services. The executable workflow is written in XScufl language

and executed in Freefluo workflow engine. Life sciences researchers can monitor

the execution status through the Tavana workbench. In the myGrid system, the

Feta data model is used to represent the semantic description of available services

[50]. Web services are annotated with terms from an OWL-base myGrid domain

ontology [103] with an GUI based interface Pedro [33]. This approach is more

lightweight than the OWL-S and WSMO ontologies, although less expressive of

details which could be more supportive for the automation process. Although the

description methods adopted in myGrid has limited expressivity, they are suffi-

cient for describing most services and their simplicity makes them more practical

9http://www.cs.wisc.edu/condor/dagman/10http://www.mygrid.org.uk11http://taverna.sf.net

39

for describing large number of services.

IRIS [74] project is another project that targets discovery, composition, and

interoperability of services required within in silico life science experiments. The

IRIS project uses a semi-automatic procedure for identifying and placing cus-

tomizable adapters (mediators) into workflows built by service composition. In

IRIS, the capabilities of a mediator are described using the Mediator Profile Lan-

guage (MPL). MPL is developed as a top-level ontology using the Web Onotology

Language (OWL).

BioMoby 12 is an open source research project which aims to generate an

architecture for the discovery and distribution of biological data through web ser-

vices [101]. Decentralized data and services are registered at a centralized registry

called MOBY Central. The BioMOBY project focuses on the area of service de-

scription, discovery, transaction and simple input/output object type definitions.

This foundational set of functionality allows client programs to expand the specifi-

cation to include additional new features. The architecture provides a set of foun-

dational functions that allows client programs to expand on the specification to

include additional new features. There are two development tracks with different

architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). The

BioMoby project recently integrated access to many BioMoby features to the Tav-

erna workbench interface using a Taverna plug-in. Users are guided through the

construction of syntactically and semantically correct workflows from the graphic

interface [44].

Open Middleware Infrastructure Institute UK (OMII-UK) 13 is a

project that aims to provide software and support to enable the collaboration

12http://biomoby.open-bio.org/13http://www.omii.ac.uk/

40

of building infrastructure for the UK e-Science community and its international

collaborators. The OMII environment integrates other open-source software com-

ponents to provide users a secure web services hosting and services execution

environment. Users can deploy web services on different levels in the OMII server

architecture, a normal Axis web service and a secure web service with the WS-

Security support. GridSAM provides a Web Service for submitting and monitor-

ing jobs managed by a variety of Distributed Resource Managers (DRM). The

modular design allows third-parties to provide submission and file-transfer plug-

ins to GridSAM. It also integrates GRIMOIRES, a registry service, to provide

descriptions of services and workflows. The GRIMOIRES implementation ex-

tends the UDDI specification and provides not only the syntactic description but

also semantic descriptions. The OGSADAI middleware provides data integration

and secure infrastructure for exposing data resources as web services in a grid

or any other context. WSRF::Lite follows on from OGSI::Lite Perl, the Web

Service Resource Framework (WSRF) which was inspired by and supersedes the

Open Grid Services Infrastructure (OGSI). WSRF::Lite provides support for the

following Web Service Specifications: WS-Addressing, WS-ResourceProperties,

WS-ResourceLifetimes, WS-BaseFaults, WS-ServiceGroups, WS-Security.

2.5 Conclusion

In this chapter, we introduce several concepts related to SOA and discuss

the integration of these technologies to solve some open issues in SOA research.

Applying semantic web technology is intend to automate the web service discovery

and composition process with little (or without) guidance of a human being. The

challenges are: 1) define a high quality domain ontology; 2) interoperability of the

41

ontology among different domains; 3) correct annotation of large numbers of web

services and data using the ontology; and 4) an agreed on definition of service

composibility, soundness, and scalability. AI planning technologies used in the

service composition process is largely studied at the theoretical level and often

demonstrated with a well-defined, small domain, such as a travel agency, instead

of large real world applications.

Services provided in the Grid architecture, in particular, Globus toolkits, can

be exposed with a web services interface and be composed into a workflow. When

combined with Grid computing technology, this allows the creation of virtual

organizations and groups, provides a service-oriented architecture that is more

efficient and flexible with resource allocation and data transfer (such as gftp tool),

and enables an increased level of privacy inside and between virtual organizations.

Since Grid computing and service-oriented architecture are converging together,

there are many standards and specifications that are constantly being expanded,

updated, refined, and obsolete rather rapidly, it is hard to keep up with those

evolving standards and specifications. For example, the Open Grid Services In-

frastructure (OGSI) was published by the Global Grid Forum (GGF) as a proposed

recommendation in June 2003. It was intended to provide an infrastructure layer

for the Open Grid Services Architecture (OGSA). OGSI is now obsolete, and has

been superseded by Web Services Resource Framework (WSRF). With the re-

lease of GT4, the open source tool kit is migrating back to a pure Web services

implementation (rather than OGSI), via integration of the WSRF.

Applying peer-to-peer technology can help to avoid central failure and increase

the scalability during the service discovery and workflow execution process.

Service-oriented computing is a new research area, with many in-progress

42

frameworks and middleware, workflow specification, WS-* standards, and onto-

logical representations that have been presented without complete tool support.

There are still many areas of research need to be addressed in order to build a

complete, reliable, and ideal service-oriented architecture.

43

CHAPTER 3

A SERVICE-ORIENTED DATA INTEGRATION AND ANALYSIS

ENVIRONMENT FOR BIOINFORMATICS RESEARCH

In this chapter 1, we present a practical experiment of building a service-

oriented system upon current web services technologies and bioinformatics mid-

dleware. The system allows scientists to extract data from heterogeneous data

sources and generate phylogenetic comparisons automatically. This can be diffi-

cult to accomplish using manual search tools since sequence data is rapidly ac-

cumulating and those manual tools will need to be repeatedly invoked as that

new data becomes available. A web-based environment enables scientists to more

effectively define a task, perform the task at a desired time, monitor the execu-

tion status, and view the results. The first prototype of this system is evaluated

on a phylogenetic research application, Mother of Green (MoG). Our evaluation

demonstrates that a service-oriented architecture can accelerate scientific research,

increase research productivity, and provide a new approach to doing science. We

also discuss issues in design and implementation of the system and identify our

future research directions to enhance the system.

1Portions of this chapter appear in the 40th Annual Hawaii International Conference onSystem Sciences, HICSS40, Hawaii, 2007[110]

44

3.1 Introduction

As biological research is becoming increasingly data driven, scientists are con-

ducting experiments using the cyberinfrastructure (in silico experiments) to gather

information in public online databases and to test their hypotheses. These hetero-

geneous, independently developed data sources make traditional approaches in-

sufficient for this type of research and experimentation. Complex queries against

several of these databases may provide valuable new insights, but interoperability

problems make this difficult. The researcher must often manually cut and paste

data from one database resource to another and repeatedly use multiple tools to

format and analyze the data, a process that may take days or weeks. In many

investigations, the process stops once the scientist requires a workflow that is not

feasible using manual retrieval and analysis.

There is a demand for a methodology that frees users from having to locate

the data sources, interact with each data source, and manually combine data

in multiple formats from multiple sources. A promising solution to achieve the

seamless interoperability among these data sources and analysis tools relies on the

emerging technology of service-oriented architecture (SOA). SOA has been recog-

nized during the past few years as an approach to achieve interoperability among

multiple data sources [91] [92]. Many large bioinformatics database providers,

such as NCBI, EMBL, DDBJ, already make their databases available via a SOA.

Emerging toolkits and platforms, such as Soaplab [87] enable many data analysis

tools to be wrapped as web services. These existing services permit software engi-

neers to build unified interfaces for scientists to access heterogeneous data sources.

The platform independent feature of SOA makes it a feasible solution to integrate

increasingly available data analysis tools.

45

While there are protocols, toolkits, and middleware that are increasingly avail-

able to address the majority of the technical issues in building a data integration

and data analysis environment, the question of how real world problems can be

solved successfully using these technologies needs to be answered through practical

implementations in a real world context. In this chapter, we describe the design

and implementation of a web-based data integration and analysis environment.

The underlying infrastructure is built upon current web service technologies and

bioinformatics middleware to enable biologists to better utilize heterogeneous ge-

nomic data. The first prototype of the system is used in a phylogenetic research

application, the Mother of Green (MoG). MoG is a collaborative research project

on plastid phylogenetic analysis involving information technologists and biologists.

Genomic sequence data is accumulating faster than scientists can find and ana-

lyze it using manual search tools. The SOA-based platform allows scientists to

extract data and analyze phylogenetic comparisons automatically. The web-based

environment enables scientists to more effectively define a task, perform the task

at a desired time, monitor the execution status, and view the results. The over-

all aim of this project is to provide an easy-to-use environment for biologists to

research the puzzle of plastid phylogeny and to answer an open question on the

phylogenetic history of the plastid genome.

In the rest of this chapter, we briefly review web service technologies and

related work followed by an overview of the MoG project and a description of the

overall system architecture. We then describe a prototype implementation of the

system, related issues, and extensions of the system.

46

3.2 Related work

The service-oriented architecture (SOA) was proposed initially as an emerging

paradigm for business process integration inside or across organization boundaries.

It is gaining significant attention from the scientific research community for use

in building e-science infrastructures. The proposed standard in grid computing,

Open Grid Service Architecture (OGSA) [63], is built upon service-oriented ar-

chitecture and demonstrates the convergence of the Grid with SOA. Three basic

standards in SOA, Simple Object Access Protocol (SOAP), Web Services De-

scription Language (WSDL), and Universal Description, Discovery and Integra-

tion (UDDI), are sufficient for providing simple atomic services. However, single

atomic services are not adequate for developing complex applications. One of the

most important features of SOA is that services developed in different groups can

be combined as a workflow to solve complicated problems. This feature leads to

several research issues and challenges including service discovery, services compo-

sition, and service enactment. Semantic web technology[54] [7] and peer-to-peer

technology are used in SOA to automate the service discovery process and make

the service enactment more reliable.

BioMOBY is an open source research project which aims to generate an archi-

tecture for the discovery and distribution of biological data through web services

[101]. Decentralized data and services are registered at a centralized registry called

MOBY Central. The BioMOBY project focuses on the area of service descrip-

tion, discovery, transaction and simple input/output object type definitions. This

foundational set of functionality allows client programs to expand on the spec-

ification to include additional new features. The architecture provides a set of

foundational functions that allows client programs to expand on the specification

47

to include additional new features. There are two development tracks with differ-

ent architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY).

REMORA [14] is a web server implementation base on the BioMOBY service

specification. It provides life science researchers with an easy-to-use workflow

generator and launcher, a repository of predefined workflows and a survey system.

Another project, myGrid, provides e-Science application developers a toolkit

based upon a high-level middleware layer. It builds on and extends the Grid frame-

work of distributed computing through a SOA. It not only provides a semantic

based service discovery system but also the Taverna workflow bench [65], person-

alized data repositories, provenance and update notification. The direct users of

myGrid are users who build applications using the myGrid toolkit [94]. Compared

to the BioMOBY project, myGrid has more ambitious goals. Bioinformaticians,

tool builders and service providers can collectively or selectively employ these mid-

dleware services to produce applications that support research in the biological

and life sciences [36].

The IRIS [74] project is another active project that targets the service discov-

ery, composition, and interoperability of services required within in silico exper-

iments. The IRIS project handles this problem through a semi-automatic pro-

cedure for identifying and placing customizable adapters into workflows built by

service composition.

Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioin-

formatic Workflow Builder Interface (BioWBI) [9] are two projects provided by

IBM aphaWorks to allow life science researchers to build and execute bioinfor-

matics workflows and share their analysis processes.

WsBAW is an application that automates bioinformatic workflow by deploying

48

a web service. BioWBI is an easy-to-use, Web-based working environment from

which a life sciences researcher and/or a research community can build and execute

bioinformatic workflows and share their analysis processes.

IBM alphaWorks provides the applications, Web Service for Bioinformatic

Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface

(BioWBI) [9]. WsBAW is an application that automates bioinformatic workflow

by deploying a web service. It consists of a client application through which

users are able to send batch requests to a specific bioinformatic workflow execu-

tion engine, such as BioWBI, by using a Web service. BioWBI is an easy-to-use

Web-based working environment from which a Life Sciences researcher and/or

a research community can build and execute bioinformatic workflows and share

their analysis processes.

3.3 Motivation

The motivating application is the phylogenomics of the plastid. Named the

Mother of Green (MoG) project by an multidisciplinary team of computer scien-

tists and biologists, MoG aims to identify the most recent common ancestor of all

plastids. While many biologists support the view that all plastids are descended

from a single endosymbiont ancestor, the data are not conclusive due to missing

information and inefficient use of existing information. Using the nucleotide and

amino sequences of expressed genes to infer ancient ancestral relationships, MoG

investigators hope to identify which of the ancestral plastid genes have traveled

into the host nucleus and why some genes are more likely to be transferred than

others. The rate of data accumulation, the rapid development of new phyloge-

netic analysis tools, and the refinement of existing tools simply overwhelm the

49

researchers. The biologists need a better approach than manual or ad-hoc script-

ing to accumulate and analyze enough relevant data to rigorously test the single

ancestor hypothesis.

3.3.1 Use case

A typical phylogenetic analysis process consisting of multiple manual data

collection and data analysis steps is described below and shown in Figure 3.1.

2007-4-18 Ph.D defense 12

A typical in-silico investigationDatadriven research workflow

A: Query completegenome sequences

given a taxon

A: Query completegenome sequences

given a taxon

B: Query protein coding genes

for each genome sequence

B: Query protein coding genes

for each genome sequence

C: Eliminate vectorsequences

C: Eliminate vectorsequencesD: Sequences

alignmentD: Sequences

alignmentE: Phylogenetic

analysisE: Phylogenetic

analysis

Figure 3.1. A manual phylogenetic data collection and data analysisprocess

A) Biologists send a query to a data provider, NCBI for example, through a

web-based interface to retrieve the whole genome sequence of a specified taxon.

After recording the query terms and results, the investigator must examine the list

of sequences, delete inappropriate entries and then add new entries based on their

knowledge of plastid phylogenomics or from sequences generated in their own lab.

50

B) For each whole genome sequence, biologists need to find specific protein

coding genes, or the specific subunits of protein coding genes, or specific active

sites within a specific gene or subunit. This is an iterative process for each entry

in the list.

C) Each nucleotide sequence must be checked for vector sequences, a com-

mon contaminant of nucleotide sequences in unvetted public databases, and any

detected vector contaminants removed.

D) Biologists then choose a subset of these genes and use a sequence alignment

program, (e.g. ClustalW), to align the sequences. After viewing the results,

biologists may decide to choose another subset for sequence alignment analysis or

continue the comparison using phylogenetic tree building tools.

E) Once the initial sequence alignment results prove satisfactory, biologists

convert the alignment output to the appropriate data format required by the

phylogenetic analysis programs, such as PAUP or Phylip.

3.3.2 Operational barriers

The data retrieval and data analysis processes need to be repeated multiple

times, as different hypothesis are evaluated and new data pours into the public

databases. From an operational perspective, this repetition makes the research

process time consuming or even impossible using manual approaches. Other bar-

riers also make this particular scientific research process even more difficult.

• Data collection The capabilities offered by a data retrieval system can-

not always meet the requirements of scientists. Entrez [61] is a web-based

data retrieval system available from NCBI that provides integrated access

to multiple databases covering a variety of data domains, including com-

51

plete genomes, nucleotide and protein sequences, gene sequences, three-

dimensional molecular structures, literature, and more. However, sometimes

scientists are not able to get desired information with a simple query. For

instance, “find all of the subunits for the plastid ATP synthase” requires

that the investigator first identify the official protein names of all subunits

of which there many (atp alpha, atp beta, atp gamma, atp delta, atp ep-

silon and so on) for the plastid-specific ATP synthase. The next process is

to retrieve these sequences for each new genome and to merge these data

with the data previously retrieved.

• Analysis tool usage Each data analysis program may have different require-

ments for input data formats even for programs providing similar function-

alities. Correct use of these programs and correct implementation of this

workflow relies heavily on the researcher having detailed knowledge and un-

derstanding of each tool. A typical work unit might be: “find all of the

sequences for atp synthase alpha subunit that are most similar to the atp

alpha synthase sequence found in Prochlorococcus, align the sequence using

clustalW, save that output, then reformat the data and submit the sequences

to Phylip for phylogenetic analysis”. The output from one data analysis pro-

gram needs to be fed into the next program as its input with appropriate

conversion to the required data format. The rapid development of new data

analysis tools and the refinement of existing tools make the manual data

conversion process even more difficult.

• Experimental record keeping Accurate recording of an in silico investigation,

including materials, methods, and results is as important as accurate record-

ing of bench top or field experiments. Keeping the provenance data, includ-

52

ing the input, output, and intermediate data sets is also critical. Manual

organization of these metadata quickly approaches impossibility for anything

but the most trivial of queries.

An easy-to-use environment is essential and necessary to support the automa-

tion of deep phylogenetic analysis. For many years the data were sparse. Now

mountains of data exist but our limited 20th Century tools do not properly equip

us to mine for the gems within them. Automation has become necessary.

3.4 System architecture

The whole system, the MoGServ, includes an underlying infrastructure, MoGServ

middle layer, and a web-based environment that provides an easy-to-use interface

for scientists to access functions provided by the middlelayer. The system acts

as both service consumer and service provider in the context of SOA. While it

consumes and aggregates services provided by other service providers, the system

also provides services that can be used and integrated by other applications.

There are two roles in the design and implementation of the system, end-

users and software developers. End-users are biologists who focus on the study

of what information needs to be gathered and what data analysis needs to be

preformed. The software developers are responsible for several tasks based on end-

users requirements: collecting and annotating available services; creating services

to implement functions in the specific application; building workflows to automate

a variety of tasks required by end-users; providing a flexible, high performance,

fault-tolerant infrastructure to execute the workflows; providing a mechanism for

end-users to keep track of the origin of the data (data provenance); and providing

end-users a web interface to configure a task, monitor the execution status, and

53

view results. An overview of the MoGServ system architecture is given in Figure

3.2.

Web InterfaceWeb Interface ApplicationsApplications

Application ServerApplication Server

Data AccessServices

Data AccessServices

Data AnalysisServices

Data AnalysisServices

Job ManagerJob Manager

Job LauncherJob Launcher

Service/WorkflowRegistry


MetadataSearch

MetadataSearch

Local DataStorage

Local DataStorage

Workflow/SoapEngines

Services

NCBINCBI DDBJDDBJ EMBLEMBL

Data/ServicesProviders

MoGServMiddleLayer

ServicesAccessClient

OthersOthers

MoG

Ser

vS

yste

m A

rchi

tect

ure

Figure 3.2. MoGServ System architecture includes a services accessclient, MoGServ middle layer, and other data and services providers

3.4.1 Data storage and access service

Data collection from multiple distributed data resources is one of the first steps

of a bioinformatics research project. In the MoG project, an in silico experiment

involves the collection of large data sets, a computational and memory intensive

process that involves daily checking for new information and quality control for

each new sequence detected. Some data service providers limit the number of

connections to their data server for performance concerns. The refresh rate of the

54

data in a data source is much lower than the rate of end-user requests for the data.

Therefore, a local data storage is required to store biological data collected from

remote data providers, to avoid repeated vetting of the same data and to insure

access to the data for time sensitive projects. The biological data from remote data

sources is gathered, aggregated, and integrated into the local database through a

set of data access services.

An in silico experiment also requires the integration of results from numerous

data analysis tools. Recording the intermediate data in the local database allows

MoGServ to preserve the data provenance and provides opportunity for end-users

to keep track of where a piece of data has come from. The information stored in

the local database can be accessed through a set of data access services.

3.4.2 Service and workflow registry

A service and workflow registry provides a repository to store descriptions of

services and workflows that may be used in a phylogenetic study. These services

and workflows include both locally constructed and preexisting services. The reg-

istry also provides functions to allow inquiries about services or workflows. In the

first prototype, neither UDDI-based registry nor semantic-based descriptions are

employed. While a UDDI type registry is more business-oriented and may not

be a perfect fit for this application, the semantic-based description takes more

time to define a commonly used ontology. The current registry is a simple table

with focus on capturing both functional and non-functional properties of services

and workflows to support service selections, service and workflow enactment, and

provenance data representation. Semantic-based description and inquiry provides

the attractive capability of automating service discovery and will be used in the

55

TABLE 3.1

ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIPTION

Attributes Description

id a unique sequence number assigned to the service/workflow during the reg-istration process

name the name of the service or workflow

text description description of the functions provided by the service or workflow

location the URL of the definition of the workflow or WSDL location of the service

input/output description of input/output parameters

provider the name of the service or workflow provider

version the version of the service or workflow implementation

algorithm the algorithm used in the service or workflow implementation

invocationmethod

the method used to execute the service or workflow

next version of MoGServ. The description of a service or workflow includes at-

tributes as shown in Table 3.1.

When end-users view results from their experiments, they may ask a question

“which algorithm was used to generated the data and what is the source

of the data?” Service consumers may prefer a service or workflow based on their

preference for a particular algorithm or provider. For example, a sequence align-

ment service can be implemented using the Sequence Alignment and Modeling

System (SAM) or ClustalW.

3.4.3 Indexing and querying metadata

The data is best managed with a relational database; however, for searching

purposes, an indexer is more efficient. We identify and extract metadata about

56

additional actual data sequences, experiments, services and workflow descriptions

in the local database. For example, the metadata of a gene sequence includes gid,

accession number, name of the sequence, from which organism, and taxonomy. An

experiment can generate results that leads to new or more detailed information

requirements and a new series of experiments. End-users may need to know the

origin of a piece of data – “which query was used to get this subset of sequence,

when was the data generated, what process was used to generate the results”. This

may lead to new experiments using different data sets or even different methods.

These metadata are extracted and indexed by a metadata indexing service.

This service is triggered when new data is added into the database. A metadata

searching service provides functions to query an index.

3.4.4 Service and workflow enactment

The system supports both synchronized invocation and asynchronized invoca-

tion methods. Synchronized invocation is mostly used for invocation of services

or workflows with short running times, e.g. querying sequence data or job infor-

mation in a local database.

Asynchronized invocation is used for executing long running services and work-

flows. As shown in Figure 3.3, the job manager accepts the input parameter of the

service/workflow, service/workflow id, and timer. The definition of the services

and workflows is found in the registry. A job definition including the services or

workflows URL, input parameter, timer, and other metadata of the job informa-

tion (such as when and who submitted this job) is stored in the database. A job

id used to identify the job is generated. The job launcher periodically checks the

database to retrieve a service and workflow which needs to be executed at a time

57

point.

Multiple workflow engines are deployed on different nodes to prevent single

engine failure and achieve higher performance. A similar mechanism is used for

deploying long running services to prevent service failure. Each node hosts a

service that is responsible for returning the current load information of the node.

This information is used by the job launcher to dispatch a job to an optimal node.

With the SOA, it is easy to distribute and invoke workflows and services remotely.

The execution status of the workflow or service is recorded into the database as

an attribute of a job description. This information can be used for implementing

failure recovery functions, such as restart. The job information accessed through

data access services allows end-users to monitor the execution and view the results.


Service and workflow enactment

INPUT

Parameters

Task Name

Timer

INPUT

Parameters

Task Name

Timer


Job ManagerJob Manager

Find the service/workflowdefinition using the task name

Form a JobDescription

Output

Job ID

Output

Job ID

Job LauncherJob Launcher

Instances of Workflow/Service Engines

Instances of Workflow/Service Engines

JobInformation

Figure 3.3. Asynchronized services and workflow invocation model

58

3.5 Implementation

3.5.1 Development and deployment tools

Among a large number of programming platforms for web services develop-

ment and deployment, Microsoft’s .NET and Sun’s J2EE typically are two main

choices for applications and middleware developers. With consideration of future

extensions of the system as well as our previous experience with Java, the J2EE

based platform appeared more suitable for MoGServ. In particular, Apache’s open

source tools - Tomcat(5.0.18) and Axis(1 2RC2), are used.

Tomcat/Axis are active projects with support from the open source commu-

nity. Another open source software tool, Eclipse, is used to develop the web

interface for the system.

There are more than a dozen proposed languages to coordinate messaging and

transactions among independent web services. The business process execution

language for web services (BPEL4WS) is a promising workflow language since it

has wide support from IBM, Microsoft, and BEA. Several workflow enactment

engines, such as BPWS4J, Collax, ActiveBPEL, are already in place to support

the execution of workflow. While a business-oriented workflow language and cor-

responding execution engine can be used in the scientific domain [20], the Taverna

[65] project possesses more attractive features and naturally fits the development

of our system. The Taverna project is open source and a part of the myGrid

project developed in the e-Science community to support data-intensive in sil-

ico bioinformatics experiments. The Taverna workbench provides a graphical

tool for building, editing, and browsing workflows and generates a XML-based

Simple conceptual unified flow language (Scufl) document. The embedded work-

flow execution engine, Freefluo, facilitates testing during the development process.

59

Freefluo, a Java workflow enactment engine, which supports the Scufl specifica-

tion, coordinates execution of the parallel and sequential activities in the workflow

and supports data iteration and nested workflows. The enactor can invoke arbi-

trary WSDL type service operations as well as more specific bioinformatics service

operations such as Soaplab and BioMoby.

Apache Lucene [51] is used in our system for building a search engine to sup-

port full-text search on sequence data, intermediate data results, and job infor-

mation stored in the local database. Since Lucene is a search engine library

written entirely in Java instead of a command line toolkit, it provides flexibility

to write a variety of applications with rich search capabilities. These capabili-

ties include ranked searching, phrase queries, wildcard queries, proximity queries,

fielded searching, and so on.

PostgreSQL(8.0) is used to store all the intermediate data results, job infor-

mation, sequence data, and services/workflow descriptions.

3.5.2 Services provision

We create web services using the RPC style due to its easy implementation

with full support from most tools. As most bioinformatics applications take a

number of input parameters and produce a number of outputs, we use an XML

document to represent the input/output of a service for which a large number

of parameters are needed. The XML document is provided as a single input

parameter to the service or workflow and the output results are produced as a

single XML document. Using this method, the service consumers themselves

create a valid and accurate XML document for input while service providers parse

the XML and extract the input parameters.

60

Multiple services are created and deployed on the Tomcat/Axis server using

the Java2WSDL and WSDL2Java toolkits. Individual services can be invoked

statically or dynamically through a client side application. They can also be

used as a building block in the workflow creation process. We separated services

provided in the first prototype into the following categories.

Data collection The original data source is NCBI. NCBI’s Entrez Programming

Utilities (eUtils) provide access to Entrez data outside of the regular web query

interface and help for retrieving search results for future use in other environments.

With the eUtils SOAP interface, we create services to get data, such as complete

genome sequences and specific genes of interest.

Query local database All the intermediate data and job information are stored

in the local database to help biologists keep track of the data provenance and mon-

itor the job execution. Also in this particular application, biologists are interested

in selecting sequence subsets from the local database and using sequence align-

ment services to do preliminary comparisons. A set of services are implemented

to query desired information.

Indexing and querying metadata The creation and update of each of these

indices is done by a service operation. The index service is triggered whenever

new data is stored in the database. The query service accepts a query string and

an index name to search the index and return output.

Data format services Each particular data analysis tool used in a bioinformat-

ics study requires a specific data format as input. A set of data format services in

the system is implemented to convert data into an appropriate format. This type

of service can be used in a workflow creation process or used explicitly.

Data analysis services Many existing data analysis tools in bioinformatics re-

61

search are available as command line applications. The creation of a data analysis

service is a process to wrap these toolkits as web services. JLaunch [42] is a light-

weight Java library for launching command line applications from Java programs.

With the JLaunch library, we can write Java programs to execute any type of

command line programs.

3.5.3 Workflow engine

The Freefluo workflow engine is deployed on a application server. The invo-

cation of the workflow engine is done by generating a local stub specific to the

Freefluo web services API. The local stub is implemented as part of the job luncher

in our system.

The execution of a workflow on the Freefluo engine follows the following steps:

1) obtain a proxy to the remote Freefluo server; 2) create a Scufl model; 3) pass

a XScufl workflow to the Scufl model and form the input using the Baclava data

model, a representation of Taverna data type 2; 4) compile the XScufl workflow as

a workflow instance; 5) execute the workflow instance and obtain an ID from the

server; 6) poll the Freefluo engine until the execution has completed; 7) retrieve

a list of outputs from the server; 8) extract the required output from the Baclava

data model; and 9) destroy the workflow instance.

3.5.4 Building workflows

A Scufl workflow represents a procedure as a set of processes and the rela-

tionships between these processes. Our workflow design uses available services

as building blocks whenever possible and creates new ones when necessary. The

2http://taverna.sourceforge.net/index.php?doc=usingbaclava.html

62

Taverna workbench provides a graphical tool to build and test workflow as well

as a number of integrated bioinformatics services. The Scufl language has some

useful features such as implicit iteration and conditional branching that are most

important for building workflows in this application. During the construction of

workflows, we often encounter the case that output of one service can not be com-

pletely fit to the input of the next chosen service. One approach we take is to

create a new service, such as the type of data format service described above, and

expose it in the same way as other services. An alternative approach provided

in the Taverna workbench is to use the Beanshell scripts [4] to convert the out-

put to appropriate input. We create a number of workflows using the Taverna

workbench to support the research. One example is shown in Figure 3.4. It is

a workflow used to retrieve a complete genome sequence and particular gene se-

quences from the NCBI site. The workflow accepts two inputs, the query term and

the particular gene group. The service genome gids by terms returns a String

of gids and a Beanshell script converts the String to a list of gids. The service

Get Nucleotide Fasta, a third party service, accepts a gid and returns a sequence

in fasta format. The implicit iteration method in the Xscufl workflow enables it-

eration for all the gids in the list. With the service-oriented architecture, the same

services can be used for different workflows, minimizing the need to create new

services.

3.5.5 Web interface

The web interface provides scientists a convenient interface to configure their

tasks, monitor the job execution status and view results. It is implemented

with a number of server side JSPs (Java Server Pages). The returned results

63

are transformed with appropriate XSLT to HTML pages. The service-oriented

architecture provides flexibility of building the front-end web application with dif-

ferent languages, e.g. Perl, and deploying on a different web service engine, e.g.

Apache/SOAP::lite.

3.6 Discussion

Although current development and deployment tools haven’t implemented all

the features claimed in the service-oriented architecture specification, they are

actively evolving to make it happen. In particular, the Apache Tomcat/Axis,

Taverna workbench, and Freefluo engine enabled the implementation of our first

prototype.

In general, SOA offers considerable benefits for building the system: 1) The

loosely coupled feature of SOA facilitates the distribution of computational in-

tensive processes across multiple nodes; 2) The platform independent feature

of SOA facilitates the integration of data from heterogeneous data resources

through distributed web services; 3) The composition-of-services feature allows

reuse of a service in multiple workflows minimizing the need to create new ser-

vices; and 4) SOA also provides flexibility for building the front-end web appli-

cation with different languages, e.g. Perl, and deploying on different web service,

e.g. Apache/SOAP::lite.

While we believe a simple SOA architecture is appropriate in the design and

implementation of our system, there are various aspects of the system that need

to be improved. We summarize issues and the directions to enhance the system

in this section.

64

3.6.1 Issues with the first prototype

Security Although security was not our major concern during the first proto-

type implementation, it is an important component in the next implementation.

Services and workflows provided in the system allow users to access the compu-

tational and data resources in the system with no restrictions. A certain level

of security is required to prevent abuse of the system and to protect sensitive

data and analysis results. An authorization component should be built in the sys-

tem to enable users to access the permitted services and to personalize their own

workspace. A web portal will be built to enable users to create an account, login

and logout with username and password. The user account information including

the access level will be stored in a database. The GridSphere portal framework

[39], an open-source portlet based web portal, is one of the candidates.

Service and workflow description and selection In the first prototype

implementation, the same development group acts as both service provider (ser-

vices/workflow creation) and service consumer (building the web-based applica-

tion using these services and workflows) roles. While there is no demand for

supporting the selection of appropriate services/workflows, the major capability

of the index-based services/workflows registry is to keep track of data provenance

and to provide definition for performing services/workflows.

However, the index-based syntactic description services/workflows provide lim-

ited flexibility for third party service consumers to choose appropriate services/workflows

provided in the system and to integrate them into their application without prior

knowledge.

Failure tolerance and recovery The workflow or service execution may fail

at some point due to the failure of the enactment engine, failure of the service,

65

and failure of the network fabric [64]. Our system handles these failures during

the static workflow design stage and services or workflows invocation stage.

Multiple workflow engines and long running services are deployed on different

physical locations. It allows a submitted task to be invoked on the most idle

site to achieve higher performance. More importantly, this approach can prevent

dispatching services/workflows to the engine with a physical failure. Recording

execution status of long running services/workflows in the database allows us to

add policies for determining if a failed service/workflow should be restarted. The

Taverna workbench and Xscufl provide a capability that allows users to specify

an alternate service and to configure basic fault tolerance mechanisms during the

workflow design stage, which can prevent the failure of services to a certain degree.

Another more promising, yet more complicated approach for failure recovery

is to support the dynamic selection of alternate services during execution time.

However, the implementation of this feature requires services to be described in

rich semantic formats using a widely accepted ontology.

Data provenance In the system, the metadata description of sequence, job

information, and services/workflows are stored in the database. A set of indexing

and querying services allows end-users to trace the origin of the data, which is

a desired feature for scientists. Also, the workflow engine and Xscufl provides

mechanisms to record more detailed information including the type of processor,

status, start and end time, and a description of the service operation. A sys-

tems administrator may be interested in using this information to investigate how

results, in particular erroneous or unexpected ones, were produced by workflow

processes.

66

3.6.2 Extension of the system

Although the first prototype of the system focuses on design and implemen-

tation based on relatively mature technologies in service-oriented architecture,

we are extending the system to address some issues described above with grid

computing and semantic web technologies.

Grid technologies specify the mechanisms for distributed resource manage-

ment, coordinated fail-over, and security. As the Grid technologies, and Grid

framework Globus toolkit [97] in particular, are evolving towards the OGSA stan-

dard, integration of the Grid technologies into the system can help address some

issues discussed above. The convergence of service-oriented architecture and Grid

technology allow us to enhance the system through the integration of existing

components.

In a scientific domain, the process used to generate the output of a service and

workflow is often as important as the result. As is the case with bench scientists,

in silico investigators will decide for themselves which methods and which data

will be used for their study as well as what kind of outputs they are expecting.

In the first prototype implementation, this requirement is satisfied through close

collaboration among team members.

As this system will be used by a phylogenomics research community that

spans multiple disciplines, different investigators will have their own methods for

approaching problems of common interest. A mechanism that allows end-users

to define the workflow at a higher level of abstraction is required. Instead of

choosing specific services to form a workflow, scientists would rather define a

workflow by specifying functions that a service should provide. Different levels of

training and experience also require different levels of abstraction. For example, a

67

graduate student in a particular research domain may have limited knowledge of

the methods available to perform an experiment, while an experienced investigator

may know ahead of time which building blocks are required and which approach

is most efficient for the scientific hypothesis to be tested. We represent different

abstraction levels in Figure 3.6. End-users may need to define the workflow at

any one of these four stages based on their knowledge of provided services.

A concrete workflow, which can be sent to a workflow engine, is represented

at the fourth phase. The conversion from the third phase to the fourth phase is

related to choosing an instance of a service with Quality of Service (QoS) metrics.

One service interface may have multiple implementations provided by different

service providers. These implementations have different quality properties such

as trustworthiness, cost, execution time, and so on. An optimal service should be

chosen during this conversion process. The conversion from the second phase to

the third phase requires mapping a particular task to a service, or a sequence of

multiple services.

This mapping process can be accomplished manually by software developers

in an ad-hoc way, like the approach we took in the implementation of the first

prototype. This approach relies heavily on developers’ knowledge of services and

logical ordering in the workflow.

Preferably, this process should be able to be done partially or wholly automat-

ically. In order to support this semi-automatic or automatic process, a complete

presentation of knowledge should be in place to allow software agents to substi-

tute the work of the human. Using semantic web technology, in particular OWL

and OWL-S, to represent the ontological representation of domain knowledge and

semantic description of services is a promising approach. Semantic web technol-

68

ogy offers promising features for supporting bioinformatics research [12]. Some

bioinformatics middleware, such as the myGrid and BioMoby projects, have their

own approaches to support automated discovery and composition of services using

semantic web technology [49]. Much research has been done exploring AI planning

techniques for automation of the composition process. The long term goal of a

successful composition mechanism should meet several requirements: connectivity,

quality of service, correctness, and scalability [58].

Although there are still practical difficulties in developing semantic web ser-

vices, we believe that the appearance of tools for creating ontologies, annotating

services [89], and development of widely accepted domain ontologies allow us to

add semantics into our system and support the automation of the mapping pro-

cess.

3.7 Conclusion

As both data and tool providers begin to present their resources with web

services interfaces, and as open source tools and middleware for supporting web

services, workflow generation, and enactment become more available, biologists

will begin to use those available services, as well as begin to provide service access

to their databases and programs for sharing within the bioinformatic community

[65]. Our system is a demonstration of progress toward this goal.

In summary, current SOA standards and toolkits are sufficient to build the first

prototype of MoGServ. MoGServ is in its early stage of development with limited

services and workflows available. The basic implemented functionalities enable the

user to collect data and do preliminary data analysis as well as metadata searching.

By using the system, scientists are able to get some scientific insights about the

69

alpha subunit of ATP synthase and indicate that it retains the signal of a very

ancient line of descent while having enough polymorphism to infer phylogenetic

relationships [78].

Building the system upon the SOA provides us flexibilities to integrate services,

to build a variety of workflows, and to build a web portal for scientists to access

the system via a web interface. New features and services are continuously being

added to the system in response to scientists’ feedback and requirements. The

future direction of our research will be to focus on enhancing the system using

semantic web and grid computing technologies.

70

Figure 3.4. A workflow built using Taverna workbench to get completegenome sequences and specific gene sequences

71

Figure 3.5. A workflow for querying two subset sequences from localdatabase, filtering out sequences coming from same organism, and doing

sequence alignment analysis

Figure 3.6. Abstraction of user defined workflows

72

CHAPTER 4

EXPLORING THE DEEP PHYLOGENY OF THE PLASTIDS WITH THE

MOGSERV

In this chapter, we illustrate a research application that uses the MoGServ to

investigate the deep phylogeny of the plastids and attempts to answer an open

question on phylogenetic history of the plastid genome.

4.1 Introduction

Plastids are important organelles found only in plants and algae. Chloroplasts

are the photosynthetic form of a plastid. Similar to mitochondria, both of them

have their own DNA and are involved in energy metabolism. Other forms of a

plastid may be responsible for storage of products like starch and for the synthesis

of many classes of molecules such as fatty acids which are needed as cellular

building blocks and/or for the functioning of the plant.

Phylogenetics is the study of the evolutionary relationship among various

groups of organisms. The origin and evolution of a group of organisms is called

phylogeny or phylogenesis.

The endosymboint hypothesis suggest that mitochondria were free living bac-

teria that were engulfed and subsequently enslaved by a primitive ancestor of

all living eukaryotes [27] [69]. Between 1.2 and 1.5 Ga (billion years ago), one

73

or more of these early eukaryotic cell lineages captured a cyanobacterium and

produced three primary plastid lineages: green plant lineage (chlorophytes), red

algal lineage (rhodophytes), and glaucophyte lineage (a group of freshwater algae)

[69]. Surviving endosymbiotes include the green algal and red algal photosynthetic

chloroplasts and the cyanelle, the endosymbiont in the glaucophytes that retains

more of the character of a cyanobacterial progenitor. Plastids have also spread

by secondary endosymbiosis, in which a cell engulfs a cell already containing an

endosymbioint. In secondary endosymbiosis, the nuclear genome of the engulfed

cell usually disappears. Seven lineages are produced from green algae and red al-

gae in secondary symbiosis [69], see Figure A.1. The evolution of these secondary

plastids suffers the reducing of their genomes by gene transfer into the nucleus

[77]. The red algal lineage also includes organisms that have lost the capacity

to photosynthesize but sill retain a degenerate plastid. Apicomplexans produced

from the red algal lineage are non-photosynthetic intracellular parasites whose

members include Toxoplasma gondii and Plasmodium falciparum.

Plasmodium falciparum (P. falciparum) is a protozoan parasite, one type of

apicomplexa, which cause malaria in humans. P. falciparum has three genomes:

nuclear, mitochondrial, and plastid (apicoplast). Phylogenetic analysis of plastid

genes provides a new way for targeted antiparasitic drug design [31].

Organisms generally inherit genes from their parents (Vertical Gene Trans-

fer), or receive genes from other organisms through Horizontal Gene Transfer

(HGT) and Lateral Gene Transfer (LGT). Most plastid genomes are circular, do

not recombine and are inherited through only one parent. The highly conservative

character of the plastid genome makes phylogenetic analysis possible. However,

the HGT and the LGT in multiple endosymbiotic events complicate the phylogeny

74

of plastids. Another complication is gene duplication and loss within the plastid

itself. While there is a broad consensus that all plastids are descended from a sin-

gle endosymbiont ancestor, some researchers also suggest an alternative hypothesis

of multiple origins that is “at least equally consistent in most cases” [95]. Plastid

phylogenetic analysis must account for multiple endosymbiotic events, superim-

posed upon a process of LGT that occurs throughout the process of converting a

free-living cell to an endosymbiont. Accumulating and analyzing enough data to

rigorously test the single ancestor hypothesis is a promising research direction to

take.

The development of advanced sequencing techniques makes a large amount of

DNA and amino acid sequences available for phylogenetic analysis. The commonly

used methods for inferring phylogenies include parsimony, maximum likelihood,

and Bayesian inference. The rate of data accumulation, the rapid development

of new phylogenetic analysis tools, and the refinement of existing tools, however,

make manual collecting and analyzing these sequences difficult. For example,

the number of cyanobacteria sequences in NCBI database increase from 42 to 57

within about 6 months (June 2006 - December 2006). Figure 4.1 shows the growth

of sequences databases in last a few years [57].

In this chapter, we describe a scientific application that uses the cyberinfras-

tructure to collect and analyze data to gain biological meaning from this data

analysis. The use of a web-based system, MoGServ, shows the ability to signifi-

cantly increase a scientist’s productivity over using a manual process.

75

Figure 4.1. The growth of sequence databases (NCBI Genebank andEBI Swissprot) and annotations. This figure is from Folker Meyer[57]

4.2 System and methods

MoGServ is a service-oriented environment described in detail in Chapter 3.

It facilitates scientific research and discovery from several aspects:

• Easy and rapid extraction of DNA and protein sequence from public databases

to a local database which saves scientists months of repetitive searching,

downloading, and data management.

• Painless reformatting of the extracted data for commonly used analytical

tools.

• Preliminary data inspection and analysis using these tools within the web-

services environment which permits inspection of many conserved gene can-

didates, enabling the investigator to rapidly determine the suitability of the

76

chosen gene for deep phylogenetic analysis.

• User-specified additions to the local database which allows the upload of

sequences into the local database.

• User-specified additions to the automated queries which provides a free-text

searching interface for constructing data sets with interests.

Deep phylogenetic analyses are highly context-dependent. The addition of a

single new cyanobacterial or algal genome would fundamentally change the result.

MoGServ permits an investigator to address these hypotheses using the most

current data available and rapidly reanalyze data as more genomes and genes are

sequenced. This enables rapid hypothesis testing and creates an environment in

which genuine discovery is possible. The most exciting form of discovery is the

surprise result, the result that leads to an entirely new hypothesis.

4.2.1 Data model

As web services technologies have been used by several large data source

provider, such as DDBJ, EMBL, and NCBI, to leverage their data and computa-

tional services, accessing up-to-date sequences becomes more flexible and feasible.

Due to the nature of accessing data sources via the Internet, however, the re-

quirement of data retrieval on-the-fly efficiently and reliably can not be fulfilled

easily. Additional requirements of data manipulation and information manage-

ment cannot be done without local database support. Also, the incomplete and

misannotation in biological data requires biologists’ expertise to ensure the accu-

racy before using this data for analysis.

In order to provide the capability of storing, integrating, and accessing se-

quences from diverse data sources, a data model needs to be developed to meet

77

the following requirements:

• Store sequences from distributed data sources and provide general annota-

tion to facilitate querying.

• Ensure the integrity of sequences during the data collection process and the

efficiency of updating the database periodically.

• Provide an easy way for scientist to manipulate their data sets and manage

their scientific experiments records and data provenance.

The custom data model of MoGServ consists of four modules: sequence module,

set module, user module, and job module. Figure 4.2 shows the entity-relationship

(ER) diagram. An alternative data model, the Chado database schema, one of

the components of GMOD 1, is the foundation of interoperatability of GMOD

applications. We did not use the Chado database schema because it contains

large number of modules and tables that are not necessary for our system; also it

does not model some of the information we are trying to capture in this system.

A sequence in the system is a biological sequence that comes either from a

public data resource or from laboratory experiments uploaded by scientists.

A sequence can be a nucleotide sequence or protein sequence. It can also

be a complete genome sequence or a gene product sequence. Each sequence

can be classified using a taxonomy defined by the public database and terms

that scientists used to find this sequence.

A set is a group of sequences that scientists put together to support their research

interests and usually is used for subsequent data analysis. The properties

1Generic Model Organism Database http://www.gmod.org/, an open source project to de-velop a set of software for creating and administering a model organism database

78

of a set not only contain the sequences in the set but also the provenance

of the set. For example, a set may be created by users querying the local

database, or generated from the previous data analysis.

A job is defined as a task involving data collection or data analysis. It contains

input, output, execution status, and other properties.

4.2.2 Services

The MogServ provides a number of services to support deep phylogenetic re-

search. These services are integrated in the system and accessible from a web

interface. These services can also be used as a component to be integrated into a

workflow.

4.2.3 Data collection

Data collection is a suit of services used to retrieve desired data from public

data providers into the local database. The data collection service updates the

local database periodically. Users can define the query term using any NCBI

query format from the web interface provided in the MoGServ system. Users can

also define a particular gene name or gene products as shown in Figure B.2. The

retrieved data is indexed using the Lucene indexer and search engine [51], which

supports the free text search. The syntax of search is shown as Figure D.4.

The data collection suit consists of 5 components: retrieve genome sequence,

retrieve gene sequence, convert genome sequence to file, convert name sequences,

and index sequences. When combined with the database model, these services

ensure data integrity, data accuracy, data consistent, and exception handling.

Data integrity: Since users are allowed to use any appropriate query term to

79

search the public database (NCBI), it is highly possible that the same se-

quence would be retrieved with different query terms. Also, the classification

and taxonomy of sequences in the public data sources also bring duplicated

sequences. For example, both query terms “chloroplast”, “cyanobacteria”

get the sequence

>gi|72381840|ref|NC 007335.1|Prochlorococcus marinus str. NATL2A.

“chloroplast”, “cyanobacteria”, and “plastid” get the sequence

>gi|42592260|ref|NC 003070.5|Arabidopsis thaliana chromosome 1.

“apicoplast”, “chloroplast”, and “plastid” get the sequence

>gi|31442363|ref|NC 004823.1|Eimeria tenella chloroplast.

The design of the data model ensures that the same sequence can not be in-

serted to the table twice. However, the query term used to get the sequence

should be recorded in the query by term field. This information may help

scientists better understand the relationships and discover new insights.

Data accuracy: Since scientists are interested in particular gene sequences that

reside in a range of a complete genome sequence, instead of searching the

gene database of NCBI, we choose to parse the XML file of a complete

genome sequence to get particular gene sequences and gene products. In

such way, the accuracy of the data may be guaranteed. The NCBI service

provides the search result in XML format; an example is shown in Fig-

ure D.1. The data collection service provided in the MoGServ parses the

XML file to find the INSDFeature key tag for each INSDFeature and then

to see if it is a CDS (CoDing sequence, i.e., a region of nucleotide that corre-

sponds to the sequence of amino acids in the predicted protein). The next

step is to find the INSDQualifier name and INSDQualifier value pair for

80

the gene name (e.g., atpD) or gene product description (e.g., ATP synthase

subunit B). Gene names and gene product descriptions with scientific in-

terests are defined by users through the web interface. In most cases, gene

names are enough to get the desired CDS. However, due to the incomplete

annotation of the CDS, gene names may be not available for a particular

CDS in the nucleotide sequence. The gene product description becomes an-

other criterion for getting accurate CDS; an example of a gene sequence in

fasta and TinySeq XML format is shown in Appendix D.2.

Exception handling: Since the data is retrieved from a remote data source, NCBI,

using web service interfaces, failures may occur because of the network,

hardware, or services themselves. Recording the execution status of a data

collection service is important for detecting and recovering when a failure

appears. Since the data collection service normally runs periodically as a

batch job, we record the status in a log file on the file system. In order to

reduce the repetitive work when a failure occurs, we treat retrieving a single

sequence as a transaction. In another words, we sacrifice I/O performance

to the databases that could be possible using batched mode transactions.

Data consistent: Data analysis is an important component provided by the

MoGServ. Different data analysis tools require different data formats of

a set for their input. There are two ways to provide the desired data format,

converting the data on-the-fly or preparing the data and storing it in the

database during the data collection process. The first approach is flexible;

however, it may result in inconsistent naming problems for sequences. For

example, the same sequence in set A may have the different name in set B.

Therefore, we use an algorithm to map sequence name at the data collection

81

process. Each sequence has a fixed name for each format. Each duplicate

name is ordered by adding numeric numbers at the end of the name.

4.2.4 Local query

After desired sequences are stored in the local database, users need a way to

find an interested subset of these sequences in order to perform further data anal-

ysis. The system provides an interface for users to query the local database using

free text searching. The underlying search engine is built with the Lucene search

library. The content in the index includes metadata that are used to describe a

sequence, such as taxonomy, term, name, and etc. For example, users can use

a query “atp synthase AND B AND plastid” to get a number of sequences(See

Figure B.4). Users can manipulate these returned sequences and group these

sequences as a set. Users can also download these sequences in a variety formats.

4.2.5 Set management

In order to help scientists preparing the data set for subsequent data analysis,

MoGServ provides set management services to:

Creat set: With an appropriate query to the local database, users can look into

the list of sequences returned from the query and delete undesired sequences.

Users can create a new set using these sequences. These sequences can also

be added into an existing set.

Upload set: Users can upload a set of sequences in fasta format into the local

database. These sequences can be from users’ own lab experiments, which

may not be ready to submit to the public database. They can also be a

82

small number of sequences not in the local database at that time. These

sequences are annotated using the appropriate metadata description.

Show set: Users can query the information of a set as shown in Figure B.6, such

as the creation date, the origination of the set, etc.

Download set: Users can download a set in a variety of formats, such as fasta

format, NEXUS format.

Set filter: This service provides the capability to find the intersection of all

the organisms (species) given a number of sets that contains gene or protein

sequences in different species. The purpose of this service is to help scientists

preparing data to determine if the gene genealogies for the subunits are

different. For example, scientists may be interested to determine if gene

genealogies for the subunits α, β, γ, δ, ε of ATP synthase CF1 are

different. The first step is to form 5 sets using query such as “ ATP AND

synthase AND delta AND CF1.” Then use the set filter services to find

all organisms (species) that contain all of the gene or protein sequences

type. These sequence sets will be used in the subsequent data analysis such

as using ClustalW to construct phylogenetic trees. While constructing a

phylogenetic tree based on the analysis on a single gene or protein taken

from a group of organisms (species) can be problematic, the analysis based

on multiple unrelated gene or protein sequences may increase the soundness

of the results.

83

4.2.6 ClustalW

Multiple alignments of sequences provide information to identify the conserved

sequence regions. ClustalW is a tools for global multiple alignment (across their

entire length) of DNA and protein sequences. EMBL-EBI provides a soap-based

web service that allows programmatic access to the data analysis tool [72]. Two

other services, T-Coffee and Muscle, are implemented using newer algorithms to

improve the accuracy and achieve higher performance. Based on users’ preference,

we integrated the ClustalW service into the MoGServ.

The integration of a service in the system is done by creating a new java-based

program using a web service interface to invoke the remote service. Instead of

copying, pasting, or uploading a sequence file, users can set up the parameters

from a web interface as shown in Figure B.9. These parameters are accepted and

combined as a XML file that is sent to the new program as input; an example

file is shown in D.6. The input and output are stored into the database, so the

information can be queried later and displayed with XSLT.

The input and output information are delivered with XML/XSLT. The output

from ClustalW includes phylogram tree, cladogram tree, distance, and ph file

based on the parameter setting. The binary results can be viewed using a Java-

based multiple alignment editor, Jalview [16].

4.2.7 Blast

The Basic Local Alignment Search Tool (BLAST) algorithm and the imple-

mentation at NCBI [1] is one of the most widely used bioinformatics programs. It is

used to compare nucleotide or protein sequences to sequence databases and calcu-

lates the statistical significance of matches. With well designed queries and align-

84

ments, the results of BLAST can infer functional and evolutionary relationships

between sequences and may provide important clues to the function of unchar-

acterized sequences. There are several alternative implementations, WU-BLAST

2, FSA-BLAST 3, parallel blast 4, available for better performance with mini-

mum loss of sensitivity. EBI and NCBI provide web-based WU-BLAST and/or

NCBI-BLAST. However, it could not meet the requirements of this particular ap-

plication in two aspects: 1) a large number of sequences needs to be downloaded,

copy and pasted to the interface; 2) sequences alignment only can be compared

against databases in EBI or NCBI, thus users could not defined their own datasets

to conduct comparisons.

BLAST requires two sequences as input: a query sequence (also called the

target sequence) and a sequence database. BLAST will find subsequences in the

query that are similar to subsequences in the database.

Hosting a service on the MoGServ eliminates these two limitations. Users

can define the compare set and database sets. The result is stored in the local

database. The job information is accessible any time when needed. The service

has two execution methods, synchronized and asynchronized, which are the same

for every data analysis service provided in MoGServ. Similar to the ClustalW

service, the Blast service accepts input in XML format, as shown in D.7. A

tblastn web interface is shown in Figure B.8.

2http://blast.wustl.edu/3http://www.fsa-blast.org/4http://www-users.cs.umn.edu/ rangwala/final bglBLAST.pdf

85

4.2.8 Phylip and Paup

PAUP* 5 is a program for phylogenetic analysis using parsimony, maximum

likelihood, and distance methods. The program features an extensive selection of

analysis options and model choices, and accommodates DNA, RNA, protein and

general data types. Among the many strengths of the program is the rich array

of options for dealing with phylogenetic trees including importing, combining,

comparing, constraining, rooting and testing hypotheses.

PAUP* uses the NEXUS file format, which is a modular format used by sev-

eral programs. All versions require data and commands to be present in the

NEXUS format (with the exception that commands can additionally be executed

interactively from the command prompt).

PHYLIP 6 is a set of modular programs for performing numerous types of

phylogenetic analysis. Individual programs are broadly grouped into several cat-

egories: molecular sequence methods; distance matrix methods; analyses of gene

frequencies and continuous characters; discrete characters methods; and tree draw-

ing, consensus, tree editing, and tree distances. Together the programs accommo-

date a broad range of data types including, DNA, RNA, protein, restriction sites,

and general data types. The programs encompass a broad variety of analysis types

including parsimony, compatibility, distance, invariants and maximum likelihood,

and also include both jackknife and bootstrap re-sampling methods. Therefore

for a typical analysis the user makes choices regarding each aspect of an analysis

and chooses specific programs accordingly. Programs are run interactively via a

text-based interface that provides a list of choices and prompts users for input.

5http://paup.csit.fsu.edu/6http://evolution.genetics.washington.edu/phylip.html

86

Phylogentic trees generated from these phylogenetic analysis tools can be

viewed using TreeView 7 [68]. TreeView is a simple program that displays a phy-

logenetic tree of up to a certain number of taxa. Phylogenies may be displayed

either as slanted or rectangular cladograms. TreeView provides a way to view the

contents of a NEXUS, PHYLIP, ClustalW or ClustalX, or other format tree files.

4.2.9 Data conversion

In a typical workflow, one program’s output may be used as the next program’s

input in the workflow. A necessary data conversion process is needed in order to

make the output suitable as the input for the next program. MoGServ provides

a number of services to convert fasta format to ClustalW format, fasta format to

NEXUS format, and so on.

The program readseq 8, developed by D. Gilbert, is a reformatting program

used to reformat DNA or protein sequence data. It allows the input of single

or multiple sequences in 18 different formats and converts to a specified format.

MoGServ integrates readseq program as a service to convert the output from

ClustalW to the NEXUS format.

4.3 Results of case studies

The evolution of ATP synthase is considered severely constrained; the structure

of ATP synthase is shown in Figure A.1. It can be a candidate for ascertainment

of deep phylogeny. The step that we use to test the hypothesis is first to identify

individual subunit genealogy, then to merge the data and reanalyze the data.

7http://www.molecularevolution.org/software/treeview/8http://iubio.bio.indiana.edu/soft/molbio/readseq/java/

87

4.3.1 Case study: the rediscovery of Erythrobacter litoralis

The MoGServ local database includes whole genome sequences from chloro-

plasts, cyanobacteria, plastid, and apiocolasts. The biological investigator hypoth-

esized that the amino acid sequence of the chloroplast subunits of ATP synthase

would be a good choice for a deep phylogenetic analysis, a departure from estab-

lished procedures. DNA sequence from ribosomal genes, the protein synthesizing

machinery, are the traditional choices for deep phylogenetic analysis. Preliminary

analyses on 33 taxa revealed that the α and β subunits of this enzyme have a

stunningly high degree of amino acid sequence conservation across cyanobacterial

genomes and chloroplast genomes from a wide array of algal taxa and green plants.

As the nuclear genomes of the algal taxa are more phylogenetically distinct from

one another as humans are from fungi, this result indicated that that chloroplast

ATP was a suitable candidate enzyme and provided support for the single ancestor

hypothesis.

The problem now was one of excessive conservation in the α and β subunits.

Comparison of the most conserved region of the α subunit against all sequences

at NCBI revealed that this region is so conserved that it matches that of an ATP

synthase subunit in the mitochondrial genome. Phylogenetic evidence clearly

indicates that mitochondria descend from a single bacterial ancestor and that this

ancestor was related to the alpha proteobacteria, a group closely related to the

cyanobacteria. MoGServ enabled the investigator to add to the already convincing

evidence that the mitochondrial and chloroplast genomes are related. This was

not the hypothesis of interest but lead the investigator to try a different approach.

The investigator then examined the amino acid sequence of the ε subunit of

ATP synthase for the same 33 taxa examined previously. Sequence conservation

88

was evident but somewhat less than that seen in the α and β subunits. The local

database query was relaxed to permit inclusion of the ATP synthase ε subunits of

both cyanobacteria and alpha proteobacteria. More than a dozen proteobacteria

were identified, all of which except one are nonphotosynthetic.

The surprise bacterium was Erythrobacter litoralis. This organism is a faculta-

tive photoheterotroph, able to photsynthesize in the light and catabolize organic

sources in the dark. It was found in the Sargasso Sea in 1994 and sequenced in

2005. This discovery suggests that Mother of Green may not be a cyanobacterium

but an α proteobacteria.

4.4 Summary

In this chapter, we detail the data and services integrated in the MoGServ

system in order to support the deep phylogenetic investigations. We describe one

case study for a phylogenetic investigation 9. This case study shows that the

investigator is able to gather data and perform more advanced data analysis and

lead to dicovery new knowledge using the web based environment and services

provided in the MoGServ system.

9The case study on the use of MoGServ for a phylogenetic investigation, was conducted incollaboration with Professor Jeanne Romero-Severson[78], Department of Biological Sciences,University of Notre Dame, and partially supported by the Indiana Center for Insect Genomics(ICIG) with funding from the Indiana 21st Century fund.

89

Figure 4.2. Entity relationship diagram of the data model in MoGServcreated by SQL::Translator

90

CHAPTER 5

ONTOLOGICAL REPRESENTATION MODEL

MoG (Mother of Green), a project involving deep phylogeny of plastids, in-

cludes the development of a system (MoGServ) to enable life scientists to easily

aggregate heterogeneous data and conduct data analysis using the growing array

of web-based scientific databases and analysis tools. MogServ, a SOA-based data

integration environment, is built using current web service technology and existing

middleware for life sciences research. Based on the successful design and imple-

mentation of this prototype, in this chapter, we present an enhanced system with

semantic annotation of services and data. The enhancement aims at allowing life

science researchers to define their experiments at different levels based on their

knowledge of the tools, data, and the system. The semantically enriched data

allows easier reuse, sharing, and experiments involving search to be conducted.

While the service-oriented architecture is used in the implementation of e-

Science infrastructure, semantic web technology is increasingly gaining interest to

be used for annotating the life science and medical information [12]. For example:

UniProt RDF 1 project provides all UniProt protein sequence and annotation

data in RDF. These efforts makes the vision of the semantic web [7] become more

1http://dev.isb-sib.ch/projects/uniprot-rdf/

91

practical. Other open source projects, such as HayStack 2 and SIMILE 3, aim at

delivering these semantically annotated data to web browsers. The appearance of

open source tools that support the semantic web and service-oriented computing

encourage the life science community to provide their data, analysis tools, and

share scientific experiments with these technologies.

5.1 The MoG life sciences project and biomedical application

As part of the Mother-of-Green (MoG) project 4 we are developing scientific

workflow tools (MoGServ) that enable end-user composed semantic web-services

to increase the interoperability of the growing array of web-based life science

databases and analysis tools. These workflow tools are built from available and

emerging open-source, open-standards technology.

The prototype problem domain that guides this project, the phylogenomics of

the plastid, includes genomic, transcriptomic, and proteomic data. Plastids are

hypothesised to be descendants of cyanobacterial ancestors captured by eukaryote

hosts. As more cyanobacterial and plastid genomes are sequenced, information

accumulates that could shed light on plastid genomics and phylogeny. One of the

major plagues of humankind, malaria, is caused by a parasite containing a plastid:

Plasmodium falciparum. A new pharmaceutical drug that disrupts the function of

this plastid (the apicoplast) might be harmless to humans, who, like all animals,

have no plastids.

Examination of the genes, the linear order of the genes, the proteins, and the

temporal order of protein expression of related organisms can suggest possible

2http://haystack.lcs.mit.edu/3http://simile.mit.edu/4http://www.nd.edu/∼mog/

92

apicoplast functions. The problem is the accurate identification of relatives or

even closely related plastid genes of known function. At present, the phylogeny of

the apicoplast is not clear. A phylogenomics approach requires the extraction and

analysis of genomic information from diverse scientific disciplines: plant, algal and

cyanobacterial systematics, plant biochemistry, animal parasitology, genetics and

cell biology. This phylogenomics investigation provides software design use-cases,

testing, and an opportunity for the evaluation of scientific workflow composition

tools and technology.

5.2 Ontological representation model

Metadata about services, sequences, and users’ experimental results are cap-

tured in MoGServ in order to facilitate the information inquiry from application

developers searching for appropraite services and from end-users to keep track

of their in-silico experiments. The inquiry system in the prototype is based ini-

tially on a keyword search method for easy implementation purpose. With the

prospect of hosting MoGServ at multiple sites in the phylogenetic research com-

munity, applying the semantic web approach for representing the metadata allows

for much more focused and structured queries and the possibility to answer ques-

tions based on logical inference rather than text associations. An ontology that

describes the concepts relevant to a given domain along with properties character-

izing these concepts can meet these requirements. By relying on shared ontologies

and agreements on the definition of common concepts, data and information can

be annotated using the shared vocabularies in these ontologies.

Since most semantic web services standards are relatively mature and stable,

we build an application-specific ontology using a distributed and modularized

93

ontology structure and re-used some cross-domain ontologies such as the Dublin

Core 5 and other well-defined bioinformatics ontologies. The use of well-defined

ontologies could potentially increase the interoperobility when information is pub-

lished on the web.

There are three ontology sets that are clearly differentiated in the system: MoG

application domain ontology, which is used to represent concepts and information

unique to MoGServ system, such as jobs, sequences collections, etc; generic service

description ontology, such as OWL-S, which is used to specify generic web service

concepts such as service inputs, outputs, preconditions, and effects; and the service

domain ontology, which is designed and used for the semantic description of web

services in the bioinformatics domain.

5.2.1 RDF, OWL, and DIG reasoner

The Resource Description Framework (RDF) 6 has been proposed as a W3C

standard to enable distributed knowledge representation on the Semantic Web.

It is a graph model of the statements that encode the metadata description of

web resources, people, places, and other concepts. RDF is based on the idea of

identifying things using Uniform Resource Identifiers (URIs), and describing re-

sources in terms of simple properties and property values. This enables RDF to

represent simple statements about resources as a graph of nodes and arcs repre-

senting the resources, and their properties and values. An RDF graph is a set of

triples. Each triple consist of a subject(start node), a predicte(edge), and an

object(end node). A fact is expressed as a Subject-Predicate-Object triple,

also known as a statement. A triple can be written as P (S, O), that is, a subject

5http://dublincore.org6http://www.w3.org/TR/rdf-primer/

94

S has P (predicate or property) with value O. RDF/XML and Notation 3 (N3),

are two formats for representing RDF models. Figure 5.1 is a RDF graph model

that represent some information for describing the MoG project web site. Facts

are expressed as subject-predicate-object triples:

<’’http://www.nd.edu/~mog’’> <#hasCreator> < #gmadey><#gmadey> <#hasFullName> <Gregory. Madey>Note: # is represent as some URIs

The RDF/XML representation:

<rdf:RDF xmlns:rdf=’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’xmlns:ex=’’http://someexample.org#’’>

<rdf:Description rdf:about=’’http://www.nd.edu/~mog’’><ex:hasCreator rdf:resource=’’ex:gmadey’’ />

</rdf:Description><rdf:Description rdf:about=’’ex:gmadey’’>

<ex:hasFullName>Gregory Madey</ex:hasFullName></rdf:Description>

</rdf:RDF>

http://www.nd.edu/~mog

#hasCreator

#gmadey

#hasFullName

Gregory Madey

#hasTitle

#professor http://www.nd.edu/~gmadey

#hasPersonalSite

MoG is a … project

#hasTextDescription#hasResearchTopic

#bioinformatics

Literal Resource # URI provided the definitionof these vacabulary

#hasFundedBy

#foundation

Figure 5.1. A RDF graph model to represent some information fordescribing the MoG project web site

95

RDF schema is a mechnism that allow developers to define a particular vo-

cabulary for specifying the kinds of objects to which predicates can be applied.

These pre-defined terminologies such as Class, subClassOf, Property establish

an agreement on the semantics of specified terms and the interperation of given

statements. The Web Ontology Language (OWL) 7 is one type of ontology lan-

guage available for describing semantic web information, which is more complex

and powerful than the RDF schema. It is built on top of the RDF graph model

with better capabilities for describing the relationship among resources and their

properties 8. The OWL language is divided into three syntax classes: OWL Lite,

OWL DL and OWL Full. Classes(concepts), properties(roles, relation-

ships), and individuals(instances) are three components in OWL language.

Let’s consider that the interperation of a domain knowledge using function I. A

domain knowledge is represented with a number of concepts CI ⊆ DI . Each

concept may contain a number of individuals and one individual may belong to

different concepts II ∈ DI . The relationship between two individuals are repre-

sented as RI ⊆ DI ×DI . The web data information can be related by using the

definition of these concepts.

Jena 9 is a semantic web framework for creation of RDF and OWL models

as well as a common interface for parsing and reasoning. Protege 10 is a free,

open source ontology editor and knowledge-base framework that supports two

main approaches to modeling ontologies via the Protege-Frames and Protege-

OWL editors. The OWL DL ontology can be translated into a description logic

7http://www.w3.org/TR/owl-ref/8http://jena.sourceforge.net/ontology/index.html9http://jena.sourceforge.net

10http://protege.stanford.edu/

96

representation that are decidable fragments of First Order Logic (FOL) 11. A

Description Logic Reasoner can perform automated reasoning over an ontol-

ogy, such as computing the inferred superclasses of a class, determining whether

or not a class is consistent, deciding whether or not one class is subsumed by an-

other (subsumption reasoning). Pellet12, FaCT/FaCT++ 13, Racer/RacerPro 14,

KAON2 15 are four popular ones among a number of DL reasoners. The DIG 16 in-

terface specifies a common interface for DL reasoners. A DIG compliant reasoner

is a DL reasoner that provides a standard access interface (DIG interface), which

enables the reasoner to be accessed over HTTP using the DIG langauge. Jena

and Protege-OWL provide APIs that can be used to interact with any external

DIG compliant reasoner without requiring developers’ to have detailed knowledge

of the reasoner.

5.2.2 Generic service description ontology

OWL-S 17 is an OWL based ontology for semantic representation of services.

It is a complex and rich model that includes the representation of both atomic

services and composite services as well as complicated control flow and data flow.

Most of the current open-source APIs, editors, and annotation tools at this stage

only partially support the OWL-S service model having primary focus on the

11Logics are decidable if computations or algorithms based on the logic will terminate in afinite time.

12http://pellet.owldl.com/13http://owl.man.ac.uk/factplusplus/14http://www.racer-systems.com/15http://kaon2.semanticweb.org/16http://dig.sourceforge.net/17http://www.w3.org/Submission/OWL-S/

97

OWL-S service profile and service grounding. Annotating a service with the OWL-

S model is a non-trival task even with support from annotation tools, such as the

SRI OWL-S editor 18.

The Feta [50] data model is used for semantic description of services in the my-

Grid project. Web services can be annotated using terms in a OWL-base myGrid

domain ontology [103] with an GUI based interface Pedro [33]. This approach

is more lightweight than the OWL-S approach. Although OWL-S provides more

support for the automation process, especially since its definition of the precon-

dition and post effect allows the possible application of AI planning technologies,

it is difficult to utilize its full functionality. The Feta data model has limited

expressivity but sufficient for describing most services and its simplicity makes it

more practical for describing large number of services.

We believe it is more practical to use the Feta model for service and workflow

description at this stage. Since the semantic representation model in the system

is modularized, it is easy to convert to an OWL-S representation when the tools

and API that support the OWL-S becomes more stable and mature.

5.2.3 Service domain ontology

The service domain ontology should be generic enough to provide the concepts

needed by any web service in a certain domain, and rich enough to represent

the available knowledge for performing complex reasoning. The service domain

ontology plays an important role for the automation of service discovery. However,

building such a quality domain ontology is a challenging task. Sabou et. al [80]

presents an automatic method that learns a domain ontology for the purpose of

18http://owlseditor.semwebcentral.org/

98

web service description from natural language documentation of web services. It

provides a guideline and tool for domain experts to inspect a large number of web

services in a certain domain in order to build a high quality generic ontology.

BioMOBY’s object ontology, MOBY-S 19, contains concepts that are related

to data formats and data types usually used in bioinformatics. There are no

restrictions on complex relationship definitions in the ontology. It serves as a

common vocabulary collection that can be used to define services that accept

a particular type of data as their input/output in certain format. The myGrid

ontology 20 describes the bioinformatics research domain and the dimensions with

which a service can be characterised from the perspective of the scientist. The

scope of the ontology is limited to supporting service discovery. Descriptions

of services are constructed to present their properties such as “what the service

does”, “what data sources it accesses”, and “what domain specific methods the

analysis involves”. Each hierarchy contains abstract concepts to describe the

bioinformatics domain at a high level of abstraction. By describing the domain

of interest in this way, users should be able to find appropriate services for their

experiments from a high level view of the biological processes they wish to perform

on their data.

5.2.4 MoG application domain ontology

The MoG application domain ontology auguments the two ontology sets de-

scribed above, representing concepts that only exist in the MoGServ system, in-

cluding jobs, collections of sequences, etc. The ontology definition provides vo-

cabulary to annotate services that use data types and data formats not available

19http://biomoby.org/RESOURCES/MOBY-S/Objects20http://www.mygrid.org.uk/ontology

99

elsewhere. It also allows the annotation of experimental data permitting users to

keep track of their data. The MoG application domain ontology also represents

the interactions between end-users and the system.

Sequence, SequenceSet, Job are three main concepts in a MoG application.

The MoGServ system contains a local database that stores integrated sequences

of scientific interests from multiple public databases along with private data from

the life scientists’ own laboratory experiments. One activity a scientist may need

to do often is to query the local MoGServ database to get a collection of sequences

supporting a particular research investigation, and use this collection to do sub-

sequent data analysis. We also define other concepts User, Input, Output,

Privacy to annotate the access permissions for data sets. For example, if the

data is updated from a scientist’s lab experiment which is not intent to be pub-

lished at one point, this piece of data should be retricted to be used by authorized

persons only.

The ontology is defined with OWL using Protege. Each concept consists of

two main types of properties: object properties and datatype properties. An

object property represents the relationship between two individuals in the domain.

A datatype property links an individual to an XML Schema data type or a RDF

literal. Figure 5.2 demonstrates the main concepts and relationships defined in

the MoG application domain ontology.

The Sequence class has multiple properties: 1) hasSequenceID, a unique iden-

tifier of the sequence - an identifier may be in the life science identifier (LSID)

format, 2) hasSequenceName, a string of XML data type, 3) hasTaxonomy, a

datatype property with a string of XML typed data - each individual of the se-

quence class may either be retrieved from a public data base or uploaded by

100

Figure 5.2: Main concepts and partial relationships defined in the MoG applicationdomain ontology

scientists from their own labratory experiments.

SequenceSet class has property: isChildOf is a functional property, which

means there can be at most one individual that is related to the individual via the

property. A sequence set can only be a child of one sequence set no matter how

the sequence get created. A sequence set can have multiple child sequence sets. A

sequence set can be a sibling of another sequence set only when other sequence sets

are also generated from the setfilter service. The property isSiblingOf is a sym-

metric property. The existential restriction hasSequence∃Sequence indicates a

necessary condition for an individual if it belongs to the class SequenceSet.

The Job class has execution time properities such as submittedAt, startedAt,

finishedAt. This informtion provides data for measuring Quality of Services (QoS).

It also provides information for end-users to monitor their job execution.

101

5.3 Implementation

Given a well-defined domain ontology, associated services, workflows and the

data products generated from these, the services and workflows can be annotated

using a common vocabulary. The meta data with semantic annotation is stored

in a RDF repository. From a number of RDF storage packages, we chose Sesame

1.2.6 21 as the repository. Sesame is an open source Java framework for storing,

querying and reasoning with RDF and RDF Schema. Using RDF as the main

storage and exchange method makes knowledge in the field portable to other

applications and readable by machine as well as by human.

The annotation in RDF/XML format of one service provided in the MoGServ

system is shown as below and displayed in Figure E.5. It is a service that ac-

cepts a sequence set id and sequence type as input parameters, executes ClustalW

sequence analysis, and returns the result.

<rdf:RDFxmlns:mygrid="http://www.mygrid.org.uk/ontology#"xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:mog="http://almond.cse.nd.edu:10000/mog#">

<mygrid:service><mygrid:hasOperation><mygrid:operation><mygrid:isFunctionOf><mygrid:operationApplication><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#aligning"/>

</mygrid:operationApplication></mygrid:isFunctionOf><mygrid:outputParameter><mygrid:sequence_alignment_report><mygrid:mygInstance rdf:resource="http://www.mygrid.org.uk/ontology#sequence_alignment_report"/>

<mygrid:hasParameterDescriptionText>ClustalW alignment file</mygrid:hasParameterDescriptionText><mygrid:hasParameterNameText>filename</mygrid:hasParameterNameText><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/>

21http://www.openrdf.org/

102

</mygrid:sequence_alignment_report></mygrid:outputParameter><mygrid:usesResource><mygrid:operationResource><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#sequence_database"/>

</mygrid:operationResource></mygrid:usesResource><mygrid:performsTask><mygrid:aligning><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#operationTask"/>

</mygrid:aligning></mygrid:performsTask><mygrid:hasOperationNameText>runClustalWdf</mygrid:hasOperationNameText><mygrid:inputParameter><mog:set><mygrid:mygInstance rdf:resource="http://almond.cse.nd.edu:10000/mog#set"/>

<rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/><mygrid:hasParameterNameText>setid</mygrid:hasParameterNameText>

</mog:set></mygrid:inputParameter><mygrid:inputParameter><mygrid:parameter><mygrid:mygInstance rdf:resource="http://www.mygrid.org.uk/ontology#biological_sequence"/>

<rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#biological_sequence"/><mygrid:hasParameterNameText>sequenceType</mygrid:hasParameterNameText>

</mygrid:parameter></mygrid:inputParameter>

</mygrid:operation></mygrid:hasOperation><mygrid:hasServiceNameText>mog:service:ClustalW</mygrid:hasServiceNameText><mygrid:locationURI rdf:resource="http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdl"/><mygrid:hasServiceType>WSDL</mygrid:hasServiceType><mygrid:publishedBy><mygrid:organisation><mygrid:publishedBy><mygrid:organisation><mygrid:hasOrganisationNameText>MoG</mygrid:hasOrganisationNameText><mygrid:hasOrganisationDescriptionText>MoG</mygrid:hasOrganisationDescriptionText>

</mygrid:organisation>

103

</mygrid:publishedBy></mygrid:organisation>

</mygrid:publishedBy><mygrid:hasServiceDescriptionText>This is a service accepts setid,sequecenType as parameters and return the name of the alignmentreport stored in the local database</mygrid:hasServiceDescriptionText><mygrid:hasServiceDescriptionLocation>http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdl</mygrid:hasServiceDescriptionLocation>

</mygrid:service></rdf:RDF>

All the data sets stored in the local database are generated by a service or a

workflow. The annotation of experimental data is through services provided in the

MoGServ system. These services are invoked automatically when an individual is

created. Each sequence, set of sequences, and job are identified with the LSID.

The Life Science Identifiers (LSID) 22 is a special kind of Uniform Resource Name

(URN) for biological entities. The LSID concept defines an approach for naming

and identifying data resources stored in multiple, distributed data stores. Since

adoption of the LSID in the life sciences is increasing, using it as an identifier for

experimental data provides the extensibility of our system to publish those data.

We implement a number of software components to annotate and query meta

data including job information, services/workflows descriptions (See Figure 5.3).

The query components embed queries in the SeRQL (Sesame RDF Query Lan-

guage) format supported by Sesame.

5.4 Conclusion

In this chapter, we present an ontological model that is used to semantically an-

notate data and services in the MoGServ system. This ontological model contains

three ontology sets: MoG application domain ontology, generic service description

22http://lsid.sourceforge.net/

104


AnnotationTemplates

(Data)

AnnotationTemplates(Service

Workflows)

Querytemplates

result

Generic Service Description Ontology(myGrid/Feta model)

Service DomainOntology(myGrid)

MoGServ applicationDomain Ontology

(MoGServ)

RDFStore

QueryComponents

AnnotationComponents

Ontological modules

Use vocabularies defined in these ontological modules to annotate and query

Figure 5.3. The software components implementation of annotation andquerying meta data

ontology, and service domain ontology. Using a distributed and modularized on-

tology structure and reuse of well-defined ontologies could potentially increase the

interoperobility when the data generated from the MoGServ is shared with other

researchers. At this stage, the developed MoG application domain ontology is sim-

ply served as a common vacabulary definition to capture the relationship among

data set, sequences, jobs, and other properties related to these three concept.

The Feta data model is used to annotate services in the MoGServ. Compared to

the table and index based metadata search method, the semantically-annotated

experimental data provides a better, flexible approach for users to search and

share their experiments. However, how to annotate the meta data accurately and

efficiently becomes the major difficulty in applying ontological model.

105

CHAPTER 6

IMPROVING THE REUSE OF THE SCIENTIFIC WORKFLOW

Most current practical methodologies and workflow systems for service com-

position and workflow creation in e-Science pursue a semi-automatic way to allow

users to discover and select appropriate services to include in a workflow based on

semantic and conceptual service definitions. This effort shifts the load of requiring

users to have detailed knowledge and understanding of each tool, service, and data

type. However, few of these approaches consider the potential for reuse: to share

the knowledge gained during the service composition process and to reuse com-

plete or partial reuse of existing workflows. We believe that providing a capability

for reuse of this knowledge and workflows could be an important component in a

workflow system. In this chapter [109], we present a methodology and an enhanced

system design to facilitate the reuse of knowledge and workflows. It contains 1)

a hierarchical workflow structure representation, 2) knowledge management and

knowledge discovery components to capture and manage the reusable knowledge

in a system, and 3) an approach for using a graph matching algorithm to discover

similar workflows.

6.1 Introduction

As more data, analysis tools, and other resources are delivered as services on

the web, the major benefit of adopting service-oriented architecture in e-Science,

106

is that of allowing scientists to describe and enact their experimental processes by

orchestrating distributed and local services into a workflow. Service orchestration,

also called service composition, is a difficult and complex task. It often involves

choosing a set of appropriate services based on the functional and non-functional

properties of services, ordering them in sequence, resolving connectivity between

the services, and converting the complex process into a target workflow language

that can be deployed and invoked on a platform.

Over the past several years, much research has been done on approaches for

service discovery and composition in order to achieve the goal of seamless web

service composition [58]. These approaches range from both adoption of indus-

try standards to adoption of semantic web technology, and from manual or static

composition to automatic dynamic composition [90]. A significant portion of the

work aims at automating discovery and composition by combining ontological

annotation of services and AI planning technology. In the literature, the demon-

stration of these approaches is largely applied on virtual travel agencies or small

well-defined domains. Applying these approaches to larger, more complex and

less-defined applications can be difficult, especially before a complete strong on-

tological agreement is established in the application domain or across multiple

domains.

Most current practical methodologies for service composition or workflow cre-

ation employ a semi-automatic design that allows users to discover and select

appropriate services to include in a workflow based on semantic and conceptual

service definitions. This partially lifts the load on the users of requiring detailed

knowledge and understanding of each tool, service, and data type. In the mean-

time, it increases the complexity of building such middleware to support workflow

107

creation at a higher level abstraction. Mediator, shim, and adaptors technologies

[74] are applied to resolve the connectivity between the services. Several work-

flow management systems and service-oriented middleware, such as Pegasus [34],

myGrid/Taverna [65], Kepler [52], and Triana [96], are developed and with the

intent to streamline the workflow design, execution, monitoring, and re-run the

workflow.

Most of these systems and approaches provide users an environment to com-

pose services from scratch in terms of more accurately choosing appropriate ser-

vices with consideration of semantic matching and quality of services (QoS). Fewer

of them consider the potential of reuse and sharing of the knowledge gained during

the service composition process and reuse of complete or partial existing work-

flows. We believe that providing a capability to reuse the knowledge and workflows

is an important component in such a system. This reusability will lead to a more

efficient and more structured composition process that will accelerate rapid appli-

cation development. It will provide more valuable guidelines to assist users with

their workflow creation using knowledge that has been gained and verified by oth-

ers. Reuse of the verified knowledge will potentially increase the correctness of

composed workflows and reduce the errors that may be caused by misannotation,

inaccurate annotation, and incomplete annotation of services. The requirement of

complete information about the world brings challenges of applying traditional AI

planning technologies into the service composition process since it is not feasible

nor possible to collect all the information to form a complete initial state of the

world [46, 58]. The gradually gathered knowledge in the system during service

composition process may help accumulate more complete information for an AI

planner.

108

In this chapter, we present a methodology and an enhanced system design

to facilitate the reuse of knowledge and workflows. It contains a hierarchical

workflow structure, knowledge management and knowledge discovery components

to capture and manage the reusable knowledge in a workflow system, and an

approach for using a graph matching algorithm to discover similar workflows.

The methodology proposed is being used in the design and implementation of a

service-oriented based system for supporting bioinformatics research.

6.2 A hierarchical workflow structure

We define a hierarchical workflow structure that contains four levels of repre-

sentation (see Figure 6.1): abstract workflow, concrete workflow, optimal work-

flow, and workflow instance.

Abstract workflow

Concrete workflow

Optimal workflow

Workflow instance

Encode, convert theHigh level definition To low-level executable

Replace individual Services with theiroptimal alternatives

Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.

Task A Task B

Service B

Service A

Service DService C

Service B

Service A

Service DService C’

input

outputService B

Service A

Service DService C’

Figure 6.1. A four level hierarchical workflow structure representationand transformation of scientific processes

109

Abstract workflow is a definition of a scientific process with emphasis on the

analytical operations or function to be performed rather than the mecha-

nisms for performing these operations.

Concrete workflow is a definition of a number of tasks represented as actual ex-

ecutable services. A concrete workflow can be converted to specific workflow

language and sent to a workflow engine to be executed.

Optimal workflow is a concrete workflow where individual executable services

are replaced by alternatives with highest quality.

Workflow instance is an actual run of a concrete workflow or optimal workflow

with input data and generated output data.

Users can use a GUI-based interface to define an abstract workflow by drag-

ging and dropping high level abstracted components provided in the system. An

alternative way is to define an abstract workflow using standardized syntax, vo-

cabularies, and semantics developed in their scientific communities. Users logically

create each task in terms of functions they wish the task should accomplish.

The translation of an abstract workflow into a concrete workflow is a process

of discovering suitable services that implement these functions and solving the

connectivity between services.

The optimization of a concrete workflow into an optimal workflow is a process

of ranking services based on a set of metrics and selecting an optimal service to

replace each service in the workflow.

A concrete workflow can be invoked repeatedly with different input parame-

ters. Since a scientific process is a process for discovering new knowledge, keeping

track of the source of a workflow result can be as important as the result itself.

110

Data provenance is metadata recording the process of experiment workflows, an-

notations, and notes about experiments. It provides significant added value in

such data intensive e-Science [83]. Many data provenance systems in e-Science

have focused on recording the data from which a data product evolved and the

process of transformation of these data, i.e., input data, output data, and process.

This may include information on running time and failure rates of each running

instance of a workflow; these can provide measurements for profiling the quality

of services and workflow. This information can be used to assist the workflow

optimization process.

Several benefits are provided with this hierarchical workflow structure defini-

tion:

Allows users to define workflow at different abstract levels. Less expe-

rienced users may define a workflow in terms of functions they wish a task should

perform. Intermediate users may define a workflow with more detailed properties

of each task, such as the algorithm and data source they may want to use. Ex-

pert users may be able to define a workflow in an ad-hoc approach by choosing

appropriate executable services and form a workflow with appropriate logic. An

example is shown in Figure 6.2. Users would like to conduct an experiment to

determine “if gene genealogies for ATP subunit α, β, γ are different”. A less ex-

perienced user may define a workflow with two tasks, retrieving and aligning.

An intermediate user may have knowledge of two particular services (queryGene,

clustalW) that should be used in the workflow in order to perform each task. An

expert bioinformatician may know that in order to get more accurate results, it

is necessary to encapsulate a service (setFilter) to compute the intersection of

all the organisms in the sequence sets.

111


Aligning

Retrieving

Workflow A defined by a less experienced user using the functional definition of services

queryGene

clustalW

Workflow B defined by an intermediate user with executable services

queryGene

clustalW

queryGene queryGene

setIds

setFilter

clustalW clustalW

Workflow C defined by an expert user with two extra executable services to ensure the accurate output of

the biological process

Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunitαβ γ different?”

Figure 6.2. An example illustrates the user-oriented workflow definitionwith different levels of knowledge

Allows the transformation of workflows in semi-automatic or auto-

matic ways. The transformation from abstract workflow to the concrete workflow

can be completed by an expert bioinformatician with assistance from a service

discovery agent provided in the system. The myGrid/Taverna [65] workbench

provides users not only a visual workflow building tool but supportes the annota-

tion and discovery of services using an ontology. IRIS [74] provides an approach

to create, discover, and manage adapters (mediators) that are intended to glue

two bioinformatics services together with appropriate data transformation, iden-

tifier mapping and so forth. The BioMoby [44] project integrates access to many

of BioMoby’s features to the Taverna interface in the form of a Taverna plug-

in. Users are guided through the construction of syntactically and semantically

correct workflows through plug-in calls to the Moby Central registry.

112

The transformation from the concrete workflow to the executable workflow can

be completed automatically by ranking services and choosing optimal ones. Most

of previous work bound the information regarding the quality of service with

the translation process results in more sophisticated and complex composition

methods. Since most measurements of the quality of services are dynamically

changing, this tightly-coupled representation and composition method is not easily

adapted to these changes. Separating the optimal workflow from the concrete

workflow allows the easy integration of Grid computing technology to address the

resource allocation and security issues of data and computation resources.

Allows the full or partial reuse of workflows defined at different

levels. Reuse of a workflow may occur when users need to replicate their data

sets or rerun the same workflow using different input data. For example, consider

a scientist who is interested in a data set generated from a given workflow. Using

the recorded data provenance, the corresponding concrete workflow that was used

to generate this data set can be discovered. The concrete workflow can be re-

optimized and invoked with different input data.

The reuse of a workflow may also occur during workflow design. For example,

a scientist may have a high level representation or partial representation of a work-

flow, searching the workflow repository may return a number of similar workflows

at an abstract level and/or concrete level. This scientist may choose a candidate

to reuse or modify the workflow to meet the goal.

6.3 An enhanced workflow system

A general workflow system contains most of the components illustrated in the

Figure 6.3 to support the semi-automated workflow composition process.

113

• Ontologies serve as a common vocabulary for semantic annotation of services

and data in the system.

• Semantics enabled service registry is responsible for storing the semantic

and syntactic information of services as well as answering the inquiry. The

semantic information can be provided by service providers or third party

annotation.

• Workflow composer discovers appropriate services and resolving the connec-

tivity between services. It is also responsible for converting the workflow

into a workflow language that can be executed on a workflow engine.

• Data provenance management keeps track of the origination of the data

products.

Few workflow systems have the capability for the reuse of the knowledge gained

during the service discovery, service composition, and service invocation process.

We add two components – knowledge discovery and knowledge management – to

the workflow system and discuss how this knowledge can be used over time to

provide more accurate guidelines to users.

As most current semantic web services standards are relatively mature and

stable, the ontology model used in a system is built upon a distributed and modu-

larized ontology structure and reuse some cross-domain ontologies such as Dublin

Core (http://dublincore.org). The use of a well-defined ontology could potentially

increase the interoperability for information published on the web. The ontology

model used in a system normally contains two modules: generic service descrip-

tion ontology, such as OWL-S, is an ontology module used to specifies generic

web service concepts including service inputs, outputs, preconditions, and effects;

114

UserService

Annotator

Abstractworkflow

DL reasonerDL reasoner

Ontology

Create abstract workflowusing ontology

Annotate servicesusing ontology

Semantics enabledservice registry

Semantics enabledservice discovery

Semantics enabledservice discovery

Service matchmakingService matchmaking

Workflow composer (software agent/experienced users)

Find appropriate service

Workflowexecutionengine

Workflowexecutionengine

concreteworkflow

Data provenancemanagement

Data provenancemanagement

Collect and manage information about data origination

Knowledgebase

management

Knowledgebase

managementKnowledgediscovery

Knowledgediscovery

Figure 6.3. An enhanced workflow system with two added components,knowledge management and knowledge discovery

service domain ontology is an ontology module designed and used for the semantic

description of web services in a particular domain and normally represented with

OWL-lite or OWL-DL.

We give a definition of service in our system as a tuple with several important

attributes:

servicei(descriptioni, operationi, ...) – a service contains text descriptions of

its feature, a set of operations (must not be ∅), and other attributes;

operationij(descriptionij, inputij, outputij, qualityij, performtaskij, ...) – an op-

eration in a service contains text descriptions of its features, a set of input param-

eters (may be ∅), a set of output parameters (may be ∅), a set of quality metrics,

semantic description of the features using vocabulary from service domain ontol-

115

ogy, and others;

parameterk(semantick, datatypek) – a parameter contains semantic descrip-

tion using vocabulary from service domain ontology and the data type.

The semantic annotation of services and workflows can be represented as a

RDF model and stored in a RDF repository.

6.3.1 Knowledge management

The knowledge management component is responsible for collecting, analyz-

ing, and handling inquiries on the knowledge base. The knowledge base holds

information gathered incrementally during workflow translation and service com-

position processes. This information provides increasingly accurate guidelines for

users over time. Four types of information are classified:

- Connectivity of services. A concrete workflow can be viewed as a graph

with a number of linked services in a certain order and logic. Each node

in the workflow is an operation of an executable service. In a simple case,

two nodes are connected if an output parameter of one operation maps an

input parameter of another operation based on their syntactic and semantic

description.

Rule1 :

operationij → operationmn

if ∃parameterk ∈ outputij and ∃parametero ∈ inputmn and

datatype(parametero) = datatype(parameterk) and

semantics(parametero) = semantics(parametero)

Rule2 : if operationij → operationmn then servicei → servicem

116

While the composability of services can be determined by these above simple

rules, it can be identified using more complex models [55].

The connectivity between two services can be identified automatically based

on the rule defined above when a new service is added to the system. It

is a computationally intensive process when the number of services in the

system and the number of parameters for each operation is large. Also,

incorrectly identifying the connectivity between two services is most likely

introduced by the misannotation of services or an incomplete ontological

model. Therefore, during the translation process, the connectivity struc-

ture should be refined and updated based on human judgment. After a

concrete workflow is created and verified, the connectivity of services in the

workflow can be added into the system. As time goes by, the connectivity

of services in a system forms a graph of the knowledge space. A vertex

in the graph can be represented as (servicei, opertationij, parameterijk) or

(servicei, operationij) if one operation does not have parameters and the

edge represents the connectivity of two vertices.

- Alternativity of services. In the context of our research, we define

servicei as an alternative of servicem if ∀operationij ∈ servicei and

operationmn ∈ servicem their syntactic and semantic description are the

same except the quality properties. For example, two services that imple-

ment the same WSDL interface are alternatives for each other. These two

services may implement the WSDL interface using different underlying tech-

nologies, charging different fees, and having different performance.

The execution of workflows and services takes place in a distributed comput-

ing environment. The execution may fail at some point due to the failure of

117

the workflow engine, failure of the service, and failure of the network fabric

[64]. The capability to dynamically select alternative services ensures the

recovery from service failure. The myGrid/Taverna project provides users

a way to encapsulate alternative services into the workflow at the design

time. Another approach is to find an alternative service during run time

using general semantic service discovery technologies. We believe that iden-

tifying and storing the alternatives of a service ahead of time can increase

the performance by eliminating this semantic service discovery process. The

method can also improve the correctness of finding alternative services. The

alternativity of services can be automatically identified when a new service

is added in the system and refined during the workflow translation process.

The alternatives of a servicei can be presented as a named property of

a service – alternativeOf. The alternativeOf property is a transitive

property, which means that if servicei is an alternative of servicem, servicem

is an alternative of servicex, then servicei is an alternative of servicex.

- Quality profile of services. As more services with similar functionali-

ties are published, it is important to define qualitative metrics that help

the selection of the optimal services. Modeling the quality of service and

approaches for choosing optimal services has been well studied for several

years [10]. While there are a number of quality criteria that can be used for

ranking services, different systems choose different sets of metrics and qual-

ity models for computing the overall quality of service. We define quality

with four attributes. Quality(cost, trustness, executiontime, failurerate)

– cost is the fee needed to execute an operationij and it is provided by the

service provider;

118

– trustness defines users preference of using the operationij based on their

experiences and it is annotated by users;

– executiontime and failurerate define the performance of an operationij

and they are collected and calculated from each run of a workflow or service.

Other QoS properties, such as security, may also be added when needed. The

overall quality of each service can be computed periodically or during the

optimization process using the similar QoS computation model algorithm

defined in [48].

- Mapping between abstract workflow and concrete workflow. The

construction of the abstract workflow represents the knowledge that sci-

entists knows about their domain and the services/tools provided in the

system. The abstract workflow and the semantic annotation of the con-

crete workflow are represented using the ontology. The concrete workflow

also is represented using a particular workflow specification that can be in-

voked on the workflow engine. Recording the mapping relationship between

abstract workflow to concrete workflow enables finding similar workflows in

the system given a workflow in different representation format. The concrete

workflow can have its own semantic annotation. It can also be represented

using the specific workflow language that can be invoked in the workflow

engine.

The knowledge about the connectivity of services, alternativity of services,

quality of services, and workflow representations is typically stored in tables.

119

6.3.2 Knowledge discovery

The Knowledge discovery component resides in the workflow composer. It is

responsible for communicating between the workflow composer and the knowledge

management component during the workflow translation process to find appro-

priate knowledge in the system. It is also responsible for selecting and replacing

services with their optimal alternatives during the optimization process and to

find a replacement during run time. The knowledge discovery component accepts

and sends requests to the knowledge management component.

6.4 Translation process

The process of translating abstract workflow into a concrete workflow involves

the discovery of appropriate services and resolving the connectivity between ser-

vices in order to accomplish tasks defined in the abstract workflow.

6.4.1 Service discovery and matchmaking process

During the translation process, the workflow composer issues a query to find

appropriate services that can be used to accomplish the defined task. For exam-

ple, the composer is interested in finding an operation which performs the task

“aligning”. We assume that one property of an operation is annotated using

#performTask which is the vocabulary term defined in the OWL-based bioinfor-

matics ontology of the myGrid project “http://www.mygrid.org.uk/ontology”.

A general query returns all services having #performTask property equals #aligning.

More sophisticated discovery processes use reasoning capabilities to infer a sub-

sumption relationship between the requested service and the services described us-

ing the ontology. For example, there is an operation that has been annotated with

120

the property #performTask using the vocabulary #pairwise local aligning.

In the ontology definition, the class #pairwise local aligning is not an asserted

subclass but an inferred subclass of #aligning. With the subsumption reasoning,

not only services annotated with #aligning should be returned, but also services

annotated with #pairwise local aligning.

The general translation from an abstract workflow to a concrete workflow

requires solving the connectivity between two executable services with mismatched

or inappropriate input to output. The mismatching problem may be introduced by

inaccurate semantic annotations, incomplete semantic annotations, and inaccurate

ontological reasoning (See Figure 6.4). One of the false positive examples is that a

DDBJ-XML service with attached semantic annotation of its output as Sequence

Data Record actually returns a document using self-defined format. The NCBI

blast service with attached semantic annotation of its input as Sequence Data

Record requires FASTA formatted sequence data. The connectivity of these two

services is identified as positive but in fact is not. This type of error can be detected

by expertise at design time or after the formed workflow runs and returns incorrect

results.

The true negtive case can be detected automatically during the translation

time. Adaptor, shim, or mediator [74] technologies are used to align or modify

poorly typed input and output of consecutive services in a workflow. These media-

tors are stored in mediator pools and discovering such a mediator is achieved with

ontologies and machine reasoning, the same as the discovery of normal services.

Most research has focused on how to discover these mediators using semantic web

technology and machine reasoning. A general mediation process terminates in

methods for translating the output of one web service into the input for the next.

121


TNFN

FP TPMatch Detectionoutput

Accurate annotation

Inaccurate annotationLack semantic annotationInaccurate ontological reasoning

Inaccurate annotationIncomplete semantic annotationInaccurate ontological reasoning

Accurate annotation

GenBankServiceOut:GenBank record

BlastpIn: protein sequenceX

Mediator, adaptor,shim

DDBJ-XMLOut: sequence

data record

NCBI blastIn: sequence data

record

fasta formatSelf-defined format

May be detectedby expertise at design time or afterrun

Can be detected automatically

X

Yes No

Yes

No

FPTN

Real match

Figure 6.4. The mismatching problem may be introduced due to theinaccurate annotation, incomplete semantic annotation, and inaccurate

ontological reasoning during the translation process.

6.4.2 Knowledge reuse

With the incrementally added information in the knowledge base, solving con-

nectivity can be done completely at the syntax level without need for consulting

the domain ontology. As time goes by, converting the abstract workflow to the

concrete workflow may be achieved by finding a mediator between two services in

the knowledge base. Thus, the use of ontologies will be exactly on those parts of

the workflow that were never used before. The manual translation process will be

122

required just once for every new element of the set of components in a workflow

and when a new service is added in the registry. The problem of solving the con-

nectivity between two services can be converted to a problem of finding a path

between two nodes in a connectivity graph.

During the translation process, instead of resolving the connectivity from

scratch using semantic reasoning technology, the composer can reuse stored knowl-

edge to support the semi-automatic and automatic composition.

1. Given a service or operation, all services or operations connected to the

current service or operation can be found by table lookup and presented

to the users. Users can choose one based on their expertise. Since the

connectivity stored in the table is verified during the previous workflow

creation process, we expect the probability of finding an accurate one is

higher and faster than using the semantic reasoning techniques from scratch.

2. Given two services or operations, find one or a sequence of services or op-

erations between them (mediators) that can connect these two services or

operations together. This problem can be converted to a problem of finding

a path between the service or operation A to the service or operation B.

Since the connectivity structure of services or operations in the knowledge

base is a graph, the shortest path algorithm (Dijkstra) is applicable to this

problem.

3. This concept can be extended into a wider use case when users know the

exact input they can provide and output they are trying to get. A general

planning technology is trying to find a service or operation that accepts

this input and a service or operation that generates this output. Using the

123

connectivity structure, the path between the input and output can be found,

if there is any.

6.4.3 Implementation and evaluation

The connectivity between two services is identified automatically when a new

service is registered into the semantic-enabled registry using the matching rules

defined in the Section 6.3. As more services are registered in the registry, the con-

nectivity graph is formed. Since the automatic identification process may intro-

duce some mismatching problems, the mismatching cases can be corrected during

the workflow translation process with knowledge from experts (See Figure 6.5).

During the workflow translation process, the knowledge discovery component can

find the path between two services/operations and suggest the next available ser-

vices/operations by searching the knowledge base at syntactic level. The searching

function is implemented with the Dijkstra algorithm.


Registrationprocess

registry

Automatically Identify the connectivity

Knowledge base

Store the connectivity

Workflow Translation /

Service compositionprocess

Refine, update, decompose the workflow

Figure 6.5. The creation process of connectivity graph when a newservice is added in the registry, the connectivity is refined and updated

during the workflow translation process.

124

TABLE 6.1

PERFORMANCE EVALUATION OF MATCH DETECTION

PROCESS

Number of Ser-vices

Number ofMatched Pairs

Load RDF repos-itory (millisec-onds)

Average time ofmatch detectionper single service(milliseconds)

200 10 1547 12.02

400 34 2346 13.01

600 84 2600 12.31

800 138 3015 12.35

1000 225 3325 12.51

The connectivity graph approach is evaluated on an Dell laptop with a 1.5GHz

Pentium M CPU and 512M of RAM. Service decriptions are randomly generated

using 418 concepts from domain ontology (myGrid and MoGServ) for semantic

type and defined 10 concepts for data type. Each service contains 1 operation.

Each operation has 1 input and 1 output. The measured performance of the

match detection process during the service registration process is reported in Table

6.1. The number of matched pairs reports the identified pair of services that one

service’s output can be fed as input for the other service. Although the time to

load the RDF repository (Sesame) increases as the number of generated services

increases, the process is typically done once. The average time of the matching

process when a new service is registered in the repository requires about 12-13

milliseconds.

The searching function of shortest path algorithm is evaluated using the con-

nectivity graph created from 1000 randomly generated semantic web services. The

125

TABLE 6.2

PERFORMANCE EVALUATION OF PATH SEARCHING PROCESS

Number of nodes Number of arcs Average pathsearch time (mil-liseconds)

Connectivitygraph load time(milliseconds)

724 587 Less than 1 220

graph is formed with matched pair and input/output with each service in matched

pair. The measured performance of the path finding process is reported in Table

6.2. Loading the connectivity graph is typically done once. The average path

search time is less than 1 milliseconds. The longest path between two nodes has

9 additional nodes.

The preliminary results for testing the feasiblility of our implementation in

terms of performance is acceptable. Further testing with real services and work-

flows is needed to fully testing our approach.

6.5 Workflow reuse

Both abstract worklow, and concrete workflow can be viewed as a graph. With

this type of graph represenation, graph matching techniques can be applied to find

similar workflows in the system. Although in-depth graph theoretic research is

not the main focus of this investigation, we are interested in applying an efficient

algorithm to find similar workflows in the system given the graph representation

of abstract workflow or concrete workflow.

SUBDUE (available at http://cygnus.uta.edu/subdue/) is a graph-based knowl-

edge discovery system that finds structural and relational patterns in data rep-

126

resenting entities and relationships. SUBDUE represents data using a labeled,

directed graph in which entities are represented by labeled vertices or subgraphs,

and relationships are represented by labeled edges between the entities. The SUB-

DUE graph match utility [18] is a part of the SUBDUE data mining system. The

graph match utility can perform exact and inexact graph matches on directed or

undirected graphs with labeled vertices and edges. It solves the graph isomor-

phism problem which is defined as: given two graphs G1 and G2, is it possible to

permute (or relabel) the vertices of one graph so that it is equivalent to the other.

For example, a scientist may have a scientific process in her mind such as:

“I’d like to get all ATP alpha units of plastids in my MoG investigation and do

mutliple sequence alignments and get an alignment report with a format that I am

able to feed into my local PAUP program.” A possible abstract workflow she may

define is similar to Figure 6.6.

The given workflow is converted to the graph representation that can be fed

into the match algorithm. The match algorithm computes the similarity of the

given workflow against all the workflows stored in the knowledge base. The re-

turned match cost from the SUBDUE algorithm is the measurement that we use

to rank the similarity of the workflows. If two graphs are identical, the match

cost is 0. Costs of various graph match transformation have effects on the results.

The costs can be changed based on the importance of each transformation. For

example, we might like to define that the cost of substituting a vertex label or edge

label is higher than the cost of deleting the vertex or edge. With this specified,

the algorithm can find more optimal results.

The threshold of returned workflows is defined based on the match cost. One or

more workflows with the most similarity are returned and presented to users. Users

127

output

query_term

hasParameter

input

task

hasInput

task

hasNextretrieving

aligning

multiple_alignment_report

performTask

hasOutputperformTask

hasParameter

v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report

e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter

SUBDUE input formatGraph view

Figure 6.6. The graph representation of a workflow for describing ascientific process

may decide to use these workflows as a template to manipulate their workflow

definition on the abstract level or concrete workflow level. Alternatively, users

may decide to use the returned workflow to conduct their experiments.

6.6 Related work

Abstract and concrete workflows have been introduced in various scientific

workflow literature and systems [22, 24, 73]. These two representations create a

view of certain aspects of a workflow that meet the interests of users with differ-

ent knowledge levels of the services and a particular domain. However, in these

systems and literature, the notion of concrete workflow and optimal workflow are

128

combined together and are often not distinguished as two separate representations.

We believe that separating these two workflow representations provides flexibility

of dynamic binding, the ability to select optimal services, and easier integration

of Grid resource management services.

The translation of an abstract workflow into a concrete workflow is a process

of service discovery and service composition. It normally uses an ontology to

annotate services and applies reasoning and matchmaking technologies from a

workflow. A number of research investigations focus on automation of this process

and assume that the ontological model is well defined and services are correctly

annotated, which is not always the case. Rao et. al. [76] presents an approach that

addresses the reality of incomplete annotation. The framework helps users become

better at annotating composable functionality over time. The enhance system and

methodology proposed in this paper is intended to reuse the knowledge that has

been verified by others. It provides users more accurate guidance for service

discovery using that stored knowledge.

The importance of reuse and repurposing of workflows has been reported in

[37, 104]. Antoon Goderis et. al. [38] presents an approach of using graph

based solution to find similar concrete workflow on the web. This is an similar

approach similar to what we used but with a different graph matching algorithm

and different graph representations.

6.7 Conclusion and future Work

In this chapter, we present the importance of implementing workflow and

knowledge reuse. In order to support that reuse, we propose and describe a

methodology and an enhanced workflow system. In includes a hierarchical work-

129

flow structure consisting of four levels that allow users to specify workflows at

different levels of abstraction, based on their knowledge and experience. Two com-

ponents are added into the workflow system to collect and analyze the reusable

of products from the workflow translation process when new services are added in

the system.

The methodology proposed is being used in the design and implementation

of a service-oriented based system for supporting bioinformatics research. Based

on its successful design and implementations of the system (MoGServ) [110], we

developed an ontological model for data and services annotation in the system.

At the current stage, the number of services, operations, workflows in the system

is relatively small, but are expected to grow with usage. The future MoGServ is

intended to support genomic research and provide a workbench for biologists in

the Indiana Center for Insect Genomics (ICIG)1, a research center composed of

three academic institutional partners. Users can define a genomic research work-

flow through a web interface for a particular application. It may result in higher

productivity for genomics researchers and synergy resulting from transparent in-

tegration of data and analysis tools from multiple locations. We believe that the

enhanced workflow system with the knowledge reuse capability can provide more

accurate guidelines during the workflow creation process and make the process

more efficient. A systemic evaluation is being conducted.

1http://ctdrt.bio.nd.edu/index.php?content=projectinfo.php&projectno=4

130

CHAPTER 7

SUMMARY AND FUTURE WORKS

7.1 Summary

In this dissertation, we present a practical experiment of building a service-

oriented system upon current web services technologies and bioinformatics mid-

dleware. The first prototype of this system integerates data and services from

other service providers. It is being evaluated on a phylogenetic research applica-

tion, Mother of Green (MoG). Our evaluation demonstrates that a service-oriented

architecture can accelerate scientific research, increase research productivity, and

provide a new approach to doing science.

Based on the successful design and implementation of this prototype, we

present an enhanced system with semantic annotation of services and data. The

enhancement aims at allowing life science researchers to define their experiments

at different levels based on their knowledge of the tools, data, and the system. The

semantically enriched data allows easier reuse, sharing, and experiments involving

search to be conducted.

Few of current practical methodologies and workflow systems for service com-

position and workflow creation in e-Science consider the potential for reuse: to

share the knowledge gained during the service composition process and to reuse

complete or partial of existing workflows. We believe that providing a capability

131

for reuse of this knowledge and workflows could be an important component in a

workflow system. We propose a methodology and an enhanced system design to

facilitate the reuse of knowledge and workflows. It contains a hierarchical work-

flow structure representation, knowledge management and knowledge discovery

components to capture and manage the reusable knowledge in a system, and an

approach for using a graph matching algorithm to discover similar workflows.

7.2 Limitations and future work

The future MoGServ is intended to support genomic research and provide a

workbench for biologists in the Indiana Center for Insect genomics (ICIG). The

ICIG includes three partners, University of Notre Dame, University of Purdue, and

Indiana University. The future MoGServ can help a user at the ICIG site discover

data or computational web services that are available at the site, other ICIG

partner’s locations, or elsewhere on the web. There are several limitations of the

initial MoG implementation we discussed in Chapter 3 in order to use the system

across mutiple sites. These limitations include security, resource management,

and end-user oriented workflow creation. Several improvements and theoretical

approaches are described in Chpater 5 and Chapter 6, however, there are still more

work need to be done. The future work may be conduct from several aspects.

• Integration of GridSAM. We will explore a way to integrate the MoG system

with a grid computing architecture such that the security issue, resource al-

location, and resource management can be shift to using the existing grid

computing technologies. In the MoGServ implementation, we have a sim-

ple resource management mechanism implemented by two components, job

manager and job luncher. A better sophisticated mechanisms can be used

132

to integrate into the MoGServ system. The GridSAM 1 Web Service is a

WS-I compliant Web Service implementation of the GridSAM service inter-

face as well as the upcoming Global Grid Forum Basic Execution Service

interface. It integrates with the GridSAM Core Engine to provide remote

job launching and file staging capability as described in a Job Submission

Description Language document. As a new feature introduced to GridSAM

2.0.0, a sophisticated authorization mechanism provides a powerful capabil-

ity to control incoming service requests on a user/group basis.

• Enhancement of user interface. In the current MoGServ, the data model

design has table to keep the personlize information for individual user. An

authorization component should be built in the system to enable users to

access the permitted services and to personalize their own workspace. A web

portal will be built to enable users to create an account, login and logout

with username and password. The user account information including the

access level will be stored in a database. The GridSphere portal framework

[39], an open-source portlet based web portal, is one of the candidates.

• Enhancement of data annotation and ontological model. The current onto-

logical model captures the meta data that users need to query their data

provenance. As the system is used more and more, the ontological model

may need to be updated and add new properties and concepts in.

• Integration of presented new functionalities into the system. In this dis-

sertation research work, we present a new approach to improve the reuse

workflows and their by products as well as a heireachical workflow struc-

1http://gridsam.sourceforge.net/2.0.0/index.html

133

ture. The future work is adding these funcationailities into the system by

developing an easy-use interface for users to define workflows at multiple

levels; allow users choose similar workflows and manipulate the workflow as

desired.

134

APPENDIX A

GLOSSARY

BPEL4WS – Business Process Execution Language for Web Services providesa language for the formal specification of business processes and businessinteraction protocols.

BLAST – Basic Local Alignment Search Tool algorithm is used to compare nu-cleotide or protein sequences to sequence databases and calculates the sta-tistical significance of matches.

ClustalW – a tools for global multiple alignment of DNA and protein sequences.

FASTA – a common sequence format that begins with a single-line descriptionfollowed by lines of sequence data.

HGT – Horizontal gene transfer is a process in which an organism transfersgenetic material to another cell that is not its offspring. HGT occurs outsideof the mechanisms of Mendelian genetics, crossing over species, order andfamily reproductive barriers.

J2EE – Java 2 Platform, Enterprise Edition defines the standard for developingcomponent-based multitier enterprise applications.

JXTA – is an open source peer-to-peer platform created by Sun Microsystems in2001.

LGT – Lateral gene transfer is a process in which an organism transfers geneticmaterial to another cell that is not its offspring. LGT occurs within the cell,from endosymbiont genomes to the host cell nucleus.

MoG – Mother of Green is a collaborative research project on plastid phylogeneticanalysis involving information technologists and biologist.

MoGServ – A service-oriented system for data integration and data analysis forphylogenetic analysis.

135

NEXUS – The NEXUS format was designed by David Maddison, Wayne Maddi-son, and David Swofford to facilitate the interchange of input files betweenprograms used in phylogeny and classification.

OGSA – Open Grid Services Architecture

OWL – Web Ontology Language

OWL-S – the ontology description of web service using OWL.

PAUP – a progarm for phylogenetic analysis using parsimony, maximum likeli-hood, and distance methods.

Phylip – a set of modular program for performing numerous types of phylogeneticanalysis

Phylogeny – also called phylogenesis is the origin and evolution of a group oforganism.

Phylogenetics – an area of study the evolutionary relationship among variousgroups of organisms.

RDF – Resource Description Framework is the basic standard for knowledgesharing and reuse in semantic web

REST – Representational State Transfer is a term coined by Roy Fielding todescribe an architecture style of networked systems.

SAM – Sequence Alignment and Modeling System.

SOA – Service-oriented architecture

SOAP – Simple Object Access Protocol is a protocol for exchanging messagesamong requesters and providers.

SOC – Service-oriented computing

UDDI – Universal Description, Discovery and Integration provides a standardregistry for publishing, discovery and reuse of web services.

WSDL – Web Service Description Language defines the abstract interface ofservices.

WS-I – an open industry organization chaptered to promote Web services inter-operability, creates, promotes and supports generic protocols for the inter-operable exchange of messages between Web services.

WSRF – Web Services Resource Framework

136

XML – Extensible Markup Language

XSLT – XSL Transformations is a language for transforming XML documentsinto other XML documents. SL specifies the styling of an XML documentby using XSLT to describe how the document is transformed into anotherXML document that uses the formatting vocabulary.

A.1 Pictures

Figure A.1. Time line for the origin of life and major invasions givingrise to mitochondria and plastids.[27]

137

Figure A.2. Gene transfer to the nucleus. [27]

138

Figure A.3. Symbioses process [69]

139

Figure A.4. ATP Synthase: the wheel that powers life. It is a candidatefor ascertainment of deep phylogeny.

140

APPENDIX B

MOGSERV MANUAL

B.1 Main

MoGServ is accessible through URL, http://almond.cse.nd.edu:10000/bioinfor1.

If you are inside the ND network, you may access another host of MoGServ from

http://biocomp.science.nd.edu:8080/mog. (See Figure B.1)

B.2 Retrieve genome and gene data from NCBI database

Data collection service retrieves complete genome squences and gene sequences

using terms defined by users. Retrieved sequences are stored into the local database.

The service is executed weekly during weekend or daily during night to update

the database. See Figure B.2.

B.3 Query local database

This service allows users to create gene sequence sets or genome sequence sets

by querying the local database. The meta data of these sequences are indexed

using Lucene index and search engine. The valid query can be “chlo*”, “ATP

and atp”, and so on, which follows the Lucene syntax. Users input their query

and choose either “gene” or “complete genome” sequence. A set of sequences is

returned. Users can examine the set and delete sequences from this set. Then

141

Figure B.1. The main menu of the MoGServ

users can choose either “create new set” or “add to an existing set”. “create

new set” puts these sequences together in order to do sequence alignment. A

set id returns to users for further reference. “add to an existing set” puts these

sequences into an existing set (id is input by users). See Figure B.3 B.4. Users

can also download these sets with different formats.

B.4 Set management

Users can upload a set of sequences in fasta format to the local database.

These sequences can be from users’ own lab experiments, which may not be ready

to submit to the public database. They can also be a small number of sequences

142

Figure B.2. A web interface provides users a way to define data withinterests.

not in the local database at that time. These sequences are annotated using the

appropriate metadata description. See Figure B.5.

Users can query the information of a set as shown in Figure B.6, such as the

creation date, the origination of the set, etc.

Users can also use the set filter service to find the intersection of organisms

among multiple sets. See Figure B.7.

B.5 Data analysis services

The MoGServ system provides 7 data analysis services: blastn, blastp, blastx,

tblastn, tblastx, MegaBLAST, ClustalW.

In order to use blast and megablast to do sequence alignment, users need to

143

Figure B.3. Input the query term from this interface and choose gene orgenome database

input two sequence sets: base set and compare set. A base set is a set of sequence

that is similar to the “database” field in NCBI blast search website. A compare

set is a set of sequence that is compare against a base set. It is similar to the

“search” field in NCBI blast search website. Base sets and compare sets need to be

created using “Query Local” or “Set managment” services. Users can define a few

parameters, such as e-value, window size, and so on. A job id will be returned and

shown on the browser. Users should record this id number for further reference.

When the task is executed, required sequences are retrieved from local database

and input to blast (megablast) program. Comparison results are stored in the

local file system for downloading. See Figure B.8 shows the tblastn service.

144

In order to use ClustW service, users need to define the set id and the sequence

type. See Figure B.9. The job id is returned for further reference.

B.6 Job mangement

This service allows users to query the job information and monitor the execu-

tion status of their submitted jobs. There are three execution status, “submit”,

“start”, and “finish”. “output” becomes hot link when execution status turns to

“finish”. Users can follow the link to view the input, output of each data analysis

job. See Figure B.10 B.11 B.12.

145

Figure B.4. The results from querying local database

146

Figure B.5. Users may copy, past particular sequences and upload tothe local database

147

Figure B.6. Set information

148

Figure B.7. The set filter service is used to find intersection oforganisms among mutliple sets.

149

Figure B.8. tblastn interface in MoGServ

150

Figure B.9. ClustalW Interface in MoGServ

151

Figure B.10. Job management interface shows the status, input link,output link of a job

152

Figure B.11. An example input of a clustalW analysis, set id is a hotlink, users can view sequence information in this set.

153

Figure B.12. An example output of a clustalW analysis, users candownload, convert, view the results.

154

APPENDIX C

DEVELOPMENT AND DEPLOYMENT TOOLKITS

Some development and deployment toolkits we used for the implmentation

are listed in Table C.1. All the software packages are open source and can be

download from the URL.

155

TA

BLE

C.1

OP

EN

SO

UR

CE

SO

FT

WA

RE

PA

CK

AG

ES

USE

DFO

R

DE

VE

LO

PM

EN

TA

ND

DE

PLO

YM

EN

T

Pac

kage

sV

ersi

onD

escr

ipti

ons

UR

L

Apa

che

Axi

s1ax

is-1

2RC

2a

SOA

Pen

gine

forde

velo

ping

and

host

ing

web

serv

ices

http

://w

s.ap

ache

.org

/axi

s/

Tom

cat

jaka

rta-

tom

cat-

5.0.

18J2

EE

com

plia

ntse

rvle

tco

ntai

ner

http

://t

omca

t.ap

ache

.org

/

Tav

erna

1.4

aG

UIba

sed

wor

kben

chfo

rcr

eati

ng,e

xe-

cuti

on,an

dm

onit

orin

gw

orkfl

ows

http

://t

aver

na.s

ourc

efor

ge.n

et/

Apa

che

Luc

ene

1.4.

3a

high

-per

form

ance

,fu

ll-fe

atur

edte

xtse

arch

engi

nelib

rary

wri

tten

inJa

vaht

tp:/

/luc

ene.

apac

he.o

rg/j

ava/

docs

Pos

tgre

sSQ

L8.

0.3

are

lati

onal

data

base

syst

emht

tp:/

/ww

w.p

ostg

resq

l.org

/

Pro

tege

3.2

anon

tolo

gyed

itor

and

know

ledg

e-ba

sefr

amew

ork

wit

hO

WL

supp

orts

http

://p

rote

ge.s

tanf

ord.

edu/

Pel

let

1.3-

beta

2an

open

-sou

rce

Java

base

dO

WL

DL

rea-

sone

rht

tp:/

/pel

let.

owld

l.com

/

sesa

me

1.2.

6an

open

sour

ceR

DF

fram

ewor

kw

ith

sup-

port

for

RD

FSc

hem

ain

fere

ncin

gan

dqu

eryi

ng

http

://w

ww

.ope

nrdf

.org

/abo

ut.js

p

subd

ue5.

1.4

agr

aph-

base

dda

tam

inin

gsy

stem

http

://c

ygnu

s.ut

a.ed

u/su

bdue

/

156

APPENDIX D

SUPPLEMENTARY MATERIAL FOR CHAPTER 3 AND CHAPTER 4

D.1 Complete genome sequence in XML

<?xml version="1.0"?><!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN"

"http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"><INSDSeq><INSDSeq_locus>NC_005042</INSDSeq_locus><INSDSeq_length>1751080</INSDSeq_length><INSDSeq_strandedness>double</INSDSeq_strandedness><INSDSeq_moltype>DNA</INSDSeq_moltype><INSDSeq_topology>circular</INSDSeq_topology><INSDSeq_division>BCT</INSDSeq_division><INSDSeq_update-date>24-JUL-2006</INSDSeq_update-date><INSDSeq_create-date>25-JUL-2003</INSDSeq_create-date><INSDSeq_definition>Prochlorococcus marinus subsp. marinus str. CCMP1375,

complete genome</INSDSeq_definition><INSDSeq_primary-accession>NC_005042</INSDSeq_primary-accession><INSDSeq_accession-version>NC_005042.1</INSDSeq_accession-version><INSDSeq_other-seqids><INSDSeqid>ref|NC_005042.1|</INSDSeqid><INSDSeqid>gnl|NCBI_GENOMES|310</INSDSeqid><INSDSeqid>gi|33239452</INSDSeqid></INSDSeq_other-seqids><INSDSeq_project>419</INSDSeq_project><INSDSeq_source>Prochlorococcus marinus subsp. marinus str. CCMP1375

(Prochlorococcus marinus SS120)</INSDSeq_source><INSDSeq_organism>Prochlorococcus marinus subsp. marinus str. CCMP1375

</INSDSeq_organism><INSDSeq_taxonomy>Bacteria; Cyanobacteria; Prochlorales; Prochlorococcaceae;

Prochlorococcus</INSDSeq_taxonomy>....<INSDSeq_feature-table>....<INSDFeature><INSDFeature_key>CDS</INSDFeature_key><INSDFeature_location>1447640..1449106</INSDFeature_location>

157

<INSDFeature_intervals><INSDInterval><INSDInterval_from>1447640</INSDInterval_from><INSDInterval_to>1449106</INSDInterval_to><INSDInterval_accession>NC_005042.1</INSDInterval_accession></INSDInterval>

</INSDFeature_intervals><INSDFeature_quals><INSDQualifier><INSDQualifier_name>gene</INSDQualifier_name><INSDQualifier_value>atpD</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>locus_tag</INSDQualifier_name><INSDQualifier_value>Pro1591</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>note</INSDQualifier_name><INSDQualifier_value>Produces ATP from ADP in the presence of a proton

gradient across the membrane. The beta chain is a regulatory subunit</INSDQualifier_value>

</INSDQualifier><INSDQualifier><INSDQualifier_name>codon_start</INSDQualifier_name><INSDQualifier_value>1</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>transl_table</INSDQualifier_name><INSDQualifier_value>11</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>product</INSDQualifier_name><INSDQualifier_value>ATP synthase subunit B</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>protein_id</INSDQualifier_name><INSDQualifier_value>NP_875982.1</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>db_xref</INSDQualifier_name><INSDQualifier_value>GI:33241040</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>db_xref</INSDQualifier_name><INSDQualifier_value>GeneID:1462973</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>translation</INSDQualifier_name><INSDQualifier_value>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGK

NPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIF

158

NVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK</INSDQualifier_value>

</INSDQualifier></INSDFeature_quals>

</INSDFeature>....</INSDSeq_feature-table><INSDSeq_sequence> ....... </INSDSeq_sequence></INSDSeq>

The size of this example XML file is about 7.7M, the size of the complete genome

sequence in fasta format is about 1.7M. Actual length of this sequence is 1751080

nt.

D.2 Example of a ATP synthase subunit B sequence

Fasta format:

>gi|33241040|ref|NP_875982.1| ATP synthase subunit B[Prochlorococcus marinus subsp. marinus str. CCMP1375]

MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK

TinySeq XML:

<?xml version="1.0"?><!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN""http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">

<TSeq><TSeq_seqtype value="protein"/><TSeq_gi>33241040</TSeq_gi><TSeq_accver>NP_875982.1</TSeq_accver><TSeq_sid>gnl|REF_uproscoff|Pro1591</TSeq_sid><TSeq_taxid>167539</TSeq_taxid><TSeq_orgname>Prochlorococcus marinus subsp. marinus str. CCMP1375

159

</TSeq_orgname><TSeq_defline>ATP synthase subunit B [Prochlorococcus marinus subsp.marinus str. CCMP1375]</TSeq_defline><TSeq_length>488</TSeq_length><TSeq_sequence>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK</TSeq_sequence>

</TSeq>

There are total 182 complete genome sequences in the database and 878 ATP gene

sequences.

D.3 Protein name

For each whole genome sequence find all of the proteins that make up ATP

synthase (See table D.1)

D.4 Syntax of search local database

Local database is indexed using Lucene search engine. Refer to 1 for com-

plete syntax description. There are two tables to store complete genome sequence

and gene sequences respectively. D.2 list syntax and example for searching these

database. D.3 summarize the field when we create index.

D.5 Workflow of retrieve sequence

Since we use web services provided by NCBI to retrieve the sequences, there

may have failure during data collection process. Record the status of data retrieve

1http://lucene.apache.org/java/docs/queryparsersyntax.html

160

TABLE D.1

NAME OF ATP SYNTHASE

Protein name description

atpC gamma chain

atp1 protein 1

atpI chain a

atpH subunit c

atpG chain b’

atpF chain b

atpD delta chain

atpA alpha chain

atpB beta subunit

atpE epsilon subunit

N/A “ATP synthase”

ch1M Mg-protoporphyrin IX methyl transferase

ftrC ferredoxin-thioredoxin reductase, catalytic chain

TABLE D.2

SYNTAX OF SEARCHING LOCAL DATABASE

Query type Example

single words cyanobacteria

phrase “ATP synthase”

field name:ATP AND gamma AND plastid

boolean atpa NOT bacteria

grouping atpa AND (plastid or cyanobacteria)

161

TABLE D.3

INDEXING FIELD OF LOCAL DATABASE

Field Comments

gi gi number of the sequence

accver accver number of the sequence

name name of the sequence

term query defined by users and used to get this sequence from NCBI

taxonomy taxonomy of the sequence provided by NCBI

cds name of protein that make up atp synthase (only in gene table)

nucleotide gi gi number of corresponding nucleotide gi which is also the gi fromthe complete genome (only in gene table)

nucleotide name name of corresponding nucleotide sequence(only in gene table)

default the default field contains all the information described above, with-out specify the field name

in the database enables us to examine the integrity of the data. Parse the XML

file requires huge memory. Detail with the redundance of the sequence, but record

the query term. Update the database weekly or daily.

Psudo code for retrieving complete genome sequence:

get search term from ncbi_retrieve tablefor each term

get sequence in fasta formatset retrieve_gene_status as ’ready’

Psudo code for retrieving gene sequence:

get acceid from ncbi_genomes table where retrieve_gene_status is ’ready’for each acceid

update retrieve_gene_status as ’start’ in ncbi_genome tableget sequence in GB XML formatparse the XML to get particular protein sequence acceiduse acceid to get protein sequence in fasta formatcompute the correspond nucleotide sequenceget taxonomy of the sequence

162

update retrieve_gene_status as ’start’ in ncbi_genome tableupdate the taxonomy for the sequence in ncbi_genome table

D.6 ClustalW input

An example of the ClustalW input file:

<?xml version=’1.0’ encoding=’utf-8’?><inputparams><setid>142</setid><sequencetype>nucleotid</sequencetype><title>Sequence</title><topdiags></topdiags><alignment>full</alignment><window></window><gapext></gapext><outputtree></outputtree><output>aln1</output><tossgaps>true</tossgaps><ktup></ktup><kimura>true</kimura><matrix>blosum</matrix><scores>percent</scores><outorder>aligned</outorder><gapopen></gapopen><gapclose></gapclose><gapdist></gapdist><pairgap></pairgap>

</inputparams>

An example of the ClustW output file:

<?xml version=’1.0’ encoding=’utf-8’?><output><title>Sequence</title><ebiid>clustalw-20060925-04170320</ebiid><file>clustalw-20060925-04170320.txt</file><file>clustalw-20060925-04170320.aln</file><file>clustalw-20060925-04170320.dnd</file>

</output>

D.7 Blast

An example blastn input:

<?xml version=’1.0’ encoding=’utf-8’?>

163

<inputparams><expect>10</expect><wordsize>11</wordsize><matrix></matrix><opengap></opengap><extendgap></extendgap><searchSetId>130</searchSetId><searchSetType>gene</searchSetType><searchSeqType>nucleotide</searchSeqType><dbSetId>130</dbSetId><dbSetType>gene</dbSetType><dbSeqType>nucleotide</dbSeqType>

</inputparams>

D.8 PAUP

The result generated from ClustalW program is convert to NEXUS format

from the web interface (see Figure B.12). The data conversion is done with the

service provided in the system. Here is portion of the NEXUS file format for all

ATP beta unit.

#NEXUS

BEGIN DATA;DIMENSIONS NTAX=27 NCHAR=1503;FORMAT DATATYPE=DNA INTERLEAVE MISSING=-;

[Name: Saccharum1 Len: 1503 Check: 0][Name: Saccharum2 Len: 1503 Check: 0][Name: Zea_mays Len: 1503 Check: 0][Name: Triticum_a Len: 1503 Check: 0]...MATRIXSaccharum1 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGGSaccharum2 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGGZea_mays ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGATTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGG...Calycanthu TGAPinus_kora ---Pinus_thun ---Marchantia ---Physcomitr ---

164

Anthoceros ---Huperzia_l ---

;END;

Here is the configuration file used for PAUP to generate phylogenetic tree usingthe NEXUS file:

#NEXUSbegin paup;

set autoclose=yes warntree=no warnreset=no;log start file=thisfile.log replace;execute atpb_27.nex;Set criterion=distance;dset dist = hky85;showdist;nj;nj breakties = random;bootstrap nreps=100 brlens=yes keepall=yes search=heuristic;

savetrees from=1 to=1 savebootp=both maxdecimals=0;contree all/strict=no file=thisfilename.tre replace showtree=yes;

end;

Figure D.1 and D.2 show generated tree results.

165

Figure D.1. Phylogenetic tree generated from the PAUP

166

Figure D.2. Phylogenetic tree file generated from the PAUP can beviewed by other program

167

APPENDIX E

SUPPLEMENTARY MATERIAL FOR CHAPTER 5 AND CHAPTER 6

This is a sample output of comparing two workflows using SUBDUE. The

inexact graph match program computes the cost of transforming the larger of the

input graphs into the smaller according to the transformation costs predefined

cost. The program returns this cost and the mapping of vertices in the larger

graph to vertices in the smaller graph. The smaller match cost represents the

higher structural similarity between two workflows.

// Costs of various graph match transformations#define INSERT_VERTEX_COST 1.0 // insert vertex#define DELETE_VERTEX_COST 1.0 // delete vertex#define SUBSTITUTE_VERTEX_LABEL_COST 1.0 // substitute vertex label#define INSERT_EDGE_COST 1.0 // insert edge#define INSERT_EDGE_WITH_VERTEX_COST 1.0 // insert edge with vertex#define DELETE_EDGE_COST 1.0 // delete edge#define DELETE_EDGE_WITH_VERTEX_COST 1.0 // delete edge with vertex#define SUBSTITUTE_EDGE_LABEL_COST 1.0 // substitute edge label#define SUBSTITUTE_EDGE_DIRECTION_COST 1.0 // change directedness of edge#define REVERSE_EDGE_DIRECTION_COST 1.0 // change direction of directed edge

[xxiang1@localhost subdue-5.1.4]$ bin/gm graphs/graph1.g graphs/mytest1.gMatch Cost = 15.000000Mapping (vertices of larger graph to smaller):1 -> deleted2 -> 33 -> 14 -> 25 -> deleted6 -> deleted7 -> 48 -> deleted

[xxiang1@localhost subdue-5.1.4]$

168

An example of WSDL description for the service provided in the MoGServ (See

Figure: E.1).

Create a workflow using Taverna workbench (See Figure: E.2). XScufl

format (See Figure: E.3). Sample data annotation in rdf format displayed with

RDF Gravity 1 (See Figure: E.4). Sample service annotation in rdf format

displayed with RDF Gravity 2 (See Figure: E.5).

1http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html2http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html

169

Figure E.1. This is the WSDL description of QueryLocal service hostedin the MoGServ, which provides an operation to create a set in the localdatabase. This operation accepts two parameters and return the set id.

170

Figure E.2. One example of using Taverna workbench to create, test,and run workflow. This workflow accepts users input, search the localdatabase, create set, align set using ClustalW, convert the ClustalW

result to NEXUS format, which can be fed to PAUP.

171

Figure E.3. XScufl workflow format represents the workflow createdusing the Taverna workbench.

172

Figure E.4. Annotation of job and set information using ontologicalmodel defined. The sample rdf file is displayed using RDF Gravity.

173

Figure E.5. Annotation of a service using ontological model defined.The sample rdf file is displayed using RDF Gravity.

174

BIBLIOGRAPHY

1. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic localalignment search tool. J. Mol. Biol., 215(3):403–10, 1990.

2. K. Amin, G. von Laszewski, M. Hategan, N. J. Zaluzec, S. Hampton, andA. Rossi. Gridant: A client-controllable grid workflow system. In Proceedingsof the 37th Hawaii International Conference on System Science, 2004.

3. Axis. Apache axis, apache software foundation. URL http://ws.apache.

org/axis.

4. BEANSHELL. Light weight scripts for java. URL http://www.beanshell.

org/.

5. K. A. Beiter and K. Ishii. Integration producibility and product performancetools within a web-service environment. In ASME 2003 design engineeringtechnical conferences and computers and information in engineering confer-ence, 2003.

6. B. Benatallah, M. Dumas, Q. Z. Sheng, and A. H. Ngu. Declarative compo-sition and peer-to-peer provisioning of dynamic web services. In Proceedingsof the 18th International Conference on Data Engineering (ICDE’02), 2002.

7. T. Berners-Lee, J. Hedler, and O. Lassila. The semantic web. ScientificAmerican, May 2001.

8. T. Berners-Lee, W. Hall, J. Hendler, N. Shadbolt, and D. J. Weitzner. Cre-ating a science of the web. Science, 313(5788):769–771, August 2006.

9. BIOWBI. Bioinformatic workflow builder interface (biowbi). URL http:

//www.alphaworks.ibm.com/tech/biowbi.

10. P. A. Bonatti and P. Festa. On optimal service selection. In Proceedings ofthe 14th international conference on World Wide Web, 2005.

11. BPWS4J. The ibm business process execution language for web service javarun time. URL http://www.alphaworks.ibm.com/tech/bpws4j.

175

http://ws.apache.org/axis

http://ws.apache.org/axis

http://www.beanshell.org/

http://www.beanshell.org/

http://www.alphaworks.ibm.com/tech/biowbi

http://www.alphaworks.ibm.com/tech/biowbi

http://www.alphaworks.ibm.com/tech/bpws4j

12. D. Buttler, M. Coleman, T. Critchlow, R. Fileto, W. Han, C. Pu, D. Rocco,and L. Xiong. Querying multiple bioinformatics information sources: Cansemantic web research help? SIGMOD Record, 31(4):59–64, 2002.

13. M. Carman, L. Serafini, and P. Traverso. Web service composition as plan-ning. In ICAPS 2003 Workshop on Planning for Web Services, 2003.

14. S. Carrere and J. Gouzy. Remora: a pilot in the ocean of biomoby web-services. Bioinformatics, 22(7), 2006.

15. S. Chirstley, X. Xiang, and G. Madey. An ontology for agent-based modelingand simulation. In Agent 2004 Conference, 2004.

16. M. Clamp, J. Cuff, S. M. Searle, and G. J. Barton. The jalview java alignmenteditor. Bioinformatics, 20(3):426–7, 2004.

17. Collaxa. Collaxa BPEL server. URL http://www.collaxa.com/.

18. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE IntelligentSystems, 15(2):32–41, 2000.

19. J. Day and R. Deters. Selecting the best web service. In Proceedings of the2004 conference of the Centre for Advanced Studies on Collaborative research,pages 293–307, 2004.

20. R. de Knikker, Y. Guo, J. long Li, A. K. Kwan, K. Y. Yip, D. W. Cheung,and K.-H. Cheung. A web services choreography scenario for interoperatingbioinformatics applications. BMC Bioinformatics, 5(25), 2004.

21. D. de Roure, N. R. Jennings, and N. Shadbolt. The semantic grid: Past,present and future. Proc. of the IEEE, 93(3), March 2005.

22. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Black-burn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mappingabstract complex workflows onto grid environments. Journal of Grid Com-puting, 1(1), 2003.

23. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su1,K. Vahi, and M. Livny. Grid Computing, volume 3165/2004 of Lecture Notesin Computer Science, chapter Pegasus: Mapping Scientific Workflows ontothe Grid, pages 11–20. Springer Berlin / Heidelberg, 2004.

24. L. A. Digiampietri, C. B. Medeiros, and J. C. Setubal. A framework based onweb service orchestration for bioinformatics workflow management. Geneticsand Molecular Research, 4(3):535–542, 2005.

176

http://www.collaxa.com/

25. A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learn-ing to match ontologies on the semantic web. The VLDB Journal (Theinternational Journal on Very large Data bases, 12, 2003.

26. A. Dogac, Y. Kabak, G. Laleci, S. Sinir, A. Yildiz, S. Kirbas, and Y. Gurcan.Semantically enriched web services for the travel industry. SIGMOD Record,33(3), 2004.

27. S. D. Dyall, M. T. Brown, and P. J. Johnson. Ancient invasions: Fromendosymbionts to organelles. SCIENCE, 304(9), April 2004.

28. I. Elgedawy, Z. Tari, and M. Winikoff. Exact functional context matchingfor web services. In ICSOC’04, 2004.

29. V. Ermolayev, N. Keberle, O. Kononenko, S. Plaksin, and V. Terziyan. To-wards a framework for agent-based semantic web service composition. Inter-national Journal of Web Service Research, 2004.

30. ETTK. Emerging technologies toolkit. URL http://www.alphaworks.ibm.

com/tech/wssem.

31. N. M. Fast, J. C. Kissinger, D. S. Roos, and P. J. Keeling. Nuclear-encoded,plastid-targeted genes suggest a single common origin for apicomplexan anddinoflagellate plastids. Mol. Biol. Evol., 18(3):418–426, 2001.

32. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enablingscalable virtual organizations. Lecture Notes in Computer Science, 2150,2001.

33. K. Garwood, P. Lord, H. Parkinson, N. Paton, and C. Goble. Pedro ontologyservices: A framework for rapid ontology markup. In Proc of 2nd EuropeanSemantic Web Conference, pages 578–591. Springer Verlag, 2005.

34. Y. Gil, E. Deelman, J. Blythe, C. Kesselman, and H. Tangmunarunkit. Arti-ficial intelligence and grids: Workflow planning and beyond. IEEE IntelligentSystems, special issue on E-Science, Jan/Feb 2004.

35. GO. Gene ontology consortium. URL http://www.geneontology.org/.

36. C. Goble, C. Wroe, R. Stevens, and the myGrid consortium. The mygridproject: services, architecture and demonstrator. In UK e-Science AHM,September 2003.

37. A. Goderis, U. Sattler, P. Lord, and C. Goble. Seven bottlenecks to workflowreuse and repurposing. In Fourth International Semantic Web Conference(ISWC 2005), volume 3792, pages 323–337, Galway, Ireland, 2005.

177

http://www.alphaworks.ibm.com/tech/wssem

http://www.alphaworks.ibm.com/tech/wssem

http://www.geneontology.org/

38. A. Goderis, P. Li, and C. Goble. Workflow discovery: the problem, a casestudy from e-science and a graph-based solution. In IEEE InternationalConference on Web Services (ICWS’06), 2006.

39. GridSphere. Gridsphere portal framework. URL http://www.gridsphere.

org/gridsphere/gridsphere?cid=2.

40. T. Gruber. What is an ontology. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

41. A. Gmez-Prez, R. Gonzlez-Cabero, and M. Lama. A framework for designand composition of semantic web services. American Association for Artifi-cial Intelligence, 2004.

42. JLaunch. JLaunch from Duke bioinformatics shared resource. URL http:

//dbsr.duke.edu/.

43. B. Johansson and P. Krus. A web service approach for model integration incomputational design. In ASME 2003 design engineering technical confer-ences and computers and information in engineering conference, 2003.

44. E. Kawas, M. Senger, and M. D. Wilkinson. Biomoby extensions to the tav-erna workflow management and enactment software. BMC Bioinformatics,Nov. 2006.

45. M. Klein and A. Bernstein. Toward high-precision service retrieval. InternetComputing, 8(1):30–36, January/February 2004.

46. U. Kuter, E. Sirin, D. Nau, B. Parsia, and J. Hendler. Information gather-ing during planning for web service composition. In The third internatonalsemantic web conference (ISWC2004), Hiroshima, Japan, 2004.

47. L. Li and I. Horrocks. A software framework for matchmaking based onsemantic web technology. In Proceedings of the 12th international conferenceon World Wide Web, 2003.

48. Y. Liu, A. H. Ngu, and L. Zeng. Qos computation and policing in dynamicweb service selection. In WWW2004, 2004.

49. P. Lord, S. Bechhofer, M. Wilkinson, G. Schiltz, D. Gessler, D. Hull,C. Goble, and L. Stein. Applying semantic web services to bioinformat-ics: Experiences gained, lessons learnt. In Third International SemanticsWeb Conference (ISWC2004), 2004.

178

http://www.gridsphere.org/gridsphere/gridsphere?cid=2

http://www.gridsphere.org/gridsphere/gridsphere?cid=2

http://dbsr.duke.edu/

http://dbsr.duke.edu/

50. P. Lord, P. Alper, C. Wroe, and C. Goble. Feta: A light-weight architecturefor user oriented semantic service discovery. In In Proceedings of SecondEuropean Semantic Web Conference, ESWC 2005, pages 17–31. Springer-Verlag LNCS 3532 2005, May-June 2005.

51. Lucene. Apache lucene. URL http://lucene.apache.org/java/docs/

index.html.

52. B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A.Lee, J. Tao, and Y. Zhao. Scientific workflow management and the keplersystem. Concurrency and Computation: Practice and Experience, 18(10):1039 – 1065, Dec 2005.

53. E. M. Maximilien and M. P. Singh. Toward autonomic web services trustand selection. In ICSOC’04, 2004.

54. S. A. McIlraith, T. C. Son, and H. Zeng. Semantic web services. IEEEIntelligent Systems, pages 46–53, March/April 2001.

55. B. Medjahed, A. Bouguettaya, and A. K. Elmagarmid. Composing webservices on the semantic web. The VLDB Journal, 2003.

56. E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. Observer: An approachfor query processing in global information systems based on interoperationacross pre-existing ontologies. In Intl. Conf. on Cooperative InformationSystems (CoopIS 96), 1996.

57. F. Meyer. Genome sequencing vs. moore’s law: Cyber challenges for the nextdecade. CTWatch Quarterly, 2(3), August 2006.

58. N. Milanovic and M. Malek. Current solutions for web service composition.IEEE Internet Computing, 8(6):51–59, November/December 2004.

59. J. A. Miller and P. A. Fishwick. Investigating ontologies for simulation mod-eling. In The 37th Annual Simulation Symposium, April 2004.

60. M. G. Nanda, S. Chandra, and V. Sarkar. Decentralizing execution of com-posite web services. In OOPSLA’04, 2004.

61. NCBI. Entrez: Making use of its power. Briefings in bioinformatics, 4(2),June 2003. URL http://www.ncbi.nih.gov/.

62. N. F. Noy and M. A. Musen. Prompt: Algorithm and tool for automated on-tology merging and alignment. In The proceedings of the National conferenceon artificial intelligence (AAAI), 2000.

179

http://lucene.apache.org/java/docs/index.html

http://lucene.apache.org/java/docs/index.html

http://www.ncbi.nih.gov/

63. OGSA. Links to open grid service architecture. URL http://www.globus.

org/ogsa/.

64. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver,A. Wipat, and P. Li. Taverna, lessons in creating a workflow environmentfor the life sciences. In GGF workflow workshop, 2004.

65. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood,T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool forthe composition and enactment of bioinformatics workflows. Bioinformatics,20(17), 2004.

66. M. Ouzzani, B. Benatallah, and A. Bouguettaya. Ontological approachfor information discovery in internet databases. Distributed and ParallelDatabases, 8(3), 2000.

67. OWL. W3c OWL web ontology language overview. URL http://www.w3.

org/TR/owl-features/.

68. R. D. M. Page. Treeview: An application to display phylogenetic trees onpersonal computers. Computer Applications in the Biosciences, 12:357–358,1996.

69. J. D. Palmer. The symbiotic birth and spread of plastids: How many timesand whodunit? J. Phycol., 39, 2003.

70. M. P. Papazoglou and D. Georgakopoulos. Service-oriented computing. Com-munications of the ACM, 46(10), 2003.

71. A. Patil, S. Oundhakar, A. Sheth, and K. Verma. Meteor-s web serviceannotation framework. In Proceeding of the World Wide Web Conference,July 2004.

72. S. Pillai, V. Silventoinen, K. Kallio, M. Senger, S. Sobhany, J. Tate, S. Ve-lankar, A. Golovin, K. Henrick, P. Rice, P. Stoehr, and R. Lopez. Soap-basedservices provided by the European Bioinformatics Institute (EBI). NucleicAcids Res, 33(1):W25–W28, 2005. URL http://www.ebi.ac.uk/Tools/

webservices/WSClustalW.html.

73. U. Radetzki and A. B. Cremers. Iris: A framework for mediator-based com-position of service-oriented software. In 2004 IEEE International Conferenceon Web Services (ICWS 2004), July 2004.

74. U. Radetzki, U. Leser, S. Schulze-Rauschenbach, J. Zimmermann, J. Lussem,T. Bode, and A. Cremers. Adapters, shims, and glue–service interoperabilityfor in silico experiments. Bioinformatics, 22(9):1137–1143, 2006.

180

http://www.globus.org/ogsa/

http://www.globus.org/ogsa/

http://www.w3.org/TR/owl-features/

http://www.w3.org/TR/owl-features/

http://www.ebi.ac.uk/Tools/webservices/WSClustalW.html

http://www.ebi.ac.uk/Tools/webservices/WSClustalW.html

75. S. Ran. A model for web services discovery with QoS. ACM SIGecomExchanges, 4(1), 2003.

76. J. Rao, D. Dimitrov, P. Hofmann, and N. Sadeh. A mixed initiative approachto semantic web service discovery and composition: Sap’s guided proceduresframework. In Proceedings of the IEEE International Conference on WebServices (ICWS’06), pages 401 – 410, 2006.

77. J. A. Raven and J. F. Allen. Genomics and chloroplast evolution: what didcyanobacteria do for plants? Genome Biology, 4, 2003.

78. J. Romero-Severson. Use case: How mog web services enable scientific dis-covery. Technical report, University of Notre Dame, August 2006.

79. S. Russell and P. Norvig. Artificial Intelligence A Mordern Approach. Pren-tice Hall, 1995.

80. M. Sabou, C. Wroe, C. Goble, and H. Stuckenschmidt. Learn-ing domain ontologies for semantic web service descriptions.Journal of Web Semantics, 3(4), 2005. Accessible from:http://www.websemanticsjournal.org/ps/pub/2005-28.

81. SAWSDL. Semantic annotations for web services description language work-ing group. URL http://www.w3.org/2002/ws/sawsdl/.

82. C. Schmidt and M. Parashar. A peer-to-peer approach to web service dis-covery. World Wide Web, 7(2), 2004.

83. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance ine-science. SIGMOD Record, 34(5), Sept 2005.

84. E. Sirin, J. Hendler, and B. Parsia. Semi-automatic composition of webservices using semantic descriptions. In Web Services: Modeling, Architectureand Infrastructure” workshop in conjunction with ICEIS2003, 2003.

85. K. Sivashanmugam, K. Verma, A. Sheth, and J. Miller. Adding semantics toweb services standards. In In Proceedings of the 1st International Conferenceon Web Services (ICWS’03), 2003.

86. K. Sivashanmugam, J. Miller, A. Sheth, and K. Verma. Framework forsemantic web process composition. Special Issue of the International Journalof Electronic Commerce (IJEC), 2004.

87. SoapLab. Soap-based analysis web service developed in the EuropeanBioinformatics Institute (EBI). URL http://www.ebi.ac.uk/soaplab/.

181

http://www.w3.org/2002/ws/sawsdl/

http://www.ebi.ac.uk/soaplab/

88. SpeedR. URL http://lsdis.cs.uga.edu/proj/meteor/mwsdi.html.

89. N. Srinivasan, M. Paolucci, and K. Sycara. Semantic web service discoveryin the owl-s ide. In Proceeding of the 39th Hawaii International Conferenceon System Sciences, 2006.

90. B. Srivastava and J. Koehler. Web service composition current solutions andopen problems. In ICAPS2003, 2003.

91. L. Stein. Creating a bioinformatics nation. Nature, 417(9), 2002.

92. L. D. Stein. Integrating biological databases. Nature Reviews genetics, 4,2003.

93. R. Stevens. Trends in cyberinfrastructure for bioinformatics and com-putational biology. CTWatch Quarterly, 2(3), August 2006. URLAvailableon-lineathttp://www.ctwatch.org/quarterly/.

94. R. Stevens, K. Glover, C. Greenhalgh, C. Jennings, S. Pearce, P. Li,M. Radenkovic, and A. Wipat. Performing in silico experiments on the grid:A users perspective. In Proc UK e-Science programme All Hands Conference,2003.

95. J. W. Stiller and D. C. Reel. A single origin of plastids revisited: Convergentevolution in organellar genome content. J. Phycol, 39, 2003.

96. I. Taylor, M. Shields, I. Wang, and A. Harrison. Visual Grid Work-flow in Triana. Journal of Grid Computing, 3(3-4):153–169, Septem-ber 2005. URL http://www.springerlink.com/openurl.asp?genre=

article&issn=1570-7873&volume=3&issue=3&spage=153.

97. The Globus Project. The globus project. URL http://www.globus.org.

98. W. van der Aalst. Don’t go with the flow: Web services composition stan-dards exposed. IEEE Intelligent Systems, Jan/Feb 2003.

99. Y. Wang and E. Stroulia. Semantic structure matching for assessing web-service similarity. In M. E. Orlowska, S. Weerawarana, M. P. Papazoglou,and J. Yang, editors, Service-Oriented Computing - ICSOC 2003, 2003.

100. R. Weber, C. Schuler, P. Neukomm, H. Schuldt, and H.-J. Schek. Webservice composition with o’grape and osiris. In Proceeding of the 29th VLDBConference, 2003.

101. M. D. Wilkinson and M. Links. Biomoby: An open source biological webservice proposal. Briefings in bioinformatics, 3(4), 2002.

182

http://lsdis.cs.uga.edu/proj/meteor/mwsdi.html

Available on-line at http://www.ctwatch.org/quarterly/

http://www.springerlink.com/openurl.asp?genre=article&issn=1570-7873&volume=3&issue=3&spage=153

http://www.springerlink.com/openurl.asp?genre=article&issn=1570-7873&volume=3&issue=3&spage=153

http://www.globus.org

102. WordNet. Wordnet: A large lexical database of english, developed under thedirection of george a. miller. URL http://wordnet.princeton.edu/.

103. C. Wroe, R. Stevens, C. Goble, A. Roberts, and M. Greenwood. A suite ofdaml+oil onotlogies to describe bioinformatics web services and data. Inter-national Journal of Cooperative Information Systems, 12(4):197–224, June2003.

104. C. Wroe, C. Goble, A. Goderis, P. Lord, S. Miles, J. Papay, P. Alper, andL. Moreau. Recycling workflows and services through discovery and reuse.Concurrency and Computation: Practice and Experience, 2007.

105. WS. Web services architecture. URL http://www.w3.org/TR/ws-arch/

#service_oriented_architecture. W3C Working Group Note 11 Febru-ary 2004.

106. WsBAW. Bioinformatic analysis workflow (WsBAW). URL http://www.

alphaworks.ibm.com/tech/wsbaw.

107. WSIF. Web services invocation framework (wsif), apache software founda-tion. URL http://ws.apache.org/wsif/.

108. X. Xiang and G. Madey. A semantic web services enabled web portal archi-tecture. In International Conference on Web Services (ICWS2004), 2004.

109. X. Xiang and G. Madey. Improving the reuse of scientific workflows andtheir by-products. URL http://www.nd.edu/~mog/Papers/papers.html.Working paper, 2007.

110. X. Xiang, G. Madey, and J. Romero-Severson. A service-oriented data inte-gration and analysis environment for in-silico experiments and bioinformaticsresearch. In Proceedings of the 40th Annual Hawaii International Conferenceon System Sciences (CD-ROM), January 2007.

111. J. Yang. Web service componentization. Communication of the ACM, Oc-tober 2003.

112. X. Yi and K. J. Kochut. Process composition of web services with complexconversation protocols: A colored petri nets based approach. In Design,Analysis and Simulation of Distributed System DASD2004, 2004.

113. U. Zdun, M. Voelter, and M. Kircher. Design and implementation of anasynchronous invocation framework for web services. In The InternationalConference on Web Services - Europe 2003 (ICWS-Europe’03), 2003.

This document was prepared & typeset with pdfLATEX, and formatted withnddiss2ε classfile (v1.0[2004/06/15]) provided by Sameer Vijay.

183

http://wordnet.princeton.edu/

http://www.w3.org/TR/ws-arch/#service_oriented_architecture

http://www.w3.org/TR/ws-arch/#service_oriented_architecture

http://www.alphaworks.ibm.com/tech/wsbaw

http://www.alphaworks.ibm.com/tech/wsbaw

http://ws.apache.org/wsif/

http://www.nd.edu/~mog/Papers/papers.html

Documents

SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF ...mog/Papers/xiaorong_phd.pdf · SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS A Dissertation