31
COMPUTING SCIENCE Measuring and Dealing with the Uncertainty of SOA Solutions Yuhui Chen, Anatoliy Gorbenko, Vyachaslav Kharchenko and Alexander Romanovsky TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010

COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

COMPUTING SCIENCE

Measuring and Dealing with the Uncertainty of SOA Solutions Yuhui Chen, Anatoliy Gorbenko, Vyachaslav Kharchenko and Alexander Romanovsky

TECHNICAL REPORT SERIES

No. CS-TR-1225 November 2010

Page 2: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

TECHNICAL REPORT SERIES No. CS-TR-1225 November, 2010

Measuring and Dealing with the Uncertainty of SOA Solutions Y. Chen, A. Gorbenko, V. Kharchenko and A. Romanovsky Abstract The paper investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection into Web Services. We summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services. © 2010 University of Newcastle upon Tyne. Printed and published by the University of Newcastle upon Tyne, Computing Science, Claremont Tower, Claremont Road, Newcastle upon Tyne, NE1 7RU, England.

Page 3: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

Bibliographical details CHEN, Y., GORBENKO, A., KHARCHENKO, V., ROMANOVSKY, A. Measuring and Dealing with the Uncertainty of SOA Solutions [By] Y. Chen, A. Gorbenko, V. Kharchenko, A. Romanovsky Newcastle upon Tyne: University of Newcastle upon Tyne: Computing Science, 2010. (University of Newcastle upon Tyne, Computing Science, Technical Report Series, No. CS-TR-1225)

Added entries UNIVERSITY OF NEWCASTLE UPON TYNE Computing Science. Technical Report Series. CS-TR-1225 Abstract The paper investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection into Web Services. We summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services. About the authors Yuhui Chen completed his PhD study at Newcastle University (UK). He received a MSc in Computing Science in 2003 from Newcastle University. He started his PhD from April 2004 under the supervision of Prof. Alexander Romanovsky. His research focuses on dependability of Service Oriented Architecture. Anatoliy Gorbenko graduated in computer science in 2000 and received the PhD degree from the National Aerospace University, Kharkiv, Ukraine in 2005. He is an Associate Professor at the Department of Computer Systems and Networks of the National Aerospace University in Kharkiv (Ukraine). There he co-coordinates the DESSERT (Dependable Systems, Services and Technologies) research group. His work focuses on system research ensuring dependability and fault tolerance in service-oriented architectures; on investigating system diversity, dependability assessment and exception handling, and on applying these results in real industrial applications. Dr. Gorbenko is a member of EASST (European Association of Software Science and Technology). Vyacheslav Kharchenko (M’01) received his PhD in Technical Science at the Military Academy named after Dzerzhinsky (Moscow, Russia) in 1981 and Doctor of Technical Science degree at the Kharkiv Military University (Ukraine) in 1995. He is a Professor and heads of the Computer Systems and Networks Department and the DESSERT research group at the National Airspace University, Ukraine. He is also a senior research investigator in the field of safety-related software at the State Science-Technical Center of Nuclear and Radiation Safety (Ukraine). He has published nearly 200 scientific papers, reports and book chapters, more than 500 inventions and is the coauthor or editor of 28 books. He has been the principal investigator and consultant on a succession of research projects in safety and dependability of I&C NPP and aerospace systems. He has been a head of the DESSERT International Conference (http://www.stc-dessert.com) in 2006-2009. His research interests include critical computing, dependable and safety-related I&C systems, multi-version design technologies, software and FPGA-based systems verification and expert analysis. Alexander (Sascha) Romanovsky is a Professor in the Centre for Software and Reliability, Newcastle University. His main research interests are system dependability, fault tolerance, software architectures, exception handling, error recovery, system structuring and verification of fault tolerance. He received a M.Sc. degree in Applied Mathematics from Moscow State University and a PhD degree in Computer Science from St. Petersburg State Technical University. He was with this University from 1984 until 1996, doing research and teaching. In 1991 he worked as a visiting researcher at ABB Ltd Computer Architecture Lab Research Center, Switzerland. In 1993 he was a visiting fellow at Istituto di Elaborazione della Informazione, CNR, Pisa, Italy. In 1993-94 he was a post-doctoral fellow with the Department of Computing Science, the University of Newcastle upon Tyne. In 1992-1998 he was involved in the Predictably Dependable Computing Systems (PDCS) ESPRIT Basic Research Action and the Design for Validation (DeVa) ESPRIT Basic Project. In 1998-2000 he worked on the Diversity in Safety Critical Software (DISCS) EPSRC/UK Project. Prof Romanovsky was a co-author of the Diversity with Off-The-Shelf Components (DOTS) EPSRC/UK Project and was involved in this project in 2001-2004. In 2000-2003 he was in the executive board of Dependable Systems of Systems (DSoS) IST Project. He has been the Coordinator of the Rigorous Open Development Environment for Complex Systems (RODIN) IST Project (2004-2007). He is

Page 4: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

now the Coordinator of the major FP7 DEPLOY Integrated Project (2008-2012) on Industrial Deployment of System Engineering Methods Providing High Dependability and Productivity.

Suggested keywords WEB SERVICES SOA BENCHMARKING RESPONSE TIME

Page 5: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

Measuring and Dealing with the

Uncertainty of SOA Solutions Yuhui Chen

1, Anatoliy Gorbenko

2, Vyachaslav Kharchenko

2, Alexander Romanovsky

3

1 Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK

2 Department of Computer Systems and Networks, National Aerospace University, Kharkiv,

Ukraine 3 School of Computing Science, Newcastle University, Newcastle upon Tyne, UK

ABSTRACT The chapter investigates the uncertainty of Web Services performance and the instability of their

communication medium (the Internet), and shows the influence of these two factors on the overall

dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour

of a number of existing Web Services used in e-science and bio-informatics, provide the results of

statistical data analysis and discuss the probability distribution of delays contributing to the Web Services

response time. The ratio between delay standard deviation and its average value is introduced to measure

the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection

into Web Services. We summarise our experiments with SOA-specific exception handling features

provided by two web service development kits and analyse exception propagation and performance as the

major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services.

INTRODUCTION

The paradigm of Service-Oriented Architecture (SOA) is a further step in the evolution of the well-known

component-based system development with Off-the-Shelf components. SOA and Web Services (WSs)

were introduced to ensure effective interaction of complex distributed applications. They are now

evolving within critical infrastructures (e.g. air traffic control systems), holding various business systems

and services together (for example, banking, e-health, etc.). Their ability to compose and implement

business workflows provides crucial support for developing globally distributed large-scale computing

systems, which are becoming integral to society and the economy.

Unlike common software applications, however, Web Services work in an unstable environment as

part of globally-distributed and loosely-coupled SOAs, communicating with a number of other services

deployed by third parties (e.g. in different administration domains), typically with unknown dependability

characteristics. When complex service-oriented systems are dynamically built or when their components

are dynamically replaced by the new ones with the same (or similar) functionality but unknown

dependability and performance characteristics, ensuring and assessing their dependability becomes

genuinely complicated. It is this fact that is the main motivation for this work.

By their very nature Web Services are black boxes, as neither their source code, nor their complete

specification, nor information about their deployment environments are available; the only known

information about them is their interfaces. Moreover, their dependability is not completely known and

they may not provide sufficient Quality of Service (QoS); it is often safer to treat them as “dirty” boxes,

assuming that they always have bugs, do not fit well enough, and have poor specification and

documentation. Web Services are heterogeneous, as they might be developed following different

standards, fault assumptions and different conventions, and may use different technologies. Finally,

Service-Oriented Systems are built as overlay networks over the Internet and their construction and

composition are complicated by the fact that the Internet is a poor communication medium (e.g. it has low

quality and is not predictable).

Page 6: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

2!

Therefore, users cannot be confident of their availability, trustworthiness, reasonable response time

and other dependability characteristics (Avizienis, Laprie, Randell, & Landwehr, 2004), as these can vary

over wide ranges in a very random and unpredictable manner. In this work we use the general synthetic

term uncertainty to refer to the unknown, unstable, unpredictable, changeable characteristics and

behaviour of Web Services and SOA, exacerbated by running these services over the Internet. Dealing

with such uncertainty, which in the very nature of SOA, is one of the main challenges that researchers are

facing.

To become ubiquitous, Service-Oriented Systems should be capable of tolerating faults and

potentially-harmful events caused by a variety of reasons, including low or varying (decreasing) quality

of components (services), shifting characteristics of the network media, component mismatches,

permanent or temporary faults of individual services, composition mistakes, and service disconnection,

changes in the environment and in the policies.

The dependability and QoS of SOA has recently been the aim of significant research effort. A

number of studies (Zheng, & Lyu, 2009; Maamar, Sheng, & Benslimane, 2008; Fang, Liang, F. Lin, &

C.-C. Lin, 2007) have introduced several approaches to incorporating resilience techniques (including

voting, backward and forward error recovery mechanisms and replication techniques) into WS

architectures. There has been work on benchmarking and experimental measurements of dependability

(Laranjeiro, Vieira, & Madeira, 2007; Duraes, Vieira, & Madeira, 2004; Looker, Munro, & Xu, 2004) as

well as dependability and performance evaluation (Zheng, Zhang, & Lyu, 2010). But even though the

existing proposals offer useful means for improving SOA dependability by enhancing particular WS

technologies, most of them do not address the uncertainty challenge which exacerbates the lack of

dependability and varying quality.

The uncertainty of Web Services has two main consequences. First, it makes it difficult to assess

dependability and performance of services, and hence to choose between them and gain confidence in

their dependability. Secondly, it becomes difficult to apply fault tolerance mechanisms efficiently, as too

much of the data which is necessary to make choices is missing.

The purpose of the chapter is to investigate the dependability and uncertainty of SOA and the

instability of the communication medium through large-scale benchmarking of a number of existing Web

Services. Benchmarking is an essential and very popular approach to web services dependability

measurement. Apart from papers (Laranjeiro et al., 2007; Duraes et al., 2004) we need to mention such

recent and ongoing European research projects as AMBER (http://www.amber-project.eu/) and WS-

Diamond (http://wsdiamond.di.unito.it/). Mostly relying on stress-testing and failure injection techniques,

these works analyse services robustness, their behaviour in the presence of failure or under stressed load,

and compare the effectiveness of the technologies used to implement web services. Hardly any of the

studies, however, address the web services instability issue or offer a strong mathematical foundation or

proofs - mostly because, we believe, there is no general theory to capture uncertainties inherent to SOA.

In this chapter we present our practical experience in benchmarking and measuring a number of

existing WSs used in e-science and bio-informatics (Blast and Fasta, providing API for bioinformatics

and genetic engineering, and available at http://xml.nig.ac.jp, and BASIS, the Biology of Ageing E-

Science Integration and Simulation System, available at http://www.basis.ncl.ac.uk/WebServices.html).

This chapter summarises our recent work in the area (for more information, the readers are referred to

Gorbenko, Mikhaylichenko, Kharchenko, and Romanovsky (2007), Gorbenko, Kharchenko, Tarasyuk,

Chen, and Romanovsky (2008), Chen et al. (2009)).

In the first section we describe the experimental techniques used, investigate performance

instability of the Blast, Fasta and BASIS WSs and analyse the delays induced by the communication

medium. We also show results of statistical data analysis (i.e. minimal, maximal and average values of the

delays and their standard deviations) and present probability distribution series.

The second section analyses the instability involved in delays as elements of the web service

response time. In this section we report the latest results of advanced BASIS web services measurements,

capable of distinguishing between the network round trip time (RTT) and the request processing time

Page 7: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

3!

(RPT) on the service side. The section also provides results of checking hypotheses about the distribution

law of the web service response time and its component values RPT and RTT.

The uncertainty discovered in web services operations affects the dependability of SOA and will

require additional specific resilience techniques. Exception handling is one of the means widely used for

attaining dependability and supporting recovery in SOA applications. The third section presents the

results of error and fault injection into web services. We summarise our experiments with SOA-specific

exception handling features provided by two tool kits: the Sun Microsystems JAX-RPC and the IBM

WebSphere Software Developer Kit for developing web services. In this section we examine the ability of

built-in exception handling mechanisms to eliminate certain causes of errors and analyse exception

propagation and performance as the major factors affecting fault tolerance (in particular, error handling

and fault diagnosis) in web services.

1. MEASURING DEPENDABILITY AND PERFORMANCE UNCERTAINTY

OF SYSTEM BIOLOGY APPLICATIONS

1.1. Measuring Uncertainty of Blast and Fasta Web Services

In our experiments we dealt with the DNA Databank, Japan (DDBJ), which provides API for

bioinformatics and genetic engineering (Miyazaki, & Sugawara, 2000). We benchmarked the Fasta and

Blast web services provided by DDBJ, which implement algorithms commonly used in the in silico

experiments in bioinformatics to search for gene and protein sequences that are similar to a given input

query sequence.

1.1.1. Experimental Technique

A Java client was developed to invoke the Fasta and Blast WSs at DDBJ during five days from

04 June 2008 to 08 June 2008. In particular, we invoked the getSupportDatabaseList operation supported

by both the Fasta and Blast WSs. The size of the SOAP request for the Fasta and Blast WS is 616 bytes,

whereas the SOAP responses are 2128 and 2171 bytes respectively.

The services were invoked simultaneously, using threads every 10 minutes (in total, more than 650

times during the five days). At the same time, the DDBJ Server was pinged to assess the network RTT

(round trip time) and to take into account the Internet effects on the web service invocation delay. The

total number of the ICMP Echo requests sent to the DDBJ Server was more than 200000 (one per two

seconds).

1.1.2. Performance Trends Analysis

Figure 1 shows the response delays of the Fasta (a) and Blast (b) WSs. In spite of the fact that we

invoked similar operations of these two services with similar sizes of SOAP responses simultaneously,

they had different response times. Moreover, the response time of Blast was more unstable (see the

probability distribution series of Fasta (a) and Blast (b) in Figure 2).

This difference can be explained by internal reasons (such as a varying level of CPU utilization and

memory usage while processing the request, some differences in implementations, etc.). Besides, we

noted a period of time, Time_slot_2 (starting on June 05 at 23:23:48 and lasting for 3 hours and 8

minutes), during which the average response time increased significantly for both Fasta and Blast (see

Figure 1).

Table 1 presents the results of statistical data analysis of response times for the Fasta and Blast

WSs for the stable network route period, Time_slot_1, and for the period when the network route

changed, Time_slot_2.

Page 8: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

4

Figure 1. Response delay trends: (a) Fasta web service; (b) Blast web service

Figure 2. Response time probability distribution series: (a) Fasta web service; (b) Blast web service

Standard deviation of response time for Fasta is about 16% of its average value, whereas for the

Blast web service it equals 27% and 45% for Time_slot_1 and Time_slot_2 respectively. We believe this

shows that a significant time uncertainty exists in Service Oriented Systems.

Further investigation of the ping delays confirmed that this was a period during which the network

route between the client computer at Newcastle University (UK) and the DDBJ server in Japan changed.

Moreover, during the third time slot we observed 6 packets lost in 20 minutes. Together with the high

RTT deviation, it indicates that significant network congestion occurred.

Table 1. Response time statistics summary

Invocation response time (RT), ms

Time slot min. max. avg. std. dev.

Fasta WS

Time slot 1 937 1953 996.91 163.28

Time slot 2 937 4703 1087.28 171.12

Blast WS

Time slot 1 1000 1750 1071.17 291.57

Time slot 2 1015 3453 1265.72 572.70

Page 9: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

5!

1.1.3. PINGing Delay Analysis

Through monitoring the network using ICMP Echo requests, we discovered that the overall testing

interval can be split into three time slots with their own particular characteristics of the communication

medium as shown in Figure 3 and Table 2.

Figure 3. PINGing time slots

Table 2. PINGing statistics summary

PING round trip time (RTT), ms

Time slot min. max. avg. std. dev.

PINGing from Newcastle University LAN (UK)

Time slot 1 309 422 309.21 1.40

Time slot 2 332 699 332.72 3.48

Time slot 3 309 735 312.94 12.73

PINGing from KhAI University LAN (Kharkiv, Ukraine)

- 341 994 396.27 62.14

Time_slot_1 is a long period of time characterized by a highly stable network condition (see

Figure 4-a) with the average Round Trip Time (RTT) of 309,21 ms. This was observed over most of the

testing period. According to the TTL parameter returned in ICMP Echo reply from DDBJ server, the

network route contained 17 intermediate hosts (routers) between Newcastle University Campus LAN and

the DDBJ server.

Time_slot_2 began on June 05 at 23:23:48, ending on June 06 at 02:31:30. This was a sufficiently

stable period with the average Round Trip Time (RTT) of 332,72 ms (see Figure 4-b). The ratio of the

standard deviation of the delay to the average value (referred to as the coefficient of variation), used in the

chapter as a measure of uncertainty, was about 1% for this period. This is accounted for by the fact that

during this time slot the network route was changed. The number of intermediate hosts (routers) grew

from 17 to 20. This also affected the average response time of the Fasta and Blast WSs.

Time_slot_3 is a short period (of about 20 minutes) characterized by a high RTT instability (a

higher value of standard deviation than in time slots 1 and 2) (see Figure 4-c and Table 2 for more

details). It was too short, however, to analyse its impact on the Fasta and Blast response times.

Packet losses occurred during all of the time slots, on average once in every two hours (the total

number of losses was 44, 8 of which were double losses). Sometimes the RTT increases significantly over

a short period. This indicates that transient network congestions occurred periodically throughout the

testing period.

Page 10: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

6

Figure 4. PING probability distribution series of network round trip time: (a) Time_slot_1;

(b) Time_slot_2; (c) Time_slot_3; (d) pinging from KhAI University LAN

At the same time, we were surprised by the high stability of network connection during long

periods. We had expected a greater instability of the round trip time due to the use of the Internet as a

medium and the long distance between the client and Web Services. To understand this, the DDBJ server

was pinged from KhAI University LAN (Kharkiv, Ukraine) during another two days. As a result, we

observed a significant instability of the RTT (see Figure 4-d). The standard deviation of RTT is about

16% of its average value. Besides, packet losses occur, on average, after every 100 ICMP Echo requests.

We used the tracert command to discover the reason for such instability and found that the route

from Ukraine to Japan includes 26 intermediate hosts and goes through the UK (host’s name is ae-

2.ebr2.London1.Level3.net, IP address is “4.69.132.133”) but the main instability takes place at the side

of local Internet Service Provider (ISP). The RTT standard deviation for the first five intermediate hosts

(all located in Ukraine) was extremely high (about 100% of its average value).

As a consequence, the standard deviation of response time for the requests sent to the Fasta and

Blast WSs from the KhAI University LAN has dramatically increased as compared to the ones sent from

Newcastle University. This came as a result of superposition of high network instability and the observed

performance uncertainty inherent to the Fasta and especially Blast WSs.

1.2. Measuring Uncertainty of BASIS System Biology Web Service

In this section, we present a set of new experiments conducted with an instance of the System Biology

Web Service (BASIS WS) to continue our research on measuring the performance and dependability of

Web Services used in e-science experiments. In a study reported in the previous section we found evident

performance instability existing in SOA that affects the dependability of web services and its clients.

The Fasta and Blast WSs we experimented with were the part of DNA Databank (Miyazaki, &

Sugawara, 2000) that was out of our general control. Thus, we were unable to capture the exact causes of

performance instability. The main difference between that work and our experiments with the BASIS web

service, hosted by the Institute for Ageing and Health (Newcastle University), is the fact that this WS is

Page 11: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

7!

under our local administration. Thus we are able to look inside its internal architecture and to perform

error and time logging for every external request. Moreover, we have used several clients from which the

BASIS WS was benchmarked to give us a more objective view and to allow us to see whether the

instability affects all clients in the same way or not.

The aims of the work are as follows: (i) to conduct a series of experiments similar to those reported

in the previous section but with access to inside information to obtain a better understanding of the

sources of exceptions and performance instability; (ii) to conduct a wider range of experiments by using

several clients from different locations over the Internet; (iii) to gain an inside understanding of the

bottlenecks of an existing system biology application to help in improving it in the future.

1.2.1. BASIS System Biology Applications

Our experiments were conducted in the collaboration with a Systems Biology project called BASIS

(Biology of Ageing E-Science Integration and Simulation System) (Kirkwood et al., 2003). The BASIS

application is a typical, representative example of a number of SOA solutions found in e-science and grid.

Being one of the twenty pilot projects funded under the UK e-science initiative in the development of the

UK grid applications, BASIS at the Institute for Ageing and Health in Newcastle University, aims at

developing web-based services that help the biology-of-ageing research community for quantitative study

of the biology of ageing by integrating data and hypotheses from diverse biological sources. With the

association and expertise from the UK National e-Science Centre on building Grid applications, the

project has successfully built a system that integrates various components such as model design,

simulators, databases, and exposes their functionalities as Web Services (Institute for Ageing and Health,

2009). The architecture of the BASIS Web Service (basis1.ncl.ac.uk) is shown in Figure 5.

Figure 5. The architecture of BASIS system

The system is composed of a BASIS Server (2x2.4GHz Xeon CPU, 2GB DDR RAM, 73GB

10,000 rpm U160 SCSI RAID), including a database (PostgreSQL v8.1.3) and Condor v 6.8.0 Grid

Computing Engine; a sixteen computer cluster, an internal 1Gbit network, and a web service interface

deployed on Sun Glassfish v2 Application Server with JAX-WS + JAXB web service development pack.

BASIS offers four main services to the community:

– BASIS Users Service allows users to manage their account;

– BASIS Simulation Service allows users to run simulations from ageing research;

Page 12: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

8

– BASIS SBML Service allows users to create, use and modify SBML models. The Systems

Biology Markup Language (SMBL) is a machine-readable language, based on XML, for representing

models of biochemical reaction networks. SBML can represent metabolic networks, cell-signalling

pathways, regulatory networks, and other kinds of systems studied in systems biology;

– BASIS Model Service allows users to manage their models.

The most common BASIS usage scenario is: (i) to upload a SMBL simulation model into BASIS

server; (ii) to run uploaded SMBL model with the biological statistics from BASIS database; (iii) to

download simulation results. The size of SMBL models and simulation results uploaded and downloaded

to/from the BASIS server can wary in a wide range and can be really huge (up to tens and even hundreds

of megabytes). It can be a real problem for the remote clients, especially for those with the low-speed or

low-quality Internet connections.

1.2.2. Experimental Technique

To provide a comprehensive assessment we used five clients deployed in different places over the

Internet: Frankfurt (Germany), Moscow (Russia), Los Angeles (USA) and two clients in Simferopol

(Ukraine) that use different Internet service providers. Figure 6, created by tracing routes between clients

and the BASIS WS, demonstrates different number of intermediate routers between the BASIS WS and

each of the clients. Note that there are parts of the routes common to different clients.

Our plan was to perform prolonged WS testing to capture long-term performance trend, to disclose

performance instabilities and possible failures. The GetSMBL method, returning SMBL simulation result

of 100 Kb, has been invoked simultaneously from all clients every 10 minutes during five days starting

from Dec 23, 2008 (more than 600 times in total). At the same time the BASIS SBML Web Service has

been pinged to assess network round trip time RTT and to take into account the Internet effects on the WS

invocation delay. Total numbers of ICMP Echo requests sent to BASIS Server were more than 10000. In

additional to that we traced network routes between clients and the web service to find out an exact point

of network instability.

The experiment was run over the Christmas week for the following reasons. The University’s

internal network activity was minimal during this week. At the same time the overall Internet activity

typically grows during this time as social networks (e.g. Facebook, YouTube, etc.) and forums experience

a sharp growth during the holidays (Goad, 2008).

A Java-based application called Web Services Dependability Assessment Tool (WSsDAT) which is

aimed at evaluating the dependability of Web Services (Li, Chen, & Romanovsky, 2006) was used to test

the BASIS SBML web service from remote hosts. The tool supports various methods of dependability

testing by acting as a client invoking the Web Services under investigation. It enables users to monitor

Web Services by collecting the following characteristics: (i) availability; (ii) performance; (iii) faults and

exceptions. During our experimentation we faced with several organizational and technical problems.

Thus, test from Los Angeles was started up 16 hours late. The Moscow client were suddenly terminated

after first thirty requests and restarted only five days later when the first step of the experiment was

already finished.

1.2.3. Performance Trends Analysis

Figure 7 shows the response time statistics for different clients. The summary of the statistical data

describing the response time and client instability ranks are also presented in Table 3. An average request

processing time by the BASIS WS was about 163 ms. Thus, the network delay makes the major

contribution to the response time. To analyse performance instability for each particular client we have

estimated how many percent the standard deviation (std. dev) of response time takes from its average

(avg) value (i.e. coefficient of variation – cv).

Page 13: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

9

Figure 6. Internet routes used by different clients

The fastest response time (in average) was observed for the client from Frankfurt whereas Los

Angeles client was the slowest one. This situation was easy to predict. However, we have also found that

the fastest client was not the most stable. Quite the contrary, the most stable response time has been

observed by the client from Los Angeles. The most unstable response time has been observed by

Simferopol_1. From time to time all clients (except for Los Angeles) have faced very high delays. Some

of these were ten times larger than average response time and twenty times larger than the minimal

values. The clients located in Moscow and in Simferopol_1 were faced with high instability of response

time due to high network instability (as it was found from the ping statistics analysis). A deeper analysis

of the trace_route statistics helped us to find out a remarkable fact that network instability (instability of

network delay) happened on the part of a network route that was closer to the particular client than to the

web service. Access to the inside information (server log) and additional network statistics (like the ping

and trace_rout logs) allowed us to get a better understanding of the sources of performance instability and

exceptions. For example, in Figure 8-a showing the response time of the Frankfurt client, we can see five

time intervals characterized by high response time instability. These were caused by different reasons (see

Table 4). During the first and the fourth time intervals all clients were affected by the BASIS service

overload due to high number of external requests and the database backup. The second time interval was

the result of BASIS Service maintenance. The BASIS server was restarted several times. As a result all

clients caught exceptions periodically and suffered from response time instability.

Table 3. BASIS WS response time statistics summary and client instability ranks

Response Time Client

location min,

ms

max,

ms

avg,

ms

std.dev,

ms

cv, % Instability

Rank

Number of

intermediate

routers

Frankfurt 317 6090 383.17 71.91 18.77 IV 11

Moscow 804 65134 1228.38 437.69 35.63 III 13

Simferopol_1 683 125531 1186.74 895.18 75.43 I 22

Simferopol_2 716 11150 1272.12 634.53 49.88 II 19

Los Angeles 1087 3663 1316.54 129.79 9.86 V 22

Page 14: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

10

Figure 7. BASIS WS response time trends from different user-side perspectives

Response time instability during the third time interval was caused by extremely high network

instability occurred between the second and the third intermediate routers (counting from the Frankfurt

client towards the BASIS service). It was an interval where the network RTT suddenly increased three

times in average (from 28.3 ms up to 86.7 ms) and had a great deviation (32.2 ms). The last unstable

interval was observed by the Frankfurt client on December 27 (from 02 a.m. to 07 a.m.). In fact, the

Frankfurt host is an integration server that is involved in software development. At the end of the week it

performs automatic procedures of program code merging and unit testing. As a result, the host was

overloaded by the local tasks and our testing client even caught several operating system exceptions

“java.io.IOException: error=24, Too many open files”.

Table 4. Response time instability intervals experienced by the Frankfurt client

Time

interval Date/Time Instability cause

from: Dec 23/12:23:59 1

to: Dec 23/13:23:59 BASIS Service overload

from: Dec 23/23:03:59 2

to: Dec 24/01:44:00

BASIS Service failure and

maintenance actions

from: Dec 24/11:34:00 3

to: Dec 24/17:44:01

Network delay instability

due to network concestion

from: Dec 25/14:24:15 4

to: Dec 26/00:14:15 BASIS Database backup

from: Dec 27/02:14:23 5

to: Dec 27/07:14:23 Local host overload

Page 15: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

11

Figure 8. WS Response time for German client: (a) response time trend;

(b) response time probability distribution series

1.2.4. Response Time Probability Density Analysis

Probability distribution series of response time that were obtained for different clients statistics are shown

in Figures 8-b and 9. As it can be seen all probability distribution series of service response time, taken in

our experiments from different client’s perspectives, tend to be described by the Poisson law, whereas

network RTT and request processing time RPT by the BASIS WS match well the Exponential

distribution.

However, unlike the Poisson and Exponential distributions all probability distribution series

obtained in our epxeriments have heavy tails caused by the ‘real world’ instability when delays increase

suddenly and significantly due to different reasons that are hard to predict.

This finding is in line with Reinecke, van Moorsel, and Wolter (2006) and other existing

experimental works. Thus, more realistic assumptions and more sophisticated distribution laws are needed

to fit better the practical data. It may be the case that the Exponential distribution of RTT and RPT can by

replaced with one of the heavy tailed distribution like log-normal, Weibull or Beta. At the same time the

service RT for different clients could be described in a more complex way as a composition of two

distribution: RTT (that is unique for each particular client) and RPT (that is unique for the service used

and, hence, is the same for all clients with the identical priority).

1.2.5. Errors and Exceptions Analysis

During our experiments, several clients caught different errors and exceptions with different error rates.

Most of them (1-3) were caused by BASIS Service maintenance when the BASIS WS, server and

database were restarted several times (see Table 5).

The first one (‘Null SOAP body’) resulted in a null-sized response from web service. It is a true

failure that may potentially cause dangerous situation as it was not reported as an exception! According to

the server side log, the failures were caused by errors occurred when the BASIS business logic processing

component was trying to connect to the database. As the database was shutdown due to an exception, this

component failed to handle the connection exception, and returned empty results to the client. Apparently,

the BASIS WS should be improved to provide better mechanisms for error diagnosis and exception

handling.

Page 16: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

12!

Figure 9. Probabilities Distribution Series of WS Response Time from different user-side perspectives:

(a) Moscow, Russia; (b) Los Angeles, USA; (c) Simferopol_1, Ukraine; (d) Simferopol_2, Ukraine

The second exception was caused by BASIS WS shutdown, whereas the third one was likely a

result of BASIS server shutdown while the BASIS WS was operated. However, we cannot be sure

because ‘Null pointer exception’ gives too little information for troubleshooting. The reason of the forth

and fifth exception were network problems. It is noteworthy, that ‘UnknownHostException’ caused by

silence of DNS-server takes about 2 minutes (too long!) to be reported to the client.

Table 5. BASIS WS errors and exceptions statistics

Number of exceptions per client ! Error/Exception

Germany Simferopol_1 Simferopol_2

1 Error: Null SOAP body 4 4 6

2 Exception: HTTP transport error:

java.net.ConnectException: Connection refused 2 0 0

3 Exception: java.lang.NullPointerException 3 4 3

4 Exception: HTTP transport error:

java.net.NoRouteToHostException: No route to host 0 1 2

5 Exception: HTTP transport error:

java.net.UnknownHostException: basis1.ncl.ac.uk 0 0 1

Error rate 0.015 0.015 0.02

Page 17: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

13!

1.3. Discussion

The purpose of the first section was to examine the uncertainty inherent to Service-Oriented Systems by

benchmarking three bioinformatics web services typically used to perform in silico experiments in

systems biology studies. The main finding is that the uncertainty comes from three sources the Internet,

the web service and from the client itself. Network instability as well as the internal instability of web

services throughput significantly affect service response time. Occasional transient and long-term Internet

congestions and network route changes that are difficult-to-predict affect the stability of Service-Oriented

Systems. Because of network congestions causing packet losses and multiple retransmissions, the

response time could sharply increase in an order of magnitude.

Because of the Internet, different clients have their own view on Web Service performance and

dependability. Each client has its own unique network route to the web service. However, some parts of

the route can be common for several clients or even for all of them (see Figure 6). Thus, number of clients

simultaneously suffering from the Internet instability depends on the point where network congestions or

failures happen. More objective data might be obtained by aggregation of clients’ experience, for

example, in a way proposed in (Zheng et al, 2010) and/or by having internal access to the Web Service

operational statistics.

We can also conclude from our practical experience, that the instability of the response time

depends on the quality of the network connection used rather than on the length of the network route or

the number of the intermediate routers. QoS of a particular WS cannot be ensured without guaranteeing

network QoS, especially when the Internet is used as a communication medium for the global service-

oriented architecture.

During the WS invocation different clients caught different number of errors and exceptions, but

not all of them were caused by service unreliability. In fact, some clients were successfully serviced

whereas others, at the same time, were faced with different problems due to timing errors or network

failures. These errors might occur in different system components depending on the relative position in

the Internet of a particular user and a web service, and, also, on the instability points appearing during the

execution. As a result, web services might be compromised by the client side or network failures, which,

actually, are not related to web service dependability. Most of the time, the clients are not very interested

in their exact cause. Thus, from different client side perspectives the same web service usually has

different availability and reliability characteristics. A possible approach to solving the problem of

predicting the reliability of SOA-based applications given their uncertainty by users collaboration is

proposed in (Zheng, & Lyu, 2010).

Finally, the Exponential distribution that typically used for networks simulation and response time

analysis does not represent well such unstable environments as the Internet and SOA. We believe that the

SOA community needs a new exploratory theory and more realistic assumptions to predict and simulate

performance and dependability of Service-Oriented Systems by distinguishing different delays

contributing to the WS response time.

2. INSTABILITY MEASUREMENT OF DELAYS CONTRIBUTING TO

WEB SERVICE RESPONSE TIME

This section reports a continuation of our previous work with BASIS System Biology Applications

aiming to measure the performance and dependability of e-science WSs from the end user’s perspective.

In the previous investigation we found evident performance instability existing in these SOAs and

affecting dependability of both, the WSs and their clients. However, we were unable to capture the exact

causes and shapes of performance instability. In this section we focus on distinguishing between different

delays contributing to the overall Web Service response time. Besides, new theoretical results are

presented at the end of this section where we rigorously investigate the real distribution laws, describing

response time instability in SOA.

Page 18: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

14

2.1. Experimental Technique

Basically, we used the same experimental technique described in the previous section. The BASIS WS,

returning SMBL simulation result, has been invoked by the client software placed in five different

locations (in Frankfurt, Moscow, Los Angeles and two in Simferopol) every 10 minutes during eighteen

days starting from April, 11 2009 (more than 2500 times in total). Simultaneously, we traced the network

route (by sending ICMP-echo requests) between the client and the BASIS SBML web service to

understand how the Internet latency affects the WS invocation delay and to find out where possible the

exact points of network instability.

At the same time there are significant differences in the measurement techniques presented in the

previous section and the work reported in this section. The main one is that in our new experiments we

measure four time-stamps T1, T2, T3 and T4 (see Figure 10) for each request instead of only T1 and T4

(as it was done in the previous experiments). It becomes possible because during our new experiments we

had an internal access to the BASIS WS and were able to install directly into the BASIS WS a monitoring

software to capture the exact time when the user’s requests come to BASIS and when it returns the

responses. This allowed us to separate the two main delays contributing to the WS response time (RT):

the request processing time (RPT) by a web service and the network (the Internet) round trip time (RTT),

i.e. RT = RPT + RTT.

Besides, we investigated how the performance of the BASIS WS and its instability changed during

the 3 months since our previous large-scale experiment to check the hypothesis that once measured they

stay true. Finally, when we set these new experiments we wanted to know is there a way to predict and

represent the performance uncertainty in SOA by employing one of the theoretical distributions, used to

describe such random variables like the web service response time. A motivation for this is the fact shown

by many studies (e.g. Reinecke et al., 2006) that the Exponential distribution does not represent well the

accidental delays in the Internet and SOA. After processing statistics for the all clients located in different

places over the Internet we found the same uncertainty tendencies. Thus, in this section we report results

obtaining only for the Frankfurt client.

Figure 10. Performance measurement

2.2. Performance Trends Analysis

Performance trends and probability density series of RPT, RTT and RT captured during eighteen days are

shown in Figure 11. It can be see that RTT and especially RPT have significant instability that contribute

together to the instability of the total response time RT.

Sometimes, delays were twenty times (and even more) longer than their average values (see

Table 6). In brackets we give estimation of the maximal and average values of RPT, RTT and RT and

their standard deviations that were obtained after taking out of consideration (discarding) ten the most

extreme values of delays. A ratio between delay standard deviation and its average value is used as the

uncertainty measure.

Page 19: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

15!

As compared with our experiments of three month prescription we have observed a significant

increase of the average response time (see Table 6) fixed by the Frankfurt client (889.7 ms instead of

502.15 ms). In additional to this, an uncertainty of BASIS performance from the client-side perspectives

has been increased in times (94.1% instead of 18.77%). The network route between the BASIS WS and

the Frankfurt client has also changed significantly (18 intermediate routers instead of 11).

In our current work we set the number of bars in the histogram representing probability density

series (see Figure 11) equal to the square root of the number of elements in experimental data, that is

similar to the Matlab histfit(x) function. This allowed us to find out new interesting properties.

Figure 11. Performance trends and probability density series: RPT, RTT and RT

In particular, we could see that about 5% of RPT, RTT and RT are significantly larger than their

average values. It is also clear that the probability distribution series of RTT has two extreme points.

Besides, more than five percents of RTT have value that is 80ms (1/5) less than the average one. Tracing

routes between the client and the service allows us to conclude that these fast responses were caused by

shortening the network routes. This seems to be very unusual for RPT but should be typical for the

Internet. Finally, this peculiarity of RTT causes an appearance of the observable left tail in the RT

probability distribution series. It also makes it difficult to find the theoretical distribution, representing

RTT.

Page 20: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

16

Table 6. Performance statistics: RPT, RTT, RT

Min, ms Max, ms Avg, ms Std. Dev. Cv, %

RPT 287.0 241106.0 (8182.0) 657.7 (497.6) 4988.0 (773.5) 758.4 (155.4)

RTT 210.0 19445.0(1479.0) 405.8 (378.2) 621.1 (49.2) 153.1 (13.0)

RT 616.0 241492.0 (11224.0) 1061.5 (889.7) 5031.0 (837.4) 474.1 (94.1)

Ping RTT 26.4 346.9 (50.4) 32.0 (31.9) 3.6 (0.9) 11.3 (2.8)

As the availability concern we should mention that the BASIS WS was unavailable for four hours

(starting from 19:00 April, 11) because of the network rerouting. Besides, two times the WS reported an

exception instead of returning the normal results.

2.3. Retrieval of Real Distribution Laws of Web Service Delays

2.3.1. Hypothesis Checking Technique

In this section we provide results of hypotheses checking about distribution law of web service response

time (RT) and its component values RPT and RTT. In our work we use the Matlab numeric computing

environment (www.mathworks.com) and its Statistics Toolbox, a collection of tools supporting a wide

range of general statistical functions, from random number generation, to curve fitting. The techniques of

hypothesis checking consist of two basic procedures. First, values of distribution parameters are to be

estimated by analysing experimental samples. Second, the null hypothesis that experimental data have a

particular distribution with certain parameters should be checked. To perform hypothesis checking itself

we used the kstest function: [h, p] = kstest(x, cdf) performing a Kolmogorov-Smirnov

test to compare the distribution of x to the hypothesized distribution defined by matrix cdf.

The null hypothesis for the Kolmogorov-Smirnov test is that x has a distribution defined by cdf.

The alternative hypothesis is that x does not have that distribution. Result h is equal to “1” if we can reject

the hypothesis, or “0” if we cannot reject that hypothesis. The function also returns the p-value which is

the probability that x does not contradict the null hypothesis. We reject the hypothesis if the test is

significant at the 5% level (if p-value less than 0.05).

2.3.2. Goodness-of-Fit Analysis

In our experimental work we have checked six hypotheses that experimental data conform Exponential,

Gamma, Beta, Normal, Weibull or Poisson distributions. These checks were performed for the request

processing time (RPT), round trip time (RTT) and response time (RT) as a whole.

Our main finding is that none of the distributions fits to describe the whole performance statistics,

gathered during 18 days. Moreover, the more experimental data we used the worse approximation were

provided by all distributions! It means that in the general case an instability existing in Service-Oriented

Architecture cannot be predicted and described by analytic formula. The further work focused on finding

the distribution law that fits the experimental data within limited time intervals. We have chosen two

short time intervals with the most stable (from 0:24:28 of April, 12 until 1:17:50 of April, 14) and the

least stable (from 8:31:20 of April, 23 until 22:51:36 of April, 23) response time.

The first time interval includes 293 request samples. Results of hypothesis checking for RPT, RTT

and RT are given in Tables 7, 8 and 9 respectively. The p-value, returned by the kstest function, was used

to estimate the goodness-of-fit of the hypothesis. As it can be seen, the Beta, Weibull and especially

Gamma (1) distributions fit the experimental data better than others. Besides, RPT is approximated by

these distributions better than RT and RTT.

! "! "

b

x

a

aex

a!bb,a|xfy

##$$ 11

(1) !

Page 21: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

17!

Typically, the Gamma probability density function (PDF) is useful in reliability models of

lifetimes. This distribution is more flexible than the Exponential one, which is a special case of the

Gamma function (when a=1). It is remarkable, that the Exponential distribution in our case describes

experimental data worst of all.

However, close approximation even by using the Gamma function can be achieved only within the

limited sample interval (25 samples in our case). Moreover, RTT (and sometimes RT) can hardly be

approximated even under such limited sample length.

Table 7. RPT Goodness-of-fit approximation

Approximation goodness-of-fit (p-value) Number of

requests Exp. Gam. Norm. Beta Weib. Poiss.

293 (all) 7.8 10^-100 1.1 10^-06 9.5 10^-63 9.3 10^-25 2.3 10^-11 4.9 10^-66

First half 1.1 10^-99 0.0468 1.2 10^-62 0.0222 0.00023 1.1 10^-65

Second half 1.3 10^-47 0.2554 5.1 10^-30 0.2907 0.0729 1.6 10^-31

First 50 6.9 10^-18 0.2456 2.3 10^-11 0.2149 0.0830 7.5 10^-12

First 25 2.3 10^-09 0.9773 5.1 10^-06 0.9670 0.5638 2.9 10^-06

Second 25 2.5 10^-09 0.2034 5.2 10^-06 0.1781 0.0508 3.1 10^-06

Table 8. RTT Goodness-of-fit approximation

Distribution goodness-of-fit (p-value) Number of

requests Exp. Gam. Norm. Beta Weib. Poiss.

293 (all) 2.1 10^-94 5.1 10^-30 4.4 10^-59 7.0 10^-39 5.0 10^-38 7.5 10^-85

First half 6.5 10^-52 2.6 10^-17 9.1 10^-33 1.1 10^-16 2.6 10^-19 1.0 10^-45

Second half 5.0 10^-44 2.5 10^-11 1.8 10^-27 4.6 10^-16 4.6 10^-13 8.1 10^-40

First 50 8.1 10^-18 1.9 10^-04 2.1 10^-11 2.9 10^-04 2.0 10^-07 2.1 10^-15

First 25 2.7 10^-09 0.004 4.2 10^-06 0.0043 0.0133 4.6 10^-08

Second 25 1.6 10^-09 6.0 10^-04 4.0 10^-06 5.4 10^-04 3.5 10^-04 4.8 10^-08

Table 9. RT Goodness-of-fit approximation

Distribution goodness-of-fit (p-value) Number of

requests Exp. Gam. Norm. Beta Weib. Poiss.

293 (all) 1.6 10^-96 1.8 10^-14 4.4 10^-60 4.4 10^-29 1.0 10^-19 4.0 10^-67

First half 2.6 10^-52 0.0054 9.4 10^-33 0.0048 1.1 10^-06 2.6 10^-35

Second half 1.0 10^-45 9.8 10^-08 1.9 10^-28 5.2 10^-15 9.1 10^-09 2.2 10^-32

First 50 6.1 10^-18 0.1159 2.1 10^-11 0.1083 0.1150 6.1 10^-12

First 25 2.4 10^-09 0.8776 4.2 10^-06 0.8909 0.7175 2.7 10^-06

Second 25 1.9 10^-09 0.0843 4.5 10^-06 0.0799 0.0288 2.8 10^-06

For the second time interval all six hypotheses failed because of the low confidence of the p-value

(less than confidence interval of 5%). Thus, we can state that the deviation of experimental data

significantly affects goodness of fit. However, we also should mention that the Gamma distribution also

gave better approximation than other five distributions.

2.4. Discussion

In these experiments the major uncertainty came from the BASIS WS itself, whereas in the experiments

conducted three month before this (during the Christmas week) the Internet most likely was the main

cause of the uncertainty.

An important fact we found is that RPT has a higher instability than RTT, however, in spite of this

RPT can be better represented using a particular theoretical distribution. At the same time the probability

Page 22: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

18

distribution series of RTT has unique characteristics making it really difficult to describe them

theoretically. Among the existing theoretical distributions Gamma, Beta and Weibull capture our

experimental response time statistics better than others. However, goodness of fit is good enough only

within short time intervals.

We also should mention here that performance and other dependability characteristics of WSs

could become out of date very quickly. The BASIS response time has changed significantly after three

months in spite of the fact that there were no essential changes in its architecture apart from changes of

the usage profile and the Internet routes. The BASIS WS is a typical example of a number of SOA

solutions found in e-science and grid. It has a rather complex structure which integrates a number of

components, such as a SBML modeller and simulator, database, grid computing engine, computing

cluster, etc., typically used for many in silico studies in systems biology. We believe that performance

uncertainty, which is partially due to the systems themselves, can be reduced by further optimisation of

the internal structure and the right choice of components and technologies that suit each other and fit the

system requirements better.

Finally, our concrete suggestion for bio-scientists using BASIS is to set up a time out that is 1.2

times longer than the average response time estimated for 20-25 last requests. When the time out is

exceeded, a recovery action based on a simple retry can be effective most of the time in dealing with

transient congestions happening in the Internet and/or the BASIS WS. A more sophisticated technique

that would predict the response time more precisely and set up the time out should assess the average

response time and coefficient of variation. To be more dependable, clients should also distinguish

between different exceptions and handle them in different ways depending on the exception source.

All experimental results can be found at http://homepages.cs.ncl.ac.uk/

alexander.romanovsky/home.formal/Server-for-site.xls, including the invocation and the ping RTT

statistics for the Frankfurt client, and the probability distribution series (RPT, RTT, and RT). An extended

version of this section is submitted to SRDS’2010.

3. BENCHMARKING EXCEPTION PROPAGATION MECHANISMS

Exception handling is one of the popular means used for improving dependability and supporting

recovery in the Service-Oriented Architecture. Knowing the exact causes and sources of exceptions

raising during operation of Web Service allows developers to apply the more suitable fault-tolerant and

error recovery techniques (AmberPoint, 2003). In this section we present an experimental analysis of the

SOA-specific exception propagation mechanisms and provide some insights into differences in error

handling and propagation delays between two implementations of web services in IBM WebSphere SDK1

and Sun Java application server SDK2. We analyse an ability of the exception propagation mechanisms of

the two Web Services development toolkits to disclose the exact roots of processing different exceptions

and to understand their implications for performance and uncertainty of the SOA applications using them.

To provide such an analysis we have used a fault injection which is a well-proven method for

assessing the dependability and fault-tolerance of a computing system. In particular, Looker et al. (2004)

and Duraes et al. (2004) present a practical approach for the dependability benchmarking and evaluation

of the robustness of Web Services. However, the existing works neither consider the propagation

behaviour of the exceptions raised because of the injected faults nor study the performance with respect to

the exception propagation caused by the use of different Web Services platforms.

3.1. Experimental Technique

To conduct our experiments we first implemented a Java class, WSCalc, which performs a simple

arithmetic operation upon two integers, converting the result into a string. Then we implemented two

testbed Web Services using two different development toolkits: i) Sun Java System (SJS) Application

Server and ii) IBM WebSphere Software Developer Kit (WSDK). The next steps were analysis of SOA-

specific errors and failures and injection them into testbed web service architecture. Finally, analysis and

comparison of the exception propagation mechanisms and performance implications was done.

Page 23: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

19

3.2. Web Services Development Toolkits

In our work we experimented with two widely used technologies: the Java cross-platform technology,

developed by Sun and the IBM Web Service development environments and runtime application servers.

The reasons for this choice are that Sun develops most of the standards and reference implementations of

Java Enterprise software whereas IBM is the largest enterprise software company.

NetBeans IDE/SJS Application Server. NetBeans IDE3 is a powerful integrated environment for

developing applications on the Java platform, supporting Web Services technologies through the Java

Platform Enterprise Edition (J2EE). Sun Java System (SJS) Application Server is the Java EE

implementation by Sun Microsystems. NetBeans IDE with SJS Application Server support JSR-109,

which is a development paradigm that is suited for J2EE development, based on JAX-RPC (JSR-101).

IBM WSDK for Web Service. IBM WebSphere Software Developer Kit Version 5.1 (WSDK) is an

integrated kit for creating, discovering, invoking, and testing Web Services. WSDK v5.1 is based on

WebSphere Application Server v5.0.2 and provides support for the following open industry standards:

SOAP 1.1, WSDL 1.1, UDDI 2.0, JAX-RPC 1.0, EJB 2.0, Enterprise Web services 1.0, WSDL4J,

UDDI4J, and WS-Security. WSDK can be used with the Eclipse IDE4 which provides a graphical

interactive development environment for building and testing Java applications. Supporting the latest

specifications for Web Services WSDK enables to build, test, and deploy Web Services on industry-

leading IBM WebSphere Application Server. Functionality of the WSDK v5.1 has been incorporated into

the IBM WebSphere Studio family of products.

Note that at the time of writing, the JAX-RPC framework was extensively replaced by the newer

JAX-WS framework (with SOAP 1.2 compliance), but, we believe our findinsg will still apply to the

present and future Web Services technologies as they will be facing the same dependability issues.

3.3. Web Service Testbed

The starting point for developing a JAX-RPC WS is the coding of a service endpoint interface and an

implementation class with public methods that must throw java.rmi.RemoteException. To analyse features of

the exception propagation mechanisms in the service-oriented architecture we have developed a testbed WS

executing simple arithmetic operations. The implementation bean class of the Web Service providing

arithmetic operations is shown in Figure 12.

package ai.xai12.loony.wscalc;

public class WSCalc implements WSCalcSEI {

public String getMul (int a, int b) {

return new Integer(a * b).toString();

}

...

}

Figure 12. The implementation bean class of the Web Service

providing simple arithmetic operations

The testbed service was implemented by using two different development kits provided by Sun and

IBM. Two diverse web services obtained in such a way were deployed on the two hosts using the same

runtime environment (hardware platform and operating system) but different application servers: i) IBM

WebSphere and ii) SJS AppServer. These hosts operated under Windows XP Profession Edition were

located in the university LAN. Thus, transfer delays and other network problems were insignificant and

affected both testbed services in the same way.

Page 24: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

20

3.4. Error and Failure Injection

In our work we have experimented with 18 types of the SOA-specific errors and failures occurring during

service binding and invocation, SOAP messaging and request processing by a web service (see Table 10)

and dividing into three main categories: (i) network and remote system failures, (ii) internal errors and

failures and (iii) client-side binding errors. They are general (not application specific) and can appear in

any Web Service application during operation.

Table 10. SOA-specific errors and failures

! Type of error/failure Error/failure domain

1. Network connection break-off

2. Domain Name System is down

3. Loss of request/response packet

4. Remote host unavailable

5. Application Server is down

Network

and system

failures

6. Suspension of WS during transaction

7. System run-time error

8. Application run-time error

9. Error causing user-defined exception

Service

errors and

failures

10. Error in Target Name Space

11. Error in Web Service name

12. Error in service port name

13. Error in service operation name

14. Output parameter type mismatch

15. Input parameter type mismatch

16. Error in name of input parameter

17. Mismatching of number of input params

18. WS style mismatching (“Rpc” or “Doc”)

Client-side

binding

errors

Common-case network failures are down state of DNS or packets lost due to the network congestion.

Besides, the operation of a WS depends on the operation of the system software like web-server, application

server and database management system. In our work we analysed failures occurring when the application

servers (WebSphere or SJS AppServer) were shut down.

Client errors in early binding or dynamic interface invocation (DII) (like “Error in Target Name

Space”, “Error in Web Service name”, etc.) occur because of the changes in the invocation parameters,

and/or inconsistencies between the WSDL-description and the service interface. Finally, the service

failures are connected with program faults and run-time errors causing system- or user-defined

exceptions. System run-time errors like “Stack overflow” or “Lack of memory” result in the exceptions at

the system level as a whole. Operation “Division by zero” is also caught and generates an exception at the

system level but it is easier to simulate such system error than other ones.

The typical examples of the application run-time errors are “Operand type mismatch”, “Product

overflow” and “Index out of bounds”. In our experiments we injected the “Operand type mismatch” error,

hangs of the WS due to its program getting into a loop and error causing user-defined exception

(exception defined by a programmer during WS development).

Service failures (6, 7, 8) were simulated by fault injection at the service side. Client-side binding

errors (10-18) which are, in fact, a set of robustness tests (i.e., invalid web-services call parameters) were

applied during web-services invocation in order to reveal possible robustness problems in the web-

services middleware. We used a compile-time injection technique (Looker et al., 2004) where a source

code is modified to inject simulated errors and faults into the system.

Page 25: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

21!

Network and system failures were simulated by shutting down manually of DNS server, application

server and network connections at the client and service sides.

3.5. Errors and Exceptions Correspondence Analysis

Table 11 describes a relationship between errors/failures and the exceptions raised at the top level on

different application platforms. As it was discovered, some injected errors and failures cause the same

exception so we were not always able to define the precise exception cause. There are several groups of

such errors and failures: 1 and 2 (Sun); 3 and 6 (Sun); 4 and 5 (Sun); 1, 2 and 5 (IBM); 3 and 6 (IBM).

Some client-side binding errors (11 – “Error in Web Service name”, 12 – “Error in service port

name”) neither raise exceptions nor affect the service output. This happens because the WS is actually

invoked by the address location, whereas the service and port names are only used as supplementary

information. Moreover, the WS developed by using IBM WSDK and deployed on the IBM WebSphere

application server, tolerates such binding errors internally: 10 - “Error in Target Name Space”, 14 -

“Output parameter type mismatch”, and 16 - “Error in name of input parameter”. These features are

supported by the WSDL description and a built-in function of automatic type conversion.

Errors in the name of the input parameter were tolerated because checking the order of parameters

has a priority over the coincidence of parameter names in the IBM implementation of web service. On the

other hand it seems like Websphere is unable to detect a potentially dangerous situation resulted from the

parameters mishmash.

Table 11. Example of top-level exceptions raised by different types of errors and failures

Type of

error/failure

Exception message at using Sun

Microsystems WS Toolkit

Exception message at using IBM

WS Toolkit (WSDK)

Network

connection break-

off; DNS is down

“HTTP transport error: java.net.

UnknownHostException: c1.xai12.ai”

“{http://websphere.ibm.com/webser

vices/} Server.generalException”

Remote host

unavailable

(off-line)

“HTTP Status-Code 404: Not Found -

/WS/ WSCalc”

“{http://websphere.ibm.

com/webservices/} HTTP

faultString: (404)Not Found”

Suspension of Web

Service during

transaction

Waiting for response during too much

time (more than 2 hours) without

exception

“{http://websphere.ibm.com/webser

vices/} Server.generalException

faultString: java.io. Interrupted

IOException:Read timed out”

System run-time

error (“Division by

Zero”)

“java.rmi.ServerException:

JAXRPC.TIE.04: Internal Server Error

(JAXRPCTIE01: caught exception while

handling request: java.lang.

ArithmeticException: / by zero)”

“{http://websphere.ibm.com/webser

vices/} Server.generalException

faultString: java.lang.Arithmetic

Exception: / by zero”

Application error

causing user-

defined exception

“java.rmi.RemoteException:

ai.c1.loony.exception. UserException”

“{http://websphere.ibm.com/webser

vices/}Server.generalException

faultString:(13)UserException”

Error in name of

input parameter

“java.rmi.RemoteException:

JAXRPCTIE01: unexpected element

name:expected=Integer_2,

actual=Integer_1”

OK - Correct output without

exception

Page 26: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

22!

3.6. Exception Propagation and Performance Analysis

Table 11 shows the exceptions raised at the top level on the client side. However, a particular exception can be

wrapped dozens of times before it finally propagates to the top. This process takes time and significantly

reduces performance of exception handling in service-oriented architecture.

An example of the stack trace corresponding to the “Operand Type Mismatch” run-time error caught by

a web service is given in Figure 13. The exception propagation chain has four nested calls (started with

“at” preposition) in case of using WS development kit from Sun Microsystems. For comparison, the stack

trace of IBM-based implementation has 63 nested calls for the same error. The full stack traces and

technical details can be found in Gorbenko et al. (2007).

The results of exception propagation and performance analysis are represented in Table 12. This

table includes a number of exceptions stack trace (length of exceptions propagation chain, i.e. the count of

different stack traces for this particular failure) and propagation delay (min, max and average values)

which is a time between the invocation of a service and capture of the exception by a catch block. As it

can be seen from Table 12, the IBM implementation of the web service has almost twice as good a

performance as that of the service implemented in the Sun technology.

java.rmi.ServerException: JAXRPC.TIE.04: Internal Server Error

(JAXRPCTIE01: java.lang. NumberFormatException: For input string: "578ER")

at com.sun.xml.rpc.client.dii.BasicCall.invoke(BasicCall.java:497)

at ai.c1.xai12.wstest.InvoceWS.invoce(InvoceWS.java:125)

at ai.c1.xai12.wstest.InvoceWS.invoceByVector(InvoceWS.java:75)

at wstest.Main.main(Main.java:42)

Figure 13. Stack trace of failure No 8, raised in the client application developed in NetBeans IDE

by using JAX-RPC implementation of Sun Microsystems

The performance of exception propagation mechanisms has been monitored at the university LAN

on heterogeneous server platforms. The first row of the table corresponds to the correct service output

without any exceptions. The rows, marked in bold, correspond to the cases of correct service outputs

without exceptions in spite of injected errors.

It is clear from the table that the exceptions propagation delay is several times greater than normal

working time. However, the exception propagation delay of the Web Service developed with NetBeans

IDE using the JAX-RPC implementation from Sun Microsystems was two times shorter than the delay we

experienced when we used IBM WSDK. This can be accounted for by the fact that the exception

propagation chain in the IBM implementation of the web service is usually much longer. The factors

affecting the performance and differences between the two web-service development environments most

probably depend on the internal structure of toolkits and the application servers used. We believe that the

most likely reason for this behaviour is that the JAX-RPC implementation by Sun Microsystems has a

larger number of nested calls than IBM WSDK.

In case of service suspension or packet loss, the service client developed using the Sun WS toolkit

may not raise an exception even over as long as 2 hours’ time. This results in retarded recovery action and

complicates developers’ work. Analysing the exception stack trace and propagation delay can help in

identifying the source of the exception.

For example, failures 1 - “Network connection break-off” and 2 - “Domain Name System (DNS) is

down” raise the same top-level exception “HTTP transport error: java.net.UnknownHostException:

loony.xai12.ai”. However, if we use the Sun WS toolkit, we can distinguish between these failures by

comparing numbers of the stack traces (38 vs. 28). If we use IBM WSDK, we are able to distinguish

failure 5 – “Application Server is down” from failures 1 and 2 by analysing the exception propagation

delay (the first one is greater by one order).

Page 27: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

23!

Table 12. Performance analysis of exceptions propagation mechanism

WS Development Toolkit NetBeans IDE (Sun) IBM WSDK

exception

propagation

delay, ms

exception

propagation delay,

ms " Type of error/failure

no of

stack

tracesmin max av.

no of

stack

tracesmin max av.

Without Error/Failure 0 40 210 95 0 15 120 45

1. Network connection break-off 38 10 30 23 16 10 40 28

2. Domain Name System is down 28 16 32 27 16 15 47 34

3. Loss of packet with client request or

service response - >7200000 15 300503 300661 300622

4. Remote host unavailable (off-line) 9 110 750 387 11 120 580 350

5. Application Server is down 9 70 456 259 16 100 550 287

6. Suspension of Web Service during

transaction (getting into a loop) - >7200000 15 300533 300771 300642

7. System run-time error (“Division by

Zero”) 7 90 621 250 62 120 551 401

8. Calculation run-time error (“Operand

Type Mismatch”) 4 90 170 145 63 130 581 324

9. Application error causing user-defined

exception 4 100 215 175 61 150 701 366

10. Error in Target Name Space 4 100 281 180 0 10 105 38

11. Error in Web Service name 0 40 120 80 0 10 125 41

12. Error in service port name 0 30 185 85 0 15 137 53

13. Error in service operation name 4 90 270 150 58 190 511 380

14. Output parameter type mismatch 14 80 198 160 0 15 134 48

15. Input parameter type mismatch 4 80 190 150 76 90 761 305

16. Error in name of input parameter 4 70 201 141 0 10 150 47

17. Mismatching of number of input

service parameters 4 80 270 160 61 130 681 350

18. Web Service style mismatching 4 70 350 187 58 90 541 298

3.7. Discussion

Exception handling is widely used as the basis for forward error recovery in service-oriented architecture.

Its effectiveness depends on the features of exception raising and on the propagation mechanisms. This

work allows us to draw the following conclusions.

1. Web services developed by using different toolkits react to some DII client errors differently

(“Output parameter type mismatch”, “Error in name of input parameter”). Sometimes this diversity can

allow us to mask client errors, yet in other cases it will lead to an erroneous outcome. Moreover, the

exception messages and stack traces gathered in our experimentation were not always sufficient to

identify the exact cause of these errors. For example, it is not possible to know if a remote host is down or

unreachable due to transient network failures. All this contributes to SOA uncertainty and prevents

developers from applying an adequate recovery technique.

2. Clients of web services developed using different toolkits can experience different response

time-outs. In our experimentation with simple Web Services we also observed substantial delays in client

software developed using the Sun Microsystems toolkit caused by WS hangs or packet loss.

3. Web Services developed using different toolkits have different exception propagation times. This

affects failure detection and failure notification delay. We believe that WSDK developers should make an

effort to reduce these times.

Page 28: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

24

4. Analysing exception stack traces and propagation delays can help identify the exact sources of

exceptions even if we have caught the same top-level exception messages. It makes for a better fault

diagnosis, which identifies and records the cause of exception in terms of both location and type, as well

as better fault isolation and removal.

5. Knowing the exact cause and sources of exceptions is useful for applying appropriate failure

recovery or fault-tolerant means during exception handling. Several types of failures resulting in

exceptions can be effectively handled on the client side, whereas others should be handled on the service

side. Exceptions handling of the client side errors in early binding procedures may include a retry with the

help of dynamic invocation. Transient network failures can be tolerated by a simple retry. In other cases

redundancy and majority voting should be used.

6. Gathering and analysing exception statistics allow improvement of fault handling, which

prevents located faults from being activated again by using system reconfiguration or reinitialization.

This is especially relevant to a composite system with several alternative WSs.

7. Analysing exception stack traces helps identify the application server, WSDK, libraries and

packages used for WS development. This information is useful for choosing diverse variants from a set of

alternative Web Services deployed by third parties and building effective fault-tolerant systems by using

WS redundancy and diversity.

Below is a summary of our suggestions as to how exception handling should be implemented in

SOA systems to help develop systems that handle exceptions optimally.

First of all, a Web Service should return exceptions as soon as possible. Long notification delays

can significantly affect SOA performance, especially in complex workflow systems. To decrease the

exception propagation delay, developers should avoid unnecessary nesting of exceptions and reduce the

overall number of exception stack traces.

It is also essential that exceptions should contain more detailed information about the cause of error

and also provide additional classification attributes to help error diagnosis and fault tolerance. For

example, if an exception reports whether the error seems to be transient or permanent, a user’s application

will be able to automatically choose and perform the most suitable error recovery action (a simple retry in

case of transient errors or more complex fault-tolerant techniques otherwise).

CONCLUSION AND FUTURE WORK

Service-Oriented Architecture and Web Services technologies support rapid, low-cost and seamless

composition of globally distributed applications, and enable effective interoperability in a loosely-coupled

heterogeneous environment. Services are autonomous, platform-independent computational entities that

can be dynamically discovered and integrated into a single service to be offered to the users or, in turn,

used as a building block in further composition. The essential principles of SOA and WS form the

foundation for various modern and emerging IT technologies, such as service-oriented and cloud

computing, SaaS (software as a service), grid, etc.

According to International Data Corporation (2007), Web Services and service-oriented systems

are now widely used in e-science, critical infrastructures and business-critical systems. Failures in these

applications adversely affect people’s lives and businesses. Thus, ensuring dependability of WSs and

SOA-based systems is a must, as well as a challenge. To illustrate the problem, our earlier extensive

experiments with the BASIS and BLAST bioinformatics WSs show that the response time varies greatly

because of such various unpredictable factors as Internet congestions and failures and WS overloads. In

particular, the BASIS WS response time ranges from 300 ms to 120000 ms, the response time in 22% of

the requests is at least twice longer than the observed minimal value and the response time in about 5% of

requests is more than 20 times longer. We believe it is impossible to build fast and dependable SOAs

without tackling these issues.

Our recent experimental work supports our claim that dealing with the uncertainty inherent in the

very nature of SOA and WSs is one of the main challenges in building dependable SOAs. Uncertainty

needs to be treated as a threat in a way similar to and in addition to faults, errors and failures as

traditionally dealt with by the dependability community (Avizienis et al., 2004).

Page 29: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

25!

The response time instability can cause timing failures when the time of response arrival or the

time in which information is delivered at the service interface (i.e. the timing of service delivery) deviates

from the time required to execute the system function. A timing failure may be in the form of early or late

response, depending on whether the service is delivered too early or too late (Avizienis et al., 2004). In

complex Service-Oriented Systems composed of many different Web Services, some users may receive a

correct service whereas others may receive incorrect services of different types due to timing errors.

These errors may occur in different system components depending on the relative position of a particular

user and particular Web Services in the Internet, and on the instability points appearing during the

execution. Thus, timing errors can become a major cause of inconsistent failures usually referred to, after

Lamport, Shostak, & Pease (1982), as the Byzantine failures.

The novel concepts of Service-Oriented Systems and their application in new domains clearly call

for continued attention to the SOA-specific uncertainty issues. For the open intra-organisational SOA

systems using the Internet, this uncertainty is unavoidable and the systems should be able to provide the

trustworthy service in spite of it. This, in turn, will require developing new resilience engineering

techniques and resilience-explicit mechanisms to deal with this threat.

Good measurement of uncertainty is important (and our work contributes to this topic), and yet this

is just the beginning because, once measured, the non-functional characteristics of WSs cannot be

assumed to be true forever. This is why developing dynamic fault-tolerant techniques and mechanisms

setting timeouts on-line and adopting system architecture and its behaviour on the fly is crucial for SOA.

In fact, there is a substantial number of dependability-enhancing techniques that can be applied to SOA

(Zheng, & Lyu, 2009; Maamar et al., 2008; Laranjeiro, & Vieira, 2008; Fang et al., 2007; Salatge, &

Fabre, 2007, etc.), including retries of lost messages, redundancy and replication of WSs, variations of

recovery blocks trying different services, etc.

These techniques exploit the flexibility of the service infrastructure, but the major challenge in

utilising these dependability techniques is the uncertainty inherent in the services running over the

Internet and clouds. This uncertainty exhibits itself through the unpredictable response times of the

Internet messages and data transfers, the difficulty of diagnosing the root cause of service failures, the

lack of ability to see beyond the interfaces of a service, unknown common mode failures, etc. The

uncertainty of the Internet and service performance instability are such that on-line optimization of

redundancy can make a substantial difference in perceived dependability. There are, however, no good

tools available at the moment for a company to carry out such optimisation in a rigorous manner.

We believe that uncertainty can be resolved by two means: uncertainty removal through advances

in data collection and uncertainty tolerance through smart algorithms that improve decisions despite a

lack of data (e.g. by extrapolation, better mathematical models, etc.). The user can intelligently and

dynamically switch between the Internet service providers or WS providers if she/he understands which

delay makes the major contribution to the response time and its instability. The more aware the user is of

the response time, different delays contributing to response time and their uncertainty, the more intelligent

will be his/her choice.

Future solutions will need to deal with a number of issues, such as the uncertainty of fault

assumptions, of redundant resource behaviour, of error detection, etc. The traditional adaptive solutions

based on the control feedback will not be directly applicable as they are designed for predictable

behaviour. One of the possible ways to resist uncertainty is to use service and path redundancy and

diversity inherent to SOA.

In (Gorbenko, Kharchenko, & Romanovsky, 2009) we propose several patterns for dependability-

aware service composition that allows us to construct composite Service-Oriented Systems resilient to

various types of failure (signalled or unsignalled; content, timing or silent failures) by using the inherent

redundancy and diversity of Web Service components which exist in the SOA and extending the mediator

approach proposed by Chen, and Romanovsky (2008).

Page 30: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

26!

ACKNOWLEDGEMENTS

A. Romanovsky is partially supported by the UK EPSRC TrAmS platform grant.

A. Gorbenko is partially supported by the UA DFFD grant GP/F27/0073 and School of Computing

Science, Newcastle University.

REFERENCES

AmberPoint, Inc (2003). Managing Exceptions in Web Services Environments, An AmberPoint

Whitepaper. Retrieved from http://www.amberpoint.com.

Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of

Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-

33

Chen, Y., & Romanovsky, A. (2008, Jan/Feb) Improving the Dependability of Web Services Integration.

IT Professional: Technology Solutions for the Enterprise, 20-26.

Chen, Y., Romanovsky, A., Gorbenko, A., Kharchenko, V., Mamutov, S., Tarasyuk, O. (2009).

Benchmarking Dependability of a System Biology Application. Proceedings of the 14th IEEE Int.

Conference on Engineering of Complex Computer Systems (ICECCS’2009), 146 – 153.

Duraes, J., Vieira, M., & Madeira, H. (2004). Dependability Benchmarking of Web-Servers. In M. Heisel

et al. (Eds.), Proceedings of the 23rd Int. Conf. on Computer Safety, Reliability and Security

(SAFECOMP'04), LNCS 3219, (pp. 297-310). Springer-Verlag.

Fang, C.-L., Liang, D., Lin, F., & Lin, C.-C. (2007). Fault tolerant web services. Journal of Systems

Architecture, 53(1), 21-38

Goad, R. (2008, Dec) Social Xmas: Facebook’s busiest day ever, YouTube overtakes Hotmail, Social

networks = 10% of UK Internet traffic, [Web log comment]. Retrieved from

http://weblogs.hitwise.com/robin-goad/2008/12/facebook_youtube_christmas_social_networking.html

Gorbenko, A., Mikhaylichenko, A., Kharchenko, V., & Romanovsky, A. (2007). Experimenting With

Exception Handling Mechanisms Of Web Services Implemented Using Different Development Kits.

Technical report CS-TR 1010, Newcastle University. Retrieved from

http://www.cs.ncl.ac.uk/research/pubs/trs/papers/1010.pdf.

Gorbenko, A., Kharchenko, V., Tarasyuk, O., Chen, Y., & Romanovsky, A. (2008). The Threat of

Uncertainty in Service-Oriented Architecture. Proceedings of the RISE/EFTS Joint International

Workshop on Software Engineering for Resilient Systems (SERENE’20082008), ACM, 49-50.

Gorbenko, A., Kharchenko, V., & Romanovsky, A. (2009). Using Inherent Service Redundancy and

Diversity to Ensure Web Services Dependability. In M.J. Butler, C.B. Jones, A. Romanovsky, E.

Troubitsyna (Eds.) Methods, Models and Tools for Fault Tolerance, LNCS 5454 (pp. 324-341). Springer-

Verlag.

Institute for Ageing and Health (2009). BASIS: Biology of Ageing e-Science Integration and Simulation

System. Retrieved June 1, 2010, from http://www.basis.ncl.ac.uk/. Newcastle upon Tyne, UK: Newcastle

University.

International Data Corporation (2007). Mission Critical North American Application Platform Study, IDC

White Paper. Retrieved from www.idc.com.

Kirkwood, T.B.L., Boys, R.J., Gillespie, C.J., Proctor, C.J., Shanley, D.P., Wilkinson, D.J. (2003).

Towards an E-Biology of Ageing: Integrating Theory and Data. Journal of Nature Reviews Molecular

Cell Biology, 4, 243-249.

Page 31: COMPUTING SCIENCE · TECHNICAL REPORT SERIES No. CS-TR-1225 November 2010 . TECHNICAL REPORT SERIES No. CS-TR-1225 November, ... (M’01) received his PhD in Technical Science at

27!

Lamport, L., Shostak, R., & Pease, M. (1982). The Byzantine Generals Problem. ACM Trans.

Programming Languages and Systems, 4(3), 382-401.

Laranjeiro, N., & Vieira, M. (2008). Deploying Fault Tolerant Web Service Compositions. International

Journal of Computer Systems Science and Engineering (CSSE): Special Issue on Engineering Fault

Tolerant Systems, 23(5).

Laranjeiro, N., Vieira, M., & Madeira, H. (2007). Assessing Robustness of Web-services Infrastructures.

Proceedings of the International Conference on Dependable Systems and Networks (DSN’07), 131-136

Li, P., Chen, Y., Romanovsky, A. (2006). Measuring the Dependability of Web Services for Use in e-

Science Experiments. In D. Penkler, M. Reitenspiess, & F. Tam (Eds.): International Service Availability

Symposium (ISAS 2006), LNCS 4328, (pp. 193-205). Springer-Verlag.

Looker, N., Munro, M., & Xu, J. (2004). Simulating Errors in Web Services. International Journal of

Simulation Systems, Science & Technology, 5(5)

Maamar, Z., Sheng, Q., & Benslimane, D. (2008). Sustaining Web Services High-Availability Using

Communities. Proceedings of the 3rd International Conference on Availability, Reliability and Security,

834-841.

Miyazaki, S., & Sugawara, H. (2000) Development of DDBJ-XML and its Application to a Database of

cDNA, Genome Informatics 2000, (pp. 380–381). Tokyo: Universal Academy Press Inc.

Reinecke, P., A. van Moorsel, & Wolter., K. (2006). Experimental Analysis of the Correlation of HTTP

GET invocations. In A. Horvath and M. Telek (Eds.): European Performance Engineering Workshop

(EPEW’2006), LNCS 4054, (pp. 226-237). Springer-Verlag.

Salatge, N., & Fabre, J.-C. (2007). Fault Tolerance Connectors for Unreliable Web Services. Proceedings

of the International Conference on Dependable Systems and Networks (DSN’07). 51-60.

Zheng, Z., Zhang, Y., & Lyu, M. (2010). Distributed QoS Evaluation for Real-World Web Services.

Proceedings of the IEEE International Conference on Web Services (ICWS’10), 83-90.

Zheng, Z., & Lyu, M. (2010). Collaborative Reliability Prediction for Service-Oriented Systems.

Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering (ICSE’10), 35-

44.

Zheng, Z., & Lyu, M. (2009). A QoS-Aware Fault Tolerant Middleware for Dependable Service

Composition. Proceedings of the International Conference on Dependable Systems and Networks

(DSN’09), 239-248.

1 http://www-128.ibm.com/developerworks/webservices/wsdk/ 2 http://www.sun.com/software/products/appsrvr_pe/index.xml 3 http://www.netbeans.org 4 http://www.eclipse.org