Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
COMPUTING SCIENCE
Measuring and Dealing with the Uncertainty of SOA Solutions Yuhui Chen, Anatoliy Gorbenko, Vyachaslav Kharchenko and Alexander Romanovsky
TECHNICAL REPORT SERIES
No. CS-TR-1225 November 2010
TECHNICAL REPORT SERIES No. CS-TR-1225 November, 2010
Measuring and Dealing with the Uncertainty of SOA Solutions Y. Chen, A. Gorbenko, V. Kharchenko and A. Romanovsky Abstract The paper investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection into Web Services. We summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services. © 2010 University of Newcastle upon Tyne. Printed and published by the University of Newcastle upon Tyne, Computing Science, Claremont Tower, Claremont Road, Newcastle upon Tyne, NE1 7RU, England.
Bibliographical details CHEN, Y., GORBENKO, A., KHARCHENKO, V., ROMANOVSKY, A. Measuring and Dealing with the Uncertainty of SOA Solutions [By] Y. Chen, A. Gorbenko, V. Kharchenko, A. Romanovsky Newcastle upon Tyne: University of Newcastle upon Tyne: Computing Science, 2010. (University of Newcastle upon Tyne, Computing Science, Technical Report Series, No. CS-TR-1225)
Added entries UNIVERSITY OF NEWCASTLE UPON TYNE Computing Science. Technical Report Series. CS-TR-1225 Abstract The paper investigates the uncertainty of Web Services performance and the instability of their communication medium (the Internet), and shows the influence of these two factors on the overall dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour of a number of existing Web Services used in e-science and bio-informatics, provide the results of statistical data analysis and discuss the probability distribution of delays contributing to the Web Services response time. The ratio between delay standard deviation and its average value is introduced to measure the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection into Web Services. We summarise our experiments with SOA-specific exception handling features provided by two web service development kits and analyse exception propagation and performance as the major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services. About the authors Yuhui Chen completed his PhD study at Newcastle University (UK). He received a MSc in Computing Science in 2003 from Newcastle University. He started his PhD from April 2004 under the supervision of Prof. Alexander Romanovsky. His research focuses on dependability of Service Oriented Architecture. Anatoliy Gorbenko graduated in computer science in 2000 and received the PhD degree from the National Aerospace University, Kharkiv, Ukraine in 2005. He is an Associate Professor at the Department of Computer Systems and Networks of the National Aerospace University in Kharkiv (Ukraine). There he co-coordinates the DESSERT (Dependable Systems, Services and Technologies) research group. His work focuses on system research ensuring dependability and fault tolerance in service-oriented architectures; on investigating system diversity, dependability assessment and exception handling, and on applying these results in real industrial applications. Dr. Gorbenko is a member of EASST (European Association of Software Science and Technology). Vyacheslav Kharchenko (M’01) received his PhD in Technical Science at the Military Academy named after Dzerzhinsky (Moscow, Russia) in 1981 and Doctor of Technical Science degree at the Kharkiv Military University (Ukraine) in 1995. He is a Professor and heads of the Computer Systems and Networks Department and the DESSERT research group at the National Airspace University, Ukraine. He is also a senior research investigator in the field of safety-related software at the State Science-Technical Center of Nuclear and Radiation Safety (Ukraine). He has published nearly 200 scientific papers, reports and book chapters, more than 500 inventions and is the coauthor or editor of 28 books. He has been the principal investigator and consultant on a succession of research projects in safety and dependability of I&C NPP and aerospace systems. He has been a head of the DESSERT International Conference (http://www.stc-dessert.com) in 2006-2009. His research interests include critical computing, dependable and safety-related I&C systems, multi-version design technologies, software and FPGA-based systems verification and expert analysis. Alexander (Sascha) Romanovsky is a Professor in the Centre for Software and Reliability, Newcastle University. His main research interests are system dependability, fault tolerance, software architectures, exception handling, error recovery, system structuring and verification of fault tolerance. He received a M.Sc. degree in Applied Mathematics from Moscow State University and a PhD degree in Computer Science from St. Petersburg State Technical University. He was with this University from 1984 until 1996, doing research and teaching. In 1991 he worked as a visiting researcher at ABB Ltd Computer Architecture Lab Research Center, Switzerland. In 1993 he was a visiting fellow at Istituto di Elaborazione della Informazione, CNR, Pisa, Italy. In 1993-94 he was a post-doctoral fellow with the Department of Computing Science, the University of Newcastle upon Tyne. In 1992-1998 he was involved in the Predictably Dependable Computing Systems (PDCS) ESPRIT Basic Research Action and the Design for Validation (DeVa) ESPRIT Basic Project. In 1998-2000 he worked on the Diversity in Safety Critical Software (DISCS) EPSRC/UK Project. Prof Romanovsky was a co-author of the Diversity with Off-The-Shelf Components (DOTS) EPSRC/UK Project and was involved in this project in 2001-2004. In 2000-2003 he was in the executive board of Dependable Systems of Systems (DSoS) IST Project. He has been the Coordinator of the Rigorous Open Development Environment for Complex Systems (RODIN) IST Project (2004-2007). He is
now the Coordinator of the major FP7 DEPLOY Integrated Project (2008-2012) on Industrial Deployment of System Engineering Methods Providing High Dependability and Productivity.
Suggested keywords WEB SERVICES SOA BENCHMARKING RESPONSE TIME
Measuring and Dealing with the
Uncertainty of SOA Solutions Yuhui Chen
1, Anatoliy Gorbenko
2, Vyachaslav Kharchenko
2, Alexander Romanovsky
3
1 Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
2 Department of Computer Systems and Networks, National Aerospace University, Kharkiv,
Ukraine 3 School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
ABSTRACT The chapter investigates the uncertainty of Web Services performance and the instability of their
communication medium (the Internet), and shows the influence of these two factors on the overall
dependability of SOA. We present our practical experience in benchmarking and measuring the behaviour
of a number of existing Web Services used in e-science and bio-informatics, provide the results of
statistical data analysis and discuss the probability distribution of delays contributing to the Web Services
response time. The ratio between delay standard deviation and its average value is introduced to measure
the performance uncertainty of a Web Service. Finally, we present the results of error and fault injection
into Web Services. We summarise our experiments with SOA-specific exception handling features
provided by two web service development kits and analyse exception propagation and performance as the
major factors affecting fault tolerance (in particular, error handling and fault diagnosis) in Web Services.
INTRODUCTION
The paradigm of Service-Oriented Architecture (SOA) is a further step in the evolution of the well-known
component-based system development with Off-the-Shelf components. SOA and Web Services (WSs)
were introduced to ensure effective interaction of complex distributed applications. They are now
evolving within critical infrastructures (e.g. air traffic control systems), holding various business systems
and services together (for example, banking, e-health, etc.). Their ability to compose and implement
business workflows provides crucial support for developing globally distributed large-scale computing
systems, which are becoming integral to society and the economy.
Unlike common software applications, however, Web Services work in an unstable environment as
part of globally-distributed and loosely-coupled SOAs, communicating with a number of other services
deployed by third parties (e.g. in different administration domains), typically with unknown dependability
characteristics. When complex service-oriented systems are dynamically built or when their components
are dynamically replaced by the new ones with the same (or similar) functionality but unknown
dependability and performance characteristics, ensuring and assessing their dependability becomes
genuinely complicated. It is this fact that is the main motivation for this work.
By their very nature Web Services are black boxes, as neither their source code, nor their complete
specification, nor information about their deployment environments are available; the only known
information about them is their interfaces. Moreover, their dependability is not completely known and
they may not provide sufficient Quality of Service (QoS); it is often safer to treat them as “dirty” boxes,
assuming that they always have bugs, do not fit well enough, and have poor specification and
documentation. Web Services are heterogeneous, as they might be developed following different
standards, fault assumptions and different conventions, and may use different technologies. Finally,
Service-Oriented Systems are built as overlay networks over the Internet and their construction and
composition are complicated by the fact that the Internet is a poor communication medium (e.g. it has low
quality and is not predictable).
2!
Therefore, users cannot be confident of their availability, trustworthiness, reasonable response time
and other dependability characteristics (Avizienis, Laprie, Randell, & Landwehr, 2004), as these can vary
over wide ranges in a very random and unpredictable manner. In this work we use the general synthetic
term uncertainty to refer to the unknown, unstable, unpredictable, changeable characteristics and
behaviour of Web Services and SOA, exacerbated by running these services over the Internet. Dealing
with such uncertainty, which in the very nature of SOA, is one of the main challenges that researchers are
facing.
To become ubiquitous, Service-Oriented Systems should be capable of tolerating faults and
potentially-harmful events caused by a variety of reasons, including low or varying (decreasing) quality
of components (services), shifting characteristics of the network media, component mismatches,
permanent or temporary faults of individual services, composition mistakes, and service disconnection,
changes in the environment and in the policies.
The dependability and QoS of SOA has recently been the aim of significant research effort. A
number of studies (Zheng, & Lyu, 2009; Maamar, Sheng, & Benslimane, 2008; Fang, Liang, F. Lin, &
C.-C. Lin, 2007) have introduced several approaches to incorporating resilience techniques (including
voting, backward and forward error recovery mechanisms and replication techniques) into WS
architectures. There has been work on benchmarking and experimental measurements of dependability
(Laranjeiro, Vieira, & Madeira, 2007; Duraes, Vieira, & Madeira, 2004; Looker, Munro, & Xu, 2004) as
well as dependability and performance evaluation (Zheng, Zhang, & Lyu, 2010). But even though the
existing proposals offer useful means for improving SOA dependability by enhancing particular WS
technologies, most of them do not address the uncertainty challenge which exacerbates the lack of
dependability and varying quality.
The uncertainty of Web Services has two main consequences. First, it makes it difficult to assess
dependability and performance of services, and hence to choose between them and gain confidence in
their dependability. Secondly, it becomes difficult to apply fault tolerance mechanisms efficiently, as too
much of the data which is necessary to make choices is missing.
The purpose of the chapter is to investigate the dependability and uncertainty of SOA and the
instability of the communication medium through large-scale benchmarking of a number of existing Web
Services. Benchmarking is an essential and very popular approach to web services dependability
measurement. Apart from papers (Laranjeiro et al., 2007; Duraes et al., 2004) we need to mention such
recent and ongoing European research projects as AMBER (http://www.amber-project.eu/) and WS-
Diamond (http://wsdiamond.di.unito.it/). Mostly relying on stress-testing and failure injection techniques,
these works analyse services robustness, their behaviour in the presence of failure or under stressed load,
and compare the effectiveness of the technologies used to implement web services. Hardly any of the
studies, however, address the web services instability issue or offer a strong mathematical foundation or
proofs - mostly because, we believe, there is no general theory to capture uncertainties inherent to SOA.
In this chapter we present our practical experience in benchmarking and measuring a number of
existing WSs used in e-science and bio-informatics (Blast and Fasta, providing API for bioinformatics
and genetic engineering, and available at http://xml.nig.ac.jp, and BASIS, the Biology of Ageing E-
Science Integration and Simulation System, available at http://www.basis.ncl.ac.uk/WebServices.html).
This chapter summarises our recent work in the area (for more information, the readers are referred to
Gorbenko, Mikhaylichenko, Kharchenko, and Romanovsky (2007), Gorbenko, Kharchenko, Tarasyuk,
Chen, and Romanovsky (2008), Chen et al. (2009)).
In the first section we describe the experimental techniques used, investigate performance
instability of the Blast, Fasta and BASIS WSs and analyse the delays induced by the communication
medium. We also show results of statistical data analysis (i.e. minimal, maximal and average values of the
delays and their standard deviations) and present probability distribution series.
The second section analyses the instability involved in delays as elements of the web service
response time. In this section we report the latest results of advanced BASIS web services measurements,
capable of distinguishing between the network round trip time (RTT) and the request processing time
3!
(RPT) on the service side. The section also provides results of checking hypotheses about the distribution
law of the web service response time and its component values RPT and RTT.
The uncertainty discovered in web services operations affects the dependability of SOA and will
require additional specific resilience techniques. Exception handling is one of the means widely used for
attaining dependability and supporting recovery in SOA applications. The third section presents the
results of error and fault injection into web services. We summarise our experiments with SOA-specific
exception handling features provided by two tool kits: the Sun Microsystems JAX-RPC and the IBM
WebSphere Software Developer Kit for developing web services. In this section we examine the ability of
built-in exception handling mechanisms to eliminate certain causes of errors and analyse exception
propagation and performance as the major factors affecting fault tolerance (in particular, error handling
and fault diagnosis) in web services.
1. MEASURING DEPENDABILITY AND PERFORMANCE UNCERTAINTY
OF SYSTEM BIOLOGY APPLICATIONS
1.1. Measuring Uncertainty of Blast and Fasta Web Services
In our experiments we dealt with the DNA Databank, Japan (DDBJ), which provides API for
bioinformatics and genetic engineering (Miyazaki, & Sugawara, 2000). We benchmarked the Fasta and
Blast web services provided by DDBJ, which implement algorithms commonly used in the in silico
experiments in bioinformatics to search for gene and protein sequences that are similar to a given input
query sequence.
1.1.1. Experimental Technique
A Java client was developed to invoke the Fasta and Blast WSs at DDBJ during five days from
04 June 2008 to 08 June 2008. In particular, we invoked the getSupportDatabaseList operation supported
by both the Fasta and Blast WSs. The size of the SOAP request for the Fasta and Blast WS is 616 bytes,
whereas the SOAP responses are 2128 and 2171 bytes respectively.
The services were invoked simultaneously, using threads every 10 minutes (in total, more than 650
times during the five days). At the same time, the DDBJ Server was pinged to assess the network RTT
(round trip time) and to take into account the Internet effects on the web service invocation delay. The
total number of the ICMP Echo requests sent to the DDBJ Server was more than 200000 (one per two
seconds).
1.1.2. Performance Trends Analysis
Figure 1 shows the response delays of the Fasta (a) and Blast (b) WSs. In spite of the fact that we
invoked similar operations of these two services with similar sizes of SOAP responses simultaneously,
they had different response times. Moreover, the response time of Blast was more unstable (see the
probability distribution series of Fasta (a) and Blast (b) in Figure 2).
This difference can be explained by internal reasons (such as a varying level of CPU utilization and
memory usage while processing the request, some differences in implementations, etc.). Besides, we
noted a period of time, Time_slot_2 (starting on June 05 at 23:23:48 and lasting for 3 hours and 8
minutes), during which the average response time increased significantly for both Fasta and Blast (see
Figure 1).
Table 1 presents the results of statistical data analysis of response times for the Fasta and Blast
WSs for the stable network route period, Time_slot_1, and for the period when the network route
changed, Time_slot_2.
4
Figure 1. Response delay trends: (a) Fasta web service; (b) Blast web service
Figure 2. Response time probability distribution series: (a) Fasta web service; (b) Blast web service
Standard deviation of response time for Fasta is about 16% of its average value, whereas for the
Blast web service it equals 27% and 45% for Time_slot_1 and Time_slot_2 respectively. We believe this
shows that a significant time uncertainty exists in Service Oriented Systems.
Further investigation of the ping delays confirmed that this was a period during which the network
route between the client computer at Newcastle University (UK) and the DDBJ server in Japan changed.
Moreover, during the third time slot we observed 6 packets lost in 20 minutes. Together with the high
RTT deviation, it indicates that significant network congestion occurred.
Table 1. Response time statistics summary
Invocation response time (RT), ms
Time slot min. max. avg. std. dev.
Fasta WS
Time slot 1 937 1953 996.91 163.28
Time slot 2 937 4703 1087.28 171.12
Blast WS
Time slot 1 1000 1750 1071.17 291.57
Time slot 2 1015 3453 1265.72 572.70
5!
1.1.3. PINGing Delay Analysis
Through monitoring the network using ICMP Echo requests, we discovered that the overall testing
interval can be split into three time slots with their own particular characteristics of the communication
medium as shown in Figure 3 and Table 2.
Figure 3. PINGing time slots
Table 2. PINGing statistics summary
PING round trip time (RTT), ms
Time slot min. max. avg. std. dev.
PINGing from Newcastle University LAN (UK)
Time slot 1 309 422 309.21 1.40
Time slot 2 332 699 332.72 3.48
Time slot 3 309 735 312.94 12.73
PINGing from KhAI University LAN (Kharkiv, Ukraine)
- 341 994 396.27 62.14
Time_slot_1 is a long period of time characterized by a highly stable network condition (see
Figure 4-a) with the average Round Trip Time (RTT) of 309,21 ms. This was observed over most of the
testing period. According to the TTL parameter returned in ICMP Echo reply from DDBJ server, the
network route contained 17 intermediate hosts (routers) between Newcastle University Campus LAN and
the DDBJ server.
Time_slot_2 began on June 05 at 23:23:48, ending on June 06 at 02:31:30. This was a sufficiently
stable period with the average Round Trip Time (RTT) of 332,72 ms (see Figure 4-b). The ratio of the
standard deviation of the delay to the average value (referred to as the coefficient of variation), used in the
chapter as a measure of uncertainty, was about 1% for this period. This is accounted for by the fact that
during this time slot the network route was changed. The number of intermediate hosts (routers) grew
from 17 to 20. This also affected the average response time of the Fasta and Blast WSs.
Time_slot_3 is a short period (of about 20 minutes) characterized by a high RTT instability (a
higher value of standard deviation than in time slots 1 and 2) (see Figure 4-c and Table 2 for more
details). It was too short, however, to analyse its impact on the Fasta and Blast response times.
Packet losses occurred during all of the time slots, on average once in every two hours (the total
number of losses was 44, 8 of which were double losses). Sometimes the RTT increases significantly over
a short period. This indicates that transient network congestions occurred periodically throughout the
testing period.
6
Figure 4. PING probability distribution series of network round trip time: (a) Time_slot_1;
(b) Time_slot_2; (c) Time_slot_3; (d) pinging from KhAI University LAN
At the same time, we were surprised by the high stability of network connection during long
periods. We had expected a greater instability of the round trip time due to the use of the Internet as a
medium and the long distance between the client and Web Services. To understand this, the DDBJ server
was pinged from KhAI University LAN (Kharkiv, Ukraine) during another two days. As a result, we
observed a significant instability of the RTT (see Figure 4-d). The standard deviation of RTT is about
16% of its average value. Besides, packet losses occur, on average, after every 100 ICMP Echo requests.
We used the tracert command to discover the reason for such instability and found that the route
from Ukraine to Japan includes 26 intermediate hosts and goes through the UK (host’s name is ae-
2.ebr2.London1.Level3.net, IP address is “4.69.132.133”) but the main instability takes place at the side
of local Internet Service Provider (ISP). The RTT standard deviation for the first five intermediate hosts
(all located in Ukraine) was extremely high (about 100% of its average value).
As a consequence, the standard deviation of response time for the requests sent to the Fasta and
Blast WSs from the KhAI University LAN has dramatically increased as compared to the ones sent from
Newcastle University. This came as a result of superposition of high network instability and the observed
performance uncertainty inherent to the Fasta and especially Blast WSs.
1.2. Measuring Uncertainty of BASIS System Biology Web Service
In this section, we present a set of new experiments conducted with an instance of the System Biology
Web Service (BASIS WS) to continue our research on measuring the performance and dependability of
Web Services used in e-science experiments. In a study reported in the previous section we found evident
performance instability existing in SOA that affects the dependability of web services and its clients.
The Fasta and Blast WSs we experimented with were the part of DNA Databank (Miyazaki, &
Sugawara, 2000) that was out of our general control. Thus, we were unable to capture the exact causes of
performance instability. The main difference between that work and our experiments with the BASIS web
service, hosted by the Institute for Ageing and Health (Newcastle University), is the fact that this WS is
7!
under our local administration. Thus we are able to look inside its internal architecture and to perform
error and time logging for every external request. Moreover, we have used several clients from which the
BASIS WS was benchmarked to give us a more objective view and to allow us to see whether the
instability affects all clients in the same way or not.
The aims of the work are as follows: (i) to conduct a series of experiments similar to those reported
in the previous section but with access to inside information to obtain a better understanding of the
sources of exceptions and performance instability; (ii) to conduct a wider range of experiments by using
several clients from different locations over the Internet; (iii) to gain an inside understanding of the
bottlenecks of an existing system biology application to help in improving it in the future.
1.2.1. BASIS System Biology Applications
Our experiments were conducted in the collaboration with a Systems Biology project called BASIS
(Biology of Ageing E-Science Integration and Simulation System) (Kirkwood et al., 2003). The BASIS
application is a typical, representative example of a number of SOA solutions found in e-science and grid.
Being one of the twenty pilot projects funded under the UK e-science initiative in the development of the
UK grid applications, BASIS at the Institute for Ageing and Health in Newcastle University, aims at
developing web-based services that help the biology-of-ageing research community for quantitative study
of the biology of ageing by integrating data and hypotheses from diverse biological sources. With the
association and expertise from the UK National e-Science Centre on building Grid applications, the
project has successfully built a system that integrates various components such as model design,
simulators, databases, and exposes their functionalities as Web Services (Institute for Ageing and Health,
2009). The architecture of the BASIS Web Service (basis1.ncl.ac.uk) is shown in Figure 5.
Figure 5. The architecture of BASIS system
The system is composed of a BASIS Server (2x2.4GHz Xeon CPU, 2GB DDR RAM, 73GB
10,000 rpm U160 SCSI RAID), including a database (PostgreSQL v8.1.3) and Condor v 6.8.0 Grid
Computing Engine; a sixteen computer cluster, an internal 1Gbit network, and a web service interface
deployed on Sun Glassfish v2 Application Server with JAX-WS + JAXB web service development pack.
BASIS offers four main services to the community:
– BASIS Users Service allows users to manage their account;
– BASIS Simulation Service allows users to run simulations from ageing research;
8
– BASIS SBML Service allows users to create, use and modify SBML models. The Systems
Biology Markup Language (SMBL) is a machine-readable language, based on XML, for representing
models of biochemical reaction networks. SBML can represent metabolic networks, cell-signalling
pathways, regulatory networks, and other kinds of systems studied in systems biology;
– BASIS Model Service allows users to manage their models.
The most common BASIS usage scenario is: (i) to upload a SMBL simulation model into BASIS
server; (ii) to run uploaded SMBL model with the biological statistics from BASIS database; (iii) to
download simulation results. The size of SMBL models and simulation results uploaded and downloaded
to/from the BASIS server can wary in a wide range and can be really huge (up to tens and even hundreds
of megabytes). It can be a real problem for the remote clients, especially for those with the low-speed or
low-quality Internet connections.
1.2.2. Experimental Technique
To provide a comprehensive assessment we used five clients deployed in different places over the
Internet: Frankfurt (Germany), Moscow (Russia), Los Angeles (USA) and two clients in Simferopol
(Ukraine) that use different Internet service providers. Figure 6, created by tracing routes between clients
and the BASIS WS, demonstrates different number of intermediate routers between the BASIS WS and
each of the clients. Note that there are parts of the routes common to different clients.
Our plan was to perform prolonged WS testing to capture long-term performance trend, to disclose
performance instabilities and possible failures. The GetSMBL method, returning SMBL simulation result
of 100 Kb, has been invoked simultaneously from all clients every 10 minutes during five days starting
from Dec 23, 2008 (more than 600 times in total). At the same time the BASIS SBML Web Service has
been pinged to assess network round trip time RTT and to take into account the Internet effects on the WS
invocation delay. Total numbers of ICMP Echo requests sent to BASIS Server were more than 10000. In
additional to that we traced network routes between clients and the web service to find out an exact point
of network instability.
The experiment was run over the Christmas week for the following reasons. The University’s
internal network activity was minimal during this week. At the same time the overall Internet activity
typically grows during this time as social networks (e.g. Facebook, YouTube, etc.) and forums experience
a sharp growth during the holidays (Goad, 2008).
A Java-based application called Web Services Dependability Assessment Tool (WSsDAT) which is
aimed at evaluating the dependability of Web Services (Li, Chen, & Romanovsky, 2006) was used to test
the BASIS SBML web service from remote hosts. The tool supports various methods of dependability
testing by acting as a client invoking the Web Services under investigation. It enables users to monitor
Web Services by collecting the following characteristics: (i) availability; (ii) performance; (iii) faults and
exceptions. During our experimentation we faced with several organizational and technical problems.
Thus, test from Los Angeles was started up 16 hours late. The Moscow client were suddenly terminated
after first thirty requests and restarted only five days later when the first step of the experiment was
already finished.
1.2.3. Performance Trends Analysis
Figure 7 shows the response time statistics for different clients. The summary of the statistical data
describing the response time and client instability ranks are also presented in Table 3. An average request
processing time by the BASIS WS was about 163 ms. Thus, the network delay makes the major
contribution to the response time. To analyse performance instability for each particular client we have
estimated how many percent the standard deviation (std. dev) of response time takes from its average
(avg) value (i.e. coefficient of variation – cv).
9
Figure 6. Internet routes used by different clients
The fastest response time (in average) was observed for the client from Frankfurt whereas Los
Angeles client was the slowest one. This situation was easy to predict. However, we have also found that
the fastest client was not the most stable. Quite the contrary, the most stable response time has been
observed by the client from Los Angeles. The most unstable response time has been observed by
Simferopol_1. From time to time all clients (except for Los Angeles) have faced very high delays. Some
of these were ten times larger than average response time and twenty times larger than the minimal
values. The clients located in Moscow and in Simferopol_1 were faced with high instability of response
time due to high network instability (as it was found from the ping statistics analysis). A deeper analysis
of the trace_route statistics helped us to find out a remarkable fact that network instability (instability of
network delay) happened on the part of a network route that was closer to the particular client than to the
web service. Access to the inside information (server log) and additional network statistics (like the ping
and trace_rout logs) allowed us to get a better understanding of the sources of performance instability and
exceptions. For example, in Figure 8-a showing the response time of the Frankfurt client, we can see five
time intervals characterized by high response time instability. These were caused by different reasons (see
Table 4). During the first and the fourth time intervals all clients were affected by the BASIS service
overload due to high number of external requests and the database backup. The second time interval was
the result of BASIS Service maintenance. The BASIS server was restarted several times. As a result all
clients caught exceptions periodically and suffered from response time instability.
Table 3. BASIS WS response time statistics summary and client instability ranks
Response Time Client
location min,
ms
max,
ms
avg,
ms
std.dev,
ms
cv, % Instability
Rank
Number of
intermediate
routers
Frankfurt 317 6090 383.17 71.91 18.77 IV 11
Moscow 804 65134 1228.38 437.69 35.63 III 13
Simferopol_1 683 125531 1186.74 895.18 75.43 I 22
Simferopol_2 716 11150 1272.12 634.53 49.88 II 19
Los Angeles 1087 3663 1316.54 129.79 9.86 V 22
10
Figure 7. BASIS WS response time trends from different user-side perspectives
Response time instability during the third time interval was caused by extremely high network
instability occurred between the second and the third intermediate routers (counting from the Frankfurt
client towards the BASIS service). It was an interval where the network RTT suddenly increased three
times in average (from 28.3 ms up to 86.7 ms) and had a great deviation (32.2 ms). The last unstable
interval was observed by the Frankfurt client on December 27 (from 02 a.m. to 07 a.m.). In fact, the
Frankfurt host is an integration server that is involved in software development. At the end of the week it
performs automatic procedures of program code merging and unit testing. As a result, the host was
overloaded by the local tasks and our testing client even caught several operating system exceptions
“java.io.IOException: error=24, Too many open files”.
Table 4. Response time instability intervals experienced by the Frankfurt client
Time
interval Date/Time Instability cause
from: Dec 23/12:23:59 1
to: Dec 23/13:23:59 BASIS Service overload
from: Dec 23/23:03:59 2
to: Dec 24/01:44:00
BASIS Service failure and
maintenance actions
from: Dec 24/11:34:00 3
to: Dec 24/17:44:01
Network delay instability
due to network concestion
from: Dec 25/14:24:15 4
to: Dec 26/00:14:15 BASIS Database backup
from: Dec 27/02:14:23 5
to: Dec 27/07:14:23 Local host overload
11
Figure 8. WS Response time for German client: (a) response time trend;
(b) response time probability distribution series
1.2.4. Response Time Probability Density Analysis
Probability distribution series of response time that were obtained for different clients statistics are shown
in Figures 8-b and 9. As it can be seen all probability distribution series of service response time, taken in
our experiments from different client’s perspectives, tend to be described by the Poisson law, whereas
network RTT and request processing time RPT by the BASIS WS match well the Exponential
distribution.
However, unlike the Poisson and Exponential distributions all probability distribution series
obtained in our epxeriments have heavy tails caused by the ‘real world’ instability when delays increase
suddenly and significantly due to different reasons that are hard to predict.
This finding is in line with Reinecke, van Moorsel, and Wolter (2006) and other existing
experimental works. Thus, more realistic assumptions and more sophisticated distribution laws are needed
to fit better the practical data. It may be the case that the Exponential distribution of RTT and RPT can by
replaced with one of the heavy tailed distribution like log-normal, Weibull or Beta. At the same time the
service RT for different clients could be described in a more complex way as a composition of two
distribution: RTT (that is unique for each particular client) and RPT (that is unique for the service used
and, hence, is the same for all clients with the identical priority).
1.2.5. Errors and Exceptions Analysis
During our experiments, several clients caught different errors and exceptions with different error rates.
Most of them (1-3) were caused by BASIS Service maintenance when the BASIS WS, server and
database were restarted several times (see Table 5).
The first one (‘Null SOAP body’) resulted in a null-sized response from web service. It is a true
failure that may potentially cause dangerous situation as it was not reported as an exception! According to
the server side log, the failures were caused by errors occurred when the BASIS business logic processing
component was trying to connect to the database. As the database was shutdown due to an exception, this
component failed to handle the connection exception, and returned empty results to the client. Apparently,
the BASIS WS should be improved to provide better mechanisms for error diagnosis and exception
handling.
12!
Figure 9. Probabilities Distribution Series of WS Response Time from different user-side perspectives:
(a) Moscow, Russia; (b) Los Angeles, USA; (c) Simferopol_1, Ukraine; (d) Simferopol_2, Ukraine
The second exception was caused by BASIS WS shutdown, whereas the third one was likely a
result of BASIS server shutdown while the BASIS WS was operated. However, we cannot be sure
because ‘Null pointer exception’ gives too little information for troubleshooting. The reason of the forth
and fifth exception were network problems. It is noteworthy, that ‘UnknownHostException’ caused by
silence of DNS-server takes about 2 minutes (too long!) to be reported to the client.
Table 5. BASIS WS errors and exceptions statistics
Number of exceptions per client ! Error/Exception
Germany Simferopol_1 Simferopol_2
1 Error: Null SOAP body 4 4 6
2 Exception: HTTP transport error:
java.net.ConnectException: Connection refused 2 0 0
3 Exception: java.lang.NullPointerException 3 4 3
4 Exception: HTTP transport error:
java.net.NoRouteToHostException: No route to host 0 1 2
5 Exception: HTTP transport error:
java.net.UnknownHostException: basis1.ncl.ac.uk 0 0 1
Error rate 0.015 0.015 0.02
13!
1.3. Discussion
The purpose of the first section was to examine the uncertainty inherent to Service-Oriented Systems by
benchmarking three bioinformatics web services typically used to perform in silico experiments in
systems biology studies. The main finding is that the uncertainty comes from three sources the Internet,
the web service and from the client itself. Network instability as well as the internal instability of web
services throughput significantly affect service response time. Occasional transient and long-term Internet
congestions and network route changes that are difficult-to-predict affect the stability of Service-Oriented
Systems. Because of network congestions causing packet losses and multiple retransmissions, the
response time could sharply increase in an order of magnitude.
Because of the Internet, different clients have their own view on Web Service performance and
dependability. Each client has its own unique network route to the web service. However, some parts of
the route can be common for several clients or even for all of them (see Figure 6). Thus, number of clients
simultaneously suffering from the Internet instability depends on the point where network congestions or
failures happen. More objective data might be obtained by aggregation of clients’ experience, for
example, in a way proposed in (Zheng et al, 2010) and/or by having internal access to the Web Service
operational statistics.
We can also conclude from our practical experience, that the instability of the response time
depends on the quality of the network connection used rather than on the length of the network route or
the number of the intermediate routers. QoS of a particular WS cannot be ensured without guaranteeing
network QoS, especially when the Internet is used as a communication medium for the global service-
oriented architecture.
During the WS invocation different clients caught different number of errors and exceptions, but
not all of them were caused by service unreliability. In fact, some clients were successfully serviced
whereas others, at the same time, were faced with different problems due to timing errors or network
failures. These errors might occur in different system components depending on the relative position in
the Internet of a particular user and a web service, and, also, on the instability points appearing during the
execution. As a result, web services might be compromised by the client side or network failures, which,
actually, are not related to web service dependability. Most of the time, the clients are not very interested
in their exact cause. Thus, from different client side perspectives the same web service usually has
different availability and reliability characteristics. A possible approach to solving the problem of
predicting the reliability of SOA-based applications given their uncertainty by users collaboration is
proposed in (Zheng, & Lyu, 2010).
Finally, the Exponential distribution that typically used for networks simulation and response time
analysis does not represent well such unstable environments as the Internet and SOA. We believe that the
SOA community needs a new exploratory theory and more realistic assumptions to predict and simulate
performance and dependability of Service-Oriented Systems by distinguishing different delays
contributing to the WS response time.
2. INSTABILITY MEASUREMENT OF DELAYS CONTRIBUTING TO
WEB SERVICE RESPONSE TIME
This section reports a continuation of our previous work with BASIS System Biology Applications
aiming to measure the performance and dependability of e-science WSs from the end user’s perspective.
In the previous investigation we found evident performance instability existing in these SOAs and
affecting dependability of both, the WSs and their clients. However, we were unable to capture the exact
causes and shapes of performance instability. In this section we focus on distinguishing between different
delays contributing to the overall Web Service response time. Besides, new theoretical results are
presented at the end of this section where we rigorously investigate the real distribution laws, describing
response time instability in SOA.
14
2.1. Experimental Technique
Basically, we used the same experimental technique described in the previous section. The BASIS WS,
returning SMBL simulation result, has been invoked by the client software placed in five different
locations (in Frankfurt, Moscow, Los Angeles and two in Simferopol) every 10 minutes during eighteen
days starting from April, 11 2009 (more than 2500 times in total). Simultaneously, we traced the network
route (by sending ICMP-echo requests) between the client and the BASIS SBML web service to
understand how the Internet latency affects the WS invocation delay and to find out where possible the
exact points of network instability.
At the same time there are significant differences in the measurement techniques presented in the
previous section and the work reported in this section. The main one is that in our new experiments we
measure four time-stamps T1, T2, T3 and T4 (see Figure 10) for each request instead of only T1 and T4
(as it was done in the previous experiments). It becomes possible because during our new experiments we
had an internal access to the BASIS WS and were able to install directly into the BASIS WS a monitoring
software to capture the exact time when the user’s requests come to BASIS and when it returns the
responses. This allowed us to separate the two main delays contributing to the WS response time (RT):
the request processing time (RPT) by a web service and the network (the Internet) round trip time (RTT),
i.e. RT = RPT + RTT.
Besides, we investigated how the performance of the BASIS WS and its instability changed during
the 3 months since our previous large-scale experiment to check the hypothesis that once measured they
stay true. Finally, when we set these new experiments we wanted to know is there a way to predict and
represent the performance uncertainty in SOA by employing one of the theoretical distributions, used to
describe such random variables like the web service response time. A motivation for this is the fact shown
by many studies (e.g. Reinecke et al., 2006) that the Exponential distribution does not represent well the
accidental delays in the Internet and SOA. After processing statistics for the all clients located in different
places over the Internet we found the same uncertainty tendencies. Thus, in this section we report results
obtaining only for the Frankfurt client.
Figure 10. Performance measurement
2.2. Performance Trends Analysis
Performance trends and probability density series of RPT, RTT and RT captured during eighteen days are
shown in Figure 11. It can be see that RTT and especially RPT have significant instability that contribute
together to the instability of the total response time RT.
Sometimes, delays were twenty times (and even more) longer than their average values (see
Table 6). In brackets we give estimation of the maximal and average values of RPT, RTT and RT and
their standard deviations that were obtained after taking out of consideration (discarding) ten the most
extreme values of delays. A ratio between delay standard deviation and its average value is used as the
uncertainty measure.
15!
As compared with our experiments of three month prescription we have observed a significant
increase of the average response time (see Table 6) fixed by the Frankfurt client (889.7 ms instead of
502.15 ms). In additional to this, an uncertainty of BASIS performance from the client-side perspectives
has been increased in times (94.1% instead of 18.77%). The network route between the BASIS WS and
the Frankfurt client has also changed significantly (18 intermediate routers instead of 11).
In our current work we set the number of bars in the histogram representing probability density
series (see Figure 11) equal to the square root of the number of elements in experimental data, that is
similar to the Matlab histfit(x) function. This allowed us to find out new interesting properties.
Figure 11. Performance trends and probability density series: RPT, RTT and RT
In particular, we could see that about 5% of RPT, RTT and RT are significantly larger than their
average values. It is also clear that the probability distribution series of RTT has two extreme points.
Besides, more than five percents of RTT have value that is 80ms (1/5) less than the average one. Tracing
routes between the client and the service allows us to conclude that these fast responses were caused by
shortening the network routes. This seems to be very unusual for RPT but should be typical for the
Internet. Finally, this peculiarity of RTT causes an appearance of the observable left tail in the RT
probability distribution series. It also makes it difficult to find the theoretical distribution, representing
RTT.
16
Table 6. Performance statistics: RPT, RTT, RT
Min, ms Max, ms Avg, ms Std. Dev. Cv, %
RPT 287.0 241106.0 (8182.0) 657.7 (497.6) 4988.0 (773.5) 758.4 (155.4)
RTT 210.0 19445.0(1479.0) 405.8 (378.2) 621.1 (49.2) 153.1 (13.0)
RT 616.0 241492.0 (11224.0) 1061.5 (889.7) 5031.0 (837.4) 474.1 (94.1)
Ping RTT 26.4 346.9 (50.4) 32.0 (31.9) 3.6 (0.9) 11.3 (2.8)
As the availability concern we should mention that the BASIS WS was unavailable for four hours
(starting from 19:00 April, 11) because of the network rerouting. Besides, two times the WS reported an
exception instead of returning the normal results.
2.3. Retrieval of Real Distribution Laws of Web Service Delays
2.3.1. Hypothesis Checking Technique
In this section we provide results of hypotheses checking about distribution law of web service response
time (RT) and its component values RPT and RTT. In our work we use the Matlab numeric computing
environment (www.mathworks.com) and its Statistics Toolbox, a collection of tools supporting a wide
range of general statistical functions, from random number generation, to curve fitting. The techniques of
hypothesis checking consist of two basic procedures. First, values of distribution parameters are to be
estimated by analysing experimental samples. Second, the null hypothesis that experimental data have a
particular distribution with certain parameters should be checked. To perform hypothesis checking itself
we used the kstest function: [h, p] = kstest(x, cdf) performing a Kolmogorov-Smirnov
test to compare the distribution of x to the hypothesized distribution defined by matrix cdf.
The null hypothesis for the Kolmogorov-Smirnov test is that x has a distribution defined by cdf.
The alternative hypothesis is that x does not have that distribution. Result h is equal to “1” if we can reject
the hypothesis, or “0” if we cannot reject that hypothesis. The function also returns the p-value which is
the probability that x does not contradict the null hypothesis. We reject the hypothesis if the test is
significant at the 5% level (if p-value less than 0.05).
2.3.2. Goodness-of-Fit Analysis
In our experimental work we have checked six hypotheses that experimental data conform Exponential,
Gamma, Beta, Normal, Weibull or Poisson distributions. These checks were performed for the request
processing time (RPT), round trip time (RTT) and response time (RT) as a whole.
Our main finding is that none of the distributions fits to describe the whole performance statistics,
gathered during 18 days. Moreover, the more experimental data we used the worse approximation were
provided by all distributions! It means that in the general case an instability existing in Service-Oriented
Architecture cannot be predicted and described by analytic formula. The further work focused on finding
the distribution law that fits the experimental data within limited time intervals. We have chosen two
short time intervals with the most stable (from 0:24:28 of April, 12 until 1:17:50 of April, 14) and the
least stable (from 8:31:20 of April, 23 until 22:51:36 of April, 23) response time.
The first time interval includes 293 request samples. Results of hypothesis checking for RPT, RTT
and RT are given in Tables 7, 8 and 9 respectively. The p-value, returned by the kstest function, was used
to estimate the goodness-of-fit of the hypothesis. As it can be seen, the Beta, Weibull and especially
Gamma (1) distributions fit the experimental data better than others. Besides, RPT is approximated by
these distributions better than RT and RTT.
! "! "
b
x
a
aex
a!bb,a|xfy
##$$ 11
(1) !
17!
Typically, the Gamma probability density function (PDF) is useful in reliability models of
lifetimes. This distribution is more flexible than the Exponential one, which is a special case of the
Gamma function (when a=1). It is remarkable, that the Exponential distribution in our case describes
experimental data worst of all.
However, close approximation even by using the Gamma function can be achieved only within the
limited sample interval (25 samples in our case). Moreover, RTT (and sometimes RT) can hardly be
approximated even under such limited sample length.
Table 7. RPT Goodness-of-fit approximation
Approximation goodness-of-fit (p-value) Number of
requests Exp. Gam. Norm. Beta Weib. Poiss.
293 (all) 7.8 10^-100 1.1 10^-06 9.5 10^-63 9.3 10^-25 2.3 10^-11 4.9 10^-66
First half 1.1 10^-99 0.0468 1.2 10^-62 0.0222 0.00023 1.1 10^-65
Second half 1.3 10^-47 0.2554 5.1 10^-30 0.2907 0.0729 1.6 10^-31
First 50 6.9 10^-18 0.2456 2.3 10^-11 0.2149 0.0830 7.5 10^-12
First 25 2.3 10^-09 0.9773 5.1 10^-06 0.9670 0.5638 2.9 10^-06
Second 25 2.5 10^-09 0.2034 5.2 10^-06 0.1781 0.0508 3.1 10^-06
Table 8. RTT Goodness-of-fit approximation
Distribution goodness-of-fit (p-value) Number of
requests Exp. Gam. Norm. Beta Weib. Poiss.
293 (all) 2.1 10^-94 5.1 10^-30 4.4 10^-59 7.0 10^-39 5.0 10^-38 7.5 10^-85
First half 6.5 10^-52 2.6 10^-17 9.1 10^-33 1.1 10^-16 2.6 10^-19 1.0 10^-45
Second half 5.0 10^-44 2.5 10^-11 1.8 10^-27 4.6 10^-16 4.6 10^-13 8.1 10^-40
First 50 8.1 10^-18 1.9 10^-04 2.1 10^-11 2.9 10^-04 2.0 10^-07 2.1 10^-15
First 25 2.7 10^-09 0.004 4.2 10^-06 0.0043 0.0133 4.6 10^-08
Second 25 1.6 10^-09 6.0 10^-04 4.0 10^-06 5.4 10^-04 3.5 10^-04 4.8 10^-08
Table 9. RT Goodness-of-fit approximation
Distribution goodness-of-fit (p-value) Number of
requests Exp. Gam. Norm. Beta Weib. Poiss.
293 (all) 1.6 10^-96 1.8 10^-14 4.4 10^-60 4.4 10^-29 1.0 10^-19 4.0 10^-67
First half 2.6 10^-52 0.0054 9.4 10^-33 0.0048 1.1 10^-06 2.6 10^-35
Second half 1.0 10^-45 9.8 10^-08 1.9 10^-28 5.2 10^-15 9.1 10^-09 2.2 10^-32
First 50 6.1 10^-18 0.1159 2.1 10^-11 0.1083 0.1150 6.1 10^-12
First 25 2.4 10^-09 0.8776 4.2 10^-06 0.8909 0.7175 2.7 10^-06
Second 25 1.9 10^-09 0.0843 4.5 10^-06 0.0799 0.0288 2.8 10^-06
For the second time interval all six hypotheses failed because of the low confidence of the p-value
(less than confidence interval of 5%). Thus, we can state that the deviation of experimental data
significantly affects goodness of fit. However, we also should mention that the Gamma distribution also
gave better approximation than other five distributions.
2.4. Discussion
In these experiments the major uncertainty came from the BASIS WS itself, whereas in the experiments
conducted three month before this (during the Christmas week) the Internet most likely was the main
cause of the uncertainty.
An important fact we found is that RPT has a higher instability than RTT, however, in spite of this
RPT can be better represented using a particular theoretical distribution. At the same time the probability
18
distribution series of RTT has unique characteristics making it really difficult to describe them
theoretically. Among the existing theoretical distributions Gamma, Beta and Weibull capture our
experimental response time statistics better than others. However, goodness of fit is good enough only
within short time intervals.
We also should mention here that performance and other dependability characteristics of WSs
could become out of date very quickly. The BASIS response time has changed significantly after three
months in spite of the fact that there were no essential changes in its architecture apart from changes of
the usage profile and the Internet routes. The BASIS WS is a typical example of a number of SOA
solutions found in e-science and grid. It has a rather complex structure which integrates a number of
components, such as a SBML modeller and simulator, database, grid computing engine, computing
cluster, etc., typically used for many in silico studies in systems biology. We believe that performance
uncertainty, which is partially due to the systems themselves, can be reduced by further optimisation of
the internal structure and the right choice of components and technologies that suit each other and fit the
system requirements better.
Finally, our concrete suggestion for bio-scientists using BASIS is to set up a time out that is 1.2
times longer than the average response time estimated for 20-25 last requests. When the time out is
exceeded, a recovery action based on a simple retry can be effective most of the time in dealing with
transient congestions happening in the Internet and/or the BASIS WS. A more sophisticated technique
that would predict the response time more precisely and set up the time out should assess the average
response time and coefficient of variation. To be more dependable, clients should also distinguish
between different exceptions and handle them in different ways depending on the exception source.
All experimental results can be found at http://homepages.cs.ncl.ac.uk/
alexander.romanovsky/home.formal/Server-for-site.xls, including the invocation and the ping RTT
statistics for the Frankfurt client, and the probability distribution series (RPT, RTT, and RT). An extended
version of this section is submitted to SRDS’2010.
3. BENCHMARKING EXCEPTION PROPAGATION MECHANISMS
Exception handling is one of the popular means used for improving dependability and supporting
recovery in the Service-Oriented Architecture. Knowing the exact causes and sources of exceptions
raising during operation of Web Service allows developers to apply the more suitable fault-tolerant and
error recovery techniques (AmberPoint, 2003). In this section we present an experimental analysis of the
SOA-specific exception propagation mechanisms and provide some insights into differences in error
handling and propagation delays between two implementations of web services in IBM WebSphere SDK1
and Sun Java application server SDK2. We analyse an ability of the exception propagation mechanisms of
the two Web Services development toolkits to disclose the exact roots of processing different exceptions
and to understand their implications for performance and uncertainty of the SOA applications using them.
To provide such an analysis we have used a fault injection which is a well-proven method for
assessing the dependability and fault-tolerance of a computing system. In particular, Looker et al. (2004)
and Duraes et al. (2004) present a practical approach for the dependability benchmarking and evaluation
of the robustness of Web Services. However, the existing works neither consider the propagation
behaviour of the exceptions raised because of the injected faults nor study the performance with respect to
the exception propagation caused by the use of different Web Services platforms.
3.1. Experimental Technique
To conduct our experiments we first implemented a Java class, WSCalc, which performs a simple
arithmetic operation upon two integers, converting the result into a string. Then we implemented two
testbed Web Services using two different development toolkits: i) Sun Java System (SJS) Application
Server and ii) IBM WebSphere Software Developer Kit (WSDK). The next steps were analysis of SOA-
specific errors and failures and injection them into testbed web service architecture. Finally, analysis and
comparison of the exception propagation mechanisms and performance implications was done.
19
3.2. Web Services Development Toolkits
In our work we experimented with two widely used technologies: the Java cross-platform technology,
developed by Sun and the IBM Web Service development environments and runtime application servers.
The reasons for this choice are that Sun develops most of the standards and reference implementations of
Java Enterprise software whereas IBM is the largest enterprise software company.
NetBeans IDE/SJS Application Server. NetBeans IDE3 is a powerful integrated environment for
developing applications on the Java platform, supporting Web Services technologies through the Java
Platform Enterprise Edition (J2EE). Sun Java System (SJS) Application Server is the Java EE
implementation by Sun Microsystems. NetBeans IDE with SJS Application Server support JSR-109,
which is a development paradigm that is suited for J2EE development, based on JAX-RPC (JSR-101).
IBM WSDK for Web Service. IBM WebSphere Software Developer Kit Version 5.1 (WSDK) is an
integrated kit for creating, discovering, invoking, and testing Web Services. WSDK v5.1 is based on
WebSphere Application Server v5.0.2 and provides support for the following open industry standards:
SOAP 1.1, WSDL 1.1, UDDI 2.0, JAX-RPC 1.0, EJB 2.0, Enterprise Web services 1.0, WSDL4J,
UDDI4J, and WS-Security. WSDK can be used with the Eclipse IDE4 which provides a graphical
interactive development environment for building and testing Java applications. Supporting the latest
specifications for Web Services WSDK enables to build, test, and deploy Web Services on industry-
leading IBM WebSphere Application Server. Functionality of the WSDK v5.1 has been incorporated into
the IBM WebSphere Studio family of products.
Note that at the time of writing, the JAX-RPC framework was extensively replaced by the newer
JAX-WS framework (with SOAP 1.2 compliance), but, we believe our findinsg will still apply to the
present and future Web Services technologies as they will be facing the same dependability issues.
3.3. Web Service Testbed
The starting point for developing a JAX-RPC WS is the coding of a service endpoint interface and an
implementation class with public methods that must throw java.rmi.RemoteException. To analyse features of
the exception propagation mechanisms in the service-oriented architecture we have developed a testbed WS
executing simple arithmetic operations. The implementation bean class of the Web Service providing
arithmetic operations is shown in Figure 12.
package ai.xai12.loony.wscalc;
public class WSCalc implements WSCalcSEI {
public String getMul (int a, int b) {
return new Integer(a * b).toString();
}
...
}
Figure 12. The implementation bean class of the Web Service
providing simple arithmetic operations
The testbed service was implemented by using two different development kits provided by Sun and
IBM. Two diverse web services obtained in such a way were deployed on the two hosts using the same
runtime environment (hardware platform and operating system) but different application servers: i) IBM
WebSphere and ii) SJS AppServer. These hosts operated under Windows XP Profession Edition were
located in the university LAN. Thus, transfer delays and other network problems were insignificant and
affected both testbed services in the same way.
20
3.4. Error and Failure Injection
In our work we have experimented with 18 types of the SOA-specific errors and failures occurring during
service binding and invocation, SOAP messaging and request processing by a web service (see Table 10)
and dividing into three main categories: (i) network and remote system failures, (ii) internal errors and
failures and (iii) client-side binding errors. They are general (not application specific) and can appear in
any Web Service application during operation.
Table 10. SOA-specific errors and failures
! Type of error/failure Error/failure domain
1. Network connection break-off
2. Domain Name System is down
3. Loss of request/response packet
4. Remote host unavailable
5. Application Server is down
Network
and system
failures
6. Suspension of WS during transaction
7. System run-time error
8. Application run-time error
9. Error causing user-defined exception
Service
errors and
failures
10. Error in Target Name Space
11. Error in Web Service name
12. Error in service port name
13. Error in service operation name
14. Output parameter type mismatch
15. Input parameter type mismatch
16. Error in name of input parameter
17. Mismatching of number of input params
18. WS style mismatching (“Rpc” or “Doc”)
Client-side
binding
errors
Common-case network failures are down state of DNS or packets lost due to the network congestion.
Besides, the operation of a WS depends on the operation of the system software like web-server, application
server and database management system. In our work we analysed failures occurring when the application
servers (WebSphere or SJS AppServer) were shut down.
Client errors in early binding or dynamic interface invocation (DII) (like “Error in Target Name
Space”, “Error in Web Service name”, etc.) occur because of the changes in the invocation parameters,
and/or inconsistencies between the WSDL-description and the service interface. Finally, the service
failures are connected with program faults and run-time errors causing system- or user-defined
exceptions. System run-time errors like “Stack overflow” or “Lack of memory” result in the exceptions at
the system level as a whole. Operation “Division by zero” is also caught and generates an exception at the
system level but it is easier to simulate such system error than other ones.
The typical examples of the application run-time errors are “Operand type mismatch”, “Product
overflow” and “Index out of bounds”. In our experiments we injected the “Operand type mismatch” error,
hangs of the WS due to its program getting into a loop and error causing user-defined exception
(exception defined by a programmer during WS development).
Service failures (6, 7, 8) were simulated by fault injection at the service side. Client-side binding
errors (10-18) which are, in fact, a set of robustness tests (i.e., invalid web-services call parameters) were
applied during web-services invocation in order to reveal possible robustness problems in the web-
services middleware. We used a compile-time injection technique (Looker et al., 2004) where a source
code is modified to inject simulated errors and faults into the system.
21!
Network and system failures were simulated by shutting down manually of DNS server, application
server and network connections at the client and service sides.
3.5. Errors and Exceptions Correspondence Analysis
Table 11 describes a relationship between errors/failures and the exceptions raised at the top level on
different application platforms. As it was discovered, some injected errors and failures cause the same
exception so we were not always able to define the precise exception cause. There are several groups of
such errors and failures: 1 and 2 (Sun); 3 and 6 (Sun); 4 and 5 (Sun); 1, 2 and 5 (IBM); 3 and 6 (IBM).
Some client-side binding errors (11 – “Error in Web Service name”, 12 – “Error in service port
name”) neither raise exceptions nor affect the service output. This happens because the WS is actually
invoked by the address location, whereas the service and port names are only used as supplementary
information. Moreover, the WS developed by using IBM WSDK and deployed on the IBM WebSphere
application server, tolerates such binding errors internally: 10 - “Error in Target Name Space”, 14 -
“Output parameter type mismatch”, and 16 - “Error in name of input parameter”. These features are
supported by the WSDL description and a built-in function of automatic type conversion.
Errors in the name of the input parameter were tolerated because checking the order of parameters
has a priority over the coincidence of parameter names in the IBM implementation of web service. On the
other hand it seems like Websphere is unable to detect a potentially dangerous situation resulted from the
parameters mishmash.
Table 11. Example of top-level exceptions raised by different types of errors and failures
Type of
error/failure
Exception message at using Sun
Microsystems WS Toolkit
Exception message at using IBM
WS Toolkit (WSDK)
Network
connection break-
off; DNS is down
“HTTP transport error: java.net.
UnknownHostException: c1.xai12.ai”
“{http://websphere.ibm.com/webser
vices/} Server.generalException”
Remote host
unavailable
(off-line)
“HTTP Status-Code 404: Not Found -
/WS/ WSCalc”
“{http://websphere.ibm.
com/webservices/} HTTP
faultString: (404)Not Found”
Suspension of Web
Service during
transaction
Waiting for response during too much
time (more than 2 hours) without
exception
“{http://websphere.ibm.com/webser
vices/} Server.generalException
faultString: java.io. Interrupted
IOException:Read timed out”
System run-time
error (“Division by
Zero”)
“java.rmi.ServerException:
JAXRPC.TIE.04: Internal Server Error
(JAXRPCTIE01: caught exception while
handling request: java.lang.
ArithmeticException: / by zero)”
“{http://websphere.ibm.com/webser
vices/} Server.generalException
faultString: java.lang.Arithmetic
Exception: / by zero”
Application error
causing user-
defined exception
“java.rmi.RemoteException:
ai.c1.loony.exception. UserException”
“{http://websphere.ibm.com/webser
vices/}Server.generalException
faultString:(13)UserException”
Error in name of
input parameter
“java.rmi.RemoteException:
JAXRPCTIE01: unexpected element
name:expected=Integer_2,
actual=Integer_1”
OK - Correct output without
exception
22!
3.6. Exception Propagation and Performance Analysis
Table 11 shows the exceptions raised at the top level on the client side. However, a particular exception can be
wrapped dozens of times before it finally propagates to the top. This process takes time and significantly
reduces performance of exception handling in service-oriented architecture.
An example of the stack trace corresponding to the “Operand Type Mismatch” run-time error caught by
a web service is given in Figure 13. The exception propagation chain has four nested calls (started with
“at” preposition) in case of using WS development kit from Sun Microsystems. For comparison, the stack
trace of IBM-based implementation has 63 nested calls for the same error. The full stack traces and
technical details can be found in Gorbenko et al. (2007).
The results of exception propagation and performance analysis are represented in Table 12. This
table includes a number of exceptions stack trace (length of exceptions propagation chain, i.e. the count of
different stack traces for this particular failure) and propagation delay (min, max and average values)
which is a time between the invocation of a service and capture of the exception by a catch block. As it
can be seen from Table 12, the IBM implementation of the web service has almost twice as good a
performance as that of the service implemented in the Sun technology.
java.rmi.ServerException: JAXRPC.TIE.04: Internal Server Error
(JAXRPCTIE01: java.lang. NumberFormatException: For input string: "578ER")
at com.sun.xml.rpc.client.dii.BasicCall.invoke(BasicCall.java:497)
at ai.c1.xai12.wstest.InvoceWS.invoce(InvoceWS.java:125)
at ai.c1.xai12.wstest.InvoceWS.invoceByVector(InvoceWS.java:75)
at wstest.Main.main(Main.java:42)
Figure 13. Stack trace of failure No 8, raised in the client application developed in NetBeans IDE
by using JAX-RPC implementation of Sun Microsystems
The performance of exception propagation mechanisms has been monitored at the university LAN
on heterogeneous server platforms. The first row of the table corresponds to the correct service output
without any exceptions. The rows, marked in bold, correspond to the cases of correct service outputs
without exceptions in spite of injected errors.
It is clear from the table that the exceptions propagation delay is several times greater than normal
working time. However, the exception propagation delay of the Web Service developed with NetBeans
IDE using the JAX-RPC implementation from Sun Microsystems was two times shorter than the delay we
experienced when we used IBM WSDK. This can be accounted for by the fact that the exception
propagation chain in the IBM implementation of the web service is usually much longer. The factors
affecting the performance and differences between the two web-service development environments most
probably depend on the internal structure of toolkits and the application servers used. We believe that the
most likely reason for this behaviour is that the JAX-RPC implementation by Sun Microsystems has a
larger number of nested calls than IBM WSDK.
In case of service suspension or packet loss, the service client developed using the Sun WS toolkit
may not raise an exception even over as long as 2 hours’ time. This results in retarded recovery action and
complicates developers’ work. Analysing the exception stack trace and propagation delay can help in
identifying the source of the exception.
For example, failures 1 - “Network connection break-off” and 2 - “Domain Name System (DNS) is
down” raise the same top-level exception “HTTP transport error: java.net.UnknownHostException:
loony.xai12.ai”. However, if we use the Sun WS toolkit, we can distinguish between these failures by
comparing numbers of the stack traces (38 vs. 28). If we use IBM WSDK, we are able to distinguish
failure 5 – “Application Server is down” from failures 1 and 2 by analysing the exception propagation
delay (the first one is greater by one order).
23!
Table 12. Performance analysis of exceptions propagation mechanism
WS Development Toolkit NetBeans IDE (Sun) IBM WSDK
exception
propagation
delay, ms
exception
propagation delay,
ms " Type of error/failure
no of
stack
tracesmin max av.
no of
stack
tracesmin max av.
Without Error/Failure 0 40 210 95 0 15 120 45
1. Network connection break-off 38 10 30 23 16 10 40 28
2. Domain Name System is down 28 16 32 27 16 15 47 34
3. Loss of packet with client request or
service response - >7200000 15 300503 300661 300622
4. Remote host unavailable (off-line) 9 110 750 387 11 120 580 350
5. Application Server is down 9 70 456 259 16 100 550 287
6. Suspension of Web Service during
transaction (getting into a loop) - >7200000 15 300533 300771 300642
7. System run-time error (“Division by
Zero”) 7 90 621 250 62 120 551 401
8. Calculation run-time error (“Operand
Type Mismatch”) 4 90 170 145 63 130 581 324
9. Application error causing user-defined
exception 4 100 215 175 61 150 701 366
10. Error in Target Name Space 4 100 281 180 0 10 105 38
11. Error in Web Service name 0 40 120 80 0 10 125 41
12. Error in service port name 0 30 185 85 0 15 137 53
13. Error in service operation name 4 90 270 150 58 190 511 380
14. Output parameter type mismatch 14 80 198 160 0 15 134 48
15. Input parameter type mismatch 4 80 190 150 76 90 761 305
16. Error in name of input parameter 4 70 201 141 0 10 150 47
17. Mismatching of number of input
service parameters 4 80 270 160 61 130 681 350
18. Web Service style mismatching 4 70 350 187 58 90 541 298
3.7. Discussion
Exception handling is widely used as the basis for forward error recovery in service-oriented architecture.
Its effectiveness depends on the features of exception raising and on the propagation mechanisms. This
work allows us to draw the following conclusions.
1. Web services developed by using different toolkits react to some DII client errors differently
(“Output parameter type mismatch”, “Error in name of input parameter”). Sometimes this diversity can
allow us to mask client errors, yet in other cases it will lead to an erroneous outcome. Moreover, the
exception messages and stack traces gathered in our experimentation were not always sufficient to
identify the exact cause of these errors. For example, it is not possible to know if a remote host is down or
unreachable due to transient network failures. All this contributes to SOA uncertainty and prevents
developers from applying an adequate recovery technique.
2. Clients of web services developed using different toolkits can experience different response
time-outs. In our experimentation with simple Web Services we also observed substantial delays in client
software developed using the Sun Microsystems toolkit caused by WS hangs or packet loss.
3. Web Services developed using different toolkits have different exception propagation times. This
affects failure detection and failure notification delay. We believe that WSDK developers should make an
effort to reduce these times.
24
4. Analysing exception stack traces and propagation delays can help identify the exact sources of
exceptions even if we have caught the same top-level exception messages. It makes for a better fault
diagnosis, which identifies and records the cause of exception in terms of both location and type, as well
as better fault isolation and removal.
5. Knowing the exact cause and sources of exceptions is useful for applying appropriate failure
recovery or fault-tolerant means during exception handling. Several types of failures resulting in
exceptions can be effectively handled on the client side, whereas others should be handled on the service
side. Exceptions handling of the client side errors in early binding procedures may include a retry with the
help of dynamic invocation. Transient network failures can be tolerated by a simple retry. In other cases
redundancy and majority voting should be used.
6. Gathering and analysing exception statistics allow improvement of fault handling, which
prevents located faults from being activated again by using system reconfiguration or reinitialization.
This is especially relevant to a composite system with several alternative WSs.
7. Analysing exception stack traces helps identify the application server, WSDK, libraries and
packages used for WS development. This information is useful for choosing diverse variants from a set of
alternative Web Services deployed by third parties and building effective fault-tolerant systems by using
WS redundancy and diversity.
Below is a summary of our suggestions as to how exception handling should be implemented in
SOA systems to help develop systems that handle exceptions optimally.
First of all, a Web Service should return exceptions as soon as possible. Long notification delays
can significantly affect SOA performance, especially in complex workflow systems. To decrease the
exception propagation delay, developers should avoid unnecessary nesting of exceptions and reduce the
overall number of exception stack traces.
It is also essential that exceptions should contain more detailed information about the cause of error
and also provide additional classification attributes to help error diagnosis and fault tolerance. For
example, if an exception reports whether the error seems to be transient or permanent, a user’s application
will be able to automatically choose and perform the most suitable error recovery action (a simple retry in
case of transient errors or more complex fault-tolerant techniques otherwise).
CONCLUSION AND FUTURE WORK
Service-Oriented Architecture and Web Services technologies support rapid, low-cost and seamless
composition of globally distributed applications, and enable effective interoperability in a loosely-coupled
heterogeneous environment. Services are autonomous, platform-independent computational entities that
can be dynamically discovered and integrated into a single service to be offered to the users or, in turn,
used as a building block in further composition. The essential principles of SOA and WS form the
foundation for various modern and emerging IT technologies, such as service-oriented and cloud
computing, SaaS (software as a service), grid, etc.
According to International Data Corporation (2007), Web Services and service-oriented systems
are now widely used in e-science, critical infrastructures and business-critical systems. Failures in these
applications adversely affect people’s lives and businesses. Thus, ensuring dependability of WSs and
SOA-based systems is a must, as well as a challenge. To illustrate the problem, our earlier extensive
experiments with the BASIS and BLAST bioinformatics WSs show that the response time varies greatly
because of such various unpredictable factors as Internet congestions and failures and WS overloads. In
particular, the BASIS WS response time ranges from 300 ms to 120000 ms, the response time in 22% of
the requests is at least twice longer than the observed minimal value and the response time in about 5% of
requests is more than 20 times longer. We believe it is impossible to build fast and dependable SOAs
without tackling these issues.
Our recent experimental work supports our claim that dealing with the uncertainty inherent in the
very nature of SOA and WSs is one of the main challenges in building dependable SOAs. Uncertainty
needs to be treated as a threat in a way similar to and in addition to faults, errors and failures as
traditionally dealt with by the dependability community (Avizienis et al., 2004).
25!
The response time instability can cause timing failures when the time of response arrival or the
time in which information is delivered at the service interface (i.e. the timing of service delivery) deviates
from the time required to execute the system function. A timing failure may be in the form of early or late
response, depending on whether the service is delivered too early or too late (Avizienis et al., 2004). In
complex Service-Oriented Systems composed of many different Web Services, some users may receive a
correct service whereas others may receive incorrect services of different types due to timing errors.
These errors may occur in different system components depending on the relative position of a particular
user and particular Web Services in the Internet, and on the instability points appearing during the
execution. Thus, timing errors can become a major cause of inconsistent failures usually referred to, after
Lamport, Shostak, & Pease (1982), as the Byzantine failures.
The novel concepts of Service-Oriented Systems and their application in new domains clearly call
for continued attention to the SOA-specific uncertainty issues. For the open intra-organisational SOA
systems using the Internet, this uncertainty is unavoidable and the systems should be able to provide the
trustworthy service in spite of it. This, in turn, will require developing new resilience engineering
techniques and resilience-explicit mechanisms to deal with this threat.
Good measurement of uncertainty is important (and our work contributes to this topic), and yet this
is just the beginning because, once measured, the non-functional characteristics of WSs cannot be
assumed to be true forever. This is why developing dynamic fault-tolerant techniques and mechanisms
setting timeouts on-line and adopting system architecture and its behaviour on the fly is crucial for SOA.
In fact, there is a substantial number of dependability-enhancing techniques that can be applied to SOA
(Zheng, & Lyu, 2009; Maamar et al., 2008; Laranjeiro, & Vieira, 2008; Fang et al., 2007; Salatge, &
Fabre, 2007, etc.), including retries of lost messages, redundancy and replication of WSs, variations of
recovery blocks trying different services, etc.
These techniques exploit the flexibility of the service infrastructure, but the major challenge in
utilising these dependability techniques is the uncertainty inherent in the services running over the
Internet and clouds. This uncertainty exhibits itself through the unpredictable response times of the
Internet messages and data transfers, the difficulty of diagnosing the root cause of service failures, the
lack of ability to see beyond the interfaces of a service, unknown common mode failures, etc. The
uncertainty of the Internet and service performance instability are such that on-line optimization of
redundancy can make a substantial difference in perceived dependability. There are, however, no good
tools available at the moment for a company to carry out such optimisation in a rigorous manner.
We believe that uncertainty can be resolved by two means: uncertainty removal through advances
in data collection and uncertainty tolerance through smart algorithms that improve decisions despite a
lack of data (e.g. by extrapolation, better mathematical models, etc.). The user can intelligently and
dynamically switch between the Internet service providers or WS providers if she/he understands which
delay makes the major contribution to the response time and its instability. The more aware the user is of
the response time, different delays contributing to response time and their uncertainty, the more intelligent
will be his/her choice.
Future solutions will need to deal with a number of issues, such as the uncertainty of fault
assumptions, of redundant resource behaviour, of error detection, etc. The traditional adaptive solutions
based on the control feedback will not be directly applicable as they are designed for predictable
behaviour. One of the possible ways to resist uncertainty is to use service and path redundancy and
diversity inherent to SOA.
In (Gorbenko, Kharchenko, & Romanovsky, 2009) we propose several patterns for dependability-
aware service composition that allows us to construct composite Service-Oriented Systems resilient to
various types of failure (signalled or unsignalled; content, timing or silent failures) by using the inherent
redundancy and diversity of Web Service components which exist in the SOA and extending the mediator
approach proposed by Chen, and Romanovsky (2008).
26!
ACKNOWLEDGEMENTS
A. Romanovsky is partially supported by the UK EPSRC TrAmS platform grant.
A. Gorbenko is partially supported by the UA DFFD grant GP/F27/0073 and School of Computing
Science, Newcastle University.
REFERENCES
AmberPoint, Inc (2003). Managing Exceptions in Web Services Environments, An AmberPoint
Whitepaper. Retrieved from http://www.amberpoint.com.
Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of
Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-
33
Chen, Y., & Romanovsky, A. (2008, Jan/Feb) Improving the Dependability of Web Services Integration.
IT Professional: Technology Solutions for the Enterprise, 20-26.
Chen, Y., Romanovsky, A., Gorbenko, A., Kharchenko, V., Mamutov, S., Tarasyuk, O. (2009).
Benchmarking Dependability of a System Biology Application. Proceedings of the 14th IEEE Int.
Conference on Engineering of Complex Computer Systems (ICECCS’2009), 146 – 153.
Duraes, J., Vieira, M., & Madeira, H. (2004). Dependability Benchmarking of Web-Servers. In M. Heisel
et al. (Eds.), Proceedings of the 23rd Int. Conf. on Computer Safety, Reliability and Security
(SAFECOMP'04), LNCS 3219, (pp. 297-310). Springer-Verlag.
Fang, C.-L., Liang, D., Lin, F., & Lin, C.-C. (2007). Fault tolerant web services. Journal of Systems
Architecture, 53(1), 21-38
Goad, R. (2008, Dec) Social Xmas: Facebook’s busiest day ever, YouTube overtakes Hotmail, Social
networks = 10% of UK Internet traffic, [Web log comment]. Retrieved from
http://weblogs.hitwise.com/robin-goad/2008/12/facebook_youtube_christmas_social_networking.html
Gorbenko, A., Mikhaylichenko, A., Kharchenko, V., & Romanovsky, A. (2007). Experimenting With
Exception Handling Mechanisms Of Web Services Implemented Using Different Development Kits.
Technical report CS-TR 1010, Newcastle University. Retrieved from
http://www.cs.ncl.ac.uk/research/pubs/trs/papers/1010.pdf.
Gorbenko, A., Kharchenko, V., Tarasyuk, O., Chen, Y., & Romanovsky, A. (2008). The Threat of
Uncertainty in Service-Oriented Architecture. Proceedings of the RISE/EFTS Joint International
Workshop on Software Engineering for Resilient Systems (SERENE’20082008), ACM, 49-50.
Gorbenko, A., Kharchenko, V., & Romanovsky, A. (2009). Using Inherent Service Redundancy and
Diversity to Ensure Web Services Dependability. In M.J. Butler, C.B. Jones, A. Romanovsky, E.
Troubitsyna (Eds.) Methods, Models and Tools for Fault Tolerance, LNCS 5454 (pp. 324-341). Springer-
Verlag.
Institute for Ageing and Health (2009). BASIS: Biology of Ageing e-Science Integration and Simulation
System. Retrieved June 1, 2010, from http://www.basis.ncl.ac.uk/. Newcastle upon Tyne, UK: Newcastle
University.
International Data Corporation (2007). Mission Critical North American Application Platform Study, IDC
White Paper. Retrieved from www.idc.com.
Kirkwood, T.B.L., Boys, R.J., Gillespie, C.J., Proctor, C.J., Shanley, D.P., Wilkinson, D.J. (2003).
Towards an E-Biology of Ageing: Integrating Theory and Data. Journal of Nature Reviews Molecular
Cell Biology, 4, 243-249.
27!
Lamport, L., Shostak, R., & Pease, M. (1982). The Byzantine Generals Problem. ACM Trans.
Programming Languages and Systems, 4(3), 382-401.
Laranjeiro, N., & Vieira, M. (2008). Deploying Fault Tolerant Web Service Compositions. International
Journal of Computer Systems Science and Engineering (CSSE): Special Issue on Engineering Fault
Tolerant Systems, 23(5).
Laranjeiro, N., Vieira, M., & Madeira, H. (2007). Assessing Robustness of Web-services Infrastructures.
Proceedings of the International Conference on Dependable Systems and Networks (DSN’07), 131-136
Li, P., Chen, Y., Romanovsky, A. (2006). Measuring the Dependability of Web Services for Use in e-
Science Experiments. In D. Penkler, M. Reitenspiess, & F. Tam (Eds.): International Service Availability
Symposium (ISAS 2006), LNCS 4328, (pp. 193-205). Springer-Verlag.
Looker, N., Munro, M., & Xu, J. (2004). Simulating Errors in Web Services. International Journal of
Simulation Systems, Science & Technology, 5(5)
Maamar, Z., Sheng, Q., & Benslimane, D. (2008). Sustaining Web Services High-Availability Using
Communities. Proceedings of the 3rd International Conference on Availability, Reliability and Security,
834-841.
Miyazaki, S., & Sugawara, H. (2000) Development of DDBJ-XML and its Application to a Database of
cDNA, Genome Informatics 2000, (pp. 380–381). Tokyo: Universal Academy Press Inc.
Reinecke, P., A. van Moorsel, & Wolter., K. (2006). Experimental Analysis of the Correlation of HTTP
GET invocations. In A. Horvath and M. Telek (Eds.): European Performance Engineering Workshop
(EPEW’2006), LNCS 4054, (pp. 226-237). Springer-Verlag.
Salatge, N., & Fabre, J.-C. (2007). Fault Tolerance Connectors for Unreliable Web Services. Proceedings
of the International Conference on Dependable Systems and Networks (DSN’07). 51-60.
Zheng, Z., Zhang, Y., & Lyu, M. (2010). Distributed QoS Evaluation for Real-World Web Services.
Proceedings of the IEEE International Conference on Web Services (ICWS’10), 83-90.
Zheng, Z., & Lyu, M. (2010). Collaborative Reliability Prediction for Service-Oriented Systems.
Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering (ICSE’10), 35-
44.
Zheng, Z., & Lyu, M. (2009). A QoS-Aware Fault Tolerant Middleware for Dependable Service
Composition. Proceedings of the International Conference on Dependable Systems and Networks
(DSN’09), 239-248.
1 http://www-128.ibm.com/developerworks/webservices/wsdk/ 2 http://www.sun.com/software/products/appsrvr_pe/index.xml 3 http://www.netbeans.org 4 http://www.eclipse.org