Adaptive Testing, Oracle Generation, and Test Case Ranking for
Web Services
Wei-Tek TsaiWei-Tek TsaiSoftware Research LaboratorySoftware Research Laboratory
Computer Science & Engineering DepartmentComputer Science & Engineering DepartmentArizona State UniversityArizona State University
[email protected]@asu.edu
Table of Content
• Background • Existed Dilemmas for SOA • Introduction to WebStrar• Difference between Blood and WS Group
Testing• Testing Process• BBS Case Study• Impact of Training Sizes and Target Sizes• Conclusions and Future Work
Impacts of Training on Test Case Ranking
Background
• Software development is shifting away from the product-oriented paradigm to the service-oriented paradigm.
• Service-Oriented Architecture (SOA) and its implementation Web Services (WS) received significant attention as major computer companies are all adopting this new approach to develop software and systems.
• However, trustworthiness becomes a serious problem and appropriate tradeoffs have to be paid during the WS testing phase
Verification of Web Services
• Collaborative Testing: Cooperation and collaboration among different testing activities and stakeholders including service provider, service consumer, and service brokers.
• Specification-Based Testing: SOA proposes a fully specification-based process. WS define a XML-based protocol stack to facilitate service inter-communication and inter-operation. Specifications, such as WSDL, OWL-S, and WSFL describe the service features. Hence, test cases need to be generated based on these specifications.
Existed Dilemmas for SOA (continued)
• Run-time Testing: Most WS activities such as service publishing, discovering, matching, composition, binding, execution, monitoring are done at runtime. Thus– Verification and testing must to be done in cluding test
case generation, test execution, test evaluation, model checking must be done at run-time.
• Different implementations of the same specification: For the same specification of a service requirement, many alternative implementations may be available online. Effective algorithms are needed to rank and select the best WS.
Introduction to WebStrar
• WebStrar: Infrastructure for Web Services Testing, Reliability Assessment, and Ranking. It is is an infrastructure that facilitates the development of Web services, trustworthy Web services, and their applications. It provides– the public (service providers, brokers, requestors ,
researchers, and regulators) on-line access to the tools and databases that enable describing (specifying), finding, scripting (composing complex services from existing services), testing, verification, validation, experimentation, and reliability evaluation of Web services.
– WebStrar has WS group testing to rank services belong to the same specifications.
SOAP call
ResultsService providers
Found
UDDIservice broker
Publishing Find
Clients
CurrentCurrent
Web ServiceWeb Service
ModelModel
Found
SOAP call
Publishingrequest
Approval
Binding
Data & Results
inquiry
testing
accept-ancetest
Registry
Trustworthy Service Broker
check-ininterface
check-outinterface
servicebinder
Service providers
TestMaster
Databaseof
test scripts
acceptanceinterface
Clients
TrustworthyTrustworthy
Web ServiceWeb Service
Model Model
based on Testingbased on Testing
WebStrar InfrastructureWebStrar Infrastructure
WS ranking
Test caseranking
Oracleupdates
Trustworthy WS repository
Reliability database
WS directory
Service Requestors / Clients
Service Providers: submit WS + test cases + test case oracle
Dynamically Replace - Testing method- Reliability model
ResearchersDevelopers
Access Data
- Reliability
- Ranks
ReliabilityModels
Model checking
WSDLOWL-S
DAML-S
Test case generators
Test case database
WS testmaster
Servicecomposition
Composite WS
Test casevalidation
Difference between blood and WS group testing
Compared featuresCompared features Blood Group Testing Blood Group Testing (BGT)(BGT)
Web Services Group Testing (WSGT)Web Services Group Testing (WSGT)
Testing goalsTesting goals Find bad samples from a large pool of blood samples.
Rank WS in a large pool of WS with the same specification; Rank the fault detection capacity of test scripts;Determine the oracle of each test script; andFault identification.
Optimization Optimization objectivesobjectives
Minimize the number of tests needed.
Minimize the number of tests and voting needed.
Sample mixSample mix Arbitrary and physical mix.
Interoperability is constrained by WSDL, DAML-S, OWL-S, and composition semantics such as ontology.
Testing methodsTesting methods Bio/chemical tests. WS unit, integration, and interoperability testing using adaptive, progressive, and concurrent testing.
Testing locationTesting location Centralized testing Distributed and remote testing by agents and voters.
VerificationVerification Contamination analysis. Oracle comparison and/or majority voting
Test coverageTest coverage One test for each mix. Need many tests for each group of WS to verify a variety of aspects.
Reliability Reliability evaluationevaluation
Reliability of testing process
Reliability of WS under test and testing process
Reliability of testsReliability of tests Tests can be reliable or unreliable. Most BGT assumes tests are reliable.
The voting mechanism may be unreliable, and the number of faulty WS may be greater than the number correct WS to mislead the voter.
Testing Process
• Test a large number of WS at both the unit and integration levels.
• At each level, the testing process has two phases: – Training Phase and – Volume Testing Phase.
Phase 1: Training Phase
• Select a subset of WS randomly from the set of all WS to be tested. The size of the subset, which is named as “Training Size”, can be experimentally decided.
• Apply each test case in the given set of test cases to test all the WS in the selected subset.
• Voting: For each test input, the outputs from the WS under test are voted by a stochastic voting mechanism based on majority and deviation voting principles.
Phase 1: Training Phase (Cont’d)
• Oracle establishment: If a clear majority output is found, the output is used to form the oracle of the test case that generates the output. A confident level is defined based on the extent of the majority. The confident level will be dynamically adjusted in the phase 2 as well.
• Test case ranking: Test cases will be ranked according to their fault detection capacity, which is proportional to the number failures the test cases detect. In the phase 2, the higher ranked test cases will be applied first to eliminate the WS that failed to pass the test.
Phase 1: Training Phase (Cont’d)
• WS ranking: The stochastic voting mechanism will not only find a majority output, but also rank the WS under group testing according to their average deviation to the majority output.
• By the end of training phase testing, we have– Tested and ranked the selected WS;– Ranked the potency of test cases;– Establish the oracle for test cases and their
confidence levels.
Phase 1: Training Phase (Cont’d)
test scripttest script
test scripttest script
test scripttest script
test scripttest script
test script
rankrank
rankrank
rankrank
rankrank
rank
Generatetest
scriptranking
Findoracleand
confidentlevel
Detectfailureand
computeWS
reliability
confconf
confconf
confconf
confconf
conf
oracleoracle
oracleoracle
oracleoracle
oracleoracle
oracle
WS WS WS WS WS WS WS WS WS WS
voting
. . .
Phase 2: Volume Testing Phase
• This phase continues testing the remaining WS and any newly arrived WS, based on the profiles and history (test case effectiveness, oracle, and WS ranking) obtained in the training phase.
• By the end of Phase 2: – all the WS available are tested;– A short ranked list of WS; – Test cases are updated and ranked; and– Oracles and their confidence levels are updated.
Best Buy Stock WS SpecificationEvent Specification
(1) A client queries a stock’s price
Client can query any stock’s price. If queried stock name is not empty and requested stock information is available, the server WS sends the requested stock price to the requesting client.
(2)20 minutes have past since last stock price checking
The service automatically checks stock prices every 20 minutes. If the prices of some stocks increase >= 5% within the past 20 minutes, it will send messages to the all stock owners, reminding them to sell the stocks whose prices increase >= 5%, or buy the stocks to sell at a higher price.
If the prices of some stocks decreases >= 10% within the past 20 minutes, the server WS will send messages to the stock owners, reminding them to buy the
stocks whose prices decrease >= 10% or sell them to stop further losses.
If the advancing volume or declining volume of some stocks increases >= 100% in the past 20 minutes compared to the same period of yesterday, it will
send messages to alert the stock owners.
Best Buy Stock WS Case Study
Stock databaseEvaluation
enginePurchase/sale info
BBS WS1
updating
BBS WS2
BBS WSn
. . .
Class Stock {int StockID; // only use different prim number as ID;float price; // current purchase priceint minute; // percentage points of change in a minuteint day; // percentage points of change in a dayint week; // percentage points of change in a weekint month; // percentage points of change in a monthint year; // percentage points of change in a year
}
Clients
WS offered by brokers
Voter...
a1
a2
an
Stock databaseEvaluation
enginePurchase/sale info
BBS WS1
updating
BBS WS2
BBS WSn
. . .
Class Stock {int StockID; // only use different prim number as ID;float price; // current purchase priceint minute; // percentage points of change in a minuteint day; // percentage points of change in a dayint week; // percentage points of change in a weekint month; // percentage points of change in a monthint year; // percentage points of change in a year
}
Clients
WS offered by brokers
Voter...
a1
a2
an
Impact of Training Sizes and Target Sizes
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60
Training size
Cost
15
Target size
1
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60
Training size
Cost
15
Target size
1
15
Target size
1
Impacts of Target Size on Testing Cost
• The smaller the target size, the lower the cost. This is so because more WS can be eliminated sooner.
• The differences between the curves 1 to 12 are small, while a large gap exits between curves 12 and 13. The reason is that there are 12 fault-free WS under test. The number of failures detected from them is zero. If these fault-free WS are in the current target set, any WS will be eliminated if a single failure is detected.
• When the target size moves from 12 to 13 or higher, the testing cost increases sharply, because the algorithm must find a better WS among a set of imperfect WS.
Impacts of Training Size on Testing Cost
• The smaller the training size, the lower the cost. • When the training size is less than or equal to the target
size, increasing the training size does not increase the cost (the initial part of the curves is flat). When the training size exceeds the target size, the cost increases as the training size increases.
• When the training size equals the total number of WS under test, it becomes exhaustive testing and no test runs can be saved.
Oracle Establishment and Confidence
• Note that the oracle is established by a majority voting.
• If the training size is small, the confidence decreases, and it is even possible that an incorrect answer can get the majority vote.
• Also incorrect WS does not always produce an incorrect answer. It often produces an incorrect answer some of times.
The Impacts of Training Size on Oracle
Training Size Prob. of correct oracle
Prob. of no oracle
Prob. of incorrect oracle
1 76.9% 0.0% 23.1%
2 83.8% 16.3% 0.0%
3 98.1% 1.9% 0.0%
4 98.1% 0.6% 1.3%
5 98.1% 1.9% 0.0%
Training Size
Prob. of correct ranking
Prob. of no decision
Prob. of incorrect ranking
1 60.0% 0.0% 40.0%
2 35.6% 59.4% 5.0%
3 65.1% 21.5% 13.4%
4 71.5% 16.3% 12.2%
5 75.8% 15.2% 9.0%
Impacts of Training on Test Case Ranking
Conclusions
• This paper proposed an efficient process to test a large number of web services designed based on the same specification.
• The experiment results reveal that the smaller the training size, the lower the cost.
• However, a small training size can lead to incorrect oracle, leading to incorrect WS ranking.
• A small training size can also lead to incorrect test case ranking, resulting a higher test cost in phase two.
• Therefore, it is critical to select a reasonable sized training size in WS group testing.
Future Work
• Need to address the impact of the age of the test cases. Need to have an adaptive window to address this.
• Also, we need a stochastic algorithm to perform the majority voting automatically for complex outputs.