Adaptive Testing, Oracle Generation, and Test Case Ranking

Adaptive Testing, Oracle Generation, and Test Case Ranking for

Web Services

Wei-Tek TsaiWei-Tek TsaiSoftware Research LaboratorySoftware Research Laboratory

Computer Science & Engineering DepartmentComputer Science & Engineering DepartmentArizona State UniversityArizona State University

[email protected]@asu.edu

Table of Content

• Background • Existed Dilemmas for SOA • Introduction to WebStrar• Difference between Blood and WS Group

Testing• Testing Process• BBS Case Study• Impact of Training Sizes and Target Sizes• Conclusions and Future Work

Impacts of Training on Test Case Ranking

Background

• Software development is shifting away from the product-oriented paradigm to the service-oriented paradigm.

• Service-Oriented Architecture (SOA) and its implementation Web Services (WS) received significant attention as major computer companies are all adopting this new approach to develop software and systems.

• However, trustworthiness becomes a serious problem and appropriate tradeoffs have to be paid during the WS testing phase

Verification of Web Services

• Collaborative Testing: Cooperation and collaboration among different testing activities and stakeholders including service provider, service consumer, and service brokers.

• Specification-Based Testing: SOA proposes a fully specification-based process. WS define a XML-based protocol stack to facilitate service inter-communication and inter-operation. Specifications, such as WSDL, OWL-S, and WSFL describe the service features. Hence, test cases need to be generated based on these specifications.

Existed Dilemmas for SOA (continued)

• Run-time Testing: Most WS activities such as service publishing, discovering, matching, composition, binding, execution, monitoring are done at runtime. Thus– Verification and testing must to be done in cluding test

case generation, test execution, test evaluation, model checking must be done at run-time.

• Different implementations of the same specification: For the same specification of a service requirement, many alternative implementations may be available online. Effective algorithms are needed to rank and select the best WS.

Introduction to WebStrar

• WebStrar: Infrastructure for Web Services Testing, Reliability Assessment, and Ranking. It is is an infrastructure that facilitates the development of Web services, trustworthy Web services, and their applications. It provides– the public (service providers, brokers, requestors ,

researchers, and regulators) on-line access to the tools and databases that enable describing (specifying), finding, scripting (composing complex services from existing services), testing, verification, validation, experimentation, and reliability evaluation of Web services.

– WebStrar has WS group testing to rank services belong to the same specifications.

SOAP call

ResultsService providers

Found

UDDIservice broker

Publishing Find

Clients

CurrentCurrent

Web ServiceWeb Service

ModelModel

Found

SOAP call

Publishingrequest

Approval

Binding

Data & Results

inquiry

testing

accept-ancetest

Registry

Trustworthy Service Broker

check-ininterface

check-outinterface

servicebinder

Service providers

TestMaster

Databaseof

test scripts

acceptanceinterface

Clients

TrustworthyTrustworthy

Web ServiceWeb Service

Model Model

based on Testingbased on Testing

WebStrar InfrastructureWebStrar Infrastructure

WS ranking

Test caseranking

Oracleupdates

Trustworthy WS repository

Reliability database

WS directory

Service Requestors / Clients

Service Providers: submit WS + test cases + test case oracle

Dynamically Replace - Testing method- Reliability model

ResearchersDevelopers

Access Data

- Reliability

- Ranks

ReliabilityModels

Model checking

WSDLOWL-S

DAML-S

Test case generators

Test case database

WS testmaster

Servicecomposition

Composite WS

Test casevalidation

Difference between blood and WS group testing

Compared featuresCompared features Blood Group Testing Blood Group Testing (BGT)(BGT)

Web Services Group Testing (WSGT)Web Services Group Testing (WSGT)

Testing goalsTesting goals Find bad samples from a large pool of blood samples.

Rank WS in a large pool of WS with the same specification; Rank the fault detection capacity of test scripts;Determine the oracle of each test script; andFault identification.

Optimization Optimization objectivesobjectives

Minimize the number of tests needed.

Minimize the number of tests and voting needed.

Sample mixSample mix Arbitrary and physical mix.

Interoperability is constrained by WSDL, DAML-S, OWL-S, and composition semantics such as ontology.

Testing methodsTesting methods Bio/chemical tests. WS unit, integration, and interoperability testing using adaptive, progressive, and concurrent testing.

Testing locationTesting location Centralized testing Distributed and remote testing by agents and voters.

VerificationVerification Contamination analysis. Oracle comparison and/or majority voting

Test coverageTest coverage One test for each mix. Need many tests for each group of WS to verify a variety of aspects.

Reliability Reliability evaluationevaluation

Reliability of testing process

Reliability of WS under test and testing process

Reliability of testsReliability of tests Tests can be reliable or unreliable. Most BGT assumes tests are reliable.

The voting mechanism may be unreliable, and the number of faulty WS may be greater than the number correct WS to mislead the voter.

Testing Process

• Test a large number of WS at both the unit and integration levels.

• At each level, the testing process has two phases: – Training Phase and – Volume Testing Phase.

Phase 1: Training Phase

• Select a subset of WS randomly from the set of all WS to be tested. The size of the subset, which is named as “Training Size”, can be experimentally decided.

• Apply each test case in the given set of test cases to test all the WS in the selected subset.

• Voting: For each test input, the outputs from the WS under test are voted by a stochastic voting mechanism based on majority and deviation voting principles.

Phase 1: Training Phase (Cont’d)

• Oracle establishment: If a clear majority output is found, the output is used to form the oracle of the test case that generates the output. A confident level is defined based on the extent of the majority. The confident level will be dynamically adjusted in the phase 2 as well.

• Test case ranking: Test cases will be ranked according to their fault detection capacity, which is proportional to the number failures the test cases detect. In the phase 2, the higher ranked test cases will be applied first to eliminate the WS that failed to pass the test.


• WS ranking: The stochastic voting mechanism will not only find a majority output, but also rank the WS under group testing according to their average deviation to the majority output.

• By the end of training phase testing, we have– Tested and ranked the selected WS;– Ranked the potency of test cases;– Establish the oracle for test cases and their

confidence levels.


test scripttest script




test script

rankrank

rankrank

rankrank

rankrank

rank

Generatetest

scriptranking

Findoracleand

confidentlevel

Detectfailureand

computeWS

reliability

confconf

confconf

confconf

confconf

conf

oracleoracle

oracleoracle

oracleoracle

oracleoracle

oracle

WS WS WS WS WS WS WS WS WS WS

voting

. . .

Phase 2: Volume Testing Phase

• This phase continues testing the remaining WS and any newly arrived WS, based on the profiles and history (test case effectiveness, oracle, and WS ranking) obtained in the training phase.

• By the end of Phase 2: – all the WS available are tested;– A short ranked list of WS; – Test cases are updated and ranked; and– Oracles and their confidence levels are updated.

Best Buy Stock WS SpecificationEvent Specification

(1) A client queries a stock’s price

Client can query any stock’s price. If queried stock name is not empty and requested stock information is available, the server WS sends the requested stock price to the requesting client.

(2)20 minutes have past since last stock price checking

The service automatically checks stock prices every 20 minutes. If the prices of some stocks increase >= 5% within the past 20 minutes, it will send messages to the all stock owners, reminding them to sell the stocks whose prices increase >= 5%, or buy the stocks to sell at a higher price.

If the prices of some stocks decreases >= 10% within the past 20 minutes, the server WS will send messages to the stock owners, reminding them to buy the

stocks whose prices decrease >= 10% or sell them to stop further losses.

If the advancing volume or declining volume of some stocks increases >= 100% in the past 20 minutes compared to the same period of yesterday, it will

send messages to alert the stock owners.

Best Buy Stock WS Case Study

Stock databaseEvaluation

enginePurchase/sale info

BBS WS1

updating

BBS WS2

BBS WSn

. . .

Class Stock {int StockID; // only use different prim number as ID;float price; // current purchase priceint minute; // percentage points of change in a minuteint day; // percentage points of change in a dayint week; // percentage points of change in a weekint month; // percentage points of change in a monthint year; // percentage points of change in a year

}

Clients

WS offered by brokers

Voter...

a1

a2

an

Stock databaseEvaluation

enginePurchase/sale info

BBS WS1

updating

BBS WS2

BBS WSn

. . .

Class Stock {int StockID; // only use different prim number as ID;float price; // current purchase priceint minute; // percentage points of change in a minuteint day; // percentage points of change in a dayint week; // percentage points of change in a weekint month; // percentage points of change in a monthint year; // percentage points of change in a year

}

Clients

WS offered by brokers

Voter...

a1

a2

an

Impact of Training Sizes and Target Sizes

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60

Training size

Cost

15

Target size

1

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60

Training size

Cost

15

Target size

1

15

Target size

1

Impacts of Target Size on Testing Cost

• The smaller the target size, the lower the cost. This is so because more WS can be eliminated sooner.

• The differences between the curves 1 to 12 are small, while a large gap exits between curves 12 and 13. The reason is that there are 12 fault-free WS under test. The number of failures detected from them is zero. If these fault-free WS are in the current target set, any WS will be eliminated if a single failure is detected.

• When the target size moves from 12 to 13 or higher, the testing cost increases sharply, because the algorithm must find a better WS among a set of imperfect WS.

Impacts of Training Size on Testing Cost

• The smaller the training size, the lower the cost. • When the training size is less than or equal to the target

size, increasing the training size does not increase the cost (the initial part of the curves is flat). When the training size exceeds the target size, the cost increases as the training size increases.

• When the training size equals the total number of WS under test, it becomes exhaustive testing and no test runs can be saved.

Oracle Establishment and Confidence

• Note that the oracle is established by a majority voting.

• If the training size is small, the confidence decreases, and it is even possible that an incorrect answer can get the majority vote.

• Also incorrect WS does not always produce an incorrect answer. It often produces an incorrect answer some of times.

The Impacts of Training Size on Oracle

Training Size Prob. of correct oracle

Prob. of no oracle

Prob. of incorrect oracle

1 76.9% 0.0% 23.1%

2 83.8% 16.3% 0.0%

3 98.1% 1.9% 0.0%

4 98.1% 0.6% 1.3%

5 98.1% 1.9% 0.0%

Training Size

Prob. of correct ranking

Prob. of no decision

Prob. of incorrect ranking

1 60.0% 0.0% 40.0%

2 35.6% 59.4% 5.0%

3 65.1% 21.5% 13.4%

4 71.5% 16.3% 12.2%

5 75.8% 15.2% 9.0%

Impacts of Training on Test Case Ranking

Conclusions

• This paper proposed an efficient process to test a large number of web services designed based on the same specification.

• The experiment results reveal that the smaller the training size, the lower the cost.

• However, a small training size can lead to incorrect oracle, leading to incorrect WS ranking.

• A small training size can also lead to incorrect test case ranking, resulting a higher test cost in phase two.

• Therefore, it is critical to select a reasonable sized training size in WS group testing.

Future Work

• Need to address the impact of the age of the test cases. Need to have an adaptive window to address this.

• Also, we need a stochastic algorithm to perform the majority voting automatically for complex outputs.

Documents

Adaptive Testing, Oracle Generation, and Test Case Ranking