On Representing and Eliciting Resilience Requirements of ... · Microservices are usually deployed in a more sophisticated environment using virtual infrastructures like virtual machines

1

On Representing and Eliciting ResilienceRequirements of Microservice Architecture Systems

Kanglin Yin, Qingfeng Du, Wei Wang, Member, IEEE, Juan Qiu,Member, IEEE and Jincheng Xu

Abstract—Together with the spread of DevOps practices andcontainer technologies, Microserivce Architecture has become amainstream architecture style in recent years. Resilience is akey characteristic in Microservice Architecture Systems(MSASystems), and it shows the ability to cope with various kindsof system disturbances which cause degradations of services.However, due to lack of consensus definition of resilience in thesoftware field, although many works have been done on resiliencefor MSA Systems, developers still don’t have a clear idea onhow resilient an MSA System should be, and what resiliencemechanisms are needed.

In this paper, by referring to existing systematic studies onresilience in other scientific areas, the definition of microserviceresilience is provided and a Microservice Resilience MeasurementModel is proposed to measure service resilience. And we give arequirement model to represent resilience requirements of MSASystems. A process framework is also proposed to elicit MSASystem resilience requirements. As a proof of concept, a casestudy is conducted on an MSA System to illustrate how theresilience requirements are elicited and represented.

Index Terms—Microservice; Resilience; Requirement Specifi-cation; Requirement Elicitation Methods.

I. INTRODUCTION

Microservice Architecture (aka Microservices)[1] is a newarchitectural style which modularizes software components asservices, which are called microservices, and makes servicesand service development teams as independent of each otheras possible. In recent years, Microservice Architecture hasalready been a mainstream architecture style adopted by manyleading internet companies[2][3]. Shifting to MicroserviceArchitecture promises fast time-to-market of individual ser-vices, and enables modern development processes like Con-tinuous Delivery[4], DevOps[5]. Besides the delivery speed,scalability, flexibility in resource allocation, code reuse andother features are also improved greatly by the MicroserviceArchitecture.

Although the Microservice Architecture has many bene-fits, Microservice Architecture Systems (MSA Systems) aremore fragile than traditional monolithic systems[6], becauseMicroservices are usually deployed in a more sophisticatedenvironment using virtual infrastructures like virtual machinesand containers, and there are lots of components for servicedecoupling and management (e.g. API Gateway, MessageQueue, Service Registry, etc.). Threats come from anywherein an MSA System: small-density components with faults [7],unstable message passing among microservices[8], the under-lying cloud environment with unreliable containers, virtualmachines, or bare-metal servers[9]. Even normal actions taken

in cloud environments like software/hardware upgrades anddynamic changes in configuration files may lead to severeservice outages, which are lessons learned from hundreds ofservice outage reports of cloud companies in the literature[10].

Reliability, availability and fault tolerance etc. are tradi-tional metrics to evaluate a software system’s ability to copewith failures [11]. These metrics assume that a softwaresystem has two states such as ”Available/Unavailable” or”Reliable/Unreliable”, and calculate the probability of thesetwo states. However, recent studies on cloud system failures[12][10][13] found that cloud systems are more likely tobe in a ”limped” state rather than be totally unavailable.The ”limped” state means that although a cloud system canprovide services with normal functionalities, the services workin performance under users’ satisfaction, which is known asservice degradation. In such situation, metrics like reliabilityor availability can’t evaluate the software system so well. Forexample, in situation A, the average response time of an MSASystem’s service is degraded from 3 seconds to 5 secondsdue to a failure; while in situation B, the average responsetime of the service is degraded from 3 seconds to 12 secondsdue to the same failure. The failure recovery time of thesetwo situations are the same. It is obvious that the service insituation A performs better than the service in situation Bwhen failure happens. But if we take ”response time higherthan 3 seconds” as an unreliable state, the metrics of servicereliability in these two situations are the same.

As a result, many practitioners of the Microservice Archi-tecture [14][15][16], proposed Resilience as a characteristicdescribing how an MSA System copes with the occurrence offailures and recovers the degraded service back to its normalperformance. Existing reliability/fault tolerant mechanismsused in Service-Oriented Architectures and cloud platformslike Circuit Breakers and Bulkheads, are used as resiliencemechanisms in MSA Systems[8][15].

Although engineering resilience in MSA Systems hasgained popularity among designers and engineers, the con-sensus on how engineering resilience can be designed andimproved has not yet been reached. Until now, there is nocommon definition for microservice resilience. And althoughseveral works have been done for software resilience bench-marking [17] [18], available engineering quantification metricsstill exhibit very little standardization.

Due to no standard definitions and quantification metrics formicroservice resilience, it is hard to make definite resiliencerequirements for MSA Systems. As a result, microservicedevelopers seldom have a clear idea of the following questions

arX

iv:1

909.

1309

6v1

[cs

.SE

] 2

8 Se

p 20

19

2

about resilience, which may lead to development failuresaccording to the theory of requirement engineering[19].

1) What is microservice resilience?2) How to evaluate microservice resilience?3) How resilient an MSA System should be?4) How to set goals for resilience mechanisms?

Furthermore, how to represent resilience requirements forMSA Systems is an another problem to face even if stan-dard definitions and quantification metrics of microserviceresilience are given. Although Domain Driven Design (DDD)[20] is the suggested way to build MSA System requirements[14], integrating resilience into DDD seems to be difficult.DDD focuses on how to decompose a system to microservicesby business context boundaries, so introducing notions inresilience like service degradations, failures may trouble thepartition of context boundaries. Another type of requirementmodel is needed for microservice resilience requirements.

In order to solve the problems above, this paper works onthe representation and elicitation of MSA System resiliencerequirements, contributions of this paper are:

• By referring to systematic studies on definitions and mea-surements of resilience in other scientific areas. We pro-vide the definition of microservice resilience. And a Mi-croservice Resilience Measurement Model (MRMM) isproposed to measure service resilience of MSA Syetems.

• Based on MRMM, a requirement model is designed torepresent resilience requirements of MSA Systems. Therequirement model contains a Resilience Goal Decom-position View refining service resilience goals to systembehaviors with a customized goal model, and a ResilienceMechanism Implementation View to show how resiliencemechanisms work in MSA Systems.

• A process framework to elicit resilience requirementsof MSA Systems is proposed. The process frameworkoutlines steps to elicit our resilience requirement modelfor MSA Systems, which follows the methodology ofGoal-Oriented Requirement Engineering.

The remain of this paper organizes as follows: Section IIsummarises some related works of this paper; Section IIIprovides the definition of microservice resilience and pro-posed Microservice Resilience Measurement Model; SectionIV proposes the service resilience requirement model basedon the definition and measurement model in Section III;Section V describes the process framework to elicit MSASystem reslience requirements; In Section VI a case studyis conducted using an MSA System to illustrate the wholeresilience requirement elicitation process, and Section VIImakes conclusion of this paper and outlines some futureworks.

II. RELATED WORKS

This section discusses existing related studies in these threeresearch areas: Resilience in other scientific areas, Resiliencein microservices, and Goal-Oriented Requirement Engineer-ing.

Fig. 1. The Reslience Triangle Model Proposed in [25]

A. Resilience in Other Scientific AreasThe word ”resilience”, originates from the Latin verb ”re-

silire”, means an object’s ability to bonus back to its normalsize after being pulled or pushed. Holling [21] firstly usedthis word in the field of ecology, to represent an ecosystem’sability to absorb disturbances. In recent decades, the notionof resilience has been used in many scientific areas, likepsychology, city planning, management science, etc. As isdescribed in Righi’s review of resilience engineering[22], theability of anticipating/being aware of hazards, the capacity ofadapting to variability, and the ability of responding/restoringare major concerns of resilience.

In Hosseini’s literal review on system resilience[23], re-silience assessment is classified into two categories: qual-itative approaches and quantitative approaches. Qualitativeapproaches gives concept framework of practices to archieveresilience[24], which is usually used in Society-Ecology,Organization Management, and Healthcare. Quantitative ap-proaches use resilience curves to illustrate the resilient behav-ior of an engineered system undergoing a disruptive event.Many researchers used the properties of the resilience curveto the measure the resilience of the system.

Figure 1 shows the Bruneau’s resilience triangle model[25] on a resilience curve, which is the most used resiliencemodel in quantitative assessment. In Figure 1, the x-axisstands for the time and the y-axis stands for the quality ofa system. The Bruneau’s model proposed three dimensions ofResilience: Robustness, Rapidity and Resilience. Robustnessand Rapidity are the measurements on the x and y axis inFigure 1, while Resilience measures the shaded area in Figure1. A great number of researches on resilience quantificationalso proposed metrics on these three dimensions [26][27] [23].

B. Resilience in MicroservicesThe importance of resilience has been pointed out in many

practitioner books of microservices [14][15][16], and studiesdiscussing key features of MSA Systems [6][28], which makesit a common sense that resilience is a key characteristicin MSA Systems. In these books and studies, some typicalresilience mechanisms like the Circuit Breakers and Bulkheads[8] are mentioned.

In recent years there have been several resilience relatedworks on MSA Systems. Richter et al. showed that the Mi-croservice Architecture itself can have positive impacts on de-pendability and fault-tolerance[29]. Nane analyzed the impact

3

of containers to microservice performance [30]. Giovanni, etal, proposed a self-managing mechanism for MSA System[31],where auto-scaling and health management of services areimplemented by monitoring. Soenen et al. designed a scalablemechanism for microservice-based NFV System for fault re-covery and high availability [32]. Zwietasch used Time-SeriesPrediction method to predict failures in MSA Systems[33].Stefan et al. designed a decision guidance model for servicediscovery and fault of microservices, where some faults inan MSA System are related to certain designs in the system[34]. Heorhiadi et al, designed a resilience testing frameworkfor the Microservice Architecture [35], how to inject faultsinto microservices was not discussed in detail. Brogi proposeda reference dataset generation framework for microserviceswhich includes failure data[36]. Thomas and Andre, built ameta-model for MSA Systems, which is used for performanceand resilience benchmarking[37], but how this model is usedfor benchmarking and how resilience is evaluated was notdiscussed.

C. Goal-Oriented Requirement Engineering

Goal-oriented Requirement Engineering (GORE) is abranch of requirement engineering. GORE is used in earlierstages of requirements analysis to elicit the requirements fromhigh-level system goals[38], while object-oriented analysis likeUML[39] fits well to the later stages. KAOS [40][41], i* [42],GBRAM [43] and the NFR Framework [44] are main goalmodeling frameworks used in GORE[45]. The main concernsof recent papers on GORE are implementations, integrations,extensions and evaluations of the main goal models, which issummarized by the literature review of Jennifer et al. [46].

Few works were done on GORE for microservice-related systems. Wang et al. introduced service discovery inGORE, which helps the requirement decomposition of SOAsystems[47]. Zardari and Bahsoon discussed about how GOREis adapted into clouds[48].

On GORE with failures or performance, Rashid et al.extended the goal-oriented model to aspect-oriented model,which separates the functional and non-functional propertiesof a requirement [49]. Robert and Axel presented an approachto use exiting refinements as a refinement pattern for goalrefinement [50], which proved the validity of refinement in thestudies above. Van and Letier integrated the notion of obstacleinto their KAOS model in [51]. And then Axel presentedformal techniques for reasoning about obstacles to satisfactionof goals in his paper [52], which mainly focuses on findingand resolving obstacles in functional goals that are too ideal.Van Lamsweerde explicitly modeled the goals of an attacker,an agent that creates security risks for the system goalsin KAOS[53]. John, Lawrence and Brian proposed a goal-oriented framework for Non-Functional Requirements [44],a series of refinement methods were designed for accuracyand performance requirements. Fatama et al. proposed a threelayered goal model and conducted multi-objective risk analysison the goal model[54].

Similar to the studies above, we integrate notions in mi-croservice resilience such as disruption, system resources into

a customized goal model to represent resilience requirementsof MSA Systems.

III. MICROSERVICE RESILIENCE

The definition of microservice resilience and a MicroserviceReslience Measurement Model are proposed in this section,as the basis of resilience requirement representation for MSASystems.

A. Definition of Microservice Resilience

By summing up existing viewpoints on resilience in otherscientific areas, and associating these viewpoints with what isrequired to be achieved in MSA Systems, we have come tothe following conclusions on MSA System resilience:• In running environments of MSA Systems, there a lot of

unpredictable events that make services of MSA Systemsperform not as good as expected. These events are termed”disruptions” in the field of resilience[23] [22].

• It is hard to get the probabilities of disruptions becausethe architecture and deployment environment of MSASystems always change with the quick iteration of De-vOps. Considering high fault density in MSA Systems[6],and assumptions on disruptions in other scientific areas[23] [27], it can be assumed that disruptions are inevitableand always happen on MSA Systems.

• Service performance is the main concern of MSA SystemResilience. Disruptions in MSA Systems cause lossesof service performance, which are called service degra-dations. The curve representing how a service’s perfor-mance varies from time under a service degradation is atypical ”resilience curve” in researches of resilience[27].

• Resilient MSA Systems should keep performance de-graded services from a too low level which is unaccept-able by users, and make degraded services’ performanceback to normal as fast as possible.

Based on above conclusions, the definition of resilience forMSA Systems is provided as follows:

Resilience of a Microservice Architecture System is theability to maintain the performance of services at an accept-able level and recover the service back to normal, when adisruption causes the service degradation.

B. Microservice Resilinece Measurement Model

To quantify the service resilience in MSA System,we proposed Microservice Resilience Measurement Model(MRMM). MRMM quantifies the resilience by measuringservice degradations caused by disruptions. Resilience metricsin MRMM are later used for service resilience goal setting inresilience requirements.

Figure 2 shows the meta-model of MRMM. Below weprovide the definitions for elements (e.g. Service Resilience,Disruption Tolerance) in MRMM. Mathematical presentationsof these elements are also given, in order to encode resiliencerequirements to formal propositions and verify the satisfiabilityof resilience requirements in our future work.

Definition 1 (MSA System and Service):

4

ProvidesMSA System

Measured by

Service

Compares toService

PerformanceBenchmark

ServiceDegradation Causes Disruption

Reflects

ServicePerformance

<<Metric>> DisruptionTolerance

Metrics of

<<Metric>> RecoveryRapidity

<<Metric>> Performance

Loss

Service Resilience

Quantified by

Fig. 2. Meta-Model of MRMM

An MSA System is a software system that provides ser-vices. Every service in an MSA System is an interface exposedto users or other systems. Users and other systems can fulfilcertain functionalities by accessing services. Factors like howservice are modularized and deployed are not included in ourdefinition for MSA Systems, because they are out of the scopeof our definition on microservice resilience.

In mathematics, An MSA System is represented by a setMS = {S1, . . . , Sn} where S1, . . . , Sn are services providedby the MSA System. Each service is represented by a tupleS = 〈L,A〉 where:• L is the label of the service which is used for verification

in Goal Models[55].• A = {A1, . . . , An} is the set of performance attributes of

the service. Performance attribute is defined in the nextparagraph.

Definition 2 (Service Performance):Each service in an MSA System has one or several perfor-

mance attributes (e.g. response time, throughput) to evaluateservice performance. Performance Attributes of a service aredecided by type of the service. For example, availability andsuccess rate are common performance attributes of transac-tional services, while video stream services are usually bench-marked by throughputs. Selection of performance attributesis discussed in detail in our proposed resilience requirementprocess framework in Section V.

Service performance is metric of a service’s performanceattribute in a certain time period. Service performance can berepresented by function P (S,A, t), where S is the service, Ais the performance attribute, and t is the timestamp.

In well-developed MSA Systems, real-time service perfor-mance data can be collected by monitoring tools like cAdvisor,Zabbix. These data are stored in time-series databases so thatthe performance value of a service at a timestamp can bequeried in the form of P (S,A, t).

Definition 3 (Service Performance Benchmark):A service performance benchmark is the baseline value

of service performance. Service performance benchmarks areused to judge whether services are degraded. If the serviceperformance value is lower than its service performancebenchmark at some time, the corresponding service is regardedas degraded.

Same with mathematical representation of service perfor-mance, service performance benchmark is represented byfunction QB(S,A, t), meaning the baseline value of serviceS’s performance attribute a at time t. In mathematics, if thepredicate Degrad(S, t) means a service S is degraded at timet, Degrad(S, t) can be defined by following propositional for-mula, where the symbol↔ means the equivalence relationshipin propositional logic.

Degrad(S, t)←→ ∃I, P (S, I, t) < PB(S, I, t) (1)

Depending on the types of services and performance at-tributes, a service performance benchmark may be either aconstant value (e.g. benchmark of service response time isusually fixed), or a dynamic value varying from time. How toset service performance benchmarks is discussed in SectionV.

Definition 4 (Disruption):A disruption in an MSA System is an event that happens

to the MSA System which makes a service degraded. Adisruption should contain the following information:• The related objects when a disruption happens. An object

may be any abstract or realistic entity that can be iden-tified in an MSA system (e.g. servers, containers, CPUs,processes, network connections, services).

• The event type of a disruption. For an object where adisruption happens, there are several event types. Forexample, a Virtual Machine’s disruption event type mayinclude VM halt down, OS upgrade, kernel break, etc.

A disruption is represented by the tuple D = 〈O,E〉 inmathematics, where O,E are the labels of the object andevent type. If the predicate Occur(D, t) means a disruptionD occurs at timestamp t, the fact that D causes servicedegradation on service S at timestamp t′ can be presentedby Formula 2.

Occur(D, t) −→ Degrad(S, t′) (2)

Definition 5 (Service Degradation and Service Resilience):Service degradation is a phenomenon happening in MSA

Systems that a service is kept degraded because of a disruption.Degraded state of a service is confirmed by judging whetherthe service performance is lower than the service performancebenchmark. In mathematics, a service degradation is repre-sented by tuple SD = 〈S,A,D, ts, te〉, where:• S is the degraded service;• A is the performance attribute where service performance

benchmark is violated;• D is the disruption causing the service degradation;

5

• ts and te are the start time and the end time of the servicedegradation.

Service degradation shows the impact of a disruption on ser-vice performance. Service resilience is a model that measuresthe impact of a service degradation. Since service performancevalues under a service degradation is a typical resilience curve,by referring to the Bruneau Model[25], three metrics areincluded in service resilience to measure a service degradation:Disruption Tolerance, Recovery Rapidity and PerformanceLoss.• Disruption Tolerance:

Disruption Tolerance measures how much service per-formance is degraded compared with service performancebenchmark. Disruption Tolerance of a service degradationis the maximum deviation of service performance fromservice performance benchmark in the period of servicedegradation. In mathematics, the Disruption ToleranceDT (SD) of a service degradation SD is represented byFormula 3.

DT (SD) = max(PB(SD.S, SD.A, t)

− P (SD.S, SD.A, t)), t ∈ [SD.ts, SD.te] (3)

When a service is suffering degradation, the MSA Systemshould keep the service from severe degradation which isunacceptable to users (For example, the frame rate of avideo stream service can be lowered a bit but not too lowto make a video look like a slide).

• Recovery Rapidity:Recovery Rapidity measures how fast a degraded ser-vice can be recovered and reach the service perfor-mance benchmark again. Similar to Mean Time to Repair(MTTR) used in reliability assessment, Recovery Rapid-ity RR(SD) is measured by calculating the time intervalof the service degradation SD, as is shown in Formula4.

RR(SD) = SD.te − SD.ts (4)

• Performance Loss:Performance Loss is a quantification of the magnitude ofservice degradation in service performance. PerformanceLoss PL(SD) of a service degradation SD is mathemat-ically expressed by Equation 5.

PL(SD) =

∫ SD.te

SD.ts

[PB(SD.S, SD.A, t)

− P (SD.S, SD.A, t)]dt (5)

Performance Loss measures the cumulative degradedperformance during the service degradation, which isshown as the shaded area in Figure 1. Performance Losscan reveal business loss in a service degradation. Forexample, if a data transmission service benchmarked bythroughput suffers a service degradation, Performance

Fig. 3. Performance on TPS of a Sample Service Degradation

Loss can measure how much data is less transmitted thanexpected.In time series databases used for monitoring, there al-ready exist data types used for recording cumulativevalues of performance (Like the Counter data type inPrometheus). So it is possible to collect Performance Lossdata of service degradations in MSA Systems.

Service resilience can measure a service degradation withthese three metrics. Mathematically, Service resilience is rep-resented by tuple SR = 〈DT,RR,PL〉, where DT , RRand PL are Disruption Tolerance, Recovery Rapidity, andPerformance Loss.

Take a service with performance attribute TPS (Transac-tions Per Second) as an example. The performance bench-mark on TPS of this service is 50 requests/second. Whenthe service suffered from a disruption and was recoveredlater, and collected TPS value during service degradationis shown in Figure 3. Disruption Tolerance of the serviceis 50 − 25 = requests/second, Recovery Rapidity of theservice is 15 − 5 = 10 seconds, and Performance Loss is5 ∗ (50 − 35) + 5 ∗ (50 − 25) = 200 requests, which means200 user requests are less processed than expected due to thedisruption.

IV. RESILIENCE REQUIREMENT REPRESENTATION

In Section III, we proposed MRMM to measure serviceresilience when a service degradation happens. With measur-able service resilience metrics, microservice practitioners canset service resilience goals to indicate how resilient an MSASystem is supposed to be. Service resilience goals specifyservice resilience with thresholds of service resilience metricsin MRMM. For a service with a service resilience goal, serviceresilience metrics of any service degradation happened to thisservice are expected within the thresholds in service resiliencegoal.

When there is a service degradation that violates the serviceresilience goal, the service degradation should be further ana-lyzed to diagnose the disruption causing service degradation.Then developers establish corresponding resilience mechanismto mitigate the impact of the identified disruption so that theservice resilience goal is satisfied again. In MSA Systems, a

6

resilience mechanism is a process consisting of system be-haviors (like monitoring, failure detection) executed by one orseveral components to react to disruptions. And microservicepractitioners usually use architecture-level diagrams to showhow a resilience mechanism works in MSA Systems[14][56].

Thus the resilience requirement of an MSA System consistsof the following information based on our definition onmicroservice resilience and MRMM:

• Service resilience goals;• Disruptions that cause service resilience goal violations;• Resilience mechanisms established to mitigate impacts of

disruptions.

We proposed Service Resilience Requirement Model torepresent the microservice resilience requirement above. Theproposed requirement model consists of two views: ResilienceGoal Decomposition View and Resilience Mechanism Imple-mentation View. In Resilience Goal Decomposition View, weintegrate notions of MRMM into a customized goal model, torepresent service resilience goals. Disruptions causing serviceresilience goal violations are regarded as obstacles to serviceresilience goals, and resilience mechanisms are resolutions tothese disruptions. Resilience Mechanism Implementation Viewuses microservice practitioners’ existing documentation stylesfor resilience mechanisms (like architectural-level diagramsin [14][56]), to show how resilience mechanisms work inMSA Systems in a more expressive way. Figure 4 shows hownotions in MRMM are represented in Resilience Goal Decom-position View, and how a resilience mechanism establishedin Resilience Goal Decomposition View is implemented inResilience Mechanism Implementation View.

A. Resilience Goal Decomposition View

In Resilience Goal Decomposition View, service resiliencegoals are set and decomposed with the methodology ofGoal-Oriented Requirement Engineering (GORE) [38]: ser-vice resilience goals are decomposed into resource resiliencegoals, disruptions obstructing service/resource resilience goalsare identified, resilience mechanisms resolving obstacles areestablished, and resilience mechanisms are further refinedto detailed system behaviors which can be implemented bycomponents of MSA Systems.

Resilience Goal Decomposition View uses a mainstreamgoal model, KAOS[41], which is more expressive than othergoal models[57], and notions in MRMM are integrated intothis goal model. The customized goal model uses basic ele-ments in KAOS: goal, obstacle, agent and domain property.Besides these elements, we introduce an new element assetto our goal model to represent necessary services and systemresources in MSA Systems. The symbolic representations ofthese elements in diagrams are shown in the example inFigure 4. The following paragraphs explains in detail howresilience requirements of an MSA Systems are represented inResilience Goal Decomposition View, and some mathematicalrepresentations are given, for our future work on verifyingresilience goal satisfaction.

1) Service Resilience Goal:Goals are objectives that the target system should achieve.

In KAOS, goals cover different types of concerns: from high-level, strategic concerns, to low-level, technical concerns;from functional concerns, to non-functional concerns[38]. Indiagrams of our proposed goal model, goals are representedby blue parallelograms.

Service resilience goals are final goals to achieve in anMSA System, which specify how resilient the services inMSA Systems are supposed to be. A service resilience goal ofa service contains thresholds of service resilience metrics inMRMM. And the performance attribute is also specified in aservice resilience goal, because service resilience metrics arecalculated from service performance variations.

Mathematically, if RG(S,A) means the service resiliencegoal RG of service S on performance attribute A, and thepredicate satisfy(RG) means RG is satisfied, based ondefinitions of service degradation and service resilience inMRMM, RG satisfies the propositional formula in Formula6, where SD(S,A) means all service degradations of serviceS on performance attribute A.

satisfy(RG(S,A))←→ ∀SD ∈ SD(S,A),(RE(SD).DT < RG.DT )∧(RE(SD).RR < RG.RR)∧

(RE(SD).PL < RG.PL) (6)

A service resilience goal is linked to an asset representingthe service and a domain property representing the serviceperformance benchmark, to show which service the serviceresilience goal specifies and which service performance bench-mark is used to calculate service resilience when servicedegradation happens, as is shown in Figure 5.

In KAOS models, textual specification of elements isrequired[58]. We use the following information to specify aservice resilience goal:• Goal Name: The identifier of the service resilience goal;• Service: The service which the service resilience goal

specifies;• Performance Attribute: The performance attribute of the

service resilience goal;• Service Resilience Thresholds: Thresholds of service

resilience metrics.

2) Resilience Goal Refinement:In KAOS model, A high-level goal can be refined to low-

level goals, these low-level goals are called sub goals of thehigh-level goal. There are two types of refinements: AND-refinement and OR-refinement.

AND-refinement means a goal can be achieved by satisfy-ing all of its sub goals. Given a goal G0 and a set of G0’s subgoals {G1, . . . , Gn}, AND-refinement is textually denoted asG0

AND−−−→ {G1, . . . , Gn}. Mathematically, AND-refinementsatisfies the propositional Formula 7.

n∧i=1

satisfied(Gi) −→ satisfied(G0) (7)

7

Service Resilience Goal

ResilienceMechanism

System Behavior 1 System Behavior n

Disruption

System OperationComponent 1

Service

Specifies

Obstructs Refinement

Resolves

Implements Implements

And Refinement

PerformanceBenchmark

Refers To

Benchmarks

Resilience Goal Decomposition View

MSA System

Service

Service PerformanceBenchmark

Service Degradation Disruption

Service Performance

<<Metric>> Disruption Tolerance

<<Metric>> Recovery Rapidity

<<Metric>> Performance Loss

Service Resilience

Microserivce Resilience Measurement Model

SystemOperation

Component n

System Resource

SystemOperation

Component 1

Resilience Mechanism Implementation View

SystemBehavior 1

Implemented in

Represeted by

Represeted by

Represeted by

Compared to

······ System Operation

Component1

SystemBehavior n

···

···

Fig. 4. Relations between MRMM, Resilience Goal Decomposition View and Resilience Mechanism Implementation View


Service

ServicePerformanceBenchmark

Fig. 5. The Service Resilience Goal of a Service benchmarked by a ServicePerformance Attribute

OR-refinement means a goal can be achieved when one ofits sub goals is satisfied. Given a goal G0 and a set of G0’ssub goals {G1, . . . , Gn}, OR-refinement is textually denotedas G0

OR−−→ {G1, . . . , Gn}. Mathematically, OR Refinementsatisfies the propositional Formula 8.

n∨i=1

satisfied(Gi) −→ satisfied(G0) (8)

Figure 6 shows the difference in representation betweenAND-refinement and OR-refinement in Resilience Goal De-composition View.

Goal

Sub Goal 1 Sub Goal n···(a) And Refinement

Goal

Sub Goal 1 Sub Goal n···(b) OR Refinement

Fig. 6. And Refinement and OR Refinement

In our goal model, each service resilience goal is refinedto resilience goals of service’s dependency system resourceswhich ensure the running of the service (such as containers,VMs), if the performance attribute of the service resiliencegoal is directly influenced by performance attributes of thesesystem resources. Resource resilience goals can be furtherrefined to resilience goals of resources’ dependency systemresources. Such refinement is an AND-refinement because aservice resilience goal may be obstructed by just one resourceresilience goal violation.

When disruptions that obstructs service/resource resiliencegoals are found, resilience mechanisms are established toresolve disruptions. These resilience mechanisms are sub goalsof service/resource resilience goals since they promise thesatisfaction of service/resource resilience goals. The refine-ment from a resilience goal to a resilience mechanisms is anOR-refinement, because when a disruption happens, only thecorresponding resilience mechanism is required.

Resilience mechanisms are decomposed one or severaltimes by AND/OR-refinements, until detailed system behav-iors which can be executed by individual components inMSA Systems are figured out. System behaviors of resiliencemechanisms are represented by blue parallelograms with boldborders to show that they are terminal goals in the goal model.Each system behavior is linked with an agent representing asystem component, to show which component executes thesystem behavior.

Figure 7 shows the whole goal refinement from a serviceresilience goal to system behaviors in Resilience Goal Decom-position View.

3) Resilience Obstacle:Obstacles are a dual notion to goals in goal models, and

they are represented by red parallelograms. When obstacles gettrue, some goals may not be achieved[52]. In our goal model,service degradations are obstacles to service resilience goals.And a service degradation is transformed into a disruptionobstructing a resource resilience goal after the service degra-dation is diagnosed, as is shown in Figure 8. The disruption is

8

Service

Resource


Resource Resilience Goal Resource Resilience Goal

··· ···

···Disruption

Resilience Mechanism

Disruption

Resilience Mechanism···

System Behavior System Behavior

System Component

······ System Component

···

Fig. 7. Goal Refinement from a Service Resilience Goal to System Behaviors

linked to the affected system resource and then correspondingresilience mechanism will take actions to the resource.

The textual specification of an obstacle contains the follow-ing information:

• Obstacle Name: The identifier of the obstacle;• Event: The description that how service/resource re-

silience goals are obstructed.

Service


Service Degradation

(a) Service Degradation Obstructing Service Re-silience Goal

ServiceService Resilience Goal

Disruption

ResourceResource Resilience Goal

······

(b) Disruption Obstructing Resource Resilience Goal

Fig. 8. A Service Degradation(a) and its Root Cause Disruption(b)

4) Asset:Assets are services or system resources(like containers,

physical servers) in an MSA System. Assets are representedby purple hexagons in diagrams of our proposed goal model.Textual specification of an asset includes the asset name andthe asset type.

A service asset is linked with assets representing service’sdependency system resources, and these resource assets are

linked with other resource assets. Identified dependency re-lationships among assets are used as references to serviceresilience goal refinement and service degradation diagnose.

5) Agent:Agents are individual objects that perform behaviors to

archive goals[38]. In our goal model, agents are representedas yellow hexagons, and they represent for components inan MSA System that execute system behaviors decomposedfrom resilience mechanisms. Individual components for MSASystem operation like monitoring tools or anomaly detectorsare typical agents in Resilience Goal Decomposition View.Same with assets, textual information of agents includes theagent name and the agent type.

6) Domain Property:Domain properties are indicative statements of domain

knowledge which are used as references of elements inKAOS[38][59]. Service performance benchmarks are domainproperties of service resilience goals, because service per-formance benchmarks are key variables to calculate serviceresilience metrics in MRMM. Domain knowledges of MSASystems (like architectural patterns, operation principles) aredomain properties of resilience mechanisms, because they arethe references of resilience mechanisms. A domain property isrepresented by an orange pentagon, and textual specificationsof domain properties include the following information:• Domain Property Name: The identifier of the domain

property;• Description: Detailed description of the domain property;• Reference Resources: Reference links to related docu-

mentations of the domain property.

B. Resilience Mechanism Implementation View

Resilience Mechanism Implementation View shows howa resilience mechanism, which is established in ResilienceGoal Decomposition View, is implemented in an MSA Sys-tem. Existing documentation styles for resilience mechanisms(like architecture-level diagrams used for Circuit Breakersand Bulkheads in [14][60]) are directly used for ResilienceMechanism Implementation View, because there is no needto design a new model since these documentation styles areexpressive enough.

Resilience Mechanism Implementation View is drawn ac-cording to the elements in Resilience Goal DecompositionView. For example, if the component-and-connector architec-ture style[61] is used for Resilience Mechanism Implemen-tation View, each agent in Resilience Goal DecompositionView is drawn as boxes representing components, and systembehaviors of resilience mechanisms are drawn as connectionslaunched by these components.

V. RESILIENCE REQUIREMENT ELICITATION

A process framework is proposed in this section to elicitresilience requirements of MSA Systems represented by theService Resilience Requirement Model in Section IV. Theproposed process framework describes how to set service

9

performance benchmarks and service resilience goals, finddisruptions that cause service resilience goal violations, andestablish resilience mechanisms to mitigate the impact ofdisruptions.

A. AssumptionsThe proposed process framework follows these assump-

tions:• The target MSA System is iteratively developed.• The first version of the target MSA System has already

been developed and deployed before the eliciting re-silience requirements.

• There exists a monitoring component in the target MSASystem. And the monitoring component is well-designedso that necessary performance data of all services andresources, and service resilience metrics of service degra-dations can be collected.

• The target MSA System meets its performance require-ments when no disruptions happen.

• The resilience requirements focus on service performancedegradations caused by disruptions, and software failurescaused by code errors are not considered.

B. Stakeholders of Resilience RequirementsStakeholders are participants involving software require-

ment elicitation. Following stakeholders in the developmentteam of an MSA System may participate in the resiliencerequirement elicitation process:• Quality Assurer: Stakeholder for quality assurance is

present in all stages of an MSA System’s developmentlifecycle. When eliciting resilience requirements, qualityassurers should provide key performance attributes andcorresponding benchmarks of each service, and thenset service resilience goals according to performancebenchmarks.

• Operation Engineer: Operation engineers deploy thesystem and maintain the system after deployment. Theydetect service degradations in running MSA Systems, andfind possible root cause disruptions of service degrada-tions.

• Software Architect: Software architects are responsiblefor the architecture of the entire MSA System, or forthe architecture of a single microservice. When disrup-tions are identified, architects should establish resiliencemechanisms to cope with these disruptions.

• Developer: Developers responsible for developing oper-ation components in MSA Systems (which sometimesare called operations developers in DevOps) will imple-ment resilience mechanisms, so they should know whyresilience mechanisms are required.

• Tester: Testers need to know what is achieved by re-silience mechanisms in order to design test cases for theseresilience mechanisms.

C. Process Framework for Resilience Requiremenxt Elicita-tion

Microservice Architecture is an architecture adapting tothe DevOps process [62], thus the resilience requirement

Identifiy Services

Identify DependencyResources

Stage I: System Identification

Set PerformanceBenchmarks

Service DegradationAnalysis

Stage II: Resilience GoalSetting

Set Resilience Goals

Establish ResilienceMechanisms

Stage III: DisruptionAnalysis

Implementation,Deployment &Data Colleciton

Fig. 9. Process Framework for Resilience Requirement Elicitation

System

Service Service Service

Container Container Container

VirtualMachine

VirtualMachine

Fig. 10. A Sample Identified Architecture of MSA System in Resilience GoalDecomposition View

elicitation of MSA Systems is also an iterative process. Figure9 outlines the steps (bold line boxes) of resilience requirementelicitation in MSA System development. The proposed pro-cess framework has three major stages: System Identification,Resilience Goal Setting and Disruption Analysis. DevOpsprocesses after requirement elicitation like system implemen-tation, deployment and monitoring data collection (the dashedline box) are not discussed in this paper.

1) Stage I: System Identification:The architecture of an MSA System including services and

system resources is identified first as the basis of resiliencerequirement elicitation. Identified architecture of an MSASystem is represented in Resilience Goal Decomposition View,by a number of assets linked with each other showing whatservices is provided by an MSA System and what systemresources ensure the running of a service, as is shown in Figure10.

• Identify Services:Services in an MSA System is identified in order toset service resilience goals and find dependency sys-tem resources of these services. Since the MicroserviceArchitecture is the architectural style which modularizecomponents as services, microservices in an MSA Systemcan be directly identified as services on which serviceresilience goals are set.Identified services are represented as assets in ResilienceGoal Decomposition View, and these service assets arelinked to an asset representing the MSA System.

• Identify Dependency Resources:Service degradations are caused by disruptions whichaffect system resources ensuring the running of services.And resilience mechanisms are established to take actionon these affected system resources. Thus the dependencysystem resources of a service need to be identified inresilience requirement elicitation.

10

In MSA Systems, services are deployed asindependent to each other as possible. So eachservice in an MSA System usually has individualsupporting system resources. Resource dependencyof a service can be generally structured asService

supports←−−−−−− V irtualInfrastructuresupports←−−−−−−

PhysicalInfrastructure, which follows the SaaS-PaaS-IaaS architectural pattern of cloud systems.The granularity of dependency resource identificationshould not be too fine-grained, because too many iden-tified system resources may reduce the readability ofresilience requirement diagrams. Only system resourcesthat can be manipulated by system operation componentsare needed to be identified, since system behaviors areterminal goals in Resilience Goal Decomposition View.

In Stage I the architecture of MSA System is identified andrepresented in Resilience Goal Decomposition View, in orderto set resilience goals and analysis disruptions in later stages.As MSA Systems always face changes, the architecture inResilience Goal Decomposition View should also be renewedwhen microservices are added/removed or the deploymentarchitecture is changed (e.g. from virtual machine deploymentto container deployment).

2) Stage II: Resilience Goal Setting:In Stages II, performance benchmarks of services and

resources are determined, and then service resilience goals areset based on these performance benchmarks in Resilience GoalDecomposition View.• Set Performance Benchmarks:

As is defined in MRMM, service resilience of a servicedegradation is measured by performance changes onservices comparing to service performance benchmarks.So performance attributes for services should be deter-mined after services in an MSA System are identified.Table I lists common service performance attributes instandards of IT Systems such as SPEC[63], TPC [64][65], ETSI[66], datasets for service performance eval-uation [67][68], and other researches on web servicemetrics selection [69]. Selection of service performanceattributes depends on the type of services, as is discussedin Section III. Sometimes specific performance attributesare required to set considering the business need of MSASystems. For example, Netflix uses how many streamsare started in a given second as a performance attributebecause it impacts the success of the business[70].Not only services performance attributes, performanceattributes for services’ dependency resources are also tobe determined. Because a service resilience goal may ablebe refined to resilience goals of its dependency resourcesif service performance is directly influenced by resourceperformance, and anomalies in resource performancebenchmarks can also be regarded as disruptions in MSASystems.Table II shows collectable metrics of virtual infrastruc-tures in monitoring tools like Zabbix, cAdvisor and Heap-ster (no available performance standard for virtual infras-tructures is found as far as we studied), and Table III lists

TABLE ICOMMON SERVICE PERFORMANCE ATTRIBUTES

PerformanceAttribute Description Unit

ResponseTime

Time taken to send a request and receivea response ms

Availability Number of successful invocations/total in-vocations %

Throughput Total Number of invocations for a givenperiod of time invokes/s

Successability Number of response / number of requestmessages %

TABLE IIPERFORMANCE ATTRIBUTES OF VIRTUAL INFRASTRUCTURES


system.cpu.util CPU utility %

system.cpu.intr CPU interrupts per second intr/s

proc.num Number of processes processes

vm.memory.size.available Available memory %

system.swap.size.pfree Free swap space(percentage) %

vfs.fs.inode.pfree Free inodes on / (percentage) %

vfs.fs.size.pfree Free disk space on / (percentage) %

net.if.in Incoming network traffic bps

net.if.out Outgoing network traffic bps

some performance attributes of physical infrastructures instandards of SPEC[63], TPC[64] and EEMBC[71].After performance attributes are determined, performancebenchmarks are set on these performance attributes. As isdefined in Section III, performance benchmarks may beeither a constant value or a dynamic value varying fromtime. Constant performance benchmark values can be setby referring to history mean values, expert experience,or suggested values in standards. Dynamic performancebenchmark values can be generated by running time se-ries prediction algorithms (like EWMA, ARIMA, LSTM,etc.) on historical performance data.Performance benchmarks are represented as domain prop-erties in Resilience Goal Decomposition View. Eachperformance benchmark is linked to an asset of a serviceor a resource, to illustrate that the performance bench-mark is set for the service/resource. The specificationof a performance benchmark in Resilience Goal De-composition View contains the performance attribute ofperformance benchmark, benchmark value (for dynamicbenchmark values it may be a reference to the data train-ing result), and references to related standards/discussionminutes/algorithms if necessary.

11

TABLE IIIPERFORMANCE ATTRIBUTES FOR PHYSICAL INFRASTRUCTURE


Instances Created Application Virtual Instances instances

Elasticity

Whether the work performed by applica-tion instances scales linearly in a cloudwhen compared to the performance of ap-plication instances during baseline phase

%

MeanInstanceProvi-sioningTime

The time interval between the instanceprovisioning request and connectivity toport 22 on the instance

s

CPUSpeed

Average time for a computer to completesingle tasks s

NetworkThrough-put

Throughput of network in unit of time bps

• Set Resilience Goals:As is defined in Section III, service resilience metricsof a service degradation are calculated from the serviceperformance benchmark. In Resilience Goal Decomposi-tion View, a service resilience goal is linked to domainproperty of the service performance benchmark.For each service resilience goal, thresholds for DisruptionTolerance, Recovery Time and Performance Loss areset. Not all of these three metrics are required to setin a service resilience goal since sometimes some ofthese metrics are unnecessary or meaningless in certainperformance attributes (For example, Performance Lossis meaningless for response time).Thresholds of Disruption Tolerance and Recovery Timecan be set by referring to existing standards or expertexperience on reliability or fault tolerance(e.g. ETSI[72]has suggested values on minimum acceptable responsetime and outage recovery time on different types ofNFV services). And threshold of Performance Loss isset depending on history disruption data or business needof the target MSA system.If the performance of a service is directly influencedby the performance of its dependency resources (e.g.TPS of a service is influenced by CPU processing speedof the service container), the service resilience goal ofthis service is refined to resilience goals of dependencyresources. So that system resources affected by disrup-tions can be found out by detecting resource resiliencegoal violations. The way to set thresholds on metricsof resource resilience goals are the same with serviceresilience goals. Figure 11 shows an example that aservice resilience goal is refined to resource resiliencegoals in Resilience Goal Decomposition View.

Same with Stage I, Stage II is also an iterative processthat performance benchmarks and service resilience goals areupdated with changes of the MSA Systems.

3) Stage III: Disruption Analysis:

Service

Container

VirtualMachine

ServiceBenchmark

ContainerBenchmark

VMBenchmark


Container Resilience Goal

VM Resilience Goal

Fig. 11. A Service Resilience Goal if Refined to Resource Resilience Goals

System


Service Resilience GoalService Resilience Goal

Service Degradation Service Degradation

Fig. 12. Detected Service Degradations Violating Service Resilience Goals

In Stage III, service degradations that violates service re-silience goals are detected, and these service degradations arediagnosed to find out possible root cause disruptions. Thenresilience mechanisms are established to mitigate disruptions’impact.

• Service Degradation Analysis:To achieve service resilience goals in MSA Systems, ser-vice degradations that violate service resilience goals aredetected and analyzed. Service resilience goal violationsin an MSA System can be easily detected by settingalarms of service resilience metrics on the monitoringcomponent. The corresponding service degradations arerepresented as obstacles to service resilience goals inResilience Goal Decomposition View, as is shown inFigure 12.Detected service degradations are further analyzed to findpossible root cause disruptions which affect services’dependency resources. Detailed ways of making inferencefrom service degradations to disruption are not discussedin this paper, since a great number of works have alreadybeen done on performance analysis and fault diagnosis insoftware systems.In Resilience Goal Decomposition View, analyzed servicedegradation is substituted by an obstacle representing thepossible root cause disruption. The disruption is linked tothe service’s dependency resource which it affects, and itsoriginal link to service resilience goal may be redirectedto resource resilience goals if resource resilience goalviolations are detected. Moreover, a reference to eventlog files may also be attached to a disruption in the formof domain property. Figure 13 is a sample for foundeddisruptions in Resilience Goal Decomposition View.

• Establish Resilience Mechanisms:

12

System


Container Container Container

VirtualMachine

VirtualMachine

Service ResilienceGoal

Container ResilienceGoal

VM Resilience Goal

DisruptionLogs

Service ResilienceGoal

Disruption

Fig. 13. Founded Disruptions in Resilience Goal Decomposition View


ResilienceMechaninsm

System Behavior System Behavior

DisruptionResilience

DesignStrategy

System Behavior

Resilience MechanismImplementation View

System OperationComponent



Fig. 14. Using Resilience Mechanisms to Resolve Disruptions

When disruptions in MSA Systems are identified, corre-sponding resilience mechanisms are established to pre-vent service degradations from service resilience goalviolations. As is described in many practitioner booksof microservices [14] [16], MSA Systems use typicalfault-tolerance mechanisms used in large-scale internetapplications[60] (e.g. Circuit Breaker, Bulkhead, Active-Active Redundancy, etc.) as resilience mechanisms. Thesemechanisms are usually implemented in standalone com-ponents/middlewares (like Hystrix, Istio) in MSA Sys-tems. Depending on the types of disruptions and af-fected system resources, patterns of resilience mecha-nisms are different, which has already been concludedin the literature[73].A resilience mechanism is presented as a goal refiningservice/resource resilience goals and resolving disrup-tions. The resilience mechanism is further refined, untildetailed system behaviors to manipulate affected systemresources by individual system components are figuredout, and these system behaviors can be represented in Re-silience Mechanism Implementation View. The referenceto Resilience Mechanism Implementation View and otherreferences like resilience mechanism patterns are attachedto the resilience mechanism as domain properties. Figure14 shows a sample resilience mechanism in ResilienceGoal Decomposition View.

After Stage III, the resilience requirement elicitation processfor an MSA System at a certain iteration is finished.

VI. CASE STUDY

In order to verify the feasibility of our proposed resiliencerequirement model and resilience requirement elicitation pro-cess framework, we conducted a case study on an MSA

Fig. 15. User Interface of KAOSer

System. One of the benchmark MSA Systems proposed inthe literature[74] was used as our target system.

In this study, the documentations and deployment configura-tion files of the target system were read to identify the system’sservices and resources to be used in Stage I. Meanwhile, wedeployed the target system on a cluster of servers, and we usedtools to generate workloads of user operations, collect perfor-mance data of services and resources, in order to simulatea real-world running MSA System. By referring to Netflix’sproposed methodology of finding faults in MSA Systems[70],we randomly injected faults to our 24*7 running system, tofind disruptions causing service resilience goal violations. Thedetected service degradations and monitoring data for serviceperformance were collected as inputs of Stage II and StageIII.

In order to draw diagrams of Resilience Goal Decompo-sition View, we developed an open source web applicationcalled KAOSer 1. Users can create, edit elements in ResilienceGoal Decomposition View, and export the diagram and textualdocumentation to their local storage. Figure 15 shows theuser interface of KAOSer. Users can drag elements to thediagram area from the toolbox on the left, and edit the textualspecifications of these elements by clicking on these elements.

A. System Description

Sock Shop 2 is an open source MSA System for demonstra-tion and testing of microservice and cloud native technologies.It is built using Spring Boot, Go kit and Node.js, and ispackaged in Docker containers. Sock Shop provides basicservices of an online shopping system, Figure 16 shows thearchitecture of these services.

We deployed Sock Shop on a Kubernetes cluster with onemaster node and three worker nodes. A Controller Node(which contains a bunch of tools including deployment, work-load simulation, performance monitoring, fault injection, etc.)was used to generate and collect necessary data for resiliencerequirement elicitation. Figure 17 shows the deployment sce-nario of Sock Shop in our case study.

1https://github.com/XLab-Tongji/KAOSer2https://github.com/microservices-demo/microservices-demo

https://github.com/XLab-Tongji/KAOSer

https://github.com/microservices-demo/microservices-demo

13

Fig. 16. Sock Shop Architecture

Sock-Shop

Master Node Worker Node Worker Node Worker Node

Kubernetes

Container Container Container Container Container

Front-End Order Payment User Catalogue Cart

ControllerNode

Jenkins

Locust

Chaos-Toolkit

Prome-theus

Fig. 17. Sock Shop Deployment Scenario

B. Resilience Requirement Elicitation

The resilience requirement elicitation for Sock Shop followsthe process framework proposed in Section V. Limited by thelength of the paper, we cannot list all service resilience goals,disruptions and resilience mechanism in our case study. So ateach step we just illustrate a typical example.

1) Stage I: System Identification:• Identify Services

According to the system architecture shown in Figure 16,we decomposed Sock Shop into microservices it provides,as is shown in Figure 18.

• Identify Dependency ResourcesSock Shop was deployed on a Kubernetes Cluster in thisstudy, so the dependency system resources of serviceswere identified according to the Kubernetes deploymentconfiguration file of Sock Shop.In Kubernetes, a service is deployed on a pod. A podis a group of containers having the same functionalities.Containers in a pod are deployed on worker nodesof a Kubernetes cluster, and managed by Kubernetes

Sock-Shop

Front-End Order Payment User Catalogue Cart Shipping

Fig. 18. Services in Sock Shop

management services deployed on the master node. Inthe Kubernetes configuration file of Sock Shop, thereare pods providing database services to microservices ofOrder, User, Cart and Catalogue. These database servicesare identified as dependency resources of microservices.Containers in Kubernetes are deployed to work nodesdynamically so we didn’t know on which worker node acontainer was deployed, so we took the whole Kubernetescluster as a dependency resource for all containers. TheKubernetes cluster is supported by its worker nodesand Kubernetes management applications running on themaster node. Figure 19 shows identified services andresources in Resilience Goal Decomposition View ofSock Shop.

2) Stage II: Resilience Goal Setting:

• Set Performance BenchmarksConsidering the metrics collectable from the monitoringtool Prometheus, and existing common researches &standards on web service qualities (which are mentionedin Section V), we chose the following metrics as thegeneral service performance attributes since all servicesin Sock Shop are transactional services:

– Response Time: Time taken to send a request andreceive a response.

– Success Rate: Number of response / Number ofrequest messages.

As is instructed in performance benchmark standards likeTPC[65], performance attributes that price services (likerevenue per second per service interaction) are requiredfor benchmarking transactional services. Since the finalgoal of resilience is to prevent business loss causedby service degradations, we also set some performanceattributes on business for certain services besides generalservice performance attributes.Limited by the capability of our monitoring tool, wecould not directly collect data on total price of an order.As an alternative, we used the number of success orders,add-to-cart operations and online users per second, as ourbusiness performance attributes, since they are collectableand directly impact the ”revenue” of Shop Shop. TableIV shows all service performance attributes selected forSock Shop.Then the performance attributes of services’ dependencyresources were determined. We just show resource per-formance attributes of the Order service here, since howperformance attributes are set for dependency resourcesof different services are similar.The Order service has three performance attributes: Re-sponse Time, Success Rate and Success Orders. In theKubernetes cluster where Sock Shop was deployed, theOrder service was deployed on a pod which consists ofseveral container instances. Metrics of the Order service’sperformance attributes can be calculated by performanceattributes of containers by the following formulas:

14

Sock-Shop

Front-End Order Payment User CatalogueCart Shipping

POD(Front-End)

Container(Front-End)

Cart-DB

POD(Order)

Container(Order)

Catalogue-DBOrder-DB User(DB)

POD(Order-DB)

Container(Order-DB)

POD(Payment)

Container(Payment)

POD(User)

Container(User)

POD(User-DB)

Container(User-DB)

POD(Catalogue)

Container(Catalogue)

POD(Catalogue)

Container(Catalogue-DB)

POD(Cart)

Container(Cart)

POD(Cart-DB)

Container(Cart-DB)

POD(Shipping)

Container(Shippping)

Kubernetes

KubernetresManagerWorker Node

Master Node

Fig. 19. Services and Resources in Sock Shop

TABLE IVSERVICE PERFORMANCE ATTRIBUTES OF SOCK SHOP

Service PerformanceAttribute Description Unit

All Services ResponseTime

Average Time taken to senda request and receive a re-sponse

ms

SuccessRate

Number of response / numberof request messages %

Order SuccessOrders

Number of finished order persecond orders/s

Cart Add-CartCount

Number of add-to-cart re-quests per second request/s

User OnlineUsers Number of online users users

ResponseT ime(Service) =∑ni (ResponseT ime(Ci)× TPS(Ci))∑n

i TPS(Ci)(9)

SuccessRate(Service) =∑ni (SuccessRate(Ci)× TPS(Ci))∑n

i TPS(Ci)(10)

SuccessOrders(Service) =n∑i

(SuccessRate(Ci)× TPS(Ci)) (11)

In Formula 9,10 and 11, n means the number of availablecontainer instances of a POD and Ci means the ith

container instance. The number of available instancesof a pod, TPS, Response Time, and Success Rate ofcontainers, were selected as performance attributes to bebenchmarked because they directly influence performanceattributes of the Order service.In this case study, we used a workload generator togenerate workload of a number of users, in order tosimulate a system in a real world and collect performancedata. The workload generator simulated the whole shop-ping procedure on Sock Shop (including user logging,browsing items, adding items to cart, submitting orders)so that workload on all services were ensured. We usedcontinuous integration tool Jenkins to trigger workloadsin a certain period every day so that a daily behaviorof a running system was simulated. Figure 20 showsthe performance monitoring data with the workload wesimulated.By referring to mean values of running data we simu-lated, we set 3 seconds and 90% as the baseline valueof Response Time and Success Rate of services andcontainers. The baseline value of Available Instances ina pod was set to 3 instances, as is configured in ourdeployment configuration file. For performance attributesvarying from time like Success Orders and TPS, we usedtime series prediction algorithms to build performancebenchmarks. Figure 20 shows the predicted benchmark ofthe service performance attribute Success Orders (markedas red line), calculated by Triple Exponential Smoothingwith historical monitoring data (marked as blue line) onSuccess Orders. And Figure 21 shows the representationof the performance benchmarks of services and resourcesin Resilience Goal Decompositions View.

• Set Resilience GoalsWe set thresholds of Disruption Tolerance and Recovery

15

Fig. 20. Success Orders of Sock Shop under the Simulated Workload(the Blue Line), and the Performance Benchmark(the Red Line)

Order

POD(Order)

Container(Order)

TPS

ResponseTime

Success Rate

AvailableInstances

SuccessOrders

ResponseTime

Success Rate

Fig. 21. Performance Benchmarks for the Order Service and its DependencyResources

TABLE VRESILIENCE GOALS OF THE ORDER SERVICE AND ITS DEPENDENCY

RESOURCES

Service/Resource PerformanceAttribute

DisruptionTolerance

RecoveryTime

PerformanceLoss

Order(Service) ResponseTime 10s 5s -

Successability 20% - -Success Or-ders - - 500

orders

Order(POD) AvailableInstances

1instance 2s -

Order(Container) ResponseTime 10s 5s -

Successability 20% - -

TPS - -150transac-tions

Time by referring to suggested values in ETSI standards.For Performance Loss, by assuming that 5% loss oforders in a day is not expected in Sock Shop, thePerformance Loss thresholds for Success Orders of theOrder service and TPS of the Order container instanceswere roughly calculated. Table V shows thresholds inresilience goals of the Order service and its dependencyresources.As is shown in Formula 9,10 and 11, performanceattributes of the Order service can be calculated by

Service(Order)

POD(Order)

Container(Order)

Resilience(Response Time) Resilience(Success Rate)Resilience(Success Orders)

Resilience(Available Instances)

Resilience(TPS)Resilience(Response Time)

Resilience(Success Rate)

Fig. 22. Service Resilience Goal Refinement for the Order service

performance attributes of its pods and containers. Soservice resilience goals of the Order service were refinedto resource resilience goals of the Order services’ podand container in Resilience Goal Decomposition View,as is shown in Figure 22.

3) Stage III: Disruption Analysis:• Service Degradation Analysis

In this case study, we proactively injected random faultsto our 24*7 running Sock Shop and detected servicedegradations violating service resilience goals we set. Thefollowing types of faults were injected by referring totutorials of Chaos Engineering [70]:

– Generates high load for one or more CPU cores ofphysical machines.

– Allocates a specific amount of a physical machine’sRAM.

– Put read/write pressure on disks of physicals ma-chines and containers

– Inject latency to network traffic between containers– Induce packet loss to a container network.– Reboot the physical machines– Kill a POD in Kubernetes– Kill management processes of Kubernetes

We used the alert manager in Prometheus to alarm whenresilience goal violations were found. Finally we detected37 service degradations that violate service resiliencegoals during the 5-day running with fault injections. Thenumber of violated service degradations is acceptablebecause both Shop Shop and Kubernentes are open sourceprojects, and we just used only 5 physical machines tobuild up our system.

16

Fig. 23. A Service Degradation Violating Service Resilience Goal of theOrder Service

Order

POD(Order)

Container(Order)

Resilience(Success Orders)

Resilience(Available Instances)

Resilience(TPS)

Resilience(Success Rate)

Network Delay

Fig. 24. Root Cause Disruption of the Service Degradation in Figure 23

Considering the length of the paper, here we just showhow a service degradation detected is analyzed and re-solved.Figure 23 shows a service degradation detected on theOrder Service that violated the service resilience goalon Success Orders. By searching fault injection logsand related monitoring data, we found that the servicedegradation was caused by the network delay we injectedto containers of the Order service, and it also violatesresilience goals on TPS of the Order service’s container,as is shown in Figure 24.

• Establish Resilience MechanismsNetwork delays on containers impact the container’sability to process transactions, so we planned to transfermore transactions to other normal container instanceswhen network delay on a container is detected. This canbe realized by using Service Mesh to control networktraffics among containers. In order to implement containernetwork traffic control, the sidecar component Enovy isinjected to all containers to proxy container networkconnection, and the Service Mesh middleware, Istio, isused to monitor and manage traffic among sidecars.Kubernetes is responsible for injecting sidecars to con-tainers. Figure 25 shows how this resilience mechanismis represented in Resilience Goal Decomposition View,and Figure 26 shows the representation in ResilienceMechanism Implementation View.After integrating Service Mesh to Kubernetes, we testedthe resilience mechanism by injecting the same faultagain. Collected monitoring data are shown in Figure 27.The service degraded a bit in testing but no alarm on

Resilience(TPS)

Network Delay

Use Service Mesh to ControlContainer Traffic

Enovy

Proxy Container NetworkConnections Manage Sidecar Traffic Inject Sidecars to

Containers

Istio Kubernetes

IstioDocumentation

Service MeshImplementation

Fig. 25. Resilience Mechanism Established for Container Network Delay

Istio

Enovy

Kubernetes

Inject Sidecars

ServiceProxy

Container

EnovyServiceProxy

Container

EnovyServiceProxy

Container

ManageNetworkTraffics

Fig. 26. Resilience Mechanism in Resilience Mechanism ImplementationView

resilience goal violation was reported, which proves thatthe established resilience mechanism for the disruption iseffective.

VII. CONCLUSION

In recent years, the microservice architecture has alreadybeen a mainstream architecture adopted by internet compa-

Fig. 27. Monitoring Data of Success Orders after Integrating Service Meshto Kubernetes

17

nies. In cases when achieving higher system reliability is nolonger affordable and failure is inevitable [75], microservicepractitioners start to use the word ”resilience” to describe theability coping with failures. However, due to no consensuson definitions and measurements of resilience, microservicepractitioners seldom have a clear idea on how to evaluatethe resilience of an MSA System and how resilient an MSASystem is supposed to be.

In this work, we have the following contributions:• The definition of microservice resilience is provided by

referring to systematic studies on resilience in otherscientific areas. And a Microservice Reslience Measure-ment Model is proposed to measure service resilience ofservice degradations.

• Service Resilience Requirement Model is proposed torepresent service resilience goals, disruptions and re-silience mechanisms. The requirement model uses goalmodel to refine service resilience goals with thresholdsof resilience metrics in MRMM, to system behaviors tobe implemented in MSA Systems.

• We propose the process framework for microserviceresilience requirement elicitation. The process frameworkoutlines steps to set service resilience goals, analysisdisruptions and establish resilience mechanisms againstdisruptions.

Possible limitation of our proposed work, which can beimproved in future work, include the followings:• The MRMM model gives measurable resilience metrics to

evaluate a service’s resilience, and it may be worthwhileto find a modeling technique to model system resiliencewith service resilience.

• Verification on goal satisfaction is an important typeof research in Goal Modeling. We have already givenmathematical presentations of notions in our measure-ment model and requirement model. How to encode thesemathematical presentations into formal languages, andverify them with model checkers, are future works ofthis paper.

• In our case study, an open source microservice projectwas used as the target system. The target system con-tains not too many microservice so that we can easilybuild resilience documents with manual efforts. For largescale MSA Systems, generating resilience requirementscompletely manually is difficult because there are alot of services and resources to be identified. So autogeneration techniques (e.g. Generate the architecture ofthe system by reading deployment configuration files) forour resilience requirement model is needed.

REFERENCES

[1] Martin Fowler and James Lewis. Microservices. ThoughtWorks.http://martinfowler. com/articles/microservices. html [last accessed onFebruary 17, 2015], 2014.

[2] Todd Hoff. Lessons learned from scaling uber to 2000 engineers, 1000services, and 8000 git repositories, 2017.

[3] TONY Mauro. Adopting microservices at netflix: Lessons for architec-tural design, 2016.

[4] Jez Humble and David Farley. Continuous Delivery: Reliable SoftwareReleases through Build, Test, and Deployment Automation (AdobeReader). Pearson Education, 2010.

[5] Len Bass, Ingo Weber, and Liming Zhu. DevOps: A Software Architect’sPerspective. Addison-Wesley Professional, 2015.

[6] Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, ManuelMazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. Mi-croservices: yesterday, today, and tomorrow. In Present and UlteriorSoftware Engineering, pages 195–216. Springer, 2017.

[7] Les Hatton. Reexamining the fault density component size connection.IEEE software, 14(2):89–97, 1997.

[8] Fabrizio Montesi and Janine Weber. Circuit breakers, discovery, and apigateways in microservices. arXiv preprint arXiv:1609.05830, 2016.

[9] Christian Esposito, Aniello Castiglione, and Kim-Kwang RaymondChoo. Challenges in delivering software in the cloud as microservices.IEEE Cloud Computing, 3(5):10–14, 2016.

[10] Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono,and Kurnia J. Eliazar. Why does the cloud stop computing?: Lessonsfrom hundreds of service outages. In Acm Symposium on CloudComputing, 2016.

[11] Organizacion Internacional de Normalizacion. ISO-IEC 25010: 2011Systems and Software Engineering-Systems and Software Quality Re-quirements and Evaluation (SQuaRE)-System and Software QualityModels. ISO, 2011.

[12] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, TiratatPatana-Anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, AgungLaksono, Jeffrey F. Lukman, and Vincentius Martin. What bugs live inthe cloud? a study of 3000+ issues in cloud systems. 2014.

[13] Haryadi S Gunawi, Riza O Suminto, Russell Sears, Casey Golliher,Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng,Nematollah Bidokhti, Caitie McCaffrey, et al. Fail-slow at scale:Evidence of hardware performance faults in large production systems.ACM Transactions on Storage (TOS), 14(3):23, 2018.

[14] Sam Newman. Building Microservices. ” O’Reilly Media, Inc.”, 2015.[15] Eberhard Wolff. Microservices: Flexible Software Architecture.

Addison-Wesley Professional, 2016.[16] Irakli Nadareishvili, Ronnie Mitra, Matt McLarty, and Mike Amundsen.

Microservice Architecture: Aligning Principles, Practices, and Culture.” O’Reilly Media, Inc.”, 2016.

[17] A Bondavalli et al. Research roadmap deliverable d3. 2. AMBER–Assessing Measuring and Benchmarking Resilience, Funded by Euro-pean Union, 2009.

[18] Raquel Almeida and Marco Vieira. Changeloads for resilience bench-marking of self-adaptive systems: a risk-based approach. In DependableComputing Conference (EDCC), 2012 Ninth European, pages 173–184.IEEE, 2012.

[19] Jeremy Dick, Elizabeth Hull, and Ken Jackson. Requirements engineer-ing. Springer, 2017.

[20] Eric Evans. Domain-driven design: tackling complexity in the heart ofsoftware. Addison-Wesley Professional, 2004.

[21] Crawford S Holling. Resilience and stability of ecological systems.Annual review of ecology and systematics, 4(1):1–23, 1973.

[22] Angela Weber Righi, Tarcisio Abreu Saurin, and Priscila Wachs. Asystematic literature review of resilience engineering: Research areas anda research agenda proposal. Reliability Engineering & System Safety,141:142–152, 2015.

[23] Seyedmohsen Hosseini, Kash Barker, and Jose E Ramirez-Marquez. Areview of definitions and measures of system resilience. ReliabilityEngineering & System Safety, 145:47–61, 2016.

[24] Michael Ungar. Qualitative contributions to resilience research. Quali-tative social work, 2(1):85–102, 2003.

[25] Michel Bruneau, Stephanie E Chang, Ronald T Eguchi, George CLee, Thomas D ORourke, Andrei M Reinhorn, Masanobu Shinozuka,Kathleen Tierney, William A Wallace, and Detlof Von Winterfeldt. Aframework to quantitatively assess and enhance the seismic resilienceof communities. Earthquake spectra, 19(4):733–752, 2003.

[26] Devanandham Henry and Jose Emmanuel Ramirez-Marquez. Genericmetrics and quantitative approaches for system resilience as a functionof time. Reliability Engineering & System Safety, 99:114–122, 2012.

[27] Nita Yodo and Pingfeng Wang. Engineering resilience quantification andsystem design implications: A literature survey. Journal of MechanicalDesign, 138(11):111408, 2016.

[28] Nicola Dragoni, Ivan Lanese, Stephan Thordal Larsen, Manuel Mazzara,Ruslan Mustafin, and Larisa Safina. Microservices: How to make yourapplication scale. arXiv preprint arXiv:1702.07149, 2017.

[29] Daniel Richter, Marcus Konrad, Katharina Utecht, and Andreas Polze.Highly-available applications on unreliable infrastructure: Microservicearchitectures in practice. In Software Quality, Reliability and SecurityCompanion (QRS-C), 2017 IEEE International Conference on, pages130–137. IEEE, 2017.

18

[30] Nane Kratzke. About microservices, containers and their underestimatedimpact on network performance. arXiv preprint arXiv:1710.04049,2017.

[31] Giovanni Toffetti, Sandro Brunner, Martin Blochlinger, Florian Dudouet,and Andrew Edmonds. An architecture for self-managing microservices.In Proceedings of the 1st International Workshop on Automated IncidentManagement in Cloud, pages 19–24. ACM, 2015.

[32] Thomas Soenen, Wouter Tavernier, Didier Colle, and Mario Pickavet.Optimising microservice-based reliable nfv management & orchestrationarchitectures. In Resilient Networks Design and Modeling (RNDM),2017 9th International Workshop on, pages 1–7. IEEE, 2017.

[33] Tim Zwietasch. Online failure prediction for microservice architectures.Master’s thesis, 2017.

[34] Stefan Haselbock, Rainer Weinreich, and Georg Buchgeher. Decisionguidance models for microservices: service discovery and fault tolerance.In Proceedings of the Fifth European Conference on the Engineering ofComputer-Based Systems, page 4. ACM, 2017.

[35] Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael KReiter, and Vyas Sekar. Gremlin: systematic resilience testing ofmicroservices. In Distributed Computing Systems (ICDCS), 2016 IEEE36th International Conference on, pages 57–66. IEEE, 2016.

[36] Antonio Brogi, Andrea Canciani, Davide Neri, Luca Rinaldi, and JacopoSoldani. Towards a reference dataset of microservice-based applications.In International Conference on Software Engineering and Formal Meth-ods, pages 219–229. Springer, 2017.

[37] Thomas F. Dllmann and Andr Van Hoorn. Model-driven generation ofmicroservice architectures for benchmarking performance and resilienceengineering approaches. In The Acm/spec, pages 171–172, 2017.

[38] Axel Van Lamsweerde. Goal-oriented requirements engineering: Aguided tour. In Requirements Engineering, 2001. Proceedings. FifthIEEE International Symposium on, pages 249–262. IEEE, 2001.

[39] Kenneth S Rubin and Adele Goldberg. Object behavior analysis.Communications of the ACM, 35(9):48–62, 1992.

[40] Axel Van Lamsweerde and Emmanuel Letier. From object orientationto goal orientation: A paradigm shift for requirements engineering. InRadical Innovations of Software and Systems Engineering in the Future,pages 325–340. Springer, 2004.

[41] Anne Dardenne, Axel Van Lamsweerde, and Stephen Fickas. Goal-directed requirements acquisition. Science of computer programming,20(1-2):3–50, 1993.

[42] Eric SK Yu. Towards modelling and reasoning support for early-phase requirements engineering. In Requirements Engineering, 1997.,Proceedings of the Third IEEE International Symposium on, pages 226–235. IEEE, 1997.

[43] Annie I Anton. Goal-based requirements analysis. In RequirementsEngineering, 1996., Proceedings of the Second International Conferenceon, pages 136–144. IEEE, 1996.

[44] John Mylopoulos, Lawrence Chung, and Brian Nixon. Representing andusing nonfunctional requirements: A process-oriented approach. IEEETransactions on software engineering, 18(6):483–497, 1992.

[45] Alexei Lapouchnian. Goal-oriented requirements engineering: Anoverview of the current research. University of Toronto, page 32, 2005.

[46] Jennifer Horkoff, Fatma Basak Aydemir, Evellin Cardoso, Tong Li,Alejandro Mate, Elda Paja, Mattia Salnitri, John Mylopoulos, and PaoloGiorgini. Goal-oriented requirements engineering: a systematic literaturemap. In Requirements Engineering Conference (RE), 2016 IEEE 24thInternational, pages 106–115. IEEE, 2016.

[47] Hongbing Wang, Suxiang Zhou, and Qi Yu. Discovering web servicesto improve requirements decomposition. In Web Services (ICWS), 2015IEEE International Conference on, pages 743–746. IEEE, 2015.

[48] Shehnila Zardari and Rami Bahsoon. Cloud adoption: a goal-orientedrequirements engineering approach. In International Workshop onSoftware Engineering for Cloud Computing, pages 29–35, 2011.

[49] Awais Rashid, Peter Sawyer, Ana Moreira, and Joao Araujo. Earlyaspects: A model for aspect-oriented requirements engineering. InRequirements Engineering, 2002. Proceedings. IEEE Joint InternationalConference on, pages 199–202. IEEE, 2002.

[50] Robert Darimont and Axel Van Lamsweerde. Formal refinement patternsfor goal-driven requirements elaboration. In ACM SIGSOFT SoftwareEngineering Notes, volume 21, pages 179–190. ACM, 1996.

[51] Axel Van Lamsweerde and Emmanuel Letier. Integrating obstaclesin goal-driven requirements engineering. In Proceedings of the 20thinternational conference on Software engineering, pages 53–62. IEEEComputer Society, 1998.

[52] Axel Van Lamsweerde and Emmanuel Letier. Handling obstacles ingoal-oriented requirements engineering. IEEE Transactions on SoftwareEngineering, 26(10):978–1005, 2000.

[53] Axel Van Lamsweerde. Elaborating security requirements by construc-tion of intentional anti-models. In Proceedings of the 26th InternationalConference on Software Engineering, pages 148–157. IEEE ComputerSociety, 2004.

[54] Fatma Basak Aydemir, Paolo Giorgini, and John Mylopoulos. Multi-objective risk analysis with goal models. In 2016 IEEE Tenth Inter-national Conference on Research Challenges in Information Science(RCIS), pages 1–10. IEEE, 2016.

[55] Paolo Giorgini, John Mylopoulos, Eleonora Nicchiarelli, and RobertoSebastiani. Reasoning with Goal Models. 2002.

[56] Kjell Jrgen Hole. Anti-fragile ict systems. 2016.[57] Chi Mai Nguyen, Roberto Sebastiani, Paolo Giorgini, and John My-

lopoulos. Multi-objective reasoning with constrained goal models.Requirements Engineering, 23(2):189–225, 2018.

[58] Robert Darimont, Emmanuelle Delor, Philippe Massonet, and Axel vanLamsweerde. Grail/kaos: an environment for goal-driven requirementsengineering. In Proceedings of the (19th) international conference onsoftware engineering, pages 612–613. IEEE, 1997.

[59] A Lamsweerde. Kaos tutorial. Cediti, September, 5, 2003.[60] Michael T Nygard. Release it!: design and deploy production-ready

software. Pragmatic Bookshelf, 2018.[61] Paul Clements, David Garlan, Len Bass, Judith Stafford, Robert Nord,

James Ivers, and Reed Little. Documenting software architectures: viewsand beyond. Pearson Education, 2002.

[62] Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. Microser-vices architecture enables devops: migration to a cloud-native architec-ture. IEEE Software, 33(3):42–52, 2016.

[63] SPEC Benchmarks. Standard performance evaluation corporation, 2000.[64] TPC Benchmark A Standard Specification. Transaction processing

performance council. San Jose, CA, 5, 1989.[65] Daniel A Menasce. Tpc-w: A benchmark for e-commerce. IEEE Internet

Computing, 6(3):83–87, 2002.[66] ISGNFV ETSI. Etsi gs nfv-tst 001 v1. 1.1: Network functions vir-

tualisation(nfv); pre-deployment testing; report on validation of nfvenvironments and services, 2016.

[67] E. Al-Masri and Q. H. Mahmoud. Qos-based discovery and rankingof web services. In 2007 16th International Conference on ComputerCommunications and Networks, pages 529–534, Aug 2007.

[68] Yilei Zhang, Zibin Zheng, and Michael R. Lyu. Wsexpress: A qos-awaresearch engine for web services. In IEEE International Conference onWeb Services, 2010.

[69] Sravanthi Kalepu, Shonali Krishnaswamy, and Seng Wai Loke. Ver-ity: a qos metric for selecting web services and providers. In WebInformation Systems Engineering Workshops, 2003. Proceedings. FourthInternational Conference on, pages 131–139. IEEE, 2003.

[70] Ali Basiri, Niosha Behnam, Ruud De Rooij, Lorin Hochstein, LukeKosewski, Justin Reynolds, and Casey Rosenthal. Chaos engineering.IEEE Software, 33(3):35–41, 2016.

[71] Embedded Microprocessor Benchmark Consortium et al. Eembc bench-mark suite, 2008.

[72] ISGNFV ETSI. Etsi gs nfv-rel 001 v1. 1.1: Network functions virtual-isation(nfv); resiliency requirements, 2015.

[73] Saurabh Hukerikar and Christian Engelmann. A pattern language forhigh-performance computing resilience. In Proceedings of the 22ndEuropean Conference on Pattern Languages of Programs, page 12.ACM, 2017.

[74] Carlos M. Aderaldo, Claus Pahl, and Pooyan Jamshidi. Benchmarkrequirements for microservices architecture research. In Interna-tional Workshop on Establishing the Community-wide Infrastructure forArchitecture-based Software Engineering, 2017.

[75] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy.Site Reliability Engineering: How Google Runs Production Systems. ”O’Reilly Media, Inc.”, 2016.

Documents

On Representing and Eliciting Resilience Requirements of ... · Microservices are usually deployed in a more sophisticated environment using virtual infrastructures like virtual machines