Rapid Quality Assurance with Requirements Smellsfemmer/works/2016-requirements_smells-jss.pdf · Rapid Quality Assurance with Requirements Smells HenningFemmera,,DanielMéndezFernándeza,StefanWagnerb,SebastianEdera

Rapid Quality Assurance with Requirements Smells

Henning Femmera,∗, Daniel Méndez Fernándeza, Stefan Wagnerb, Sebastian Edera

aSoftware & Systems Engineering, Technische Universität München, GermanybInstitute of Software Technology, University of Stuttgart, Germany

Abstract

Context: Bad requirements quality can cause expensive consequences during the software developmentlifecycle, especially if iterations are long and feedback comes late. Objectives: We aim at a light-weightstatic requirements analysis approach that allows for rapid checks immediately when requirements are writtendown. Method: We transfer the concept of code smells to Requirements Engineering as Requirements Smells.To evaluate the benefits and limitations, we define Requirements Smells, realize our concepts for a smelldetection in a prototype called Smella and apply Smella in a series of cases provided by three industrial anda university context. Results: The automatic detection yields an average precision of 59% at an averagerecall of 82% with high variation. The evaluation in practical environments indicates benefits such as anincrease of the awareness of quality defects. Yet, some smells were not clearly distinguishable. Conclusion:Lightweight smell detection can uncover many practically relevant requirements defects in a reasonablyprecise way. Although some smells need to be defined more clearly, smell detection provides a helpful meansto support quality assurance in Requirements Engineering, for instance, as a supplement to reviews.

Keywords: Requirements Engineering, Quality Assurance, Automatic Defect Detection, RequirementsSmells

Contents

1 Introduction 2

2 Related work 32.1 The notion of smells in software engi-

neering . . . . . . . . . . . . . . . . . . 32.2 Quality assurance of software require-

ments . . . . . . . . . . . . . . . . . . 32.3 Discussion . . . . . . . . . . . . . . . . 5

3 Requirements Smells 83.1 Requirements Smell terminology . . . 83.2 Requirements Smells based on ISO 29148 8

4 Smella: A prototype for RequirementsSmell detection 10

4.1 Requirements parsing . . . . . . . . . 114.2 Language annotation . . . . . . . . . . 114.3 Findings identification . . . . . . . . . 124.4 Findings presentation . . . . . . . . . 12

5 Requirements Smell detection in theprocess of quality assurance 15

6 Evaluation 166.1 Case study design . . . . . . . . . . . 16

6.1.1 Research questions . . . . . . . 166.1.2 Case and subjects selection . . 176.1.3 Data collection procedure . . . 176.1.4 Analysis procedure . . . . . . . 186.1.5 Validity procedure . . . . . . . 19

6.2 Results . . . . . . . . . . . . . . . . . . 196.2.1 Case and subjects description . 20

Preprint accepted at Elsevier, doi:10.1016/j.jss.2016.02.047

http://dx.doi.org/10.1016/j.jss.2016.02.047

6.2.2 RQ 1: How many RequirementsSmells are present in the artifacts? 21

6.2.3 RQ 2.1: How accurate is thesmell detection? . . . . . . . . 27

6.2.4 RQ 2.2: Which of these smellsare practically relevant in whichcontext? . . . . . . . . . . . . . 29

6.2.5 RQ 3: Which requirementsquality defects can be detectedwith smells? . . . . . . . . . . . 35

6.2.6 RQ 4: How could smells helpin the QA process? . . . . . . . 36

6.2.7 Evaluation of validity . . . . . 38

7 Conclusion 397.1 Summary of conclusions . . . . . . . . 397.2 Relation to existing evidence . . . . . 407.3 Impact/Implications . . . . . . . . . . 407.4 Limitations . . . . . . . . . . . . . . . 417.5 Future work . . . . . . . . . . . . . . . 41

Appendix ARequirements Checklist 46

1. Introduction

Defects in requirements, such as ambiguities or in-complete requirements, can lead to time and costoverruns in a project [56]. Some of the issues re-quire specific domain knowledge to be uncovered. Forexample, it is very difficult to decide whether a re-quirements artifact is complete without domain knowl-edge. Other issues, however, can be detected moreeasily: If a requirement states that a sensor shouldwork with sufficient accuracy without detailing whatsufficient means in that context, the requirement isvague and consequently not testable. The same holdsfor other pitfalls such as loopholes: Phrasing that acertain property of the software under developmentshould be fulfilled as far as possible leaves room forsubjective (mis-)interpretation and, thus, can havesevere consequences during the acceptance phase of aproduct [24, 33].To detect such quality defects, quality assurance

processes often rely on reviews. Reviews of require-ments artifacts, however, need to involve all relevant

stakeholders [65], who must manually read and un-derstand each requirements artifact. Moreover, theyare difficult to perform. They require a high domainknowledge and expertise from the reviewers [65] andthe quality of their outcome depends on the quality ofthe reviewer [75]. On top of all this, reviewers couldbe distracted by superficial quality defects such as theaforementioned vague formulations or loopholes. Wetherefore argue that reviews are time-consuming andcostly.

Therefore, quality assurance processes would benefitfrom faster feedback cycles in requirements engineer-ing (RE), which support requirements engineers andproject participants in immediately discovering cer-tain types of pitfalls in requirements artifacts. Suchfeedback cycles could enable a lightweight qualityassurance, e.g., as a complement to reviews.Since requirements in industry are nearly exclu-

sively written in natural language [58] and naturallanguage has no formal semantics, quality defects inrequirements artifacts are hard to detect automati-cally. To face this challenge of fast feedback and theimperfect knowledge of a requirement’s semantics, wecreated an approach that is based on what we callRequirements (Bad) Smells. These are concrete symp-toms for a requirement artifact’s quality defect forwhich we enable rapid feedback through automaticsmell detection.

In this paper, we contribute an analysis of whetherand to what extent Requirements Smell analysis cansupport quality assurance in RE. To this end, we1. define the notion of Requirements Smells and in-

tegrate the Requirements Smells1 concept into ananalysis approach to complement (constructiveand analytical) quality assurance in RE,

2. present a prototypical realization of our smelldetection approach, which we call Smella, and

3. conduct an empirical investigation of our ap-proach to better understand the usefulness of

1In context of our studies, we use the ISO/IEC/IEEE29148:2011 standard [33] (in the following: ISO 29148 ) asbasis for defining requirements quality. The standard suppliesa list of so-called Requirements Language Criteria, such asloopholes or ambiguous adverbs, which we use to define eightsmells (see also the smell definition in Sect. 3.2).

2

a Requirements Smell analysis in quality assur-ance.

Our empirical evaluation involves three industrialcontexts: The companies Daimler AG as a represen-tative for the automotive sector, Wacker Chemie AGas a representative for the chemical sector, and Tech-Divison GmbH as an agile-specialized company. Wecomplement the industrial contexts with an academicone, where we apply Smella to 51 requirements arti-facts created by students. With our evaluations, weaim at discovering the accuracy of our smell analysistaking both a technical and a practical perspectivethat determines the context-specific relevance of thedetected smells. We further analyze which require-ments quality defects can be detected with smells, andwe conclude with a discussion of how smell detectioncould help in the (industrial) quality assurance (QA)process.

Previously published material

This article extends our previously published work-shop paper [24] in the following aspects: We providea richer discussion on the notion of RequirementsSmell and give a precise definition. We introduceour (extended) tool-supported realization of our smellanalysis approach and outline its integration into theQA process. We extend our first two case studies withanother industrial one as well as with an investigationin an academic context to expand our initial empiricalinvestigations by1. investigating the accuracy of our smell detection

including precision, recall, and relevance from apractical perspective,

2. analyzing which quality defects can be detectedwith smells and

3. gathering practitioner’s feedback on how theywould integrate smell detection in their QA pro-cess considering both formal and agile processenvironments.

Outline

The remainder of this paper is structured as follows.In Sect. 2, we describe previous work in the area. InSect. 3, we define the concept of Requirements Smellsand describe how we derived a set of Requirements

Smells from ISO 29148. We introduce the tool real-ization in Sect. 4 and discuss the integration of smelldetection in context of quality assurance in Sect. 5.In Sect. 6, we report on the empirical study that weset up to evaluate our approach, before concludingour paper in Sect. 7.

2. Related work

In the following, we discuss work relating to theconcept of natural language processing and smells ingeneral, followed by quality assurance in RE, beforecritically discussing currently open research gaps.

2.1. The notion of smells in software engineeringThe concept of code smells was, to the best of our

knowledge, first proposed by Fowler and Beck [27]to answer the question at which point the quality ofcode is so low that it must be refactored. Accordingto Fowler and Beck, the answer cannot be objectivelymeasured, but we can look for certain concrete, visiblesymptoms, such as duplicated code [27] as an indicatorfor bad maintainability [35]. This concept of smells,as well as the list that Fowler and Beck proposed, ledto a large field of research. Zhang et al. [76] providean in-depth analysis of the state of the art in codesmells. The metaphor of smells as concrete symptomshas since then been transferred to quality of otherartifacts including (unit) test smells [72] and smells forsystem tests in natural language [31]. Ciemniewskaet al. [12], further characterize different defects of usecases through the term use case smell. In our work,we extend the notion of smells to the broader contextof requirements engineering and introduce a concretedefinition for the term Requirements Smell.

2.2. Quality assurance of software requirementsThe concept of Requirements Smells is located in

the context of RE quality assurance (QA), which isperformed either manually or automatically.

Manual QA. Various authors have worked on QA forsoftware requirements by applying manual techniques.Some put their focus on the classification of qualityinto characteristics [15], others develop comprehen-sive checklists [2, 6, 50, 39, 38]. Regarding QA, some

3

develop constructive QA approaches, such as creatingnew RE languages, e.g. [17], to prevent issues up front,others develop approaches to make analytic QA, suchas reviews, more effective [69]. In a recent empiricalstudy on analytical QA, Parachuri et al. [60] manu-ally investigate the presence of defects in use cases.To sum it up, these works on manual QA provideanalytical and constructive methods, as well as (vary-ing) lists for defects. They strengthen our confidencethat today’s requirements artifacts are vulnerable toquality defects.

Automatic QA. Various publications discuss the au-tomatic detection of quality violations in RE. Wesummarize existing approaches and tools, their pub-lications, and empirical evaluations in Table 2. Wealso created an in-depth analysis of in total 27 relatedpublications evaluating which quality defects or smellsthe approaches opt for in their described detection.In the following, we will first explain two related ar-eas (automatic QA for redundancy and for controlledlanguages), before discussing automatic QA for ambi-guity in general. For ambiguity, we first describe thoseapproaches that conducted empirical evaluations ofprecision or recall of quality defects related, but notidentical to, the ones of ISO 29148. Afterwards, we fo-cus on publications that mention the same criteria asin the ISO 29148 (see Table 1 for this list and their re-spective empirical evaluations) and discuss the chosenapproaches and results. We publish the complete listof each quality defect that is detected by each of the27 papers, as well as the precision and recall (whereprovided), online as supplementary material [25].

Automatic QA for redundancy. One specific area ofQA is avoiding redundancy and cloning. WhereasJuergens et al. [34] use ConQAT to search for syn-tactic identity resulting from a copy-and-paste reuse,Falessi et al. [21] aim at detecting similar content,therefore using methods from information retrieval(such as Latent Semantic Analysis [52]). Rago etal. [62] extend this work specifically for use cases.Their tool, ReqAlign, classifies each step with a se-mantic abstraction of the step. These publicationsanalyze the performance of their approaches, and de-

pending on the artifact and methods achieve precisionand recall close to 1 (see Table 2).

Automatic QA for controlled languages. Another spe-cific area is the application of controlled language andthe QA of controlled language. RETA [4] specificallyanalyzes requirements that are written via certainrequirements patterns (such as with the EARS tem-plate [54]). Their goal is to detect both conformanceto the template but also some of the ambiguities as de-fined by Berry et al [7]. The authors report on a casestudy where they look at the template conformancein depth, indicating that template conformance canbe classified with various NLP suites to a high accu-racy (Precision > 0.85, Recall > 0.9), both with andwithout glossaries. However, the performance of am-biguity detection (such as the detection of pronouns)is not further discussed in the publication. Similarly,AQUSA [51] analyzes requirements written in userstory format (c.f. [14] for a detailed introduction intouser stories), and detects various defects, such as miss-ing rationales, where they achieve a precision of 0.63-1.Circe [1, 29] is a further tool that assumes that re-quirements are written in such requirements patternsand detects violations of context- and domain-specificquality characteristics by building logical models. Theauthors report on six exemplary findings, which weredetected in a NASA case study. However, despitetheir value to automatic QA, such approaches requirevery specific requirements structure.

Automatic QA for ambiguity in general. The remain-ing approaches listed in Table 2 aim at detectingambiguities in unconstraint natural language. Sincethe quality defects detected by the approaches byCiemniewska et al. [12], Kof [45], HeRA by Knauss etal. [41, 42], Kiyavitskaya et al. [40], RESI by Körneret al. [46, 47, 48], and Alpino by DeBruijn et al. [16]are not the ones discussed in ISO 29148 and since wecould not find an evaluation of precision and recall ofthese approaches, we omit discussing these approachesin-depth here. An analysis of what these approachesfocus on in detail as well as their evaluation can befound in short in Table 2 and in full length in our sup-plementary material online [25]. In the following, wefirst report on those publications that focus on criteria

4

different from ISO 29148, but which report precisionor recall. Afterwards, we describe publications thataim at detecting quality violations of ISO 29148 (seeTable 1).

First, Chantree et al. [11] target the specific gram-matical issue of coordination ambiguity (detectingproblems of ambiguous references between parts of asentence), mostly through statistical methods, suchas occurrence and co-occurrence of words. In a casestudy, they report on a precision of their approachmostly between 54% and 75%. even though theydo not explicitly differentiate between the detectedambiguities and the concept of pronouns. Second,Gleich et al. [30] base their approach on the ambigu-ity handbook, as defined by Berry et al. [7], as wellas company-specific guidelines. They compare theirdictionary- and POS-based approach against a goldstandard which they created by letting people high-light ambiguities in requirements sentences. The goldstandard deviates substantially, however, from what isconsidered high quality in their guidelines. Therefore,they create an additional gold standard, mostly basedon the guideline rules. Consequently, their precision2varies between 34% for the pure experts opinion, to97% for a more guideline-based gold standard. Third,Krisch and Houdek [49], focus on the detection ofpassive voice and so-called weak words. They presenttheir dictionary- and POS-based approach to practi-tioners and find many false positives, similar to ourRQ 3. In average, a precision of 12% is reported forthe weak words detection. These approaches focus onvery related, but not identical quality violations orsmells.

Automatic QA for ISO 29148 criteria. Lastly, wespecifically focus on those approaches that reportto detect the criteria from the ISO 29148 standard.Table 1 provides an overview of these works and theirrespective evaluations.

2Gleich et al. calculate their metrics based on the combina-tion of all ambiguities; unfortunately, they do not differentiate,e.g. by the type of ambiguity. Also, to our knowledge, thegold standard does not differentiate between the types. Thisprevents a direct comparison to their work.

The ARM tool [74] defines quality in terms of the(now superseeded) IEEE 830 standard [32] and pro-poses generic metrics, instead of giving feedback di-rectly to requirements engineers. The metrics arecalculated through counting how often a set of pre-defined terms (per metric) occurs in a document, in-cluding a metric of what we call Loopholes. Eventhough they report on a case study with 46 specifi-cations from NASA, only a quantitative overview isreported3. The QuARS tool [19, 18] is based on theauthor’s experience. Bucchiarone et al. [10] describethe use of QuARS in a case study with Siemens andshow some exemplary findings. SyTwo [22] adoptsthe quality model of QuARS and applies it to usecases. Loopholes and Subjectivity are part of theQuARS quality model. Also RQA is built on a dif-ferent, proprietary quality model, as described byGénova et al. [28], which includes negative terms aswell as pronouns as quality defects. These works alsobuilt upon extending natural language with NLP an-notations, such as POS tags and searching throughdictionaries for certain problematic phrases. However,we could not find a detailed empirical investigationof these tools, e.g. with regards to precision and re-call. SREE is an approach by Tjong and Berry [70],which aims at detection of ambiguities with a recallof 100%. Therefore, they completely avoid all NLPapproaches (since they come with imprecision), andbuild large dictionaries of words. The tool includesdetection of loopholes, as well as pronouns; however,they report only on an aggregated precision for allthe different types of ambiguities (66-68%) from twocase studies. In our previous paper [24], we searchedfor violations of ISO 29148, yet we provided only aquantitative analysis, as well as qualitative examples.As mentioned before, RETA also issues warnings forpronouns, however, the evaluation in their paper [4]focusses on template conformance.

2.3. DiscussionPrevious work has led to many valuable contribu-

tions to our field. To explore open research gaps, wenow critically reflect on previous contributions from

3See also our RQ 1 in Sect. 6.

5

Table 1: Related work on criteria of ISO-29148 standard, detailed supplementary material can be found online [25]

ARM QuARS RQA SREE Smella RETA[74] [19] [18] [22] [10] [28] [70] [24] [4]

Ambiguous Adv. & Adj. E/QComparatives E/QLoopholes (or Options) Q E/Q E/Q E E Q*/P* E/QNegative Terms O E/QNon-Verifiable Terms E/QPronouns O Q*/P* E/Q OSubjectivity E/Q E/Q E E E/QSuperlatives E/Q

Legend: O=No empirical analysis, E=Examples from Case, Q=Quantification, P=Precision analyzed, R=Recall analyzed,*=Aggregated over multiple smells

an evaluation, a quality definition and a technicalperspective.

First, one gap in existing automatic QA approach-es is the lack of empirical evidence, especially underrealistic conditions. Only few of the introduced contri-butions were evaluated using industrial requirementsartifacts. Those who do apply their approach onsuch artifacts focus on quantitative summaries ex-plaining which finding was detected and how oftenit was detected. Some authors also give examples offindings, but only few works analyze this aspect indepth with precision and recall, especially in the fuzzydomain of ambiguity (see Table 2). When looking atthe characteristics that are described in ISO 29148,we have not seen a quantitative analysis of precisionand recall. Furthermore, reported evidence does notinclude qualitative feedback from engineers who aresupposed to use the approach, which could revealmany insights that cannot be captured by numbersalone. However, we postulate that the accuracy ofquality violations very much depends on the respectivecontext. This is especially true for the fuzzy domain ofnatural language where it is important to understandthe (context-specific) impact of a finding to rate itsdetection for appropriateness and eventually justifyresolving the issue.Second, the existing approaches are based on pro-

prietary definitions of quality, based on experienceor, sometimes, simply on what can be directly mea-sured. The ARM tool [74] is loosely based on theIEEE 830 [32] standard. However, as the recent liter-ature survey by Schneider and Berenbach [67] states:

“the ISO/IEC/IEEE 29148:2011 is actually the stan-dard that every requirements engineer should be fa-miliar with”. We are not aware of an approach thatevaluates the current ISO 29148 standard [33] in thisrespect. As Table 1 shows, for most language qualitydefects of ISO 29148, there has not yet been a toolto detect these quality defects. To all our knowledge,for neither of these factors, there is an differentiatedempirical analysis of precision and recall. Yet, manyother quality models (most notably from the ambigu-ity handbook by Berry et al. [7]) and quality violationscould lead to Requirements Smells, as far as they com-ply with the definition given in the next section.

Finally, taking a more technical perspective, ourRequirements Smell detection approach does not fun-damentally differ from existing approaches. Similarto previous works, we apply existing NLP techniques,such as lemmatization and POS tagging, as well asdictionaries. For the rules of the ISO 29148 standard,no parsing or ontologies (as used in other approaches)were required. However, to detect superlatives andcomparatives in German, we added a morphologicalanalysis, which have not yet seen in related work.

In summary, in our contribution, we extend thecurrent state of reported evidence on automatic QAfor requirements artifacts via systematic studies interms of distribution, precision, recall, and relevance,as well as by means of a systematic evaluation withpractitioners under realistic conditions. We performthis on both existing, as well as new quality defectstaken from the ISO 29148. Therefore, we extend our

6

Tab

le2:

Related

approaches

andtools,

andtheirevalua

tion

,detaile

dsupp

lementary

materialcanbe

foun

don

line[25]

Too

l/App

roach

Purpo

se(unlessam

bigu

itydet.)

Pub

lications

Evaluation

Precision

Recall

Con

QAT

Redun

dancy

[34]

E/Q

/P0.27-1

–(Falessi)

Redun

dancy

[21]

Q/P

/Rup

to96

upto

96ReqAlig

nRedun

dancy

[62]

Q/P

/R0.63

0.86

RETA

Structured

Lang

uage

Rules

[4]

E/Q

/P/R

0.85-0.94

0.91-1

AQUSA

UserStoryRules

[51]

E/Q

/P0.63-1

–CIR

CE

Structured

Lang

uage

Rules

[29]

[1]

E–

–

(Ciemniew

ska)

[12]

E–

–(K

of)

[44]

E/Q

––

(Kiyavitskaya)

[40]

E/Q

––

RESI

[46]

[47]

[48]

E/Q

––

HeR

A[41]

[42]

E–

–Alpino

[16]

E/Q

––

(Cha

ntree)

[11]

E/P

/R0.6-1

0.02-0.58

Gleich

[30]

E/Q

*/P*/R*

0.34-0.97

0.53-0.86

(Krisch)

[49]

E/Q

/P0.12

–

ARM

RE

Artifa

ctMetrics

[74]

Q–

–QuA

RS/Sy

Two

[19]

[18]

[22]

[10]

E/Q

––

RQA

[28]

O–

–SR

EE

[70]

Q*/P*

0.66-0.68*

–Sm

ella

[24]

E/Q

––

Legend:

O=Noem

piricala

nalysis,E=Exa

mples

from

Case,

Q=Qua

ntification

,P=Precision

analyzed,R

=Recalla

nalyzed,

*=Aggregatedover

multiple

smells

7

previously published first empirical steps [24] to closethese gaps by thorough empirical evaluation.

3. Requirements Smells

We first introduce the terminology on RequirementsSmells as used in this paper. In a second step, wedefine those smells we derived from ISO 29148 andwhich we use in our studies, before describing the toolrealization in the next section.

3.1. Requirements Smell terminology

Code smells are supposed to be an imprecise in-dication for bad code quality [27]. We apply thisconcept of smells to requirements and define it asfollows: A Requirements Smell is an indicator of aquality violation, which may lead to a defect, with aconcrete location and a concrete detection mechanism.In detail, we consider a smell as having the followingcharacteristics:

1. A Requirements Smell is an indicator for a qual-ity violation of a requirements artifact. For thisdefinition, we understand requirements quality interms of quality-in-use, meaning that bad require-ments artifact quality is defined by its (potential)negative effects on activities in the software life-cycle that rely on these requirements artifacts(see also [26]).

2. A Requirements Smell does not necessarily leadto a defect and, thus, has to be judged by the con-text (supported e.g. by (counter-/)indications).Whether a Requirements Smell finding is or isnot a problem in a certain context must be indi-vidually decided for that context and is subjectto reviews and other follow-up quality assuranceactivities.

3. A Requirements Smell has a concrete locationin an entity of the requirements artifact itself,e.g. a word or a sequence. Requirements Smellsalways provide a pointer to a certain location thatQA must inspect. In this regard, it differs fromgeneral quality characteristics, e.g. completeness,that only provide abstract criteria.

4. A Requirements Smell has a concrete detectionmechanism. Due to its concrete nature, Require-ments Smells offer techniques for detection of thesmells. These techniques can, of course, be moreor less accurate.

Furthermore, we define a quality defect as a concreteinstance or manifestation of a quality violation in theartifact, in contrast to a finding which is an instanceof a smell. However, like a smell indicates for a qualityviolation, the finding indicates for a defect. Fig. 1visualizes the relation of these terms.

Instance

Quality Model

Requirements Smells

indicates for

RE Entity

supported by

indicates for

detects

automated by

present in

instance of

decreases

FindingQuality Defect

SmellQuality Violation

Quality-in-use

Smell Detector

Figure 1: Terminology of Requirements Smells (simplified)

In the following, we will focus on natural languageRequirements Smells, since requirements are mostlywritten in natural language [58]. Furthermore, the realbenefits of smell detection in practice should comewith automation. Therefore, the remainder of thepaper discusses only Requirements Smells where thedetection mechanism can be executed automatically(i.e. it requires no manual creation of intermediate orsupporting artifacts).

3.2. Requirements Smells based on ISO 29148

We develop a set of Requirements Smells based onan existing definition of quality. For the investiga-tions in scope of this paper, we take the ISO 29148

8

requirements engineering standard [33] as a baseline.The reasons for this are two-fold.

First, the ISO 29148 standard was created to har-monize a set of existing standards, including theIEEE 830:1998 [32] standard. It differentiates betweenquality characteristics for a set of requirements, suchas completeness or consistency, and quality character-istics for individual requirements, such as unambiguityand singularity. The standard furthermore describesthe usage of requirements in different project phasesand gives exemplary contents and structure for re-quirements artifacts. Therefore, we argue that thisstandard is based on a broad agreement and accep-tance. Recent literature studies come to the sameconclusion [67].

Second, the standard provides readers with a listof so-called requirements language criteria which sup-port the choice of proper language for requirementsartifacts. The authors of the standard argue thatviolating the criteria results “in requirements that areoften difficult or even impossible to verify or may allowfor multiple interpretations" [33, p.12]. For definingour smells, which we describe next, we refer to thissection of the standard and use all the defined require-ments language criteria. We employ those criteriaas a starting point and define the smells by addingthe affected entities (e.g. a word) and an explana-tion. Here, we do not discuss the impact smells haveon the quality-in-use. Essentially, smells hinder theunderstandability of requirements and consequentlytheir subsequent handling and their verification (fora richer discussion, see also previous work in [26]).

Our current understanding is based on the ex-amples given by the standard. A subset ofthe language criteria, namely Subjective Lan-guage, Ambiguous Adverbs and Adjectives andNon-verifiable Terms, as defined in [33], arestrongly related, essentially since subjective languageis a special type of ambiguity, which may lead to is-sues during verification. Since the intention of thiswork is to start with the standard as a definition ofquality, in the following, we will remain with the pro-vided definition based on the language criteria andleave the development of a precise and complete setof Requirements Smells to future work. In detail, we

use the requirements language criteria to derive thesmells summarized next.

Smell Name: Subjective LanguageEntity: WordExplanation: Subjective Language refers to

words of which the semantics isnot objectively defined, such asuser friendly, easy to use, cost ef-fective.

Example: The architecture as well as theprogramming must ensure a sim-ple and efficient maintainabil-ity.

Smell Name: Ambiguous Adverbs and Ad-jectives

Entity: Adverb, AdjectiveExplanation: Ambiguous Adverbs and Adjec-

tives refer to certain adverbs andadjectives that are unspecific bynature, such as almost always, sig-nificant and minimal.

Example: If the (...) quality is too low, afault must be written to the errormemory.

Smell Name: LoopholesEntity: WordExplanation: Loopholes refer to phrases that

express that the following require-ment must be fulfilled only to acertain, imprecisely defined ex-tent.

Example: As far as possible, inputs arechecked for plausibility.

9

Smell Name: Open-ended, Non-verifiableTerms

Entity: WordExplanation: Open-ended, non-verifiable terms

are hard to verify as they offer achoice of possibilities, e.g. for thedevelopers.

Example: The system may only be acti-vated, if all required sensors (...)work with sufficient measure-ment accuracy.

Smell Name: SuperlativesEntity: Adverb, AdjectiveExplanation: Superlatives refer to requirements

that express a relation of the sys-tem to all other systems.

Example: The system must provide the sig-nal in the highest resolution thatis desired by the signal customer.

Smell Name: ComparativesEntity: Adverb, AdjectiveExplanation: Comparatives are used in require-

ments that express a relation ofthe system to specific other sys-tems or previous situations.

Example: The display (...) contains thefields A, B, and C, as well asmore exact build infos.

Smell Name: Negative StatementsEntity: WordExplanation: Negative Statements are “state-

ments of system capability notto be provided"[33]. Some arguethat negative statements can leadto underspecification, such as lackof explaining the system’s reac-tion on such a case.

Example: The system must not sign offusers due to timeouts.

Smell Name: Vague PronounsEntity: PronounExplanation: Vague Pronouns are unclear rela-

tions of a pronoun.Example: The software must implement

services for applications, whichmust communicate with con-troller applications deployed onother controllers.

Smell Name: Incomplete ReferencesEntity: Text referenceExplanation: Incomplete References are refer-

ences that a reader cannot follow(e.g. no location provided).

Example: [1] “Unknown white paper". Pe-ter Miller.

4. Smella: A prototype for RequirementsSmell detection

Requirements Smell detection, as presented in thispaper, serves to support manual quality assurancetasks (see also the next section). The smell detectionis implemented on top of the software quality analysistoolkit ConQAT,4 a platform for source code analysis,which we extended with the required NLP features.In the following, we introduce the process for the au-tomatic part of the approach, i.e. the detection andreporting of Requirements Smells. To the best of ourknowledge, there is no tool, other than the ones men-tioned in related work, that detect and present thesesmells in natural language requirements documents.The process takes requirements artifacts in vari-

ous formats (MS Word, MS Excel, PDF, plain text,comma-separated values) and consists of four steps(see also Fig. 2):

1. Requirements parsing of the requirements arti-facts into single items (e.g. sections or rows),resulting in plain texts, one for each item

4http://www.conqat.org

10

http://www.conqat.org

Requirements

Annotation IdentificationParsing

Spec A1

Spec B1

Sec1 Req1Req2Req3

Sec2 Req1Req2

Sec1 Req1Req2Req3

POS TaggingMorphologic Analysis

Lemmatization

1 2 3 4Presentation

Overview Dashboard

Smell Viewer

Figure 2: The overall smell detection process

2. Language annotation of the requirements withmeta-information

3. Findings identification in the requirements, basedon the language annotations

4. Presentation of a human-readable visualizationof the findings as well as a summary of the results

The techniques behind these steps are explained inthe following section.

4.1. Requirements parsing

Our current tool is able to process several file for-mats: MS Word, MS Excel, PDF, plain text andcomma-separated values (CSV). Depending on theformat, the files are parsed in different ways. Plaintext and PDF are taken as is and parsed file by file.Microsoft Word files are grouped by their sections.For Microsoft Excel and CSV files, we define thosecolumns that represent the IDs or names (if there areany), and those columns should be used as text inputto detect smells.If a file is written in a known template, such as a

common template for use cases, we can make use ofthis template to understand structural defects, suchas lacking content items in a template. In the remain-der of this paper, however, we focus on the naturallanguage Requirements Smells as provided by the ISOstandard.

4.2. Language annotationFor the annotation and smell detection steps, we

employ three techniques from Natural Language Pro-cessing (NLP) [36]. Table 3 additionally shows whichof the techniques we use for which smell.

POS Tagging: For two smells, we use part-of-speech(POS) tagging. Given a sentence in natural lan-guage, it determines the role and function ofeach single word in the sentence. The output isa so-called tag for each word indicating, for in-stance, whether a word is an adjective, a particle,or a possessive pronoun. We used the StanfordNLP library [71] and the RFTagger [66] for this.Both are statistical, probabilistic taggers thattrain models similar to Hidden Markov Modelsbased on existing databases of tagged texts. Adetailed introduction into the technical details ofPOS tagging is beyond the scope of this paperbut can be found, for example, in [36]. We usePOS tagging to determine so-called substitutingpronouns. These are pronouns that do not re-peat the original noun and, thus, need a human’sinterpretation of its dependency.

Morphological Analysis: Based on POS tagging,we perform a more detailed analysis of text anddetermine a word’s inflection. This includes, interalia, determining a verb’s tense or an adjective’scomparison. We use this technique to analyze if

11

adjectives or adverbs are used in their compara-tive or superlative form.

Dictionaries & Lemmatization: For the remain-ing five smells, we use dictionaries based on theproposals of the standard [33] and on our ex-periences from first experiments in a previouswork [24]. We furthermore apply lemmatizationfor these words, which is a normalization tech-nique that reproduces the original form of a word.In other words, if a lemmatizer is applied to thewords were, is or are, the lemmatizer will re-turn for all three the word be. Lemmatizationis in its purpose very similar to stemming (see,e.g. the Porter Algorithm [61]), yet not basedon heuristics but on the POS tag as well as theword’s morphological form. For RequirementsSmells, the difference is significant: For example,the words use and useful stem to the same wordorigin (use), but to different lemmas (i.e. mean-ings; use and useful). Whereas the lemma use ismostly clear to all stakeholders, the lemma usefulis easily misinterpreted.

4.3. Findings identification

Based on the aforementioned information, we iden-tify findings. This step actually finds the parts of anartifact that exhibit bad smells. Dependent on theactual smell, we use different techniques, as shown inTable 3. If the smell relates to a grammatical aspect,we search through the information from POS taggingand morphological analyses. For example, for theSuperlatives Smell, we report a finding if an adjec-tive is, according to morphologic analysis, inflectedin its superlative form. If the smell does not relateto grammatical aspects but rather the semantics ofthe requirements, we identify the smell by matchingthe lemma of a word against a set of words from pre-defined dictionaries. Since the requirements underanalysis in our cases did not contain references, in-complete references are not part of our tool at present.

4.4. Findings presentation

We implemented the presentation of findings ina prototype, which we call Smella (Smell Analysis).

Smella is a web-based tool that enables viewing, re-viewing and blacklisting findings as well as a hotspotanalysis at an artifact level. In the Smella presenta-tion, we display the complete requirements artifactand annotate findings in a spell checker style. Thisfollows the idea of smells as only indications that mustbe evaluated in their context. Lastly, the system givesdetailed information when a user hovers a finding (seeFig. 3). In the following, we shortly describe the fea-tures of Smella in detail to provide the reader with arough understanding of the prototype.

View findings: At the level of a single artifact, wepresent the text of the artifact and its structure.We mark all findings in the text. With a click onthe markers, more information about the findingis displayed. The tool provides an explanation ofthe rationale behind this smell and possible im-provements for the finding depending on the smell(every smell has a message for improvements).

Review findings: We allow the user to write a re-view and to set a status for each finding, bothsupporting feedback mechanisms within and be-tween project teams. A user has the possibilityto accept or reject a finding but also to set a cus-tom state, for example under review. Accepting afinding means the finding needs to be addressed.If a finding is rejected, the finding does not needto be addressed. The semantics of the customstatus is open to the reviewer.

Blacklist findings: Smells are only indicators forissues. Therefore, users can reject findings. Ifa finding is rejected by the user, the finding isremoved from the visualization and will not bepresented to the user anymore.

Disable smells: Often, users are interested in only asubset of smells or even just one smell. Therefore,we allow the user to hide all findings of particularsmells and to select the smells she wants to displayin the artifact view.

Analyze hotspots: In this view, we present all ar-tifacts in a colored treemap (see Fig. 4). Everybox in the treemap is one artifact, with the color

12

Figure 3: A sample output from the smell detection tool (detailed artifact view) with some smells disabled and some findingsblacklisted

13

Figure 4: A sample output from the smell detection tool (hotspot analysis view)

14

Table 3: Detection techniques for smells

Smell Name Detection Mechanism

Subjective Language DictionaryAmbiguous Adverbs and Adjectives DictionaryLoopholes DictionaryOpen-ended, non-verifiable terms DictionarySuperlatives Morphological analysis or POS taggingComparatives Morphological analysis or POS taggingNegative Statements POS tagging and dictionaryVague Pronouns POS tagging: Substituting pronouns.Incomplete References Not in scope of this study

of the box indicating the number of findings: themore red an artifacts is, the more findings it con-tains (the more it “smells” bad). The artifactsare grouped by their folder structure. The toolprovides a summarized treemap for all smellsas well as a separate treemap for all individualsmells. With these treemaps, users can identifyartifacts or groups of artifacts exhibiting a highnumber of findings – for one single smell but alsofor all smells together. This feature supports theidentification of candidates for in-depth reviews.

5. Requirements Smell detection in the pro-cess of quality assurance

The Requirements Smell detection approach de-scribed in previous sections serves the primary purposeof supporting quality assurance in RE. The detectionprocess itself is, however, not restricted to particu-lar quality assurance tasks, nor does it depend on aparticular (software) process model as we will showin Sect. 6. Hence, a smell detection, similar to thenotion of quality itself, always depends on the viewsin a socio-economic context. Thus, how to integratesmell detection into quality assurance needs to be an-swered according to the particularities of that context.In the following, we therefore briefly outline the rolesmell detection can generally take in the process ofquality assurance. More concrete proposals on howto integrate it into specific contexts are given in ourcase studies in Sect. 6.

We postulate the applicability of the RequirementsSmell detection in the process of both constructiveand analytical quality assurance (see Fig. 5). Fromthe perspective of a constructive quality assurance,authors can use the smell detection to increase theirawareness of potential smells in their requirementsartifacts and to remove smells before releasing an ar-tifact for, e.g., an inspection. External reviewers inturn, can then use the smell detection to prepare an-alytical, potentially cost-intensive, quality assurancetasks, such as a Fagan inspection [20]. Such an in-spection involves several reviewers and would benefitfrom making potential smells visible in advance. Iter-ative inspection approaches are also known as phasedinspections, as defined by Knight and Myers [43].

We furthermore believe that one major advantageis that the scope of our smell detection is not toenforce resolving a potential smell but to increasethe awareness of the like and to make transparentlater reasoning why certain decisions have been taken.Please note that two different roles (e.g. requirementsengineer and QA engineer) can take two different view-points on the same smell, respectively its criticalityand whether it should be resolved or not. In addition,a finding could be unambiguous to the author, butunclear to the target group of readers (represented bythe reviewers). Therefore, one contribution of our tool-supported smell detection is also to actively foster thecommunication between reviewers and authors and toenable continuous feedback between both roles. Forthis reason, we enable stakeholders in Smella to com-

15

Automatic Smell Detection

Constructive QA Analytical QA

Detectsmells

Author ReviewerFeedback

View findings & Review findings

View findings & Review findings

Create / update

Visualizesmells

Figure 5: A suggestion for applying Requirements Smell detection in QA

ment on detected smells and make explicit whetherthey need to be resolved or whether and why theyhave been accepted or rejected.

6. Evaluation

For a better, empirical understanding of smells inrequirements artifacts, we conducted an exploratorymulti-case study with both industrial and academiccases. We particularly rely on case study researchover other techniques, such as controlled experiments,because we want to evaluate our approach in practicalsettings under realistic conditions. For the designand reporting of the case study, we largely follow theguidelines of Runeson and Höst [63].

6.1. Case study design

Our overall research objective is as follows:

Research Objective: Analyze whether automaticanalysis of Requirements Smells helps in requirements

artifact quality assurance.

To reach this aim, we formulate four research ques-tions (RQ). In the following, we introduce those re-search questions, the procedures for the case andsubjects selection, the data collection and analysis,and the validity procedures.

6.1.1. Research questionsRQ 1: How many smells are present in require-ments artifacts? To see if the automatic detectionof smells in requirements artifacts could help in QA,we first need to verify that Requirements Smells existin the real world. The answer to this question fostersthe understanding how widespread the smells underanalysis are in industrial and academic requirementsartifacts.RQ 2: How many of these smells are relevant?Not only the number of detected smells is important.If many of the detected smells are false positives andnot relevant for the requirements engineers and devel-opers, it would hinder QA more than it would help.

16

As relevancy is a rather broad concept, we break downRQ 2 into two sub-questions.

RQ 2.1: How accurate is the smell detec-tion? The first sub-question looks at the moretechnical view on relevance. We want to findfalse positives and false negatives to determinethe precision and recall of the analysis in termsof correct detection of the defined smell.RQ 2.2: Which of these smells are practi-cally relevant in which context? This sec-ond sub-question is concerned with practical rel-evance. We investigate whether practitionerswould react and change the requirement whenconfronted with the findings.

RQ 3: Which requirements quality defects canbe detected with smells? After we understood howrelevant the analyzed Requirements Smells are, wewant to understand their relation to existing qualitydefects in requirements artifacts. Hence, we need tocheck whether, and if so, which defects in requirementsartifacts correspond to smells, as we understand smellfindings as indicators for defects.RQ 4: How could smells help in the QA pro-cess? Finally, we collect general feedback from prac-titioners whether (and how) smell detection could bea useful addition to QA for requirements artifacts andwhether as well as how they would integrate the smelldetection into their QA process.

6.1.2. Case and subjects selectionOur case and subject selection is opportunistic but

in a way that maximizes variation and, hence, eval-uates the smell detection in very different contexts.This is particularly important for investigating re-quirements artifacts under realistic conditions, alsodue to the large variation in how these artifacts man-ifest themselves in practice. A prerequisite for ourselection is the access to the necessary data. To geta reasonable quantitative analysis of the number ofsmells (RQ 1) and qualitative analysis of the rela-tion of smells and defects (RQ 3), we complementour three industrial cases with a case in an academicsetting. There, various student teams are asked toprovide software with a certain set of (identical) func-tionality for a customer as part of a practical course.This is also a realistic setting but provides us with a

higher number of specifications and reviews than inthe industrial cases.

We will refer to the subjects of the industrial casesas practitioners and we will call the latter subjectsstudents.

6.1.3. Data collection procedureWe used a 6-step procedure to collect the data

necessary for answering the research questions.

1. Collect requirements artifact(s) for each case. Weretrieved the requirements artifacts to be ana-lyzed in each case. For one case, the require-ments were stored in Microsoft Word Documents.For the other cases, this involved extracting therequirements from other systems, either a propri-etary requirements management tool (resultingin a list of html files), or the online task manage-ment system JIRA, which led to a set of comma-separated values files. For the student projects,the students handed in their final artifacts eitheras a single PDF or as a PDF with the generalartifact and another PDF with the use cases.Where authors explicitly structured requirementsin numbered requirements, user stories or usecases, we counted these artifacts.

2. Run the smell detection via Smella. We appliedour detection tool as introduced in Sect. 4.4 onthe given requirements artifacts, which generateda list of smells per artifact.

3. Classify false positives. For all cases in which wewanted to present our results to practitioners, wereviewed each detected finding. In pairs of re-searchers, we classified the findings as either trueor false positive. We classified a finding as falsepositive if the finding was not an instance of thesmell, e.g. because the results of the linguisticanalysis was incorrect.5 For artifacts contain-ing more than 10 findings of a smell, we onlyinspected a set of 10 random findings (of that

5For example, if the linguistic analysis incorrectly classifiedthe word provider in the sentence “As a provider, I want [. . . ]”as a comparative adjective.

17

smell) per artifact. The same holds for Case D,where we inspected 10 random findings of eachcategory for the whole case.

4. Inspect documents for false negatives. Tocalculate the recall of the smell detection,for each case we randomly selected one ar-tifact that a pair of researchers inspectedfor false negatives. To ease the manualinspection, we grouped the smells Subjec-tive Language, Ambiguous Adverbs and Ad-jectives, Loopholes, Non-verifiable Terms(as Ambiguity-related smells). We classifiedwhether a finding is a true or false negative basedon the same conditions as in the previous step.

One common cause for false negatives fordictionary-based smells can be that an ambigu-ous phrase is not part of the dictionary. Sincewe developed the dictionaries based on existingdictionaries, such as the standard, these dictio-naries are not yet complete and must be furtherdeveloped. However, since this is an issue that isnot a problem of the smell detection approach ingeneral, but rather a configuration task, we didnot take these findings into consideration for therecall.

5. Get rating by practitioners. We selected a subsetof the true positive findings so that we cover allsmells with a minimum of two findings per smellas far as the artifacts allowed. When we foundrepeating or similar findings, e.g. multiple similarsentences with the same smell, we also includedone of these findings into the set.

We presented this subset to the practition-ers and interviewed them, finding by finding,through three closed questions (see also Table 9):Q1: Would you consider this smell as relevant?Q2: Have you been aware of this finding before?Q3: Would you resolve the finding? Of these,the former two must be answered with yes orno. For the last question, we also needed to takethe criticality into account. Therefore, in casepractitioners answered that they would resolve afinding, we also asked whether they would resolveit immediately, in a short time (i.e. within this

project iteration) or in a long time (e.g. if it hap-pens again). In addition to these three questions,we took notes of qualitative feedback, such asdiscussions.

6. Interview practitioners. In addition to the rat-ings, we performed open interviews with prac-titioners about their experience with the smelldetection and how they might include it in theirquality assurance process. We took notes of theanswers.

7. Get review results from students. Lastly, the stu-dents performed reviews of the artifacts of otherstudent teams. They documented and classifiedfound problems according to a checklist (see Ta-ble A.11) without awareness of the smell findingsin their artifacts. We then collected the reviewreports from the students.

6.1.4. Analysis procedureWe structure our analysis procedure into seven

steps. Each step leads to the results necessary foranswering one of our research questions.

1. Calculate ratios of findings per artifact. To un-derstand whether smells are a common issue inrequirements artifacts, we compared the quanti-tative summaries of smells in the various artifactsand domains. To enable a comparison betweendifferent types of requirement artifacts, we usedthe number of words in each artifact as a measureof size. Hence, we finally reported the ratio offindings per 1000 words for each smell and allsmells in total. This provided answers for RQ 1.

2. Calculate ratios of findings for parts of user sto-ries. In one case, we had a common structure ofthe requirements, because they were formulatedas user stories. To get a deeper insight into thedistribution of smells and findings, we calculatedthe ratios of findings per 1000 words for eachpart. We divided the user stories into the partsrole (“As a. . . ”), feature (“I want to. . . ”) andreason (“so that. . . ”) using regular expressions.We counted the words and findings in each part.This provided further insights into the answer forRQ 1.

18

3. Calculate ratios of false positives. After a roughoverview obtained under the umbrella of RQ 1describing the number of findings for each smellof the varying artifacts, we wanted to better un-derstand the smell’s relevance. The first stepwas to calculate the ratios of false positive as weclassified them in Step 3 of the data collection.We reported false positive rates overall and foreach smell. This provides the first part of theanswer to RQ 2.1.

4. Calculate ratios of false negatives. The preci-sion of a smell detection is tightly coupled withthe recall. Therefore, we calculated the ratio ofdetected smell findings to all existing findings,according to our manual inspection, as describedin Step 4 of the data collection procedure. Thisprovides the second part of the answer to RQ 2.1.

5. Calculate ratio of irrelevant smells. We were notonly interested in errors in the linguistic analysisbut also in how relevant the correct analyseswere for the practitioners. Hence, we calculatedand reported the ratios of findings consideredirrelevant by the practitioners. This answersRQ 2.2.

6. Compare defects from reviews with findings. Fromthe students, we received review reports for eachartifact. As the effort to check them all wouldhave been overwhelming, we took a random sam-ple of 20% of the artifacts. For each of the defectsdetected in the review, we checked if there is acorresponding finding from a smell. This answersRQ 3.

7. Interpret interview notes. To answer finally RQ 4,we analyze the interview transcripts and code theanswers given by the interviewees manually.

6.1.5. Validity procedureFirst, we used peer debriefing in the sense that all

data collection and analyses were done by at least tworesearchers. Analysis results were also checked by allresearchers. This researcher triangulation especiallyincreases the internal validity. Furthermore, we kept

an audit trail in a Subversion system to capture allchanges to documents and analyses.

Second, we performed all the classifications of find-ings into true and false positives in pairs. This alreadyhelped to avoid misclassifications. To further checkour classifications, we afterwards did an independentre-classification of randomly selected 10% of the find-ings and calculated the inter-rater agreement. Wediscussed to clarify which findings we consider falsepositives and repeated the classifications until wereached an acceptable agreement. The same proce-dure held for the inspection of artifacts to detect falsenegatives, which we also conducted in pairs. Further-more, we also independently re-classified one of theartifacts to understand the inter-rater agreement onthe false negatives. Overall, our analysis for false pos-itives and relevance of the findings is also a validityprocedure in the sense that we check in RQ 2 theresults from RQ 1.Third, we discussed with the practitioners what

relevance of smells means in the context of the studyto avoid misinterpretations. Furthermore, we gave thestudents review guidelines to give them an indicationwhat quality defects in requirements artifacts mightbe. Both serve in particular as mitigation to threatsto the internal and the construct validity.Fourth, we performed the analysis of the corre-

spondence between smells and defects with a pair ofresearchers. This pair derived a classification of thefound and not found defects. Both other researchersreviewed the classification, and we improved it itera-tively until we reached a joint agreement.Fifth, we performed member checking by showing

our transcriptions and interpretations for RQ 4 to theinterviewed practitioners and incorporating feedback.Finally, to support the external validity of the re-

sults of our study, we aimed at selecting cases withmaximum variation in their domains, sizes, and howthey document requirements.

6.2. Results

In the following, we report on the results of our casestudies. We first describe the cases and subjects underanalysis, before we answer the research questions. Weend by evaluating the validity of the cases.

19

6.2.1. Case and subjects descriptionThe first three cases contain requirements produced

in different industrial contexts: embedded systemsin the automotive industry, business information sys-tems for the chemical domain and agile developmentof web-based systems. While the first two representrather classical approaches to Requirements Engineer-ing, the third case applies the concept of user stories,as it is popular in agile software development. Thefourth case is in an academic background and employsboth use cases and textual requirements. Regardingsubject selection, for each industrial case we selectedpractitioners involved in the company, domain andspecification. We executed the findings rating (Step 5)and the interviews regarding the QA process (Step 6)with the same experts, so that their answer in Step 6is based on their experience with practical, real ex-amples. In the following, we describe the cases, aswell as the experts or students for each case. Table 4provides a quantitative overview of the cases.

Case A: Daimler AG. Daimler AG is a multinationalautomotive corporation headquartered in Stuttgart,Germany. At Daimler, we analyzed six different re-quirements artifacts (A1–A6) which were written byvarious authors. The requirements artifacts describefunctionality in different domains of engine control aswell as driving information. In this case, requirementsare written down in the form of sentences, identifiedby an ID. The authors are domain experts who arecoached on writing requirements.The requirements artifacts A1–A6 consist of 323

requirements in total (see Table 4). All of the artifactsof Daimler analyzed in our study were created bydomain experts in a pilot phase after a change in therequirements engineering process as part of a softwareprocess improvement endeavour. For RQ 2.2., wereviewed 22 findings with an external coach who worksas a consultant for requirements engineering and hastightly collaborated with the group for many years.

Case B: Wacker Chemie AG. In the second case, weanalyzed requirements artifacts of business informa-tion systems from Wacker Chemie AG. Wacker is aglobally active company working in the chemical sec-tor and headquartered in Munich, Germany. The

systems that we analyzed fulfil company-internal pur-poses, such as systems for access to Wacker buildingsor support systems for document management.We analyzed three Wacker requirements artifacts

that were written by five different authors. At Wacker,functional requirements are written as use cases (in-cluding fields for Name, Description, Role and Pre-condition) whereas non-functional requirements aredescribed in simple sentences. The artifacts consistedof 53 use cases and 13 numbered requirements (seeTable 4). For the reviews of the findings in RQ 2.2,we selected 18 findings and discussed them with theChief Software Architect, who also has several yearsof experience in quality assurance.

Case C: TechDivision. For the third case, we ana-lyzed the requirements of the agile software engineer-ing company TechDivision GmbH. TechDivision hasaround 70 employees, working in 3 locations in Ger-many. They focus mainly on web development, i.e.creating product portals and e-commerce solutions fora variety of companies, as well as web consulting, espe-cially focusing on search engine optimizations. Manyof their products involve customisation of Magento6or Typo37 frameworks.

In their projects, TechDivision follows an agile soft-ware development process using either Scrum [68] orKanban [3] methodologies. For their requirements,TechDivision applies user stories [14], which they writeand manage in Atlassian JIRA8. User stories at Tech-Divison follow the common Connextra format: As a[Role], I want [Feature], so that [Reason]. We willalso follow this terminology here.The systems under analysis consist of two online

shopping portals, a customer-relationship system anda content-management system, all of which we cannotname for non-disclosure-agreement reasons. In total,we analyzed over 1,000 user stories containing roughly28,000 words. For RQ 2.2, we met with an experiencedScrum Master and a long-term developer, who haveworked on several projects for TechDivision.

6http://www.magento.com7http://www.typo3.org8https://atlassian.com/software/jira

20

http://www.magento.com

http://www.typo3.org

https://atlassian.com/software/jira

Case D: University of Stuttgart. The requirements ofCase D were created by 52 groups of three 2nd-yearstudents each during a compulsory practical course inthe software engineering programme at the Universityof Stuttgart. We removed one artifact, because it wasincorrectly encoded, thus resulting in 51 requirementsartifacts for this analysis.

2000

4000

6000

8000

Figure 6: Variation of size of requirements artifacts in Case Din words

The resulting requirements artifacts differ vastly instyle; hence, we were unable to count them in terms ofrequirements, but instead only counted the structureduse cases as provided by the authors, and quantifiedthe artifacts by word size. The average size of arequirements artifact was 4,471 words (min: 1,425,max: 8,807, see Fig. 6) and contained 19 use cases(min: 6, max: 39), thus creating a set of artifacts ofnearly a quarter of a million words, including morethan 950 use cases.For practical reasons, we could not evaluate each

research question in each case: For example, RQ 3depends on the existence of reviews with documentedresults, which is often not existent in practice. Fur-thermore, depending the answers of RQ 4 on the po-tentially less experienced students from Case D wouldintroduce a threat to the validity of our evaluation.Table 5 shows the mapping between research ques-

tions and study objects. The interviews for RQ 2.2and RQ 4 lasted 60 minutes for each Case A and Band 120 minutes for Case C.

6.2.2. RQ 1: How many Requirements Smells arepresent in the artifacts?

Under this research question, we quantify the num-ber of findings that appear in requirements. Table 6shows the number of findings for each case, each re-quirements artifact and each smell and also puts thesenumbers in relation to the size of the artifact. Weanalyzed requirements of the size of more than 250kwords, on which the smell detection produced in totalmore than 11k findings, thus revealing roughly 44findings per thousand words.

Table 6 shows that all requirements artifacts containfindings of Requirements Smells. They vary from 5findings for the smallest9 case (A3) up to 572 for thelargest case (C4). The number of findings stronglycorrelates with the size of the artifact (see Fig. 7,Spearman correlation of 0.9). Hence, in the remainder,we normalize the number of findings by the size of theartifact.

The artifacts of Daimler have an average of 26 find-ings per thousand words, in contrast to 41 for bothWacker and TechDivision and 43 for the artifacts pro-duced by the students. Best to analyze the variancewithin a requirements artifact seems Case D, in whichmultiple teams had a similar background and projectsize. Fig. 8 shows the variance between the artifactsof Case D with an average of 44 findings, a minimumof 26 findings (D11) and a maximum of 75 findings(D32) per 1,000 words.

When inspecting the different Requirements Smells,we can see that the most common smells are vaguepronouns with 25 findings per 1,000 words, followedby the negative words smell with 6 findings andthe loophole smell with 4 findings. The least of-ten smells are non-verifiable terms with 1 findingper 1,000 words, and ambiguous adverbs and ad-jectives with 0.25 findings per 1,000 words. In fact,the most common smell, vague pronouns, appears100 times more often than the ambiguous adverbs

9in terms of total number of words

21

●

●

●

●

020

0040

0060

0080

0010

000

1200

0

0100200300400500

Num

ber

of W

ords

in A

rtifa

ct

Number of Findings in Artifact

A1

A2

A3

A4

A5

A6

B1

B2

B3

C1

C2

C3

C4

D6

D11

D13

D15

D16 D

19

D20

D22

D24

D28

D31

D32

D34

D41

D42

D43

D45

D49

●

Dai

mle

r

Wac

ker

Tech

Div

isio

n

Stu

ttgar

t

Figure7:

Num

berof

finding

sstrong

lycorrelates

withsize

ofartifact

(for

read

ability

reason

s,fortheStuttgartcases(blue)

only

IDsof

less

correlating

artifactsaredisplayed).

22

Table 4: Study objects

Artifact Topic Size

inWords

#Requirements

#Use

Cases

#UserStories

A1 Adaptive valve control 1896 91A2 Exhaust control 2244 72A3 Driving information 199 12A4 Engine startup control 975 44A5 Engine control 524 49A6 Powertrain communication 1100 55

Sum Daimler 6938 323

B1 Management of access control 2093 9 18B2 Event notification 1015 3 19B3 Document management 458 1 16

Sum Wacker 3566 13 53

C1 Webshop for fashion articles 5226 168C2 CMS in transportation domain 2742 123C3 CRM system 6863 230C4 Webshop for hardware articles 13124 561

Sum TechDivision 27955 1082

Avg Stuttgart 4470 18.9

Sum Stuttgart 227973 966

Sum over all 266432 336 53 1082

and adjectives. To analyze the variance in depth,we again take the students’ artifacts for reference.Fig. 9 shows the relative number of findings acrossthe projects.

Interpretation. We interpret the quantitativeoverview along three variables: projects, contexts andthe different Requirements Smells.

Projects When comparing at project level, we seethat Cases A1–A6 (with outlier A5) and C1–C4(with outlier C3) show quite similar numbers.In contrast B1 to B3 vary between 28 and 68

findings per 1,000 words. When looking intothe most extreme outliers B3 and D32, we see asystematic error that creates a large number offindings: Both projects repeatedly explain whatthe system should10 do instead of what it mustdo. 16 of 19 loopholes findings in B3 and 29of 37 loophole findings in D32 root from thisproblem. This can lead to difficult issues in con-

10Soll is a German modal verb that is less strict than anEnglish must.

23

Table 5: Study objects usage in research questions

Case RQ

1:Distribution

RQ

2.1:

Precision

RQ

2.1:

Recall

RQ

2.2:

Relevan

ce

RQ

3:DefectTyp

es

RQ

4:QA

Process

A: Daimler X X X XB: Wacker X X X XC: TechDivision X X X X XD: Univ. of Stuttgart X X X X

tracting as requirements that are phrased witha should are commonly understood as optional(see e.g. RFC2119 [9] for a detailed explanation).

Hence, we could see a surprising consistency intwo of three industrial case studies. The Wackerdata varies, so does the students case. In bothcases, the negative extremes point at issues thatpotentially have expensive consequences.

Context The four cases differ strongly in their con-text: They write down requirements in differ-ent forms, vary in their software developmentmethodology and also produce software for dif-ferent domains. When comparing the findingsat the domain level, we see that Daimler arti-facts with an average of 26 findings per thousandwords contain less findings than both Wacker andTechDivision with 41 findings and the artifactsproduced by the students with 43 findings.

Our partners reported that there have been train-ings for the authors of the cases A1–A6 recently,which could explain the difference. Another rea-son could be the strong focus that the automotivedomain puts on requirements and requirementsquality in contrast to the other domains. Lastly,also the strict process in this domain could be areason for this striking difference of the Daimlerrequirements. Unsurprisingly, the students’ re-

●

●

●

3040

5060

70

Figure 8: Number of findings per 1,000 words in Case D

quirements form the lower end of the scale, yetnot by much.

Requirements Smells When comparing the eightsmells, we see a strong variance between the num-ber of findings, both in absolute as well as relativevalues. A qualitative inspection indicates reasonsfor the most occurring smells. First, the smelldetection for vague pronouns finds all substi-tuting pronouns in the requirements. Especiallyin German, in many sentences the reference ofthe pronoun can sometimes be derived from gen-der and grammatical case of the word, thus cor-rectly detecting pronouns, but not vague pro-nouns. RQ 2.1 quantifies this issue. Second, themost common indication for loophole findingsis the aforementioned use of the word should. Wediscuss this case in-depth with practitioners inRQ 2.2. Third, we will also inspect reasons forthe high number of negative words findings inRQ 2.1 and RQ 2.2.

Answer to RQ 1. The number of findings in require-ments artifacts strongly correlates with the size ofthe artifact. There are roughly 44 findings per 1,000words and some contexts show a striking similarity

24

Tab

le6:

Qua

ntitativesummaryof

smellfin

ding

s

Case

Num

Words

All

Smells

Subjective

Lan

guag

eSm

ell

Loo

phole

Smell

Vag

ueProno

uns

Smell

Superlatives

Smell

Negative

Words

Smell

Com

paratives

Smell

Non

-verifiab

les

Smell

Ambigu

ous

A&

ASm

ell

abs

rel

abs

rel

abs

rel

abs

rel

abs

rel

abs

rel

abs

rel

abs

rel

abs

rel

A1

1896

4523

.74

2.11

21.05

136.86

73.69

115

73.69

00

10.53

A2

2244

5223

.26

2.67

31.34

208.91

10.45

146.24

52.23

20.89

10.45

A3

199

525

.10

00

03

15.08

00

210

.05

00

00

00

A4

975

2929

.73

3.08

11.03

1515

.38

00

88.21

11.03

11.03

00

A5

524

2038

.20

00

014

26.72

00

59.54

00

11.91

00

A6

1100

3229

.10

00

08

7.27

00

1311

.82

76.36

43.64

00

Sum

Daimler

6938

183

26.4

131.87

60.86

7310

.52

81.15

537.64

202.88

81.15

20.29

B1

2093

9043

52.39

115.26

4019

.11

62.87

209.56

73.34

10.48

00

B2

1015

2827

.62

1.97

10.99

1312

.81

00

32.96

98.87

00

00

B3

458

3167

.70

019

41.48

919

.65

12.18

00

12.18

00

12.18

Sum

Wacker

3566

149

41.8

71.96

318.69

6217

.39

71.96

236.45

174.77

10.28

10.28

C1

5226

229

43.8

489.18

50.96

104

193

0.57

295.55

366.89

10.19

30.57

C2

2742

120

43.8

114.01

72.55

6222

.61

31.09

134.74

248.75

00

00

C3

6863

233

3430

4.37

142.04

105

156

0.87

314.52

456.56

10.15

10.15

C4

1312

457

243

.635

2.67

161.22

339

25.83

110.84

101

749

3.73

90.69

120.91

4

Sum

TechD

ivision

2795

511

5441

.312

44.44

421

610

21.82

230.82

174

6.22

154

5.51

110.39

160.57

MeanStuttgart

4470

198.5

44.4

6.45

1.44

19.65

411

7.37

26.26

5.12

1.14

27.59

6.17

16.63

3.72

4.71

1.05

0.96

0.21

Sum

Stuttgart

2279

7310

122

44.4

329

1.44

1002

459

8626

.26

261

1.14

1407

6.17

848

3.72

240

1.05

490.21

Overall

2664

3211

608

43.6

473

1.78

1081

4.06

6731

25.26

299

1.12

1657

6.22

1039

326

00.98

680.26

25

● ● ●

●●●●

●● ●●

●

●

●●

Sub

ject

ive

Loop

hole

sV

ague

P.

Sup

erla

tives

Neg

ativ

esC

ompa

rativ

esN

on−

verif

iabl

eA

mb.

Adv

erbs

01020304050

Figure9:

Variation

ofsm

ells

per1,000words

inCaseD

26

in the number of findings for their artifacts. In ourcases, the automotive requirements had a lower num-ber of findings whereas student artifacts containeda higher number of findings relative to the size ofthe artifacts. The most common findings are for thesmells loopholes and vague pronouns.

6.2.3. RQ 2.1: How accurate is the smell detection?To understand the capabilities of the smell detec-

tion, we need to understand precision as metric indi-cating how many of the detected findings are correct,as well as recall as a metric indicating how many ofthe correct findings are detected.

Precision. To understand to which extent the num-bers of findings for certain smells in RQ 1 are causedby the detection mechanism, we inspected a randomsample of 616 findings by taking equivalent sets offindings from each project and manually classifyingwhether the finding fulfills the smell definition. Wecould not inspect the same number of findings of eachsmell for each project, because some projects only hadfew or even no findings of a certain smell (see numberof findings per project in Table 6).

Table 7 and Fig. 10 show the summary of this anal-ysis: The precision of the detection of the subjec-tive language smell revealed only three false pos-itives in total, thus leading to a precision of 0.96.Non-verifiable words, loophole, and ambiguousadverbs and adjectives smells range between 0.70and 0.81, hence leading to roughly one mistake in foursuggestions. Comparative and superlative smellsrange around 0.5 which would mean that every sec-ond finding is correct. At the rear end of the listare the negative words and vague pronouns smellswith one correct finding in three to four suggestions.Across all smells, the precision is between 0.48 (over allinspections) and 0.59, if we take the varying numberof inspected findings between the smells into account.To understand these numbers, we qualitatively in-spected the false positive classifications, revealing thefollowing main reasons for false positives:

Grammatical errors in real world language.The first issue that creates false positives is thefact that our study analyzes real world language.

Some of the requirements, especially in CaseC, contained a number of grammatical flaws aswell as dialectal phrases, which lead to wrongresults in the automatic morphologic analysisand automatic POS tagging and consequentlyalso to false positives during smell detection.

Vague pronouns. The smell detection for vaguepronouns showed the lowest precision. In thedetection of this smell, we look for substitutingpronouns, which are pronouns where the nounis not repeated after the pronoun11, of which wecharacterize only every fourth finding as a defect.The reason behind this poor performance, be-sides a number of false positives due to the poorgrammar mentioned before, is the comparablylarge number of grammatical exponents of theGerman language. In addition to number andthree grammatical genders, the German languagealso has four grammatical cases. Therefore, invarious instances of substituting pronouns, thereis only one grammatical possibility of what thepronoun could refer to.

Findings in conditions. A third reason for falsepositives is that the smell detection, so far, takesvery little context into account. For example, thecomparatives smell aims at detecting require-ments that define properties of the system rela-tive to other systems or circumstances12. Whensearching for grammatical comparatives in re-quirements, roughly 48% of the cases are of theaforementioned kind. In roughly the same num-ber of cases, however, the comparative describes acondition. For example, if the requirement statesthat if the system takes more than 1 second to re-spond [. . . ], the comparison is not against anothersystem or circumstance but against absolute num-bers. Therefore, in this case, the comparative

11E.g. The father of these. vs The father of these kids.12As discussed in Sect. 3.2, the problem of comparatives in

requirements is validation: How can we understand whethera system fulfills a requirements if that requirement is statedin a relative instead of an absolute way? What if the systemin comparison changes its properties, would this render therequirement suddenly unfulfilled?

27

does not indicate a problem (one could even arguethat this is an indicator for good quality).A similar problem holds for the negativephrases smell: The smell detection aims at re-vealing statements of what the system should notdo. Often, however, the negative is mentionedin conditions. For example, if the requirementsexpress what to do if the user input is not zero[. . . ], the negation relates to a condition and notto a property of the system.

Recall. When analyzing the accuracy of an automaticdetection, we must look not only at precision, butalso at recall, i.e. the ratio of all detected findings toall defects of a certain type in an artifact. To thisend, we inspected one artifact of each case, in total aset of roughly 16,200 words, and manually identifiedthe findings in each artifact. Due to the problems ofdistinguishing the various ambiguity-related smells,we analyzed the recall of these four smells as if itwas one smell, without further differentiation (seeSection 6.1.3).

The manual inspection revealed 200 findings in thisartifact sample and an average recall of 0.82. Table 8and Fig. 10 show the summary of the results: The com-parison shows a recall between 0.84 and 0.95 for fourof the five investigated smells. The highest recall wasachieved by the Comparative Requirements Smell,with 0.95, which means that the smell detection missedone in 20 findings. The fifth smell, with the lowestrecall, is Superlative Requirements Smell with arecall of 0.5. However, this smell is one of the rarestof the smells, as one can also see in the results to RQ1. Therefore our analysis of the recall of this smell isbased on few data points. Hence, we suggest to takethe recall of this smell with care, and suggest thatfuture studies should investigate this issue in moredepth.

A further analysis of the false negatives shows thatthe smell detection missed findings because of impre-cisions in the NLP libraries (i.e. Stanford NLP [71] forLemmatization and POS Tagging and RFTagger [66]for morphologic analysis). For the dictionary-basedsmells, the lemmatization did not correctly deducethe correct lemma, e.g. it did not understand thata certain word was a plural of a lemma. If only

the lemmatized version of the word, i.e. the singularform, is in the dictionary, then the smell detectordoes not correctly identify the smell. In the false neg-ative cases for the Comparative and SuperlativeRequirements Smell, RFTagger did not correctlyclassify the inflection.

Interpretation. The study revealed that the precisionstrongly varies between the different smells. Qual-itative analysis provided further insights describednext.We can now explain the high number of findings

for vague pronouns in RQ 1. If we assume that aquarter of the findings are correct, the number offindings in this category is closer to the remainingsmells. Also, we could see that while there are certainreasons of impreciseness that root from the studyobjects themselves and are, thus, unavoidable, thereis plenty of space for optimization. First, existingtechniques from NLP could be applied to improvecertain smells, such as the vague pronouns. Second,from the examples we have seen, we would argue thatthe application of heuristics could heavily improve theprecision of existing smell detection techniques. Forexample, if we exploit the information available fromPOS tagging, we can find out whether a comparisonrefers to a number or numerical expression.Regarding recall, our analysis shows only a slight

variance between the smells, with the only outlierbeing the Superlative Requirements Smell; how-ever, since this is a very rare smell, this recall is basedon only few data points, therefore, we must considerthis result with care. When inspecting the reasonsfor false negatives, we found that optimizations couldbe made through the lemmatizer. Future research inthis direction should compare whether the accuracy oflemmatizers as reported in the field of computationallinguistics also holds for requirements engineering ar-tifacts. Furthermore, we analyzed requirements inGerman language where lemmatization is a more dif-ficult problem than in English, since the languagemakes stronger use of inflections (e.g. with cases orgender). Hence, smell detectors based on lemmatiza-tion for the English language might work better thanthe results indicate in our analysis.

28

In general, the precision and recall are thereforecomparable to other approaches with related purposes(see Sect. 2). However, is it sufficient for an applicationof Requirements Smells in practice?

First, when looking at precision, we must take intoaccount that the current state of practice consistsstill of manual work and that the cost for runningan automatic analysis is virtually zero. Nevertheless,checking a false positive finding takes effort which aninspector could rather spend in reading the documentin more detail. However, as we see a high variation inthe precision over different smells, we need to discussthese separately. Several of the smells have a preci-sion of 0.7 and higher which is considered acceptablein static code analysis [8]. For other RequirementsSmells, the precision is below 0.5. This means thatevery other finding will be a false positive. This canbe critical in the effort spent in vain and annoy auser of the smell detection. Yet, we follow Menzies etal. [57] that a low precision can be still useful “Whenthere is little or no cost in checking false alarms.” Inour experience, the cost of checking a finding is oftenjust a few seconds.Second, when looking at recall, most of the smell

detections reach a recall of more than 80%. Variouspublications, most prominently Kiyavitskaya [40] andBerry et al. [5], argue that a recall close to 100% is abasic requirement for any tool for automatic QA inRE. The core argument is that with a lower recall, re-viewers stop checking these aspects and consequentlymiss defects, and that reviewers need to check thecomplete artifact anyway. However, if taking the ex-ample of spell checkers and grammar checks, these arestill used on a daily basis, although they are far awayfrom 100% recall. Therefore, one could consequentlyalso argue that the precision is more important thanthe recall.

In any case, whether the reported precision and re-call are sufficient in industry needs further research inthe future. As mentioned above, it mainly depends ontwo factors: the required investment versus the gainedbenefit (similar to the concept of technical debt). Forthe required investment, we argue that, based on ourexperience of analyzing the various cases presentedhere, one can quickly iterate through the detectedfindings with low investment. To further support this

discussion, the following research question analyzesthe aspect of the benefits to practitioners in moredetail.

Answer to RQ 2.1. As shown in Tables 7 and 8, andas shown in Fig. 10, the precision is on average around59%, with an average recall of 82%, but both varybetween smells. We consider this reasonable for a taskthat is usually performed manually. However, this alsodepends on the relevance of findings to practitioners,which we analyze in RQ 2.2. The study also revealsimprovements for future work through the applicationof deeper NLP.

6.2.4. RQ 2.2: Which of these smells are practicallyrelevant in which context?

To understand whether the Requirements Smellshelp detecting relevant problems, we first performeda pre-study, in which we confronted practitioners ofDaimler and Wacker with findings. The pre-study,which we reported in Femmer et al. [24], aimed atreceiving qualitative and tacit feedback. It showedthat Requirements Smells can in fact indicate relevantdefects.In contrast, in this study we analyze relevance in

specific categories by interviewing practitioners atTechDivision on their opinion on the findings in termsof relevance, awareness, and whether these practition-ers would resolve the suggested finding.

Quantitative observations. Table 9 reports the 20 find-ings that we discussed with TechDivision. In summary,we can see that they considered 65% of the findingsas relevant for their context. Furthermore, they havenot been aware of 45% of the findings. Lastly, theywould act on 50% of the presented findings and on40% even immediately.

Qualitative observations (true positives). The find-ings that the tool produces mostly constituted formsof underspecification. For example, in Finding #1 (seeTable 9): "As a searcher, I want to see the checkboxesin the different categories displayed more clearly, sothat. . . " (for similar examples, see Findings 3, 4, 14,16, and 20). In this case, as in many of the other exam-ples, the practitioners stated that no developer could

29

●●●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Precision

Rec

all

SubjectiveLanguage

AmbiguousAdverbs

andAdjectives

Loopholes

Non−verifiable

Terms

SuperlativeRequirements

ComparativeRequirements

NegativeWords

VaguePronouns

Figure 10: Precision and recall of the discussed smell detection approaches.

30

Table 7: Precision of smell detection

Smell Finding

sInspected

Finding

sAccepted

Finding

sRejected

Precision

Subjective Language Smell 69 66 3 0.96Ambiguous Adverbs and Adjectives Smell 21 17 4 0.81Loophole Smell 60 43 17 0.72Non-verifiable Term Smell 23 16 7 0.70Superlative Requirements Smell 39 19 20 0.49Comparative Requirements Smell 88 42 46 0.48Negative Words Smell 129 42 87 0.33Vague Pronouns Smell 187 48 139 0.26

Average 77 36.6 40.4 0.59Overall 616 293 323 0.48

Table 8: Recall of smell detection within sample of 4 artifacts(16,271 words)

Smell Finding

sin

artifacts

Finding

sidentifiedcorrectly

Recall

Ambiguity-related S. 74 64 0.86Superlative Requirements S. 4 2 0.50Comparative Requirements S. 21 20 0.95Negative Words S. 64 54 0.84Vague Pronouns S. 37 34 0.92

Average 40 34.8 0.82Overall 200 174 0.87

implement this story properly. They also recalledvarious discussions in estimation meetings on whatwas to be done to complete these types of stories13.

In the previous research questions, we have seen thatRequirements Smells are able to detect loopholes inrequirements, such as the usage of the word should. Tounderstand the relevance of this finding in the contextof an agile company, we also discussed the loopholein Finding #6. When we pointed out the finding, theyresponded that they considered expressing what thesystem should do in user stories problematic. Theyconsidered this defect a low risk, as the developersunderstood ("If you are told that you should take outthe trash, you understand that it is an imperative.")and their user stories did never turn out to be of legalrelevance. They concluded that they want to avoid

13Note that discussions can have different objectives, i.e. whatis to be implemented and how. For these, how to implement astory is the team’s task and thus discussions can help findingthe best way. In contrast, what the product owner wants isoutside of the team’s scope and therefore should not be a matterof discussion.

31

Tab

le9:

Exemplaryfin

ding

s;shortenedan

dtran

slated

bytheau

thors,

finding

sin

bold

IDFinding

Relevan

t?Aware?

Resolve?

1Asavisitor,

Iwan

tto

seethecheckb

oxes

inthedifferentcatego

ries

displayed

mor

ecl

earl

y,so

that

Icansee

morequ

icklythat

Icanselect

andde

select

catego

ries.

Yes

Yes

Yes

inshortterm

2Asavisitor,Iwan

tto

seethecheckb

oxes

inthedifferentcatego

ries

displayedmoreclearly,

sothat

Icanseem

ore

quic

kly

that

Icanselect

andde

select

catego

ries.

No

No

No

3Asan

editor,Iwan

tto

makeit

sim

ple

rto

differentiatebetween...

No

No

No

4Asavisitor,

Iwan

tto

see

furt

her

details,

e.g.

(...),so

that...

Yes

Yes

Yes

immed

iately

5Asacu

stom

er,Iwan

t,ifIha

vea

larg

ernu

mber

ofE-M

ails

inmymailbox

,...

Yes

Yes

Yes

immed

iately

6Asan

editor,Iwan

tto

makeit

simpler

todifferentiatebetweenA

andB,therefore,

Ash

ould

belabeled

as...

Yes

No

Yes

inlong

term

7Asaprovider,Iwan

tthat,as

far

aspos

sible,allfields,aremap

ped

betweenSy

stem

Aan

dSy

stem

B.

Yes

Yes

Yes

immed

iately

8Asaprovider

Iwan

tthene

wssectionto

beim

plem

entedwithan

effortas

low

aspos

sible.

Yes

Yes

No

9Asavisitor,

Ido

not

wan

tto

seecatego

ryX,so

that

Iam

not

confronted

withtheissue.

No

No

No

10Asavisitorof

theweb

page,Iwan

tfornot

selected

catego

ries,thedisplayedhe

arts

ofthescore(searchresultslist)

tobedisplayedin

such

acolor,

that

thescoredisplayis

not

chan

gedan

dalwayson

lyhe

arts

ofrelevant

catego

ries

aredisplayedin

color.

Yes

Yes

Yes

immed

iately

11Asan

employee,Iwan

tthat

anarticle,

ifnopriceis

imported,

despitethelabel

’availab

le’to

be

not

displayedin

SYST

EM

X,so

that

thearticleau

tomatically

resumes

whe

napriceis

imported.

Yes

Yes

Yes

immed

iately

12Asauser,Iwan

tto

have

thepossibility

tousecu

stom

values

form

inim

um

and

max

imum,so

that

(...).

No

No

No

13Asavisitor,Iwan

tto

have

apossibility

tobrow

sethroug

hprevious

andne

xtprod

ucts,so

that

Ican

quic

kly

and

easi

lylook

atmultipleprod

uctwitho

utha

ving

togo

back

totheoverview

page.

No

No

No

14Asavisitor,

Iwan

tto

naviga

teto

mea

nin

gfullystructured

catego

ries

viathemenus.

Yes

Yes

Yes

immed

iately

15Asavisitor,

Iwan

tto

quic

kly

open

thepictures

oftheweb

site,so

that

unne

cessarywaiting

isavoide

d.Yes

Yes

Yes

immed

iately

16Asavisitorof

theweb

site,Iwan

ta

nic

elyde

sign

edsearch-sug

gest-box

whe

nIenteratext

andwait.

Yes

Yes

Yes

immed

iately

17Asabu

yer,

Iwan

tto

select

from

asetof

shop

ping

prov

iders(...),

sothat

Icanselect

the

bes

tsuited

shop

ping

provider.

No

No

No

18Asan

editor,Iwan

tto

have

multipleentrypointsforlink

ingcatego

ries,so

that

thevisitorcan

(...)

getan

overview

ofselected

bran

dsan

dcatego

ries

and

thei

rfilters.

Yes

No

No

19As[O

THER

SYST

EM],Iwan

tthat

anorde

rof

thestatus

’Order

income’

tran

sition

sinto

status

’waitfortran

smis-

sion

into

[SYST

EM]’,so

that

Ido

notseetheorde

rwhe

ninde

xing

open

orde

rsan

dso

Ido

notprocesstheorde

rmultipletimes

(and

that

onecanseethestatus

oftheorde

rin

theba

cken

dprop

erly).

No

No

No

20Asan

editor,Iwan

tto

know

ago

odway

how

totran

sfer

newscontentfrom

[SYST

EM]to

[SYST

EM]to

beab

leto

efficientlymigrate

ever

yth

ingat

once.

Yes

Yes

No

32

this, but it has no immediate urgency in a projectsituation.ISO 29148 discusses the use of negative state-

ments ("capabilities not to be provided"). In a previ-ous study [24] practitioners expressed their reluctanceof this criterion. In contrast, in this study, practition-ers said they would act upon 2 out of 3 of the negativestatements (Findings #9–11) that we presented tothem as they revealed unclear requirements. In onecase they even remembered that this led to discussionsabout the implementation during the sprint. Table 9shows many more, similar examples.

Qualitative observations (false positives). Also inter-esting are those cases that practitioners considered notrelevant in their context or where practitioners saidthey would not act upon. Summarized, the reasonswere the following:

Domain and context knowledge: Some storiesthat were unclear to outsiders were understand-able for someone knowing the system underconsideration. For example, in user story #18 itwas unclear to the first and second author whattheir refers to. It was clear, however, to bothpractitioners with knowledge about the system.

Process requirement: In Finding #8, the smell re-veals another conspicuous finding: The developershould put as low effort as possible into the im-plementation of this story. In the discussion, thereason for this was that the customer did notwant to pay much for this implementation. Thusthe story should only be fulfilled if it was possibleto be fulfilled cheaply. While the practitionerstold us they would not change anything aboutthis story, they agreed that the smell pointedout something that violates common user storypractice.

Finding in reason part: In four cases, the practi-tioners agreed to the finding but considered itirrelevant as the finding was inside the reasonpart of the user story. This is due to this part ofthe user story only serving as additional informa-tion. This reason part is not used in testing nor

is the information directly relevant for implemen-tation. The main purpose is to understand thebusiness value and to indicate the major goal tothe team, similar to goals and goal modeling intraditional requirements engineering [50].

Answer to RQ 2.2. In summary, the practitionersexpressed that 65% of the discussed findings wererelevant, as they lead to lengthy discussions and un-necessary iterations in estimation. They also saw theproblem of legal binding, but in contrast to the prac-titioners of Case A and B, they considered these find-ings less relevant. Due to these results, they expressedtheir strong interest in exploring smell detection forprojects; we will explain the results of this discussionin RQ 4.

Further observations of quality defects in differentparts of a user story

We considered especially the last explanation for re-jecting findings (finding in reason part of a user story)particularly interesting. We had noticed that the rea-son part was often written in a rather imprecise way.To be able to quantify this aspect, we automaticallysplit user stories according to the language patternsand quantified the distribution of words as well asfindings over the different parts of user stories.Table 10 shows the results of this analysis. The

number of words is roughly distributed as follows: 11%of the words of a user story describe the role, 55% ofthe words describe the feature and 34% describe thereason. Of the 1,082 user stories, 290 had no reasonpart at all. Due to this uneven distribution, similar asin the previous analyses, we normalize the number offindings by the number of words in each part resultingin the number of findings per 1,000 words.Only 1% of the findings are located in the role

part. In fact, when we inspected these findings, theywere false positives due to the grammatical problemsdescribed in the previous section. The absence offindings in this section is expected, as this part ofthe user story only names the role and does not offermany chances for smells as described in Sect. 3.2. Forthe remainder, 46% of the findings are located in thefeature and 53% are located in the reason part. Inrelation to its size, the difference is striking: With 64

33

Tab

le10:Finding

sin

diffe

rent

partsof

user

stories(T

=total,Ro=

role,F=feature,

Re=

reason

)

Case

#Stories

w/o

reason

Size

inWords

Finding

sab

solute

Finding

spe

r1,000Words

Total

Role

Feature

Reason

T.

Ro.

F.

Re.

T.

Ro.

F.

Re.

C1

168

235226

801

2375

2050

229

183

145

441

3571

C2

123

452742

260

1552

930

120

062

5844

040

62C3

230

196863

824

3090

2949

233

581

147

346

2650

C4

561

203

13124

1188

8223

3713

572

0307

265

440

3771

Sum

1082

290

27955

3073

15240

9642

1154

6533

615

412

3564

34

findings per 1,000 words, the reason has nearly doublethe number of findings of the feature part and nearly70% more findings than the average requirement, asanalyzed in Sect. 6.2.2.In summary, the reason part of user stories is par-

ticularly prone to smells, but the qualitative analysisin RQ 2.2 reveals that practitioners consider findingsin this section to be less relevant. This investigationcould support further application of RequirementsSmells in practice by helping to prioritize smells ac-cording to their location.

6.2.5. RQ 3: Which requirements quality defects canbe detected with smells?

For 44 of the 51 requirements artifacts the stu-dents provided technical reviews. We qualitativelyanalyzed the results of 10 randomly selected reviews(around 20%). The inspected reviews were conductedby 5–7 reviewers (mean: 5.6), took 90 minutes andresulted in 18–69 defects (mean: 38.1). We iteratedthrough the 381 defects documented in the reviewsand evaluated whether the smell detection producedfindings indicating these defects. If no smell indicatedthe defect, we openly classified the defects. We didnot quantify these results, because the resulting num-bers would assume and suggest that the distributionof defects is representative for regular projects, whichwe are unsure about (i.e. because of a high number ofspelling and grammatical issues).

The classification of the defects and their compari-son with the detected smells resulted in the followinglist of of defects indicated by Requirements Smells:

Sentence not understandable. In some instances,when the defect suggested changing the sentenceto improve understandability, these sentenceswere highlighted especially by the vague pro-nouns and negative statements smells.

Improper legal binding. Various requirements ar-tifacts had issues with improper legal binding.In one case, the reviewers recognized this anddemanded the use of the term must. The loop-holes smell pinpointed at this issue.

Unspecified/unmeasurable NFRs. Varioussmells, especially the superlatives smell,

indicated at defects of underspecification withinnon-functional requirements.

The remaining defects were not indicated by Re-quirements Smells.

Interpretation. The quantitative distribution of de-fects is not necessarily representative for industryprojects and, thus, has not been not analyzed. The re-views clearly show that manual inspection discoveredthe same defects as in the previous research question:Understandability, legally binding terminology andunderspecified requirements. These are issues with re-gards to representation but also the content describedin the artifact. We argue that these issues are com-mon for requirements artifacts. Requirements Smellscan therefore indicate relevant defects from multiple,independent sources (manual inspection, interviewswith practitioners, independent manual reviews) formultiple, independent cases.

Answer to RQ 3. Automatic smell detection can pointto issues in both representation (e.g. improper legalbinding) and content (underspecified/unmeasurableNFRs). The analysis of the reported defects indicatesthat more defects could be automatically detected(see section further discussion on detectability of de-fects described next). Nevertheless, just as for staticcode analysis, we see that automatic analysis can notindicate all defects and thus must be accompanied byreviews [73]. The fourth research question aims atanalyzing this aspect in depth.

Further discussion on detectability of defects. Duringthe analysis, if no smells indicated the defect, weopenly classified the defects. While discussing theresulting list of defects and the degree to which theyare detectable within the group of authors, we cameup with a classification which is broader as initiallyplanned while designing the study. This classificationconsiders whether a defect:

• Already can be detected• Could be detected, but is not implemented yet

in our detection• Cannot be detected at the moment, but should be

soon

35

• Cannot be detected at all and probably won’t besoon

This classification is purely based on our knowledgeof existing related work and our subjective expecta-tions gained during the data analysis process. Theclassification yielded in a map visualised in Fig. 11.The figure is structured in two dimensions: On thevertical axis, we group the defects into defects relatingto the content, and defects relating to representation.Furthermore, on the horizontal axis, we map the itemsaccording to the expected precision and completenesswe believe the detection could be (i.e. the classifi-cation above). The further left an item, the moreprecise and complete we expect a smell detection tobe; the items on the right we assume to be close toimpossible to detect in a general case.

With the defects that our current approach does notreveal, this research question shows that more defectscould be detected: These are namely defects withterminology, singularity in use cases and structuralissues focusing on the content such as the absenceof mandatory elements in the artifact [37], structuralredundancy [34] or structural inconsistency betweencontent. It remains unclear how far more enhancedlanguage analysis with more sophisticated NLP andontologies can enable to understand language. In anycase, when a defect remains subtle and vague in itsdefinition, such as an unintuitive structuring or design,we only see potential for automation if a defect canbe defined precisely. For problems relating to thedomain itself (e.g. incomplete information about thedomain or incorrect information with regards to thedomain), we consider it impossible to detect issuesunless formalizing the concepts of the domain.

6.2.6. RQ 4: How could smells help in the QA pro-cess?

After the interviews and analysis, we asked all in-volved practitioners whether or not they think re-quirements smell detection is a helpful support, andwhether and how they would integrate it in theircontext. We asked those questions openly and tran-scribed the answers for validation by the intervieweesand later coding. In the following, we report on theresults structured by topics. Where applicable, we

provide the verbatim answers in relation to their cases(A, B or C).

Overall Evaluation. In general, all practitionersagreed on the usefulness of the smell detection even ifconsidering different perspectives that arise from theirprocess setting. One practitioner (Case C) reportsthat he expects one benefit in using smell detection isthat it would lead to a reduction of the time spent foreffort estimations (in context of agile methods), as theproduct owner could benefit from the smell detectionon the fly and, thus, avoid misinterpretations later.

Quotes on Overall EvaluationA. “I think that smells can help to analyze a

specification.”B. “The method of Requirements Smells is a

valuable extension in the area of require-ments engineering and gives helpful inputconcerning the quality of specified require-ments in early development phases.”

C. “I think such a smell detection is of highvalue to make sure that our team is con-fronted with already quality assured [user]stories. This can reduce the time in our ef-fort estimations, because the product ownerwould directly notice on the fly what couldlead to misinterpretations later.”

Integration into Process. When asked for how thepractitioners would integrate the smell detection intotheir process setting, we got varying answers depend-ing on the process. The practitioner relying moreon rich process models (Case B) could imagine usinga smell detection either as a support for the personwriting the requirements or as part of a more funda-mental QA method for the company. But also thepractitioner relying more on the agile methods (CaseC) could imagine using Requirements Smells as a sup-port for the person writing the requirements or incontext of analytical QA. In addition, one potentialuse is seen in context of problem management. Im-portantly, all practitioners see the full potential of asmell detection only if integrated in their existing toolchain (see also quotes on constraints and limitations).

36

Detected

&by&Req

uiremen

ts&Smells

Sent

ence

not

und

erst

anda

ble

Uns

peci

fied/

unm

easu

rabl

e N

FRs

Impr

oper

lega

l bin

ding

Inco

rrect

info

rmat

ion

Uni

ntui

tive

Use

Cas

e flo

w o

r dia

gram

s

Lang

uage'se

man

*cs

Sem

antic

clo

nes

Mis

sing

man

dato

ry it

ems

Term

inolog

y

Presenta*o

n'an

d'Structure

Representa5on Content

Sing

ular

ity in

UC

Enco

ding

Unn

atur

al it

emiz

atio

nsU

nint

uitiv

e st

ruct

ure

of ta

ble

Una

ppea

ling

imag

eU

nrea

dabl

e im

age

Und

ersp

ecifi

ed te

rms

Unn

eces

sary

term

s in

glo

ssar

yN

amin

g vi

olat

ing

conv

entio

n U

ndefi

ned

dom

ain-

spec

ific

term

sIn

cons

iste

nt u

sage

of t

erm

s

Inco

mpl

ete

info

rmat

ion

Spel

ling

Gra

mm

arLa

ngua

ge m

ixtu

reW

rong

wor

d (la

ngua

ge)

Sem

antic

ally

con

tradi

ctin

g in

form

atio

n

Detectab

leby&Req

uiremen

ts&Smells

Rather&not&detectable

by&Req

uiremen

ts&Smells

Wro

ng w

ord

(dom

ain)

Stru

ctur

al re

dund

ancy

/ C

loni

ngSt

ruct

ural

ly in

cons

iste

nt d

iagr

ams

Figure11:Finding

sin

requ

irem

ents

review

s,classifiedby

content/representation

andde

tection

37

Quotes on Integration into ProcessB. “I like to compare Requirements Smells to

the “check spelling aid" known e.g. fromMicrosoft Word. So for me RequirementsSmells are intuitive and lightweight andshould be used and integrated within require-ments engineering and quality assuranceprocesses.”

C. “As a product owner, I would use a smelldetection on the fly [...]. In addition, smelldetection could help in analytical QA, as itcould reveal when a problem occurs repeat-edly, either in a project or in the companyas a whole.”

Constraints and Limitations. One facet we considerespecially interesting when using qualitative data isthe chance to reveal further fields of improvement.We therefore concentrate now on the constraints thatwould hamper the usage of a smell detection. Onefacet we believe to be important is that practitionerswant to avoid additional effort when using smell detec-tion in their context. Furthermore, the practitionerof Case A believes that the automatic smell detectionrequires a common understanding on the notion of REquality. He further indicates that the smell detectionshould explicitly take into account that some criteriacannot be met at every stage of a project.

Quotes on Constraints and LimitationsA. “First, the people who need to write the

specification received training which givesthe required performance criteria. Second,abstraction levels must be taken into ac-count during the smell detection process,since at higher abstraction levels differentcriteria cannot be met (e.g. vague pronounsor subjective language).”

B. “As a product owner, I would use a smelldetection on the fly provided that it wouldnot mean additional effort [such as by hav-ing to use another tool].”

Answer to RQ 4. Our practitioners provided a generalagreement on potential benefits of using smell detec-tion a quality assurance context. When asked howthey would integrate the requirements smell detection,

they see possibility for both analytical and construc-tive QA, provided, however, this integration wouldnot increase the required effort, e.g. by integratingthe detection into existing tool chains.

6.2.7. Evaluation of validityWe use the structure of threats to validity from [64]

to discuss the evaluation of the validity of our study.

Construct validity. In our evaluation, we analyzedRequirements Smells in the terms of false positives,relevance and relation to quality defects. There arethreats that the understanding of these terms variesand, thus, the results are not repeatable. Yet, we areconfident that our validity procedures described inSect. 6.1.5 reduced this threat. For the false positives,we classified a subset of the findings independently,and afterwards compared (inter-rater agreement Co-hen’s kappa: 0.53) and discussed the results. Wesubsequently reclassified a different subset of findingsagain, which lead to an inter-rater agreement (Cohen’skappa) of 0.72. For the classification of false negatives,we reclassified one document separately, calculatingthe percentage of agreement on false positives14. Thislead to an agreement of 88%.We consider both of these substantial agreements,

especially in the inherently ambiguous and complexdomain of RE. Thus, we consider this threat as suffi-ciently controlled.

Internal validity. A threat to the internal validity ofour results is that the experience of the students aswell as the practitioners might play a role in their rat-ings of relevance or detection of quality defects. Wemitigated this threat by choosing only practitionersfor the ratings and interviews who had several yearsof experience. The students are only in the secondyear. We cannot mitigate this threat but consider theeffect to be small. There might be some defects not

14We did not employ Cohen’s kappa here, since the numberof true positives (non-smell words) would strongly dominate theresult and therefore skew the inter-rater agreement. Instead,we calculated the ratio of findings which both rating teamsindependently classified as false positive to the number offindings which only one of the teams classified false positive.

38

found by the students that could have been indicatedby a smell as well as unfound defects undetectableby smells. Hence, future studies will add to the clas-sification but are unlikely to change it substantially.Personal pride could potentially have an impact onthe answers to a RQ 2.2, if practitioners are not ableto professionally discuss their own work products. Inour cases, however, all practitioners openly acceptedthe discussions (as can be seen in their answers). Eventhough we carefully supervised this threat, we havenot found signs of personal bias in the cases involved.Finally, the students might also have been influencedby the review guidelines we provided. Yet, none ofthe investigated smells was explicitly listed in theguidelines. Instead, the guideline contained ratherhigh-level aspects such as “unambiguity”. Althoughwe consider this threat to be a minor one, it is stillpresent.

External validity. As requirements engineering is adiverse field, the main threat to the external validityof our results is that we do not cover all domainsand ways of specifying requirements. We mitigatedthis threat to some degree by covering at least severaldifferent domains and study objects, of which some arepurely textual requirements artifacts, some use cases,and some user stories. We argue that this representsa large share of today’s requirements practices.

Reliability. Our study contains several classificationsand ratings performed by people. This constitutes athreat to the reliability of our results. We are confi-dent, however, that the peer debriefing and memberchecking procedures helped to reduce this threat.

7. Conclusion

In this paper, we defined Requirements Smells andpresented an approach to the detection of Require-ments Smells which we empirically evaluated in amulti-case study. In the following, we summarizeour conclusions, relate it to existing evidence on thedetection of natural language quality defects in re-quirements artifacts, and we discuss the impact andlimitations of our approach and its evaluation. Weclose with outlining future work.

7.1. Summary of conclusions

First, we proposed a light-weight approach to de-tect Requirements Smells. It is based on the naturallanguage criteria of ISO 29148 and serves to rapidlydetect Requirements Smells. We define the term Re-quirement Smell as an indicator of a quality violation,which may lead to a defect, with a concrete loca-tion and a detection mechanism, and we also givedefinitions of a concrete set of smells.Second, we developed an implementation that is

able to detect Requirements Smells by using part-of-speech (POS) tagging, morphological analysis anddictionaries. We found that it is possible to providesuch tool support and outlined how such a tool couldbe integrated into quality assurance.Third, in the empirical evaluation, our approach

showed to support us in automatically analysing re-quirements of the size of 250k words. Findings werepresent throughout all cases but in varying frequenciesbetween 22 and 67 findings per 1,000 words. Outliersindicated serious issues. An investigation of the de-tection precision showed an average precision around0.59 over all smells, again varying between 0.26 and0.96. The recall was on average 0.82, but also variedbetween 0.5 and 0.95. To improve the accuracy, wedescribed concrete improvement potential based onreal world, practical examples.

A further analysis of reviews and practitioner’s opin-ions strengthen our confidence that smells indicatequality defects in requirements. For these qualitydefects, practitioners explicitly stated the negativeimpact of discovered findings on estimation and im-plementation in projects. The study also showed,however, that while Requirements Smell detectioncan help during QA presumedly in a broad spectrumof methodologies followed (including agile ones), therelevance of Requirements Smells varies between cases.Hence, it is necessary to tailor the detection to thecontext of a project or company. We analyzed thisfactor in depth, demonstrating that the reason part ofa user story contains most findings (absolutely and rel-atively), but practitioners consider these findings lessrelevant as they argue that this part is not commonlyused in implementation or testing. This raises thequestion of the relevance of this part at all, at least

39

from a quality assurance perspective, which shouldbe investigated in future work.Our comparison with defects found in reviews fur-

thermore showed that the Requirements Smell detec-tion partly overlaps with results from reviews. As aresult, we provide a map of defects in requirementsartifacts in which we give a first indication whereRequirements Smells can provide support and wherethey cannot.

Therefore, we provide empirical evidence from mul-tiple, independent sources (manual inspection, inter-views with practitioners, independent manual reviews)for multiple, independent cases, showing that Require-ments Smells can indicate relevant defects across dif-ferent forms of requirements, different domains, anddifferent methodologies followed.

7.2. Relation to existing evidenceExisting approaches in the direction of automatic

QA for RE are based on various quality models, in-cluding the ambiguity handbook by Berry et al. [7],the now superseeded IEEE 830 standard [32] and pro-prietary models. Yet, according to a recent literaturereview by Schneider and Berenbach [67], ISO 29148is the current standard in RE “that every require-ments engineer should be familiar with”. However,no detailed empirical studies (see Table 1) exist forthe quality violations described in ISO 29148. Whencomparing to similar, related quality violations, alsofew empirical, industrial case studies exist (see Ta-ble 2). Gleich et al. [30] and Chantree et al. [11] reportfor conceptually similar problems, a precision of thedetection between 34% and 75% (97% in a specialcase), and a recall between 2% and 86%. Krisch andHoudek [49] report a lower precision in an industrialsetting. The precision and recall for the detection ofthe smells, which we developed based on the descrip-tion in the standard, are in a similar range to theaforementioned. In summary, this work provides adetailed empirical evaluation on the quality factors ofISO 29148, including a deeper understanding of bothexisting and novel factors.

We also take a first step from the opposite perspec-tive: So far, to all our knowledge, all related workstarts from a certain quality model and goes intoautomation. Our results to RQ 3 provides a bigger

picture for understanding in how far quality defects inrequirements could be addressed through automaticanalysis in general.Our results to RQ 2.2 furthermore provides evi-

dence for the claim by Gervasi and Nuseibeh [29]that “Lightweight validation can discover subtle errorsin requirements.” More precisely, our work indicatesthat automatic analysis can find a set of relevant de-fects in requirements artifacts by providing evidencefrom multiple case studies in various domains andapproaches. The responses by practitioners to thefindings do, to some extent, contradict the claim byKiyavitskaya et al. [40] who state that “any tool [...]should have 100% recall ”. Practitioners respondedvery positively on our first prototype and the smellsit finds. Yet, obviously, more detailed and broaderevaluations, especially conducted independently byother researchers not involved in the development ofSmella, should follow.

7.3. Impact/ImplicationsFor practitioners, Requirements Smells provide a

way to find certain issues in a requirements artifactwithout expensive review cycles. We see three mainbenefits of this approach: First, the approach, justas static analysis for code, can enable project leadsto keep a basic hygiene for their requirements arti-facts. Second, the review team can avoid discussingobvious issues and focus on the important, difficult,domain-specific aspects in the review itself. Third,the requirements engineers receive a tool for immedi-ate feedback, which can help them to increase theirawareness for certain quality aspects and establishcommon guidelines for requirements artifacts.

Yet, the low precision for some of the smells mightcause unnecessary work checking and rejecting find-ings from the automatic smell detection. Hence, atleast for now, it is advisable to concentrate on thehighly accurate smells.For researchers, this work sharpens the term Re-

quirements Smell by providing a definition and a tax-onomy. By implementing and rating concrete smellfindings, we also came to the conclusion, however, thatnot all of the requirements defects from ISO/IEC/-IEEE 29148 can be clearly distinguished as Require-ments Smells. In particular, the difference between

40

Subjective Language, Ambiguous Adverbs and Adjec-tives, Non-verifiable Terms, and Loopholes was not al-ways clear to us during our investigations (see RQ 2.1).Therefore, we, as a community, can take our smelltaxonomy as a starting point, but we also need tocritically reflect on some smells to further refine thetaxonomy.

Finally, empirical evidence in RE is, in general, dif-ficult to obtain because many concepts depend onsubjectivity [55]. One issue increasing the level ofdifficulty in evidence-based research in RE remainsthat most requirements specifications are written innatural language. Therefore, they do not lend them-selves for automated analyses. Requirements Smelldetection provides us with a means to quantify theextent of certain defects in a large sample of require-ments artifacts while explicitly taking into accountthe sensitivity of findings to their context. Hence,this allows us to consider a whole new spectrum ofquestions worth studying in an empirical manner.

7.4. Limitations

We concentrated on a first set of concrete Require-ments Smells based on our interpretation of the some-times imprecise language criteria of ISO/IEC/IEEE29148. There are more smells, also with different char-acteristics than the ones we proposed and analyzed.In addition, even though we diversified our study ob-jects over domains, methods and different types ofrequirements, we cannot generalize our findings toall applicable contexts. We therefore consider thepresented results only a first step towards the contin-uous application of Requirements Smells in softwareengineering projects.

7.5. Future work

Our work focuses on Requirements Smells basedon ISO/IEC/IEEE 29148. Future work needs to clar-ify and extend this taxonomy based on related workand experience in practice. This also includes thedevelopment of other Requirements Smell detectiontechniques to increase our understanding about whichdefects can be revealed by Requirements Smells andwhich defects cannot.

Second, this first study gained first insights into theusefulness of Requirements Smells for QA. We further-more sketched an integration of Requirements Smellsinto a QA process. Yet, a full integration and the con-sequences must be analyzed in depth. In particular,we need to understand whether smell detection as asupporting tool, similar to spell checking, as pointedout by on of our participants, enables requirementsengineers to improve their requirements artifacts.

Lastly, Requirements Smells focus on the detectionof issues in requirements artifacts. They require athorough understanding of the impact of a qualitydefect, which is hence also part of the requirementssmell taxonomy. This link must be carefully evaluatedand analyzed in practice. Our preliminary works onthis topic [23, 59] provide first ideas in that direction.

Acknowledgments

We would like to thank Elmar Juergens, MichaelKlose, Ilona Zimmer, Joerg Zimmer, Heike Frank,Jonas Eckhardt as well as the software engineeringstudents of Stuttgart University for their supportduring the case studies and feedback on earlier draftsof this paper.This work was performed within the project Q-

Effekt; it was partially funded by the German FederalMinistry of Education and Research (BMBF) undergrant no. 01IS15003 A-B. The authors assume respon-sibility for the content.

Bibliography

[1] V. Ambriola and V. Gervasi. On the system-atic analysis of natural language requirementswith CIRCE. Automated Software Engineering,13(1):107–167, 2006.

[2] B. Anda and D. I. K. Sjøberg. Towards an inspec-tion technique for use case models. In Proceedingsof the 14th International Conference on SoftwareEngineering and Knowledge Engineering. ACM,2002.

[3] D. J. Anderson. Kanban. Blue Hole Press, 2010.

41

[4] C. Arora, M. Sabetzadeh, L. Briand, and F. Zim-mer. Automated Checking of Conformance toRequirements Templates using Natural LanguageProcessing. IEEE Transactions on Software En-gineering, 41(10):944–968, 2015.

[5] D. Berry, R. Gacitua, P. Sawyer, and S. F. Tjong.The case for dumb requirements engineering tools.In Requirements Engineering: Foundation forSoftware Quality, pages 211–217. Springer BerlinHeidelberg, 2012.

[6] D. M. Berry, A. Bucchiarone, S. Gnesi, G. Lami,and G. Trentanni. A new quality model for nat-ural language requirements specifications. In Re-quirements Engineering: Foundation for SoftwareQuality. Essener Informatik Beiträge, 2006.

[7] D. M. Berry, E. Kamsties, and M. M. Krieger.From Contract Drafting to Software Specification: Linguistic Sources of Ambiguity. Technicalreport, School of Computer Science, Universityof Waterloo, Waterloo, ON, Canada, 2003.

[8] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton,S. Hallem, C. Henri-Gros, A. Kamsky, S. McPeak,and D. Engler. A few billion lines of code later:using static analysis to find bugs in the real world.Communications of the ACM, 53(2):66–75, 2010.

[9] S. Bradner. Key words for use in RFCs to In-dicate Requirement Levels - RFC 2119. https://www.ietf.org/rfc/rfc2119.txt, 1997.

[10] A. Bucchiarone, S. Gnesi, and P. Pierini. QualityAnalysis of NL Requirements : An IndustrialCase Study. In 13th IEEE International Require-ments Engineering Conference, pages 390–394,2005.

[11] F. Chantree, B. Nuseibeh, A. D. Roeck, andA. Willis. Identifying Nocuous Ambiguities inNatural Language Requirements. 14th IEEEInternational Requirements Engineering Confer-ence, pages 59–68, sep 2006.

[12] A. Ciemniewska, J. Jurkiewicz, L. Olek, andJ. Nawrocki. Supporting Use-Case Reviews. In

Business Information Systems, pages 424–437.Springer Berlin Heidelberg, 2007.

[13] A. Cockburn. Writing Effective Use Cases.Addison-Wesley, 2000.

[14] M. Cohn. User stories applied: For agile softwaredevelopment. Addison-Wesley Professional, 2004.

[15] A. Davis, S. Overmyer, K. Jordan, J. Caruso,F. Dandashi, A. Dinh, G. Kincaid, G. Ledeboer,P. Reynolds, P. Sitaram, A. Ta, and M. Theo-fanos. Identifying and measuring quality in a soft-ware requirements specification. In ProceedingsFirst International Software Metrics Symposium,pages 141–152, 1993.

[16] F. De Bruijn and H. L. Dekkers. Ambiguity innatural language software requirements: A casestudy. In Requirements Engineering: Foundationfor Software Quality, pages 233–247. SpringerBerlin Heidelberg, 2010.

[17] C. Denger, D. Berry, and E. Kamsties. Higherquality requirements specifications through nat-ural language patterns. In Software: Science,Technology and Engineering, pages 80–90. IEEE,2003.

[18] F. Fabbrini, M. Fusani, S. Gnesi, and G. Lami.An automatic quality evaluation for natural lan-guage requirements. In Proceedings of the Sev-enth International Workshop on RequirementsEngineering: Foundation for Software Quality,volume 1, pages 4–5, 2001.

[19] F. Fabbrini, M. Fusani, S. Gnesi, and G. Lami.The linguistic approach to the natural languagerequirements quality: benefit of the use of an au-tomatic tool. In Proceedings 26th Annual NASAGoddard Software Engineering Workshop, pages97–105. IEEE Computer Society, 2001.

[20] M. Fagan. Design and code inspections to re-duce errors in program development. In Softwarepioneers, pages 575–607. Springer, 2002.

[21] D. Falessi, G. Cantone, and G. Canfora. Empir-ical Principles and an Industrial Case Study in

42

https://www.ietf.org/rfc/rfc2119.txt

https://www.ietf.org/rfc/rfc2119.txt

Retrieving Equivalent Requirements via NaturalLanguage Processing Techniques. IEEE Transac-tions on Software Engineering, 39(1):18–44, 2013.

[22] A. Fantechi, S. Gnesi, G. Lami, and A. Maccari.Application of linguistic techniques for Use Caseanalysis. Requirements Engineering, 8(3):161–170, 2003.

[23] H. Femmer, J. Kučera, and A. Vetrò. On theimpact of passive voice requirements on domainmodelling. In Proceedings of the 8th ACM/IEEEInternational Symposium on Empirical SoftwareEngineering and Measurement, ESEM ’14, pages21:1–21:4, New York, NY, USA, 2014. ACM.

[24] H. Femmer, D. Méndez Fernández, E. Juergens,M. Klose, I. Zimmer, and J. Zimmer. Rapid re-quirements checks with requirements smells: Twocase studies. In Proceedings of the 1st Interna-tional Workshop on Rapid Continuous SoftwareEngineering, RCoSE 2014, pages 10–19, NewYork, NY, USA, 2014. ACM.

[25] H. Femmer, D. Méndez Fernández, S. Wagner,and S. Eder. Supplementary online material:Analysis of related work. Created on: 2015-12-22.

[26] H. Femmer, J. Mund, and D. Mendez Fernandez.It’s the Activities, Stupid! A New Perspectiveon RE Quality. In Proceedings of the 2nd Inter-national Workshop on Requirements Engineeringand Testing, pages 13–19, 2015.

[27] M. Fowler and K. Beck. Refactoring: improv-ing the design of existing code. Addison-WesleyProfessional, 1999.

[28] G. Génova, J. M. Fuentes, J. Llorens, O. Hurtado,and V. Moreno. A framework to measure andimprove the quality of textual requirements. Re-quirements Engineering, 18(1):25–41, Sept. 2011.

[29] V. Gervasi and B. Nuseibeh. Lightweight vali-dation of natural language requirements. Soft-ware: Practice and Experience, 32(2):113–133,Feb. 2002.

[30] B. Gleich, O. Creighton, and L. Kof. Ambiguitydetection: Towards a tool explaining ambiguitysources. In Requirements Engineering: Founda-tion for Software Quality, volume 6182, pages218–232. Springer Berlin Heidelberg, 2010.

[31] B. Hauptmann, M. Junker, S. Eder, L. Heine-mann, R. Vaas, and P. Braun. Hunting for smellsin natural language tests. In Proceedings of theInternational Conference on Software Engineer-ing, pages 1217–1220, 2013.

[32] IEEE Computer Society. IEEE Recom-mended Practice for Software RequirementsSpecifications. https://standards.ieee.org/findstds/standard/830-1998.html, 1998.

[33] ISO, IEC, and IEEE. ISO/IEC/IEEE29148:2011. https://standards.ieee.org/findstds/standard/29148-2011.html, 2011.

[34] E. Juergens, F. Deissenboeck, M. Feilkas,B. Hummel, B. Schaetz, S. Wagner, C. Domann,and J. Streit. Can Clone Detection Support Qual-ity Assessments of Requirements Specifications?In Proceedings of the International Conferenceon Software Engineering, pages 79–88, 2010.

[35] E. Juergens, F. Deissenboeck, B. Hummel, andS. Wagner. Do code clones matter? In Proceed-ings of the International Conference on SoftwareEngineering, pages 485–495, 2009.

[36] D. Jurafsky and J. H. Martin. Speech and Lan-guage Processing. Pearson Education, 2nd edition,2014.

[37] M. I. Kamata and T. Tamai. How Does Require-ments Quality Relate to Project Success or Fail-ure? In 15th IEEE International RequirementsEngineering Conference, pages 69–78, 2007.

[38] E. Kamsties, D. M. Berry, and B. Paech. De-tecting Ambiguities in Requirements DocumentsUsing Inspections. In Proceedings of the 1stWorkshop on Inspection in Software Engineer-ing, pages 68–80, 2001.

43

https://standards.ieee.org/findstds/standard/830-1998.html




[39] E. Kamsties and B. Peach. Taming ambiguity innatural language requirements. In Proceedings ofthe International Conference on System and Soft-ware Engineering and their Applications, pages1–8, 2000.

[40] N. Kiyavitskaya, N. Zeni, L. Mich, and D. M.Berry. Requirements for tools for ambiguity iden-tification and measurement in natural languagerequirements specifications. Requirements Engi-neering, 13(3):207–239, 2008.

[41] E. Knauss and T. Flohr. Managing requirementengineering processes by adapted quality gate-ways and critique-based RE-tools. In Proceed-ings of Workshop on Measuring Requirements forProject and Product Success, Nov. 2007.

[42] E. Knauss, D. Lübke, and S. Meyer. Feedback-Driven Requirements Engineering : The HeuristicRequirements Assistant. In Proceedings of the In-ternational Conference in Software Engineering,pages 587–590, 2009.

[43] J. C. Knight and E. A. Myers. An improvedinspection technique. Communications of theACM, 36(11):51–61, 1993.

[44] L. Kof. Scenarios: Identifying missing objects andactions by means of computational linguistics. InIn Proceedings of the 15th IEEE InternationalRequirements Engineering Conference, pages 121–130, Oct 2007.

[45] L. Kof. Treatment of Passive Voice and Conjunc-tions in Use Case Documents. Natural LanguageProcessing and Information Systems, 4592:181–192, 2007.

[46] S. J. Körner and T. Brumm. Improving naturallanguage specifications with ontologies. In Pro-ceedings of the 21st International Conference onSoftware Engineering and Knowledge Engineer-ing, pages 552–557. World Scientific, 2009.

[47] S. J. Körner and T. Brumm. Natural Lan-guage Specification Improvement With Ontolo-gies. International Journal of Semantic Comput-ing, 03(04):445–470, 2009.

[48] S. J. Körner and T. Brumm. RESI - A naturallanguage specification improver. In Proceedingsof the 2009 IEEE International Conference onSemantic Computing, pages 1–8. IEEE, 2009.

[49] J. Krisch and F. Houdek. The Myth of BadPassive Voice and Weak Words: An EmpiricalInvestigation in the Automotive Industry. In 23rdIEEE International Requirements EngineeringConference, pages 344–351, 2015.

[50] A. V. Lamsweerde. Requirements Engineering.John Wiley & Sons, 2009.

[51] G. Lucassen, F. Dalpiaz, S. Brinkkemper, andJ. van der Werf. Forging High-Quality User Sto-ries: Towards a Discipline for Agile Requirements.In 23rd IEEE International Requirements Engi-neering Conference, pages 126–135, 2015.

[52] A. D. Lucia, F. Fasano, R. Oliveto, and G. Tor-tora. Recovering traceability links in softwareartifact management systems using informationretrieval methods. ACM Transactions on Soft-ware Engineering and Methodology, 16(4), Sept.2007.

[53] J. Ludewig and H. Lichter. Software Engineering.dpunkt.verlag, 2nd edition, 2010.

[54] A. Mavin, P. Wilkinson, A. Harwood, and M. No-vak. EARS (Easy Approach to RequirementsSyntax). Proceedings of the IEEE InternationalConference on Requirements Engineering, pages317–322, 2009.

[55] D. Méndez Fernández, J. Mund, H. Femmer, andA. Vetrò. In Quest for Requirements EngineeringOracles: Dependent Variables and Measurementsfor (good) RE. In Proceedings of the 18th Interna-tional Conference on Evaluation and Assessmentin Software Engineering, pages 3:1–3:10. ACM,2014.

[56] D. Méndez Fernández and S. Wagner. Namingthe Pain in Requirements Engineering: A Designfor a Global Family of Surveys and First Re-sults from Germany. Information and SoftwareTechnology, 57(1):616–643, 2015.

44

[57] T. Menzies, A. Dekhtyar, J. Distefano, andJ. Greenwald. Problems with precision: A re-sponse to "Comments on ’data mining staticcode attributes to learn defect predictors’". IEEETransactions on Software Engineering, 33(9):637–640, 2007.

[58] L. Mich, M. Franch, and P. L. Novi Inverardi.Market research for requirements analysis us-ing linguistic tools. Requirements Engineering,9(2):151–151, 2004.

[59] J. Mund, H. Femmer, D. Méndez Fernández, andJ. Eckhardt. Does Quality of Requirements Spec-ifications matter? Combined Results of Two Em-pirical Studies. In Proc. of the 9th InternationalSymposium on Empirical Software Engineeringand Measurement, pages 1–10, 2015.

[60] D. Parachuri, A. Sajeev, and R. Shukla. An Em-pirical Study of Structural Defects in IndustrialUse-cases. In Proceedings of the InternationalConference on Software Engineering, pages 14–23.ACM, 2014.

[61] M. Porter. An algorithm for suffix stripping.Program, 14(3):130–137, 1980.

[62] A. Rago, C. Marcos, and J. A. Diaz-Pace. Identi-fying duplicate functionality in textual use casesby aligning semantic actions. Software & SystemsModeling, pages 1–25, Aug. 2014.

[63] P. Runeson and M. Höst. Guidelines for con-ducting and reporting case study research insoftware engineering. Empirical Software En-gineering, 14(2):131–164, Dec. 2008.

[64] P. Runeson, M. Höst, A. Rainer, and B. Regnell.Case Study Research in Software Engineering.Guidelines and Examples. Wiley, 2012.

[65] F. Salger. Requirements reviews revisited: Resid-ual challenges and open research questions. InProceedings of the 2013 21st IEEE InternationalRequirements Engineering Conference, pages 250–255. IEEE, 2013.

[66] H. Schmid and F. Laws. Estimation of conditionalprobabilities with decision trees and an applica-tion to fine-grained POS tagging. In Proceedingsof the Conference on Computational Linguistics,pages 777–784. Association for ComputationalLinguistics, 2008.

[67] F. Schneider and B. Berenbach. A LiteratureSurvey on International Standards for SystemsRequirements Engineering. In Proceedings ofthe Conference on Systems Engineering Research,volume 16, pages 796–805, Jan. 2013.

[68] K. Schwaber and J. Sutherland. The scrum guide.Technical report, Scrum.org, 2011.

[69] F. Shull, I. Rus, and V. Basili. How perspective-based reading can improve requirements inspec-tions. Computer, 33(7):73–79, 2000.

[70] S. F. Tjong and D. M. Berry. The design of SREE- A prototype potential ambiguity finder for re-quirements specifications and lessons learned. InREFSQ, pages 80–95. Springer Berlin Heidelberg,2013.

[71] K. Toutanova, D. Klein, and C. D. Manning.Feature-rich part-of-speech tagging with a cyclicdependency network. In Proceedings of the 2003Conference of the North American Chapter ofthe Association for Computational Linguistics onHuman Language Technology, 1(June):252–259,2003.

[72] A. van Deursen, L. Moonen, A. van den Bergh,and G. Kok. Refactoring test code. CWI, 2001.

[73] S. Wagner, J. Jürjens, C. Koller, andP. Trischberger. Comparing bug finding toolswith reviews and tests. In Proceedings of Test-ing of Communicating Systems, pages 40–55.Springer, 2005.

[74] W. M. Wilson, L. H. Rosenberg, and L. E. Hyatt.Automated analysis of requirement specifications.In Proceedings of the International Conferenceon Software Engineering, pages 161–171. ACM,1997.

45

[75] M. V. Zelkowitz, R. Yeh, R. G. Hamlet, J. D.Gannon, and V. R. Basili. The Software Indus-try: A State of the Art Survey. Foundations ofEmpirical Software Engineering: The Legacy ofVictor R. Basili, 1:383–383, 1983.

[76] M. Zhang, T. Hall, and N. Baddoo. Code badsmells: a review of current knowledge. Journal of

Software Maintenance and Evolution, 23(3):179–202, 2011.

Appendix A. Requirements Checklist

46

Table A.11: Checklist for the students’ requirements reviews. Created by Anke Drappa, Patricia Mandl-Striegnitz and HolgerRöder based on [13] and [53]. Translated from German.

The document is well structured and easy to understand.

All used terms are clearly defined and consistently used.

All external interfaces are clearly defined.

The level of detail is consistent throughout the document.

The requirements are consistent and unambiguous.

The defined requirements are consistent with the state of the art.

All tasks and data have useful identifiers.

Data is not defined redundantly.

The defined relationships between data objects are necessary and sufficient.

The specification of quality attributes is realistic, useful, quantifiable and unambiguous.

The user interface is comfortable and easy to learn.

The use case describes a behaviour of the system which is valuable and visible for the actor.

The use case is described in a table which is consistently used for the whole requirements specification.

The use case has a unique ID.

The use case has a unique and expressive name.

The main actor’s goal is described in an understandable way.

All actors participating in the use case are specified.

If there is more than one actor, the main actor is identified.

The preconditions of the use case are specified.

The postconditions for the use case are specified.

It is clearly specified how the main actor triggers the main success scenario.

The main success scenario has 3 to 9 steps.

After the main success scenario, the postconditions hold.

The main actor reaches their goal by the main success scenario.

Each step is sequentially numbered.

It is clear which actor is executing the step.

The step does not describe details of the user interface.

The step describes exactly one action of the acting actor.

There are postconditions for each extension.

It is clearly specified in which step the main success scenario deviates into an extension.

The conditions for the deviation into an extension are clearly specified.

After an extension, all postconditions for that extension hold.

47