Rule-based fuzzy classification for software quality control

Fuzzy Sets and Systems 63 (1993) 349-358 349 North-Holland

Rule-based fuzzy classification for software quality control Christof Ebert University of Stuttgart, Institute for Control Engineering and Industrial Automation, 70550 Stuttgart, Germany

Received March 1993 Revised August 1993

Abstract: In the area of software development it would be of great benefit to predict early in the development process those components of the software system that are likely to have a high error rate or that need high development effort. This paper discusses fuzzy classification techniques as a basis for constructing quality models that can identify outlying software components that might cause potential quality problems. These models are using software complexity metrics that are available early in the development process, thus providing support during the design and the code phase. Experimental results based on real project data are presented to underline the suggested approach and its advantages compared to crisp classification and decision techniques. The application to given data sets and to ongoing projects in the context of consulting activities indicates that a module quality model-wi th respect to changes-provides both quality of fit (according to past data) and predictive accuracy (according to the current projects).

Keywords: Approximate reasoning; fuzzy data analysis; software engineering; software metrics; software quality control.

1. Introduction

I f you cannot measure it, you cannot manage it.

Peter Drucker

The assessment of quality and productivity factors is essential in the management of projects in many areas and environments. In the area of software development it would be of great benefit to predict early in the development process those components of the software system

Correspondence to: C. Ebert, University of Stuttgart, Institute for Control Engineering and Industrial Automation, Pfaffenwaidring 47, 70550 Stuttgart, Germany.

that are likely to have a high error rate or that need high development effort. To ensure success with such goals, an organization must know where it is, and how it is possible to get to the destination. Navigating the way with measures is one important approach to ensure that it stays on the course to achieving these goals. Another reason for the current awareness that measures are needed are increasing client requests for accurate figures about cost and quality and management requests for indicators of what to improve.

Classes of software components are often not well separated and there are no well defined boundaries: a distinct module might belong to more than one class, or it might be in the middle of several classes, thus recommending treating it as partly belonging to these classes. For example, when distinguishing modules that are more error-prone than others, a metric vector consisting of few measures such as module size, cohesiveness and data fan-in can be determined during the design phase. Now, the goal is to determine those modules that belong to the rather small group of modules that potentially cause most of the errors, costs, and rework. Obviously, the best solution would be to filter out exactly the specific high-risk components in order to improve their design or start again from scratch. Unfortunately, the measurement vector usually provides rather continuous data, hence prohibiting clear frontiers between good and bad. Expert's knowledge in such cases covers these problems with linguistic uncertainty and fuzziness: ' i f length is medium and cohesion is low the module is likely to cause trouble'. The problem is how to differentiate between such linguistic attributes and how to come to an exact and reproducible decision. While so-called hard clustering algorithms create disjoint subsets out of all objects, fuzzy clustering might result in overlapping subsets with specific degrees of membership of each object. One of the

0165-0114/94/$07.00 (~) 1994---Elsevier Science B.V. All rights reserved SSDI 0165-0114(93)E0185-U

350 C. Ebert / Rule-based fuzzy classification

advantages towards classification trees is hence the feasibility to treat different values of explanatory (metric) variables with different relevances.

This paper discusses fuzzy classification techniques as a basis for constructing quality models that can identify outlying software components that might cause potential quality problems. These models are using software complexity metrics that are available early in the development process, thus providing support during the design and the code phase. We are using already published data for validation in order to permit. easy access for external studies. Note that we do not make any assumptions requiring distinct scales, ranges, or distributions of the metrics for the suggested classification technique.

2. Analytical quality control in software engineering

Most software enterprises apply quality assurance that identifies problems (as instances of poor quality) and ensures that they are resolved. Although striving to high quality standards, only a few organisations apply true quality control. Quality control consists of comparing observed quality with expected quality, hence minimizing the effort expended on correcting the sources of defect. In order to achieve software quality, it must be developed in an organized form by using methods and techniques as provided by CASE (Computer Aided Software Engineering) tools. In order to achieve an indication of software quality, the software must be subjected to measurement. This is accomplished through the use of metrics and statistical evaluation techniques that relate specific quantified product requirements to some attributes of quality. The approach of integrating software measures and statistical techniques is shown in Figure 1. The CASE environment provides the formal description of different products developed during the software life cycle with its necessarily defined methods and the underlying process. Multivariate analyses techniques provide feedback about relationships between components (e.g. factor analysis) [1, 2]. Classification techniques help determining outliers (e.g. error-prone components) [3]. Finally,

detailed diagrams and tables provide insight into the reasons why distinct components are potential outliers and how to improve them [4]. In this article we are focussing on fuzzy classification techniques and on approximate reasoning based on such techniques.

Software metrics are measures of software components and their individual development processes. These metrics can be classified according to the empirical objects they measure. The delivered documents can be analyzed with respect to volume (lines of delivered code, lines of comments, module size, etc.), structural aspects (data flow between modules, control flow complexity, hierarchical depth, etc.), or quality factors (error-rate, reliability, maintainability, etc.). The underlying development process can be measured in terms of effort (person days, cumulative time, costs, etc.), or resources used during development (computer hours, number of software engineering personel, etc.). Because of the extreme subjectivity of quality attributes per se, it is important to select metrics that can be applied to the specific objectives of a program, that have been derived from the program requirements and can be used to prove consistency, that can be applied during all phases of the development process on all products (design, code, documentation, etc.), and that can be collected and evaluated automatically by the CASE environment in use. Management usually likes figures; however, if costs are high from scratch and the new CASE environment already required more money than expected, such programs have no chance to be initialized. The paradox with measurement lies in the fact that it takes effort to measure and to use measures for process control which in return hinders such programs' implementation.

For better understanding of the text, we are providing lots of small examples that underline the approaches, techniques and terms being used. Most examples in the context of software measurement are repeatedly applying a measure called cyclomatic complexity [5]. This metric is based on the control flowgraph structure of a software component, not regarding whether it is source code or a detailed design. The cyclomatic complexity as proposed by McCabe counts all decisions in the flowgraph and was originally designed to measure the number of linearly

C. Ebert / Rule-based fuzzy classification

M e a s u r e s and stat ist ical techniques in so f twa re engineering

351

/software products N or documents

E feedback

-T 'r L

evaluation and ~ evaluation and / ~mprovsment of l ~ improvement oft/

products ~ the process//

F i g . 1

independent paths through a program, which in turn is believed to relate to maintainability and testability of software. Because of its many derivatives [5] we do not provide exactly counting rules. Nevertheless it should be noted that its range are positive integers, excluding zero.

Quality models are built on former project experiences and combine the quantification of aspects of software components with a framework of rules (e.g. limits for metrics, appropriate ranges etc.). The product metrics that are used to create quality models can be derived from all products or documents generated during the life cycle. Quality models, thus, are generated by the combination and statistical analysis of product metrics (e.g. complexity measures such as

volume, data flow, decision count, etc.) and product or process attributes (e.g. quality characteristics, effort, etc.) [6-8]. These models are evaluated by applying and comparing exactly those invariant figures they are intended to predict, the process metrics (e.g. effort, error rate, number of changes since the project started, etc.). Iterative repetition of this process can refine the quality models hence allowing the use of them as predictors for similar environments and projects. For assessing overall quality or productivity, it is suitable to break it down into its component factors (e.g. maintainability), thus arriving at several aspects of software that can be analyzed quantitatively. At the highest level, quality factors are defined that are appropriate for software acquisition managers to


use as an aid in specifying quality objectives for their software systems. Typical problems connected to data collection, analysis, and quality modelling are addressed and discussed com- prehensively in [7, 6].

Quality or productivity factors to be predicted during the development of a software system are affected by many product and process attributes, e.g. software design characteristics or the underlying development process and its environment. An example for a metric-based decision environment for knowledge-based systems has been suggested by Behrendt et al. [9]. This tool is based on a factorial quality taxonomy that classifies the above mentioned quality factors (e.g. reusability) and related sub-factors (e.g. modularity) into linguistic categories (e.g. 'not acceptable'). The current classification system takes measurable and nonmeasurable attributes as an input, such as design of control structures, number of system parameters. Another tool system for assessing risk factors of software components has been developed by Porter and Selby [3]. The proposed method generates measurement-based models of high-risk components automatically, based on metrics from previous releases or projects. These models are built according to a classification tree with binary and multivalue decision nodes. While the first approach permits the use of linguistic descriptions and qualitative reasoning without describing how the classes had been created, the latter is based on history-based crisp decisions that do not indicate any intuitiveness. Thus, both approaches clearly try to solve the problem of metric-based decision support; however, it is often not clear how to justify the decisions. The most serious constraint imposed by classification trees and other crisp clustering techniques is their goal to identify mutually exclusive subsets, thus not allowing fuzzy memberships to several classes.

3. Metric-based approximate reasoning

Cluster analysis is an instrument to identify structures and relations in a set of objects [1]. It is used to suggest possible organizations of the data into meaningful groups. The underlying assumption is that objects under investigation

may be grouped such that elements residing in a particular group or cluster are, in some sense, more similar to each other than to elements belonging to other groups. Classification or clustering algorithms are mathematical tools for detecting similarities between members of a collection of objects. Information about the objects (software components) to be analyzed is input to classification algorithms in the form of metric vectors Mi. The components of a metric vector are the measurements of distinct features that have been chosen as a basis for comparing a specific object to other objects. The output of a clustering algorithm can then be used to classify the objects into subsets or clusters. Metric vectors assigned to the same cluster are in some sense similar to each other, more so than they are to other metric vectors not assigned to that cluster. The metric vector can be considered as a multidimensional space where each software component is represented as a point with distinct coordinates. We identify as a cluster any subset of the points which is internally well connected and externally poorly connected. To sum up, we are considering assignments to groups rather than algorithmic relationships. While the latter goal has been achieved more or less with regression models predominantly for past projects, the first goal seems to be adequate for predicting potential outliers in running projects, where preciseness is too expensive and often not to be achieved, thus being inadequate.

Developing a classification system consists of the following elements [10]:

- A set of descriptions (a metric vector M = {ml . . . . , mn}) based on n selected features that contain enough information to describe the object in the applications's domain. In the case of analyzing software projects this set consists of n software measurements, such as cyclomatic complexity or number of input data to a module. These measures should be collected in a closely defined and reproducible fashion, hence en- abling automatic retrieval. The metric set M can be gained during the complete project life cycle depending on its availability (e.g. design metrics vs. code metrics). M exists for each software component under investigation, thus resulting in a definite representation of each component in the n-dimensional metric space.

- A quality or productivity factor vector F that

C. Ebert / Rule-based fuzzy classification 353

consists of single attributes of software that contribute to quality or productivity, respectively. Typically F considers aspects such as reliability, error-count, or person-days. F can be defined as a type of metric that provides a direct measure of software quality or productivity. Elements of F are often discrete (nominal, ordinal, or interval at most). They are supplied by human experts, hence being fuzzy in nature. When combinations of features are to be considered, their comprehensibility to human experts must be taken into account. Usually F is unknown during the project development and therefore highest interest lies in its early and accurate prediction.

- A set of validating data (validating vector V) that should provide information about the degree to which given software possesses the quality or productivity attributes that are also collected in F. This data set V can have values such as number of errors or MT'FF of a given group of software components. It is thus a collection of direct measures or indicators of quality or productivity factors from past projects that are of interest to the developer.

- A set of patterns (rule vector R) that reflect the experts' knowledge about how to assign data sets M to target classes in F. Such patterns consist of rules for differentiating between members and nonmembers of a distinct target class. With two ranking classes reflecting the discrimination of two categories 'good' and 'not so good', it is easier to find rules for classification than for more classes. The ability to identify such target classes and assign rules for discrimination can only be evaluated experimentally.

- I n order to construct and evaluate a classification scheme, the data sets M and V need to be divided into two mutually exclusive sets before the classification process takes place. One is the set used for training or construction of a distinct classification scheme, while the other one will be used to validate the scheme. After having finished the classification scheme, its predictive accuracy is tested against the data of the second group. This is especially important whenever the classification scheme is not intuitively based (e.g. the automatic construction of a classification tree requires a set of data for refinement and one for validation [3]).

Most of the facts and rules that belong to human expertise contain fuzzy predicates and thus are fuzzy propositions. This is particularly true of heuristic rules which are often used in the management of software projects. Because all the factors in software development management are vague and imprecise (i.e. meaning different things for different people; e.g. maintainability can be interpreted as easiness of understanding software documents or in terms of formal cohesiveness of components), an alterna- tive method must be used, permitting approximate reasoning from vague inputs. The method suggested and discussed here is based on fuzzy reasoning [11-13]. As stated by Zadeh, fuzzy reasoning refers to a process by which an imprecise conclusion is deduced from a collection of imprecise observations or antecedents, and such reasoning is qualitative rather than quantitative in nature.

The desirability of a fuzzy classification technique is implied by the fact that although the extreme states of software components' quality, good or bad, are well defined, they are separated by a continuum of intermediate states. A further consideration is the importance of retaining the semantic richness of the descriptions of software quality provided by an expert. In existing software classification systems, the fuzziness of the knowledge base is ignored because neither predicate logic nor probability-based methods provide a systematic basis for dealing with it [10, 2-4]. As a consequence, fuzzy facts and rules are generally manipulated as if they were non-fuzzy, leading to conclusions whose validity is open to question. As a simple illustration of this point, consider the fact [3]: ' I f data bindings are between 6 and 10 and cyclomatic complexity is greater than 18 the software component is likely to have errors of a distinct type'. Obviously the meaning of this-automatically generated- fact is less precise than stated and might be provided by a maintenance expert as a fuzzy fact: 'If data bindings are medium and cyclomatic complexity is large then the software component is likely to have errors of a distinct type'. Of course, the latter fact requires the determination of the fuzzy attributes 'medium' or 'large' in the context of the linguistic variables they are associated with (i.e. data bindings and cyclomatic complexity). Although human experts are


rather unsuccessful in quantitative predictions (e.g. predicting the error number or length of a given component), they may be relatively efficient in qualitatively forecasting. Actually, most knowledge of experts consists of qualitia- tive variables stated verbal, as evidenced by developments in knowledge-based systems [13].

In order to permit rule-based approximate reasoning based on external input data from software products or documents and vague knowledge about the underlying process that provided the software components to be classified, it is necessary to permit the formula- tion of fuzzy (expert) rules. The combination of interacting fuzzy rules derived from expert knowledge is called fuzzy expert system, because it is supposed to model an expert and make his or her knowledge available for non-experts for purposes of diagnosis or decision making [12]. The declarative knowledge of fuzzy expert systems is represented as fuzzy sets and data. The reasoning process itself consists of the following steps:

(1) Describe an exactly defined process environment from which the software products under investigation are selected.

(2) Select a group of expert development staff who will be asked to develop a consensus concerning distinct quality factors. Of course, this jury should consist of people with respected knowledge in the areas that influence those projects being ranked (e.g. data-base or real-time experts for projects determined by such problem domains). If the selected quality factors include maintainability or testing effort, staff members that are assigned to such areas must be considered in the jury.

(3) Select a random sample of software components from the environment (e.g. modules, procedures, classes).

(4) Have the jury cluster these software items with respect to the factors in F by comparing and evaluating them. Each case must be explained exactly in order to permit a repeatable classification.

(5) Let the experts condense these explana- tions to a set of recursively refined rules R. The rules are usually dominated by fuzzy, nevertheless precise, linguistic and qualitative descriptions in opposition to quantitative selection formulas that might be preferable on the first sight.

(6) Integrate this set of rules to a classification scheme that can be applied automatically to analyze other components. Widely used classification methodologies include cluster analysis, classification trees, fuzzy classification, or Baye- sian classification [14]. Since many metric vectors are multimodal in nature, it is difficult to apply conventional partitioning procedures, such as single-stage Bayesian classifier and a common set of features for all classes. In these cases, a hierarchical classification approach is usually preferred [3].

(7) Validate R by classifying other data sets V' of further software components according to the same factors F by applying the rules R.

In software measurement, the extent to which the number system reflects the properties of objects define various types of scales and distributional forms. Measures are not connected to distinct scales per se [5]. They can be used in combination with an ordinal or an interval scale which depends on the specific representation. The conclusions and statistical methods that can he drawn about differences or ratios of measurements are restricted to proper recogni- tion of the type of scale the particular measurement constitutes. Thus, it is important to realize that the specific conditions for the use of a measure as a certain scale need to be defined. For example, quality factors can be developed according to the whole bandwidth of scales. They range from nominal scaling (e.g. 'good' or 'bad'), ordinal scaling (e.g. quality ranking), interval scaling (e.g. understandability of comments) to ratio scaling (e.g. number of faults). Basically, most measures are only ordinal and often they are not normally distributed. Unfortunately, several multivariate statistical methods require the level of measurement being at an interval or ratio scale. It is therefore necessary to be very cautious about the use of techniques which assume an underlying ratio scale and normal distribution. The proper use of measurement theory requires constructing a scale for testability as discussed in [7, 5, 10].

Automatically generated software measures (e.g. cyclomatic complexity or reliability) raise the question whether their preciseness (i.e. exact numbers at least in an integer set and on an interval scale) is necessary and useful. It is in this context important to keep in mind that the human expert who is contributing to rules is an


essential source of fuzziness because the same set of metric values may be perceived differently depending upon the experience and individual preferences in acceptability of distinct metric values. Crisp boundaries provided during knowledge acquisition usually are not reproducible and in their crispness without intuitive rationale. An example for the awareness of this observation that we call pseudo-exactness is the changing attitude towards crisp boundaries for the cyclomatic complexity. Originally it was fixed to seven dicisions per component, then it became less precise resulting in the suggestion to limit decision count to 'around ten to twenty' [3]. In order to reduce the information content of this pseudo-exactness, it is straightforward to provide different linguistic descriptors for categories the expert can deal with mentally. One of the commonly applied methods to develop such linguistic categories based on an interval scale is the categorical judgement technique [15]. In this approach, experts are asked to distinguish several categories or linguistic descriptors on the complete scale. The number of categories should contain enough preciseness to reflect those differences that are necessary for proper rule inferring. It is recommended to use between two and seven categories, where a large number of categories might result in the end-effect which means that subjects tend to avoid the use of the end categories. Values of linguistic variables should be ordered along a favorable-neutral- unfavorable scale, such as very complex, fairly complex, in between, more simple, very simple.

4. An experiment: Predicting changes based on software design metrics

It is relatively easy to construct metric-based quality models that happen to classify data of past projects well, because all such models can be calibrated according to quality of fit. The difficulty lies in improving and stabilizing models based on historic data that are of value for use in anticipating future outcomes. While consulting with two companies that are developing software for telecommunication and industrial automation, respectively, we had the task of developing a quality model with predictive accuracy. This model had to be integrated in a development process based on structured analysis, preliminary

and detailed design with a high-level design language, and coding with C. The process is embedded in the EPOS CASE environment that already provides a software metric tool applicable to software design documents [16]. The main interest of metric-based software development for the two companies was in detecting change-prone modules during the design. Changes include both corrective and additive maintenance, in any case they indicate components requiring more effort than others. We therefore concentrate on describing the application of the module-changes subset of our quality model.

Instead of increasingly refining a predictive quality model with training data, we decided to build a rather simple network of linguistic rules based on experiences from literature and former projects in the same CASE environment. The result of this first step were few intuitive rules for reducing error-proneness of modules that had been mentioned by Card and others [4, 8, 7]:

- reduce functional coupling of modules; - reduce data coupling (use global data, etc.)

of modules; - reduce number of operands in a module; - reduce module-size and increase module

cohesion; - reduce number of decisions in control flow.

Because the underlying goal was to find rules applicable to designs, we could provide information about module coupling (both functional and data), control flow, and use of operands. We omitted module-size because designs usually include lots of redundant information (e.g. concerning coupling, descriptions, etc.), thus increasing module-size vastly. Of course, we found relationships between module-volume in design and source code only late during the detailed design.

These basic rules only distinguish between values being either low or high. The next step therefore was to define ranges and membership functions for such linguistic labels. The threshold ranges where the two membership functions intersect have been provided by the same sources as the rules. Since we are suggesting a Parnas-type approach for module design with rather small modules incorporating one function, the ranges of linguistic labels are comparatively small. Membership functions and underlying linguistic rules are shown in Figure 2. As already


fuzzy membership functions and inference rules for module design

Membership Functions:

,' cyclom, compl. ,funct. fan-in , data fan-in

t,o H f-'° ,hi._ ' " . . . . . l l J l '

/ , / 10 30 0 10 5 15

I uniq opds (n2)

t)o..,,, .,.h.j._.

20 60

Rules: if cyc. cornpl, high then change-prone if funct, fan-in high then change-prone if n2 high and cyc. compl, high then change-prone

if data fan-in high then change-prone

if data fan-in low and c y c compl low then change-free

if f unc t fan-in low and c y c compl, low then changer-free

if f unc t fan-in low and data fan-in low and n2 low then change-free

if n2 low and c y c compl low then change-free

mentioned, this set of rules is only a subset of the rather large complete model. For better understanding we hence left out influences of explanatory variables resulting from environ- mental or human factors (e.g. review effort or designer experience).

To demonstrate the effectiveness of our approach we applied this fuzzy classification (Figure 2) to data originally published by Kitchenham et al. [6]. Given two sets of metrics from modules of the ICL general-purpose operating system VME, fuzzy classification was performed to estimate change-proneness of the modules. Both sets of data came from two different implementations of the same sub- system with identical functional requirements. Kitchenham provided exact data including measure for cyclomatic number, functional fan-in, data fan-in and unique operands, among others, that we used as the data for our fuzzy quality model. As each program was coded, it was placed under formal configuration control and subsequent changes to modules were recorded. It was not distinguished between corrective and additive changes' intentions. Because we used these data sets for validation purposes and not for testing purposes, we treated the sets equally despite knowing about the presence of outliers.

Fig. 2

After deciding what metric factors to consider for creating the quality model and for the validation, the second step of the classification process was to define membership functions for the linguistic variables statement count and cyclomatic complexity (Figure 2). We decided to use symmetrical triangular or trapezoid functions because these functions are easy to understand and can even be interpreted without a fuzzy inference tool. According to the approach of psychophysical measurement mentioned above, the sum of all memberships should be equal to one, corresponding to the concept of full membership. Full membership, however, is mathematically not necessary in this case. We proceeded by assigning only two membership functions for labels 'low' and 'high' for all linguistic variables. This decision is heuristic by nature and is based on past project experiences and insufficient experience to justify more detailed production rules for the quality model. Of course, it is possible to generate membership functions and rules automatically; however, the results are rather unintelligible and difficult to explain. The third step then was to provide linguistic rules for all permutations of the values of the linguistic variables. Since we decided to control membership functions and their mean values, we could apply rules we had in use for


Table 1. Contingency table for Kitchenham data (VME sub-system 1). %2=6.3, tr ~<0.1

'Few changes' 'Change-prone'

<4 changes 10 3 13 94 changes 4 10 14

14 13 27

Table 2. Contingency table for Kitchenham data (VME sub-system2). Z 2 = 18.8, ~ ~ 0.001

'Few changes' 'Change-prone'

<4 changes 22 2 24 94 changes 4 12 16

26 14 40

1 with 27 modules and Table 2 provides the data for the 40 modules of VME sub-system 2. In order to permit a crisp validation based on chi-square analysis, we assigned 'few changes' to less than four changes and 'change-prone' to equal and more than four changes, respectively. Both samples provided an experimental outcome of medium values of chi-square and acceptable significance levels tr. Combining both samples, which is feasible as long as we do not consider dependencies and history internals of the sets, we gain a high value of chi-square (24.6) and a very small significance level (tr <~ 1E-6).

5. Conclusions

other projects. The rules obviously can be applied for membership functions with different scales because they only consider linguistic values (not questioning whether 'high' means a cyclomatic complexity of around 5 for procedures or around 50 for modules). Some typical rules are given in Figure 2. Membership values of the variable 'result' were crisp in nature with a 50% threshold to separate between 'few changes' and 'change-prone'. The defuzzification method was based on center of gravity (of the resulting combined membership functions of output), thus providing results with all values between 0% and 100% change-proneness.

The results for the two sub-systems are presented in Tables 1 and 2. Both tables provide a portfolio of predictions versus reality. Notions in quotation marks are the predictions (i.e. corresponding to the columns of the tables), while data in the horizontal lines correspond to measured values. The border areas contain sums of lines or columns. The upper left field in the tables show the number of correctly classified modules with few changes, while the lower right field shows those classified correctly with many changes. The two other fields represent misclas- sification errors, either type I ('change-prone modules' classified as 'few changes modules') or type II ('few changes modules' classified as 'change-prone modules'). Of course the lower left field representing type I errors is the least desirable because it provides false security. Table 1 provides the results for VME sub-system

We have discussed some of the problems that arise in the management of uncertainty and vagueness in classifying software components. Fuzzy logic provides a natural conceptual framework for representation of knowledge and inference processes based on this knowledge that is imprecise, incomplete or inconsistent. The choice of the proper approach to automatic decision support depends on the problem. We have to distinguish between uncertainty in the case data and in the knowledge base. In the course of classifying software project and product data we have been discussing the latter case. Membership functions of linguistic variables and rules for classification and decision support hence should be fuzzy in nature, because it is difficult for software engineers to provide a complete set of mutually exclusive heuristic classification rules. On the other hand, it is unsatisfactory and practically unrealistic to deal with automatically generated decision trees or rule sets with crisp thresholds and results. We emphasize the necessity of applying fuzzy concepts to the areas of metric-based software project and quality management because subjec- tive and qualitative judgement plays an important role in this area.

A fuzzy classification technique has been introduced and applied to two data sets. The application to these sets and to ongoing projects in the context of consulting activities indicates that the module quality model -wi th respect to changes-provides both quality of fit (according to past data) and predictive accuracy (according


to the current projects). Since there are some guiding principles for decision support available, we focussed on utilizing expert-derived, however vague, knowledge that we included in a fuzzy expert system-type classification scheme. For the same reason (i.e. software engineering expert knowledge is available) we strongly oppose using learning strategies that are only result-driven (e.g. classification trees or mere neural network approaches). However, we see the necessity of such approaches when only few guiding principles are available and sufficient project data can be utilized for supervised learning.

Further research in the area of fuzzy classification of software components should focus on the areas:

- Investigation of more projects from different application areas in order to provide fundamen- tal insight in the development of membership functions and rules and their influence on different project types. This should include analyses of different approaches of constructing classification schemes (e.g. decision trees) and optimizing their accuracy, intelligibility, and reproducability.

- Improved fuzzy classification tools for visualization of decision paths, online changes and editing functions for both rules and membership functions that can be integrated to common CASE repositories. In order to be accepted by software engineering experts and normal users, a classification system should have an explanation facility that provides how the conclusion is arrived at and how the classification inference engine makes use of the knowledge provided by the fuzzy production rules.

Acknowledgements

We would like to thank the two companies who like to remain anonymous that provided data for improving the fuzzy classification model; the GPP, Munich for CASE tool support; and Dr. D. Gustafson of Kansas State University, Manhattan, KS, USA for his helpful suggestions. The author acknowledges the financial support

of the German National Science Foundation (DFG) for this ongoing research project.

References

[1] C. Ebert, Visualization techniques for analyzing and evaluating software measures, 1EEE Transactions on Software Engineering 18 (11) (November 1992) 1029-1034.

[2] J.C. Munson and T.M. Khoshgoftaar, The detection of fault-prone programs, IEEE Transactions on Software Engineering 18 (5) (1992) 423-433.

[3] A.A. Porter and R.W. Selby, Empirically guided software development using metric-based classificaiton trees, IEEE Software 7 (3) (March 1990) 46-54.

[4] D.N. Card and R.L. Glass, Measuring Software Design Quality (Prentice Hall, Englewood Cliffs, N.J., USA, 1990).

[5] H. Zuse, Software Complexity: Measures and Methods (De Gruyter, Berlin, 1991).

[6] B.A. Kitchenham and L. Pickard, Towards a constructive quality model, Software Engineering Journal 2 (7) (July 1987) 114-126.

[7] N.E. Fenton, Software Metrics: A Rigorous Approach (Chapman & Hall, London, UK, 1991).

[8] M. Shepperd, Early life-cycle metrics and software quality models, Information and Software Technology 32 (4) (1990) 311-316.

[9] W. Behrendt, S.C. Lambert et al., A metrication framework for knowledge-based systems, in: Proc. Eurometrics '92, Comm. of the E.C.: EUREKA, Brussels, April 1992, 197-210.

[10] N.F. Schneidewind, Methodology for validating software metrics, IEEE Transactions on Software Engineer- ing 18 (5) (1992) 410-422.

[11] L.A. Zadeh, A fuzzy-algorithmic approach of complex or imprecise concepts, Int. J. Man-Machine Studies 8 (1976) 249-291.

[12] H.-J. Zimmermann, Fuzzy Set Theory and its Applications, 2nd edition (Kluwer, Boston, 1991).

[13] L. Zadeh and J. Kacprzyk (Ed.), Fuzzy Logic for the Management of Uncertainty (John Wiley & Sons, New York, 1992).

[14] H.M. Wadsworth, (Ed.), Handbook of Statistical Methods for Engineers and Scientists (McGraw-Hill, New York, NY, USA, 1990).

[15] F.N. Jones, Overview of psychophysical scaling methods, In: E.C. Carterette and M.P. Friedman, Eds., Handbook of Perception, Vol. II, Psychophysical Judgement and Measurement (Academic Press, New York, NY, USA, 1974).

[16] C. Ebert, Integrating automated software design measures: Results from a CASE-study, Proc. of lOth Annual Software Reliability Symposium, Denver, June 25-26, 1992 (IEEE Comp. Soc. Press, Los Alamitos, CA, USA, 1992) 93-100.

Documents

Rule-based fuzzy classification for software quality control