9
Research Software development cost estimation: Integrating neural network with cluster analysis Anita Lee a,* , Chun Hung Cheng 1,b , Jaydeep Balakrishnan 2,c a Decision Science and Information Systems Area, School of Management, Gatton College of Business and Economics, University of Kentucky, Lexington, KY 40506-0034, USA b Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, New Territorities, Hong Kong c Faculty of Management, University of Calgary, 2500 University Drive N.W., Calgary, Alberta T2N 1N4, Canada Received 19 March 1997; accepted 15 March 1998 Abstract For software project planning control and management, an accurate estimate of software development cost is important. Past research has focused on using parametric models to predict development cost based on attributes such as lines of code or function points. This requires researchers to identify the set of factors that influence cost estimation before the system is constructed. We propose a non-parametric approach that integrates a neural network method with cluster analysis to estimate development cost. The integration of the two techniques not only allows for a more accurate cost estimate but also leads to an increase in the training efficacy of the network. # 1998 Elsevier Science B.V. All rights reserved Keywords: Software development cost; Neural network; Cluster analysis; Machine learning 1. Introduction Accurate cost estimation of a software development effort is critical for good management decision mak- ing; the estimate must include software project con- trol, budgeting, personnel allocation, and bidding for contracts. An accurate cost estimation is important, because a low cost estimate may either cause loss or compromise the quality of the software developed, resulting in partially functional or insufficiently tested software that requires later high maintenance costs. However, if the cost estimate is too high, many useful projects may not be funded, resulting in misallocation of resources and a backlog of needed software. An early cost estimate is equally important because the result will have value for project management and control only if it is provided in the early phases of the software development life cycle, preferably during the planning and requirement analysis rather than the coding and testing phases. Thus, from an organiza- tional perspective, an early and accurate cost estimate will reduce the possibility of organizational conflict during the later stages. Information & Management 34 (1998) 1–9 *Corresponding author. E-mail: [email protected] 1 E-mail: [email protected] 2 E-mail: [email protected] 0378-7206/98/$19.00 # 1998 Elsevier Science B.V. All rights reserved PII: S-0378-7206(98)00041-X

Software development cost estimation: Integrating neural network with cluster analysis

Embed Size (px)

Citation preview

Page 1: Software development cost estimation: Integrating neural network with cluster analysis

Research

Software development cost estimation:Integrating neural network with cluster analysis

Anita Leea,*, Chun Hung Cheng1,b, Jaydeep Balakrishnan2,c

a Decision Science and Information Systems Area, School of Management, Gatton College of Business and Economics,

University of Kentucky, Lexington, KY 40506-0034, USAb Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin,

New Territorities, Hong Kongc Faculty of Management, University of Calgary, 2500 University Drive N.W., Calgary, Alberta T2N 1N4, Canada

Received 19 March 1997; accepted 15 March 1998

Abstract

For software project planning control and management, an accurate estimate of software development cost is important. Past

research has focused on using parametric models to predict development cost based on attributes such as lines of code or

function points. This requires researchers to identify the set of factors that in¯uence cost estimation before the system is

constructed. We propose a non-parametric approach that integrates a neural network method with cluster analysis to estimate

development cost. The integration of the two techniques not only allows for a more accurate cost estimate but also leads to an

increase in the training ef®cacy of the network. # 1998 Elsevier Science B.V. All rights reserved

Keywords: Software development cost; Neural network; Cluster analysis; Machine learning

1. Introduction

Accurate cost estimation of a software development

effort is critical for good management decision mak-

ing; the estimate must include software project con-

trol, budgeting, personnel allocation, and bidding for

contracts. An accurate cost estimation is important,

because a low cost estimate may either cause loss or

compromise the quality of the software developed,

resulting in partially functional or insuf®ciently tested

software that requires later high maintenance costs.

However, if the cost estimate is too high, many useful

projects may not be funded, resulting in misallocation

of resources and a backlog of needed software. An

early cost estimate is equally important because the

result will have value for project management and

control only if it is provided in the early phases of the

software development life cycle, preferably during the

planning and requirement analysis rather than the

coding and testing phases. Thus, from an organiza-

tional perspective, an early and accurate cost estimate

will reduce the possibility of organizational con¯ict

during the later stages.

Information & Management 34 (1998) 1±9

*Corresponding author. E-mail: [email protected]: [email protected]: [email protected]

0378-7206/98/$19.00 # 1998 Elsevier Science B.V. All rights reserved

PII: S-0378-7206(98)00041-X

Page 2: Software development cost estimation: Integrating neural network with cluster analysis

This paper is intended to provide software managers

with a decision support tool for early cost estimation

of software development efforts. The work is moti-

vated by the need to explore innovative ways to

estimate software development cost in the 1990s

due to the increasing complexity of the problem space

as a result of advances in computer technologies,

expert system applications, and interorganizational

systems [2, 21]. A new technique integrating a neural

network method with cluster analysis is implemented

and tested using historical data and demonstrated to

show how it improves network performance. Unlike

prior approaches to software development cost esti-

mation such as size-based, function-based, or deci-

sion-tree learning based models, the new technique is

capable of distinguishing relevant cost estimation

factors from irrelevant ones. This relieves the need

to specify the set of cost estimation factors before-

hand. In addition, the cost estimated by our technique

yields higher accuracy in our experimental study.

2. Literature review

Software cost is growing at an annual rate of 12%

and is expected to reach $400 billion by the year 2000

[4]. However, a signi®cant amount of the cost, 40% or

more, is devoted to the maintenance of existing soft-

ware instead of developing much needed new products

[5]. In addition, software cost is related closely to

software quality and productivity. Unrealistically low

cost estimates frequently lead to poor product quality

and low project productivity [11]. Therefore, the

importance of understanding software cost has moti-

vated considerable research to identify factors that

in¯uence software costs. This has yielded a number of

software cost estimation models.

2.1. Size-based models

Size-based models consider project size measured

in lines of code (LOC) or thousands of lines of code

(KLOC) to be the primary factor affecting software

cost estimation. There are two parts to these cost

estimation models. One provides a base estimate of

development effort as a function of software size; it is

of the form:

E � A� B� �KLOC�c

where E is the estimated effort in man-months; A, B,

and C are constants.

The second modi®es the base estimate by taking

into account such environmental factors as the method

used in top-down design, structured code, personnel

experience and ability, etc..

Typical models of this kind include the Walston±

Felix model [20], Doty model [8], Bailey±Basili

model [1], and Boehmn's Constructive Cost model

(COCOMO) [3]. Among these models, COCOMO is

the most widely known and studied. Table 1 provides

an overview of each of these models in its base

estimate form.

The problems with using LOC as an estimate of

project size include:

1. there is no accepted de®nition of LOC and few

researchers speci®ed the line-counting rules used,

resulting in variations and uncertainty;

2. LOC is language dependent; fewer LOC may be

required for a higher-level language than a lower-

level language and yet the time per line is greater,

resulting in dif®culty in directly comparing pro-

jects using different languages;

3. it is dif®cult to estimate LOC. An experiment

performed by Yourden where the size of 16 projects

was estimated by experienced managers based on

the speci®cation of each project, showed a discre-

pancy between estimated and actual project size

ranging from ÿ210% to 83%, as shown in Table 2

[10];

4. LOC places undue emphasis on coding, which

accounts for only 10 to 15% of the total effort in

software development [6]. Hence, factors other

than size should be considered in estimating soft-

ware development cost.

Table 1

An overview of size-based models

Base estimate

Model name A B C

Walston±Felix 5.2 0.91

Bailey±Basili 5.5 0.73 1.16

Boehm basic 3.2 1.05

Boehm intermediate 3 1.12

Boehm advanced 2.8 1.2

Doty 5.288 1.047

2 A. Lee et al. / Information & Management 34 (1998) 1±9

Page 3: Software development cost estimation: Integrating neural network with cluster analysis

2.2. Function-based models

Function-based models use other counts than

LOC in estimating software development cost.

`Function Points' as de®ned by Albrecht of IBM in

1979 involve a process called function point analysis

(FPA) [14]. Albrecht's FPA involves the following

steps:

1. Identify the major system components: external

inputs, external outputs, logical internal ®les,

external interface ®les, and external inquiries.

2. Classify each component as `simple', `average', or

`complex' depending on the number of interacting

data elements and other factors.

3. Calculate the unadjusted function points (UFP)

using the following table, which includes

weights:

Complexity

Function type Simple Average Complex Total

External input �3 �4 �6

External output �4 �5 �7

Logical internal file �7 �10 �15

External interface filex Ð5�7 Ð�10

External inquiry �3 �4 �6

Total

unadjusted

function

points

4. Adjust the unadjusted function points for applica-

tion and environment complexity through a mea-

sure called the complexity adjustment factor

(CAF), i.e., function points�UFP�CAF.

CAF is calculated by using the formula:

CAF � 0:65� 0:01 N

where N is the total degree of in¯uence (DI) of 14

characteristics which are data communications, dis-

tributed processing, performance objective, con®gura-

tion load, transaction rate, on-line data entry, end-user

ef®ciency, on-line update, complex processing, re-

usability, installation ease, operational ease, multiple

sites, and change facilitation. DI takes a value from 0

(no in¯uence) to 5 (strongest in¯uence).

There are several problems with using FPA, includ-

ing:

1. it is designed for business applications and is not

appropriate for scienti®c or technical applications

in which complex algorithms are involved;

2. the validity of the method for general objective

assessment of system costs is questionable because

many elements such as the weighting factors,

component complexity, complexity factors, and

degree of in¯uence are subjectively developed

for a particular environment only; and

3. systems of high internal complexity are not ade-

quately considered.

2.3. Learning-based models

Both size-based and function-based models are

parametric; they use a function/formula of ®xed form

for software cost estimation [18]. Assumptions about

the form of the function are needed. Also, the function

developed is static, i.e., the factors and their corre-

sponding degree of in¯uence on cost estimation are

®xed. More importantly, the set of in¯uential factors

on cost estimation is identi®ed before the model can be

constructed. Learning-based models are developed to

overcome these problems. These models make no

assumptions about the form of the function under

study and they are capable of learning incrementally

as new data are provided over time. In addition, the

availability of historical data on this problem domain

makes it particularly suitable for the application of a

type of machine learning technique called `learning by

Table 2

The discrepancy between estimated and actual project size

Project Actual Predicted Actual±

predicted

%

Difference

1 70900 34700 36200 51%

2 129000 32100 96900 75%

3 23000 22000 1000 4%

4 34600 9100 25500 74%

5 23000 12000 11000 48%

6 25000 7300 17700 71%

7 52100 28500 23600 45%

8 7650 8000 ÿ350 ÿ5%

9 25900 30600 ÿ4700 ÿ18%

10 16300 2720 13580 83%

11 17400 15300 2100 12%

12 33900 105000 ÿ71100 ÿ210%

13 57200 18500 38700 68%

14 21000 35400 ÿ14400 ÿ69%

15 8640 3650 4990 58%

16 17500 2950 14550 83%

A. Lee et al. / Information & Management 34 (1998) 1±9 3

Page 4: Software development cost estimation: Integrating neural network with cluster analysis

example'. Learning by example attempts to infer or

generalize regularities from speci®c instances of a

concept. Thus, the success of applying this type of

example-driven learning technique requires the

provision of domain-speci®c knowledge in the

form of training and testing data sets. The back-

ground of this learning technique and a review of

its recent applications can be found in Refs. [19, 9],

respectively.

2.3.1. Decision tree learning models

These models construct a decision tree for software

cost estimation [15, 16]. The nodes of the tree repre-

sent attributes that best divide the data into disjoint

groups. The leaves of the tree represent the average

cost of software development. By descending the tree

along an appropriate path, the cost of software devel-

opment can be determined.

Relevant attributes for cost estimation are identi®ed

from previous efforts. Data on these attributes is

accumulated to allow the construction of a decision

tree through a process called recursive-partitioning

regression in which the best `divisive' attribute is

selected to partition the data into subsets. The process

is recursively repeated on these subsets as the tree is

expanded until no further partitioning is feasible.

Various attribute selection measures have been pro-

posed. For example, ID3 selects the most informative

attribute based on a measure that minimizes the

following function:

E�A� � ÿXV

i�1

Si

S

XN

j�1

kji

Si

log2

kji

Si

where V is the number of values for attribute A, kji the

number of examples in the jth category with the ith

value for attribute A, S the total number of examples, Si

the number of examples with the ith value for attribute

A, and N the number of categories.

One problem with this type of learning model is that

the set of relevant attributes must be identi®ed before-

hand. A neural network, with its ability to differentiate

relevant from irrelevant attributes, offers a more ¯ex-

ible approach to estimating software development

cost. Moreover, empirical evidence suggests that a

learning procedure based on a neural network often

outperforms decision tree in terms of prediction accu-

racy [17].

2.3.2. Neural network learning models

These models are built on networks of processing

units called neurons that are arranged in layers and are

connected to one another by restricted links (see [12]).

Links between neurons have associated weights. Each

neuron in the network computes a non-linear function

of its inputs; these are called activation functions. The

most common one of the activation function is:

1

1� exp�ÿPWiIi�where WiIi is a weighted sum of the inputs, Ii, to

neuron `i'. The resultant value is passed along to the

next layer after being multiplied by the connecting

weight. This process is repeated all the way from the

input layer to the output layer. The goal here is to

generate an accurate mapping between input (project

attributes) and output (software development cost)

patterns.

Different learning procedures have been proposed

to train the network to generate appropriate output

patterns for corresponding input patterns. One of the

most commonly used is called back-propagation, in

which the weights are modi®ed in such a way as to

reduce the error between actual and correct outputs on

sample patterns. The error is determined by comparing

the network's actual output pattern with an a priori

known output pattern. The difference or error between

the two is `back-propagated' through the net by modi-

fying the weights (see [7, 13] for recent business

applications of back-propagation neural networks).

Srinivasan and Fisher point out that the perfor-

mance of neural network approaches is very sensitive

to con®guration choices, such as the number of hidden

units, the stopping criteria, and the initial weight

settings. The appropriate settings of these choices

can only be determined empirically. Thus, the manner

in which the network should be trained is a concern.

We give here a new approach that integrates neural

network methods with cluster analysis to improve both

training ef®cacy and network performance.

3. The approach

Our approach involves two phases: the ®rst groups

similar projects together by cluster analysis to facil-

itate the training of the neural network in the second

4 A. Lee et al. / Information & Management 34 (1998) 1±9

Page 5: Software development cost estimation: Integrating neural network with cluster analysis

phase. Cluster analysis is designed to identify similar

objects in an n-dimensional space, where n is the

number of descriptive attributes of the object. When

applied to the problem domain of software develop-

ment cost estimation, it is assumed that similar pro-

jects share similar development cost. The similarity

among different projects, once computed, can then be

used as a valuable piece of input information to

enhance the training ef®ciency of the network.

3.1. Cluster analysis

Projects are grouped together into clusters based on

a similarity measure termed a resemblance coef®cient.

Two kinds of coef®cients are computed, depending on

the types of attributes. For quantitative attributes, the

average Euclidean or RMS distance between two

projects in an n-dimensional space is used. It is de®ned

as:

djk ���������������������������������Pn

i�1�xij ÿ xik�2n

swhere djk is the average Euclidean distance between

projects j and k, xij the value of project j's attribute i, xik

the value of project k's attribute i, and n the number of

quantitative attributes. The average Euclidean dis-

tance is, in fact, a measure of dissimilarity between

two projects: the smaller the value of the coef®cient,

the more similar are the two projects.

For nominal attributes, the Jaccard coef®cient is

used and is de®ned as:

Cjk � ÿ1� N�1ÿ 1�2� N�Data� ÿ N�1ÿ 1�

where Cjk is the Jaccard coef®cient of projects j and k,

N(1ÿ1) the number of matches between projects j and

k over all nominal attributes and N(Data) the total

number of nominal attributes. Like the average Eucli-

dean distance, a smaller value of the Jaccard coef®-

cient indicates a higher similarity between two

projects. Since the two coef®cients have different

ranges of values, the two coef®cients are converted

to standard deviations using the standard score method

before combining them.

A tree is then constructed based on the combined

resemblance coef®cients by using a hierarchical clus-

tering analysis technique, the unweighted pair-group

method. This iteratively selects the two most similar

objects to cluster into one new `object' until all objects

are clustered. Various ways of forming the clusters can

be read off from the tree. Our strategy is to cut the tree

at the point where the range of the resemblance

coef®cient is the highest, because a large range in

the value of the resemblance coef®cient indicates that

the resulting clusters are well separated in the attribute

space.

3.2. Neural network

In phase two, a neural network is ®rst trained, based

only on attributes given from the project description to

determine the appropriate settings for the following

network con®guration: (1) the number of neurons per

layer; (2) the size and selection of training and testing

data; and (3) the choice of the activation function.

These network con®guration parameters can only be

determined empirically as different problem domains

require different settings. Hence, the network is

trained twice with the intent that the best con®guration

choice will be decided in the ®rst round. Then the

information from phase one ± the cluster analysis and

the preliminary neural network ± is fed as input to a

second round of neural network training to complete

the task of software development cost estimation. Our

experimental study indicates that the proposed

approach can lead to improved network performance.

4. Experimental study

The approach was tested by using the COCOMO

dataset. Based on a regression analysis of 63 projects,

Boehm developed three forms of COCOMO: the

basic, intermediate, and advanced. The basic model

produces a base estimate of development effort using

KLOC only; the intermediate model adds 15 qualita-

tive cost drivers to improve the base estimate. These

cost drives are classi®ed into four categories: software

product attributes; computer attributes; personnel

attributes; and project attributes, as shown in Table 3.

The advanced model assesses the cost drives at each

development phase. In addition to the 15 cost drivers,

the COCOMO dataset also has other attributes, giving

a total of 39 descriptive project attributes. A complete

list is included in the Appendix A.

A. Lee et al. / Information & Management 34 (1998) 1±9 5

Page 6: Software development cost estimation: Integrating neural network with cluster analysis

4.1. Cluster analysis

In phase one, 24 of the 39 attributes were selected as

critical cost-determining factors for cluster analysis.

These attributes are ones that are used in the inter-

mediate and advanced models of COCOMO. The 63

projects were selected in six different ways (as shown

in Table 4) to serve as data for cluster analysis.

These six ways of clustering were compared using a

common set of testing data. Assuming that two pro-

jects sharing similar project attributes will have simi-

lar software development cost, each project in the

testing set was matched with a cluster and its software

development cost was estimated as the ranked-sum-

mean of all the cost of the projects in that cluster. The

error between the estimated cost and the actual cost

could then be measured. The average percentage

estimation error was then used as the basis for select-

ing the `best' way of clustering all 63 projects of the

COCOMO data. According to the results reported in

Table 5, the projects should be clustered in ways

suggested by using DATA25-3, which yields the low-

est average error in testing. This additional clustering

information, i.e., to which cluster each project

belonged, was passed to phase two.

4.2. Neural network

The appropriate network con®guration choices and

its sensitivity to various input data and activation

functions were then analyzed in a series of four

experiments. The ®rst determined the best network

con®guration among six different settings. The set-

tings are denoted by three numbers of the form,

m : n : o, where m is the number of neurons in the

input layer, n the number of neurons in the hidden

layer, and o the number of neurons in the output layer.

The same set of training and testing data (involving all

63 projects) were used. Of all the projects, 50 were

randomly selected as training data and the remaining

13 were used for testing. All 39 project attributes could

be used without screening, since the neural network

approach has the ability to discern relevant attributes

from irrelevant ones. The second experiment deter-

mined the sensitivity of the network towards input data

by training the network using three different sets of

training and testing data. The last combined the results

of the ®rst two to ®nalize the best setting of the

network con®guration. The results of each experi-

ment are shown in Tables 6±8. The best con®gura-

tion is found to be 20 : 15 : 1 trained by using 41

projects (34 for training and 7 for testing). The 41

projects are selected from the original 63 by eliminat-

ing extreme cases so that the range of the actual

development effort is reduced from 11 400 to 440

man-months.

Table 3

Cost drivers in intermediate COCOMO

Product attributes Required software reliability

Database size

Product complexity

Computer attributes Execution time constraint

Main storage constraint

Virtual machine volatility

Computer turnaround time

Personnel attributes Analyst capability

Applications experience

Programmer capability

Virtual machine experience

Programming language experience

Project attributes Modern programming practices

Use of software tools

Required development schedule

Table 4

Six datasets for cluster analysis

Dataset Means of selection

DATA50 Select 50 projects randomly

DATA34 Select 41 projects, excluding extreme

cases based on actual man-months

DATA25-1 Select 25 projects from DATA34 randomly

DATA25-2 Select 25 projects from DATA34 randomly

DATA25-3 Select 25 projects from DATA34 randomly

DATA25-4 Select 25 projects from DATA34 randomly

Table 5

Performance comparison of the cluster analysis datasets

Dataset Average % error

DATA50 57%

DATA34 39%

DATA25-1 37%

DATA25-2 29%

DATA25-3 26%

DATA25-4 36%

6 A. Lee et al. / Information & Management 34 (1998) 1±9

Page 7: Software development cost estimation: Integrating neural network with cluster analysis

The cost estimates obtained from both the cluster

analysis and preliminary network analysis were used

as additional input attributes to train the neural net-

work a second time using the best con®guration

choices found earlier. In other words, the 41 project

cases (34 training and 7 testing) were used to train a

neural network of con®guration 20 : 15 : 1. The per-

formance of the integrated network was compared to

the one without integration by using four different sets

of testing data. Signi®cant improvement in network

performance in terms of estimation accuracy was

found in all four cases, as show in Table 9.

5. Conclusion

We demonstrated in this paper that integrating

neural network with cluster analysis is a viable and

promising approach to provide relatively accurate

estimates of software development cost. By integrat-

ing neural network with cluster analysis, one can

increase the training ef®cacy of the network, resulting

in a more accurate cost estimate than by using a pure

neural network approach. The estimates are derived

early on in the software development life cycle so that

appropriate software project management and control

can be exercised.

Acknowledgements

Dr. Balakrishnan's research is supported by the

Natural Sciences and Engineering Research Council

(NSERC) of Canada.

Appendix A

Project attributes in the COCOMO dataset

Project attributes

1 Project type

2 Year developed

3 Programming languages

4 Required software reliability

5 Database size

6 Product complexity

7 Adaptation adjustment factor

Table 6

Result of Experiment 1

Network configuration

(i : j : k)

Best average

% error

No. of iteration

10 : 0 : 1 495% 100000

15 : 0 : 1 346% 40000

20 : 0 : 1 494% 1000

10 : 5 : 1 148% 9000

15 : 10 : 1 238% 5000

20 : 15 : 1 382% 1000

Note: i is the number of neurons in the input layer, j the number of

neurons in the hidden layer, and k the number of neurons in the

output layer.

Table 7

Result of Experiment 2

Best average % error

Network

configuration

63 Projects

(50 training,

13 testing)

47 Projects

(38 training,

9 testing)

41 Projects

(34 training,

7 testing)

10 : 0 : 1 495% 77% 281%

15 : 0 : 1 346% 321% 186%

20 : 0 : 1 494% 92% 101%

10 : 5 : 1 148% 78% 62%

15 : 10 : 1 238% 65% 42%

20 : 15 : 1 382% 109% 36%

Table 8

Result of Experiment 3

Network configuration Best average % error

(41 projects)

10 : 0 : 1 281%

15 : 0 : 1 186%

20 : 0 : 1 101%

10 : 5 : 1 62%

15 : 10 : 1 42%

20 : 15 : 1 36%

30 : 20 : 1 178%

40 : 30 : 1 67%

50 : 40 : 1 156%

Table 9

Network performance comparisons on best average % error

Testing cases Pure NN NN�cluster Improvement

Set 1 36% 32% 12%

Set 2 37% 23% 37%

Set 3 62% 30% 51%

Set 4 52% 35% 33%

A. Lee et al. / Information & Management 34 (1998) 1±9 7

Page 8: Software development cost estimation: Integrating neural network with cluster analysis

8 Execution time constraint

9 Main storage constraint

10 Virtual machine volatility

11 Computer turnaround time

12 Type of computer used

13 Analyst capability

14 Project team experience

15 Programmer capability

16 Virtual machine experience

17 Programming language experience

18 Personnel continuity on project

19 Modern programming practices

20 Software tools

21 Required development schedule

22 Requirement volatility effort multipliers

23 Effort multipliers

24 Software development mode

25 Total delivered source instructions in thousands

26 Adjusted delivered source instructions in thousands

27 Nominal man-months

28 Intermediate estimated man-months

29 Percentage estimation error in man-months estimation

30 Project productivity

31 Estimated development time in months

32 Percentage estimation errors for months estimation

33 Detailed estimated man-months

34 Percentage estimation error for detailed estimated man-

months

35 Normalized effort parameter

36 Basic estimated man-months

37 Basic estimation error ratio

38 Thousands of pages of project documentation

39 Pages of documentation per thousand source instruction

References

[1] J.W. Bailey, V.R. Basili, A meta-model for source development

resource expenditures, Proceedings of the Fifth International

Conference on Software Engineering, 1981, pp. 107±116.

[2] F. Bergeron, J. St-Arnaud, Estimation of information systems

development efforts: Pilot study, Information and Manage-

ment 22(4), 1992, pp. 239±254.

[3] B.W. Boehm, Software Engineering Economics, Englewood

Cliffs, Prentice-Hall, NJ, 1981.

[4] B.W. Boehm, P.N. Papaccio, Understanding and controlling

software costs, IEEE Transactions on Software Engineering

14(10), 1988, pp. 1462±1477.

[5] F.P. Brooks, The Mythical Man-Month: Essays on Software

Engineering, Reading, Addison-Wesley, MA, 1982.

[6] R.D. Ermick, In search of a better metric for measuring

productivity of application development, Proceedings of

Function Point Users Group Conference, 1987.

[7] D. Fletcher, E. Goss, Forecasting with neural networks: An

application using bankruptcy data, Information and Manage-

ment 24(3), 1993, pp. 159±167.

[8] J.R. Herd, J.N. Postak, W.E. Russel, K.R. Stewart, Software

Cost Estimation Study-Study Result. Technical Report

RADC-TR-77-220, Doty Associates, Inc., Rockville, MD,

1977.

[9] P. Langley, H.A. Simon, Applications of Machine Learning

and Rule Induction, Communications of the ACM 38(11)

55±64.

[10] L.A. Laranjeira, Software size estimation of object-oriented

systems, IEEE Transactions on Software Engineering 16(5),

1990, pp. 510±522.

[11] W.E. Lehder Jr., D.P. Smith, W.D. Yu, Software estimation

technology, AT & T Technical Journal (1988) 10±18.

[12] E.Y. Li, Arti®cial neural networks and their business applica-

tions, Information and Management 27, 1994, pp. 303±313.

[13] R.W. Lodewyck, P.S. Deng, Experimentation with a back-

propagation neural network: An application to planning and

user system development, Information and Management

24(1), 1993, pp. 1±9.

[14] G.C. Low, D.R. Jeffery, Function points in the estimation and

evaluation of the software process, IEEE Transactions on

Software Engineering 16(1), 1990, pp. 64±71.

[15] A. Porter, R. Selby, Empirically-guided software develop-

ment using metric-based classi®cation tree, IEEE Software

7(5), 1990, pp. 46±54.

[16] R. Selby, A. Porter, Learning from examples: Generation and

evaluation of decision trees for software resource analysis,

IEEE Transactions on Software Engineering 14, 1988, pp.

1743±1757.

[17] J.W. Shavlik, R.J. Mooney, G.G. Towell, Symbolic and

neural learning algorithms: An experimental comparison,

Machine Learning 6(2), 1991, pp. 111±143.

[18] K. Srinivasan, D. Fisher, Machine learning approaches to

estimating software development effort, IEEE Transactions

on Software Engineering 21(2), 1995, pp. 126±136.

[19] K.Y. Tam, Automated construction of knowledge-bases from

examples, Information Systems Research 1(2), 1990, pp.

144±167.

[20] C.E. Walston, C.P. Felix, A method of programming

measurement and estimation, IBM Systems Journal 16(1),

1977, pp. 54±73.

[21] Y. Yoon, T. Guimaraes, Selecting expert system development

techniques, Information and Management 24(4), 1993, pp.

209±223.

Anita Lee is an Associate Professor of

the Decision Science and Information

Systems area at the University of Ken-

tucky. She received her Ph.D. in Business

Administration from the University of

Iowa in 1990. Her research interests

include artificial intelligence, machine

learning, knowledge-based systems,

computer integrated manufacturing, and

group technology. She has published

extensively in numerous refereed jour-

nals including Annals of Operations Research, Expert Systems,

IEEE Expert, International Journal of Production Research, etc..

8 A. Lee et al. / Information & Management 34 (1998) 1±9

Page 9: Software development cost estimation: Integrating neural network with cluster analysis

She is currently an associate editor for Journal of Database

Management.

Chun Hung Cheng obtained his Ph.D.

in Business Administration from the

University of Iowa and started his

teaching career at Kentucky State Uni-

versity. He returned to Hong Kong in

1994 and is now an Associate Professor

at the Chinese University of Hong Kong.

He conducts research in Information

Systems and Operations Management.

His research articles have appeared in

journals including Annals of Operations Research, Expert Systems,

Expert Systems with Applications, IEEE Transactions on Man,

Systems, and Cybernetics, IIE Transactions, International Journal

of Production Research, Operations Research, etc.

Jaydeep Balakrishnan is currently

Associate Professor of Operations Man-

agement in the Faculty of Management

at the University of Calgary. He has a

Ph.D. from Indiana University and an

MBA from the University of Georgia,

both in Operations Management. His

undergraduate degree is in Mechanical

Engineering from Nagpur University in

India. He has also worked for the

automobile industry in India. Dr. Balakrishnan's research interests

include facility layout. He has published in journals including

Management Science, The European Journal of Operational

Research, and OMEGA. He has also presented papers at various

international conferences. During 1995±96 he was a Visiting

Scholar at the Chinese University of Hong Kong.

A. Lee et al. / Information & Management 34 (1998) 1±9 9