38
1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

Embed Size (px)

DESCRIPTION

3 Software Engineering Folklore A relatively small number of modules have most of the faults If a relatively small number of modules have most faults, the reason is that they also contain most of the code Large modules are buggier than small ones Buggy early → Buggy late (inherently bad modules) New code is {more, less} buggy than old code Code metrics are {good, bad} predictors of code quality

Citation preview

Page 1: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

1

The Distribution of Faults in a Large Industrial Software System

Thomas OstrandElaine Weyuker

AT&T Labs -- ResearchFlorham Park, NJ

Page 2: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

2

The $59 Billion* Question

• Can we predict where severe bugs are likely to be?

• Can we identify characteristics of code units, files, or modules that indicate a higher probability of bugs? i.e., can we characterize the fault-proneness of code units?

*The Economic Impacts of Inadequate Infrastructure for Software Testing, May 2002, NIST, May 2002

Page 3: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

3

Software Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 4: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

4

The Case Study

We studied the fault database of a • large • currently in-use• under continuing developmentinventory tracking system.

Page 5: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

5

System Information

• System type: Inventory tracking• Lifespan: First release ~1998. Subsequent releases every 3

months.• Development stages: requirements, design, development,

unit testing, integration testing, system testing, beta release, controlled release, and general release.

• Code: About ¾ of the files in java, with smaller numbers written in shell script, makefiles, xml, html, perl, c, sql, and awk.

• Fault data studied from 13 successive releases.

Page 6: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

6

Predicting Fault Proneness:Possible factors

• File Size: Is the density of faults found in a file related to the file’s size?

• Faults found during early development stages: Does the number of faults found in early stages predict the number that will be found in later stages?

• Faults found during early releases: Does the number of faults found in early releases predict the number that will be found in later releases?

• Age of file: Are new files more likely to have faults than files that existed in earlier releases?

Page 7: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

7

Number of Files

0200400600800

10001200140016001800

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

Page 8: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

8

Size of System(KLOCs, including comments)

050

100150200250300350400450500

1 2 3 4 5 6 7 8 9 10 11 12

Release Number

Page 9: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

9

Number of files and number of faults detected in each release

0200400600800

10001200140016001800

1 2 3 4 5 6 7 8 9 10 11 12 13

FilesFaults

Release Number

Num

ber o

f File

s & F

aults

Page 10: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

10

Size of System (KLOCs) and Number of Faults

0100200300400500600700800900

1000

1 2 3 4 5 6 7 8 9 10 11 12

KLOCFaults

Release Number

KLO

Cs &

Num

ber o

f Fau

lts

Page 11: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

11

SW Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 12: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

12

Distribution of Faults over FilesPercent of Faulty Files

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

Perc

ent o

f File

s tha

t hav

e fa

ults N

umber of files in

release

Page 13: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

13

Distribution of Faults over FilesNumber of Faulty Files

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

Num

ber o

f File

s tha

t hav

e fa

ults

Page 14: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

14

SW Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 15: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

15

Concentration of Faults in Rel 12

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8Percent of Files

Perc

ent Percent of Faults

Percent of CodeSize

Page 16: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

16

SW Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 17: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

17

Fault Density, All Releases (restricted to files with at least one fault)

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000 6000 7000 8000

Size of File

Faul

ts /

KLO

C

Page 18: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

18

Fault Density vs. File Size for Release 12

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Size of File

Faul

ts /

KLO

C

Page 19: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

19Fault density vs. file size (Fenton & Ohlsson, TSE 1997)

Page 20: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

20Fault density vs. module size (Hatton, Software 1997)

Page 21: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

21

SW Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 22: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

22

Does the number of faults found in early development stages predict the

number that will be found in later stages?

Does the number of faults found in early releases predict the number that will be found in later releases?

Page 23: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

23

Faults by Development Stages

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13

Early Pre: Dev & UnitTest Late Pre: Int & SysTestPost: Beta & Gen Release

Release Number

Num

ber o

f Fau

lts

Page 24: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

24

Once Faulty, Always Faulty?From Stage to Stage

In every release, ALL of the post-release faults were found in files whose contribution to the integration and system test faults was relatively low (6% - 28%).

In other words, 94% -72% of the late pre-release faults were in files that had NO post-release faults.

Fenton & Ohlsson observed similar results.

Page 25: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

25

Once Faulty, Always Faulty?From Stage to Stage

However, across all the releases, there were only a total of 128 post-release faults. No release had more than 20 post-release faults, and half had no more than 10 post-release faults.

Not enough data to draw meaningful conclusions.

Page 26: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

26

Number of Post-Release FaultsRelease in Files with 0

Late Pre-Rel Faultsin Files with 1

Late Pre-Rel Faults1 0 0

2 1 0

3 0 0

4 3 1

5 9 4

6 8 5

7 18 0

8 5 3

9 5 5

10 7 5

11 10 8

12 14 6

13 7 4

Total 87 41

Page 27: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

27

Faults by Stages

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13

Early PreLate Pre

Release Number

Num

ber o

f Fau

lts

Page 28: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

28

Faults by Stages(Sorted by Decreasing Number Early Pre-release Faults)

0

100

200

300

400

500

600

700

800

1 3 8 9 4 6 5 10 12 11 2 7 13

Early PreLate Pre

Release Number

Num

ber o

f Fau

lts

Page 29: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

29

Once Faulty, Always Faulty?From Release to Release

High-fault files of a release: Top 20% of files ordered by decreasing number of faults.

Over all releases, roughly 35% of these files were also high-fault files in the preceding and/or succeeding releases.

For Release 12, more than 40% of its high-fault files were also high-fault files in Release 1.

Page 30: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

30

Persistence of high fault count between releases

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10 11 12

High-fault in next release High-fault in previous release

Perc

ent o

f hig

h-fa

ult f

iles

in

this

rele

ase

that

are

hig

h-fa

ult i

n pr

ev/n

ext r

elea

se

Release

Page 31: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

31

SW Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

Page 32: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

32

Old Files and New Files

For Release n:

an old file is a file that existed in some Release i < n, and still exists in Release n.

a new file is a file that did not exist in any Release i < n, and is in Release n.

Page 33: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

33

Are new files more likely to have faults than files that existed in earlier releases?

Do new files have higher fault density than old files?

Page 34: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

34

Old Files, New FilesPercent Containing (any number of)

Faults

0

5

10

15

20

25

30

35

40

2 3 4 5 6 7 8 9 10 11 12

OldNew

Release Number

Perc

ent C

onta

inin

g Fa

ults

Page 35: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

35

Old Files, New FilesFaults/KLOC

over all files of the release

0

1

2

3

4

5

6

2 3 4 5 6 7 8 9 10 11 12

OldNew

Release Number

Faul

ts/K

LOC

Page 36: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

36

Summary of Fault Proneness Observations

• Faults are concentrated in a relatively small number of files, and become more heavily concentrated as the system matures.

• Large files do not generally have higher fault density than small files; the opposite seems to be true.

• Files with high fault counts during pre-release do not generally have high fault counts during post-release.

Page 37: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

37

Fault Proneness Observations

• Files with the largest numbers of faults in an early release, seem to be more likely to have large numbers of faults in the next release and later releases.

• Newly written files are more likely to be faulty than old files, and to have higher fault density than old files.

Page 38: 1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ

38

Continuing Work

• Study additional systems• Statistical analysis• Study relation between bugs in successive releases

(Do persistently high-fault files have related bugs in successive releases?)

• Do the numbers change if we calculate them for different levels of fault severity?

• Are code metrics good predictors of faults?