IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

1/6

Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

Integrated Intelligent Research (IIR) 101

IMPROVEMENT OF SOFTWARE MAINTENANCE AND

RELIABILITY USING DATA MINING TECHNIQUES

YETHIRAJ N GAssistant Professor, Department of Computer Science

Maharanis Science College for Women, Bangalore, India

.

Abstract

Software is ubiquitous in our daily life. It brings us great convenience and a big headache about

software reliability as well: Software is never bug-free, and software bugs keep incurring monetary

loss of even catastrophes. In the pursuit of better reliability, software engineering researchers found

that huge amount of data in various forms can be collected from software systems, and these data,

when properly analyzed, can help improve software reliability. Unfortunately, the huge volume ofcomplex data renders the analysis of simple techniques incompetent; consequently, studies have been

resorting to data mining for more effective analysis. In the past few years, we have witnessed many

studies on mining for software reliability reported in data mining as well as software engineering

forums. These studies either develop new or apply existing data mining techniques to tackle reliability

problems from different angles. In order to keep data mining researchers abreast of the latestdevelopment in this growing research area, we propose this paper on data mining for software

reliability. In this paper, we will present a comprehensive overview of this area, examine

representative studies, and lay out challenges to data mining researchers.

Key words: Software, Software Reliability, Data Mining, Frequent Item Set, Extracting Rules.

1. IntroductionThe economies of all developed nations aredependent on software. More and Moresystems are software controlled. SoftwareEngineering is concerned with theories,

methods and tools for professional softwaredevelopment. Software Engineering is anengineering discipline which is concerned withall aspects of software production. SoftwareEngineers should adopt a systematic andorganized approach to their work and useappropriate tools and techniques depending onthe problem to be solved, the developmentconstraints and the resources available.

Software reliability, unlike many other qualityfactors, can be measured directed andestimated using historical and developmentaldata [1]. Software reliability is defined instatistical terms as the probability of failure-free operation of a computer program in aspecified environment for a specific time.Measures of reliability- if we consider acomputer-based system, a simple measure ofreliability is mean-time-between-failure(MTBF),where MTBF = MTTF + MTTR, theacronym MTTF and MTTR are mean-time-to-

failure and mean-time-to-repair respectively[2].Software reliability specification- Reliability isa complex concept that should always beconsidered at the system rather than theindividual component level. Because the

components in a system are interdependent, afailure in one component can be propagatedthrough the system and affect the operation ofother components. In a computer-basedsystem, we have to consider three dimensionswhen specifying the overall system reliability:

(i) Hardware reliability- What is theprobability of a hardware component failingand how long would it take to repair thatcomponent? (ii) Software reliability- Howlikely is it that a software component willproduce an incorrect output? Software failuresare different from hardware failures in thatsoftware does not wear out: It can continueoperating correctly after producing anincorrect result. (iii) Operator reliability How likely is it that the operator of a systemwill make an error? [1].

Following are the basic terminologies that arefrequently used for reliability-


2/6



System FailureWhen the system doesnot perform as per theuser expectations, thensystem failure occurs.

System ErrorWhen the system givesthe result in anunexpected mannerthen the system erroroccurs.

System Fault It is probability of thesystem that the failurecan lead to systemerror.

Human Error It is human activity thatmakes the system faultto occur.

2. Mining Software Engineering Data Themain goal is to transform static record

keeping Software Engineering data toactive data so that the hidden patterns andtrends could be explored.

Why Reliability?Normally, a Software is full of bugs, InWindows 2000, containing35 million lines ofcode, there were 63,000 known bugs at thetime of release, 2 per 1000 lines. Softwarefailure costs are becoming very high. A studyby the National Institute of Standards andTechnology found that software errors cost theU.S. economy about $59.5 billion annually. Sotesting and debugging are laborious and

expensive. 50% of my company employeesare testers, and the rest spends 50% of theirtime testing! Bill Gates, in 1995. In generalSoftware is complex for e.g., MySQL has 1.2millions of LOC and its runtime data is largerand more complex. In fact, finding bugs ischallenging which requiresspecifications/properties, which often dontexist and also substantial human efforts inanalysing data are required [3].

Software Reliability Methods are:(i) Static Bug Detection - Without

running the code, detect bugs incode,

(ii) Dynamic Bug Detection (aka.Testing) - Run the code withsome test inputs and detectfailures/bugs and

(iii)Debugging - Given known testfailures (symptoms), pinpoint thebug locations in the code.

Mining for Soft Reliability is absolutely

needed because,i. Finding bugs is challenging

It requires specifications/properties, whichoften dont exist and also require substantialhuman efforts in analyzing data.

ii. We can mine common patterns aslikely specifications/propertiesDetect violations of patterns as likelybugs.

iii. We can mine huge data for patternsor locations to narrow down thescope of human inspection

E.g., code locations or predicates coveredmore in failing runs less in passing runs maybe suspicious bug locations.

3. TechniquesThe Software engineering tasks helped bydata mining are (i) programming,(ii)defectdetection,(iii)testing,(iv)debuggingand(v)maintenance.Data mining techniquesare(i)Classification, (ii) Association, (iii)Patterns Detection, (iv) Clustering [4].

Software engineering dataConsidered are- (i) Code bases, (ii)change history, (iii) program states,(iv)structural entities and (v) bug reports [5].

4.

Analysis

Data Mining for Software Bug Detectionneeds frequent pattern mining then automatedDebugging in Software Programs is carriedout from frequent patterns to software bugsand statistical debugging. Further, automatedDebugging in computer systems is carried outfrom (i) Automated diagnosis of systemmisconfigurations and (ii) performancedebugging [6].

Software Bug Detection

Common approach: mining rules/patternsfrom source code/revision histories anddetecting bugs as rule/pattern violations.

Mining rules from source code

i. Bugs as deviant behaviour [Engler etal., SOSP01]


3/6



ii. Mining programming rules with PR-Miner [Li et al., FSE05]

iii. Mining function precedence protocols[Ramanathan et al., ICSE07]

iv. Revealing neglected conditions[Chang et al., ISSTA07]

Mining rules from revision histories

i. DynaMine [Livshits& Zimmermann,FSE05]

Mining copy-paste patterns from sourcecode

ii. CP-Miner [Li et al., OSDI04] to findcopy-paste bugs [7].

Bugs as Deviant BehaviourStatic verification tools need rules to checkagainst program code

To find errors without knowing the truth

Contradiction in belief. To find lies:cross-examine one witness or manywitness. Any contradiction is an error(internal consistency)

Deviation from common behaviour.To infer correct behaviour: if 1 persondoes X, might be right or acoincidence. If 1000s do X and 1 doesY, probably an error (statistical

analysis)

Crucial: we know contradiction is anerror without knowing the correctbelief!

5. A brief methodology: Software BugDetection

Based on the discussion presented in theprevious section, the following steps forsoftware bug detection are presented.

Step 1:

Mining rules from source code [8]

Bugs as deviant 103ehaviour [Engleret al., SOSP01]

Mining techniques: Statistical analysis Mining programming rules with PR-

Miner [Li et al., FSE05]

Mining function precedence protocols[Ramanathan et al., ICSE07]

Revealing neglected conditions[Chang et al., ISSTA07]

Step 2:Mining copy-paste patterns from sourcecode

CP-Miner [Li et al., OSDI04] to findcopy-paste bugs

An Overview of Extracting Rules -Observation: elements are usually usedtogether.

Idea: finding association among elements thatarefrequently used togetherin source codeImplies frequent item set mining [9].

Examples:spin_lock_irqsave and spin_unlock_irqrestoreappear together within the same function morethan 3600 times.

Flowchart of Extracting Rules

Source files

Parsing & hashing

Pre-ProcessingItemsets

Mining

Programming patterns

Post-ProcessingGenerating rules

Programming rules

Fig.1Step 3:

Mining Programming Patterns and

Generation of Rules

Parsing Source Code Purpose: building anitem set database.Element: function call, variable, data type, etc.are mapped to a number. The Source code ismapped to an item set database.

A frequent sub-item set corresponds to aprogramming pattern and application of


4/6



frequent item set mining algorithm on the itemset database.

E.g., {39, 68, 36, 92}:27 corresponds topattern{Scsi_Host, host_alloc, add_host, scan_host}

Tradeoff: consider order or notStep 4:

Generating Programming Rules

Programming patterns - programming rulesE.g.,Patterns: {a, b, d} : 3,

{a} : 4

Rules:{a} => {b,d} with confidence = =75%

{b} => {a,d} with confidence = 100%{d} => {a,b} with confidence = 100%{a,b} => {d} with confidence = 100%{a,d} => {b} with confidence = 100%{b,d} => {a} with confidence = 100%

Rule Explosion Problem

Exponential number of rules Solution: closed mining

Example:{a,b,d}:3, {a}:4{a,b}:3, {a,d}:3, {b,d}:3 are not closed

Close rules{a,b,d}:3 | {a}:4

Detection of Violations

For violations of a programming rule(i) The rule holds for most cases

Confidence > threshold(ii) The rule is violated for a few cases

Confidence < 100%Example: Detecting Violations

Step 5:

Programming patterns:{Scsi_Host, host_alloc, add_host, scan_host}:27{Scsi_Host, host_alloc, add_host}: 29

Programming rule:{Scsi_Host, host_alloc, add_host}=>

{scan_host}

with confidence 27/29 = 93%

Missing

Table 1: Some Results of Bug DetectionSoftware #C files LOC #functions

Linux 3,538 3,037,403 73,607PostgreSQL

409 381,192 6,964

Apache 160 84,724 1,912

Software Inspected (top 60)

Bugs Anomalies FalsePositives

Linux 16 20 24

PostgreSQL

6 9 45

Apache 1 0 6

6. Limitations of PR-MinerRules across multiple functions

Not using inter-procedural analysisFalse negatives of violations in control paths

Not using sophisticated analysistechniques

Inter-procedural, path-sensitiveinference of function precedenceprotocols to address the limitations[Ramanathan et al., ICSE07] [10].

We shall now discuss Mining Function

Precedence Protocols

fp = fopen();fclose();

Definition:-Precedence protocol:A call tofcloseis always preceded by a calltofopen

Definition:-Successor protocol :A call tofopenis always succeeded by a calltofclose

Violation of Precedence Protocols

fp = fopen();


5/6



if(fp == NULL)exit(-1);fclose();

Tool Implementation/Evaluation

CHRONICLER tool implemented in Chas the following features:

Tested on open source C programsApache, linux, openssh, gimp,postgresqlLines of code varies from 66K to 2MNumber of call-sites varies from 10K to110K

Some Results of Precedence-Related BugDetection

Case Study: LinuxHardware Bug

Difficult to detect using traditionaltesting techniques

Platform dependent error Transparently identified using

CHRONICLER

Performance Bug Cache lookup operation was absent Not easily specified as a bug for

testing

Deviation delays data write flushes[11].

Limitation of Precedence-Related Bug

Detection

Does not take data flow or datadependency into account

A new approach to discoveringneglected conditions [Chang et al.,ISSTA07] addresses the issue

Based on dependence analysis,frequent item set, and frequent subgraph mining

Crucial Observation

Things that are frequently changed together

often form a pattern...also known as co-

change

Co-changed items = patterns

Finding Patterns

Find frequent itemsets (with Apriori)

o.enterAlignment()

o.exitAlignment()

o.redoAlignment()

iter.hasNext()iter.next()

{enterAlignment (), exitAlignment(),redoAlignment()}

Ranking PatternsSupport count = #occurrences of a pattern

Confidence count= Strength of a pattern, P(A|B)

Pattern classification

Post-processv validations, e violations

Usage error unlikelypatterns patterns patterns

e


6/6



7. ConclusionFinally, the following conclusions are drawn:

(i) Challenges in data mining Statistical modelling of computer

systems

Online, scalability, interpretability (ii) Data Mining for Software Bug

Detection

Frequent pattern miningAutomated Debugging in Software

Programs From frequent patterns to

software bugs Statistical debugging

Automated Debugging in Computer

Systems Automated diagnosis of

system misconfigurations

Performance debugging(iii) Limitations of Bugs as DeviantBehaviour

Fixed rule templates Need specific knowledge

about the software

2 elements PR-Miner [Li et al., FSE05]

(mining implicit programmingrules) developed to addressthe limitations

General method (No priorknowledge; No templates)

General rules (Different types:function, variable, data type, etc.;Multipleelements)

(iv) Ubiquitous computing demandsreliable software- Mining for

software reliability

Mining program sourcecode/version histories to findbugs

Mining program runtime datato locate why an execution

fails Mining system snapshots to

diagnose misconfigurationsand performance problems

(v) An active and rewarding researcharea

International Workshop on MiningSoftware Repositories since 2004

SIGCOMM Workshop on MiningNetwork Data since 2005

Systems and Machine LearningWorkshop since 2006

Workshop on Statistical LearningTechniques for Solving Systems

Problems, co-located with NIPS

8. References:1) Ian Sommerville, Software Engineering

8th edition, Pearson EducationPublications, 2007.

2) Roger S. Pressman, Software Engineering:A Practitioners Approach, 6th editionMcGraw-Hill International editionPublications, 2005.

3) James S. Peters &WitoldPedrycz,Software Engineering an EngineeringApproach, Wiley Publications, 2000.

4) Jiawei Han &MichelineKamber, DataMining: Concepts and Techniques, 2

nd

edition,, Elsevier Publications, March2006.

5) Chai Liu, Long Fei, Xifang Yan, JiaweiHan and Samuel Midkiff, StatisticalDebugging: A Hypothesis Testing-basedapproach, IEEETSE 2006.

6) Dawson Engler, David Yu Chen, SethHallem, Andy Chou and Benjamin Chelf,Bugs as Deviant Behaviour: A Generalapproach to inferring errors in systemscode, SOSP 2001.

7) Zhenmin Li, Shan Lu, SuvdaMyagmarand Yuanyan Zhou, CP-Miner: A tool forfinding copy-paste and related bugs inoperating system code, OSPI 2004.

8) Prof. S. Chitra&Dr. M. Rajaram, ASoftware Reliability Estimation tool usingArtificial Immune Recognition System:Proceedings of the InternationalMulticonference of Engineers andcomputer scientists 2008 vol 1, IMECS2008, pp. 19-21 March 2008, Hong Kong.

9) Leon Wu, BoyiXie, Gail Kaiser &Rebecca Passonneau, Department ofComputer Science, Columbia University,Newyork NY 10027 USA, BUGMINER:Software Reliability Analysis via DataMining of Bug Reports2007.

10) Swapna S. Gokhale, Member, IEEE, ASimulation Approach to structured-basedsoftware reliability analysis, IEEEtransactions on Software Engineering, vol31, No. 8, August 2005.

11) Simon P. Wilson and Francisco J.Samaniego, Nonparametric Analysis ofthe order-statistic model in softwarereliability, IEEE transactions on softwareengineering, vol 33, No. 3, March 2007.

Documents

IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES