Upload
iiradmin
View
212
Download
0
Embed Size (px)
Citation preview
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
1/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 101
IMPROVEMENT OF SOFTWARE MAINTENANCE AND
RELIABILITY USING DATA MINING TECHNIQUES
YETHIRAJ N GAssistant Professor, Department of Computer Science
Maharanis Science College for Women, Bangalore, India
.
Abstract
Software is ubiquitous in our daily life. It brings us great convenience and a big headache about
software reliability as well: Software is never bug-free, and software bugs keep incurring monetary
loss of even catastrophes. In the pursuit of better reliability, software engineering researchers found
that huge amount of data in various forms can be collected from software systems, and these data,
when properly analyzed, can help improve software reliability. Unfortunately, the huge volume ofcomplex data renders the analysis of simple techniques incompetent; consequently, studies have been
resorting to data mining for more effective analysis. In the past few years, we have witnessed many
studies on mining for software reliability reported in data mining as well as software engineering
forums. These studies either develop new or apply existing data mining techniques to tackle reliability
problems from different angles. In order to keep data mining researchers abreast of the latestdevelopment in this growing research area, we propose this paper on data mining for software
reliability. In this paper, we will present a comprehensive overview of this area, examine
representative studies, and lay out challenges to data mining researchers.
Key words: Software, Software Reliability, Data Mining, Frequent Item Set, Extracting Rules.
1. IntroductionThe economies of all developed nations aredependent on software. More and Moresystems are software controlled. SoftwareEngineering is concerned with theories,
methods and tools for professional softwaredevelopment. Software Engineering is anengineering discipline which is concerned withall aspects of software production. SoftwareEngineers should adopt a systematic andorganized approach to their work and useappropriate tools and techniques depending onthe problem to be solved, the developmentconstraints and the resources available.
Software reliability, unlike many other qualityfactors, can be measured directed andestimated using historical and developmentaldata [1]. Software reliability is defined instatistical terms as the probability of failure-free operation of a computer program in aspecified environment for a specific time.Measures of reliability- if we consider acomputer-based system, a simple measure ofreliability is mean-time-between-failure(MTBF),where MTBF = MTTF + MTTR, theacronym MTTF and MTTR are mean-time-to-
failure and mean-time-to-repair respectively[2].Software reliability specification- Reliability isa complex concept that should always beconsidered at the system rather than theindividual component level. Because the
components in a system are interdependent, afailure in one component can be propagatedthrough the system and affect the operation ofother components. In a computer-basedsystem, we have to consider three dimensionswhen specifying the overall system reliability:
(i) Hardware reliability- What is theprobability of a hardware component failingand how long would it take to repair thatcomponent? (ii) Software reliability- Howlikely is it that a software component willproduce an incorrect output? Software failuresare different from hardware failures in thatsoftware does not wear out: It can continueoperating correctly after producing anincorrect result. (iii) Operator reliability How likely is it that the operator of a systemwill make an error? [1].
Following are the basic terminologies that arefrequently used for reliability-
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
2/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 102
System FailureWhen the system doesnot perform as per theuser expectations, thensystem failure occurs.
System ErrorWhen the system givesthe result in anunexpected mannerthen the system erroroccurs.
System Fault It is probability of thesystem that the failurecan lead to systemerror.
Human Error It is human activity thatmakes the system faultto occur.
2. Mining Software Engineering Data Themain goal is to transform static record
keeping Software Engineering data toactive data so that the hidden patterns andtrends could be explored.
Why Reliability?Normally, a Software is full of bugs, InWindows 2000, containing35 million lines ofcode, there were 63,000 known bugs at thetime of release, 2 per 1000 lines. Softwarefailure costs are becoming very high. A studyby the National Institute of Standards andTechnology found that software errors cost theU.S. economy about $59.5 billion annually. Sotesting and debugging are laborious and
expensive. 50% of my company employeesare testers, and the rest spends 50% of theirtime testing! Bill Gates, in 1995. In generalSoftware is complex for e.g., MySQL has 1.2millions of LOC and its runtime data is largerand more complex. In fact, finding bugs ischallenging which requiresspecifications/properties, which often dontexist and also substantial human efforts inanalysing data are required [3].
Software Reliability Methods are:(i) Static Bug Detection - Without
running the code, detect bugs incode,
(ii) Dynamic Bug Detection (aka.Testing) - Run the code withsome test inputs and detectfailures/bugs and
(iii)Debugging - Given known testfailures (symptoms), pinpoint thebug locations in the code.
Mining for Soft Reliability is absolutely
needed because,i. Finding bugs is challenging
It requires specifications/properties, whichoften dont exist and also require substantialhuman efforts in analyzing data.
ii. We can mine common patterns aslikely specifications/propertiesDetect violations of patterns as likelybugs.
iii. We can mine huge data for patternsor locations to narrow down thescope of human inspection
E.g., code locations or predicates coveredmore in failing runs less in passing runs maybe suspicious bug locations.
3. TechniquesThe Software engineering tasks helped bydata mining are (i) programming,(ii)defectdetection,(iii)testing,(iv)debuggingand(v)maintenance.Data mining techniquesare(i)Classification, (ii) Association, (iii)Patterns Detection, (iv) Clustering [4].
Software engineering dataConsidered are- (i) Code bases, (ii)change history, (iii) program states,(iv)structural entities and (v) bug reports [5].
4.
Analysis
Data Mining for Software Bug Detectionneeds frequent pattern mining then automatedDebugging in Software Programs is carriedout from frequent patterns to software bugsand statistical debugging. Further, automatedDebugging in computer systems is carried outfrom (i) Automated diagnosis of systemmisconfigurations and (ii) performancedebugging [6].
Software Bug Detection
Common approach: mining rules/patternsfrom source code/revision histories anddetecting bugs as rule/pattern violations.
Mining rules from source code
i. Bugs as deviant behaviour [Engler etal., SOSP01]
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
3/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 103
ii. Mining programming rules with PR-Miner [Li et al., FSE05]
iii. Mining function precedence protocols[Ramanathan et al., ICSE07]
iv. Revealing neglected conditions[Chang et al., ISSTA07]
Mining rules from revision histories
i. DynaMine [Livshits& Zimmermann,FSE05]
Mining copy-paste patterns from sourcecode
ii. CP-Miner [Li et al., OSDI04] to findcopy-paste bugs [7].
Bugs as Deviant BehaviourStatic verification tools need rules to checkagainst program code
To find errors without knowing the truth
Contradiction in belief. To find lies:cross-examine one witness or manywitness. Any contradiction is an error(internal consistency)
Deviation from common behaviour.To infer correct behaviour: if 1 persondoes X, might be right or acoincidence. If 1000s do X and 1 doesY, probably an error (statistical
analysis)
Crucial: we know contradiction is anerror without knowing the correctbelief!
5. A brief methodology: Software BugDetection
Based on the discussion presented in theprevious section, the following steps forsoftware bug detection are presented.
Step 1:
Mining rules from source code [8]
Bugs as deviant 103ehaviour [Engleret al., SOSP01]
Mining techniques: Statistical analysis Mining programming rules with PR-
Miner [Li et al., FSE05]
Mining function precedence protocols[Ramanathan et al., ICSE07]
Revealing neglected conditions[Chang et al., ISSTA07]
Step 2:Mining copy-paste patterns from sourcecode
CP-Miner [Li et al., OSDI04] to findcopy-paste bugs
An Overview of Extracting Rules -Observation: elements are usually usedtogether.
Idea: finding association among elements thatarefrequently used togetherin source codeImplies frequent item set mining [9].
Examples:spin_lock_irqsave and spin_unlock_irqrestoreappear together within the same function morethan 3600 times.
Flowchart of Extracting Rules
Source files
Parsing & hashing
Pre-ProcessingItemsets
Mining
Programming patterns
Post-ProcessingGenerating rules
Programming rules
Fig.1Step 3:
Mining Programming Patterns and
Generation of Rules
Parsing Source Code Purpose: building anitem set database.Element: function call, variable, data type, etc.are mapped to a number. The Source code ismapped to an item set database.
A frequent sub-item set corresponds to aprogramming pattern and application of
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
4/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 104
frequent item set mining algorithm on the itemset database.
E.g., {39, 68, 36, 92}:27 corresponds topattern{Scsi_Host, host_alloc, add_host, scan_host}
Tradeoff: consider order or notStep 4:
Generating Programming Rules
Programming patterns - programming rulesE.g.,Patterns: {a, b, d} : 3,
{a} : 4
Rules:{a} => {b,d} with confidence = =75%
{b} => {a,d} with confidence = 100%{d} => {a,b} with confidence = 100%{a,b} => {d} with confidence = 100%{a,d} => {b} with confidence = 100%{b,d} => {a} with confidence = 100%
Rule Explosion Problem
Exponential number of rules Solution: closed mining
Example:{a,b,d}:3, {a}:4{a,b}:3, {a,d}:3, {b,d}:3 are not closed
Close rules{a,b,d}:3 | {a}:4
Detection of Violations
For violations of a programming rule(i) The rule holds for most cases
Confidence > threshold(ii) The rule is violated for a few cases
Confidence < 100%Example: Detecting Violations
Step 5:
Programming patterns:{Scsi_Host, host_alloc, add_host, scan_host}:27{Scsi_Host, host_alloc, add_host}: 29
Programming rule:{Scsi_Host, host_alloc, add_host}=>
{scan_host}
with confidence 27/29 = 93%
Missing
Table 1: Some Results of Bug DetectionSoftware #C files LOC #functions
Linux 3,538 3,037,403 73,607PostgreSQL
409 381,192 6,964
Apache 160 84,724 1,912
Software Inspected (top 60)
Bugs Anomalies FalsePositives
Linux 16 20 24
PostgreSQL
6 9 45
Apache 1 0 6
6. Limitations of PR-MinerRules across multiple functions
Not using inter-procedural analysisFalse negatives of violations in control paths
Not using sophisticated analysistechniques
Inter-procedural, path-sensitiveinference of function precedenceprotocols to address the limitations[Ramanathan et al., ICSE07] [10].
We shall now discuss Mining Function
Precedence Protocols
fp = fopen();fclose();
Definition:-Precedence protocol:A call tofcloseis always preceded by a calltofopen
Definition:-Successor protocol :A call tofopenis always succeeded by a calltofclose
Violation of Precedence Protocols
fp = fopen();
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
5/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 105
if(fp == NULL)exit(-1);fclose();
Tool Implementation/Evaluation
CHRONICLER tool implemented in Chas the following features:
Tested on open source C programsApache, linux, openssh, gimp,postgresqlLines of code varies from 66K to 2MNumber of call-sites varies from 10K to110K
Some Results of Precedence-Related BugDetection
Case Study: LinuxHardware Bug
Difficult to detect using traditionaltesting techniques
Platform dependent error Transparently identified using
CHRONICLER
Performance Bug Cache lookup operation was absent Not easily specified as a bug for
testing
Deviation delays data write flushes[11].
Limitation of Precedence-Related Bug
Detection
Does not take data flow or datadependency into account
A new approach to discoveringneglected conditions [Chang et al.,ISSTA07] addresses the issue
Based on dependence analysis,frequent item set, and frequent subgraph mining
Crucial Observation
Things that are frequently changed together
often form a pattern...also known as co-
change
Co-changed items = patterns
Finding Patterns
Find frequent itemsets (with Apriori)
o.enterAlignment()
o.exitAlignment()
o.redoAlignment()
iter.hasNext()iter.next()
{enterAlignment (), exitAlignment(),redoAlignment()}
Ranking PatternsSupport count = #occurrences of a pattern
Confidence count= Strength of a pattern, P(A|B)
Pattern classification
Post-processv validations, e violations
Usage error unlikelypatterns patterns patterns
e
7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES
6/6
Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419
Integrated Intelligent Research (IIR) 106
7. ConclusionFinally, the following conclusions are drawn:
(i) Challenges in data mining Statistical modelling of computer
systems
Online, scalability, interpretability (ii) Data Mining for Software Bug
Detection
Frequent pattern miningAutomated Debugging in Software
Programs From frequent patterns to
software bugs Statistical debugging
Automated Debugging in Computer
Systems Automated diagnosis of
system misconfigurations
Performance debugging(iii) Limitations of Bugs as DeviantBehaviour
Fixed rule templates Need specific knowledge
about the software
2 elements PR-Miner [Li et al., FSE05]
(mining implicit programmingrules) developed to addressthe limitations
General method (No priorknowledge; No templates)
General rules (Different types:function, variable, data type, etc.;Multipleelements)
(iv) Ubiquitous computing demandsreliable software- Mining for
software reliability
Mining program sourcecode/version histories to findbugs
Mining program runtime datato locate why an execution
fails Mining system snapshots to
diagnose misconfigurationsand performance problems
(v) An active and rewarding researcharea
International Workshop on MiningSoftware Repositories since 2004
SIGCOMM Workshop on MiningNetwork Data since 2005
Systems and Machine LearningWorkshop since 2006
Workshop on Statistical LearningTechniques for Solving Systems
Problems, co-located with NIPS
8. References:1) Ian Sommerville, Software Engineering
8th edition, Pearson EducationPublications, 2007.
2) Roger S. Pressman, Software Engineering:A Practitioners Approach, 6th editionMcGraw-Hill International editionPublications, 2005.
3) James S. Peters &WitoldPedrycz,Software Engineering an EngineeringApproach, Wiley Publications, 2000.
4) Jiawei Han &MichelineKamber, DataMining: Concepts and Techniques, 2
nd
edition,, Elsevier Publications, March2006.
5) Chai Liu, Long Fei, Xifang Yan, JiaweiHan and Samuel Midkiff, StatisticalDebugging: A Hypothesis Testing-basedapproach, IEEETSE 2006.
6) Dawson Engler, David Yu Chen, SethHallem, Andy Chou and Benjamin Chelf,Bugs as Deviant Behaviour: A Generalapproach to inferring errors in systemscode, SOSP 2001.
7) Zhenmin Li, Shan Lu, SuvdaMyagmarand Yuanyan Zhou, CP-Miner: A tool forfinding copy-paste and related bugs inoperating system code, OSPI 2004.
8) Prof. S. Chitra&Dr. M. Rajaram, ASoftware Reliability Estimation tool usingArtificial Immune Recognition System:Proceedings of the InternationalMulticonference of Engineers andcomputer scientists 2008 vol 1, IMECS2008, pp. 19-21 March 2008, Hong Kong.
9) Leon Wu, BoyiXie, Gail Kaiser &Rebecca Passonneau, Department ofComputer Science, Columbia University,Newyork NY 10027 USA, BUGMINER:Software Reliability Analysis via DataMining of Bug Reports2007.
10) Swapna S. Gokhale, Member, IEEE, ASimulation Approach to structured-basedsoftware reliability analysis, IEEEtransactions on Software Engineering, vol31, No. 8, August 2005.
11) Simon P. Wilson and Francisco J.Samaniego, Nonparametric Analysis ofthe order-statistic model in softwarereliability, IEEE transactions on softwareengineering, vol 33, No. 3, March 2007.