IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

Embed Size (px)

Citation preview

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    1/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 101

    IMPROVEMENT OF SOFTWARE MAINTENANCE AND

    RELIABILITY USING DATA MINING TECHNIQUES

    YETHIRAJ N GAssistant Professor, Department of Computer Science

    Maharanis Science College for Women, Bangalore, India

    .

    Abstract

    Software is ubiquitous in our daily life. It brings us great convenience and a big headache about

    software reliability as well: Software is never bug-free, and software bugs keep incurring monetary

    loss of even catastrophes. In the pursuit of better reliability, software engineering researchers found

    that huge amount of data in various forms can be collected from software systems, and these data,

    when properly analyzed, can help improve software reliability. Unfortunately, the huge volume ofcomplex data renders the analysis of simple techniques incompetent; consequently, studies have been

    resorting to data mining for more effective analysis. In the past few years, we have witnessed many

    studies on mining for software reliability reported in data mining as well as software engineering

    forums. These studies either develop new or apply existing data mining techniques to tackle reliability

    problems from different angles. In order to keep data mining researchers abreast of the latestdevelopment in this growing research area, we propose this paper on data mining for software

    reliability. In this paper, we will present a comprehensive overview of this area, examine

    representative studies, and lay out challenges to data mining researchers.

    Key words: Software, Software Reliability, Data Mining, Frequent Item Set, Extracting Rules.

    1. IntroductionThe economies of all developed nations aredependent on software. More and Moresystems are software controlled. SoftwareEngineering is concerned with theories,

    methods and tools for professional softwaredevelopment. Software Engineering is anengineering discipline which is concerned withall aspects of software production. SoftwareEngineers should adopt a systematic andorganized approach to their work and useappropriate tools and techniques depending onthe problem to be solved, the developmentconstraints and the resources available.

    Software reliability, unlike many other qualityfactors, can be measured directed andestimated using historical and developmentaldata [1]. Software reliability is defined instatistical terms as the probability of failure-free operation of a computer program in aspecified environment for a specific time.Measures of reliability- if we consider acomputer-based system, a simple measure ofreliability is mean-time-between-failure(MTBF),where MTBF = MTTF + MTTR, theacronym MTTF and MTTR are mean-time-to-

    failure and mean-time-to-repair respectively[2].Software reliability specification- Reliability isa complex concept that should always beconsidered at the system rather than theindividual component level. Because the

    components in a system are interdependent, afailure in one component can be propagatedthrough the system and affect the operation ofother components. In a computer-basedsystem, we have to consider three dimensionswhen specifying the overall system reliability:

    (i) Hardware reliability- What is theprobability of a hardware component failingand how long would it take to repair thatcomponent? (ii) Software reliability- Howlikely is it that a software component willproduce an incorrect output? Software failuresare different from hardware failures in thatsoftware does not wear out: It can continueoperating correctly after producing anincorrect result. (iii) Operator reliability How likely is it that the operator of a systemwill make an error? [1].

    Following are the basic terminologies that arefrequently used for reliability-

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    2/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 102

    System FailureWhen the system doesnot perform as per theuser expectations, thensystem failure occurs.

    System ErrorWhen the system givesthe result in anunexpected mannerthen the system erroroccurs.

    System Fault It is probability of thesystem that the failurecan lead to systemerror.

    Human Error It is human activity thatmakes the system faultto occur.

    2. Mining Software Engineering Data Themain goal is to transform static record

    keeping Software Engineering data toactive data so that the hidden patterns andtrends could be explored.

    Why Reliability?Normally, a Software is full of bugs, InWindows 2000, containing35 million lines ofcode, there were 63,000 known bugs at thetime of release, 2 per 1000 lines. Softwarefailure costs are becoming very high. A studyby the National Institute of Standards andTechnology found that software errors cost theU.S. economy about $59.5 billion annually. Sotesting and debugging are laborious and

    expensive. 50% of my company employeesare testers, and the rest spends 50% of theirtime testing! Bill Gates, in 1995. In generalSoftware is complex for e.g., MySQL has 1.2millions of LOC and its runtime data is largerand more complex. In fact, finding bugs ischallenging which requiresspecifications/properties, which often dontexist and also substantial human efforts inanalysing data are required [3].

    Software Reliability Methods are:(i) Static Bug Detection - Without

    running the code, detect bugs incode,

    (ii) Dynamic Bug Detection (aka.Testing) - Run the code withsome test inputs and detectfailures/bugs and

    (iii)Debugging - Given known testfailures (symptoms), pinpoint thebug locations in the code.

    Mining for Soft Reliability is absolutely

    needed because,i. Finding bugs is challenging

    It requires specifications/properties, whichoften dont exist and also require substantialhuman efforts in analyzing data.

    ii. We can mine common patterns aslikely specifications/propertiesDetect violations of patterns as likelybugs.

    iii. We can mine huge data for patternsor locations to narrow down thescope of human inspection

    E.g., code locations or predicates coveredmore in failing runs less in passing runs maybe suspicious bug locations.

    3. TechniquesThe Software engineering tasks helped bydata mining are (i) programming,(ii)defectdetection,(iii)testing,(iv)debuggingand(v)maintenance.Data mining techniquesare(i)Classification, (ii) Association, (iii)Patterns Detection, (iv) Clustering [4].

    Software engineering dataConsidered are- (i) Code bases, (ii)change history, (iii) program states,(iv)structural entities and (v) bug reports [5].

    4.

    Analysis

    Data Mining for Software Bug Detectionneeds frequent pattern mining then automatedDebugging in Software Programs is carriedout from frequent patterns to software bugsand statistical debugging. Further, automatedDebugging in computer systems is carried outfrom (i) Automated diagnosis of systemmisconfigurations and (ii) performancedebugging [6].

    Software Bug Detection

    Common approach: mining rules/patternsfrom source code/revision histories anddetecting bugs as rule/pattern violations.

    Mining rules from source code

    i. Bugs as deviant behaviour [Engler etal., SOSP01]

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    3/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 103

    ii. Mining programming rules with PR-Miner [Li et al., FSE05]

    iii. Mining function precedence protocols[Ramanathan et al., ICSE07]

    iv. Revealing neglected conditions[Chang et al., ISSTA07]

    Mining rules from revision histories

    i. DynaMine [Livshits& Zimmermann,FSE05]

    Mining copy-paste patterns from sourcecode

    ii. CP-Miner [Li et al., OSDI04] to findcopy-paste bugs [7].

    Bugs as Deviant BehaviourStatic verification tools need rules to checkagainst program code

    To find errors without knowing the truth

    Contradiction in belief. To find lies:cross-examine one witness or manywitness. Any contradiction is an error(internal consistency)

    Deviation from common behaviour.To infer correct behaviour: if 1 persondoes X, might be right or acoincidence. If 1000s do X and 1 doesY, probably an error (statistical

    analysis)

    Crucial: we know contradiction is anerror without knowing the correctbelief!

    5. A brief methodology: Software BugDetection

    Based on the discussion presented in theprevious section, the following steps forsoftware bug detection are presented.

    Step 1:

    Mining rules from source code [8]

    Bugs as deviant 103ehaviour [Engleret al., SOSP01]

    Mining techniques: Statistical analysis Mining programming rules with PR-

    Miner [Li et al., FSE05]

    Mining function precedence protocols[Ramanathan et al., ICSE07]

    Revealing neglected conditions[Chang et al., ISSTA07]

    Step 2:Mining copy-paste patterns from sourcecode

    CP-Miner [Li et al., OSDI04] to findcopy-paste bugs

    An Overview of Extracting Rules -Observation: elements are usually usedtogether.

    Idea: finding association among elements thatarefrequently used togetherin source codeImplies frequent item set mining [9].

    Examples:spin_lock_irqsave and spin_unlock_irqrestoreappear together within the same function morethan 3600 times.

    Flowchart of Extracting Rules

    Source files

    Parsing & hashing

    Pre-ProcessingItemsets

    Mining

    Programming patterns

    Post-ProcessingGenerating rules

    Programming rules

    Fig.1Step 3:

    Mining Programming Patterns and

    Generation of Rules

    Parsing Source Code Purpose: building anitem set database.Element: function call, variable, data type, etc.are mapped to a number. The Source code ismapped to an item set database.

    A frequent sub-item set corresponds to aprogramming pattern and application of

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    4/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 104

    frequent item set mining algorithm on the itemset database.

    E.g., {39, 68, 36, 92}:27 corresponds topattern{Scsi_Host, host_alloc, add_host, scan_host}

    Tradeoff: consider order or notStep 4:

    Generating Programming Rules

    Programming patterns - programming rulesE.g.,Patterns: {a, b, d} : 3,

    {a} : 4

    Rules:{a} => {b,d} with confidence = =75%

    {b} => {a,d} with confidence = 100%{d} => {a,b} with confidence = 100%{a,b} => {d} with confidence = 100%{a,d} => {b} with confidence = 100%{b,d} => {a} with confidence = 100%

    Rule Explosion Problem

    Exponential number of rules Solution: closed mining

    Example:{a,b,d}:3, {a}:4{a,b}:3, {a,d}:3, {b,d}:3 are not closed

    Close rules{a,b,d}:3 | {a}:4

    Detection of Violations

    For violations of a programming rule(i) The rule holds for most cases

    Confidence > threshold(ii) The rule is violated for a few cases

    Confidence < 100%Example: Detecting Violations

    Step 5:

    Programming patterns:{Scsi_Host, host_alloc, add_host, scan_host}:27{Scsi_Host, host_alloc, add_host}: 29

    Programming rule:{Scsi_Host, host_alloc, add_host}=>

    {scan_host}

    with confidence 27/29 = 93%

    Missing

    Table 1: Some Results of Bug DetectionSoftware #C files LOC #functions

    Linux 3,538 3,037,403 73,607PostgreSQL

    409 381,192 6,964

    Apache 160 84,724 1,912

    Software Inspected (top 60)

    Bugs Anomalies FalsePositives

    Linux 16 20 24

    PostgreSQL

    6 9 45

    Apache 1 0 6

    6. Limitations of PR-MinerRules across multiple functions

    Not using inter-procedural analysisFalse negatives of violations in control paths

    Not using sophisticated analysistechniques

    Inter-procedural, path-sensitiveinference of function precedenceprotocols to address the limitations[Ramanathan et al., ICSE07] [10].

    We shall now discuss Mining Function

    Precedence Protocols

    fp = fopen();fclose();

    Definition:-Precedence protocol:A call tofcloseis always preceded by a calltofopen

    Definition:-Successor protocol :A call tofopenis always succeeded by a calltofclose

    Violation of Precedence Protocols

    fp = fopen();

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    5/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 105

    if(fp == NULL)exit(-1);fclose();

    Tool Implementation/Evaluation

    CHRONICLER tool implemented in Chas the following features:

    Tested on open source C programsApache, linux, openssh, gimp,postgresqlLines of code varies from 66K to 2MNumber of call-sites varies from 10K to110K

    Some Results of Precedence-Related BugDetection

    Case Study: LinuxHardware Bug

    Difficult to detect using traditionaltesting techniques

    Platform dependent error Transparently identified using

    CHRONICLER

    Performance Bug Cache lookup operation was absent Not easily specified as a bug for

    testing

    Deviation delays data write flushes[11].

    Limitation of Precedence-Related Bug

    Detection

    Does not take data flow or datadependency into account

    A new approach to discoveringneglected conditions [Chang et al.,ISSTA07] addresses the issue

    Based on dependence analysis,frequent item set, and frequent subgraph mining

    Crucial Observation

    Things that are frequently changed together

    often form a pattern...also known as co-

    change

    Co-changed items = patterns

    Finding Patterns

    Find frequent itemsets (with Apriori)

    o.enterAlignment()

    o.exitAlignment()

    o.redoAlignment()

    iter.hasNext()iter.next()

    {enterAlignment (), exitAlignment(),redoAlignment()}

    Ranking PatternsSupport count = #occurrences of a pattern

    Confidence count= Strength of a pattern, P(A|B)

    Pattern classification

    Post-processv validations, e violations

    Usage error unlikelypatterns patterns patterns

    e

  • 7/29/2019 IMPROVEMENT OF SOFTWARE MAINTENANCE AND RELIABILITY USING DATA MINING TECHNIQUES

    6/6

    Vol 01, Issue 02, December 2012 International Journal of Data Mining Techniques and Applicationshttp://iirpublications.com ISSN: 2278-2419

    Integrated Intelligent Research (IIR) 106

    7. ConclusionFinally, the following conclusions are drawn:

    (i) Challenges in data mining Statistical modelling of computer

    systems

    Online, scalability, interpretability (ii) Data Mining for Software Bug

    Detection

    Frequent pattern miningAutomated Debugging in Software

    Programs From frequent patterns to

    software bugs Statistical debugging

    Automated Debugging in Computer

    Systems Automated diagnosis of

    system misconfigurations

    Performance debugging(iii) Limitations of Bugs as DeviantBehaviour

    Fixed rule templates Need specific knowledge

    about the software

    2 elements PR-Miner [Li et al., FSE05]

    (mining implicit programmingrules) developed to addressthe limitations

    General method (No priorknowledge; No templates)

    General rules (Different types:function, variable, data type, etc.;Multipleelements)

    (iv) Ubiquitous computing demandsreliable software- Mining for

    software reliability

    Mining program sourcecode/version histories to findbugs

    Mining program runtime datato locate why an execution

    fails Mining system snapshots to

    diagnose misconfigurationsand performance problems

    (v) An active and rewarding researcharea

    International Workshop on MiningSoftware Repositories since 2004

    SIGCOMM Workshop on MiningNetwork Data since 2005

    Systems and Machine LearningWorkshop since 2006

    Workshop on Statistical LearningTechniques for Solving Systems

    Problems, co-located with NIPS

    8. References:1) Ian Sommerville, Software Engineering

    8th edition, Pearson EducationPublications, 2007.

    2) Roger S. Pressman, Software Engineering:A Practitioners Approach, 6th editionMcGraw-Hill International editionPublications, 2005.

    3) James S. Peters &WitoldPedrycz,Software Engineering an EngineeringApproach, Wiley Publications, 2000.

    4) Jiawei Han &MichelineKamber, DataMining: Concepts and Techniques, 2

    nd

    edition,, Elsevier Publications, March2006.

    5) Chai Liu, Long Fei, Xifang Yan, JiaweiHan and Samuel Midkiff, StatisticalDebugging: A Hypothesis Testing-basedapproach, IEEETSE 2006.

    6) Dawson Engler, David Yu Chen, SethHallem, Andy Chou and Benjamin Chelf,Bugs as Deviant Behaviour: A Generalapproach to inferring errors in systemscode, SOSP 2001.

    7) Zhenmin Li, Shan Lu, SuvdaMyagmarand Yuanyan Zhou, CP-Miner: A tool forfinding copy-paste and related bugs inoperating system code, OSPI 2004.

    8) Prof. S. Chitra&Dr. M. Rajaram, ASoftware Reliability Estimation tool usingArtificial Immune Recognition System:Proceedings of the InternationalMulticonference of Engineers andcomputer scientists 2008 vol 1, IMECS2008, pp. 19-21 March 2008, Hong Kong.

    9) Leon Wu, BoyiXie, Gail Kaiser &Rebecca Passonneau, Department ofComputer Science, Columbia University,Newyork NY 10027 USA, BUGMINER:Software Reliability Analysis via DataMining of Bug Reports2007.

    10) Swapna S. Gokhale, Member, IEEE, ASimulation Approach to structured-basedsoftware reliability analysis, IEEEtransactions on Software Engineering, vol31, No. 8, August 2005.

    11) Simon P. Wilson and Francisco J.Samaniego, Nonparametric Analysis ofthe order-statistic model in softwarereliability, IEEE transactions on softwareengineering, vol 33, No. 3, March 2007.