12
Software Defect Prediction via Convolutional Neural Network Jian Li *† , Pinjia He *† , Jieming Zhu *† , and Michael R. Lyu *† * Department of Computer Science and Engineering, The Chinese University of Hong Kong, China Shenzhen Research Institute, The Chinese University of Hong Kong, China {jianli, pjhe, jmzhu, lyu}@cse.cuhk.edu.hk Abstract—To improve software reliability, software defect pre- diction is utilized to assist developers in finding potential bugs and allocating their testing efforts. Traditional defect prediction studies mainly focus on designing hand-crafted features, which are input into machine learning classifiers to identify defective code. However, these hand-crafted features often fail to capture the semantic and structural information of programs. Such information is important in modeling program functionality and can lead to more accurate defect prediction. In this paper, we propose a framework called Defect Prediction via Convolutional Neural Network (DP-CNN), which leverages deep learning for effective feature generation. Specifically, based on the programs’ Abstract Syntax Trees (ASTs), we first extract token vectors, which are then encoded as numerical vectors via mapping and word embedding. We feed the numerical vectors into Convolutional Neural Network to automatically learn semantic and structural features of programs. After that, we combine the learned features with traditional hand-crafted features, for accurate software defect prediction. We evaluate our method on seven open source projects in terms of F-measure in defect prediction. The experimental results show that in average, DP-CNN improves the state-of-the-art method by 12%. Index Termssoftware reliability; software defect prediction; deep learning; CNN I. I NTRODUCTION With the ever-increasing scale of modern software, reliabil- ity has become a critical issue, since these software are often highly complicated and failure-prone. As the code defects 1 in the implementation of software are considered as the main causes of failures [1], to improve reliability, companies like Google employ code review and unit testing for finding bugs in fresh code [2]. However, manual code reviews are labor- intensive and testing all code units is impractical. As the software project budgets are finite, it would be beneficial to first check potentially buggy code. Therefore, software defect prediction techniques which automatically find potential bugs have been widely employed to help developers allocate their limited resources [3]. Software defect prediction [4], [5], [6], [7], [8] is a process of building classifiers to predict code areas that potentially contain defects, using information such as code complexity and change history. The prediction results (i.e., buggy code areas) can place warnings for code reviewers and allocate their efforts. The code areas could be files, changes or methods. In 1 Defect and bug will be used interchangeably across this paper. this paper, we focus on file-level defect prediction. Typical defect prediction is composed of two phases [9]: feature extraction from source files, and classifier development using various machine learning algorithms. Previous studies towards building more accurate predictions mainly focus on manually designing new discriminative features or new combinations of features, so that defects can be better distinguished. Traditional hand-crafted features include Halstead features based on the number of operators and operands [10], McCabe features based on dependencies [11], CK features for object-oriented programs [12], etc. However, programs have well-defined syntax and rich se- mantics hidden in the Abstract Syntax Trees (ASTs), which traditional features often fail to capture. Thus the prediction results of traditional methods are not satisfactory enough. Re- cently, deep learning has emerged as a powerful technique for automated feature generation, since deep learning architecture can effectively capture highly complicated non-linear features. To make use of its powerful feature generation ability, the state-of-the-art method [13] leverages Deep Belief Network (DBN) in learning semantic features from token vectors ex- tracted from programs’ ASTs, which outperforms traditional features-based approaches in defect prediction. However, it overlooks the structural information of programs which can lead to more accurate defect prediction. There are structural information in ASTs, specifying how adjacent tokens (i.e., nodes on ASTs) interact with each other to accomplish certain functionality. Slight difference in local structure may lead to huge variance in program results, even program crash. For example in Figure 1, there are two Java files, both of which contain a for statement, a remove function and an add function. The only difference between the two files is the order of the remove function and add func- tion. F ile2.java will encounter NoSuchElementException, when calls remove at the beginning if the queue is empty. Treating program as bag of words without order, the state- of-the-art methods often overlook this local structural infor- mation. As reported by deep learning researchers in speech recognition [14] and image classification [15], Convolutional Neural Network (CNN) is more advanced than DBN since the former can capture local patterns more effectively. Thus CNN is more capable of detecting local patterns such as the order difference in Figure 1, and conducting defect prediction.

Software Defect Prediction via Convolutional Neural Network

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Software Defect Prediction via Convolutional Neural Network

Software Defect Prediction via ConvolutionalNeural Network

Jian Li∗†, Pinjia He∗†, Jieming Zhu∗†, and Michael R. Lyu∗†∗Department of Computer Science and Engineering, The Chinese University of Hong Kong, China

†Shenzhen Research Institute, The Chinese University of Hong Kong, China{jianli, pjhe, jmzhu, lyu}@cse.cuhk.edu.hk

Abstract—To improve software reliability, software defect pre-diction is utilized to assist developers in finding potential bugsand allocating their testing efforts. Traditional defect predictionstudies mainly focus on designing hand-crafted features, whichare input into machine learning classifiers to identify defectivecode. However, these hand-crafted features often fail to capturethe semantic and structural information of programs. Suchinformation is important in modeling program functionality andcan lead to more accurate defect prediction.

In this paper, we propose a framework called Defect Predictionvia Convolutional Neural Network (DP-CNN), which leveragesdeep learning for effective feature generation. Specifically, basedon the programs’ Abstract Syntax Trees (ASTs), we first extracttoken vectors, which are then encoded as numerical vectorsvia mapping and word embedding. We feed the numericalvectors into Convolutional Neural Network to automaticallylearn semantic and structural features of programs. After that,we combine the learned features with traditional hand-craftedfeatures, for accurate software defect prediction. We evaluate ourmethod on seven open source projects in terms of F-measure indefect prediction. The experimental results show that in average,DP-CNN improves the state-of-the-art method by 12%.

Index Terms—software reliability; software defect prediction;deep learning; CNN

I. INTRODUCTION

With the ever-increasing scale of modern software, reliabil-ity has become a critical issue, since these software are oftenhighly complicated and failure-prone. As the code defects1 inthe implementation of software are considered as the maincauses of failures [1], to improve reliability, companies likeGoogle employ code review and unit testing for finding bugsin fresh code [2]. However, manual code reviews are labor-intensive and testing all code units is impractical. As thesoftware project budgets are finite, it would be beneficial tofirst check potentially buggy code. Therefore, software defectprediction techniques which automatically find potential bugshave been widely employed to help developers allocate theirlimited resources [3].

Software defect prediction [4], [5], [6], [7], [8] is a processof building classifiers to predict code areas that potentiallycontain defects, using information such as code complexityand change history. The prediction results (i.e., buggy codeareas) can place warnings for code reviewers and allocate theirefforts. The code areas could be files, changes or methods. In

1Defect and bug will be used interchangeably across this paper.

this paper, we focus on file-level defect prediction. Typicaldefect prediction is composed of two phases [9]: featureextraction from source files, and classifier development usingvarious machine learning algorithms. Previous studies towardsbuilding more accurate predictions mainly focus on manuallydesigning new discriminative features or new combinations offeatures, so that defects can be better distinguished. Traditionalhand-crafted features include Halstead features based on thenumber of operators and operands [10], McCabe featuresbased on dependencies [11], CK features for object-orientedprograms [12], etc.

However, programs have well-defined syntax and rich se-mantics hidden in the Abstract Syntax Trees (ASTs), whichtraditional features often fail to capture. Thus the predictionresults of traditional methods are not satisfactory enough. Re-cently, deep learning has emerged as a powerful technique forautomated feature generation, since deep learning architecturecan effectively capture highly complicated non-linear features.To make use of its powerful feature generation ability, thestate-of-the-art method [13] leverages Deep Belief Network(DBN) in learning semantic features from token vectors ex-tracted from programs’ ASTs, which outperforms traditionalfeatures-based approaches in defect prediction. However, itoverlooks the structural information of programs which canlead to more accurate defect prediction.

There are structural information in ASTs, specifying howadjacent tokens (i.e., nodes on ASTs) interact with each otherto accomplish certain functionality. Slight difference in localstructure may lead to huge variance in program results, evenprogram crash. For example in Figure 1, there are two Javafiles, both of which contain a for statement, a remove

function and an add function. The only difference between thetwo files is the order of the remove function and add func-tion. File2.java will encounter NoSuchElementException,when calls remove at the beginning if the queue is empty.Treating program as bag of words without order, the state-of-the-art methods often overlook this local structural infor-mation. As reported by deep learning researchers in speechrecognition [14] and image classification [15], ConvolutionalNeural Network (CNN) is more advanced than DBN since theformer can capture local patterns more effectively. Thus CNNis more capable of detecting local patterns such as the orderdifference in Figure 1, and conducting defect prediction.

Page 2: Software Defect Prediction via Convolutional Neural Network

File1.java File2.java

Fig. 1. A motivating example. File2.java will encounter an exception whencalls remove() at the beginning if the queue is empty.

In addition, to better exploit the program context, wordembedding technique [16] can be helpful as well. Word em-bedding maps each AST token into a numerical vector, whichis trained regarding the context of each token. Consequently,tokens appearing in similar context tend to have similar vectorrepresentations that are close in the feature space, which canbenefit CNN in learning the program semantics in certaincontexts. Besides deep learning generated features, traditionaldefect prediction features such as complexity metrics andprocess metrics are shown to be informative in distinguishingbuggy code [8], which may complement features generated bydeep learning. Intuitively, by combining CNN and traditionalfeatures, we can get a richer feature representation of buggysource code.

In this paper, we propose a framework called Defect Pre-diction via Convolutional Neural Network (DP-CNN), whichcaptures both semantic and structural features of programs.Specifically, we first parse source code into ASTs, and selectrepresentative nodes on ASTs to form token vectors. Thuseach source file is represented by a token vector. Then weconduct mapping and word embedding, which converts thetoken vectors into numerical vectors, and input the numericalvectors to CNN. CNN will automatically generate semanticand structural features of the source code, which are thencombined with several traditional defect prediction features.Finally the combined features are fed into a Logistic Re-gression classifier. We evaluate our method on seven open-source Java projects with well-established labels (i.e., buggyor clean), in terms of F-measure in defect prediction. Theexperimental results indicate that averagely, the proposed DP-CNN improves the state-of-the-art DBN-based method [13] by12%, as well as traditional features-based method by 16%. Insummary, This paper makes the following contributions:

• We propose a CNN-based defect prediction frameworkto automatically generate discriminative features fromprograms’ ASTs, which preserves semantic and structuralinformation of the source code.

• We employ word embedding to encode tokens extractedfrom ASTs, which benefits CNN in learning the seman-tics of programs.

• We combine the CNN-learned features with traditionaldefect prediction features, taking advantage of both non-linear features and hand-crafted features.

The rest of this paper is organized as follows. Section IIintroduces the background of defect prediction and CNN. Sec-tion III elaborates our proposed DP-CNN, which automaticallylearns semantic and structural features from source code fordefect prediction. Section IV shows the experimental setupand results, including the parameter tuning. Section V andSection VI present the threats to validity and related work,respectively. Finally we conclude this paper and discuss plansfor future work in Section VII.

II. BACKGROUND

In this section, we briefly introduce the background of file-level defect prediction techniques and convolutional neuralnetwork. Here file-level means that each of the training ortest instances is a source code file.

Fig. 2. Defect Prediction Process

A. Defect Prediction

Software defect prediction is a process of predicting codeareas that potentially contain defects, which can help devel-opers allocate their testing efforts by first checking potentiallybuggy code [3]. Defect prediction is essential to ensuringreliability of today’s large-scale software. Figure 2 presents atypical file-level defect prediction process which is commonlyadopted in the literature [5], [13], [17]. As the process shows,the first step is to collect source code files (instances) fromsoftware archives and label them as buggy or clean. Thelabeling process is based on the number of post-release defectsof each file. A file is labeled as buggy if it contains at leastone post-release bug. Otherwise, the file is labeled as clean.The second step is to extract features from each file. Thereare many traditional features defined in past studies, whichcan be categorized into two kinds: code metrics (e.g., McCabefeatures [11] and CK features [12]), and process metrics (e.g.,change histories). The instances with the corresponding fea-tures and labels are subsequently employed to train predictiveclassifiers using various machine learning algorithms such asSVM, Naive Bayes, and Dictionary Learning [5]. Finally, newinstances are fed into the trained classifier, which can predictwhether the files are buggy or clean.

The set of instances used for building the classifier istraining set, while test set includes the instances used forevaluating the learned classifier. In this work, we focus onwithin-project defect prediction, i.e., the training and test sets

Page 3: Software Defect Prediction via Convolutional Neural Network

belong to the same project. Following the previous work inthis field [13], we use the instances from an older version ofthis project for training, and instances from a newer versionfor test.

B. Convolutional Neural Network

Fig. 3. A CNN architecture. The architecture depicts the two characteristicsof CNN: Sparse Connectivity and Shared Weights, which enable CNN tocapture local structural information of the inputs.

Convolutional Neural Networks (CNNs) are a specializedkind of neural networks for processing data that have a knowngrid-like topology [18], such as time-series data in 1D grid andimage data in 2D grid. CNNs have been demonstrated success-ful in many practical fields, including speech recognition [14],image classification [15], [19] and natural language processing[20]. In this work, we leverage CNNs for effective featuregeneration from source code. Figure 3 depicts the generalarchitecture of CNNs. Compared with traditional ArtificialNeural Networks (ANNs) or Multi-Layer Perceptrons (MLPs),CNNs have two key characteristics: Sparse Connectivity andShared Weights, which can benefit our defect prediction incapturing local structural information of programs.

Sparse Connectivity means CNNs employ a local connec-tivity pattern between neurons of adjacent layers to generatespatially local correlation of the inputs. For example in Figure3, the inputs of units in hidden layer m are from a subsetof units in layer m-1, which are spatially contiguous. Thesize of the subset is 3, so units in layer m only connect to3 adjacent neurons in the layer below (i.e., neurons in thedashed rectangle), rather than connecting to all the neuronsin traditional ANNs. Each subset acts as a local filter overthe input vector, which can produce strong responses to aspatially local input pattern. Each local filter applies a non-linear transformation just like usual ANNs: multiplying theinput with a linear filter, adding a bias term and then applyinga non-linear function. In Figure 3, if we denote the i-th hiddenunit in layer m as hm

i , then the local filter in layer m−1 actsas follows (for sigmoid non-linearities):

hmi = sigmoid((Wm−1 ∗ x)i + bm−1). (1)

where Wm−1 and bm−1 denote the weights and bias of thelocal filter.

Shared Weights mean each filter shares the same parameter-ization (weight vector and bias). As our example in Figure 3,

we show a local filter consisting of 3 units. Across the entirelayer m-1, there are 3 local filters, and the same-colored arrowsindicate they share the same weights. Replicating filters in thisway enables us to detect features regardless of their position inthe input vector. Moreover, weight sharing can greatly increaselearning efficiency by reducing the number of free parameters.

Another important concept of CNNs is max-pooling, whichpartitions the output vector into several non-overlapping sub-regions, and outputs the maximum value of each sub-region.This is a smart way of reducing the dimensionality of inter-mediate representations and providing additional robustness toour defect prediction.

The effectiveness of CNNs largely depends on the parame-ters, such as filter length and batch size. The model would noteven converge under a bad parameter setting. Thus parametertuning is a key to train a successful CNN. We will discusshow to set these parameters in Section IV-F.

III. APPROACH

In this section, we elaborate our proposed DP-CNN, aframework which automatically generates semantic and struc-tural features from source code and combines traditionalfeatures, for accurate software defect prediction. Figure 4illustrates the overall workflow of DP-CNN.

As the workflow shows, we first parse source code ofboth training files and test files into Abstract Syntax Trees(ASTs), then select representative nodes on ASTs to formtoken vectors. Thus each source file becomes a token vector,which is fed into the following encoding phase. We builda mapping between integers and tokens, and employ wordembedding to encode token vectors as numerical vectors whichare input to subsequent CNN. CNN automatically generatessemantic and structural features of source code from the inputvectors, which are then combined with several traditionaldefect prediction features. This feature generation process iselaborated in Figure 6. Finally, the combined features arefed into a Logistic Regression classifier. After building ourclassifier model (i.e., deciding the weights and biases in CNNand Logistic Regression), we can produce a probability foreach fresh code file, indicating whether it is buggy or clean.

To sum up, our approach has of four major steps: 1)parsing source code into ASTs and extract tokens, 2) encodingtoken vectors into numerical vectors, 3) employing CNNto generate semantic and structural features and combiningtraditional defect prediction features, and 4) building a LogisticRegression classifier to decide whether the fresh code files arebuggy or clean.

A. Parsing Source Code

In order to represent each source code file as a vector,we should first answer a fundamental question: what is theproper granularity of representation? In general, vector repre-sentations map a symbol to a real-valued, distributed vector.For software programs, possible granularities of the symbolinclude character-level, token-level, nodes on ASTs, etc. Asanalyzed in [21], only nodes on ASTs are a suitable granularity

Page 4: Software Defect Prediction via Convolutional Neural Network

Source Files

for add() remove()

Classifiers1 3 2

encode

(a) Parsing source code and extracting token vectors (b) Encoding token vectors

(c) Generating features via CNN

(d) Predicting defects

select informative nodes

buggy or cleanremove()

for

add()

1

2

3

AST nodes index

Fig. 4. The overall workflow of our proposed DP-CNN.

to build program representation, which preserve both syntacticand structural information of the programs. In our experiments,we employ an open-source Python package called javalang2

to parse our Java source code into ASTs. Javalang providesa lexer and parser targeting Java 8, whose implementation isbased on the Java language specification.

Following the state-of-the-art method [13], we onlyselect three types of nodes on ASTs as tokens: 1) nodesof method invocations and class instance creations, whichare recorded as their method names or class names, 2)declaration nodes, i.e., method declarations, type declarations,and enum declarations, whose values are extracted as ourtokens, and 3) control-flow nodes including IfStatement,WhileStatement, ForStatement, ThrowStatement,CatchClause, etc. Control-flow nodes are simply recordedas their node types. We exclude other types of AST nodessuch as Assignment because they are often method-specificor class-specific, which do not have consistent meaningsthroughout the whole project. All the selected AST nodes arelisted in Figure 5.

In this way, we convert each source file into a tokenvector. Take the two Java files in Figure 1 as example. Afterthe tokenization as described, File1.java and File2.javawill be denoted as [<FOR>, add, remove] and [<FOR>,

remove, add], respectively.

B. Encoding Tokens and Handling Imbalance

1) Encoding Tokens: Since CNNs require inputs as numeri-cal vectors, the extracted token vectors cannot be directly sentto a CNN. To solve this problem, we first build a mappingbetween integers and tokens, and convert token vectors intointeger vectors. Each token is associated with a unique integeridentifier which ranges from 1 to the total number of tokentypes. In this way, the same tokens keep as the same identifierand different tokens such as different method names and classnames still remain different. Also, CNNs require input vectorsto have the same length. But our converted integer vectorsmay differ in their lengths. In response, we simply append0 to each integer vectors, making their lengths consistent

2https://github.com/c2nes/javalang

FormalParameterBasicTypePackageDeclarationInterfaceDeclarationCatchClauseParameterClassDeclarationMethodInvocationSuperMethodInvocationMemberReferenceSuperMemberReferenceConstructorDeclarationReferenceTypeMethodDeclarationVariableDeclaratorIfStatementWhileStatement DoStatement

ForStatement AssertStatement BreakStatementContinueStatement ReturnStatementThrowStatement SynchronizedStatement TryStatementSwitchStatement BlockStatement StatementExpression TryResourceCatchClauseCatchClauseParameterSwitchStatementCase ForControl EnhancedForControl

Fig. 5. The selected AST nodes

with the longest vector. 0 does not have any meaning sincewe encode tokens starting from 1. Additionally, during theencoding process, we filter out infrequent tokens which mightbe designed for a specific file and not generalized for otherfiles. Specifically, we only encode tokens occurring three ormore times, while denote the others as zeros.

As discussed in Section I, we also employ word embedding[16] in the encoding phase. However, our word embedding isbuilt and trained at the same time as CNN. So we wrap wordembedding as a part of our CNN architecture and discuss itin the following CNN part.

2) Handling Imbalance: Software defect data are oftenimbalanced, in which the number of buggy instances is muchless than the number of clean instances. Imbalanced datawill degrade the performance of our model. To address thisproblem, two approaches are feasible. One approach is toreduce the training instances from the majority class (i.e., theclean files), while another approach is to duplicate traininginstances from the minority class (i.e., the buggy files). Asthe first approach would lose some information, we use the

Page 5: Software Defect Prediction via Convolutional Neural Network

Logistic Regression

Integer VectorReal-valued

VectorsTraditional Hand-crafted Features

CNN Generated Featues

Embedding Layer Convolutional Layer

Max-pooling Layer

Fully Connected Hidden Layer

Feature Concatenation

Output Layer

Fig. 6. The feature generation process of DP-CNN, which elaborates the step (c) in Figure 4.

second approach and duplicate the buggy files several timesuntil we have a balanced dataset. Note that we only apply thisprocess on the training files.

C. Building CNN and Combining Traditional Features

1) Building CNN: As discussed in Section II-B, we takeadvantage of CNN’s powerful ability of feature generation, andcapture semantic and local structural information of the sourcecode. We train our CNN (i.e., the weights and biases in CNN)by using the training data. Considering that this work engagesCNN only as an application, we adopt a standard architectureof CNN rather than fancy and complex architectures in sometheoretical approaches [15], [19]. In particular, our CNNconsists of an embedding layer (i.e., word embedding), aconvolutional layer, a max-pooling layer, a fully-connectedhidden layer, and finally a single unit output layer workingas a Logistic Regression classifier (the last step in Figure 4).The overall architecture is illustrated in Figure 6. Except theoutput layer uses sigmoid activation function, all other layersuse ReLU activation function. Our implementation is based onKeras3, through which we can easily and quickly build neuralnetworks. There is an example in Keras demonstrating the useof 1D Convolutional Neural Network for text classification,which is taken as our reference.

As shown in Figure 6, we employ word embedding [16] asthe first layer, which turns positive integers (indexes) into real-valued vectors of fixed size. Obviously, a simple index doesnot carry much context information about the token extractedfrom ASTs. However, word embedding is trained regardingthe context of each token. A feature vector will be learned foreach token, and tokens appearing in similar context tend tohave similar vector representations which have close distancein the feature space. In this way, we can further exploit thesemantics of programs via CNN. A word embedding is definedas f : M → Rn, where M represents the dictionary ofwords (or tokens in this work), f is a parameterized functionmapping words to n-dimensional vectors. The parametersof word embedding are initialized randomly and learned at

3Keras (http://keras.io) is a popular Deep Learning library for Python.

the same time as other parameters in the following CNNarchitecture. The implementation of word embedding is alsobased on Keras. Therefore, we wrap word embedding as apart of our CNN architecture and discuss them together inthis paper. Hereinafter, the term CNN always includes wordembedding.

Our CNN model is trained using the minibatch stochasticgradient descent (SGD) algorithm [22], with the Adam opti-mizer [23]. We will discuss the details of parameter tuningsuch as batch sizes and the number of training epochs inSection IV-F.

2) Combining Traditional Features: Till now we only con-sider static code features through the CNN. However, inconventional defect prediction methods, other features suchas complexity metrics and process metrics are shown tobe informative in distinguishing buggy code [8]. In fact, inour dataset we are provided with several traditional defectprediction features of each file, which are carefully extractedby the dataset contributors. To make use of these information,we directly concatenate the CNN-learned feature vectors withtraditional hand-crafted feature vectors. This concatenationcan be realized via Merge operator in Keras. Finally, thecombined feature vectors are fed into the subsequent Logis-tic Regression classifier. To demonstrate the effectiveness ofcombining traditional features, we design a variant of DP-CNN which directly feeds the CNN-learned features to finalclassifier without concatenation. In the experiments part, wewill compare this variant with DP-CNN, as well as other state-of-the-art methods.

D. Predicting Defects

We employ Logistic Regression as the final classifier, sinceit is widely used in the literature [24] and we mainly focuson feature generation in this paper. We process each file inboth training set and test set following the above steps, andobtain semantic and structural features of each source file.After we train our model using the training files with theircorresponding labels, both the weights and the biases in ourCNN and Logistic Regression are fixed. Then for each file in

Page 6: Software Defect Prediction via Convolutional Neural Network

TABLE IDATASET DESCRIPTION

Project Description Versions (Tr, T) Avg. Files Buggy Rate (%)camel Enterprise integration framework 1.4, 1.6 892 18.6jEdit Text editor designed for programmers 4.0, 4.1 284 23.8

lucene Text search engine library 2.0, 2.2 210 55.7xalan A library for transforming XML files 2.5, 2.6 815 48.5xerces XML parser 1.2, 1.3 441 15.5

synapse Data transport adapters 1.1, 1.2 239 30.5poi Java library to access Microsoft format files 2.5, 3.0 409 64.7

the test set, we feed it into our defect prediction model and thefinal classifier will give us a value, indicating the probabilityof this file being buggy.

IV. EVALUATION

In this section, we evaluate the effectiveness of our DP-CNNby comparing its accuracy on defect prediction with otherstate-of-the-art methods. In particular, our evaluation addressesthe following research questions (RQ):• RQ1: Do the deep learning-based methods outperform

traditional features-based methods?• RQ2: Does DP-CNN which combines traditional features

outperform deep learning-based methods?• RQ3: How is the performance of DP-CNN under different

parameter settings?All our experiments were run on a Linux server with one

Tesla K40m GPU. Unless otherwise stated, each experimentwas run for ten times and the average results were reported.

A. Evaluation Metrics

To evaluate the prediction accuracy, we use a widely adopt-ed metric in the literature [8], [17]: the F-measure (also F1score), which is the harmonic mean of precision and recall[25].

We first present some notations in defining precision, recall,and F-measure: (i) predicting a buggy file as buggy (b→ b);(ii) predicting a buggy file as clean (b→ c); and (iii) predictinga clean file as buggy (c→ b). N denotes the number of filesin each above situation, e.g., Nb→b for the first case. Then ourmetrics can be defined as follows:

Precision: The ratio of the number of files correctly classi-fied as buggy to the number of files classified as buggy.

Precision: P =Nb→b

Nb→b +Nc→b(2)

Recall: The ratio of the number of files correctly classifiedas buggy to the number of truly buggy files.

Recall: R =Nb→b

Nb→b +Nb→c(3)

F-measure: The traditional F-measure (F1 score) is theharmonic mean of precision P and recall R.

F-measure: F =2 ∗ P ∗RP +R

(4)

Usually, there are trade-offs between precision and recall.For example, by predicting all the test files as buggy, we willget a high recall as 1 but a very low precision. Therefore, F-measure is a composite measure of precision and recall whichfalls in the range [0, 1]. The higher the F-measure is, the betterthe prediction performance represents.

B. Dataset Description

Our defect prediction dataset comes from tera-PROMISERepository4, which is a publicly available repository special-izing in software engineering research datasets. We selectseven open-sourced Java projects from this repository, wherethe version numbers, the class name of each file, and mostimportantly, the defect label for each source file are provided.With the version numbers and class names, we can extractsource code of each file from Github5 and apply it in our DP-CNN framework. Table I shows the details of these projects,including project description, versions, the average number offiles, and the average buggy rate. To obtain the training andtest data, following the state-of-the-art method [13], we usefiles from two consecutive versions of each project. The olderversion is denoted as Tr and the newer version is denoted asT. In average, the number of files in each project is 330 andthe buggy rate of each project is 35%.

Moreover, in this dataset, we are provided with 20 tradi-tional defect prediction features for each source file, includingLines of Code (LOC), Weighted Methods per Class (WMC),Depth of Inheritance Tree (DIT), Number of Children (NOC),and McCabe complexity measures (Max CC and Avg CC),etc. The 20 traditional features are carefully extracted byJureczko et al., the dataset contributors [26]. We list thedetailed description about the 20 features in Table III. Thesefeatures and data have been widely used in previous studies[3], [5], [17].

C. Baseline Methods

We compare our proposed DP-CNN for defect predictionwith the following baseline methods:• DBN [13]: the state-of-the-art method which employs

Deep Belief Network (DBN) on source code to extractsemantic features for defect prediction.

4http://openscience.us/repo/defect/5https://github.com/apache

Page 7: Software Defect Prediction via Convolutional Neural Network

camel jEdit lucene xalan xerces synapse poi

0.1

0.3

0.5

0.7

0.9F-

mea

sure

Val

ues

0.329

0.5730.618 0.627

0.273

0.500

0.748

0.335

0.480

0.7580.681

0.261

0.503

0.780

0.505

0.631

0.7610.676

0.311

0.512

0.778

Traditional DBN CNN

Fig. 7. Performance comparison between traditional method and deep learning-based methods for defect prediction.

• Traditional [24]: traditional method which builds aLogistic Regression classifier based on the 20 features.

• DBN+: an improved version of DBN proposed by us,which combines the semantic features with the traditionalfeatures as we do in this work.

• CNN: a variant of DP-CNN which directly feeds theCNN-learned features to final classifier without combin-ing traditional features.

When implementing DBN and DBN+, we use the samenetwork architecture and parameter settings as in [13], i.e., 10hidden layers and 100 nodes in each hidden layer. For a faircomparison, we follow the same process and tools in our workto parse source code and encode tokens, as well as handlingdata imbalance.

D. Performance of Deep Learning-based Methods (RQ1)

We first compare the traditional features-based method fordefect prediction with two deep learning-based methods, i.e.,DBN and CNN, as introduced in the previous section. Thepurpose of this comparison is to validate the advantage of deeplearning techniques in software defect prediction. We conductseven sets of defect prediction experiments on those projectslisted in Table I, within each of which the older version isused to train prediction models, and the newer version is usedas the test set to evaluate the trained models.

Figure 7 shows the experimental results, i.e., F-measurevalues on each project by applying the three competingmethods. We take project camel as an example. After trainingusing version 1.4 and testing using version 1.6, the F-measureof defect prediction is 0.329, 0.335, and 0.505 for traditionalmethod, DBN and CNN, respectively. Both DBN and CNNoutperforms traditional method. We can see from figure 7that, in most cases, traditional method achieves the lowestF-measure, indicating the advantages of deep learning-basedmethods over the traditional ones. More significantly, CNNachieves the best performance. These results not only validatethe effectiveness of the state-of-the-art DBN method [13], butalso justify our proposed CNN method, which performs evenbetter than DBN.

TABLE IIPERFORMANCE COMPARISON OF DIFFERENT MODELS

Project Traditional DBN DBN+ CNN DP-CNNcamel 0.329 0.335 0.375 0.505 0.508jEdit 0.573 0.480 0.549 0.631 0.580

lucene 0.618 0.758 0.761 0.761 0.761xalan 0.627 0.681 0.681 0.676 0.696xerces 0.273 0.261 0.276 0.311 0.374

synapse 0.500 0.503 0.486 0.512 0.556poi 0.748 0.780 0.782 0.778 0.784

Average 0.524 0.543 0.559 0.596 0.608

E. Performance of Combining Traditional Features (RQ2)

After validating the effectiveness of deep learning-basedmethods in defect prediction, we continue to improve DBNand CNN via combining traditional features as described inSection III-C. Taking both traditional hand-crafted features anddeep-learning based features into consideration, we can expectto achieve more accurate prediction models. We consequentlyrun experiments on the seven projects with all the five models,including our proposed DP-CNN. As before, the older versionof each project is used as the training set, while the newerversion is used as the test set. The corresponding F-measurevalues of all models are listed in Table II.

In Table II, each row represents each project, with theproject name in first column and F-measure of each method inthe rest five columns. Note that the highest value of each row ismarked in bold. The results generally validate our intuition thatincluding the traditional features into the deep learning-basedmethods can improve the prediction accuracy. For example,when applying different methods on project camel, DBN+produces an F-measure of 0.375, which is 4% higher thanDBN, while DP-CNN achieves the highest F-measure as0.508, which is a bit better than CNN’s 0.505. Among allthe experiments, DP-CNN achieves the best performance insix out of the seven projects. The only exception is jEdit,where DP-CNN produces a lower F-measure than that of CNN.The reason may be that the training set of jEdit is relatively

Page 8: Software Defect Prediction via Convolutional Neural Network

small, thus adding traditional features to CNN may causeoverfitting and degrade the performance. Also, in most casesDBN+ performs slightly better than DBN.

As shown in last row of Table II, in average, the order ofmodel accuracy from the lowest to the highest is Traditional,DBN, DBN+, CNN, and DP-CNN. More specifically, DP-CNN improves Traditional, DBN and CNN by 16%, 12%and 2%, respectively. This exactly answers our RQ2 thatcombining traditional features is beneficial, and DP-CNNperforms the best.

F. Performance under Different Parameter Settings (RQ3)

In this section, we discuss how we set the free parametersin DP-CNN for achievement of the best performance. Due tospace limitations, here we only analyze three parameters whichare key to CNNs: the number of filters, the filter length, andthe number of nodes in hidden layers. We vary the valuesof these three parameters and conduct experiments on projectcamel, xalan, and xerces respectively. For other parameters,we directly present their values which are obtained via ourvalidation: batch size is set as 32, the training epoch is 15,and the embedding dimension is set as 30.

Figure 8 shows the F-measure of DP-CNN under differentnumber of filters, different filter length and different number ofhidden nodes. We can see that the optimal number of filters is10, where the three curves generally reach the peak, while thefilter length makes little difference in F-measure. Consideringthat the larger the filter length is, the more running time wewill take, we choose the filter length as 5. The number ofhidden nodes is set as 100, which is similar to the number offilters that the three curves roughly peak at 100.

Obviously these optimal model parameters are applicationdependent. But with our automated DP-CNN framework, theyare not difficult to determine when the relevant data areavailable.

V. THREATS TO VALIDITY

A. Implementation of DBN

To evaluate the performance our method in defect predic-tion, we compare our proposed DP-CNN with DBN [13],which is the state-of-the-art defect prediction technique. Sincethe original implementation of DBN is not released, wehave reimplemented our own version of DBN via Keras, aswell as DBN+. Although we strictly followed the proceduresand parameters settings described in the DBN paper, someimplementation details were still not mentioned in the paper,such as the full list of selected AST nodes and the learning ratewhen training neural networks. Thus our new implementationmay not reflect all the implementation details of the originalDBN method. However, we have consulted the first authorof the DBN paper via email regarding key implementationdetails, and we are confident that our implementation is veryclose to the original DBN.

1 2 3 5 10 20 50 100150200

Number of Filters

0.2

0.3

0.4

0.5

0.6

0.7

0.8

F-m

easu

re

camelxalan

xerces

(a) Different number of filters

2 3 5 10 20 50 100

Filter Length

0.2

0.3

0.4

0.5

0.6

0.7

0.8

F-m

easu

re

camelxalan

xerces

(b) Different filter length

10 20 30 50 100 150 200 250

Number of Hidden Nodes

0.2

0.3

0.4

0.5

0.6

0.7

0.8

F-m

easu

re

camelxalan

xerces

(c) Different number of hidden nodes

Fig. 8. Performance of DP-CNN under different parameter settings.

B. Datasets Selection

We conducted our experiments using seven open-sourceprojects in the PROMISE data set, they might not be repre-sentative of all software projects. Besides, we only evaluatedDP-CNN on projects written in Java language. Given projectsthat are not included in the seven projects or written in otherprogramming languages (e.g., C++ or Python), our proposedmethod might generate better or worse results. To make DP-CNN more generalizable, in the future, we will conduct

Page 9: Software Defect Prediction via Convolutional Neural Network

TABLE IIIDESCRIPTION OF THE 20 TRADITIONAL FEATURES IN PROMISE DATASETS [24]

If m, a are the number of methods, attributes in a class number and μ(a) is the number of methods

accessing an attribute, then lcom3 = ��� ∑� �(�) − m� /(1 − m).

TABLE II. DESCRIPTION OF THE STATIC CODE METRICS USED FOR DEFECT PREDICTION

Feature Names Symbols Description Weighted Methods per Class WMC The number of methods in the class.Depth of Inheritance Tree DIT Indicates the position of the class in the inheritance tree. Number of Children NOC The number of immediate descendants of the class.Coupling Between Object classes CBO The value increases when the methods of one class access services of another. Response for a Class RFC Number of methods invoked in response to a message to the object. Lack of Cohesion in Methods LCOM Number of pairs of methods that do not share a reference to an instance variable. Lack of Cohesion in Methods, different from LCOM

LCOM3

Number of Public Methods NPM The number of all the methods in a class that are declared as public. Data Access Metric DAM Ratio of the number of private (protected) attributes to the total number of attributes. Measure of Aggregation MOA The number of data declarations (class fields) whose types are user defined classes. Measure of Function Abstraction MFA Number of methods inherited by a class plus number of methods accessible by member methods of the class.Cohesion among Methods of class CAM Summation of number of different types of method parameters in every method divided by a multiplication

of number of different method parameter types in whole class and number of methods.Inheritance Coupling IC The number of parent classes to which a given class is coupled. Coupling Between Methods CBM Total number of new/redefined methods to which all the inherited methods are coupled.Average Method Complexity AMC The number of JAVA byte codes.Afferent couplings Ca How many other classes use the specific class. Efferent couplings Ce How many other classes is used by the specific class. Maximum McCabe Max (CC) Maximum McCabe’s Cyclomatic Complexity values of methods in the same class. Average McCabe Avg (CC) Average McCabe’s Cyclomatic Complexity values of methods in the same class. Lines of Code LOC Measures the volume of code. Number of Bugs Bug Number of bugs detected in the class

In this study we employ probability of defects (PD), probability of false alarm (PF), and the G-measure which is the harmonic mean of PD and 1-PF, to assess performanceof defect prediction models. There is a debate on right metrics to use when evaluating software defect predictionresults [30], we do not explore this as it is out of the scope of this paper. Instead, we adopt these three indicators based on the widely cited studies done by Menzies et al. [2], [3]. According to definitions of these three indicators, we favor prediction results with high PD, low PF, and high G-measure. Eq. (3) and (4) shows how we calculate these three measures.

PD = TP / (TP+FN) FP = FP / (TN+FP) (3)

G = 2*PD*(1-PF) / (PD+ (1-PF)) (4)

To eliminate the potential bias caused by data sampling,we repeated each prediction by 10 times and use the median performance as the overall prediction performance.

IV. RESEARCH METHOD

To examine research questions of section I, we conductempirical studies on aforementioned objective data sets inTable I. Our empirical studies include:

� A simulation experiment to verify the feasibility ofusing open-source project data to build defect prediction models for proprietary projects.

� An empirical study to verify the effectiveness and computational cost of a new proposed training data selection method. The idea behind our proposed data selection method is to select potential training data based on data similarity. More specifically, we firstfind cross training sets that are close to the test set,

and then we find features that cause the differences between training set and test set (i.e., unstablefeatures), finally we remove these unstable featuresand learn cross project defect prediction models from remaining data of close training sets.

The following two sub-sections describe details of these two empirical studies.

A. Usability of Open-Source Project Data Before employing open-source project data to build

defect prediction models for proprietary projects, we need to assess their usability for model training, i.e. we need to find evidence showing that learning from open-source project is theoretically or empirically feasible. To achieve this goal we conducted a simple simulation experiment as follows:

� Cross project defect prediction using the most appropriate training set (CPDP-Best). For each test set (i.e., a release of proprietary projects) we train defect prediction models on each individual cross project training set (i.e., a release of open-source projects) and observed its performance on the test set. We then select the best prediction performance for each test set. The results from this simple simulationprovide empirical evidence indicating whether cross project defect prediction is feasible.

We have to point out that CPDP-Best is a post-factoapproach and it does not work for practical predictionsettings. We use CPDP-Best here as a baseline to explore the theoretical possibility of learning from open-source projects.We compare prediction performance of CPDP-Best with those of two other baselines described below:

� Within project defect prediction (WPDP). This baseline represents prediction performance of a within project learning scenario, i.e., 5-folds cross

4848

experiments on a variety of projects including open-sourceand closed-source projects, and extend our method to otherprogramming languages.

VI. RELATED WORK

A. Software Defect Prediction

Software defect prediction is an active research area in Soft-ware Engineering [3], [4], [5], [6], [7], [13], [27], [28]. In theliterature, most defect prediction techniques focus on manuallydesigning new discriminative features or new combinations offeatures from labeled historical defect data, which are fed intomachine learning based classifiers to identify code defects [3].Commonly used features can be categorized into static codefeatures and process features [8]. Static code features [11],[12], [29] were already introduced in previous sections, herewe discuss the process features which were proposed recentlyand used in defect prediction. Moser et al. [4] employed thenumber of revisions, authors, past fixes, and ages of files asfeatures to predict defects. Nagappan et al. [30] proposed codechurn features, and shown that these features were effective fordefect prediction. Moreover, Hassan et al. [31] used entropyof change features in predicting defect. Based on these codeand process features, many machine learning models are builtfor two different defect prediction tasks: within-project defectprediction and cross-project defect prediction.

Within-project defect prediction means that both the trainingdata and test data come from the same project, just like ourwork in this paper. When people proposed defect predictionfor the first time in 1971 [32], they meant within-project.This problem is well explored in the past. For example,Elish et al. [33] evaluated the feasibility of Support VectorMachine (SVM) in predicting defect-prone software mod-

ules, and they compared SVM with other eight statisticallearning methods on four NASA datasets. Amasaki et al.[34] proposed to employ the Bayesian Belief Network inpredicting the final quality of a software product. Wang et al.[35] examined C4.5 in defect prediction, which is a kind ofDecision Tree (DT) algorithm. Their results indicated that tree-based algorithms could generate good predictions. Moreover,Jing et al. [5] introduced the dictionary learning techniquesto defect prediction. Their cost-sensitive dictionary learningbased approach could significantly improve defect predictionin their experiments.

Recently, more and more papers studied the cross-projectdefect prediction problem, where the training data and testdata come from different projects. Zimmermann et al. [36]evaluated the performance of cross-project defect predictionon 12 projects and their 622 combinations. They found thedefect prediction models at that time could not adapt well tocross-project defect prediction. Premraj et al. [37] comparednetwork and code metrics for defect prediction, and furtherbuilt six cross-project defect prediction models using thosemetrics sets. Their results confirmed that cross-project defectprediction is a challenging problem. The state-of-the-art cross-project defect prediction is proposed by Nam et al. [7], whoadopted a state-of-the-art transfer learning technique calledTransfer Component Analysis (TCA). They further improvedTCA as TCA+ by optimizing TCA’s normalization process.They evaluated TCA+ on eight open-source projects, and theresults shown their approach significantly improved cross-project defect prediction.

Our proposed DP-CNN differs from aforementioned de-fect prediction approaches in that, we utilize deep learningtechnique (i.e., CNN) to automatically generate discriminative

Page 10: Software Defect Prediction via Convolutional Neural Network

features from source code, rather than manually designing fea-tures, which can capture semantic and structural informationof programs and lead to more accurate prediction.

B. Deep Learning in Software Engineering

Hindle et al. [38] were the first to demonstrate that soft-ware corpora has a “naturalness” property, such that realprograms (or source code) written by real people have richstatistical properties. They successfully applied n-grams whichis a statistical language model on software languages, toaccomplish a code completion task. Recently, deep learningtechniques have been adopted in Software Engineering toimprove information retrieval-based tasks, due to its powerfulability for feature generation and remarkable achievementsin other fields [14], [15], [20]. Specifically, Lam et al. [39]proposed to combine deep learning with information retrievalto improve the performance in localizing buggy files for bugreports. Huo et al. [40] employed CNN on source code inprograms and natural language in bug reports to learn featuresrespectively, and then combined the two kinds of featuresas unified features for bug localization. Raychev et al. [41]employed Recurrent Neural Network (RNN) to tackle codecompletion task, which was further improved by their laterDecision Trees-based method [42]. White et al. [43] proposedto model sequential software languages using deep learningand applied their models in the task of code clone detection.Mou et al. [44] proposed tree-based CNN to better modelsource code while preserving the structural information, whichwas employed to classify the functionalities of programs. Gu etal. [45] utilized the RNN encoder-decoder model to addressthe problem of retrieving API call sequences, based on theuser’s natural language queries. Besides, Program Synthesis[46], [47], [48], [49] also becomes an active research areabased on deep learning techniques.

Deep learning is also applied in defect prediction. Yang etal. [50] applied DBN on 14 existing change level features togenerate new feature, for change level defect prediction. Wanget al. [13] further applied DBN on token vectors which areextracted from programs’ ASTs, for file level defect prediction.Our work differs from the first work in that, we leverage deeplearning to generate features directly from source code, ratherthan existing features. Our work differs from the second workin that, we employ CNN to capture the structural informationof programs, and utilize word embedding and combiningtraditional features to further improve our prediction results.

VII. CONCLUSION AND FUTURE WORK

With the ever-increasing scale and complexity of modernsoftware, reliability assurance has become a significant chal-lenge. To enhance the reliability of software, in this paper, wefocus on predicting potential code defects in the implementa-tion of software, thus reduce the workload of software main-tenance. Specifically, we propose a defect prediction frame-work called DP-CNN (Defect Prediction via ConvolutionalNeural Network), which utilizes CNN for automated featuregeneration from source code with the semantic and structural

information preserved. Besides, we employ word embeddingand combine the CNN-learned features with traditional hand-crafted features, to further improve our defect prediction.Our experiments on seven open source projects show thataveragely, DP-CNN improves the state-of-the-art DBN-basedand traditional features-based methods by 12% and 16%respectively, in terms of F-measure in defect prediction.

To make DP-CNN more generalizable, in the future, we willconduct experiments on more projects, and extend our methodto other programming languages like Python. Moreover, theresults in this work demonstrate the feasibility of deep learningtechniques in the filed of program analysis. It is promising toadapt deep learning in other software engineering tasks suchas code completion and code clone detection, which will beour future work.

ACKNOWLEDGMENT

The work described in this paper was supported by theNational Natural Science Foundation of China (Project No.61472338) and the Research Grants Council of the Hong KongSpecial Administrative Region, China (No. CUHK 14234416of the General Research Fund).

REFERENCES

[1] A. G. Liu, E. Musial, and M.-H. Chen, “Progressive reliability forecast-ing of service-oriented software,” in ICWS’11: Proc. of the InternationalConference on Web Services, 2011.

[2] Bug prediction at google. [Online]. Available: http://google-engtools.blogspot.hk/

[3] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener,“Defect prediction from static code features: current results, limitations,new approaches,” Automated Software Engineering, vol. 17, no. 4, pp.375–407, 2010.

[4] R. Moser, W. Pedrycz, and G. Succi, “A comparative analysis ofthe efficiency of change metrics and static code attributes for defectprediction,” in ICSE’08: Proc. of the International Conference onSoftware Engineering, 2008.

[5] X.-Y. Jing, S. Ying, Z.-W. Zhang, S.-S. Wu, and J. Liu, “Dictionarylearning based software defect prediction,” in ICSE’14: Proc. of theInternational Conference on Software Engineering, 2014.

[6] M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect prediction forimbalanced data,” in ICSE’15: Proc. of the International Conference onSoftware Engineering-Volume 2, 2015.

[7] J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” in ICSE’13:Proc. of the International Conference on Software Engineering, 2013.

[8] J. Nam, “Survey on software defect prediction,” Department of CompterScience and Engineerning, The Hong Kong University of Science andTechnology, Tech. Rep, 2014.

[9] M. R. Lyu et al., Handbook of software reliability engineering. IEEEcomputer society press CA, 1996, vol. 222.

[10] H. H. Maurice, “Elements of software science (operating and program-ming systems series),” 1977.

[11] T. J. McCabe, “A complexity measure,” IEEE Transactions on SoftwareEngineering, no. 4, pp. 308–320, 1976.

Page 11: Software Defect Prediction via Convolutional Neural Network

[12] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object orienteddesign,” IEEE Transactions on Software Engineering, vol. 20, no. 6,1994.

[13] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic featuresfor defect prediction,” in ICSE’16: Proc. of the International Conferenceon Software Engineering, 2016.

[14] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying con-volutional neural networks concepts to hybrid nn-hmm model for speechrecognition,” in ICASSP’12: Proc. of the International Conference onAcoustics, Speech and Signal Processing, 2012.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in NIPS’12: Proc. of theAdvances in Neural Information Processing Systems, 2012.

[16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in NIPS’13: Proc. of the Advances in Neural InformationProcessing Systems, 2013.

[17] T. Menzies, J. Greenwald, and A. Frank, “Data mining static codeattributes to learn defect predictors,” IEEE Transactions on SoftwareEngineering, vol. 33, no. 1, pp. 2–13, 2007.

[18] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,2016, http://www.deeplearningbook.org.

[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, 1998.

[20] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional net-works for text classification,” in NIPS’15: Proc. of the Advances inNeural Information Processing Systems, 2015.

[21] L. Mou, G. Li, Y. Liu, H. Peng, Z. Jin, Y. Xu, and L. Zhang,“Building program vector representations for deep learning,” arXivpreprint arXiv:1409.3358, 2014.

[22] L. Bottou, “Large-scale machine learning with stochastic gradient de-scent,” in Proceedings of COMPSTAT’2010, 2010.

[23] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[24] Z. He, F. Peters, T. Menzies, and Y. Yang, “Learning from open-sourceprojects: An empirical study on defect prediction,” in ESEM’13: Proc.of the International Symposium on Empirical Software Engineering andMeasurement, 2013.

[25] C. D. Manning, P. Raghavan, H. Schutze et al., Introduction to infor-mation retrieval. Cambridge University Press, 2008, vol. 1, no. 1.

[26] M. Jureczko and L. Madeyski, “Towards identifying software projectclusters with regard to defect prediction,” in Proc. of the InternationalConference on Predictive Models in Software Engineering, 2010.

[27] S. Kim, T. Zimmermann, E. J. Whitehead Jr, and A. Zeller, “Predictingfaults from cached history,” in ICSE’07: Proc. of the InternationalConference on Software Engineering, 2007.

[28] S. Zhang, J. Ai, and X. Li, “Correlation between the distribution ofsoftware bugs and network motifs,” in QRS’16: Proc. of the InternationalConference on Software Quality, Reliability and Security, 2016.

[29] R. Harrison, S. J. Counsell, and R. V. Nithi, “An evaluation of the moodset of object-oriented software metrics,” IEEE Transactions on SoftwareEngineering, vol. 24, no. 6, pp. 491–496, 1998.

[30] N. Nagappan and T. Ball, “Using software dependencies and churn met-rics to predict field failures: An empirical case study,” in ESEM’07: Proc.

of the International Symposium on Empirical Software Engineering andMeasurement, 2007.

[31] A. E. Hassan, “Predicting faults using the complexity of code changes,”in ICSE’09: Proc. of the International Conference on Software Engi-neering, 2009.

[32] F. Akiyama, “An example of software system debugging.” in IFIPCongress (1), vol. 71, 1971, pp. 353–359.

[33] K. O. Elish and M. O. Elish, “Predicting defect-prone software modulesusing support vector machines,” Journal of Systems and Software,vol. 81, no. 5, pp. 649–660, 2008.

[34] S. Amasaki, Y. Takagi, O. Mizuno, and T. Kikuno, “A bayesian beliefnetwork for assessing the likelihood of fault content,” in ISSRE’03: Proc.of the International Symposium on Software Reliability Engineering,2003.

[35] J. Wang, B. Shen, and Y. Chen, “Compressed c4. 5 models for softwaredefect prediction,” in QSIC’12: Proc. of the International Conferenceon Quality Software, 2012.

[36] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,“Cross-project defect prediction: a large scale experiment on data vs.domain vs. process,” in FSE’09: Proc. of the Joint Meeting of theEuropean Software Engineering Conference and the ACM SIGSOFTSymposium on The foundations of Software Engineering, 2009.

[37] R. Premraj and K. Herzig, “Network versus code metrics to predictdefects: A replication study,” in ESEM’11: Proc. of the InternationalSymposium on Empirical Software Engineering and Measurement, 2011.

[38] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the natu-ralness of software,” in ICSE’12: Proc. of the International Conferenceon Software Engineering, 2012.

[39] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “Combiningdeep learning with information retrieval to localize buggy files forbug reports (n),” in ASE’15: Proc. of the International Conference onAutomated Software Engineering, 2015.

[40] X. Huo, M. Li, and Z.-H. Zhou, “Learning unified features fromnatural and programming languages for locating buggy source code,”in Proceedings of IJCAI’2016.

[41] V. Raychev, M. Vechev, and E. Yahav, “Code completion with statisticallanguage models,” in ACM SIGPLAN Notices, vol. 49, no. 6. ACM,2014, pp. 419–428.

[42] V. Raychev, P. Bielik, and M. Vechev, “Probabilistic model for code withdecision trees,” in OOPSLA’16: Proc. of the International Conference onObject-Oriented Programming, Systems, Languages, and Applications,2016.

[43] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learningcode fragments for code clone detection,” in ASE’16: Proc. of theInternational Conference on Automated Software Engineering, 2016.

[44] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neuralnetworks over tree structures for programming language processing,”arXiv preprint arXiv:1409.5718, 2014.

[45] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning,” in FSE’16:Proc. of the ACM SIGSOFT International Symposium on Foundationsof Software Engineering, 2016.

[46] C. Liu, X. Chen, E. C. Shin, M. Chen, and D. Song, “Latent attentionfor if-then program synthesis,” in NIPS’16: Proc. of the Advances inNeural Information Processing Systems, 2016.

Page 12: Software Defect Prediction via Convolutional Neural Network

[47] S. Reed and N. De Freitas, “Neural programmer-interpreters,” arXivpreprint arXiv:1511.06279, 2015.

[48] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tar-low, “Deepcoder: Learning to write programs,” arXiv preprint arX-iv:1611.01989, 2016.

[49] C. Shu and H. Zhang, “Neural programming by example,” in AAAI’17:Proc. of the AAAI Conference on Artificial Intelligence, 2017.

[50] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning forjust-in-time defect prediction,” in QRS’15: Proc. of the InternationalConference on Software Quality, Reliability and Security, 2015.