Content Classification of Development Emails Sreenath Appala Vallary Singh Alberto Bacchelli, Tommaso Dal Sasso,Marco D’ Ambros, Michele Lanza REVEAL @

Content Classification of

Development Emails

Sreenath AppalaSreenath AppalaVallary SinghVallary Singh

Alberto Bacchelli, Tommaso Dal Sasso,Marco D’ Ambros, Michele LanzaAlberto Bacchelli, Tommaso Dal Sasso,Marco D’ Ambros, Michele LanzaREVEAL @ Faculty of Informatics - University of Lugano,SwitzerlandREVEAL @ Faculty of Informatics - University of Lugano,Switzerland

Contents

• Introduction

• Motivation

• Related Work

• Data Collection and Classification

• Experiment

• Threats to Validity

• Conclusion

Introduction

• Software repositories support software analysis and program comprehension

• Repositories archiving developer communication has valuable information - design discussions,rationale,history,future plans and implementation details

• Emails helps in program comprehension and software analysis

• Information extracted must be relevant, unbiased and comprehensible

• Various IR techniques treat natural language documents as bag-of-words , however applicable most often to well-formed documents which have a definite structure

• IR techniques are effective even in some software engineering tasks, but reduces quality , reliability and comprehensibility of the available information as the natural language text is often not well-formed and is interleaved with different syntaxes, code fragments , stack traces , patches etc

• The paper contributions -

• Classification of emails using combination of parsing and machine learning techniques

• Manually classify email content using a web application

• Manual classification of mailing lists of four different software systems - in the form of freely available benchmark

• Empirical evaluation of our approach against the benchmark

Motivation

• Traceability recovery - By recognizing the context in which a term appears, one can elicit weights for words appearing in a document dynamically and more accurately, improving the traceability links’s quality and giving more information to the user

• Stop words removal - By recognizing the different parts that compose an email,one can use different common terms removal techniques,exposing the most relevant information

• Artifact summarization - By recognizing the different parts of an email, one can use the most suited summarization technique according to each part’s type and extract correct information

• Fact extraction - By distinguishing the type of each email line, we can exploit ad hoc analysis techniques to extract precise information

• Non-essential information removal - By recognizing the noise in emails, the important data emerges, improving the information extraction quality

Related Work• Bettenburg et al focused their work about the noise in the email data and presents the need for pre-processing

using possible filtering heuristics to recognize noise and irrelevant information using a tool called InfoZilla

• Emails present more challenges

• contain larger natural language vocabulary

• present more noise like in email headers/signatures

• email clients wrap long lines of text, breaking right formatting

• Bird et al proposed a method to measure acceptance rate of patches submitted via email in Open source projects, to analyze developers interactions

• Tang et al addressed the issue of cleaning email data exploiting probabilistic and machine learning models

• non-NL text filtering - to filter email header/signatures and program code

• paragraph recognition

• sentence boundary detection

• word normalization - corrects misspelled words

• Carvalo and Cohen devised methods to recognize signature blocks and reply lines in emails using machine learning classifiers

• In the author’s earlier work , they developed BESC a lexical approach based tool to detect Java code fragments from development emails

Related Work• Current work differs from the previous works

• It addresses more compact classification tasks

• It considers larger granularity or different sources

• It does not distinguish structured data forms by merging patches, code and stack traces

• It does not use hard-coded classification rules

Data Collection and Classification• Data Collection

• Objective - to improve data quality and comprehension by using data sets that are accurate, comprehensive and of statistically significant sizes

• Data Set -

• Imported emails using MarkMail of unrelated systems emerging from various open source software communities.

• Each of these OSS applications use different development environment and may use different paradigms and so the usage of mailing lists could differ, eliminating the possibility of external threat for validity

• Pre-processing was done to filter messages automatically generated by bug tracking systems and versioning systems

• Random samples of emails were picked up with 95% confidence level and 5% error margin

Data Collection and Classification• Data Classification

• Manually classified 1439 sample emails from the dataset

• To reduce manual errors ,devised MAILPEEK , a web application written in Smalltalk using the Seaside framework

• Two graduate students with extensive Java programming experience were asked to classify the mails

• Users conduct classification as follows -

• can classify at character level

• click on starting and ending characters to label a block

• verify correctness

• apply appropriate category

• The inter-rater agreement is also considered by asking them to classify 5% of the mails analyzed by the other person

Data Collection and Classification• Data Distribution

• Most lines are natural language text

• More than 30% of lines are junk

• Frequency of other categories is lower and the ranking changes according to the mailing list

• Different composition of email sets contents suggest different usage of mailing lists about the OSS communities

• 5% of lines are hybrid - as they belong to more than one category, and mostly consist of junk not separated by NL text

Experiment• Term Based Classification

• In IR, documents are considered as bag-of-words where syntactic information,ordering and constituency of the words play no role in determining their meaning and each document is modeled as vector of features

• Machine Learning Method

• Use Naive Bayes, a method of supervised learning. It relies on conditional independence assumption i.e the presence of a feature is unrelated to occurrence of other features

• Probability that a line l , made of tk terms belongs to class c is P(c | l) = P(c) πk P(tk | c)

• Posterior probability P(ci | l) for each class and chooses the one with highest probability. This is called as maximum a posteriori (MAP) hypothesis: CMAP = arg max P(c | l) ~= arg max P(c) πk P(tk | c)

• For example ,If we want to classify the line d = “Alice wrote :” as text ,junk , or code , the algorithm first computes the probabilities as: P ( text | l ) = 0.43, P (junk | l) = 0.55 and P (code | l ) = 0.02,then selects the value 0.55, thus classifying l as junk .


• Selection of the Terms

• Words - Fundamental tokens of any language.Contrary to most IR techniques we do not perform stop word removal, as we expect very frequent words to be representative of a Java class. Also we do not perform stemming, as we expect some variants to be more characteristic of certain classes

• Punctuations - We distinguish lines written in languages with different syntaxes, thus punctuation is a valuable aspect. Unless the punctuation is separated by words or spaces, we consider them as a single term. The characters > and >> have no role in line classification as they are assumed to be part of email reply threads

• Bi-grams - To accommodate side effects of applying Naive Bayes approach of conditional independence , where certain languages can have patterns of terms appearing together , e.g “public void” in Java, we consider Bi-grams

• Context - All features are not just extracted from the line being classified but also depends on the surrounding lines, so a class is recognizable by a structure (e.g stack trace , patch etc) based on a context. In the past , researchers proposed to solve this problem by adding features with characteristics of lines close to the one under classification


• Line Modeling

• Modeled each line as a vector of n + 1 dimensions

• First n elements are the chosen features, while the last one is the manual classification value

Experiment• Training and Testing

• Since we take Machine Learning approach, we need to train our system on classified data

• To evaluate model accuracy, we use the IR metrics of precision, recall and F-measure

• We use two approaches for training the model

• 10-fold stratified cross-validation - Split data set into folds using 90% to train prediction model and the other 10% to test the model’s accuracy. Considering all the features the accuracy of classifying correctly goes reaches almost 94%

• Mailing list cross-validation- It is a 4-fold cross validation where each fold is neither stratified nor randomly taken, but correspond exactly to other mailing list. We train the classifiers on three mailing lists and test the prediction model accuracy using the remaining mailing list . The performance of the classifier drops to around 60% even when considering all the features. Classifying patch and code are often misclassified while the others have best results

Experiment• Term Based Features and Overfitting

• By considering entire set of features we obtain a complex classification model with more features than instances. In such a scenario, overfitting is likely to occur

• By deducting the features that are not valuable to correctly predict instances outside the training set, we decrease overfitting and increase generalizability of the results

ExperimentParsing Based Classification – specialized parser for each class based on island

parsing

• Stack Trace Parsing : Dividing stack trace into exceptionMessage, atLine, ellipsisLine and causedByLine.

• Patch Parsing : Dividing patches into patchHeader, patchBlockHeader and patchBlock.

• Source Code Parsing : written for Java• Junk Parsing : To parse noisy text in emails like authors’ signature and email

headers

ExperimentParsing Based Classification – fuses characteristics of term based classification and parser based approach.

•Adding parser results to Naïve Bayes: Adding parser based classification output to improve Naïve Bayes ML process.

•Unified Classification Approach: Using Naïve Bayes to evaluate a partial classification on features based on terms and then use another ML classifier to model the fusion of Naïve Bayes results and parser based classifications

Threats to Validity

• Construct Validity - threats regard the relation between theory and observation i.e measured variables may not measure conceptual variables

• Statistical Conclusion - threats are concerned with whether we have enough data to support our claims

• External Validity - threats are concerned with the generalizability of the results

Conclusion• Presented a unified two step approach that fuses

supervised ML approach with Island Parsing to perform automatic classification of the content of development emails into five categories. Their approach looks very promising even with cross email validation.

Documents

Content Classification of Development Emails Sreenath Appala Vallary Singh Alberto Bacchelli, Tommaso Dal Sasso,Marco D’ Ambros, Michele Lanza REVEAL @