Memorias del X Workshop Latinoamericano Ingeniería de ... · Memorias del X Workshop Latinoamericano Ingeniería de Software Experimental ESELAW 2013 Del 8 al 10 de Abril, 2013 Universidad

Memorias del X Workshop Latinoamericano Ingeniería de Software Experimental

ESELAW 2013

Del 8 al 10 de Abril, 2013 Universidad ORT Uruguay – Campus Centro

Montevideo, Uruguay

CIbSE 2013 | X Workshop Latinoamericano Ingeniería de Software Experimental | ESELAW 2013

2

Editores – Chairs de Programa: Ph.D Martín Solari Buela Universidad ORT Uruguay – Uruguay Ph. D Arilo Cláudio Dias Neto Universidade Federal do Amazonas – Brasil

Organizadores: Universidad ORT Uruguay – Universidad de la República – Antel

Media partner: EL PAIS

Sponsors: GeneXus Consulting – Onetree – Microsoft – Gallito.com – TATA Consultancy Services – Me-cadoLibre – Globant – Tilsor – UNIT

Support: Centro Latinoamericano de Estudios en Informática – Agencia Nacional de Investigación e Inno-vación – IEEE Computer Society Capítulo Uruguay – Cámara Uruguaya de Tecnologías de la Información – Gega Multimedios

ISBN: 978-9974-8379-3-5


3

Message from the General Chair

I cordially welcome you to Montevideo, Uruguay, and to the 16th Iberoamerican Conferenceon Software Engineering (Cibse 2013).

From its inception Cibse is well known to provide a growing forum for those interested inexploring and discovering the advances and research in the field of software engineering inIbero america. Trying to follow this tradition, we believe this year conference program hasmuch to offer to researchers, practitioners, students and educators.

As in previous editions, this conference will hold the three traditional tracks covering the areasof Software Engineering, Requirements Engineering, and Experimental Software Engineering; itwill also host a Doctoral Symposium, three keynotes and four tutorials. For the first time thisyear’s conference will host an Industry Forum that we hope it will encourage the exchange ofexperiences between scientists and professionals.

Cibse is a constantly growing research oriented conference. On this edition there were 36accepted papers in the three main tracks and we are pleased to have authors representingseveral countries like Argentina, Brazil, Cuba, Italy, Spain, Peru, Venezuela and Uruguay.

We are fortunate to have three keynotes that span different areas of interest to the softwareengineering community. Bill Nichols´s keynote will take a deeper look into the stages of qualitydevelopment using TSP. Carolyn Seaman will address on how the Technical Debt metaphor canbe used by practitioners and empirical researchers as a new path to technology transfer.Eduardo Mangarelli and Fernando Machado will bring an industrial perspective on thechallenges presented by the complexities that today’s applications are expected to manage,and on how these challenges are disrupting the way we engineer software.

Tutorials have always been an efficient way to learn about diverse themes in SoftwareEngineering. This year we will offer four tutorials covering leading edge topics and presentedby the following renowned speakers: Renata Guizzardi, Luis Olsina, Andrea Delgado and BillNichols.

As many conferences, this Cibse was mostly organized by a group of dedicated people thatgenerously offered their time to perform the countless tasks needed to make the conference asuccess. I would like to thank all the volunteers that have devoted their energy to the goal ofmaking it an outstanding experience to you. These include, among others, the SteeringCommittee, Track Chairs, PC members, student volunteers and the university's staff. Thank youall for your dedication and support. I must specially recognize the work of our OrganizationChair Santiago Matalonga, our Academy/Industry Liaison – Ana Laura Trias, and to LilianaPino our webmaster, we couldn’t have done this without your leadership, patience and hardwork!

I would also like to thank the invaluable support of the companies and organizations that, withtheir sponsorship, made it possible to offer full access to the conference to over fifty students.I believe this is a unique opportunity for our future Software Engineers and researchers.

Finally we hope that you will enjoy your stay at Montevideo the Iberoamerican Capital ofCulture 2013. Welcome to Cibse 2013!

Gastón Mousqués / General Chair


4

ESELAW 2013 Preface In Software Engineering, experimentation demonstrated to be necessary to support the evolution of software technologies and reveal evidences. Experimentation has been used to build and evaluate processes, methods, techniques and tools. It has been understood that different types of objects demand different types of studies to observe their behaviors, motivating the researchers to investigate and evolve, collaboratively, how to develop models for planning, execution and packaging of primary, secondary and tertiary studies applied to software engineering objects. Over ten years, the Experimental Software Engineering Latin American Workshop (ESELAW) has become an important forum for researchers and practitioners to report on and discuss new research results in Experimental Software Engineering, bringing together Latin America’s academic, industrial and commercial communities. The workshop encourages the exchange of ideas to understand the strengths and weaknesses of software engineering technologies, focusing on the process, design and structure of empirical studies, as well as on results from specific studies. Since 2011, ESELAW has been held into the context of CIbSE – the leading research forum on Software Engineering in Ibero-America. The goal is to bring ESELAW in the different countries that usually take part in CIbSE, disseminating and motivating the conduction of experimental activities in software engineering in the Latin American countries. In the tenth edition of ESELAW, the three technical sessions include high-quality scientific works addressing issues on experimentation in software engineering under different perspectives. They were titled: “TS1 – Systematic Reviews”, “TS2 – Primary Studies”, and “TS3 – Infrastructures and Methods to Support ESE”. Furthermore, ESELAW participants are also invited to take part in CIbSE’s tutorials, talks and other activities. ESELAW 2013 received 19 full paper submissions. They were reviewed at least by three referees each one. Eight were selected for presentation in the technical sessions. These papers are included in the proceedings. This result was only possible due the commitment and hard work of the Program Committee and external reviewers. Universidad ORT Uruguay and Universidad de La Republica Uruguay are hosting the tenth edition of ESELAW into CIbSE. The Brazilian Computer Society (SBC) is supporting the event. We welcome the authors, speakers and remaining participants of ESELAW 2013. We wish everyone a great conference! Arilo Claudio Dias Neto and Martin Solari Co-Chairs of ESELAW 2013


5

Nominated Papers

Best Paper: o Silverio Martínez-Fernández, Claudia P. Ayala, Xavier Franch,

Helena Martins Marques and David Ameller. A Framework for Software Reference Architecture Analysis and Review.

Distinguished Papers:

o José Torres, Daniela Cruzes and Laís Salvador. Automatically Locating Results to Support Systematic Reviews in Software Engineering.

o Ciro X. Maretto and Monalessa P. Barcellos. Software Measurement Architectures: A Mapping Study.


6

Automatically Locating Results to Support Systematic Reviews in Software Engineering

José Alberto S. Torres1, Daniela S. Cruzes2, Laís do N. Salvador3

1 Regional Center of Telematics, DPRF, Salvador/BA, Brazil

[email protected] 2 Dept. of Computer and Information Science (IDI), NTNU, Trondheim, Norway

[email protected] 3 Dept. of Computer Science, UFBA, Salvador/ BA, Brazil

[email protected]

Abstract. Background: Systematic Reviews are extremely dependent on human effort and, therefore, costly and time consuming. Some authors in Software En-gineering are starting to research the use of computer support to reduce human labor in some tasks of the process. Aim: Define a method for automatic location of sentences that describe the results in an unstructured scientific paper aiming to reduce the human effort. Method: Three sentence classification methods were analyzed and a new sentence classification method was proposed and tested with the same input set used in the other methods. Results: The method pro-posed in this work reached rates ranged between 60% and 72% of recall of the sentences describing results of the papers. Conclusions: The proximity between the recall rates found in automatic tests conducted with the proposed method and in a test with humans confirms the feasibility of this technique for automat-ing part of the process.

Keywords: Evidence-Based Software Engineering; Systematic Review; Infor-mation Retrieval.

1 Introduction

For decades people have known the gaps between research evidence and clinical prac-tice, and the consequences in terms of ineffective or even harmful decision making. Evidence-based medicine came to help in overcoming this problem in decision-making. This methodology is about asking questions, finding and evaluating relevant data and use such information in clinical practice in order to assist the work of doctors [21]. The success achieved by evidence-based medicine, especially after the 80's, has led other areas that provide service to the general public to adopt this paradigm, such as psychiatry, nursing and, more recently, the Software Engineering area [3].

Evidence-Based Software Engineering (EBSE) aims to improve decision-making on the development and maintenance of software through the integration of current best research evidence with practical experience [13][6]. EBSE also aims to provide


7

knowledge about when, how and in what context technologies, processes, tools and methods are most appropriate for the practice of software engineering. In this context, the Systematic Review (SR) has provided mechanisms to identify and aggregate re-search evidence providing a full and fair assessment of the state of evidence related to a particular topic of interest [13][6]. The process of conducting this type of study is guided by a rigid and well-defined sequence of methodological steps, which follow a strict protocol defined before the beginning of activities [12].

In the last years, most strongly after the publication of the seminal papers on EBSE [13][6] and the procedures for undertaking systematic reviews [12], we have noticed an improvement from the use of Systematic Reviews by researchers in Software En-gineering. Zhang and Babar [29] investigated the adoption and use of the Systematic Review in Software Engineering and discovered that the vast majority of researchers in SE consulted were convinced of the value of using a systematic methodology and rigorous literature reviews. In this interview, study interviewees showed concerns on the amount of time and resources required to run a SR. Cruzes and Dybå [4] per-formed a tertiary review of the types and methods of synthesis and concluded that synthesis of empirical research is at the heart of systematic reviews, and future atten-tion must be directed toward synthesis methods that increase our ability to find ways of comparing and combining what is seemingly incomparable and hard to combine.

One of the challenges in conducting Systematic Reviews of Literature is to main-tain a balance between methodological rigor and the required effort. Felizardo [7], Malheiros [16] and Silva Rocha [24] have applied the techniques of text mining and machine learning in order to reduce the effort and time required for construction of Systematic Review of Literature (SRL), acting, respectively, in steps of selecting papers and automatic identification of contextual information.

Thus, we propose in this paper a way to perform the automation of an important step in performing synthesis in Systematic Reviews, the results extraction activity. A previous study [27] analyzed the performance of some existing algorithms and meth-ods in sentence classification and demonstrated the feasibility of deploying a tech-nique to locate result sentences in unstructured Software Engineering papers automat-ically. The new method proposed in this paper, called Textum, has as main objective to automate part of the process that deals with the location of the sentences that repre-sent papers results. This way, it would be possible to analyze specific parts of the paper according to rules executed by the proposed algorithm instead of manually analyzing the full paper.

This paper is organized as follows: in Section 2, information regarding related works is provided to contextualize this work; Section 3 presents details about the main methods for classifying automatically sentences in scientific papers; in Section 4 is featured details about proposed method; Section 5 shows the feasibility study of the proposed approach and defines its application viability, as well as, presents a compar-ison with others methods; finally, Section 6 concludes this work.


8

2 Background and Related Work

The effort required to conduct a Systematic Review is one of the obstacles to a greater use of this type of study. Zhang and Ali Babar [29] performed a survey asking Software Engineering researchers that had never executed SRLs before why they had never performed this kind of study. The results showed that 50% of stakeholders have taken this attitude because they did not know about SRLs at the time of the research and at time of writing the papers. In addition, 37% of them had never used the tech-nique because of the amount time required to perform a SRL.

There are no studies to define a formula to estimate the average time taken to per-form a systematic review in Software Engineering area. In medicine, Allen and Olkin [2] presented a formula (1) to determine the amount of hours spent as a function of the number of returned references (x) produced based on empirical observations made in SRLs.

Hours = 721 + 0.243x – 0.0000123x2 (1)

Some authors have published, especially in recent years, papers about tools to au-tomate parts of the Systematic Review as a way to reduce time and cost required for its realization. Felizardo et al [7] created a tool to support the primary study selection activity using visual text mining (VTM) techniques and discovered that this approach was useful in accelerating the selection task. Furthermore, the results showed that the method helped to increase the inclusion of relevant papers and the exclusion of irrele-vant papers. Malheiros et al. [16] proposed an automated tool, called pexExplorer, to help researchers in the initial selection of items to be used in the SRL. This work showed that the use of visualization allowed for more information to be processed at once. This work still showed that during the selection of studies the use of this tech-nique is valuable for data cleaning. Silva Rocha [24] developed a tool for automatic information extraction from the scientific papers context – tool called ContextExtrac-tor. The results of the context extraction achieved by this tool were similar to the re-sults of manual context extraction achieved by Junior Software Engineers.

These tools have been built based on text mining techniques and aim to reduce the time required for construct reviews by automating part of the process. The prototypes proposed by Felizardo and Malheiros focus on the information quality assessment and selection of the papers, in its turn, Silva Rocha´s tool focuses on the extraction of context information from studies performed and described in the paper.

However, we did not found studies to automatize the results identification activity, one of the main steps of the systematic review process. The reason is probably due to the complexity of this activity. Cruzes el al. [5] performed an experiment with gradu-ate students at the University of Maryland enrolled in a class of Experimental Soft-ware Engineering. In this study, they assigned a set of papers for students to identify in the text the sentences that represented the results of the studies analyzed. The sen-tences would be identified and then compared with the oracle developed by experi-enced researchers in order to identify the percentage of accuracy in the results of manual selection of items and time taken for the analysis of texts. The experiment showed that the number of results found by the participants was below the expected,


9

since they located, on average, only 53% of existing results. It was also observed in the experiment that the students spent between 1.5 and 3 hours to read the papers, with each participant consumed, on average, 8.1 minutes to read each page. The anal-ysis of the experiment results confirms the complexity of performing this activity, both by the low success rate and the high amount of time required for analysis.

Our study proposes a method for automatic identification of the results on Software Engineering scientific papers, reducing the time spent to perform the activity and improving rates of success in identifying of study results.

3 Sentence Classification

The problem of making the machine to understand and differentiate between dif-ferent categories of sentences is usually treated in Computer Science as a task of au-tomatic classification of text. Hachey and Grover [8] use various techniques based on machine learning to set the rhetorical status of sentences from a corpus of the legal documents. Khoo et al. [11] proposes the classification of sentences of a corpus of helpdesk e-mails into categories related to domains, such as education, answer and questions.

More focused on the topic covered in this study, there were also works with em-phasis on the analysis of sentences of scientific texts. These works were divided in two categories: those that classify just the sentences found in the “abstracts” of papers [17][22][25]; and those that work on the full text [26][10][1]. Just the full text studies will be summarized below because this is the scope of this study. We can find the full discussion about these studies in [27].

The first study analyzed was developed by Teufel [26] and proposes to define a set of attributes that characterize the types of sentences. With this attributes would be possible to classify the sentences in different categories. Teufel conducted an experi-ment where a set of sentences extracted from a series of computational linguistics papers were classified into seven different categories according to their purpose. The second study was written by Ibekwe-SanJuan [10]. Its main purpose was to help users on the identification of the key scientific information in the text, from elements associated with specific sentences. In her work, Ibekwe-SanJuan observed in previous studies that scientific writing is not a neutral act, it is indeed a social act because the authors need to convince the community of the validity of their research. Hence they make use of rhetorical cues and a few recurrent patterns. The author argues that, in theory, this behavior would allow the automatic identification of sentences bearing these patterns through the use of templates or regular expressions.

The last work analyzed was developed by Agarwal and Yu [1] and focused on the classification of full-text biomedical journal papers. Four different methods for sen-tence classification were tested, two rule-based and two machine-learning based. The first rule-based method had just one rule, the category was assigned to a sentence based on which section the sentence occurred. In the second rule-based method, the author identified 603 rules to classify the sentences into four different IMRaD (Intro-duction, Method, Results and Discussion) categories. The machine learning methods


10

were also split on two: supervised machine- learning system trained on non-annotated corpus and trained on manually annotated full-text sentences.

4 An Approach For Locating Results (Textum)

The main objective of this paper is to define a method to automate the task of lo-cating result sentences in unstructured Software Engineering papers. Papers that do not follow the model IMRaD, i.e., their texts are not organized using the standard sections - Introduction, Methods, Results and Discussion, will be treated in this work as unstructured papers. The expected result from the application of this method is to reduce the human effort required for performing this activity, thus reducing time spent on execution.

The proposed algorithm was built based in text mining techniques associated with classification strategies. The method was divided in four main activities, going from text importing to the identification and selection the sentences that represent the re-sults of the scientific papers in analysis (Fig. 1).

Fig. 1. - Textum Method Schema

4.1 Paper Import and Sentence Segmentation

The first step is to convert the papers into plain text format documents. Tags and undesirable characters are removed in this stage. The next step is to segment the sen-tences, pieces of text with full meaning. This activity is performed with a pattern recognition algorithm based on the use of punctuation, capitalization, names, acro-nyms and grammatical particle position in the text. In general, Pattern recognition algorithms try to identify the "most likely" answer for possible inputs based on statis-tical parameters and characteristic which differ from the pattern matching algorithms and look for exact matches.

4.2 Attributes Calculation

There are two main activities for defining the new Sentence Classification method: the definition of sentence attributes and the classification strategy. Text classification strategies use a number of predefined attributes to automatically infer the category and classify the pieces of text.

To define the final attribute set, we used elements defined in [26][10][15][14][20][19][23]. To select the best attributes from the whole set, we used

ClassificationAttributes Calculate

1,2.5,1.32,true,...

Paper Import Sentence Segmentation

Results


11

a genetic algorithm based technique. Classification tests were run from sets of attrib-utes mounted from the initial group. The sets of attributes with the best classification results were selected and mixed to form new subgroups, used as the basis for new classification process, and so on, until define the best performing attributes to com-pose the final set.

Seventeen attributes were selected to compose the final set for each one of the sen-tences:

Keyword Frequency – this attribute follows Luhn [15] model, composed by the sum of the sentence word frequencies multiplied by the distance between words.

Cue Method – a weight was assigned to each word in the list based on the presence of this word in one of the four previously created word lists: Bonus A - weight 3; Bonus B– weight 2; Bonus C - weight 1 and Stigma – weight -1. [26]

Paragraph Sector – this attribute was proposed by Teufel and Moens [26] and con-siders the sentence position in the paragraph to which it belongs. The total number of sentences in the paragraph is divided by four to create equal number of para-graph sections, in which the sentences will be allocated.

TF-IDF - is the product of term frequency (TF) by the inverse document frequency (IDF). The calculation method is the same showed in Salton [23].

TF-ISF – The Inverse Sentence Frequency is defined based in the word frequency in the sentence - "F(w)", in the total number of words in the sentence - "n", and in the number of sentences in which the word appears– “S(w)”. The TF-ISF equation is showed in (2):

TF-ISF(w) = F(w) * (logn / S(w)) (2)

Sentence length – number of words in the sentence. Gist Sentence – It uses the methods of keyword frequency and TS-ISF to select the

sentence with highest score in the text, which, at least theoretically, represent the main idea. This sentence is called gist sentence of the text and become the basis to calculate the weight of the other sentences [19].

Lexical Connectivity –the weight is set based in the number of words shared be-tween the sentences divided by the total number of words in the sentence.

Section Position –this attribute is loaded with the sequential position of each sen-tence in the section.

Segment Position –this is a variation of the method proposed by Teufel and Moens [26]. The text is divided in ten parts and to each one is given an identifier – a letter from A to J. The sentences belonging to each part of text are defined with the let-ter assigned to the piece of text.

Verbal Tense– The first verbal occurrence in the sentence. The possible values are: present in 3th singular person, present in other persons, infinitive, past participle, present participle, gerund and sentences without verb.

Citations – IF the sentence has a citation the value is 1, else the value is set to 0. Header –The definition of the relevance of the header is given if the section title

has at last one of these words: "result", "introduction", "conclusion", "date", "im-plementation" and "discussion".


12

Self-indicative phrases– this approach follows the Ibekwe-SanJuan [10] method. It uses a Grammar to recognize the sentences that represent results from papers. The sentences selected are assigned with weight 1 and the others are assigned with val-ue 0.

Comparison – Check, through the grammar rules, if there are comparatives or su-perlatives in the sentence. The “er” suffix and the “more … than” structure, for example, are comparative indicators. The “est” suffix is a superlative indicator.

Number Presence – Check if there are numbers in the sentence, if so, “Yes” is assigned, otherwise "No".

Percentage Indicator – if there is the signal “%”, or the word “percent” (and varia-tions) in the sentence, the value is “Yes”, else the value is “Not”.

4.3 Classification

The classification task is performed in two steps. The first step uses a rule-based algorithm and the second a machine learning algorithm.

The process of choosing the classification rules set is based on the analysis of the distribution of each sentence attribute value. To perform this task, a Software Testing Papers corpus was divided into three distinct subsets. The first set was used to analyze the values of the attributes of a sentence in order to identify attributes that were caus-ing noise in the classification task. The second and third were used in the experiment to validate the method, to perform the training and to execute tests.

To build the rule-based classification model it was necessary to discover the attrib-utes that showed wide divergence to the values taking into account the two classifica-tion classes: “result sentence” and “no result sentence”. To understand the concept, note the analysis of values defined to the attributes "Frequency of Keywords” and "Cue Method" in the graphs showed in Fig. 2. The first one shows that it is possible to observe a predominance of a particular kind of sentence for certain attribute values, making this an attribute relevant to the classification process.

Unlike Frequency of Keywords, Cue Method shows the inverse behavior, the curve remains similar between the two categories, making this feature very ineffective in the sentence classification. Because of this, all attributes with similar distribution values to both categories were removed from the classification step.

Fig. 2. – Frequency of Keywords and Cue Method Graphs

Até 0 1000-1200 2200-2400 3400-3600 4800-5000400-600 1600-1800 2800-3000 4200-4400 8000-10000

0,00%

2,00%

4,00%

6,00%

8,00%

10,00%

12,00%

14,00%

0,00%

2,00%

4,00%

6,00%

8,00%

10,00%

12,00%

14,00%

Não ResultadoResultado

-1 0 1 2 3

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

Não ResultadoResultado

Frequency of Keywords Cue Method

Result No Result

Result No Result


13

The classification policy rules were developed based on the observation of the val-ues distributions in each analyzed attribute. Looking at the attributes values was pos-sible to discover different patterns in the distributions between the two categories and create rules to classify the sentences. We discovered, for example, that all results have a verb in their sentence composition; hence it was created the following rule “If the sentence has no verb, must be classified in the no result category". Some other rules are described in Table 1.

Table 1. - Rule Set Example

If the sentence has no verb, must be set to "no result"; If the sentence, excluding "stop words", having size less than or equal to 3, must be

defined as "no result" The sentences that have value less than or equal to 600 for the attribute "frequency

of key words" must be defined as "no result"; The second classification stage is based on machine learning algorithm. Six algo-

rithms were tested to select the heuristic to be utilized in our method: Nayve Bayes (NB); Tree J 4.8 (TJ48), Decision Table (DT), Support Vector Machines (SVM), Nearest Neighbor (IBK) and Multilayer Perceptron algorithm (MP). The tests were executed with a Software Testing Corpus. The results are showed in Table 2.

Table 2. - Sentence Classification Result

Algorithm N. Bayes T. J.48 TRD SVM MP IBK

Result Sentences Precision 26,2% 55,6% 0,0% 31,6% 32,7% 28,1%

Recall 17,5% 7,9% 0,0% 9,5% 28,6% 39,7%

No Result Sen-tences

Precision 93,0% 92,5% 91,9% 31,6% 93,8% 94,5%

Recall 95,7% 99,4% 100,0% 9,5% 94,8% 91,1%

Average (Results and no results)

Precision 87,6% 89,5% 84,5% 87,6% 88,9% 89,1%

Recall 89,4% 92,1% 91,9% 91,0% 89,5% 86,9% Two metrics were used in these tests: precision – number of sentences correctly

classified; and recall - percentage of sentences identified taking into consideration all the sentences of the text [18]. In our study, we tried to select the “result sentences” in order to facilitate the researcher work. In this case, if the algorithm classifies "no re-sults" sentences incorrectly is better than the algorithm does not select the "result" sentences. With a more number of sentences in the final set, the researcher will spend more time to finish his work, but without the result sentences his work could generate an incorrect result. Because of this, the recall was prioritized in the selection of ma-chine learning classification algorithm and because this Nearest Neighbor heuristic was defined as the core of the second stage of classification process.


14

5 Feasibility Study With Textum

A feasibility study was developed to evaluate the precision, recall and effectiveness of Textum algorithm in automatic identification of result sentences. Two questions were defined: Can the automatic method get better precision and recall rates in the sentence classification than the rates obtained by humans in the experiment conducted by Cruzes [5]? Is it possible to reduce the text size without removing the sentences with results in the same proportion?

In this feasibility study, the whole annotation, text preprocessing, calculation of the attributes and classification rules steps were executed using a specific tool developed to this work. The machine learning classification was performed using the Weka li-brary [9].

The first step was to select the papers to be used in the experiment. A corpus of 17 papers randomly selected in software testing area was used in our research. Nine pa-pers were used to define the classification rules and to select the machine learning algorithm and other eight papers were selected to perform the feasibility study. A list of these papers can be obtained in [28]. The papers were imported and the sentences that represent results were marked in the tool. It was necessary to annotate the sen-tences only to measure the classification performance.

After that, the text mining processing was executed to index the sentences and cre-ate the matrix with them and their attributes values. The attributes described in the section 4.2. were calculated to each sentence in the set and stored in the matrix. This step finishes the pre-processing activity.

This matrix with sentences and attributes were the input to classification activity, executed in two steps and described in the section 4.3. The rule-based classification was also performed as a filter to remove some sentences that were false-positives (not results) before next classification step. The final result from this processing is a doc-ument, which will be an entry file with sentences and attributes to be loaded in the Weka, where the machine learning classification will be executed.

The study showed a precision rate up to 56.5% and recall rate up to 60%, depend-ing on the rule-based classification scheme used. These results showed precision rates close to those described in an experiment conducted with students at the University of Maryland, in which students achieved rates close to 50% [5]. The main difference between manual and automatic tests realized was the time spent to perform these pro-cedures. While the computational processing finished in minutes, the humans took hours to finish reading the texts.

After locate the result sentences, the tool highlight only them in the original text. Looking the highlight screen (Fig 3), the researcher could find easily the results in the full text. The main problem in this experiment was that despite the precision rates were close the numbers found in the human experiment, the recall rates were too low.


15

Fig. 3. - Highlight screen

However, in practice, we have noted that a researcher needs to read the whole par-agraph to understand the context of a sentence highlighted, because sentences that are in the same paragraph help to explain the results. When these paragraphs, where the sentences highlighted were found, were completely analyzed, it was discovered that there were several other results that have not been automatically highlighted. The observation showed that if a tool is used to highlight the paragraphs instead of the sentences the recall rate reaches 72%, an increase of 12% compared with the original test.

Many paragraphs in a paper bring information about background or correlated studies. This information may be important to the reader to validate the study confi-dence and validity in a previous step of systematic review, but these aspects are not evaluated in Results Extraction step, such that reading these paragraphs imply unnec-essary work for the researcher. We observed that most of the paragraphs that contain result sentences try to explain for themselves the context related with the outcomes obtained. Furthermore, although they might appear distributed in the text, these para-graphs are usually clustered and have cross-references among the elements in each cluster, which also facilitates understanding of the context.

Thereby, we changed the feasibility study to reach a new objective, instead of iden-tifying only the result sentences, the algorithm tried to localize the paragraphs that contained at least one sentence classified as a result of the study. With this new para-digm, the Textum method achieved a precision rate of 74% in classifying paragraphs in which there are results. Even with the change, the main goal that was to reduce the effort needed to create the review has been maintained. In the end of the process, the


16

set formed by selected paragraphs represented approximately 20% of the initial text, a considerable reduction in the amount of information to be read by the researcher, with consequent reduction of time.

5.1 Comparison among Textum and other algorithms

The methods described in section III were applied on Health area papers. This type of papers, as already discussed, is structured, so that the sentences of common interest are grouped into standard sections. The results of these papers, for example, are con-centrated in a section called "results" and the sentences that describe the methodology are in a section named “method”. The other two sections, “introduction” and “discus-sion”, also congregate sentences with purpose consistent with the section title.

However, as the area of interest in our work is Software Engineering (SE) and not Health, we needed to test the performance of these algorithms in this domain. Unlike health papers that use the IMRaD model, scientific papers in SE do not usually follow a fixed pattern to organize their content, so that the sentences of similar purpose are dispersed in the full text paper. In order to assess the impact of this feature in the au-tomatic classification of sentences of papers, we tested Agarwal [1] and Ibekwe-SanJuan [10] algorithms using a set of papers from a Corpus developed by Cruzes et al. [5] in the Software Testing area. We could not test Teufel´s algorithm because we did not have access to the code used by the researcher.

To evaluate the performance of Ibekwe-SanJuan´s method to classify unstructured Software Engineering papers, we used a tool and a file with the grammar definition provided by the author. For an input set with little more than two thousand sentences belonging to Software Testing Corpus, the algorithm achieved an accuracy rate of 9.47%.

The Agarwal’s algorithm was tested from software provided by the author and the same input set used in previous method. Despite the better results, when compared to Ibekwe –SanJuan, the rates were still not good. The algorithm test resulted in 25% of accuracy in the results classification.

The main feature observed, in both trials, was the low performance of classifiers, probably due to the use of unstructured papers as input set, in contrast to the fact that originally the researchers used as input set of structured papers from health area. A detailed description of this experiment is presented in [27].

In Table 3 are listed seven sentences, extracted from unstructured papers belonging to Software Testing Corpus with their results of automatic classification experiment using three different algorithms, Agarwal´s and Ibkwe-SanJuan algorithms, described previously, and our proposal algorithm, Textum.

The results shows the difficulty to the algorithms automatically define the sentence types. The main problem is caused by the lack of a pattern to characterize the sen-tences. For example, in sentences as the first two in Table 3, we could think of a rule that says that the percent sign in a sentence indicates that this one is a result in a study. However, further analysis of the paper set shows that this sign is commonly found in sentences that describe the method in the paper. The sentence “In the exper-


17

iment, we used 30% of the original set as the training set”, for example, is one exam-ple of a sentence appearing in the methods section.

Table 3. – Result Sentence Classification Test

Result Sentences Classification Result (Algorithms)

Agarwal Ibke-SanJuan Textum 50% of the total effort required for error correction

occurred in modified modules

18% of Errors have the source on Mistakes in control logic or computation of and Expression.

Errors contained in modified modules were found to require more effort to correct than those in new modules, although the two classes contained approximately the same number of errors

Interfaces appear to be the major source of errors re-gardless of the module type.

50% of the total effort required for error correction occurred in modified modules.

A major source of insight when analyzing a software development project is a record of the changes, including error corrections, made as the development progresses.

The average effort to make a change was 5.0 man-hours, and the average to correct an error of any type was slightly higher, 5.4 man-hours.

6 Conclusion

In this paper we define and apply a method for automatically locate results from empirical studies written in unstructured published papers, called Textum method. The information in these papers is written in natural language, which is ambiguous even for humans. An experiment, in which some papers were provided to groups of students to identify the sentences in the papers that represented results, [5] showed that, on average, only 53% of all sentences in the existing text were correctly located.

The tool proposed in this work performed even better results than the students in this experiment. From the three approaches for automatic semantic annotation of sen-tences discussed in this paper, none of them outperformed the proposed method when the input set consisted of items that did not follow the IMRaD model. The precision and recall rates for these algorithms were below 30% while the proposed method exceeded the 47% level of precision in a recall of more than 60%. The precision rates obtained in the tests with Textum and in the experiment with humans [5] were similar.

Meanwhile, the recall rates in the automatic selection were lower than in the hu-man selection. However, it was noticed that a text only composed of "result" sentenc-es were not understood by the researches, they needed context information present in adjacent sentences to execute the systematic review process. Thus, instead of select-ing only the "result" sentences in the text, we changed the paradigm and started to select the whole paragraphs in the text that contained at least one sentence classified as "result".


18

This change allowed an increasing the accuracy level to 74% and the recall level to 72%. Selecting only the paragraphs, the Textum method reduced the text to be ana-lyzed by researchers to 20%, which, in theory, would probably reduce in 80% the time spent on traditional analysis of the paper. It is noteworthy that Textum method is focused on Software Engineering papers. The use of this method in other study areas, especially health, has not been evaluated; future works include the creation of an an-notated corpus of papers from another area and evaluate the efficacy of this algorithm on these papers.

ACKNOWLEDGEMENTS

This work was partially supported by the National Institute of Science and Tech-nology for Software Engineering (INES), funded by CNPq, grant 573964/2008-4.

REFERENCES

1. Agarwal, S.; Yu, H. Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion. In Proceedings of the AMIA Summit on Translational Bioinformatics, 2009.

2. Allen, I. E. and Olkin, I.. Estimating time to conduct a metaanalysis from number of cita-tions received. Journal of the American Medical Association, 282(7): 634–635

3. Biolchini, J., Mian, P.G., Natali, A.C.C., and Travassos, G.H. (2005) Systematic Review in Software Engineering, Univ. Rio de Janeiro, TR, ES 679/05.

4. Cruzes, D. and Dybå, T., Research synthesis in software engineering: A tertiary study. In-formation & Software Technology 53(5): 440-455 (2011)

5. Cruzes, D., Mendonça, M. G., Basili, V.R., Shull, F., Jino, M.: Extracting Information from Experimental Software Engineering Papers. SCCC 2007: 105-114.

6. Dybå, T., Kitchenham, B.A., and Jørgensen, M. (2005) “Evidence-based Software Engi-neering for Practitioners,” IEEE Software, 22(1): 58–65.

7. Felizardo, K. R.; Nakagawa, E. Y. ; Feitosa, D. ; Minghim, R. ; Maldonado, José Carlos . An Approach Based on Visual Text Mining to Support Categorization and Classification in the Systematic Mapping. In: 13th International Conference on Evaluation & Assessment in Software Engineering (EASE 2010), 2010.

8. Hachey, B.; Grover, C. Sequence modeling for sentence classification in a legal summari-zation system. Proceedings of the 2005 ACM symposium on Applied computing, 2005.

9. Hall, M.; Frank, E.; Holmes, G.; Pfahring-er, B.; Reutemann, P.; Witten, I. H. The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1, 2009.

10. Ibekwe-SanJuan, F.; Chen, C.; Pinho, R. Identifying Strategic Information from Scientific Articles through Sentence Classification. 6th International Conference on Language Re-sources and Evaluation Conference (LREC-08), Marrakesh, Morocco, 26 May -1st June, 2008.

11. Khoo, A.; Marom, Y.; Albrecht, D. Experiments with Sentence Classification. In Proc. of Australian Language Technology Workshop, pages 18—25, 2006.

12. Kitchenham, B.A. Procedures for Performing Systematic Reviews, Keele University, Technical Report TR/SE-0401 and NICTA Technical Report 0400011T.1, 2004.


19

13. Kitchenham, B.A., Dybå, T., and Jørgensen, M. Evidence-based Software Engineering, Proc. ICSE’04, Edinburgh, Scotland, 23-28 May, pp. 273–281, 2004.

14. Larocca, J.; Santos, A.D.; Kaestner, A.A., Freitas, A. A. Generating Text Summaries through the Relative Importance of Topics. In: Proceedings of the International Joint Con-ference IBERAMIA/SBIA, Atibaia, SP, 2000.

15. Luhn, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2, 157-165, 1958

16. Malheiros, V. D. ; Hohn, E.; Pinho, R ; Mendonça Neto, M. G. de ; Maldonado, J. C. . A Visual Text Mining approach for Systematic Reviews. In: Empirical Software Engineering and Measurement, 2007. ESEM 2007: 245-254, 2007.

17. McKnight, L.; Srinivasan, P. Categorization of Sentence Types in Medical Abstracts. In AMIA Symposium, 2003.

18. Olson, D.; Delen, D.. Advanced data mining techniques. Springer Verlag, 2008. 19. Pardo, T.A.S; Rino, L.H.M.; Nunes, M.G.V. NeuralSumm: Uma Abordagem Conexionista

para a Sumarização Automática de Textos. In Anais do IV Encontro Nacional de Inteli-gência Artificial – ENIA, pp. 1-10. Campinas-SP, Brasil, 2003.

20. Pardo, T.A.S. Gistsumm: um sumarizador automático baseado na idéia principal de textos. Série de Relatórios do Núcleo Interinstitucional de Linguística Computacional, São Paulo, 2002.

21. Rosenberg, W., & Donald A. (1995). Evidence based medicine: an approach to clinical problem-solving. British Medical Journal, 310 (6987), 1122-1126.

22. Ruch, P. A.; Geissbühler, A.; Gobeill, J.; Lisacek, F.; Tbahriti, I.; Veuthey, AL.;Aronson; AR. Using Discourse Analysis to Improve Text Categorization in MEDLINE. Medinfo, 2007.

23. Saltob, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Infor-mation Processing & Management 24 (5): 513–523, 1988

24. Silva Rocha, M. C. Contextextractor: uma ferramenta de apoio para a extração de informa-ções de contexto de artigos de engenharia de software experimental. Master Thesis, Uni-versidade Salvador, 2009.

25. Tbahriti, I.; Chichester, C.; Lisacek, F.; e Ruch, P. Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library. International Journal of Medical Informatics, 75(6):488–495, 2006.

26. Teufel, S.; Moens, M. Discourse-level argumentation in scientific articles: human and au-tomatic annotation. In:Towards Standards and Tools for Discourse Tagging. ACL 1999 Workshop, 1999.

27. Torres, J. A. S.; Cruzes, D. S.; Salvador, L. N., Automatic Results Identification in Soft-ware Engineering Papers. Is it possible? Proceedings of the International Conference on Computational Science and Its Applications (ICCSA), 2012.

28. Torres, J. A. S., Automatic summarization of software engineering papers to support the systematic review process. Master Thesis, Salvador University, Graduate Program in Computer Science, Salvador,BA/Brazil,2011.

29. Zhang, H. and Ali Babar, M. An Empirical Investigation of Systematic Reviews in Soft-ware Engineering. Empirical Software Engineering and Measurement (ESEM/11): 87-96


20

Software Measurement Architectures: A Mapping Study

Ciro Xavier Maretto1, Monalessa Perini Barcellos1

1 Ontology and Conceptual Modeling Research Group (NEMO), Department of Computer

Science, Federal University of Espirito Santo, Vitória, Brazil {ciro.maretto, monalessa}@ufes.br

Abstract. During the execution of software projects, it is necessary to collect, store and analyze data to support project and organizational decisions. Software measurement is a fundamental practice for project management and process improvement. It is present in the main models and standards that address software process improvement, such as ISO/IEC 12207, CMMI and MR MPS.BR. In order to effectively perform software measurement, it is necessary an infrastructure to support data collection, storage and analysis. This article presents a study that investigated measurement architectures described in the literature. As a result, eight architectures were found. Their main characteristics were analyzed and are presented in this paper.

Keywords: Systematic Mapping Study, Software Measurement, Measurement Architecture, Measurement Repository.

1 Introduction

Software Measurement is used by organizations in many ways. For instance, in the context of project management, measurement helps develop realistic plans, monitor progress, identify issues and justify decisions [1]. Throughout projects, data are collected for the measures and should be stored in a measurement repository in order to be used in project management and process improvement [2]. In maturity models that organize the software processes in maturity levels, such as CMMI (Capability Maturity Model Integration) [3] and MR MPS.BR (Reference Model for Process Improvement of Brazilian Software) [4], measurement is located at initial levels (CMMI level 2 and MR MPS.BR level F) and evolves as the maturity level increases. At high maturity levels (CMMI levels 4 and 5 and MR MPS.BR levels A and B) statistic process control (SPC) must be carried out and it requires extra attention in some measurement aspects, such as data storage.

It is not easy to implement and maintain a measurement repository capable of attending the needs according to the organization maturity level. Usually, organizations start recording measurement data in spreadsheets or in some systems with little or no integration among them [5]. At initial maturity levels, spreadsheets seem to be enough, but as the organization’s maturity level increase, the problems of using spreadsheets become more expressive. Most times, to achieve high maturity, organizations need to discard data stored in spreadsheets, develop a measurement repository by using appropriate technologies (e. g., database management systems), and restart the collection and storage of project data. Thus, a good practice is to define


21

an infrastructure which support software measurement and can be used from the beginning of a measurement program until the high maturity levels (or that can be extended to that) [2].

This infrastructure is made of components and can be defined by means of an architecture. According to Zachman [6], an architecture can be understood as a logical structure in which the components are organized and integrated. In the software measurement context, architecture should consider aspects related to the data collection, storage and analysis. In a measurement architecture, one of the main components is the measurement repository. According to Bernstein [7], a repository can be defined as a database sharing information about engineering artifacts. In a measurement architecture, the measurement repository stores measurement data (not limited to the collected data to the measures) and acts as a data provider to analysis.

Aiming to identify proposals to software measurement architecture, we carried out an investigation into the literature. According to Kitchenham [8], a systematic mapping (also known as exploratory study) makes an extensive study in a topic of a specific theme and aims to identify available evidence about that topic. In this sense, we carried out a systematic mapping. For each identified architecture we analyzed its characteristics and verified if the proposal provide support to the SPC.

In this paper, we present the main results of the study. After this introduction, in section 2, we briefly present software measurement and statistical process control; in section 3, the methodology used is described; in section 4 the research protocol is presented; in section 5 the main obtained results are shown; in section 6 some considerations about the results are performed; and finally, in section 7 some final considerations are made.

2 Software Measurement and Statistical Process Control

Software measurement is a primary support process for managing projects. It is also a key discipline in evaluating the quality of software products and the performance and capability of organizational software processes. The software measurement process includes the following activities: planning the measurement process, execution of the measurement process, and measurement evaluation [9].

For performing software measurement, initially, an organization must plan it. Based on its goals, the organization has to define which entities (processes, products and so on) to consider for software measurement and which of their properties (size, cost, time, etc.) are to be measured. The organization has also to define which measures are to be used to quantify those elements. For each measure, an operational definition should be specified, indicating, among others, how the measure must be collected and analyzed. Once planned, measurement can start. Measurement execution involves collecting data for the defined measures, according to their operational definitions. Once data are collected, they should be analyzed. The data analysis provides information to the decision making, supporting the identification of appropriate actions. Finally, the measurement process and its products should be evaluated in order to identify potential improvements [10].

Depending on the organization’s maturity level, software measurement is performed in different ways. At initial maturity levels, such as the levels 2 and 3 of


22

CMMI, the focus is on developing and sustaining a measurement capability that is used to support project management information needs. At maturity levels, such as CMMI levels 4 and 5, measurement is performed for the purpose of statistical process control (SPC), in order to understand the process behavior and to support software process improvement efforts [11]. SPC uses a set of statistical techniques to determine if a process is under control, considering the statistical point of view. A process is under control if its behavior is stable, i.e., if their variations are within the expected limits, calculated from historical data. The behavior of a process is described by data collected for performance measures defined to this process [12].

A process under control is a stable process and, as such, has repeatable behavior. So, it is possible to predict its performance in future executions and, thus, to prepare achievable plans and to improve the process continuously. On the other hand, a process that varies beyond the expected limits is an unstable process and the causes of these variations (said special causes) must be investigated and addressed by improvement actions in order to stabilize the process. Once the processes are stable, their levels of variation can be established and sustained, being possible to predict their results. Thus, it is also possible to identify the processes that are capable of achieving the established goals and the processes that are failing in meeting the goals. In this case, actions to change the process in order to make it capable should be carried out [12].

Statistical process control requires some changes in the traditional measurement, specially related to operational definition of measures, data collection frequency, measurement granularity, data homogeneity and data grouping to analysis [2, 13].

3 Metodology

In order to perform the systematic mapping, we used the process proposed in [14], which was defined based on [8]. It consists of the following three activities: i) Develop Research Protocol: In this step the researcher prospects the topic of

interest, defines the context to be considered in the study, and describes the object of analysis. Next, he/she defines the research protocol that will be used as a guideline to perform the research. The protocol must contain all the necessary information for a researcher to perform the research (research questions, source selection criteria, publication selection criteria, procedures for storing and analyzing the results, and so on). The protocol must be tested in order to verify its feasibility, i.e., if the results obtained are satisfactory and if the protocol execution is viable in terms of time and effort. The test results allow for improving the protocol when necessary. If the protocol is viable, an expert must evaluate it and once approved, the protocol can be used to conduct the research.

ii) Perform Research: In this step the researcher performs the research according to the research protocol. Publications are selected, and data are extracted, stored, and quantitatively and qualitatively analyzed.

iii) Provide Results: In this step the research results produced during the execution of the systematic review process should be packaged and published in a conference, journal, technical report or other publication vehicle.


23

4 Research Protocol

The research protocol used in the study contains the following information: objective, research questions, sources selection criteria, publications selection criteria, data storage and data analysis procedures, and protocol test procedure.

A. Objective Analyzing the literature in the context of software measurement architectures, with the main purpose of identifying and analyzing:

(i) Proposals for software measurement architectures; (ii) The proposals characteristics; (iii) If the proposals are capable of supporting the statistical process control.

B. Research Questions Q1. Which proposals for software measurement architecture are reported in the

literature? Q2. What are the proposals characteristics? Q3. Which proposals include support to statistical process control?

In Q3, support to statistical process control consists in support: data collection, storage, representation (by means of control charts), and process behavior analysis.

C. Sources The publications sources must be digital libraries and: (i) Have a search mechanism that allows the use of logical expressions and

search in different parts of the publications; (ii) Be available in the CAPES (Coordination for the Improvement of Higher

Education Personnel) Journals Portal1; (iii) Include publications in the Physical Science area, in particular Computer

Science.

D. Procedure for Publications Selection The object of analysis are papers published in conferences and journals. Publications selection must be done in three steps: 1st step – Preliminary selection and cataloging: the preliminary selection must be done by applying the following criteria using the digital library search mechanism:

Scope: title, abstract and keywords. Language: English. Search String: ("measurement framework" OR "measurement database" OR "measurement repository" OR "measurement architecture" OR "metrics repository" OR "metrics database") AND "software". Period: from 1990. Area: Computer Science. For establishing the search string, we performed some tests using different terms,

logical connectors, and combinations between them, aiming to obtain a search string able to return relevant publications to the study and a viable quantity to be analyzed.

1 CAPES Journals Portal (www.periodicos.capes.gov.br/) is sponsored by Brazilian government and offers access to the publications of many international and national sources, covering all knowledge areas.


24

During the informal literature review that preceded the study, we found some relevant publications addressing measurement repositories. In fact, although these publications use the term measurement repository, in the context of the study they address measurement architecture. Thereby, we decided to include in the search string terms related to repositories.

Also during the informal review we identified two relevant publications ([15] and [16]) that we used as control publications to evaluate the search strings (the string must be able to return the control publications). The tests to obtain the search string were carried out using the digital libraries Scopus (www.scopus.com) e IEEE (ieeexplore.ieee.org). Scopus was selected because during preliminary tests it returned the largest number of publications. IEEE, in turn, was selected because the control publication [16] was only available in IEEE.

Considering the tests results we decided to select a comprehensive string and to restrict the publications selection in the later steps, since more restrictive strings excluded one or both the control publications. The selected string returned many publications that deal with measurement repositories not related to software measurement, but to scientific experiments from other computer areas. However, when we tried to restrict the publications by using the term “software measurement” instead of “software”, the search results were very restricted and one of the control publications was not returned. So, even being a comprehensive string, the string selected was the one which provided better results in terms of number and relevance of selected publications.

We decided to apply the search string to the title, abstract and keywords, because some tests applying the string to the full text resulted in a large number of publications, being many of them useless. On the other hand, when restricting the string only to the title, useful publications were eliminated. 2nd Step – Selection of Relevant Publications – 1st filter: selecting publications by applying a search string does not ensure that all selected publications are relevant, because such selection is restricted to syntactic aspects. Thus, the abstract of the publications selected in the 1st step must be analyzed. Publications that do not satisfy one of (or both) the following criteria must be eliminated:

SC1: The publication addresses collection, storage, analysis or recovering of measurement data. SC2: The publication addresses some kind of software measurement architecture or measurement repository.

We refer explicitly to measurement repositories in SC2 (and in SC3 presented forward), because, as it was said before, we noticed that some publications address measurement repository proposals that represent an architecture, according to the architecture concept used in the study (see Introduction).

In order to avoid premature exclusions of publications, in case of doubt, the publication should not be eliminated. Besides, publications without an abstract should not be eliminated.

3rd Step - Selection of Relevant Publications – 2nd filter: the selection of the publications in the 2nd step considers only the abstract. Consequently, it is possible that some selected publications do not contain relevant information. Therefore, the


25

full text of the publications selected in the 2nd step must be read. Publications that do not satisfy one of (or both) the following criteria must be eliminated:

SC3: The publication describes software measurement architectures or measurement repositories. SC4: The full text is accessible.

E. Data Storage Procedure Each publication selected in the 1st step must be catalogued with the following data: title, author(s), year, reference data, source (digital library), and a summary. Each catalogued publication must be examined and submitted to the next two steps. The publications eliminated on the 2nd step must be identified as “E2: SC[number of the criteria not satisfied]”. Similarly, publications eliminated on the 3rd step must be identified as “E3: SC[number of the criteria not satisfied]”.

F. Data Extraction and Analysis Procedure For each publication selected on the 3rd step, the following information must be extracted:

(i) Proposal identification. The identification is the proposal name as cited in the publication. If the proposal has no name, it must be identified as “Proposal XYZ”, where XYZ are the initial letters of the proposal authors;

(ii) A brief description of the proposal; (iii) Proposal characteristics, organized according to the following categories:

Technology, Architecture, Collection, Storage, and Analysis; (iv) Indication if the proposal supports statistical process control.

Regarding (iv), it must be recorded “Yes” to proposals whose publications make explicit the support to SPC. It must be recorded “Probably Applicable” to proposals that do not make explicit the support to SPC, but apparently they are able to support it. It must be recorded “No” to proposals that do not mention support to SPC and it is not possible to conclude that they support it.

After the data is extracted from publications, a quantitative and qualitative analysis must be done with the main purpose of discussing the findings related to the research questions.

G. Test Protocol Procedure The research protocol must be tested using a reduced number of sources in order to verify if it is viable. The protocol is considered viable if the procedures are performed as described, if it is possible to answer the research questions and if the time and effort necessary are viable. During the protocol tests, some points need special attention: (i) Number of publications selected on the 1st step: a large number of selected

publications may mean that the string should be refined, because it is probably considering a larger domain than the target domain. It can be confirmed if many publications are eliminated in the subsequent steps. On the other hand, small number of selected publications may mean that many useful publications may be prematurely removed, that is, the search string is probably too restrictive.

(ii) Number of publications selected by the 2nd step: a large number of publications selected by the 2nd step related to the number of publications selected in the 1st


26

step might mean that either 2nd step criteria are too close to the search string and must be reviewed or 2nd step is unnecessary.

(iii) Number of publications selected for the 3rd step: a small number of publications selected in the 3rd step related to the number of publications selected in the 2nd step suggests that the criteria from the previous stage should be refined, because they probably are too wide in relation to the target domain. It is also important to consider that, in fact, only a small number of publications provide useful information for the research. Thus, once the criteria are aligned to the research objective and to the target domain, even if the number of selected publications is small, the criteria can be considered appropriated.

5 The Results

The protocol presented in the previous section was evaluated by an expert. Then, it was tested using the digital library IEEE. The protocol was considered viable and it was executed one more time using the digital library Scopus. In this section some results obtained from these two executions, carried out in November and December of 2011, are presented. Publications selected in both digital libraries were counted only once. In total, 148 publications were selected in the 1st step, 22 in 2nd the step and 12 in 3rd step.

It is possible to notice the significant decrease in the number of publications in the 2nd step. In fact, this result was expected, since we decided to use a comprehensive search string, as argued in section 4.

It is worth mentioning that the focus of the study is on measurement architectures and, for this reason, publications which described lessons learned and case studies that mention the use of measurement architecture (not describing the architecture) were excluded during the selection criteria application.

Analyzing the publications per year, from 148 publications selected by the search string (1st step), 25 (17%) are dating from 1990 to 2000 and 123 (83%) are dating from 2001 to 2011. From 12 publications selected in 3rd step, a quarter is dating from 2009 on. Besides, even we limited publications from 1990 on, the oldest publications are from 1999 and 2000.

From the publications selected in 3rd step, 8 proposals were identified. Table 1 shows a brief description of the proposals and their respective publications.

We analyzed the characteristics of each proposal. Due to space limitation it is not possible to present the characteristics in details. A summary is presented on Table 2. It is worth saying that the publications describe their proposals with different levels of detail and with different foci. Consequently, information regarding the characteristics has also heterogeneous detail levels. For instance, some proposals describe in details characteristics of the adopted architecture, while others just mention the general model in which the architecture is based on, and others nothing said about their architecture. In Table 2, when information regarding a category is not shown, it means that it was not possible to obtain information about it by reading the publications.


27

Tab

le 1

. Pro

posa

ls fo

und.

Prop

osal

D

escr

iptio

n R

ef

P01

- Gen

eric

M

easu

rem

ent

Fram

ewor

k B

ased

on

MD

A

Sof

twar

e m

easu

rem

ent f

ram

ewor

k to

sup

port

the

softw

are

mea

sure

men

t ent

ities

thro

ugh

met

amod

els a

nd

trans

form

atio

ns. F

or e

xam

ple,

giv

en a

mod

el o

f an

ER (E

ntiti

es a

nd R

elat

ions

hips

) dia

gram

, mea

sure

s suc

h as

qu

antit

y of

tabl

es a

nd re

latio

nshi

ps c

an b

e au

tom

atic

ally

cal

cula

ted

usin

g th

e fr

amew

ork.

For

this

, fra

mew

ork

uses

a d

omai

n m

odel

and

a m

easu

rem

ent m

odel

, whi

ch sa

ys w

hich

ent

ities

will

be

mea

sure

d an

d w

hat m

etho

ds

will

be

used

. The

se m

odel

s go

thro

ugh

trans

form

atio

n pr

oces

sing

QV

T (Q

uery

Vie

w T

rans

form

atio

n), w

hich

ge

nera

tes t

he m

easu

rem

ents

.

[15,

17

, 18

, 19

]

P02

- Web

Ev (W

eb fo

r th

e Ev

alua

tion)

Syst

em th

at u

ses a

mea

sure

men

t fra

mew

ork

base

d on

GQ

M (G

oal Q

uest

ion

Met

ric) [

20] t

o bu

sine

ss p

roce

ss

eval

uatio

n an

d gi

ves s

uppo

rt to

dat

a co

llect

ion,

stor

age

and

anal

ysis

. It w

as d

efin

ed in

term

s of m

easu

res,

mec

hani

sms f

or d

ata

colle

ctio

n an

d gu

ides

to u

se th

e da

ta c

olle

cted

.

[21,

22

]

P03

- NSD

IR (N

atio

nal

Softw

are

Dat

a an

d In

form

atio

n R

epos

itory

)

It co

nsis

ts o

f an

orga

niza

tiona

l ben

chm

arki

ng re

posi

tory

to so

ftwar

e pr

ojec

ts fr

om th

e U

.S A

ir Fo

rce.

It w

as

oper

atio

nal f

rom

199

4 to

199

8. A

lthou

gh it

s use

has

end

ed u

p in

199

8, th

e in

dust

ry a

nd a

cade

my

effo

rts

cont

inue

d th

roug

h C

EBA

SE (C

ente

r for

Em

piric

ally

-Bas

ed S

oftw

are

Engi

neer

ing)

. [2

3]

P04

- MR

S (M

easu

rem

ent

Rep

osito

ry S

yste

m)

It is

a m

easu

rem

ent r

epos

itory

use

d by

a g

roup

of t

elec

omm

unic

atio

n co

mpa

nies

. One

of t

he m

ain

purp

oses

was

th

e su

pply

and

pro

duct

s eva

luat

ion

thro

ugh

repo

rting

gen

erat

ion

whi

ch c

ompi

led

data

from

all

parti

cipa

ting

com

pani

es. A

s a b

ig c

once

rn th

e re

posi

tory

has

the

safe

ty a

nd p

rivac

y of

the

info

rmat

ion.

[2

4]

P05

- MM

R T

ool

Prop

osal

of a

gen

eric

and

flex

ible

mea

sure

men

t rep

osito

ry fo

r dat

a co

llect

ion,

stor

age,

ana

lysi

s, a

nd p

ublic

atio

n.

It w

as p

roje

cted

to g

ive

supp

ort t

o al

l CM

MI l

evel

s and

it w

as a

pplie

d in

Eric

son

Res

earc

h C

anad

a.

[16]

P06

- SPD

W+

(Sof

twar

e D

evel

opm

ent

Proc

ess P

erfo

rman

ce

Dat

a W

areh

ousi

ng)

It pr

esen

ts th

e da

ta w

areh

ousi

ng a

rchi

tect

ure

SPD

W+

as a

repo

sito

ry so

lutio

n ce

ntra

lized

in m

easu

rem

ents

, au

tom

atic

col

lect

ion

and

anal

ysis

mec

hani

sms.

The

SPD

W+

is a

n im

prov

emen

t of S

PDW

that

was

ope

ratio

nal

for 3

yea

rs in

HP

Bra

zil.

It w

as d

evel

oped

aim

ing

the

supp

ort o

f pro

cess

impr

ovem

ents

in m

atur

e or

gani

zatio

ns.

[25]

P07

- A U

nive

rsal

M

etric

s Rep

osito

ry

It pr

opos

es a

stru

ctur

e to

a fl

exib

le m

easu

rem

ent r

epos

itory

, abl

e to

ada

pt it

self

to d

iffer

ent l

ifetim

e m

odel

s, m

etho

dolo

gies

, and

softw

are

deve

lopm

ents

pro

cess

. The

pro

posa

l use

s tra

nsfo

rmat

iona

l vie

w c

once

pts o

f so

ftwar

e de

velo

pmen

t, w

hich

con

side

rs th

at th

e so

ftwar

e de

velo

pmen

t pro

cess

is a

ser

ies o

f arti

fact

s tra

nsfo

rmat

ion.

[26]

P08

– Pr

opos

al P

AU

It pr

esen

ts a

gen

eric

fram

ewor

k th

at in

corp

orat

es d

atab

ase,

a fo

rmal

set

of s

oftw

are

test

s and

eva

luat

ion

mea

sure

s, as

wel

l as a

n ad

vanc

ed se

t of a

naly

tical

tech

niqu

es fo

r inf

orm

atio

n an

d kn

owle

dge

extra

ctio

n. T

he

appr

oach

pro

pose

s usi

ng th

is fr

amew

ork

and

its te

chni

ques

to e

xtra

ct d

etai

led

info

rmat

ion

and

know

ledg

e fr

om

the

softw

are

mea

sure

men

t rep

osito

ries.

[27]


28

Tab

le 2

. Ove

rvie

w o

f the

gen

eral

cha

ract

eris

tics

of th

e id

entif

ied

prop

osal

s.

Prop

osal

Fe

atur

es

Tec

hnol

ogy

Arc

hite

ctur

e C

olle

ctio

n St

orag

e A

naly

sis

SPC

Su

ppor

t

P01

Use

of D

SL(D

omai

n-Sp

ecifi

c La

ngua

ge) a

nd

tool

s bas

ed o

n Ec

lipse

pl

atfo

rm

Bas

ed o

n M

DA

(Mod

el

Driv

en A

rchi

tect

ure)

Aut

omat

ic

(thro

ugh

mod

els

trans

form

atio

n)

XM

L fil

e

No

P02

Use

of J

ava

(Jav

a JD

BC

an

d Ja

va S

ervl

et A

PI)

Se

mi-a

utom

atic

(v

ia w

eb fo

rm)

Dat

abas

e Q

uant

itativ

e an

alys

is re

sour

ces

Prob

ably

A

pplic

able

P03

Use

of S

un S

olar

is U

nix,

O

racl

e an

d cl

ient

in V

isua

l B

asic

with

OD

BC

(Ope

n D

atab

ase

Con

nect

ivity

)

Clie

nt-S

erve

r (ce

ntra

l re

posi

tory

whi

ch st

ores

da

ta c

olle

cted

by

clie

nt

softw

are)

Man

ual a

nd S

emi-

auto

mat

ic

(thro

ugh

phys

ical

or

ele

ctro

nics

fo

rms)

Dat

abas

e A

naly

sis t

ools

in a

be

nchm

ark

styl

e N

o

P04

Clie

nt-S

erve

r (ce

ntra

l re

posi

tory

whi

ch st

ores

da

ta c

olle

cted

by

clie

nt

softw

are)

Sem

i-aut

omat

ic

(thro

ugh

elec

troni

c fo

rm)

Dat

abas

e G

ener

atio

n of

qu

arte

rly re

ports

N

o

P05

Use

of T

e chn

olog

ies a

nd

Mic

roso

ft to

ols (

SQL

2000

Se

rver

, Ana

lysi

s Ser

vice

s En

terp

rise

Editi

on, I

nter

net

Info

rmat

ion

Serv

er, I

ntra

net

Shar

e Po

rtal S

erve

r, A

SP)

Bas

ed o

n da

ta

war

ehou

se e

nviro

nmen

t

Sem

i-aut

omat

ic.

Inte

nd to

use

ETL

(E

xtra

ctio

n,

Tran

sfor

mat

ion

and

Load

ing)

to

colle

ct

volu

min

ous a

nd

perio

dic

data

.

Dat

a w

areh

ouse

. Th

e da

taba

se

mod

el is

ge

neric

for

data

fle

xibi

lity

SQL

(Stru

ctur

ed

Que

ry L

angu

age)

an

d O

LAP

(On-

line

Ana

lytic

al

Proc

essi

ng) c

ubes

. D

ata

is p

rese

nted

vi

a w

eb p

orta

l. It

is p

ossi

ble

to

expo

rt da

ta to

st

atis

tics t

ools

.

Yes


29

T

able

2. O

verv

iew

of t

he g

ener

al c

hara

cter

istic

s of

the

iden

tifie

d pr

opos

als

(con

t.).

Prop

osal

Fe

atur

es

Tec

hnol

ogy

Arc

hite

ctur

e C

olle

ctio

n St

orag

e A

naly

sis

SPC

Su

ppor

t

P06

Use

of M

icro

soft

t ech

nolo

gies

and

tool

s. (S

QL

Serv

er 2

005,

BI

Stud

io, V

isua

l Stu

dio

2005

, SQ

L Se

rver

Inte

grat

ion

Serv

ices

and

IIS

6.0)

Orie

nted

to se

rvic

es

(SO

A –

Ser

vice

O

rient

ed A

rchi

tect

ure)

an

d ba

sed

on d

ata

war

ehou

se e

nviro

nmen

t w

ith fo

ur c

ompo

nent

s

Sem

i-aut

omat

ic

and

auto

mat

ic, b

y us

ing

ETL.

Dat

a w

areh

ouse

Use

of B

I too

ls

(Bus

ines

s In

telli

genc

e) w

ith

web

inte

rfac

e,

incl

udin

g O

LAP

and

dash

boar

d.

Yes

P07

Use

of M

ySQ

L (o

nly

the

repo

sito

ry is

impl

emen

ted)

Dat

abas

e.

The

data

base

m

odel

is

gene

ric fo

r da

ta

flexi

bilit

y.

N

o

P08

Sem

i-aut

omat

ic

Dat

abas

e

Use

of s

tatis

tical

te

chni

ques

and

ot

hers

, suc

h as

: m

ultir

esol

utio

n an

alys

is,

clas

sific

atio

n tre

es,

neur

al n

etw

orks

and

in

fluen

ce d

iagr

ams.

No


30

6 Discussions

In this section we present additional information and some considerations about the results presented in the previous section. In general, the proposals identified are very different. Unfortunately, based on information from the publications, many times it is not possible to compare the proposals in a substantial way. Regarding the proposals characteristics, some considerations are presented below:

Technology The technologies used in the proposals are diverse, varying from free software to

proprietary technologies. This can be a reflex of the variety of technological solutions available in the market.

Architecture All the proposals, except the Generic Measurement Framework based on MDA

[15, 17, 18, 19], include in their architecture a central repository to store and retrieve data, using a client-server architecture. The proposals MRS [24] and NSDIR [23] have specific client programs for communication with the server. WebEv [21, 22], MMR Tool [16], and SPDW+ [25], in turn, use web resources. The proposals SPDW+ [25] e MMR Tool [16] have architectures based on data warehouse environment, including a component for data collection (ETL), a component to storage (data warehouse) and a component for analysis with analytical capabilities (OLAP). The SPDW+ [25] includes a fourth component responsible for the data integration. It acts as a temporary repository for standardization of the collected data.

The Generic Measurement Framework based on MDA [15, 17, 18, 19] is a conceptual architecture and it is an adaptation of MDA. It is divided in levels, ranging from MOF (Meta-Object Facility) to measurement data, also including a measurement meta-model based on a software measurement ontology.

Collection In the Table 2 it is possible to notice three types of collection: manual, semi-

automatic and automatic. Manual collection refers to the use of physical forms in order for people to record data collected for the measures. Semi-automatic collection refers to the use of computational support (for instance, electronic forms and information systems) to record data collected for the measures. In the semi-automatic collection, although there is computational support, data are supplied by people. Automatic collection refers to the use of computational tools and mechanisms which obtain data for the measures without human intervention.

Most of the proposals use semi-automatic collection. The publications which describe the proposals MMR Tool [16] e MRS [24] mention the intention of using automatic collection mechanisms, but these mechanisms are not presented on the publications. Only two proposals implemented the automatic collection: Generic Measurement Based on MDA, [15, 17, 18, 19], by means of models transformation; and SPDW+ [25], with a ETL component. It is important to emphasize that these proposals deal with very specific types of measures (for instance, quantity of tables and relationship in a certain data model and number of errors in a portion of source code), which are more favorable for the automatic collection. Therefore, proposals that deal with measures whose automatic collection is more difficult or not possible adopt semi-automatic collection. This can be seen as a sign of the difficulty and, in


31

some cases impossibility, of adopting automatic collection. Only one proposal (NSDIR [23]) uses manual collection and the data collected in physical forms are after recorded in electronic forms.

Storage The proposals use three different solutions to data storage: relational database

(WebEv [21, 22]), XML (eXtensible Markup Language), files (Generic Measurement Framework Based on MDA [15, 17, 18, 19]), and solutions based on database. Although most of the proposals adopt solutions based on databases, we noticed that each proposal support the storage of different measurement data. We believe that this occurs mainly because the repository structure (the database “model”) is defined based on the specification of which entities and elements are to be measured and what information needs are expected to be satisfied by the measurement data.

We also noticed that some proposals provide flexibility regarding which measurement data can be stored. For instance, MMR Tool [16] uses a measurement domain meta-level structure as a data model, with the purpose of allowing adaptation to different measurement contexts. On the other hand, the Universal Metrics Repository [26] is itself a flexible database that aims to store any data from any measures related to different entities.

Finally, we observed that the proposals which include support to statistic process control (SPDW+ [25] e MMR Tool [16]) adopt solutions based on data warehouse.

Analysis Most of the proposals include mechanisms for data analysis and presentation.

Some proposals, such as SPDW+ [25] and PAU [27], have more complex mechanisms and tools. The analysis can be purely qualitative, as in WebEv [21, 22], or have a benchmark type, as in NSDIR [23] e MRS [24], in which general data of products and projects can be analyzed to support identification of best practices. The proposals that support statistic processes control (SPDW+ [25] e MMR Tool [16]) adopt more sophisticated mechanisms to data analysis (both of them use OLAP tools).

Support to SPC Most proposals do not provide support to statistic process control. For instance,

the proposal NSDIR [23] includes a repository which stores general data regarding products and projects with the main purpose of using them as benchmarking. Data concerning the process definition or its executions are not stored, what does not allow for carrying out SPC.

Only two proposals (SPDW+ [25] and MMRTool [16]) include support to SPC. Both of them were developed in the context of large companies aiming at the high maturity levels. These two proposals use Microsoft technologies and solutions based on data warehouse environment.

7 Final Considerations

This paper presented the results of a systematic mapping about software measurement architectures. Altogether, 148 selected publications from the digital libraries IEEE and Scopus were analyzed and 8 software measurement architectures proposals were found. The proposals have some similarities (for instance, the use of solutions based


32

on database for data storage in most of the proposals), but they also present many differences (for example, the technologies adopted).

Once the purpose of a systematic mapping is to present evidences from the literature about a specific topic, it was not purpose of the study to compare the proposals and determine which one is the best (or worst) of them. The main objective was to identify in the literature proposals for software measurement architecture and analyze them regarding their characteristics and support to SPC.

Nowadays, the results of this study are being used in the definition of a software measurement architecture for organizations aiming to achieve the high maturity.

As limitations of the study we highlight the use of only two digital libraries as sources of publications and the unavailability of the full text of some publications. Concerning the use of only two sources, although it is a limitation, initial tests showed that the selected publications from some other libraries were similar than the selected publications in the digital libraries used until this moment. Concerning publications whose full text was not available, we contacted the authors and some of them made their publications available. However, four publications were eliminated due to the unavailability of the full text.

Acknowledges This research is funded by the Brazilian Research Funding Agencies FAPES (Process Number 52272362/11) and CNPq (Process Number 483383/2010-4).

References 1. Mcgarry, J., Card, D., Jones, C., Layman, B., Clark, E., Dean, J., Hall, F.: Practical Software

Measurement: Objective Information for Decision Makers. Addison Wesley, Boston, USA (2002).

2. Barcellos, M.: A Strategy for Software Measurement and Software Measurement Repository Evaluation for Statistical Process Control in High Maturity Organizations, Computation and Systems Engineering Program, Doctorate Thesis, Federal University of Rio de Janeiro, COPPE/UFRJ, Rio de Janeiro, Brazil, 2009. (in Portuguese only)

3. CMMI Product Team: CMMI for Development, Version 1.3., http://www.sei.cmu.edu/library/abstracts/reports/10tr033.cfm, (2010).

4. SOFTEX: MPS.BR: Melhoria de Processo do Software Brasileiro - Guia Geral, http://www.softex.br/mpsbr, (2011).

5. Dumke, R., Ebert, C.: Software Measurement: Establish – Extract – Evaluate – Execute. Springer-Verlag (2010).

6. Zachman, J.: A framework for information systems architecture. IBM Systems Journal. 276–292 (1987).

7. Bernstein, P.A.: Repositories and Object-Oriented Databases, Proceedings of BTW Conference, 34–46. (1997).

8. Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, School of Computer Science and Mathematics, Keele University. (2007).

9. ISO/IEC 15939, Systems and Software Engineering – Measurement Process, (2007). 10. Barcellos, M., Falbo, R. A., Rocha, A. R.: Establishing a Well-Founded Conceptualization

about Software Measurement in High Maturity Levels. In: Proceedings of the 7th International Conference on the Quality of Information and Communications Technology. 467–472. (2010).


33

11. Barcellos, M.P., Falbo, R.A., Rocha, A.R.A.: A Well-Founded Software Process Behavior Ontology to Support Business Goals Monitoring in High Maturity Software Organizations. IEEE 5th Joint VORTE-MOST Workshop. pp. 253–262. Proceedings of the IEEE International EDOC Enterprise Computing Conference Workshops, Vitória - ES (2010).

12. Florac, W.A., Carleton, A.D.: Measuring the Software Process: Statistical Process Control for Software Process Improvement. Addison Wesley, Boston, USA (1999).

13. Tarhan, A., Demirors, O.: Apply Quantitative Management Now. IEEE Software. 29, 77–85 (2012).

14. Montoni, M., Investigation Regarding Critical Sucess Factors in Software Process Improvement Initiatives, Computation and Systems Engineering Program, Doctorate Thesis, Federal University of Rio de Janeiro, COPPE/UFRJ, Rio de Janeiro, Brazil, 2010. (in Portuguese only)

15. Mora, B., García, F., Ruiz, F., Piattini, M., Boronat, A., Gómez, A., Carsí, J.Á., Ramos, I.: Software generic measurement framework based on MDA. IEEE Latin America Transactions. 9, 130–137 (2011).

16. Palza, E., Fuhrman, C., Abran, A., Ouest, N., Québec, H.C.M.: Establishing a Generic and Multidimensional Measurement Repository in CMMI context. Proceedings of the 28th Annual NASA Goddard Software Engineering Workshop (2003).

17. Mora, B., García, F., Ruiz, F., Piattini, M., Boronat, A., Gómez, A., Carsí, J.Á., Ramos, I.: Software generic measurement framework based on MDA. IEEE Latin America Transactions. 8, 605–613 (2010).

18. Mora, B., García, F., Ruiz, F., Piattini, M., Boronat, A., Gómez, A., Carsí, J.Á., Ramos, I.: JISBD2007-08 : Software generic measurement framework based on MDA. IEEE Latin America Transactions. 6, 363–370 (2008).

19. Mora, B., Garcia, F., Ruiz, F., Piattini, M.: Model-Driven Software Measurement Framework: A Case Study. 2009 Ninth International Conference on Quality Software. 239–248 (2009).

20. Basili, V.R., Caldeira, G., Rombach, H.D.: The goal question metric approach, Encyclopedia of Software Engineering, Wiley. (1994).

21. Aversano, L., Bodhuin, T., Canfora, G., Tortorella, M.: WebEv - a Collaborative Environment for Supporting Measurement Frameworks. Proceedings of the 37th Annual Hawaii International Conference.1–10 (2004).

22. Aversano, L., Bodhuin, T., Canfora, G., Tortorella, M.: A Framework for Measuring Business Processes based on GQM Proceedings of the 37th Annual Hawaii International Conference. 1–10 (2004).

23. Goth, G.: focus NSDIR : A Legacy Early benchmarking effort was ahead of its time. IEEE Software. 18, 53–56 (2001).

24. Bastani, F.B., Ntafos, S., Harris, D.E., Morrow, R.R., Paul, R.: A high-assurance measurement repository system. Proceedings. Fifth IEEE International Symposium on High Assurance Systems Engineering (HASE 2000). 265–272 (2000).

25. Silveira, P.S., Becker, K., Ruiz, D.D.: SPDW+: a seamless approach for capturing quality metrics in software development environments. Software Quality Journal. 18, 227–268 (2010).

26. Harrison, W.: A flexible method for maintaining software metrics data: a universal metrics repository. Journal of Systems and Software. 72, 225–234 (2004).

27. Paul, R.A., Kunii, T.L., Shinagawa, Y., Khan, M.F.: Software Metrics Knowledge and Databases for Project Management. IEEE Transactions on Knowledge and Data Engineering, 11, 255–264 (1999).


34

Revisao Sistematica: ferramentas de apoio dometodo B e da notacao Z

Sofia L. Costa and Vinıcius Pereira

Instituto de Ciencias Matematicas e de Computacao (ICMC)Universidade de Sao Paulo (USP)

Avenida Trabalhador Sao-Carlense, 400 - CentroCEP: 13566-590 - Sao Carlos - SP,

Telefone: 55 (16) 3373-9700 - Fax: 55 (16) 3371-2238{sofialc,vpereira}@icmc.usp.br

Resumo A aplicacao de metodos formais em especificacao de softwaretem crescido na industria. Nota-se a ampla utilizacao entre metodo B enotacao Z. Entretanto, ferramentas de apoio aos metodos sao importan-tes para sua aplicacao. Objetivo: Este artigo objetiva trazer evidenciasde qual dos metodos - B ou notacao Z - possui melhores ferramentasde apoio para especificacao formal de software. Metodo: Foi realizadauma revisao sistematica de estudos que retratam a especificacao de soft-ware com B ou Z. Foram identificadas caracterısticas das ferramentaspara ambos. Resultados: Vinte e cinco estudos foram recuperados, dosquais dez foram selecionados para analise. Apesar de Z ser amplamenteutilizada, as ferramentas identificadas nao oferecem todas as funcionali-dades disponıveis nas ferramentas do B. Conclusao: As ferramentas doB oferecem funcionalidades que apoiam as diferentes atividades de espe-cificacao formal de software. As evidencias aqui apresentadas podem seruteis para sua adocao na industria.

Palavras-chave: Metodo B, Notacao Z, Revisao Sistematica.

Abstract The application of formal methods in software specificationhas grown in the industry. It is noticed the extensive use of B methodand Z notation. However, support tools to the methods are importantfor their applications. Objective: This paper aims to bring evidence ofwhich of the methods - B or Z notation - has better support tools forformal specification of software.Methods: It was performed a systema-tic review of studies that portray the software specification with B or Zand been identified characteristics of both tools. Results: Twenty-fivestudies were retrieved, of which ten were selected for analysis. Despite Zbeing widely used, the identified tools do not offer all the features avai-lable in the B tools. Conclusion: The B tools provide functionality thatsupport the different activities of formal specification of software. Theevidence presented here may be useful to its adoption in the industry.

Keywords: B Method, Z Notation, Systematically Review.


35

2 Sofia L. Costa and Vinıcius Pereira

1 Introducao

A utilizacao de metodos formais introduz mais rigor no processo de desenvol-vimento de software, melhorando a producao de software em relacao a sua es-trutura, manutenibilidade e menos propenso a erros. Muitas tecnicas e metodospara especificacao formal de software foram desenvolvidas [6].

Nota-se a ampla utilizacao de dois metodos para especificacao formal desoftware [14]: metodo B e notacao Z. O metodo B, criado por J. Abrial [1] paraespecificacao e desenvolvimento de software com apoio de ferramentas, utiliza anocao de maquinas abstratas. Ja a notacao Z e uma linguagem de especificacao,tambem criada por J. Abrial [2] e baseada na teoria de conjuntos axiomatica elogica de predicados de primeira ordem. B e Z estao relacionadas e apoiam odesenvolvimento de codigo C a partir da especificacao [12].

Woodcock et al. [14] notaram que a area de metodos formais possui muitastecnicas aplicaveis na industria, porem com poucas ferramentas disponıveis. Dosprojetos relatados por eles neste survey, a maioria deles envolvia o metodo B oua notacao Z. Dentre os participantes do survey, a maioria concordou que houvesucesso na utilizacao de metodos formais, e em geral os participantes ficaramsatisfeitos com o uso em seus projetos.

Apesar disto, notou-se que as ferramentas usadas nao fizeram jus as expec-tativas. O desenvolvimento de ferramentas de apoio pode afetar a forma deimplantacao de tecnicas formais no futuro, pois as ferramentas nao sao robustaso suficiente para aplicacao em larga escala.

Outros pesquisadores como Bowen e Hinchey [6] e Glass [8] afirmam queha um abismo entre a academia e a industria em relacao a metodos formais, eainda citam fraquezas nas notacoes e ferramentas e a educacao nessa area comoum desafio para a ampla aceitacao da tecnologia de metodos formais. Logo, enecessario desenvolver ferramentas que apoiem a aplicacao em escala comercial.Woodcock et al. [14] identificaram alguns desafios em relacao as ferramentas deapoio: apoio a deducao automatizada, formatos comuns para troca de modelosentre as ferramentas e ferramentas com melhor usabilidade.

Nesse sentido, o objetivo desse trabalho e apresentar os resultados de umarevisao sistematica que identifica estudos que tratem de especificacao de softwaree ferramentas de apoio utilizadas para especificacao formal com o metodo B ou anotacao Z, de modo que praticantes de metodos formais conhecam o tipo de apoiooferecido por estes dois metodos, bem como as caracterısticas destas ferramentas.Os resultados podem auxiliar profissionais que desejam aplicar metodos formaisna industria.

O artigo esta organizado da forma a seguir. A Secao 2 apresenta o plane-jamento da revisao, incluindo a estrategia adotada para selecionar e utilizar asmaquinas de busca, a selecao de trabalhos, entre outros itens. A Secao 3 apre-senta como foi conduzida a selecao dos estudos. A Secao 4 discute os resultadosobtidos na revisao. Por fim, a Secao 5 apresenta as consideracoes finais destetrabalho.


36

Revisao Sistematica: ferramentas de apoio do metodo B e da notacao Z 3

2 Planejamento

O planejamento da revisao sistematica foi realizado de acordo com o modelo deprotocolo apresentado por Kitchenham [9]. Nesta secao, sao apresentados osprincipais pontos do plano elaborado.Objetivo: Analisar as ferramentas de apoio para o uso de metodos formais

na especificacao de software.Questao de Pesquisa: Que evidencias existem para indicar se o metodo

B ou a notacao Z possuem melhores ferramentas de apoio para especificacao desoftware?

Uma questao de pesquisa bem formulada e composta por quatro itens, iden-tificados como PICO - Population, Intervation, Comparison, Outcome:

– Populacao: especificacao de software– Intervencao: metodo B e notacao Z– Comparacao: ferramentas de apoio oferecidas pelos 2 metodos– Resultados: metodo B oferece melhores ferramentas de apoio do que a notacaoZ

Estrategia de Busca para Selecao de Estudos Primarios: A estrategiade busca e selecao dos estudos primarios foi definida de acordo com as fontes deestudos, palavras-chave e de acordo com a lista de controle selecionados para arevisao:

– Criterios para selecao das fontes: Foram analisadas as fontes que indexamas principais conferencias e periodicos onde os avancos referentes a area demetodos formais sao publicados. As principais conferencias da area sao in-dexadas pela IEEExplore, enquanto que os principais periodicos sao indexa-dos tanto pela SpringerLink quanto pela Elsevier (atraves da base Scopus).Tambem foram selecionados para a busca manual a Biblioteca Digital daSBC1 e o Simposio Brasileiro de Metodos Formais (SBMF).

– Metodos de Pesquisa: Foi utilizada a busca automatica, atraves da criacaode uma string de busca que foi executada nas fontes selecionadas, e tambema busca manual na Biblioteca Digital da SBC e nos artigos dos ultimos 5anos do SBMF.

– Palavras-chave: Metodo B, Notacao Z, Especificacao Formal, ferramentas deapoio.

– String de busca: ((’Tool’) AND ((’B Method’) OR (’Z Notation’)) AND((’Software Specification’) OR (’Formal Specification’)))

– Listagem das Fontes Selecionadas: IEEE, Springer e Scopus.

Criterios de inclusao/exclusao (CI/CE):

– Criterio 1: Estudos que tratem da especificacao de software com o metodoB ou a notacao Z

1 http://www.lbd.dcc.ufmg.br/bdbcomp/bdbcomp.jsp


37


– Criterio 2: Estudos que citem ferramentas de apoio para utilizar com ometodo B ou a notacao Z

– Criterio 3: Estudos completos em portugues ou ingles– Criterio 4: Estudos com texto completo disponıvel na Web– Criterio 5: Se houverem estudos replicados, deve-se selecionar o mais recentee o mais completo.

2.1 Processo de Selecao dos Estudos Primarios

Processo de selecao preliminar

1. Identificar estudos relevantes2. Cada revisor exclui os estudos com base na leitura do tıtulo/resumo3. Os revisores validam os resultados um do outro e fecham uma lista de estudos

em comum acordo.

Processo de selecao final

1. Os dois revisores fizeram a selecao com base na leitura do texto completo2. Os revisores validaram os resultados da selecao para entrarem em consenso3. Os revisores realizaram a Avaliacao de Qualidade dos estudos selecionados.

Avaliacao de qualidade dos estudos primarios

1. Ha uma descricao clara dos objetivos da pesquisa?2. Existe uma descricao das caracterısticas da ferramenta de apoio?3. A ferramenta foi avaliada?4. Os resultados estao reportados de forma clara?5. A ferramenta e utilizada na industria?

Para cada questao, os estudos foram pontuados com 1.0 caso a resposta seja“Sim”, com 0.5 caso a resposta seja “Parcialmente”, e com 0.0 caso a respostaseja “Nao”. Somados os pontos das cinco questoes, os estudos com score dequalidade inferior a 2.0 foram excluıdos da revisao.

3 Conducao

A revisao sistematica foi conduzida por um perıodo de 1 mes (Novembro de2012), de acordo com o planejamento apresentado nas secoes anteriores. Aotodo, foram recuperados 25 trabalhos relevantes, que foram submetidos para asetapas de selecao preliminar, selecao final e extracao de resultados.

Para a busca automatica, a string de busca foi executada em cada umas dastres bases (cada base com a sua versao da string) e os resultados foram:

– IEEE: 10 estudos– Scopus: 19 estudos


38


– Springer: 7 estudos– Total: 36 estudos

Desse total de 36 estudos retornados 11 eram duplicados, sendo que 10 es-tudos estavam duplicados entre as bases IEEE e Scopus e 1 estudos estava du-plicado entre as bases Springer e Scopus. Isso se deve ao fato da Scopus indexaralguns eventos de outras bases. Tendo isso como base, as versoes duplicadas daScopus foram excluıdas e o total de estudos para a selecao preliminar foi reduzidopara 25 estudos, assim divididos:

– IEEE: 10 estudos– Scopus: 8 estudos– Springer: 7 estudos– Total: 25 estudos

Tambem realizou-se a busca manual na Biblioteca Digital da SBC, alemde consultas nos ultimos 5 anos do Simposio Brasileiro de Metodos Formais(SBMF), onde 1 estudo foi selecionado e logo excluıdo pelo criterio 4, por naoestar disponıvel na Web. Utilizou-se a ferramenta Zotero2 para extracao dos da-dos e a ferramenta ReVis [7] como apoio para o processo de selecao. Os revisorestrocaram seus arquivos de selecao, visando entrar em consenso sobre a lista deestudos incluıdos.

Nas proximas subsecoes sao apresentados mais detalhes das atividades reali-zadas.

3.1 Selecao Preliminar

Inicialmente, 1 estudo foi excluıdo ao carregar as informacoes sobre os 25 estudosna ferramenta ReVis, pois nao foi possıvel encontra-lo na Web para utilizar asinformacoes sobre as suas referencias. Isso significa que nao foi possıvel encontraro estudo completo (criterio de exclusao 4).

Os dois revisores leram os tıtulos e os resumos dos outros 24 estudos e, aposuma reuniao de consenso, incluıram 18 estudos para a proxima etapa e excluıram6 estudos, totalizando 7 exclusoes. Com excecao do estudo excluıdo pelo criterio4, todos os outros 6 estudos foram excluıdos por nao atenderem ao criterio 1 (sersobre especificacao de software em B ou Z).

3.2 Selecao Final

Nessa secao sao apresentados os resultados finais obtidos com a conducao darevisao sistematica, de acordo com os objetivos propostos. Ressalta-se que nessafase os estudos foram lidos por completo, de modo que foram definidos os que re-almente ficariam entre os selecionados. Devido a indisponibilidade de estudos naWeb para a leitura completa (criterio de exclusao 4), 3 estudos foram eliminadosnessa fase.

2 http://www.zotero.org/


39


Foi seguido o processo definido na Secao 1. Os revisores realizaram a leituracompleta e aplicacao dos criterios de inclusao/exclusao, e apos concluir a selecaofizeram uma reuniao de consenso para definir os estudos incluıdos.

Foram excluıdos 5 estudos com base na leitura do texto completo, por trata-rem sobre especificacao de hardware e redes sem fio. Assim, ao final desta etapa,foram incluıdos um total de 10 estudos. As Figuras 1 e 2 apresentam os estu-dos dessa etapa de acordo com a ferramenta ReVis, onde os cırculos vermelhosrepresentam os estudos excluıdos e os cırculos verdes representam os incluıdos.

(a) Similaridade entre os estudos (b) Citacoes entre os estudos

Figura 1. O grau de similaridade entre os estudos e as citacoes entre eles.

Na Figura 1.a, e a propria ferramenta ReVis que faz a analise de similaridade,considerando as palavras que mais aparecem nos estudos.

Em relacao a Figura 1.b, as linhas representam as citacoes entre os estudos.Assim, o lado mais escuro da linha representa o estudo que e citado e o ladomais claro o estudo que cita. Neste caso, o estudo excluıdo que e citado por outroestudo (que foi incluıdo) nao e sobre especificacao de software (criterio 1).

Por fim, na Figura 2, e possıvel visualizar as referencias orbitando em voltade cada estudo e se tais referencias sao utilizadas por outros estudos presentesneste estudo.


40


Figura 2. As referencias de cada estudo da selecao final e seus relacionamentos.

Desse total de 10 estudos, ao se aplicar os criterios de qualidade, foram ex-cluıdos 2 deles (ID 8 e ID 9), por terem obtido nota final igual ou inferior a 2.0.As notas desses estudos excluıdos foram 1.5 e 2.0, sendo que o primeiro recebeunota 0 nos criterios de qualidade de numero 1, 3 e 5, respectivamente descricaoclara dos objetivos, validacao da ferramenta e uso da ferramenta na industria.O segundo recebeu nota 0 nos criterios de qualidade 2, 3, e 5, onde o criterio 2e a descricao da ferramenta. A Tabela 1 apresenta os 8 estudos que obtiveramscore de qualidade maior ou igual a nota de corte (2.0), com seus dados sobreautores, qual a base de origem (fonte), ano, tipo do estudo e a qual metodo elefaz referencia.

3.3 Extracao de Dados

Sobre os estudos selecionados, foram extraıdas algumas informacoes. A Figura3.a apresenta o tipo de publicacao dentre os estudos selecionados e a Figura 3.bapresenta a distribuicao das fontes para os 8 estudos selecionados. A Figura 4apresenta a distribuicao dos estudos por metodo aplicado, onde nota-se que amaioria dos estudos e sobre o metodo B.

Na Figura 5 e possıvel notar a distribuicao dos estudos selecionados por anode publicacao, onde e possıvel notar que ha estudos na decada de 90 e tambemestudos recentes que mostram o uso destes metodos. Nota-se tambem uma lacunaentre os anos 2000 e 2006.


41


Tabela 1. Dados sobre os estudos incluıdos ao finalizar a etapa de Selecao Final

ID Autores Fonte Ano Criterios Qualidade Tipo do estudo Metodo

3 Attiogbe, J.C. [3] Scopus 2006 CI1, CI2 4,5 Conferencia B

4 Babar, A.; Tosic, V.;Potter, J. [4]

IEEE 2007 CI1 3,0 Conferencia B

5 Bowen, J.; Gordon,M. [5]

Scopus 1995 CI1, CI2 3,0 Periodico Z

10 Leuschel, M. [10] Springer 2009 CI2 2,5 Cap. Livro B

12 Meyer, E.; Santen, T.[11]

Springer 2000 CI1, CI2 4,0 Cap. Livro B

15 Wang, S.; Wan, J.;Yang, X. [13]

IEEE 2006 CI1 2,5 Conferencia B

17 Zafar, N.A.; Sabir,N.; Ali, A. [16]

Scopus 2009 CI1 3,0 Periodico Z

18 Zafar, N.A.; Khan,S.A.; Araki, K. [15]

Scopus 2012 CI1 3,0 Periodico Z

(a) Tipo de publicacao para os 8 estudos se-lecionados

(b) Fontes dos 8 estudos selecionados

Figura 3. O grau de similaridade entre os estudos e as citacoes entre eles.

4 Discussao dos Resultados

Dos 8 estudos selecionados, 5 estudos (62%) relataram o uso de ferramentasde apoio ao metodo B. Dentre estes estudos, 3 ferramentas foram identificadas,conforme nota-se na Tabela 2. Nota-se que as caracterısticas dessas ferramentassao complementares, onde cada uma pode ser utilizada para diferentes tarefasno contexto de especificacao formal.


42


Figura 4. Metodos utilizados pelos estudos selecionados

Figura 5. Ano de publicacao dos estudos selecionados

Dentre as tres ferramentas, duas sao de uso comercial: Atelier-B3 e B-Toolkit4.A ferramenta ProB5 esta sendo utilizada na industria em estudos de caso paravalidacao de dados de propriedades complicadas, e esta em vista de ser empre-gada comercialmente.

A ferramenta Atelier-B e um provador de teoremas que possibilita o usooperacional do metodo B para desenvolvimento de software com provas de serlivre de defeitos (software formal). E utilizada para analise das propriedades deseguranca e para refinamentos. A ferramenta gera condicoes de verificacao para

3 http://www.atelierb.eu/en/4 http://www.b-core.com/btoolkit.html5 http://www.stups.uni-duesseldorf.de/ProB/index.php5/The ProB Animator andModel Checker


43


Tabela 2. Sıntese dos resultados sobre ferramentas de apoio ao metodo B

Ferramenta Caracterıstica Domınio Comercial

Atelier-B Provador de Teoremas Analise e projeto de software Sim

B-Toolkit Conjunto de funcoes(especificacao, projeto ecodigo)

Desenvolvimento de Software Sim

ProB Animador e Verificadorde modelos

Analise e projeto de software Nao

maquinas abstratas B que permitem provar a conformidade comportamental.Tambem oferece apoio para provar essas condicoes de forma interativa.

B-Toolkit e uma colecao de ferramentas de programacao destinadas a apoiara escrita e especificacao em B. Possui provador de teoremas e animador paramaquinas abstratas B, possibilitando o refinamento e implementacao correta dosoftware especificado.

A ferramenta ProB verifica os modelos criados em B, auxilia e prova a des-coberta de erros e tambem e um animador de maquinas abstratas B. Oferecefuncionalidades para exibir a visualizacao grafica dos automatos e tambem apoiaa verificacao automatica da consistencia das especificacoes em B.

Do total de estudos incluıdos no final do processo desta revisao sistematica,tres estudos (38%) relatam o uso de ferramentas de apoio a notacao Z. A Tabela 3apresenta as duas ferramentas que foram identificadas nesses estudos. Uma des-sas ferramentas (Z/Eves) foi a mais utilizada entre os estudos analisados nestarevisao (inclusive os excluıdos) que trataram sobre a notacao Z.

Tabela 3. Sıntese dos resultados sobre ferramentas de apoio a notacao Z

Ferramenta Caracterıstica Domınio Comercial

Z/Eves Provador de Teoremas Sistemas crıticos e complexos Sim

HOL Provador de Teoremas Sistemas crıticos versao ProofPower

A ferramenta Z/Eves6 e uma ferramenta comercial, sendo verificador de pro-vas e provador de teoremas. Ela faz verificacoes de tipo e domınio, calculo depre-condicoes, expansao do schema, provas, parsing, refinamentos e prova deteoremas em geral. A ferramenta apoia quase toda a notacao Z, sendo que oquantificador existencial nao e suportado.

Ja a HOL7 e uma ferramenta interativa que assiste provas para logicas deordem superior. Possui algumas versoes, sendo que a ultima e comercial (Proof-Power). Possibilita a prova de teoremas automatizada, e o usuario ainda podeimplementar seu proprio ambiente para provas especıficas. A ferramenta realiza

6 http://oracanada.com/z-eves/welcome.html7 http://hol.sourceforge.net/


44


calculo de predicados, sendo suficiente para expressar a maioria da teoria ma-tematica ordinal. Mas esta ferramenta nao e tao generica, ela possui uma ligacaocom a logica de ordem superior.

Finalizando a discussao sobre as ferramentas, a Tabela 4 apresenta umasıntese das caracterısticas do Metodo B e da Notacao Z, alem de qual con-texto cada uma das ferramentas encontradas se encaixa. O metodo B foi criadocom o objetivo de apoiar todo o processo de desenvolvimento de software com autilizacao de ferramentas. Portanto, B possui ferramentas de apoio mais amplasem relacao as tarefas de especificacao formal de software, como animador demaquinas abstratas B, verificador de modelos e provador de teoremas.

Tabela 4. Sıntese das caracterısticas identificadas

Metodo B Notacao Z

Objetivo Metodo formal criado para serorientado a ferramentas

Notacao para especificacao desoftware

Vantagens Mais ferramentas de apoioe com mais funcionalidades;mais facil implementar corre-tamente

Mais facil de aprender e utili-zar; mais sucinta do que B

Aplicacao Especificacao, projeto, provase geracao de codigo

Especificacao de software

Ferramentas

Provador de Teo-remas

Atelier-B HOL, Z/Eves

Verificador deModelos

ProB -

Animador ProB -

Ambiente (conj.de funcoes)

B-Toolkit -

Ja a notacao Z, por ter seu objetivo mais direcionado a especificacao desoftware, possui ferramentas com objetivos mais restritos a provas formais. Umaferramenta identificada e comercial e amplamente utilizada. Porem, ela e restritaa prova de teoremas.

A qualidade dos estudos primarios parece afetar os resultados da revisao.Em geral, os scores de qualidade foram medianos, sendo que dos 8 estudos,5 ficaram com score de qualidade igual ou inferior a 3.0, sendo que 5.0 eraa pontuacao maxima. Isto indica que os estudos nao possuiam caracterısticasque os revisores consideravam importantes, como por exemplo caracterısticas daferramenta empregada. A Tabela 5 mostra a pontuacao dos estudos incluıdosem cada uma das cinco questoes de qualidade, assim como o score final.


45


Tabela 5. Pontuacao dos estudos incluıdos em cada um dos crıterios de qualidade

ID Autores CQ1 CQ2 CQ3 CQ4 CQ5 Score

3 Attiogbe, J.C. [3] 1,0 1,0 1,0 1,0 0,5 4,5

4 Babar, A.; Tosic, V.; Potter, J. [4] 1,0 1,0 0,0 0,0 1,0 3,0

5 Bowen, J.; Gordon, M. [5] 0,0 1,0 0,5 0,5 1,0 3,0

10 Leuschel, M. [10] 1,0 0,5 0,0 1,0 0,0 2,5

12 Meyer, E.; Santen, T. [11] 1,0 1,0 0,0 1,0 1,0 4,0

15 Wang, S.; Wan, J.; Yang, X. [13] 1,0 0,0 0,0 1,0 0,5 2,5

17 Zafar, N.A.; Sabir, N.; Ali, A. [16] 1,0 1,0 0,0 1,0 1,0 4,0

18 Zafar, N.A.; Khan, S.A.; Araki, K. [15] 1,0 1,0 0,0 0,0 1,0 3,0

5 Conclusoes

Este trabalho apresentou os resultados de uma revisao sistematica que comparouas ferramentas de apoio para o metodo B e notacao Z. Nao foi encontrado umestudo que realizasse tal comparacao. Buscou-se por estudos que tratassem deum dos dois metodos, de modo que pudessem ser coletadas evidencias sobreas ferramentas existentes e suas caracterısticas. Observou-se que alguns estudoscitaram o fato de o metodo B ter melhor apoio ferramental do que a notacao Z[10].

A principal tendencia observada foi a maior variedade de ferramentas emB, por terem diferentes caracterısticas que apoiam as diferentes atividades deespecificacao formal, como provar teoremas, verificar modelos e animar modelosde forma a descobrir defeitos. A maioria das ferramentas e aplicada na industria:Atelier-B e B-Toolkit. Alem disso, uma evolucao no metodo B, chamada Event-B, e de facil aprendizado e permite utilizar as mesmas ferramentas ja existentes.

Ja as ferramentas para Z apoiam poucas atividades, basicamente a provade teoremas, ficando mais restrita em sua aplicacao. Alem disso, apenas umaferramenta foi citada em mais de um estudo: Z/Eves.

Mesmo com as vantagens das ferramentas identificadas, estas podem ter li-mitacoes, principalmente em relacao a interacao com o usuario. Desde modo,evolucoes nesse sentido nessas ferramentas podem ser importantes para ampliarsua utilizacao. Alem disso, outras ferramentas podem ser desenvolvidas, paraauxiliar outras tarefas de especificacao formal de software que nao sao atingidaspelas ferramentas identificadas.

Apesar dos estudos coletados nas fontes que envolvem os principais eventos eperiodicos da area, uma ameaca a validade e a quantidade de fontes, de modo quepoderiam ter sido incluıdas mais fontes relacionadas a Engenharia de Software.Outra ameaca diz respeito a string de busca, ja que varios trabalhos referenciamo metodo B e notacao Z apenas como B e Z respectivamente, sendo entao quecom a string executada alguma evidencia pode ter sido excluıda.

Este trabalho pode ser estendido selecionando-se mais fontes para busca emelhorando a string de busca, de modo que mais evidencias possam ser iden-tificadas. Uma dificuldade para a conducao da revisao foi em relacao as fontes,


46


pois cada uma exige um formato diferente para a string de busca, e nem sempreo formato padrao da base aparenta ser o ideal.

O presente estudo pode auxiliar praticantes de metodos formais, especifi-camente aqueles que fazem uso dos metodos B e notacao Z, uma vez que otrabalho trouxe evidencias sobre as ferramentas utilizadas e suas caracterısticas.Tais ferramentas sao importantes para garantir a validacao das especificacoes,diminuindo o trabalho manual e fornecendo garantias quanto as especificacoesgeradas.

Referencias

1. J.-R. Abrial, M.K.O. Lee, D.S. Neilson, P.N. Scharbach, and I.H. SA¸rensen. Theb-method. In Soren Prehn and Hans Toetenel, editors, VDM ’91 Formal SoftwareDevelopment Methods, volume 552 of Lecture Notes in Computer Science, pages398–405. Springer Berlin Heidelberg, 1991.

2. Jean-Raymond Abrial, Stephen A. Schuman, and Bertrand Meyer. Specificationlanguage. In On the Construction of Programs, pages 343–410. 1980.

3. J.C. Attiogbe. Tool-assisted multi-facet analysis of formal specifications (usingatelier-b and ProB). volume 2006, pages 85–90, 2006.

4. A. Babar, V. Tosic, and J. Potter. Aligning the map requirements modelling withthe b-method for formal software development. In Software Engineering Confe-rence, 2007. APSEC 2007. 14th Asia-Pacific, pages 17 –24, December 2007.

5. J. Bowen and M. Gordon. A shallow embedding of z in HOL. Information andSoftware Technology, 37(5-6):269–276, 1995.

6. J. P. Bowen and M. G. Hinchey. Ten commandments of formal methods... tenyears later. IEEE Comput., 39(1):40–48, January 2006.

7. Katia R. Felizardo, Norsaremah Salleh, Rafael M. Martins, Emilia Mendes,Stephen G. MacDonell, and Jose C. Maldonado. Using visual text mining to sup-port the study selection activity in systematic literature reviews. In Proceedings ofthe 2011 International Symposium on Empirical Software Engineering and Mea-surement, ESEM ’11, pages 77–86, Washington, DC, USA, 2011. IEEE ComputerSociety.

8. R. L. Glass. Formal methods are a surrogate for a more serious software concern.IEEE Comput., 29(4):1–19, April 1996.

9. B. Kitchenham. Procedures for performing systematic reviews. Technical reporttr/se-0401, Keele University and NICTA, 2004.

10. Michael Leuschel. Towards demonstrably correct compilation of java byte code. InFrank de Boer, Marcello Bonsangue, and Eric Madelaine, editors, Formal Methodsfor Components and Objects, volume 5751 of Lecture Notes in Computer Science,pages 119–138. Springer Berlin / Heidelberg, 2009.

11. Eric Meyer and Thomas Santen. Behavioral conformance verification in an inte-grated approach using UML and b. In Wolfgang Grieskamp, Thomas Santen, andBill Stoddart, editors, Integrated Formal Methods, volume 1945 of Lecture Notesin Computer Science, pages 358–379. Springer Berlin / Heidelberg, 2000.

12. C Snook and R. Harrison. Practitioners views on the use of formal methods: Anindustrial survey by structured interview. Information and Software Technology,43(4):275–283, March 2001.


47


13. Shuaiqiang Wang, Jiancheng Wan, and Xiao Yang. Describing, verifying and de-veloping web service using the b-method. In International Conference on NextGeneration Web Services Practices, 2006. NWeSP 2006, pages 11 –16, September2006.

14. J. Woodcock, P. G. Larsen, J. Bicarregui, and J. Fitzgerald. Formal methods:Practice and experience. ACM Computing Surveys (CSUR), 41(4):1–36, October2009.

15. N.A. Zafar, S.A. Khan, and K. Araki. Towards the safety properties of movingblock railway interlocking system. International Journal of Innovative Computing,Information and Control, 8(8):5677–5690, 2012.

16. N.A. Zafar, N. Sabir, and A. Ali. Formal transformation from NFA to z notationby constructing union of regular languages. International Journal of MathematicalModels and Methods in Applied Sciences, 3(2):115–122, 2009.


48

Es posible superar la precisión basada en el juicio de expertos de la estimación de esfuerzo de productos de

software?

Gabriela Robiolodjfdfjsjkd

Universidad AustralBuenos Aires, Argentina

[email protected]

Oscar Castillo y Bibiana Rossi

Universidad Argentina de la Empresa

Buenos Aires, Argentinaoscar.alexander@gm

ail.com, [email protected]

r

Silvana Santosdfjhsdjfhdjhfdh

Universidad Nacional de La Plata

La Plata, Argentina silvanasantos@gmail.

com

Abstract. La estimación de esfuerzo de productos de software basada en el juicio de expertos es el método más difundido en la industria del software y existen evidencias que puede tener la misma o mejor precisión cuando éste es comparado con modelos de estimaciones formales. Con la finalidad de brindar una mayor evidencia de esta afirmación se plantea un caso de estudio de unaaplicación compleja desarrollada en el contexto de una empresa pública. Se compara el método de estimación de expertos versus los métodos formales de Regresión lineal y Analogías, utilizando modelos de una sola variable Tamaño (medido en COSMIC) o Complejidad (medida en Paths). Los resultados mues-tran que no fue posible superar la precisión del juicio de expertos debido a su nivel de experiencia medio-alto.

Keywords: estimación de expertos, estimación de esfuerzo, regresión lineal, analogías

1 Introducción

Jorgensen [1] afirma que la estimación de esfuerzo de proyectos de software basada en el juicio de expertos es el método más difundido en la industria del software. Si bien esta afirmación no invalida el uso de métodos formales de estimación, pone en evidencia las limitaciones de dichos métodos, los cuáles no han podido superar la capacidad humana de sintetizar diversas variables complejas. También el autor obser-va que la estimación de expertos puede tener la misma o mejor precisión al ser com-parada con un modelo estimación formal. Encuentra una fuerte evidencia de que la


49

estimación de expertos es más precisa cuando el experto posee un importante cono-cimiento del dominio.

Se presenta en este artículo un caso de estudio con la finalidad de aportar una ma-yor evidencia a estas afirmaciones e investigar si es posible sostenerlas en un contexto de una empresa del ámbito público.

Las preguntas de investigación planteadas son:

¿Es posible reducir el error en la estimación de esfuerzo basada en expertos de un producto de software complejo aplicando un método de estimación formal que uti-liza como única variable Tamaño o Complejidad?¿Es posible reducir el error de estimación de esfuerzo de un producto de software complejo si a una historia sucesiva de estimaciones se aplica analogía evaluada por expertos vs analogía basada en Tamaño o Complejidad?

En segunda instancia, se optó por una sucesión de estimaciones para comprender cómo es la evaluación de analogías que realiza un experto en comparación con las analogías basadas en medidas objetivas.

Con el objetivo de responder a estas preguntas se selecciono una aplicación com-pleja, la cual ha sido desarrollada en sucesivas versiones, de la cual se obtuvo la espe-cificación de requerimientos y el Esfuerzo Real (ER). También fue posible obtener la estimación de esfuerzo de un experto de la empresa para comparar sus estimaciones con los resultados obtenidos usando métodos formales de estimación.

Los métodos de estimación formales usados en la comparación son métodos fre-cuentemente utilizados [2]: regresión lineal [3] y analogías [4]. Se seleccionaron las métricas COSMIC [5], [6] y Paths [7], dado que la primera es un estándar internacio-nal y la segunda una métrica de complejidad adecuada para las características de la aplicación a analizar. Además, resulta interesante que en [1] no se han incluido artícu-los que usen COSMIC o Paths.

COSMIC (Common Software Measurement International Consortium) function points es un estándar de medición cada vez más aceptado que mide Tamaño de reque-rimientos funcionales. Los requerimientos de usuarios funcionales pueden ser mapea-dos en procesos funcionales. Cada proceso funcional consiste en subprocesos que envuelven movimientos de datos. La cantidad de los movimientos de datos de entra-da, salida, lectura y escritura es el Tamaño funcional expresado en CFP.

Paths es una medida introducida por Robiolo [7], [8] que captura la complejidad de los requerimientos. Es una aplicación de la métrica de MacCabe [9] a la descripción de requerimientos. La complejidad de los requerimientos está expresada en términos de caminos, donde cada camino corresponde a un escenario alternativo de un caso de uso y es expresado en P.

El presente artículo presenta qué se entiende por juicio de expertos y algunos de los últimos estudios empíricos en torno a este tema y una discusión bibliográfica de la conveniencia del uso de métodos de estimación basado en expertos versus métodos formales de estimación. Desarrolla un caso de estudio y finalmente detalla las conclu-siones junto con la descripción de posibles trabajos futuros.


50

2 El juicio de expertos

Jorgensen [1], uno de los autores con mayor cantidad de publicaciones en torno a este tema en los últimos años, define una estrategia de estimación como juicio de exper-tos a un trabajo de estimación realizado por una persona reconocida como un experto en esta tarea, donde una parte significativa del proceso de estimación es realizada en forma intuitiva, ejecutando un proceso no explícito e inconstruible. Realizo una revi-sión de estudios detallada en torno a este tema. Los resultados arrojados por dicho proceso de revisión sugieren que las estimaciones basadas en juicio de expertos es la más utilizada para proyectos de software. Afirma que no existe evidencia sustancial en favor del uso de modelos de estimación y que hay situaciones donde se puede es-perar que las estimaciones expertas sean mucho más precisas que los métodos forma-les de estimación. Además propone una lista de 12 “best practices” o principios de estimación experta validados empíricamente y provee sugerencias sobre cómo im-plementar estas guías en las organizaciones. Una de las mejores prácticas es buscar expertos con conocimiento del dominio y capacidad de realizar buenas estimaciones,aspecto que se destaca en este artículo.

Respecto de la evaluación de la incertidumbre de las estimaciones realizadas, se planteó el rol que cumple el feedback sobre la discrepancia existente entre las horas estimadas versus las trabajadas. Existe evidencia suficiente [10], que indica que la mayoría de los desarrolladores son, inicialmente, demasiado optimistas sobre la preci-sión de sus estimaciones, manteniéndose así aun cuando el feedback provisto indica lo contrario. El autor sugiere que una condición importante que se tendría que dar para mejorar las estimaciones sobre la base del feedback provisto luego de la finalización de la(s) tarea(s), sería el uso de una estrategia explícita de evaluar la incertidumbre. Esta condición mejora mientras mayor es la cantidad de información histórica de la que se dispone. Circunstancia que es confirmada en éste artículo.

Uno de los efectos negativos que más influyen en el fracaso de los proyectos de software es el exceso de optimismo. Jørgensen y Halkjelsvik [11] descubren un punto importante que pueden estar llevando a los estimadores a ser demasiado optimistas, esto es el formato utilizado al formular la pregunta que solicita la estimación de es-fuerzo. El formato tradicional sería: “¿Cuántas horas se necesitan para completar la tarea X?” y el alternativo sería: “¿Cuántas tareas se pueden completar en Y horas?” Cualquiera de estos dos formatos deberían, en teoría, arrojar los mismos resultados. Según Jørgensen, cuando se utiliza el formato alternativo, se obtienen sorprendente-mente estimaciones mucho más bajas y por ende mucho más optimistas que si se usara el formato tradicional. La recomendación final de dicho estudio es que siempre se opte por el formato tradicional ya que este no conlleva ninguna desviación impues-ta por los clientes que quieren obtener el máximo con un presupuesto irreal. En nues-tro caso de estudio se uso el formato tradicional.

Mendes [12] investigó –en el contexto de proyectos financiados por el gobierno de Nueva Zelanda y más tarde en Brasil- el uso de un enfoque centrado en el experto en combinación con una técnica que permite la inclusión explícita de la incertidumbre y las relaciones causales como medio para mejorar la estimación del esfuerzo de pro-yectos de software. El artículo presenta una visión general del proceso de estimación


51

de esfuerzo, seguido por la discusión de cómo un enfoque centrado en el experto me-jora dicho procedimiento y puede ser ventajoso para las compañías de software.

3 El juicio de expertos vs los métodos formales

Jorgensen [1] presenta quince estudios que comparan la estimación de expertos con estimaciones basadas en modelos formales de estimación. Cinco de los artículos están a favor de la estimación de expertos, cinco no encuentran diferencia y cinco están a favor de los modelos formales de estimación. Resalta que el diseño de los experimen-tos tiene un fuerte impacto en los resultados. Además, destaca que los experimentos que no usaban modelos formales calibrados mostraban que la estimación de expertos era más precisa. Posterior al survey citado anteriormente sólo se ha hallado un artícu-lo que compara juicio de expertos vs métodos formales, en particular analiza las ven-tajas y desventajas de juicio de expertos y aprendizaje automático [13].

En otro artículo, Jorgensen [14], remarca las mismas ideas resaltando que los ex-pertos tienen una natural ventaja dado que típicamente procesan más información (o falta de ésta) en una forma más flexible y que podría ser difícil construir modelos de estimación más precisos. En el caso de estudio presentado en este artículo, se pone en evidencia la capacidad del experto anteriormente destacada.

También Jørgensen [1] reconoce que el juicio de expertos tiene sus aspectos nega-tivos, como por ejemplo: el grado de inconsistencia y la ponderación de variables. Destaca que si estos efectos negativos se pudieran reducir, la mejora en la precisión de las estimaciones sería mucho mayor.

Jørgensen y Boehm [15], toman dos posturas opuestas y debaten intentando mos-trar a las organizaciones las ventajas y desventajas de modelos formales y juicio de expertos. Boehm discierne con Jørgensen en que las estimaciones expertas producidas en estudios empíricos no son representativas de las estimaciones producidas en la práctica. Además, sostiene que ante la incertidumbre, las organizaciones van a optar por llevar a cabo extensos análisis de sensibilidad, riesgo y compensación a través de modelos formales ejecutados de manera rápida y con muy poco esfuerzo humano. Boehm recomienda combinar ambos enfoques, almacenando los resultados al finali-zar el desarrollo (o fases del mismo) y utilizar estos valores en el proceso de “close-loop feedback” donde se comparan las entradas de las estimaciones, las salidas y los resultados finales de manera de aprender de eso y poder calibrar estimaciones de futu-ros proyectos. Jørgensen alienta a las organizaciones a invertir en estudios e investi-gaciones para mejorar los procesos de estimación basados en juicio experto. Esto mismo afirman [12][16].También Jørgensen propone el uso del método Wideband Delphi de Bohem para mejorar las estimaciones y evitar posibles pujas de intereses entre los stakeholders.

MacDonell and Shepperd [17], afirman que hay un alto grado de interdependencia entre las estimaciones basadas en los modelos comunes de estimación y estimación de expertos, y es difícil derivar reglas para seleccionar el modelo de estimación más preciso, por lo tanto la solución parece ser usar la combinación de modelos.


52

4 Caso de Estudio

La empresa donde se desarrolló el caso de estudio es una empresa grande y compleja del ámbito público argentino. La aplicación desarrollada es un sistema de infracciones de tránsito. Por motivos de confidencialidad no se describen más aspectos. La especi-ficación de requerimientos de la aplicación seleccionada se baso en casos de uso (CU). Fueron seleccionados los casos de uso de cinco versiones que estaban clara-mente documentados, la codificación había sido finalizada y tenían registrado el ER. Se realizó la medición de Tamaño funcional y Complejidad de los casos de uso apli-cando las métrica COSMIC y Paths respectivamente.

La Tabla 1 muestra para cada caso de uso el ER medido en horas hombre, el Ta-maño medido en COSMIC, la Complejidad medido en Paths y el valor de las horas estimadas por un experto expresado en horas hombre para los CU de la aplicación y para las estimaciones sucesivas de las versiones.

Table 1. Datos de la aplicación del ámbito público

Versión IDCU

ER CFP P Experto(CU)

Experto(versionessucesivas)

7 1 264 20 3 240 2407 2 32 5 3 24 247 3 248 15 23 208 2087 4 112 37 15 88 889 5 104 16 5 80 809 6 136 15 9 96 9610 7 56 10 3 64 6410 8 184 7 25 112 12010 9 416 93 63 328 34410 10 8 11 2 8 1610 11 16 4 2 8 1611 12 208 90 58 176 20011 13 24 5 2 16 1611 14 144 149 16 120 14411 15 96 54 5 96 8812 16 520 46 43 504 50412 17 112 97 6 136 14412 18 40 71 30 40 40

En la Tabla 2 y 3 se muestran las características del experto que realizó las estima-ciones, las cuales describen el perfil, el nivel de experiencia y el grado de conoci-miento del entorno del experto. Se le pidió al experto que clasificara sus capacidades en uno de los siguientes niveles: alto, medio o bajo. El experto es una persona que tiene un nivel alto de experiencia, tanto en el desarrollo de software como en el lide-razgo, estimación de proyectos y en la tecnología usada en el proyecto. En el momen-


53

to que realizó las estimaciones el experto estaba a cargo del sector donde se desarrolló la aplicación, pero no trabajaba en este sector cuando se desarrollaron las versiones que se muestran en este caso de estudio, por lo tanto no tenía conocimiento del ER. A solicitud de los autores del artículo realizó la estimación, basándose en la especifica-ción de requerimientos, sin conocer las horas reales de la aplicación.

Table 2. Perfil del experto

Aptitudes DescripciónTitulo de grado Ingeniero

Años de experiencia en el desarro-llo de software

12

Años de experiencia en liderazgo 8Especialidad, Conocimientos Desarrollo .NET, Java patrones, liderazgo

de equipos, metodologías de desarrollo

Table 3. Nivel de experiencia y grado de conocimiento del entorno del experto

Capacidades Alto Medio BajoNivel Experiencia x

Conocimiento del rendimiento de los perfiles del grupo de desarrollo

x

Conocimiento de la Tecnología xConocimiento del Dominio x

Para el cálculo de los errores se utilizara la Magnitud Relativa del Error (MRE) y el error relativo (ErR) siguiendo las fórmulas 1 y 2, respectivamente.

MRE= abs (ER-EE) /ER (1)

ErR = (ER-EE) /ER (2)

Además, la calidad de la predicción (PQ) se calculará aplicando la siguiente fórmu-la (3)

PQ(0.25)= k/n (3)

Siendo k la cantidad de CU cuyo error es menor a 0.25 y n el total de los casos de uso [18].

Con el objetivo de testear estadísticamente los resultados se plantean las siguientes hipótesis alternativas:

H1a: La media del MRE de la estimación de experto es menor que la media del MRE de la estimación usando el modelo de regresión lineal y la variable indepen-diente P.


54

H1b: La media del MRE de la estimación de experto es menor que la media del MRE de la estimación por Analogía medido el Tamaño en CFP.H1c: La media del MRE de la estimación de experto es menor que la media del MRE de la estimación por Analogía medida la Complejidad en P.H1d: La media del MRE de la estimación de experto es menor que la media del MRE de la estimación por Analogía medido el Tamaño en CFP, en una historia su-cesiva de estimaciones.H1e: La media del MRE de la estimación de experto es menor que la media del MRE de la estimación por Analogía medida la Complejidad en P, en una historia sucesiva de estimaciones.

4.1 Estimación de CU de una aplicación

En primer lugar se analiza la aplicación en su conjunto, por lo tanto se consideran todos los CU para comparar los métodos formales de estimación con la estimación del experto.

Métodos formales de estimación.

Se consideraron dos métodos formales: Regresión lineal y Analogía.

Regresión lineal. Se planteó el modelo de regresión lineal Y= a + b X donde Y es el Esfuerzo Real y

X es CFP o P. La Tabla 4 muestra los resultados de la regresión lineal. No se obtuvo un modelo significativo usando como variable independiente CFP, como se puede observar en la Tabla 4 el valor de R2 Ajustado es muy bajo y p-value es mayor a 0.05. Por el contrario fue posible obtener el valor de la recta de regresión usando como variable independiente a P obteniendo un R2 Ajustado igual a 0.50 y un p-value igual a 0.001, eliminando dos outliers. También se testeo la condición de normalidad de los residuos aplicando el test de normalidad Shapiro-Wilk obteniendo un p-value igual a 0.21.

Table 4. Método de regresión lineal

EE= a + b XOutliers R2 R2

Ajustadop-value

-- -- 0.07 0.02 0.26EE=58.78 +5 * P CU12 y CU16 0.52 0.50 0.001

Para el cálculo de los errores en la estimación se usó el método de “cross-validation”, eliminando del modelo cada caso de uso a estimar. La Tabla 5 muestra los valores de la media (MMRE) y mediana (MeMRE) de los MRE, la calidad de la predicción (PQ) y los ErR.


55

Table 5. Comparación de los métodos de estimación

Método de Estimación

MMRE MeMRE PQ(0.25) ErR(min..max)

Regresión Lin-eal (P)

1.45 0.34 0.38 -8.38..0.79

Analogía (CFP) 1.53 0.83 0.06 -6.7..0.87Analogía (P) 0.94 0.46 0.39 -4.52..0.89

Experto 0.19 0.18 0.72 -0.21..0.5

Analogía.Este método de estimación consiste en encontrar un proyecto similar al proyecto a

estimar basándose en una característica. En forma independiente se uso Tamaño[CFP] o Complejidad[P]. Se considera la productividad del CU “más similar” en Tamaño o Complejidad para la obtención del EE. Entonces, para el cálculo de la Productividad (PR) se utiliza la fórmula 4 [19],

PR = ER / X (4)

Siendo X Tamaño[CFP] o Complejidad[P]. Se obtiene el EE utilizando la fórmula 5,

EE = X* PRPA (5)

Siendo el PRPA la productividad del proyecto análogo, la productividad del proyecto que tiene un valor más cercano en Tamaño[CFP] o Complejidad[P]. En caso de existir más de un proyecto análogo se toma el valor promedio de la productividad de los proyectos análogos.

Por ejemplo, CU1 la Tabla 1 de tiene un esfuerzo de 264 HH y un tamaño de 20 CFP. El CU más cercano en Tamaño es CU5 con un CFP de 16 y una ER de 104 HH, PRCU5 es igual al cociente entre 104 y 16, lo que equivale 6.5. Por lo tanto, EECU1 es igual a 130 HH. La Tabla 5 muestra el análisis estadístico de los valores obtenidos usando CFP o P.

Comparación de los métodos formales con la estimación del experto.La Tabla 5 compara los métodos de estimación formales con el método de estima-

ción basado en el experto. Usando un método de estimación basado en expertos se obtuvo un resultado aceptado para una técnica de estimación: errores menores a un 25% [18].

En la Tabla 5 se observa que al emplear regresión lineal, solo se obtuvo valores significativos cuando se uso P como variable independiente. Solo el 38% de las esti-maciones tuvieron un error menor al 25%, siendo este porcentaje bajo.

Al emplear estimación por analogía los resultados correspondiente a CFP fueron muy bajo, es decir, solo el 6% de las estimaciones realizadas por analogía tomando como variable dependiente a CFP pudieron tener un error menor al 25%. Por el con-trario al emplear P como variable independiente se pudo mejorar el resultado: el 39% de las estimaciones tuvieron un error menor al 25%.


56

Si comparamos ambas técnicas podemos concluir que usar P como medida de Complejidad es mejor que usar CFP como medida de Tamaño. Al mismo tiempo, ninguna de estas técnicas para este caso de estudio es mejor que la estimación dada por el experto. Como se observa el 72% de las estimaciones del experto tiene un error menor al 25%.

Se puede concluir entonces que ninguna de las técnicas de estimación tradicionales pudo superar la estimación del experto. Existen algunas que son mejores que otras pero no son equiparables a la experiencia del experto.

4.2 Estimación sucesivas de versiones de una aplicación

En un escenario en el cual se estiman sucesivamente versiones de una aplicación se espera que el experto aprenda de las estimaciones anteriores. La Tabla 1 muestra los CU agrupados por versiones. Se compara la estimación del experto con un método formal basado en analogías.El método de estimación por analogía usando como variables a CFP y P se aplico en la misma forma que se explicó anteriormente, con la particularidad que el conjunto de CU usados para encontrar el análogo, es el subconjunto de los CU de la versión co-rrespondiente más los CU de las versiones anteriores. Por ejemplo, cuando se estima la versión 10 se usan los CU de la versión 7 y 9. Dada esta limitación en el cálculo no fue posible aplicar regresión lineal puesto que en algunos casos la cantidad de puntos a considerar en el modelo era muy pequeña, obteniendo modelos no significativos.

Para obtener los resultados del experto, se realizo una segunda entrevista. Pero esta vez se le mostro solo los CU a estimar por versión y los errores cometidos al estimar las versiones anteriores. Por ejemplo, al estimar la versión 9, se le mostró los valores reales de la versión 7, además de la estimación realizada anteriormente de esta ver-sión. La Tabla 1 muestra las estimaciones sucesivas realizadas por el experto. La Ta-bla 6 muestra los errores de las estimaciones sucesivas, aplicando el método por Ana-logías (CFP y P) y basado en expertos.

Como se ve en la tabla 6 las MMRE de la estimación del experto tiene un rango de [0.18-0.22], y los PQ [0.63-0.75], valores mejores que los restantes. En el caso de Paths el rango del MMRE es de [0.76-2.22] y el del PQ es de [0-0.39]. Los valores obtenidos usando COSMIC son MMRE [1.18-1.53] y PQ [0-0.25]. Si bien es mejor el resultado obtenido usando la métrica Paths y el método de estimación por Analogía no hay una mejora significativa si se lo compara con COSMIC.

Las estimaciones sucesivas realizadas por el experto fluctúan pero la media y me-diana de MRE se mantienen dentro de valores aceptables para la industria, esto es menores al 25% y similares a la estimación del experto mostrada en Tabla 5. Los valores de PQ son los más altos de la Tabla 6 y si consideramos el PQ de la primera medición (72%), podemos concluir que la información más detallada no ha mejorado la precisión del experto. El experto al conocer su error introduce en la corrección de este una desviación mayor con respecto al valor real. Al mismo tiempo se observa en la Tabla 6 un aprendizaje por parte del experto, no logrando un valor mejor por las limitaciones de la mente humana.


57

Las estimaciones realizadas por Analogía y la métrica Paths también fluctúan, pero el PQ tiende a mejorar. Usando la métrica COSMIC también varía, reportando el PQ una mejora no sensible. En ambos casos el hecho de contar con una cantidad de CU menores para seleccionar el más análogo no mejora el error en la estimación.

Table 6. Comparación de la estimación sucesiva de versiones

Método de Estimación

MMRE MeMRE PQ(0,25) Erroresrelativos

(min..max)COSMIC v7 1.35 0.92 0.25 -3.36.. 0.2COSMIC v9 1.18 0.9 0 -3.36.. 0.5

COSMIC v10 1.43 0.75 0 -6.7..0.87COSMIC v11 1.29 0.79 0 -6.7..0.87COSMIC v12 1.53 0.83 0.06 -6.7..0.87

Paths v7 2.22 0.66 0 -7.25..0.88Paths v9 1.86 0.95 0 -7.25..0.88Paths v10 1.01 0.83 0.1 -4..0.83Paths v11 0.76 0.46 0.33 -4..0.83Paths v12 0.94 0.46 0.39 -4.52..0.89

Experto (V7) 0.18 0.19 0.75 0.09..0.25

Experto (V9) 0.20 0.22 0.67 0.09..0.29Experto (V10) 0.26 0.21 0.64 -1..0.35Experto (V11) 0.22 0.17 0.67 -1..0.35Experto (V12) 0.20 0.17 0.63 -1..0.35

4.3 Amenazas de validez

La mayor amenaza de validez de este caso de estudio es la registración del ER reali-zado por la empresa. Es conocido que estos registros no son exactos, por lo que fue-ron validados por registros alternativos de horas trabajadas. Concluyendo que son registros correctos.

Puede llamar la atención la omisión de la versión 8. Esta fue descartada por falta de calidad de la descripción textual de los casos de usos.

También la cantidad de casos de usos utilizados es limitada y podría afectar los re-sultados. En la selección de los casos de uso se selecciono un período de versiones que tuvieran registrado y disponible el ER. El tiempo de medición fue otra variable que fue necesario tener en cuenta considerando que era una aplicación compleja. Si


58

bien la cantidad de datos es acotada la comprobación estadística de la hipótesis plan-teada es significativa.

Otro aspecto que puede influir es el conocimiento que tiene el experto de las horas reales de este caso de estudio. El experto se incorporó a la empresa en una etapa pos-terior al desarrollo real de las versiones presentadas en este caso de estudio y no tenía conocimiento de las horas reales. Además, el grupo que desarrollo las versiones in-cluidas ha variado con respecto al momento en que se consultó al experto.

4.4 Conclusiones del caso de estudio

Para realizar la verificación estadística se utiliza el test no-paramétrico Wilcoxon dado que las distribuciones no son normales. Se selecciona este test, puesto que para pruebas pareadas es posible aplicarlo a distribuciones continuas. Fue posible rechazar la hipótesis nula a favor de la hipótesis alternativa H1a aplicando el test no-paramétrico Wilcoxon p-value igual a 0.009. También fue posible rechazar la hipóte-sis nula a favor de la hipótesis alternativa H1b aplicando el test no-paramétrico Wil-coxon p-value igual a 0.000, H1c p-value igual a 0.006, H1d, p-value igual a 0.000 y H1e p-value igual a 0.000.

Por lo tanto se concluye que en este caso de estudio NO fue posible superar con las estimaciones de expertos utilizando métodos de estimación formales (Regresión lineal y Analogía), tanto en una estimación para todos los CU de la aplicación como para una estimación sucesiva de versiones.

La participación de un experto limita la conclusión del caso de estudio, pero no la descarta, destacando el valor de todo caso de estudio en un ámbito real. Al mismo tiempo es importante comprender que el experto fue tipificado, esperando obtener diferentes precisiones al variar el perfil del experto [20].

5 Conclusión final

Como ha sido anticipado por Jorgensen [1] existen situaciones donde se puede esperar que las estimaciones expertas sean más precisas que los métodos formales de estima-ción. En el caso de estudio presentado el experto tiene una alta experiencia en el desa-rrollo de software, un alto conocimiento de la tecnología y un conocimiento medio en el dominio y en el rendimiento del grupo. Estos aspectos favorecieron la precisión de sus estimaciones.

Sorprendió que en la historia sucesiva de estimaciones no se obtuviera una mejora en la precisión. Se cree que esto se debe a la imprecisión de los ajustes humanos, aunque se muestra un aprendizaje en la sucesión de estimaciones, logrando en la es-timación final un valor similar obtenido en la estimación de todos los casos de uso dela aplicación.

En el caso de los métodos formales de estimación, se usaron modelos de una sola variable: Tamaño o Complejidad. Esto afecta la precisión de los modelos dado que si bien estos factores son los más importantes, sus variaciones no explican en un alto porcentaje la variación del esfuerzo.


59

El presente trabajo aporta un caso de estudio usando métricas no relevadas en Jor-gensen [1] y pone en evidencia la obtención de unos resultados particulares destacan-do las características del experto en un proyecto complejo del ámbito industrial. Para poder generalizar sus resultados es necesario analizar otros productos de diversos dominios, incorporar expertos con perfiles diferentes y otros métodos formales.

Sería interesante en trabajos futuros trabajar con modelos que incluyan otras varia-bles que afecten la estimación de esfuerzo y variar el perfil de los expertos, particu-larmente observar la variación de la precisión del experto al tener un conocimiento del rendimiento de los perfiles del grupo de desarrollo y del dominio alto.

Agradecimientos. El presente proyecto se ha realizado con el apoyo de la Universi-dad Austral y la Universidad Argentina de la Empresa.

Referencias

1. Jorgensen, M.: A review of studies on expert estimation of software development effort. Journal on System and Software, Vol. 70, No. 1-2, 37-60 (2004).

2. Jorgensen, M. y Shepperd.. A systematic review of software development cost es-timation studies, IEEE Transactions on Software Engineering, Vol. 33, No. 1. p. 3-53, (2007).

3. Montgomery, D, Peck, E.A., Vining, G.G.: Introducción al análisis de regresión Lineal, Compañía Editorial Continental (2004)

4. Shepperd, M., Schofield, C.: Estimating Software Project Effort Using Analogies. IEEE Trans. on Software Eng., vol. 23, no. 11, pp. 736-743, Nov. (1997).

5. COSMIC – Common Software Measurement International Consortium, 2009: The COSMIC Functional Size Measurement Method - version 3.0.1 Measurement Manual (The COSMIC Implementation Guide for ISO/IEC 19761: 2003), May (2009).

6. COCOMO II Model Definition Manual. http:// csse.usc.edu/csse/research/COCOMOII/ cocomo_downloads.htm

7. Robiolo, G., Badano, C., Orosco, R.: Transactions and Paths: two use case based metrics which improve the early effort estimation. ACM-IEEE International Sym-posium on Empirical Software Engineering and Measurement (ESEM '09), Lake Buena Vista, FL, October (2009).

8. Lavazza, L., Robiolo, G.: Introducing the Evaluation of Complexity in Functional Size Measurement: a UML-based Approach. Symposium on Empirical Software Engineering and Measurement (ESEM), Boszano-Bozen, Italia, Sept 16-17 (2010).

9. McCabe, T. A: Complexity measure, IEEE Transactions on software Engineering, Vol. SE-2, NO. 4. (1976)

10. Jorgensen, M and Gruschke, Tanja M.: The role of outcome feedback in improv-ing the uncertainty assessment of software development effort estimates. ACM Trans. Softw. Eng. Methodol. 17, 4, Article 20 (August 2008), 35 pages. (2008)

11. Jorgensen, M. and Halkjelsvik, T: The effects of request formats on judgment-based effort estimation, Journal of Systems and Software, 83 (1), 29-36. (2010)


60

12. Mendes, E.: Improving software effort estimation using an expert-centred ap-proach. In Proceedings of the 4th international conference on Human-Centered Software Engineering (HCSE'12), Winckler, M., Forbrig, P. and Bernhaupt R.(Eds.). Springer-Verlag, Berlin, Heidelberg, 18-33 (2012).

13. Cuadrado-Gallego, J.J., Rodríguez-Soria, P. and Martín-Herrera B.: Analogies and Differences between Machine Learning and Expert Based Software Project Effort Estimation. In Proceedings of the 2010 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD '10). IEEE Computer Society, Washington, DC, USA, 269-276 (2010).

14. Jørgensen, M.: Forecasting of software development work effort: Evidence on ex-pert judgement and formal models , International Journal of Forecasting 23 449–462 (2007)

15. Jorgensen, M. and Boehm, B.: Software Development Effort Estimation: Formal Models or Expert Judgement? IEEE Software (March/April):14-19 (2008).

16. Hughes, R.T.: Expert judgement as an estimating method, Information and Soft-ware Technology, Volume 38, Issue 2, 67-75, (1996).

17. MacDonell, S. G., & Shepperd, M. J.: Combining techniques to optimize effort predictions in software project management. Journal of Systems and Software, 66(2), 91 98 (2003).

18. Fenton, N.E. and Pfleeger, S.L.: Software Metrics. PWS Publishing Company (1997)

19. Jørgensen, M., Indahl, U and Sjøberg, D.: Software Effort Estimation by Analogy and "Regression Toward the Mean". Journal of Systems and Software 68(3): 253-262 (2003)

20. Halstead, S., Ortiz, R., Córdova, M. and Seguí, M.: The impact of lack in domain or technology experience on the accuracy of expert effort estimates in software projects. In Proceedings of the 13th international conference on Product-Focused Software Process Improvement (PROFES'12), Dieste, O., Jedlitschka, A. and Juristo, N. (Eds.). Springer-Verlag, Berlin, Heidelberg, 248-259, (2012).


61

Uma comparação do tempo de implementação: Android vs. HTML5

Venilton Falvo Júnior1, Lívia Castro Degrossi1, José Dario Pintor da Silva1, Ellen

Francine Barbosa1

1 Universidade de São Paulo, Instituto de Ciências Matemática e de Computação,

Av. Trabalhador São Carlense, 400, 69121 São Carlos, Brasil

{venilton, degrossi, dario, francine}@icmc.usp.br Abstract. As recentes evoluções tecnológicas elevaram o potencial dos dispositivos móveis a um novo patamar, com isso a comparação de suas plataformas de desenvolvimento se mostrou necessária. Diante de tal necessidade foi conduzido um experimento com a finalidade de verificar o tempo médio para implementação de aplicativos nativos para plataforma Android e também aplicações Web baseadas no padrão HTML5. Para a execução do experimento foram repassados dois casos de uso a uma equipe com seis desenvolvedores. Inicialmente, três deles implementaram o primeiro caso de uso usando a tecnologia HTML5 e o restante construiu na plataforma de desenvolvimento Android usando a linguagem de programação Java. Para o segundo caso de uso, as equipes inverteram as tecnologias. Com a análise dos dados do experimento, foi possível verificar que a média de tempo para implementação utilizando a plataforma HTML5 é menor do que a média de tempo utilizando Android.

Keywords: Desenvolvimento móvel, Desenvolvimento nativo, Engenharia de Software Experimental, Android, HTML5.

1 Introdução

Dispositivos móveis estão se tornando onipresentes na sociedade atual. A rápida expansão desses faz com que novos dispositivos sejam lançados a cada dia, muitas vezes com uma combinação única de hardware e software. Nesse contexto, é necessário, ainda, ter em mente a existência de diversos sistemas operacionais, como Android, iOS e Windows, cada uma com quantidade significativa de usuários e clientes potenciais.

A comunidade de desenvolvimento móvel adaptou-se às abordagens nativas de cada sistema operacional, característica que denominou essa abordagem como desenvolvimento nativo [6]. Alternativamente existe o desenvolvimento Web, que no contexto de aplicações móveis independe da plataforma operacional já que, teoricamente, suas aplicações podem ser executadas em qualquer dispositivo que possua um browser.

Algumas especulações são feitas a respeito de qual tecnologia de desenvolvimento móvel possui o melhor tempo de desenvolvimento. Alguns autores [2, 6, 10] defendem que a tecnologia de desenvolvimento Web é mais rápida ou que sua


62

utilização influencia positivamente em termos de custo. Um estudo feito pela Global Intelligence Alliance afirma que o desenvolvimento de aplicações Web é semanas mais rápido que o desenvolvimento Nativo, porém essa afirmação advém de entrevistas com desenvolvedores e pessoas ligadas ao desenvolvimento de aplicações móveis e não de um estudo empírico [5].

O presente artigo tem como principal objetivo comparar tais abordagens, Web e nativa, em especial quanto ao tempo de desenvolvimento gasto em cada uma delas. Com base nos argumentos apresentados anteriormente, a hipótese é de que aplicações Web possuem menor tempo de implementação em relação aos aplicativos nativos.

Dentre as plataformas de desenvolvimento nativas móveis, destacam-se os sistemas operacionais Android, iOS e Windows [2, 3]. Com um número consideravelmente maior de dispositivos [2], a plataforma Android foi escolhida para representar a abordagem nativa. Na vertente Web a plataforma de desenvolvimento HTML5 foi concebida como o outro parâmetro de comparação, fundamentado pelo seu baixo custo de desenvolvimento [2] e sua recente homologação pela W3C1 como padrão de desenvolvimento de páginas Web.

A fim de verificar tal hipótese, foi realizado um experimento com profissionais, desenvolvedores de software, os quais foram divididos em dois grupos. O primeiro grupo desenvolveu uma aplicação móvel utilizando HTML5. O segundo grupo, por sua vez, desenvolveu um aplicativo móvel para o sistema operacional Android, utilizando linguagem de programação Java. Posteriormente os grupos alternaram a tecnologia de desenvolvimento para que mais evidências experimentais fossem coletadas.

Após a condução do experimento, pode-se confirmar que a média de tempo de implementação para aplicações móveis escritas em HTML5 é menor que a média de tempo gasto no desenvolvimento de aplicativos móveis para Android.

Este artigo está organizado da seguinte forma. Na Seção II encontra-se uma síntese dos trabalhos publicados até o presente momento, onde tecnologias de desenvolvimento Web são confrontadas com tecnologias de desenvolvimento nativo. Na Seção III são apresentados alguns conceitos sobre o desenvolvimento móvel nativo e desenvolvimento móvel Web. Na Seção IV é apresentado o ambiente do experimento, questões de pesquisa e hipóteses. Na Seção V são apresentados os resultados do experimento juntamente com suas discussões. E, por fim, na Seção VI são apresentadas as conclusões e trabalhos futuros.

2 Background e Trabalhos Relacionados

É uma realidade na sociedade atual um indivíduo possuir pelo menos um dispositivo móvel. Quem utiliza a internet com frequência pode perceber significativas mudanças nas páginas visitadas no decorrer do tempo. Até pouco tempo atrás a interatividade de tais páginas era reduzida. Mesmo assim, esse fator não impediu o crescimento e a popularização da internet, que aos poucos foi se disseminando por todo o mundo. Hoje, os navegadores são presença constante nos dispositivos móveis,

1 http://www.w3c.br


63

independentemente do modelo e sistema operacional. A ubiquidade dos navegadores nos dispositivos tornou o desenvolvimento de aplicações móveis Web comum. Além disso, atualmente, há uma grande diversidade de sistemas operacionais atuando nesses dispositivos. Nesse contexto, a grande dúvida da comunidade de desenvolvedores é sobre qual abordagem de desenvolvimento se especializar, Web ou nativa.

Cada abordagem possui benefícios e limitações próprias. Mas qual tecnologia é a mais adequada para o desenvolvimento de aplicações móveis? A maioria das empresas enfrentam esse impasse ao escolher a tecnologia para o desenvolvimento de aplicações móveis. De um lado está a experiência do usuário e a funcionalidade do aplicativo; de outro, o custo e o tempo de desenvolvimento. Apesar de haver poucos trabalhos explorando esse tema, alguns deles [6, 10] analisam qual tecnologia é mais apropriada frente às funcionalidades esperadas.

Os diferentes sistemas operacionais resultam em uma das desvantagens mais importantes nesse cenário.

Aplicativos nativos desenvolvidos para uma plataforma específica não poderão ser executadas em outra plataforma, além de possuírem um processo de desenvolvimento e manutenção custosos [6]. Em contra partida, apesar dos recursos limitados, uma das vantagens do desenvolvimento de aplicações nativas deve-se ao acesso mais amplo ao hardware do dispositivo. Dessa forma, é possível criar aplicações otimizadas e específicas para o hardware e sistema em que serão executados. Em outra vertente, a principal vantagem de aplicações Web é sua compatibilidade entre diferentes plataformas, além de possuir baixo custo de desenvolvimento [2, 10]. Contudo, diferentemente das aplicações nativas, que possuem acesso ao hardware do dispositivo, as aplicações Web não possuem acesso ao hardware, ou o mesmo é limitado.

Por outro lado, dispositivos móveis modernos possuem navegadores eficazes com suporte a novos recursos, como o HTML5. Um exemplo do potencial dessa plataforma inclui componentes avançados para a interface com o usuário [6]. Em contrapartida, isso não ocorre com aplicações nativas. Cada plataforma possui características únicas no que diz respeito aos componentes de interface com o usuário, sendo necessário que o desenvolvedor esteja familiarizado com tais características [7].

Dessa forma, escolher a melhor abordagem para o desenvolvimento de uma aplicação depende das necessidades específicas da organização e de diferentes parâmetros como orçamento, cronograma, recursos internos, mercado, funcionalidade da aplicação, infraestrutura de TI, dentre outros [6].

3 Desenvolvimento Nativo Móvel e Desenvolvimento Web Móvel

3.1 Desenvolvimento Nativo Móvel

Um aplicativo nativo móvel é projetado para funcionar em um dispositivo atrelado a seu sistema operacional específico. Esses aplicativos podem vir pré-instalados, mas


64

também podem estar disponíveis em um repositório na internet. Entre os repositórios de aplicativos nativos existentes destacam-se a Apple Store, onde estão disponíveis aplicativos para o sistema operacional móvel iOS, da Apple, e o Google Play, que é o repositório de aplicativos para sistema operacional móvel Android, da Google.

Existem várias vantagens no uso desse paradigma [10]. Uma das principais é que esses aplicativos são desenvolvidos para aproveitar ao máximo a capacidade dos dispositivos, tanto de hardware (câmera, recursos de geolocalização, gráficos, entre outros) como de software (e-mail, visualizador de imagens, vídeos, calendários, gerenciador de arquivos, etc) uma vez que os mesmos executam diretamente no sistema operacional.

Além disso, o aplicativo nativo móvel pode ser executado no modo off-line. Uma vez instalado, a conexão com a internet não é necessariamente essencial. Os usuários podem utilizar os recursos do dispositivo a qualquer momento, e uma vez sem sinal de rede, a transferência de dados é retomada quando a conexão for restabelecida.

Outra vantagem beneficia diretamente os programadores que desenvolvem aplicativos móveis. Com o aplicativo concluído, o processo de disponibilização na rede para o consumo é simples: o desenvolvedor disponibiliza seu aplicativo em um repositório on-line e as pessoas interessadas pagam um preço fixo pelo download.

Quanto às desvantagens, a principal delas é que para que os aplicativos nativos possam funcionar corretamente em diversas plataformas, é necessária uma versão diferente do aplicativo para cada plataforma. Por exemplo, um aplicativo é desenvolvido para a plataforma Android. Caso o desenvolvedor também queira disponibilizá-lo para iOS, outra versão do aplicativo, com as particularidades desse plataforma, terá que ser desenvolvida. Com isso, o custo de desenvolvimento fica bastante elevado.

A atualização do aplicativo é outra atividade bastante dispendiosa para quem desenvolve aplicativos nativos móveis. De fato, para cada nova versão, será necessário mais trabalho no desenvolvimento, teste e distribuição para cada plataforma, diferentemente das aplicações Web móveis, em que é necessário apenas atualizar um único site.

3.2 Desenvolvimento Web Móvel

Uma aplicação Web móvel, como o nome sugere, é uma aplicação disponível na rede e acessada através de um navegador Web, desenvolvida para tablets, smartphones ou qualquer dispositivo com um browser. Como nas aplicações Web tradicionais, esse paradigma também possui como base os padrões HTML (responsável pela estrutura de páginas Web, onde provê seus componentes essenciais), CSS (define o estilo e a apresentação da página) e JavaScript (define interações do usuário com a página, dando dinâmica a mesma). Como estão disponíveis em um servidor remoto e acessadas por um browser, sendo esse o único requisito, são independentes de plataforma, ou seja, rodam em qualquer sistema operacional que o cliente esteja utilizando [10].

Apesar de cada vez que for utilizada a aplicação precisar ser baixada do servidor, aplicações Web móveis que utilizam HTML5 podem ser executas por usuários no modo off-line. O HTML5 é uma linguagem de estruturação e apresentação de


65

conteúdo. Porém, nessa última versão, o HTML vem com importantes mudanças através de novas funcionalidades como semântica e acessibilidade. Uma dessas mudanças é que ela pode acessar as funcionalidades do hardware assim como acontece nos aplicativos nativos.

Em relação aos aplicativos nativos, a principal vantagem de se utilizar aplicações Web móveis é a independência de plataforma. Dessa forma, quem constrói uma aplicação desse tipo consegue atingir um número maior de usuários, pois tudo que o dispositivo irá precisar é de um navegador. Ainda sobre vantagens, aplicações Web móveis são relativamente baratas e de fácil implementação [2, 10].

Com respeito às desvantagens do desenvolvimento Web móvel, a primeira é que esse tipo de aplicação móvel ainda não consegue acessar totalmente os recursos de hardware do dispositivo. Sendo assim, aplicações mais robustas que exigem gráficos mais pesados e necessitam de maior processamento, como por exemplo jogos, não são suportados. A segunda é que a função de acesso off-line não está completamente disponível. Uma vez longe de um ponto de acesso de internet, o usuário precisará usar outra tecnologia de acesso à internet, como o 3G. Embora em alguns países o fornecimento de internet sem fio seja barato, o mesmo não acontece em outros. Também são dependentes de um servidor, tornando, às vezes, a aplicação lenta ou indisponível.

4 O Experimento

A fim de verificar a média de tempo gasta na implementação de aplicativos Android e aplicações HTML5, foi conduzido um experimento onde desenvolvedores de software profissionais atuaram como participantes. Por conta disso, a questão de pesquisa (QP1) é: QP1 - Qual das plataformas de desenvolvimento possui a menor média de tempo para a implementação de aplicações móveis, Android ou HTML5?

4.1 Objetivo

A definição estruturada dos objetivos deste experimento é a seguinte:

Objeto do Estudo: plataformas de desenvolvimento móvel. Proposta: identificar as médias de tempo para implementação de aplicativos

Android e aplicações HTML5. Foco: investigar se a escolha entre as plataformas de desenvolvimento

Android e HTML5 pode influenciar grandezas como o tempo. Perspectiva: pesquisadores. Contexto: desenvolvedores de software profissionais.

Para tal experimento são avaliadas duas variáveis dependentes e quatro

independentes (Figura 1).


66

Figura 1 - Variáveis Dependentes e Independentes do Experimento.

O tempo médio gasto na implementação, levantado pela questão de pesquisa,

resultou nas seguintes hipóteses: Hipótese Nula (H0): A média de tempo na implementação de aplicações para Android é equivalente à média de tempo de aplicações implementadas em HTML5.

(t) Android = (t) HTML5 Hipótese Alternativa 1 (H1): A média de tempo na implementação de aplicações para Android é superior à média de tempo de aplicações implementadas em HTML5.

(t) Android > (t) HTML5 Hipótese Alternativa 2 (H2): A média de tempo na implementação de aplicações para Android é inferior à média de tempo de aplicações implementadas em HTML5.

(t) Android < (t) HTML5

Para a realização do experimento, foram levados em consideração um fator, implementação de uma funcionalidade, e dois tratamentos, implementação utilizando a plataforma Android e implementação utilizando a tecnologia HTML5. Neste contexto, serão computadas as médias das implementações resultantes de duas execuções.

Para auxiliar na análise, interpretação e validação dos resultados foram utilizados três tipos de testes estatísticos, Teste de Shapiro-Wilk, Teste de Levene e Teste T. O Teste de Shapiro-Wilk foi utilizado para verificar a normalidade das amostras devido ao número reduzido da mesma, menor que 50 valores [13]. O Teste de Levene [12] foi aplicado para verificar a homocedasticidade da amostra, ou seja, se há igualdade de variância em diferentes amostras. Por fim, o Teste T [16] foi utilizado para comparar a média de duas amostras independentes.

A fase de instrumentação do experimento é apresentada na seção a seguir.

4.2 Instrumentação

A seleção dos participantes foi realizada de maneira criteriosa para que não fosse inserido nenhum viés durante o experimento. Para isso foram selecionados participantes que tivessem menos de um ano de experiência no desenvolvimento de aplicações Web na plataforma Java e que tivessem experiência com desenvolvimento em HTML e JavaScript. Posteriormente, com o apoio de uma empresa especializada


67

em desenvolvimento de software, seis profissionais foram selecionados para a condução do experimento. Antes do início foi realizado um treinamento de duração total de oito horas, o qual foi conduzido por um dos autores. Além disso, o treinamento abordou a fundamentação das abordagens, com foco principal em elementos essenciais do ambiente de desenvolvimento. Por fim, a seguinte dinâmica de treinamento foi realizada:

Android: Como os participantes tinham domínio da linguagem de programação Java, o treinamento abordou a arquitetura de aplicações Android para essa linguagem, além da apresentação do seu respectivo ambiente de desenvolvimento.

HTML5: Como o desenvolvimento Web também não era novidade para nenhum deles, o treinamento pode ser direcionado apenas às inovações do padrão HTML5. Temas secundários como JavaScript e CSS também foram explanados.

Os treinamentos foram realizados em uma sala com a infraestrutura necessária

para essa e as demais atividades do experimento. A instalação do ambiente de desenvolvimento foi realizada de modo off-line. Para

isso foram disponibilizados no pacote experimental (disponível em: https://www.dropbox.com/sh/ovsmjpwtjuqzgxe/0eNRiKv-JG) todos os arquivos necessários para a instalação. Tal abordagem é de suma importância para a continuidade deste trabalho, já que a validade de uma replicação está diretamente relacionada às alterações feitas no experimento original.

Na fase de execução cada participante implementou um aplicativo Android e uma aplicação HTML5. Para isso, essa fase foi dividida em duas etapas, as quais contemplavam duas funcionalidades para implementação, presentes em qualquer tipo de aplicação com controle de autenticação:

Execução 1 - A funcionalidade UC01 (Validar Cadastro de Usuário), que se refere à validação dos dados de um formulário de cadastro de usuário.

Execução 2 - A funcionalidade UC02 (Autenticar Usuário Padrão), que autentica um endereço de e-mail e uma senha pré-definida.

Com as execuções devidamente estabelecidas, os participantes contaram com os

seguintes artefatos de apoio para implementação das funcionalidades:

Documento de Caso de Uso: Narrativa em forma de texto com toda a especificação de uma determinada funcionalidade. Foi utilizado o modelo proposto por Larman [9].

Diagrama de Componentes: Captura a estrutura física da implementação, ou seja, define os módulos físicos de um sistema e suas respectivas relações [9, 14]. Esse tipo de diagrama foi escolhido para apoiar a ambientação dos participantes em relação à estrutura de projetos Android criados a partir do Plugin ADT no eclipse IDE.


68

Diagrama de Sequências: Apresenta como as mensagens entre os objetos são trocadas no decorrer do tempo, para a realização de uma determinada operação [4].

Documento de Execução: Criado para apoiar o participante na execução de suas atividades e também para armazenar o tempo decorrido para finalização das mesmas. Também provê um formulário de feedback relacionado à execução.

Para a criação das interfaces visuais Android e HTML5 duas ferramentas foram

utilizadas com o recurso drag-and-drop. Para Android o plugin ADT[1] foi utilizado; para HTML5 o criador de interfaces online da biblioteca jQueryMobile[8] foi apresentado aos participantes.

Na primeira execução do experimento os participantes do Grupo 1 codificaram uma aplicação móvel utilizando a tecnologia HTML5 e os participantes do Grupo 2 um aplicativo para a plataforma Android, seguindo o caso de uso 1. Na segunda, as plataformas foram invertidas, o Grupo 1 implementou um aplicativo móvel para Android e o Grupo 2 codificou uma aplicação em HTML5, seguindo o caso de uso 2. Tal dinâmica é apresentada na Tabela 1.

Tabela 1- Dinâmica de execução do experimento.

Tecnologia Grupo 1 Grupo 2 Execução Android X 1ª Execução

(UC01) HTML5 X Android X 2ª Execução

(UC02) HTML5 X

Cada participante reportou seu documento de execução para que fosse possível analisar os tempos de execução e também os questionários de feedback.

Por fim, foi estruturado um caso de teste, que mapeia as entradas e saídas esperadas [15], para cada uma das execuções do experimento, já que cada uma delas tem seu respectivo caso de uso. Com isso, foi executado em cada um dos aplicativos Android e aplicações HTML5 o caso de teste de sua respectiva execução, obtendo assim o número de defeitos gerados por participante.

5 Resultados e Discussões

Esta seção apresenta os resultados do experimento a fim de comprovar a questão de pesquisa QP1.

Para respondê-la foi analisado o tempo médio para o desenvolvimento de aplicativos Android e aplicações HTML5 resultantes das execuções. Uma síntese dos resultados relacionados ao tempo de implementação de cada participante é apresentada na Tabela 2.

Os resultados mostram que os participantes G1.1, G1.2, G1.3, G2.1, G2.2 e G2.3, os quais implementaram o aplicativo Android, obtiveram uma média de 58,50


69

minutos. Para o desenvolvimento da aplicação HTML5, os participantes G1.1, G1.2, G1.3, G2.1, G2.2 e G2.3 obtiveram uma média de aproximadamente 38,33 minutos. Tais resultados são apresentados na Figura 2.

Tabela 2 - Tempos de implementação individuais.

Participante Android HTML5 G1.1 43 36 G1.2 70 45 G1.3 52 41 G2.1 73 41 G2.2 45 35 G2.3 68 32

Figura 2 - Tempo médio de implementação em ambas as plataformas.

Esses resultados sugerem que o desenvolvimento de aplicações HTML5 possui, em média, menor tempo de implementação quando comparado com o desenvolvimento de aplicativos Android na linguagem Java. Tal observação pode ser justificada, neste contexto, pelo fato da linguagem HTML5 ser interpretada, ao contrário da plataforma de desenvolvimento Android, onde o código deve ser compilado e emulado antes do aplicativo ser visualizado, o que pode ser determinante na manipulação de variáveis como a média de tempo de desenvolvimento. Outra justificativa está relacionada à dificuldade de aprendizagem na plataforma Android. Tal afirmação foi obtida através do formulário de feedback preenchido pelos participantes, no qual foi obtida a menor média de aprovação dos mesmos.


70

Com isso, a partir da análise prévia dos dados supõe-se que a resposta para a QP1 é HTML5, já que essa tecnologia obteve uma diferença de mais de 20 minutos no tempo médio de implementação de aplicações em relação aos aplicativos Android. Porém, não é possível fazer tal afirmação sem evidências estatísticas suficientemente conclusivas.

Para tanto, primeiramente, foi utilizado o Teste de Shapiro-Wilk para obtenção das grandezas necessárias para a análise da distribuição normal das amostras (Tabelas 3 e 4). Além disso, como os valores da variável Sig, 0.176 para Android e 0.765 para HTML5, associados ao teste de Shapiro-Wilk são maiores que o nível de significância de 0,05, assume-se que a distribuição dos dados para ambas as tecnologias é normal.

Tabela 3 - Teste de Shapiro-Wilk em relação ao tempo de implementação na plataforma Android (Fonte: Ferramenta SPSS - IBM).

Shapiro-Wilk Statistic df Sig.

Time Android 0,856 6 0,176

Tabela 4 - Teste de Shapiro-Wilk em relação ao tempo de implementação na plataforma HTML5 (Fonte: Ferramenta SPSS - IBM).

Shapiro-Wilk Statistic df Sig.

Time HTML5 0,953 6 0,765

Figura 3 - Distribuição normal do tempo de implementação na plataforma Android (Fonte: Ferramenta SPSS - IBM).


71

Figura 4 - Distribuição normal do tempo de implementação na plataforma HTML5

(Fonte: Ferramenta SPSS - IBM).

Com base nas Figuras 3 e 4 é possível concluir que nenhuma das amostras apresenta distância relevante em relação à reta de distribuição normal a ponto de serem consideradas outliers. De acordo com esses resultados, conclui-se que o teste de hipótese mais aplicável neste contexto é o Teste T, caracterizado como paramétrico em sua vertente não pareada. Para a realização do Teste T, foi aplicado previamente o Teste de Levene para constatação da homocedasticidade das amostras, termo que indica a homogeneidade ou heterogeneidade das mesmas. Para isso, verificou-se que o Sig. de 0,001 (Tabela 5) é menor que o nível de significância de 0,05, caracterizando a heterogeneidade das variâncias. Com isso confirmou-se a diferença entre as médias na variável Mean Difference (Tabela 6), obtendo assim, a confirmação estatística de que o tempo médio de implementação (Tabela 7) entre as tecnologias propostas é significativamente diferente.

Tabela 5 - Teste de Levene em relação ao tempo de implementação (Fonte: Ferramenta SPSS - IBM).

Levene’s Test for Equality of Variances

F Sig

Time

Equal variances assumed

24,011 0,001

Equal variances not

assumed


72

Tabela 6 - Teste T em relação ao tempo de implementação (Fonte: Ferramenta SPSS - IBM).

T-Test for Equality of Means

t df Sig (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the

Difference Lower Upper

Time

Equal variances assumed

3,471 10 0,006 20.167 5,810 7,220 33,113

Equal variances

not assumed

3,471 6,264 0,012 20.167 5,810 6,093 34,240

Tabela 7 - Médias em relação ao tempo de implementação (Fonte: Ferramenta SPSS - IBM).

Plataform N Mean Std. Deviation Std. Error Mean

Android 6 58,50 13,398 5,470

HTML5 6 38,33 4,803 1,961

A evidência estatística corrobora com a suposição feita anteriormente, já que a tecnologia HTML5 possui uma média de tempo para implementação aproximadamente 35% menor que a tecnologia Android. Com isso, além de H0, H2 também foi refutada, já que se contrapõe à H1, comprovada estatisticamente.

6 Ameaças à Validade

Apesar de a hipótese nula ter sido refutada, o experimento apresenta algumas ameaças à validade que devem ser consideradas. Primeiramente pode-se considerar que o baixo número de participantes é uma ameaça à validade externa, pois isso pode influenciar negativamente os resultados do experimento. Além disso, a experiência dos participantes pode induzir a conclusões precipitadas, uma vez que apesar dos esforços para o balanceamento dos grupos é impossível validar tal distribuição.

Dentre as ameaças internas possíveis, as especificações dos casos de uso propostos podem ser consideradas, visto que as funcionalidades descritas podem não ter o mesmo nível de complexidade em relação à cognição das tarefas a serem executadas. Ainda nessa vertente, verificou-se que as discrepâncias entre os recursos de cada ambiente de desenvolvimento podem ter influenciado positiva ou negativamente na avaliação de tempo das plataformas de desenvolvimento Android e HTML5.


73

7 Conclusão e Trabalhos Futuros

Com uma quantidade significativa de usuários e clientes potenciais, a escolha do tipo de aplicação, móvel ou nativa, tornou-se extremamente importante. Nesse contexto, o tempo necessário para o desenvolvimento da mesma passou a ser essencial para que tal aplicação chegasse mais rápido ao mercado consumidor. Por isso, a investigação do tempo médio necessário para a codificação de uma aplicação apresentou-se relevante no contexto desse experimento.

Com isso, ao final do experimento foi possível verificar, através de testes estatísticos, que a média de tempo para desenvolvimento móvel utilizando a tecnologia HTML5 é menor se comparado à plataforma Android. Isso evidencia que essa tecnologia além que atingir um maior número de usuários, ainda possui menor tempo de implementação.

Apesar de o experimento ter se mostrado consistente a ponto de responder a questão de pesquisa proposta, ainda assim, existem melhorias que podem elevar consideravelmente sua relevância. Do ponto de vista estatístico, o aumento no número de participantes em uma futura replicação se faz necessário, uma vez que os testes de homocedasticidade e hipótese quando aplicados em amostras maiores obtêm resultados mais confiáveis. Além disso, é importante manter o perfil dos participantes, já que este experimento foi aplicado na indústria por profissionais especializados em desenvolvimento de software.

Ainda com relação às possibilidades de trabalhos futuros, um maior número de participantes viabiliza a análise de outros tipos de variáveis, como por exemplo, os defeitos gerados durante o processo de implementação. Para isso uma técnica para a detecção de erros deve ser estudada e devidamente documentada para que as características do pacote experimental original não sejam perdidas. Além disso, poderia ser considerada em um trabalho futuro a complexidade do código desenvolvido e sua manutenibilidade.

Agradecimentos

Os autores gostariam de expressar gratidão a Cast Informática S.A. pela disponibilização de seus profissionais e do espaço para a realização desse experimento. Nós também gostaríamos de agradecer às agências de fomento CAPES, CNPq e FAPESP por todo o apoio durante o desenvolvimento deste trabalho.

Referências

1. ADT Plugin Android Developers, http://developer.android.com/tools/sdk/eclipse-adt.html 2. Appacelerator. Native vs. HTML5 Mobile App Development Which option is best?,

Available: http://www.sudesicloud.com/whitepapers/appcelerator-whitepaper-native-html5-7.pdf (2012)

3. Charland, A., & Leroux, B. Mobile application development: web vs. native. In: ACM Queue (2011).


74

4. Fowler, M., Scott, K.: UML distilled: applying the standard object modeling language. In: Longman, A.W. (eds) The Addison-Wesley object technology series. (1997)

5. Global Intelligence Alliance. Native or Web Application? How Best to Deliver Content and Services to Your Audiences over the Mobile Phone (pp. 1–38). Available: http://www.globalintelligence.com/insights-analysis/white-papers/native-or-Web-application-how-best-to-deliver-cont (2010)

6. IBM: Native, web or hybrid mobile-app development. Technical report, IBM (2012) 7. Journal, S.D.: Mobile multiplatform programming. Technical report, Software Developer’s

Journal (2012) 8. jQuery Mobile, http://jquerymobile.com 9. Larman, C.: Utilizando UML e padrões. Bookman (2008) 10. Lionbridge: Mobile web apps vs. mobile native apps: How to make the right choice.

Technical report, Lionbridge (2012) 11.mobiThinking. Mobile applications: native v Web apps – what are the pros and

cons? Available: http://mobithinking.com/native-or-web-app (2012) 12. Olkin, I.: Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling.

Stanford University Press (1960) 13. Royston, J.P.: Some Techniques for Assessing Multivarate Normality Based on the

Shapiro- Wilk W. In: Journal of the Royal Statistical Society, vol. 32. (1983) 14. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference

Manual. Pearson Higher Education (2004) 15. Sommerville, I.: Software Engineering. Addison-Wesley, Harlow (2010) 16.Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wesslen, A.:

Experimentation in software engineering: an introduction. Norwell (2000)


75

Um Estudo Experimental do Ambiente ProgTestno Ensino de Programacao

Draylson Micael de Souza, Sofia Larissa da Costa,Nemesio Freitas Duarte Filho e Ellen Francine Barbosa

Instituto de Ciencias Matematicas e de Computacao (ICMC/USP)Sao Carlos/SP, Brazil – 13560-970

{draylson,sofialc,nemesio,francine}@icmc.usp.br

http://www.icmc.usp.br

Resumo Ambientes de apoio a submissao e avaliacao automatica detrabalhos praticos de programacao vem sendo desenvolvidos como ferra-mentas de apoio ao ensino integrado de programacao e teste de software.Dentre eles, destaca-se ProgTest, que avalia o trabalho dos alunos tantoem relacao a atividade de programacao quanto em relacao a atividade deteste. Este artigo tem como objetivo apresentar um experimento cujo ob-jetivo foi analisar a efetividade da ProgTest na aprendizagem dos alu-nos. Sao apresentados detalhes do planejamento, execucao e analise dosdados, descrevendo as experiencias, benefıcios e dificuldades em relacaoa utilizacao do ambiente, sendo que os resultados fornecem indıcios deque a ProgTest melhora a aprendizagem dos alunos.

Keywords: Ensino e Aprendizagem, Programacao, Teste de Software,Estudo Experimental

1 Introducao

O ensino de fundamentos de programacao nao e uma tarefa trivial – muitosalunos tem dificuldades em compreender os conceitos de programacao [1] e/oupossuem visoes erradas sobre a atividade de programacao [2].

Dentre as iniciativas investigadas a fim de amenizar esses problemas encontra-se o ensino integrado de fundamentos de programacao e de conceitos basicos deteste de software [3,2,4,5]. A introducao da atividade de teste pode ajudar odesenvolvimento das habilidades de compreensao e analise nos alunos, pois paratestar, os alunos precisam entender o comportamento dos seus programas [2].

Apesar de seus benefıcios, uma das limitacoes desta abordagem tem sido acarencia de ambientes e ferramentas automatizadas que auxiliam adequadamenteno ensino e aprendizagem [5]. De fato, a construcao de ambientes que fornecamapoio a submissao e avaliacao automatica de trabalhos de programacao, junta-mente com aspectos de aprendizagem de teste de software, pode ser de grandeajuda e relevancia, possibilitando um eficiente feedback aos alunos, que por suavez, tera um julgamento imediato da sua implementacao [5].


76

2 D.M. Souza, S.L. Costa, N.F. Duarte Filho, E.F. Barbosa

Neste contexto, o ambiente ProgTest [5,6,7] permite que os alunos subme-tam trabalhos praticos de programacao. Utilizando ferramentas de teste integra-das, o ambiente avalia automaticamente o trabalho dos alunos, fornecendo a elesum feedback imediato sobre os seus trabalhos.

Validacoes preliminares da ProgTest mostraram que o ambiente e capazde avaliar adequadamente os trabalhos dos alunos [5,6,7]. No entanto, nenhumavalidacao foi realizada visando avaliar se o feedback provido pela ProgTestrealmente melhora a aprendizagem dos alunos.

Neste sentido, este artigo descreve a realizacao de um experimento envol-vendo a utilizacao do ambiente ProgTest no ensino de programacao e teste.Os resultados e analises obtidos pelo experimento permitem avaliar a efetividadeda ProgTest na aprendizagem dos alunos, bem como descrevem as principaisexperiencias, benefıcios e dificuldades reportados pelos alunos em relacao a uti-lizacao do ambiente,

O artigo esta organizado da seguinte forma. A Secao 2 apresenta uma visaogeral do ambiente ProgTest. A Secao 3 descreve detalhes de como o experi-mento foi planejado. Na Secao 4 e descrito como o experimento foi realizado. ASecao 5 apresenta os resultados e analises realizadas apos a execucao do experi-mento. Por fim, na Secao 6 sao apresentados as conclusoes obtidas por meio doexperimento e trabalhos futuros.

2 O Ambiente ProgTest

Uma questao crıtica para o sucesso do ensino integrado de fundamentos de pro-gramacao e teste de software e como fornecer um feedback adequado e avaliaro desempenho do aluno. O uso de ferramentas automatizadas pode trazer be-nefıcios adicionais em termos de consistencia, eficacia e eficiencia. Tais ferra-mentas permitem uma avaliacao final dos programas, sendo possıvel tambem ageracao de relatorios, de modo que cada aluno seja informado sobre seu desem-penho em relacao a media e em relacao aos alunos mais produtivos [7].

A ProgTest insere-se nesse contexto, tendo sido desenvolvida como umaferramenta para submissao e avaliacao automatica de trabalhos de programacaobaseada em atividades de teste. A ideia e fornecer apoio automatizado paraavaliar os programas e casos de teste submetidos pelos alunos [5,6,7].

Para isso, ferramentas de teste de cobertura podem ser integradas a Prog-Test, fornecendo apoio para aplicar os criterios de teste e avaliar a coberturados casos de teste, obtida a partir de execucao dos programas. Tanto a quali-dade do codigo como a qualidade dos testes podem ser analisadas com base noscriterios adotados.

A Figura 1 ilustra a interface do ambiente ProgTest. As principais funcio-nalidades do ambiente podem ser acessadas por meio de duas visoes - professorou aluno. Em relacao a visao do professor, a ProgTest permite ao usuario criarnovos cursos, matricular alunos nos cursos e definir trabalhos de programacao.

O professor tambem deve fornecer uma implementacao de referencia (“tra-balho oraculo”), o qual consiste em: (1) um programa que implementa a solucao


77

Estudo Experimental do Ambiente ProgTest no Ensino de Programacao 3

Figura 1. O Ambiente ProgTest

correta para o trabalho proposto; e (2) um conjunto de teste para o programafornecido.

Com base nos criterios e pesos definidos pelo professor, a ProgTest calculauma cobertura total do trabalho oraculo. Em seguida, a ProgTest calcula acobertura para as seguintes combinacoes:

1. programa do aluno com o conjunto de teste do aluno;2. programa do professor com o conjunto de teste do aluno;3. programa do aluno com o conjunto de teste do professor.

A partir das coberturas obtidas, uma nota para o trabalho do aluno e cal-culada por meio da media ponderada das mesmas, sendo os pesos para cadacobertuda definidos pelo professor.

Apos ter submetido seu trabalho, o aluno pode visualizar o relatorio de ava-liacao, alem de outros relatorios fornecidos pelas ferramentas de teste integradas.Alem disso, o aluno pode submeter novas versoes do seu trabalho ate atingir notamaxima.

Validacoes preliminares da ProgTest mostraram que o ambiente e capazde avaliar adequadamente os trabalhos dos alunos [5,6,7]. No entanto, nenhumavalidacao foi realizada visando avaliar se o feedback provido pela ProgTestrealmente melhora a aprendizagem dos alunos.

Neste sentido, foi conduzido um experimento controlado envolvendo a uti-lizacao da ProgTest por alunos de graduacao. O objetivo do experimento foiavaliar se a aprendizagem dos alunos melhora com o apoio do ambiente Prog-Test. A fase de planejamento do experimento e descrita a seguir.


78


3 Planejamento do Experimento

O objetivo do experimento foi delineado usando o template GQM(Goal/Question Metric) [8] e a meta e a seguinte:

– Objeto de estudo: a aprendizagem dos alunos proporcionado pelo feedbackdo ambiente ProgTest.

– Proposito: validacao do ambiente ProgTest.– Foco: completude e qualidade dos trabalhos desenvolvidos pelos alunos.– Perspectiva: academica.– Contexto: alunos de graduacao.

Nesse sentido, as questoes de pesquisa para o experimento foram:

1. Os alunos que utilizam a ProgTest produzem programas com mais quali-dade do que aqueles que nao utilizam tal ambiente?

2. Os alunos que utilizam a ProgTest produzem casos de teste com maisqualidade do que aqueles que nao utilizam tal ambiente?

3. Os alunos que utilizam a ProgTest realizam atividades de teste e de-puracao mais completa do que aqueles que nao utilizam tal ambiente?

Para avaliar estas questoes, foram utilizadas sete metricas definidas e men-suradas pelo proprio ambiente ProgTest:

– Completude da atividade de depuracao: Avalia se o aluno corrigiutodos os defeitos do programa que os seus casos de teste revelaram;

– Completude da atividade de teste: Avalia se o aluno testou todos oselementos do seu programa;

– Correcao do programa: Verifica se ha defeitos no programa do aluno;– Correcao dos casos de teste: Verifica se ha defeitos nos casos de teste;– Adequacao do programa: Verifica se o programa esta bem estruturado, ouseja, se nao ha trechos do codigo desnecessarios ou que podem ser reescritosde forma mais simples (com menos comandos e desvios de fluxo);

– Adequacao do conjunto de teste: Verifica se o conjunto de teste do alunoe adequado com relacao aos criterios de teste considerados.

– Nota sugerida para o trabalho: Fornece uma avaliacao geral do trabalhodo aluno com base nas demais metricas citadas.

Tendo os objetivos e metricas definidas, foram consideradas as hipoteses:

– H0 = O apoio fornecido pela ProgTest nao melhora a aprendizagem dosalunos.

– H1 = O apoio fornecido pela ProgTest melhora a aprendizagem dos alunos.

A hipotese nulaH0 e a hipotese que desejou-se rejeitar e a hipotese alternativaH1 a que desejou-se aceitar. Para avaliar qual das hipoteses e verdadeira, foramconsideradas as seguintes variaveis:

– Variaveis dependentes: completude das atividades de teste e depuracao,qualidade dos programas e qualidade dos conjuntos de teste.


79


– Variaveis independentes: feedback sobre os trabalhos (com feedback ime-diato e sem feedback), experiencia dos participantes, linguagem de pro-gramacao e ambiente de trabalho.

A ideia do experimento foi dividir os participantes em dois grupos com tra-tamentos diferentes no que diz respeito a variavel “feedback sobre os trabalhos”.Assim, o primeiro grupo realizou as atividades propostas sem um feedback sobrea qualidade dos seus programs e conjuntos de teste. Tal tratamento foi denomi-nado neste experimento como “Aprendizagem Tradicional”.

Por outro lado, o segundo grupo realizou as mesmas atividades, porem, com oapoio da ProgTest, que fornecia a eles um feedback imediato sobre a qualidadedos seus programas e conjuntos de teste. Este tratamento foi denominado nesteexperimento como “Aprendizagem com ProgTest”.

Dessa forma, comparando a qualidade dos trabalhos fornecidos por ambos,foi possıvel determinar para qual grupo a aprendizagem foi melhor.

Uma vez que os participantes possuem experiencias diferentes, a variavel“experiencia dos participantes” tambem precisou ser tratada para nao invalidaros resultados do experimento. Se os participantes de um grupo tivessem um graude experiencia maior do que outro, seria conveniente que seus trabalhos tivessemuma qualidade maior. Neste caso, a qualidade dos trabalhos seria resultado daexperiencia dos participantes e nao do feedback fornecido pela ProgTest.

Neste sentido, procurou-se formar os dois grupos balanceadamente, de formaque, no total, ambos os grupos fossem similares com respeito as experiencias dosparticipantes. Ainda, apos a execucao do experimento, uma analise estatıstica foirealizada a fim de verificar o quanto a experiencia dos participantes influenciounos resultados obtidos. Tal analise e descrita na Secao 5.4.

Por sua vez, a linguagem de programacao adotada para o experimento foi alinguagem C, sendo esta considerada para ambos os grupos.

Por fim, para garantir que todos os participantes do experimento realizassemas atividades sob o mesmo ambiente de trabalho, foram gerados Live DVDs comtodas as ferramentas e arquivos necessarios para realizar as atividades propostas.Assim, todos os participantes realizaram as atividades sob o ambiente preparadonos Live DVDs. Detalhes sobre os Live DVDs e os demais artefatos utilizadosno experimento sao descritos a seguir.

3.1 Instrumentacao

Como dito anteriormente, dentre os artefatos elaborados para a execucao do ex-perimento foram gerados Live DVDs com as ferramentas e arquivos necesariospara os participantes realizarem as atividades propostas. Foram preparados trestipos de Live DVDs: (1) Live DVD de aprendizagem tradicional, para os parti-cipantes que nao utilizaram o ambiente ProgTest; (2) Live DVD de aprendi-zagem com a ProgTest, para os que utilizaram a ProgTest; e (3) Live DVDdo instrutor, para a demonstracao e treinamento.

A principal diferenca entre os Live DVDs e que o Live DVD de aprendizagemtradicional possui ferramentas de teste instaladas e configuradas, de forma que


80


os participantes conseguissem realizar as atividades de teste sem a ProgTest.Por outro lado, o Live DVD de aprendizagem com ProgTest, ao inves dasferramentas de teste, possui como home page do navegador o endereco em queuma instancia da ProgTest estava sendo disponibilizada, esta ja configuradapara avaliar os trabalhos dos participantes com os exercıcios propostos e com asferramentas de teste consideradas no Live DVD com aprendizagem tradicional.

O Live DVD do instrutor consiste em uma combinacao dos outros dois, deforma que o instrutor pudesse realizar demonstracoes considerando tanto as fer-ramentas de teste como a ProgTest.

Dentre os demais artefatos desenvolvidos, encontram-se os codigos-fonte paraos exercıcios propostos. Como apresentado na Secao 2, a ProgTest necessitade uma implementacao de referencia para avaliar os trabalhos dos alunos. Alemdisso, um template dos programas e conjuntos de teste a serem desenvolvidospelos alunos tambem foram elaborados, de forma que os participantes apenascompletassem os arquivos com os seus algoritimos e casos de teste. Os templatesforam disponibilizados junto aos Live DVDs e as implementacoes de referenciaconsideradas durante o preparo da ProgTest para a execucao do experimento.

Alem disso, foram preparados artefatos impressos – especıficos para a apren-dizagem tradicional e para a aprendizagem com a ProgTest– para auxiliaros participantes na realizacao das atividades, sendo eles: (1) enunciados dosexercıcios propostos; (2) guias descrevendo os procedimentos que os participan-tes deveriam realizar para completar as atividades propostas e (3) questionariosde participacao a fim de obter dos participantes informacoes sobre como foi aexperiencia dos mesmos duranre a execucao do experimento.

Por fim, um pacote experimental foi construıdo, contendo os artefa-tos descritos nesta secao, bem como instrucoes para replicacao. Tal pa-cote experimental esta disponıvel em: www.labes.icmc.usp.br/~draylson/

PacoteExperimental/

3.2 Procedimentos

O experimento foi planejado para ser conduzido em duas fases: treinamento eexecucao. A fase de treinamento teve por objetivo garantir que os participantesestavam familiarizados com os ambientes, ferramentas, atividades, artefatos emetodos considerados no experimento. A ideia foi evitar que a falta de familia-ridade com tais elementos viesse a influenciar os resultados do experimento.

A fase de treinamento consistiu em: (1) demonstracao da resolucao de umexercıcio de programacao e teste, considerando tanto as ferramentas de teste queseriam utilizadas pelo grupo de aprendizagem tradicional como a ProgTest queseria utilizada pelo grupo de aprendizagem com ProgTest; e (2) resolucao, porparte dos participantes, de um exercıcio similar, ficando o instrutor a disposicaopara sanar as duvidas dos participantes.

Por sua vez, a fase de execucao consistiu em: (1) resolucao de um exercıcio deprogramacao e teste, similar aos considerados na fase de treinamento, sendo queo instrutor nao poderia mais sanar nenhuma duvida com relacao as atividades deprogramacao e teste; e (2) aplicacao dos questionarios de participacao, a fim de


81


obter informacoes dos participantes sobre suas experiencias, opinioes e sugestoescom relacao aos metodos considerados (em especial, aqueles que participaramda aprendizagem com a ProgTest).

Ressalta-se que, na analise dos dados, somente os dados referentes a fase deexecucao foram de fato considerados.

3.3 Riscos a Validade do Experimento

Como parte das atividades de planejamento, foram identificados os riscos a va-lidade do experimento. Tal atividade e fundamental, pois permite um maiorentendimento do experimento sobre os erros que podem ser cometidos durantea sua conducao. No presente experimento destacamos os seguintes riscos:

– Riscos a validade interna:• Aprendizagem: a utilizacao da ferramenta ProgTest nao e intuitiva,necessitando de um treinamento adequado.

• Conformidade com o estudo original: O tempo de treinamento e onıvel dos exercıcios devem ser cuidadosamente estabelecidos.

– Riscos a validade externa:• Numero de participantes: Um numero muito pequeno de partici-pantes pode nao revelar adequadamente a eficacia da ProgTest naaprendizagem dos alunos.

• Local do experimento: O experimento deve ser realizado em labo-ratorios devidamente equipados com computadores e acesso a internet.

– Riscos a validade da construcao:• Os pesquisadores devem ser cuidadosos com o tratamento das variaveisvisando as metas e objetivos definidos.

Apos o planejamento do experimento, foi realizada uma avaliacao com es-pecialista e tambem um teste piloto com a equipe de experimentacao, no qualos artefatos foram validados e o tempo para conducao do experimento foi esti-mado. Alem disso, ajustes nos artefatos foram realizados e alguns aspectos doplanejamento foram refinados.

4 Execucao do Experimento

O experimento foi realizado em um dos laboratorios de graduacao doICMC/USP. A principal dificuldade encontrada foi o fato das maquinas estaremconfiguradas para nao permitirem a realizacao de boot por meio de dispositi-vos e mıdias externas, tais como os Live DVDs preparados para a realizacaodo experimento. No entanto, a equipe ja estava prevenida e disponibilizou al-guns notebooks proprios aos participantes, podendo seguir com a execucao doexperimento conforme planejado.

No treinamento, foi realizada uma demonstracao da resolucao de um exercıciode programacao e teste, incluindo uma breve apresentacao das ferramentas, bemcomo a introducao de conceitos basicos de teste de software. Em seguida, os


82


alunos resolveram um problema de programacao e criaram casos de teste paratestar seus programas.

Apos o termino do treinamento, passou-se aos procedimentos de execucao,em que os alunos resolveram um problema de programacao similar ao do treina-mento (porem, com maior dificuldade) e criaram casos de teste para testar seusprogramas. Em seguida, os alunos responderam ao questionario de participacao.

Notou-se que a utilizacao de Live DVDs pre-configurados foi bem sucedida efacilitou a execucao do experimento. Alem disso, o tempo de execucao foi inferiorao tempo estimado. Isto pode estar relacionado ao fato dos alunos ja terem umaexperiencia consideravel em programacao ou ainda com o nıvel de dificuldadedos exercıcios que nao ofereciam um desafio muito complexo.

5 Analise dos Resultados

Para a analise dos resultados, foi selecionada a analise descritiva dos dados, com ouso de medidas da estatıstica descritiva e analise grafica. A estatıstica descritivae utilizada para descrever algumas caracterısticas relevantes dos dados coletados.A meta da analise descritiva e identificar tendencias centrais de variaveis e seustratamentos, identificar o nıvel de dispersao das mesmas, identificar pontos forada curva (outliers), e identificar correlacoes.

Como descrito na Secao 3, as metricas consideradas para avaliar a qualidadedos trabalhos desenvolvidos pelos alunos foram calculadas utilizando o proprioambiente ProgTest. Assim, as metricas referentes ao grupo de aprendizagemcom a ProgTest foram coletadas diretamente da ProgTest.

Por outro lado, para obter as metricas relacionadas ao grupo de aprendizagemtradicional, os proprios experimentadores submeteram os trabalhos produzidospor este grupo a ProgTest, coletando as metricas produzidas.

Alem disso, tambem foram considerados os dados obtidos nos questionariosde participacao, a fim de analisar o perfil e experiencia dos usuarios, bem comosuas sugestoes e crıticas em relacao a ProgTest.

Para cada uma das metricas definidas foram calculadas medidas detendencias centrais, as quais foram utilizadas para confrontar os dois grupos.

Por fim, a correlacao entre a experiencia dos usuarios e as notas sugeridaspela ProgTest aos seus trabalhos foi avaliada. A ideia foi verificar se realmentehouve influencia da ProgTest na aprendizagem dos participantes ou se foi aexperiencia em programacao dos participantes que definiu os resultados obtidos.

5.1 Experiencia dos Participantes

A tabelas 1 e 2 mostram o perfil dos participantes que realizaram o experimentocom o metodo de aprendizagem tradicional e com o metodo de aprendizagemcom a ProgTest respectivamente, com base nas respostas dos alunos nos ques-tionarios de participacao.

Uma das caracterısticas que e possıvel observar e que nenhum dos alunos enovato em programacao. Uma vez que o ambiente ProgTest tem por objetivo


83


Tabela 1. Participantes - Aprendizagem Tradicional

Aluno Experiencia com Experiencia com Dificuldade em Dificuldade emProgramacao Teste Programar Testar

1 Especialista Especialista Muito Facil Facil2 Intermediario Novato Muito Facil Muito Facil3 Especialista Intermediario Facil Medio

Tabela 2. Participantes – Aprendizagem com ProgTest

Aluno Experiencia com Experiencia com Dificuldade em Dificuldade emProgramacao Teste Programar Testar

1 Especialista Intermediario Muito Facil Facil2 Intermediario Novato Facil Facil3 Intermediario Intermediario Muito Facil Muito Facil

auxiliar no ensino introdutorio de programacao, resultados melhores poderiamter sido obtidos se participantes novatos realizassem o experimento.

Outro fator importante e que, devido ao numero pequeno de participan-tes, nao foi possıvel balancear adequadamente os dois grupos em funcao daexperiencia em programacao dos participantes. Para verificar se os resultadosforam afetados pela experiencia dos mesmos, foi realizada uma analise de cor-relacao entre a experiencia dos alunos e a nota atribuıda aos seus trabalhos. Talanalise e descrita na Secao 5.4.

5.2 Satisfacao com a ProgTest

Ainda em relacao aos dados obtidos por meio dos questionarios, os participantesdo grupo de aprendizagem com ProgTest mostraram uma resposta positivaao uso do ambiente, ja que todos concordaram que a ProgTest auxiliou naresolucao de problemas de programacao e casos de teste.

Os participantes ainda apontaram alguns pontos positivos do ambiente, den-tre eles destacam-se: (1) facilidade de submissao; (2) apoio na aprendizagem deprogramacao e teste; (3) exibicao das linhas de codigo que foram executadaspelos casos de teste, auxiliando no projeto de novos casos de teste; e (4) apoio astarefas de compilacao e execucao dos casos de teste, uma vez que a ProgTestrealiza tais tarefas automaticamente.

Tambem foram apontados alguns pontos negativos, podendo estes serem uteisem futuras melhorias do ambiente. Em linhas gerais, os participantes apontaramque a disposicao e forma em que as informacoes sao exibidas nos relatoriosfornecidos pela ProgTest precisam ser aprimoradas, de forma a torna-las maisintuitivas e menos repetitivas.

Alem disso, houve algumas crıticas em relacao ao fato da interface do am-biente apenas ser disponibilizada na lıngua inglesa, o que dificultou o uso daProgTest e a interpretacao dos relatorios fornecidos por ela.

Sugestoes de como melhorar a apresentacao das informacoes tambem foramfornecidas pelos participantes. Dentre elas, destacam-se: (1) a adicao de ex-plicacoes textuais sobre cada uma das metricas sendo consideradas na avaliacao


84


e do porque o trabalho do aluno nao atingiu 100% em cada metrica; e (2) ex-plicitacao das saıdas esperadas e saıdas obtidas por cada casos de teste, umavez que, para a linguagem C, a ProgTest ainda nao exibe tais informacoes,mostrando somente quais casos de teste passaram e quais falharam.

5.3 Coberturas e Notas

Para avaliar a eficacia do ambiente ProgTest na aprendizagem dos alunos,foi analisado a qualidade dos trabalhos submetidos pelos dois gurpos de parti-cipantes, segundo as metricas apresentadas na Secao 3. A Tabela 3 mostra osresultados obtidos pelos alunos que realizaram o experimento com o metodo deaprendizagem com ProgTest. Como e possıvel observar, todos os participantesque tiverem o apoio da ProgTest conseguiram atingir coberturas maximas emtodas as metricas.

Tabela 3. Resultados – Aprendizagem com ProgTest

Aluno Completude Completude Correcao Adequacao Correcao AdequacaoDepuracao Teste Programa Programa Testes Testes

1 100% 100% 100% 100% 100% 100%2 100% 100% 100% 100% 100% 100%3 100% 100% 100% 100% 100% 100%

Por outro lado, a Tabela 4 mostra os resultados obtidos pelos alunos que re-alizaram o experimento com aprendizagem tradicional. Ao submeter seus traba-lhos na ProgTest, foram identificados problemas nos trabalhos de dois alunos,sendo que estes nao conseguiram atingir coberturas maximas, principalmentecom relacao as metricas relacionadas a qualidade do programa (correcao do pro-grama e adequacao do programa) e a completude das atividades (completude daatividade de depuracao e completude da atividade de teste).

Tabela 4. Resultados – Aprendizagem Tradicional

Aluno Completude Completude Correcao Adequacao Correcao AdequacaoDepuracao Teste Programa Programa Testes Testes

1 100% 95% 100% 95% 100% 100%2 100% 100% 100% 100% 100% 100%3 25% 50% 25% 50% 100% 100%

O grafico na Figura 2 permite confrontar os resultados obtidos para os doisgrupos por meio da media dos resultados obtidos pelos alunos para cada metrica,deixando claro a diferenca entre a qualidade dos trabalhos enviados pelos doisgrupos.

A correcao e a adequacao dos programas desenvolvidos pelo grupo de apren-dizagem tradicional foi, em media, inferior as do grupo de aprendizagem comProgTest. Observando as metricas completude da atividade de depuracao e


85


Figura 2. Coberturas e Notas – Medias

completude da atividade de teste tambem e possıvel concluir que, em media, osparticipantes do grupo de aprendizagem tradicional nao conseguiram depurartodos os defeitos dos seus programas e nao conseguiram testar seus programaspor completo, o que justifica a falta de correcao e adequacao dos seus programa.

Por meio da mediana dos resultados (Figura 3) a diferenca e menor. No en-tanto, ainda e possıvel notar que os alunos com apoio da ProgTest conseguiramproduzir trabalhos com qualidade um pouco maior.

Figura 3. Coberturas e Notas – Medianas

Como e possıvel observar, a mediana da adequacao dos programas do grupocom aprendizagem tradicional foi inferior a mediana da adequacao dos progra-


86


mas do grupo com aprendizagem com ProgTest. Essa metrica verifica se oprograma esta bem estruturado, ou seja, se nao ha trechos do codigo que saodesnecessarios ou que poderiam ser escritos de forma mais simples (com menoscomandos e desvios de fluxo).

Neste sentido, a falta de adequacao do programa faz com que o numero deelementos do programa a serem testados aumente, tornando a atividade de testemais complexa. Este fator pode justificar a tambem inferior mediana da metricacompletude da atividade de teste, que mostra que os programas do grupo comaprendizagem tradicional nao foram completamente testados.

5.4 Correlacao de Spearman

Como discutido anteriormente, devido ao numero pequeno de participantes, naofoi possıvel balancear adequadamente os dois grupos em funcao da experienciaem programacao dos participantes. Para verificar se os resultados foram afetadospela experiencia dos mesmos, foi realizada uma analise de correlacao entre aexperiencia dos alunos e a nota atribuıda aos seus trabalhos.

Uma vez que os dados obtidos nao sao normais, para avaliar a correlacaoentre a experiencia dos candidatos e as notas obtidas por eles foi calculada acorrelacao de Spearman [9], conforme ilustrado na Tabela 5.

Tabela 5. Correlcao de Spearman

Aluno Experiencia (X) Rank de Xi Nota (Y ) Rank de Yi di d2i

1 3 2 9.85% 5 -3 9

2 2 5 10.00% 2.5 2.5 6.25

3 3 2 5.83% 6 -4 16

4 3 2 10.00% 2.5 -0.5 0.25

5 2 5 10.00% 2.5 2.5 6.25

6 2 5 10.00% 2.5 2.5 6.25

Total 44

Embora o calculo tenha resultado em uma correlacao igual a 44, para saber-mos a significancia desse valor, foi necessario calcular o coeficiente de correlacao(rs):

rs = 1− 6∑

d2in3 − n

= 1− 6× 44

63 − 6= −0.257142857

Aplicando o Teste de Hugg [9] (Tabela 6), que define a significancia paracada intervalo do modulo do coeficiente de correlacao, foi possıvel concluir quea significancia da correlacao entre a experiencia dos participantes e suas notas efraca.

Por fim, por meio dos coeficientes de determinacao (r2s) e alienacao (K) foipossıvel avaliar que a nota dos trabalhos variaram apenas 6,61% em relacao a


87


Tabela 6. Teste de Hugg

Intervalo Significancia

0.0 – 0.20 Correlacoes nulas

0.21 – 0.40 Correlacoes fracas

0.41 – 0.70 Correlacoes substanciais

0.71 – 0.90 Correlacoes fortes

0.91 – 1.0 Correlacoes extremamente fortes

experiencia dos participantes e que ha uma ausencia de 94,64% de correlacaoentre a experiencia dos participantes e a nota de seus trabalhos:

r2s = 0.066122446 = 6.61%

K =√1− r2s = 0.966373401 = 96.64%

Neste sentido, concluımos que a experiencia dos participantes teve pouca in-fluencia na qualidade dos seus trabalhos, o que nos leva a conclusao de que ofeedback fornecido pela ProgTest foi determinante para que o grupo de apren-dizagem com ProgTest desenvolvessem trabalhos com melhor qualidade. As-sim, dentre as hipoteses definidas para este experimento (Secao 3), foi aceita ahipotese H1:

H1 = O feedback fornecido pela ProgTest ajuda na aprendizagem dos alunos.

6 Conclusao

Neste trabalho foi realizado um experimento com o ambiente ProgTest, des-crevendo as experiencias, benefıcios e dificuldades em relacao a utilizacao doambiente. Todo o experimento foi planejado e executado seguindo diretrizes bemdefinidas. Os resultados e analises geradas pelo experimento possibilitaram ava-liar o quao efetivo e o feedback da ProgTest na aprendizagem dos alunos.

Ao final do experimento, a analise descritiva e estatıstica mostrou que osresultados com a ProgTest foram melhores do que com a aprendizagem tradi-cional. Os alunos que participaram do metodo de aprendizagem tradicional, emgeral, foram piores do que os alunos que participaram do metodo de aprendi-zagem com a ProgTest, o ambiente ajudou os alunos a identificarem os errose suas correcoes, fornecendo um feedback mais adequado com relacao qualidadede seus programas e casos de teste.

Em adicao as medidas de tendencia central, os autores utilizaram como com-plementacao a correlacao de Spearman, possibilitando a correlacao dos valoresem relacao aos atributos experiencia e notas dos alunos. Esta analise foi re-alizada devido ao fato da experiencia dos participantes com programacao terinfluenciado o resultado positivo. Porem, com base na correlacao entre as notase a experiencia dos participantes, conclui-se que a experiencia nao teve relacaocom os resultados.


88


Pode-se concluir que existem indıcios de que a aprendizagem dos alunos coma ProgTest e melhor do que a aprendizagem tradicional. Apesar dos indıciosapresentados, o estudo devera ser replicado de modo que sejam verificados egeneralizados os resultados obtidos, tendo uma amostra maior e mais homogenea.

Como trabalho futuro, os autores pretendem replicar o experimento comum numero maior de participantes em disciplinas introdutorias de programacao,aumentando o numero da amostra, juntamente com a homogeneidade de ex-periencia, buscando alunos novatos em programacao.

Agradecimentos

Os autores agradecem o apoio financeiro da FAPESP, CNPq e CAPES.

Referencias

1. Lahtinen, E., Ala-Mutka, K., Jarvinen, H.M.: A study of the difficulties of noviceprogrammers. SIGCSE Bulletin 37(3) (June 2005) 14–18

2. Edwards, S.H.: Using software testing to move students from trial-and-error toreflection-in-action. SIGCSE Bulletin 36(1) (March 2004) 26–30

3. Dvornik, T., Janzen, D.S., Clements, J., Dekhtyar, O.: Supporting introductorytest-driven labs with webide. In: Proceedings of the 2011 24th IEEE-CS Conferenceon Software Engineering Education and Training. CSEET ’11, Honolulu, HI, USA(2011) 51–60

4. Janzen, D.S., Saiedian, H.: Test-driven learning: intrinsic integration of testing intothe CS/SE curriculum. SIGCSE Bulletin 38(1) (March 2006) 254–258

5. Souza, D.M., Maldonado, J.C., Barbosa, E.F.: ProgTest: An environment for thesubmission and evaluation of programming assignments based on testing activities.In: Proceedings of the 2011 24th IEEE-CS Conference on Software EngineeringEducation and Training. CSEET ’11, Honolulu, HI, USA (2011) 1–10

6. Souza, D.M., Maldonado, J.C., Barbosa, E.F.: Uma contribuicao a submissao eavaliacao automatica de trabalhos de programacao com base em atividades de teste.In: Anais do II Congresso Brasileiro de Software: Teoria e Pratica - XVIII Sessaode Ferramentas. CBSoft ’11, Sao Paulo, SP, Brazil (2011) 65–71

7. Souza, D.M., Maldonado, J.C., Barbosa, E.F.: ProgTest: Apoio automatizado ao en-sino integrado de programacao e teste de software. In: Anais do XXII Simposio Bra-sileiro de Informatica na Educacao - XVII Workshop de Informatica na Educacao.SBIE-WIE ’11, Sao Paulo, SP, Brazil (2011) 1893–1897

8. Basili, V.R.: Applying the goal/question/metric paradigm in the experience fac-tory. Chapter 2 in Software Quality Assurance and Measurement : A WorldwidePerspective, Norman Fenton, Robin Whitty, and Yoshinori Lizuka (editors), ISBN:1850321744, International Thomson Publishing, London, UK (April 1996)

9. Vieira, S.: Estatıstica Experimental. 2 edn. Atlas (1999)


89

A Framework for Software Reference Architecture Analysis and Review

Silverio Martínez-Fernández1, Claudia Ayala1, Xavier Franch1, Helena Martins Marques2, and David Ameller1

1 GESSI Research Group, Universitat Politècnica de Catalunya, Barcelona, Spain {smartinez,cayala,franch,dameller}@essi.upc.edu

2 everis, Barcelona, Spain [email protected]

Abstract. Tight time-to-market needs pushes software companies and IT con-sulting firms to continuously look for techniques to improve their IT services in general, and the design of software architectures in particular. The use of soft-ware reference architectures allows IT consulting firms reusing architectural knowledge and components in a systematic way. In return, IT consulting firms face the need to analyze the return on investment in software reference architec-tures for organizations, and to review these reference architectures in order to ensure their quality and incremental improvement. Little support exists to help IT consulting firms to face these challenges. In this paper we present an empiri-cal framework aimed to support the analysis and review of software reference architectures and their use in IT projects by harvesting relevant evidence from the wide spectrum of involved stakeholders. Such a framework comes from an action research approach held in everis, an IT consulting firm. We report the issues found so far.

Keywords: Software architecture, reference architecture, architecture analysis, architecture evaluation, empirical software engineering.

1 Introduction

Nowadays, the size and complexity of information systems, together with critical time-to-market needs, demand new software engineering approaches to design soft-ware architectures (SA) [17]. One of these approaches is the use of software reference architectures (RA) that allows to systematically reuse knowledge and components when developing a concrete SA [8][13].

As defined by Bass et al. [3], a reference model (RM) is ‘‘a division of functionali-ty together with data flow between the pieces’’ and an RA is ‘‘a reference model mapped onto software elements (that cooperatively implement the functionality de-fined in the reference model) and the data flows between them’’.

A more detailed definition of RAs is given by Nakagawa et al. [17]. They define an RA as “an architecture that encompasses the knowledge about how to design concrete architectures of systems of a given application [or technological] domain; therefore, it


90

must address the business rules, architectural styles (sometimes also defined as archi-tectural patterns that address quality attributes in the reference architecture), best practices of software development (for instance, architectural decisions, domain con-straints, legislation, and standards), and the software elements that support develop-ment of systems for that domain. All of this must be supported by a unified, unambig-uous, and widely understood domain terminology”.

In this paper, we use these two RA definitions. We show the relationships among RM, RM-based RA and RA-based concrete SA in Fig. 1. Throughout the paper, we use the term RA to refer to RM-based RA and SA to refer to RA-based concrete SA. Angelov et al. have identified the generic nature of RAs as the main feature that dis-tinguishing them from concrete SAs. Every application has its own and unique SA, which is derived from an RA. This is possible because RAs are abstract enough to allow its usage in differing contexts. [2]

Reference Model (RM)

Software Reference

Architecture (RA)

Concrete Software Architecture (SA) for an application


Software Reference

Architecture (RA)



…Information system of

Organisation 1Information system of

Organisation n

Software company orIT consulting firm

Abstract

Concrete Fig. 1. Relationships among RM, RA and SA.

Research problem. The motivations behind RAs are: to facilitate reuse, and thereby harvest potential savings through reduced cycle times, cost, risk and increased quality [8]; to help with the evolution of a set of systems that stem from the same RA [13]; and to ensure standardization and interoperability [2]. Due to this, RAs are becoming a key asset of organizations [8].

However, although the adoption of an RA might have plenty of benefits for an or-ganization, it also implies several challenges, such as the need for an initial invest-ment [13] and ensuring its adequacy for the organization’s portfolio of applications. Hence, in order to use RAs, software companies and information technology consult-ing firms face two fundamental questions:

1) Is it worth to invest on the adoption of an RA?

2) Once adopted, how the suitability of an RA for deriving concrete SAs for an organization’s applications can be ensured?


91

Currently, organizations lack of support for dealing with these questions. On the one hand, there is a shortage of economic models to precisely evaluate the benefit of architecture projects [6] in order to take informed decisions about adopting an RA in an organization. On the other hand, although there are qualitative evaluation methods for RAs [1][12][14], they do not systematize how these RAs should be evaluated regarding certain quality attributes (for instance, their capability to satisfy the varia-bility in applications developed from RAs [18]).

In this context, the goal of this research is to devise a framework that supports or-ganizations to deal with the aforementioned questions by providing procedural guide-lines for setting up and carrying out empirical studies aimed to extract evidence for: 1) supporting organizations to assess if it is worth to adopt an RA, and 2) ensuring the suitability of an RA for deriving concrete SAs for an organization’s applications.

It is worth mentioning that this research has its origin in an ongoing action-research initiative among our research group and everis, a multinational consulting company based in Spain. The architecture group of everis faced the fundamental questions stated above and the framework proposed in this paper was mainly originat-ed and shaped throughout our involvement for helping everis to envisage a suitable solution. The idea behind devising such a framework is twofold: to help other organi-zations dealing with similar problems as everis; and to improve the guidelines of the framework by the experience gained in each application of the framework in order to consolidate architectural knowledge from the industrial practice.

The paper is structured as follows. In Section 2 we describe the fundamental as-pects of RAs that are suggested to be assessed. In Section 3 we describe the empirical studies that compose the framework. In Section 4 we present the context of IT con-sulting firms and show how the framework can be applied in the context of an IT consulting firm. In Sections 5 and 6 we present preliminary results of two studies of the framework applied in everis. In Section 7 we end up with conclusions and future work.

2 Practical Review Criteria for Reference Architectures

In order to devise the framework for RA analysis and review, it becomes necessary to previously identify relevant aspects to assess RAs. However, a commonly accepted set of criteria to assess RAs does not exist [1][12-14]. Thus, in this section we identify important aspects to assess RAs out of practice and out of the literature. The frame-work presented in this paper envisages these aspects as a primary input for their fur-ther refinement based on the evidence from organizations.

In [1], Angelov et al. state that SAs and RAs have to be assessed for the same as-pects. For this reason, we started by analyzing some available works on SA assess-ment [4][10]. However, existing evaluation methods for SAs are not directly applica-ble to RAs because they do not cover the generic nature of RAs [1]. Therefore, we elaborated further this analysis considering both the specific characteristics of RAs as described in [1][12-13][17] and our own experience in the field. The resulting aspects for assessing RA are detailed below and summarized in Table 1.


92

Aspect 1 refers to the need of having an overview of the RA. It includes an analy-sis of its generic functionalities, its domain [1], its origin and motivation, its correct-ness and utility, and its support for efficient adaptation and instantiation [13]. Since RAs are defined to abstract from certain contextual specifics allowing its usage in differing contexts [2], their support for efficient adaptation and instantiation while deriving concrete SAs is an aspect to assess [13].

Many prominent researchers [1][7][10][12] highlight the importance of quality at-tributes, as well as architectural decisions for the SA design process and the architec-tural assessment. These two aspects should also be considered for the RA assessment because, as we said, SAs and RAs have to be assessed for the same aspects [1]. Thus, we considered them as Aspects 2 and 3 respectively. However, since an RA has to address more architectural qualities than an SA (e.g., applicability) [1], this analysis could be wider for RAs in this sense. A list of quality attributes that are strictly deter-mined by SAs is defined in [7]. This list consists of the following ten quality attrib-utes: performance, reliability, availability, security, modifiability, portability, func-tionality, variability, subsetability and conceptual integrity.

SAs also address business qualities [1] (e.g., cost, time-to-market) that are business goals that affect their competence [4]. It is considered as Aspect 4.

To improve the SA design process, there also exist supportive technologies such as methods, techniques and tools [10][17]. Thus, it is not only important for an RA to collect data to assess its design process, but also its supportive technologies, which are assessed by Aspects 5 and 6.

As stated in [10], a crucial aspect to define the goodness of a SA is related to the ROI. The optimal set of architectural decisions is usually the one that maximizes the ROI. Aspect 7 is intended to quantify benefits and costs of RAs to calculate their ROI.

We recommend gathering evidence about all these aspects, which are summarized in Table 1, while assessing an RA. Existing methods for SA assessment have been previously applied for RA assessment, such as in [1][12][14]. However, none of them cover all the aspects of Table 1, especially Aspect 7. Hence, new approaches to assess RAs considering these aspects altogether are required. This has motivated our work.

These architectural aspects can be divided in two areas of different nature. First, Aspects 1 to 6 are qualitative architectural concerns. Second, Aspect 7 consists of quantitative metrics to calculate the benefits and costs of deriving SAs from RAs.

Table 1. Summary of relevant aspects for software reference architecture assessment.

Aspect Description of the Architectural Aspect

Qualitative

1 Overview: functionalities [1], origin, utility and adaptation [13] 2 Requirements and quality attributes analysis [1][10][12] 3 Architectural knowledge and decisions [10][12][17] 4 Business qualities [1] and architecture competence [4] 5 Software development methodology [10][17] 6 Technologies and tools [10][17]

Quantitative 7 Benefits and costs metrics to derive SAs from RAs [10]


93

3 An Empirical Framework to Review Reference Architectures

In this section, we present the ongoing version of our empirical framework. It is com-posed of an assortment of empirical studies. Each empirical study reviews a subset of the relevant architectural aspects presented in Table 1.

Current economic models [16] and RAs evaluation methods (e.g., [1][12-14]) sug-gest to gather Aspect 7, and Aspect 1-6 respectively, directly from the organizations. However, they do not provide support or guidelines for doing so. Thus, our frame-work is aimed to provide support for such gathering process while suggests to apply any of the existing methods to evaluate the economic costs of adopting an RA or its quality based on the empirically obtained data. The selection of the method used in each situation would depend on the organization context [5].

Regarding to Aspect 7, an economic model becomes necessary. The data needed to feed such an economic model depends on the existing value-driven data in the organi-zation (see Sections 3.1 and 5). Such data may be gathered by conducting post-mortem studies that collect real metrics or, when the organization does not have pre-vious experience with RAs, by estimating these metrics using historical data.

In order to gather data to cover Aspects 1-6, our framework suggests conducting surveys (see Sections 3.3 and 6). These studies gather information not only from RA projects, but also from SA projects as they are a direct outcome of the RA usage. This allows analyzing the RA’s suitability for producing the SAs of enterprise applications in organizations as well as detecting improvement opportunities.

Fig. 2 summarizes the studies that compose the framework. The studies are classi-fied by their approach to assess the RA (qualitative or quantitative), depending on which question of Section 1 they answer.

Moreover, the studies have been designed by following the guidelines for empirical studies of Wohlin et al. [21]. They state that “it is in most cases impossible to start improving directly”. Therefore, the framework is based on three steps: understand, evaluate and improve. The current version of the framework deals with the two for-mer steps, in which surveys are inside the understanding step, and models and meth-ods in the evaluation step. Thus, studies are complementary and support each other (e.g., results from a preceding study can be used to corroborate or develop further these results). For this reason, the suggested studies should be conducted sequentially.

Survey to check existing value-driven data in organizations

• Existing data that organizations have to quantitatively calculate the costs and benefits of adopting an RA in an organization.

Economic model to calculate the ROI of adopting

an RA

• What is the value of an RA? (quantitative)

Survey to understand the impact of using

an RA

• Evidence about RA practices and RA impact on the organization.

• Refined review criteria for RA.• Context of the organization.

Architectural evaluation

method specific for RA

• How well an RA supports key aspects? (qualitative)

Is it worth to invest in the adoption of an RA?Once adopted, how can we ensure the suitability of an RA for

deriving concrete SAs for an organization’s applications?

Step

1U

nder

stan

dSt

ep 2

Eval

uate

Fig. 2. Empirical studies of the framework to assess RAs.


94

3.1 Surveys to check existing value-driven data in RA projects

Context. Typically, organizations do not have resources to compare the real cost of creating applications with and without an RA. Thus, alternatives should be envisaged. Objective. To discover existing data that organizations have to quantitatively calcu-late the costs and benefits of adopting an RA in an organization. Method. Exploratory surveys with personalized questionnaires applied to relevant stakeholders (e.g., manager, architect, developer) to find out the quantitative data that has been collected in RA projects and application projects.

3.2 Applying an economic model to calculate the ROI of adopting an RA

Context. Before deciding to launch an RA, organizations need to analyze whether undertaking or not the investment. Offering organizations an economic model that is based on former projects data can help them to make more informed decisions. Objective. To assess whether it is worth investing in an RA. Method. Depending on the maturity of the organization, two methodologies can be applied. If the organization does not have an RA, the economic model should be fed with estimated data. Nevertheless, when the organization already has an RA, real data can be gathered by means of an exploratory quantitative post-mortem analysis. Then, the economic model quantifies the potential advantages and limitations of using an RA. Some related works explain how to calculate the ROI of a product [11], and software reuse [19]. We suggest using the economic model for RAs presented in [14].

3.3 Surveys to understand the impact of using an RA

Context. To refine the set of review criteria for RAs, it is needed to understand RA’s characteristics, as well as its potential benefits and limitations. Assessing previous RA projects is a feasible way to start gaining such an understanding. Objective. To understand the impact and suitability of an RM for the elaboration of RAs, and of an RA for the creation of SAs. Improvement insights can also be identi-fied from different stakeholders. Method. Exploratory surveys with personalized questionnaires applied to relevant stakeholders (e.g., architects, developers) to gather their perceptions and needs.

3.4 Applying an architectural evaluation method to prove RA effectiveness

Context. Architecture is the product of the early design phase [7]. RA evaluation is a way to find potential problems before implementing RA modules, and to gain confi-dence in the RA design provided to SA projects. Objective. To analyze the RA strengths and weaknesses and to determine which im-provements should be incorporated in the RA. Method. To apply an existing evaluation method specific for RAs such as [1][12-14]. The selection of the method would depend on the organization context [5].


95

4 Use of the framework in an IT consulting firm

4.1 Context of Information Technology Consulting Firms

Motivation. We are interested in the case in which an IT consulting firm has designed an RA with the purpose of deriving concrete SAs for each application of a client or-ganization. This usually happens when the IT consulting firm is regularly contracted to create or maintain information systems in client organizations. Each information system is built upon the RA and includes many enterprise applications (see Fig. 3).

An RA can be designed with an intended scope of a single organization or multiple organizations that share a certain property. Although Fig. 3 shows RAs that are used for the design of SAs in a single organization, there also exist RAs for multiple organ-izations that share a market or technological domain such as web applications [2].

The use of RAs allows IT consulting firms to reuse the architectural knowledge of their RM, and software components (normally associated to particular technologies) for the design of SAs in client organizations. Thus, RAs inherit best practices from previous successful experiences and a certain level of quality. These RAs provide a baseline that facilitates standardization and interoperability as well as the attainment of business goals during enterprise applications’ development and maintenance.

Types of projects. There are three types of projects with different targets (Fig. 3): 1) RM projects; 2) RA projects; and 3) SA projects.

Stakeholders for RA analysis. Stakeholders need to be clearly defined for RA as-sessment purposes [1]. The people involved in an RA assessment are the evaluation team, which conducts the empirical studies of the framework, and stakeholders from architectural projects. In the three types of projects defined above performed by IT consulting firms, we consider the following five stakeholders essential for RA as-sessment: project business manager, project technological manager, software archi-tect, developer, and application builder. Each of these stakeholders has a vested inter-est in different architectural aspects, which are important to analyze and reason about the appropriateness and the quality of the three types of projects [12]. However, there could be more people involved in an architectural evaluation, as Clements et al. indi-cate in [7]. As a consequence, although this context is generic for IT consulting firms, projects’ stakeholders may vary between firms. Below, we describe to which type of project essential stakeholders belong and their interests.

RM projects. It is composed of software architects from the IT consulting firm that worked in previous successful RA projects. They are specialized in architectural knowledge management. Their goal is to gather the best practices from previous RA projects’ experiences in order to design and/or improve the corporate RM.

RA projects. RA projects involve people from the IT consulting firm and likely from the client organization. Their members (project technological managers, soft-ware architects and architecture developers) are specialized in architectural design and have a medium knowledge of the organization business domain.

Project technological managers from the IT consulting firm are responsible for meeting schedule and interface with the project business managers from the client organization.


96

Software architects (also called as RA managers) usually come from the IT con-sulting firm, although it may happen that the client organization has software archi-tects in which organization’s managers rely on. In the latter case, software architects from both sides cooperatively work to figure out a solution to accomplish the desired quality attributes and architecturally-significant requirements.

Architecture developers come from the IT consulting firm and are responsible for coding, maintaining, integrating, testing and documenting RA software components.

SA projects. Enterprise application projects can involve people from the client or-ganization and/or subcontracted IT consulting firms (which may even be different than the RM owner) whose members are usually very familiar with the specific organ-ization domain. The participation of the client organization in RA and SA projects is one possible strategy for ensuring the continuity of their information systems without having much dependency on subcontracted IT consulting firm.

Project business managers (i.e., customer) come from client organizations. They have the power to speak authoritatively for the project, and to manage resources. Their aim is to provide their organization with useful applications that meet the mar-ket expectations on time.

Application builders take the RA reusable components and instantiate them to build an application.

Reference Model (RM)

Reference Architecture

(RA)

Concrete Software

Architecture(SA)

SoftwareArchitect &

Project Tech.Manager

ArchitectureDeveloper

ProjectBusinessManager

ApplicationBuilder

serves asa reference serves as a reference

designs develops

provides abaseline for

manages


RA Team(from IT consulting

firm andmaybe from

client organization)

Concrete SATeams

(from IT consultingfirms and/or

client organization.There could be many

teams for severalapplications)

&h.

SoftwareArchitects from

previous RA projects

RM Team(from IT

consultingfirm)

m

design andimprove

Reference Architecture

(RA)

SoftwareArchitect &

Project Tech.Manager

ArchitectureDeveloper

ProjectBusinessManager

ApplicationBuilder

serves as a referenceserves as

a reference

designs develops



&h.

Concrete Software

Architecture(SA)

builds

…

…

Concrete Software

Architecture(SA)

manages

Concrete Software

Architecture(SA)

builds

…

Fig. 3. Relevant stakeholders for RA analysis.


97

4.2 Instantiation of the Framework

The presented empirical framework is currently being applied at everis. The main motivation of everis for conducting the empirical studies is twofold: 1) strategic: providing quantitative evidence to their clients about the potential benefits of applying an RA; 2) technical: identifying strengths and weaknesses of an RA.

As an IT consulting firm, everis fits into the context described in Section 4.1 (e.g., they carry out the three types of projects described there). Following the criteria found in [1], RAs created by everis can be seen as Practice RAs, since they are defined from the accumulation of practical knowledge (the architectural knowledge of their corpo-rate RM). According to the classification of [2], they are also classical, facilitation architectures designed to be implemented in a single organization. They are classical because their creation is based on experiences, and their aim is to facilitate guidelines for the design of systems, specifically for the information system domain.

All the studies suggested in Section 3 are planned to be conducted to understand and evaluate RAs defined by everis. In this paper, we present the protocol and prelim-inary results of the two surveys of the understanding step. Section 5 describes the available value-driven data in projects in order to create or choose an economic model to calculate the ROI of adopting an RA. In Section 6, an excerpt of the survey proto-col, which has already been designed and reviewed, is presented. The survey is still in the analysis step. However, the data about the Aspect 4 (business qualities) have al-ready been processed. Preliminary results about this aspect show the benefits and aspects to consider in everis’ RA projects, and we think that such aspects might be similar in other IT consulting firms that adopt RAs in client organizations.

Table 2 shows how the roles are covered by the different studies in the everis case.

Table 2. Stakeholders of the everis case.

Projecta Business Manager

Technical Manager

Software Architect

Architecture Developer

Application Builder

RM n/a n/a S1, ROI, S2 n/a n/a RA S2, Eva S1, S2, Eva ROI, S2, Eva S2, Eva S2, Eva SA n/a n/a n/a n/a S1, S2

a. Legend: Survey to study existing data (S1), Economic model for RA’s ROI (ROI), Survey to understand RA projects (S2), and RA evaluation (Eva).

5 Survey to check existing value-driven data in projects

5.1 Protocol

Objectives of this study. The objective of this survey is to identify the quantitative information that can be retrieved from past projects in order to perform a cost-benefit analysis. The cost-benefit analysis, which is the evaluation step of the framework, needs this kind of data to calculate the ROI of adopting an RA in an organization. Sampling. A sample of 5 everis’ RA projects and 5 SA projects built upon their RAs have been selected.


98

Approach for data collection. The main perceived economic benefit on the use of RAs are the cost savings in the development and maintenance of systems due to the reuse of software elements and the adoption of best practices of software development that increase the productivity of developers [16]. We use online questionnaires to ask project technical managers and application builders about existing information in past projects for calculating these cost savings. When the client organization has no expe-rience in RAs, these data need to be estimated, what could be potentially error-prone.

5.2 Preliminary results: costs and benefits metrics for RAs

In this section we describe the information that was available in order to calculate the costs and benefits of adopting an RA. We divide existing information in two catego-ries: effort and software metrics. On the one hand, the invested effort from the tracked activities allows the calculation of the costs of the project. On the other hand, soft-ware metrics help to analyze the benefits that can be found in the source code. Effort metrics to calculate projects’ costs. In RA projects, 4 out 5 client organiza-tions tracked development efforts, while maintenance effort was tracked in 5. In SA projects, 4 client organizations tracked development and maintenance effort.

The development effort is the total amount of hours invested in the development of the RA and the SAs of applications. It could be extracted from the spent time for each development’s activity of the projects. The maintenance effort is the total amount of hours invested in the maintenance of the RA and the SAs of applications. Mainte-nance activities include changes, incidences, support and consults. Software metrics to calculate benefits in reuse and maintainability. Code in RA and SA projects was obviously available in all projects. However, due to confidential-ity issues with client organizations, it is not always allowed to access source code.

The analysis of the code from RA and SA projects allow quantifying the size of these projects in terms of LOC or function points (number of methods). Having calcu-lated the project costs as indicated above, we can calculate the average cost of a LOC or a function point. Since the cost of applications’ development and maintenance is lower because of the reuse of RA modules, we can calculate the benefits of RA by estimating the benefits of reusing them. Poulin defines a model for measuring the benefits of software reuse [19]. Maintenance savings due to a modular design could be calculated with design structured matrices [15]. For a detailed explanation about how such metrics can be used in a cost-benefit analysis, the reader is referred to [16].

5.3 Lessons learned

Architecture improvements are extremely difficult to evaluate in an analytic and quantitative fashion as the efficacy of the business (e.g., sales) [6]. This is because software development is a naturally low-validity environment and reliable expert intuition can only be acquired in a high-validity environment [9]. In order to evaluate RAs based on an economics-driven approach, software development needs to move to a high-validity environment. The good news is that it could be done with the help of good practices like time tracking, continuous feedback, test-driven development and


99

continuous integration. In order to get the metrics defined in the Section 5.2, tools such as JIRA1 and Redmine2 allow managing the tasks and their invested time, gen-eral software metrics (like LOC) and percentages of tests and rules compliance can be calculated by Sonar3 and Jenkins4. We think that adopting good practices to collect data is the basis for moving software development to a high-validity environment and consequently being able of performing an accurate cost-benefit analysis.

6 Survey to understand the impact of using an RA

6.1 Protocol

Objectives of this survey. The purpose of the survey is to understand the impact of using RAs for designing the SAs of the applications of an information system of a client organization. This is a descriptive survey that measures what occurred while using RAs rather than why. The following research questions are important in order to review relevant Aspects 1 to 6 of RAs (defined in Section 2):

1. How is an RA adapted for creating SAs of an organization’s applications? 2. What is the state of practice on requirements engineering for RAs? 3. What is the state of practice on architectural design for RAs? 4. How does the adoption of RAs provide observable benefits to the different in-

volved actors? 5. What methodologies are currently being used in RA projects by everis? 6. Which tools and technologies are currently being used in RAs projects by everis?

Sampling. The target populations of this survey are RA projects and SA projects executed by everis. A sample of 9 representative everis’ projects in client organiza-tions were selected. All these projects were from Europe (seven from Spain). Approach for data collection. On the one hand, semi-structured interviews are used for project technological managers, software architects, and client’s project business managers. The reason of using interviews is that these roles have higher knowledge than the other roles about the architectural aspects of the Table 1, or another perspec-tive in the case of client’s project business managers, so we want to collect as much information as possible from them. Prior to the interviews, questionnaires are deliv-ered to collect personal information about the interviewee and to inform him/her about the interview. On the other hand, online questionnaires are used for RA devel-opers and application builders, since most of their questions are about supportive technologies and their responses can be previously listed, simplifying the data collec-tion process.

This is an excerpt of the survey protocol. The complete version of the protocol is available at http://www.essi.upc.edu/~gessi/papers/eselaw13-survey-protocol.pdf.

1 JIRA, http://www.atlassian.com/es/software/jira/overview 2 Redmine, http://www.redmine.org/ 3 Sonar, http://www.sonarsource.org/ 4 Jenkins, http://jenkins-ci.org/


100

6.2 Preliminary results: strengths and weaknesses of RAs

In this section we present preliminary results about the business quality section of the survey, which are the answer to the fourth research question of the protocol: “How does the adoption of RAs provide observable benefits to the different involved ac-tors?” Below, the resulting benefits and aspects to consider are reported, followed of further statements. Benefits in everis’ RA projects are:

4 out of 9 projects mentioned “increased quality of the enterprise applications”. An RA helps to accomplish business needs by improving key quality attributes. An RA helps to improve the business processes of an organization. An RA reuses architectural knowledge of previous successful experiences.

7 out of 9 projects stated “reduction of the development time and faster delivery of applications”.

An RA allows starting developing applications since the first day by following architectural decisions already taken.

An RA decreases the development time of applications since the RA’s modules that implement needed functionality are reused in the application.

7 out of 9 projects mentioned “increased productivity of application builders”. An RA facilitates material and tools for the development, testing and documen-tation of applications, and for training application builders.

An RA generates or automatizes the creation of code in the applications. An RA indicates the guidelines to be followed by the application builders. An RA reduces the complexity of applications’ developments because part of the functionality is already resolved in the RA.

An RA facilitates the configuration of its modules and the integration with lega-cy systems or external systems.

6 out of 9 projects stated “cost savings in the maintenance of the applications”. An RA increases the control over applications through their homogeneity. An RA maintains only once reused services by all applications. An RA allows adding or changing functionalities by means of a modular design. An RA establishes long term support standards and “de facto” technologies.

Aspects to consider that eventually could become risks in everis’ RA projects are:

5 out of 9 projects considered “additional learning curve”. An RA implies an addi-tional training for their own tools and modules, even if its technologies are stand-ard or “de facto” already known by the application builder.

3 out of 9 projects stated “dependency on the RA”. Applications depend on the reused modules of the RA. If it is necessary to make changes in a reused module of the RA or to add a new functionality, application builders have to wait for the RA developers to include it in the RA for all the applications.

2 out of 9 projects considered “limited flexibility of the applications”. The use of an RA implies following its guidelines during the application development and adopting its architectural design. If business needs require a different type of appli-cation, the RA would limit the flexibility of that application.


101

6.3 Lessons learned

During the pilot of the survey, we learnt the following lessons about its design:

The same term could have slightly different meaning in the academia and in the industry (for instance, the term “enterprise architecture” is sometimes used in the industry to mean “software reference architecture for a single organization”).

Questions that deal with several variables disconcert the interviewee and make the analysis more difficult. It is better to split them to cover only one variable.

If a survey targets several stakeholders, their questionnaires should be designed having into account their knowledge and interest about architectural concerns.

In online questionnaires, it is recommendable to allow the interviewee to write any comments or clarifications in some field and also include an “n/a” option when necessary. Besides, a previous button is useful to make changes in prior questions.

Contacting stakeholders from client organizations was harder than contacting in-terviewees from the IT consulting firm. This is mainly because it was the IT con-sulting firm who requested the study, so they had a clear interest on it.

7 Conclusions and Future Work

Driving empirical studies is becoming one of the main sources of communication between practitioners and the academia. The main contribution of this ongoing work intends to be the formulation of a framework to conduct empirical studies for support-ing decision making and assessment related to RAs. It consists of a list of relevant aspects for RAs assessment, and an assortment of four complementary empirical stud-ies that allow understanding and evaluating these aspects.

It is a practical framework that can be adapted to the specific context of software companies and IT consulting firms. Consequently, organizations that apply the framework could benefit from a common reference framework to review RAs.

The framework is being applied in everis. This allows getting feedback for as-sessing its effectiveness and gathering industrial evidence. Preliminary results of this application indicate the importance of good practices like time tracking, continuous feedback, test-driven development and continuous integration in order to quantitative-ly evaluate RAs. Another result is that the adoption of an RA could bring as main benefits cost savings in the development and maintenance of applications.

Future work spreads into two directions. In terms of validation, we are also con-ducting the evaluation step of the framework in everis. With respect to this first ver-sion of the framework, we aim to extend it considering Wohlin’s improvement step in order to build preliminary guidelines for improving RAs in IT consulting firms.

Acknowledgements. This work has been supported by “Cátedra everis” and the Spanish project TIN2010-19130-C02-00. We would also like to thank all participants of the surveys for their kindly cooperation.


102

References

1. Angelov, S., Trienekens, J., Grefen, P.: Towards a method for the evaluation of reference architectures: Experiences from a case. In: Morrison, R., Balasubramaniam, D., Falkner, K. (eds.) Software Architecture, LNCS, vol. 5292, pp. 225-240. (2008).

2. Angelov, S., Grefen, P., Greefhorst, D.: A framework for analysis and design of software reference architectures. Information and Software Technology 54(4), 417 - 431 (2012).

3. Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2 edn. (2003).

4. Bass, L., Clements, P., Kazman, R., Klein, M.: Evaluating the software architecture com-petence of organizations. In: WICSA, pp. 249-252. IEEE (2008).

5. Bass, L., Nord, R.L.: Understanding the context of architecture evaluation methods. In: WICSA, pp. 277--281. IEEE (2012).

6. Carriere, J., Kazman, R., Ozkaya, I.: A cost-benefit framework for making architectural decisions in a business context. In: Software Engineering, vol. 2, pp. 149 -157 (2010).

7. Clements, P., Kazman, R., Klein, M.: Evaluating software architectures: methods and case studies. Addison-Wesley Professional. (2001).

8. Cloutier, R., Muller, G., Verma, D., Nilchiani, R., Hole, E., Bone, M.: The concept of ref-erence architectures. Systems Engineering 13(1), 14-27 (2010).

9. Erdogmus, H., Favaro, J.: The Value Proposition for Agility–A Dual Perspective. (2012). Available at: http://www.infoq.com/presentations/Agility-Value

10. Falessi, D., Babar, M.A., Cantone, G., Kruchten, P.: Applying empirical software engi-neering to software architecture: challenges and lessons learned. Empirical Software Engi-neering 15(3), 250-276 (2010).

11. Forrester Research: Forrester s Total Economic Impact (TEI). Available online at: www.forrester.com/TEI , last access: January 2012.

12. Gallaguer, B.P.: Using the Architecture Tradeoff Analysis MethodSM to Evaluate a Refer-ence Architecture: A Case Study. DTIC Document, (2000).

13. Galster, M., Avgeriou, P.: Empirically-grounded reference architectures: a proposal. In: Proceedings of QoSA and ISARCS, pp. 153-158. (2011).

14. Graaf, B., van Dijk, H., van Deursen, A.: Evaluating an embedded software reference ar-chitecture. CSMR, pp. 354-363, (2005).

15. MacCormack, A., Rusnak, J., Baldwin, C.: Exploring the duality between product and or-ganizational architectures. Harvard Business School Technology Paper, pp. 8-39, (2011).

16. Martínez-Fernández, S., Ayala, C., Franch, X.: A Reuse-Based Economic Model for Soft-ware Reference Architectures. Technical Report ESSI-TR-12-6, (2012).

17. Nakagawa, E., Oliveira, P., Becker, M.: Reference architecture and product line architec-ture: A subtle but critical difference. ECSA, pp. 207-211 (2011).

18. Nakagawa, E.: Reference architectures and variability: current status and future perspec-tives. In: Proceedings of the WICSA/ECSA 2012, pp. 159-162. ACM (2012).

19. Poulin, J.: Measuring software reuse: principles, practices, and economic models. Addi-son-Wesley. (1997).

20. Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering 14(2), pp. 131-164 (2009).

21. Wohlin, C., Höst, M., Henningsson, K.: Empirical Research Methods in Software Engi-neering. ESERNET. (2003).


103

Beliefs Underlying Teams Intention and Practice: An Application of the Theory of Planned Behavior

Carol Passos 1, Daniela S. Cruzes2,3, Manoel Mendonça4

1 Department of Computer Science (DCC), UFBA, Salvador-BA, Brazil [email protected]

2 Department of Computer and Information Science (IDI), NTNU, Trondheim, Norway [email protected]

3 SINTEF, NO-7465 Trondheim, Norway 4 Fraunhofer Project Center at UFBA, Salvador-BA, Brazil

[email protected]

Abstract. Many theories aim to understand the beliefs underlying an intention or behavior. These theories are currently used to seek answers about how people progress from intention to practice in business environments. The Theory of Planned Behavior (TPB) is today one of the most popular socio-psychological models for the prediction of behavior. It is believed that people consider the implications of their actions and act based on a reasonable assessment of those implications. In this context, belief can be defined as the psychological state in which an individual holds a proposition or premise to be true. So, behavior is driven by what is believed, by what is culturally assumed to be true about the world. Our work aims to study and characterize a belief system applying TPB in project teams in terms of organizational and team levels factors associated to beliefs about the software development practices. A set of interviews on origins, sources and impacts of beliefs on software practices was conducted with professionals from different project teams and companies. The results point out to a strong influence of past experiences and show that it is possible to characterize belief systems in software project contexts within a behavioral perspective.

Keywords: belief system, attitude, behavior, planned behavior, perceived behavioral control, software development practice.

1 Introduction

A project team's belief system is the foundation of software engineering practices adoption [31]. Most of the practitioners rely on beliefs to make their methodological choices. The definitive step towards the introduction of a new practice into an organizational culture is to capture and understand the beliefs and values of the project teams that can guide behavior and important decisions. This requires a deep grasp of the organization belief system and related factors. How do people come to believe that something is useful to them and reach the decision to use it in their


104

organization or particular project? This is a very interesting subject to be studied and understood by empirical software engineering researchers.

Beliefs can be defined as conceptions, personal ideologies, and perceptions of the world that shape practice and orient knowledge. The concept of belief implies the existence of a mental state with intentionality, interacting with goals and influencing ordinary actions [1]. These beliefs are built over a set of interactions, relationships, processes and activities of the group. Beliefs exist in the form of expectancy-rules and these rules are tested for a given situation. In this context, actions are driven by what is believed, by what is culturally assumed to be true about the world [13].

Declared and acknowledged project team's beliefs are often consistent with what is found in observations of software practices; and even the unannounced ones can influence specific aspects of the software process. Research has documented that practitioners' beliefs related to work processes have a significant impact on their behavior and that this influence on practices manifests itself in interesting ways [1][10][13][31]. However, most of the existing studies have not directly addressed or characterized the belief system in a software project context within a behavioral perspective.

There is a common understanding that beliefs and behavior are related. In order to determine what practices an individual is likely to perform at any given time, it is necessary to understand the set of beliefs about the individual's behavioral intention in a given circumstance. In the Software Engineering (SE) area, this lack of understanding may be due, in part, to a deficiency to investigate, understand and document the nature and effect of the belief system underlying current SE theory and practice.

Many theories have been developed to explain health-related intention and behavior and to seek answers for business administrators about how people progress from intention to behavior. The Theory of Planned Behavior (TPB) [2][3] is today one of the most popular social-psychological models for the prediction of behavior. The TPB provides a suitable framework for conceptualizing human behavior in organizational contexts, because it was developed specifically to account for behavior for which actual and perceived control may be low (spontaneous behavior).

We have run a case study [21][22] to characterize a belief system applying Theory of Reasoned Action (TRA) [4][7][11] in agile software project teams in terms of origins, sources and impacts of beliefs on self-management development practices [9][15][20]. The study addressed the influence factors associated to team beliefs, its attitudes toward behavior, the organizational culture and subjective norms to predict behavior intention and also document the inconsistencies between declared beliefs and real practice in agile software projects. With this study, we showed that it is possible and interesting to capture and represent a belief system in a project context and there are a strong influence of past experiences and organizational culture on self-management practices of agile teams in software industry.

The TPB is a generalization of TRA that is indicated to predict behaviors that are not entirely under an individual's volitional control. It fits well in the context of SE research within software industry and provides a suitable theoretical framework in mapping software project teams' behavior. After a literature review and evaluation of research objectives and questions, we decided to apply TPB to study team behavior in


105

software engineering and opted for an interview-based qualitative data collection approach.

The aim of this paper is to present a study to characterize a belief system applying TPB in project teams in terms of origins, sources and impacts of beliefs on software development practices. In order to reach this goal, we performed a set of interviews and collected information about the project contexts. A conceptual framework was built, based on the TPB model, to focus and bound the collection of data. The study addresses the influence factors associated to team beliefs, its attitudes toward behavior, the organizational culture and normative pressure to predict behavior intention and also document the inconsistencies between declared beliefs and real practice in software projects. We relate our findings to relevant behavioral literature [2][3][4][5][6][11][12][26] in order to contribute to an improved understanding on how to apply this kind of theory to better study SE practices.

The rest of the paper is organized as follows: Section 2 gives an overview of the background. Section 3 presents the research methodology and describes how we conducted the study. Section 4 presents the results of applying Theory of Planned Behavior in the context of software development practices. Section 5 discusses the implications and limitations. Conclusions and opportunities for further work are presented in Section 6.

2 Background

2.1 The Theory of Planned Behavior

Fishbein and Ajzen [11] gave us a robust definition of behavior, avoiding the confusion and ambiguity of some past theories. Their research was derived in a solid body of work for a more uniform study of these terms [14]. They sought out a way to not only predict behavior, but also to understand its relationship with beliefs and their strength.

In more recent publications, Ajzen [2][3] has extended the Theory of Reasoned Action (TRA) to the Theory of Planned Behavior (TPB) by including a measure of perceived behavioral control, which it is argued will increase the prediction of intention and behavior in those instances where the behaviors are not entirely under the control of the individual or group.

Although the TRA model can predict the probable behavior, it may not predict the actual behavior, because people do not always do what they intend to do and there may be other factors that will cause them to go against their initial intention.

When used to explain behavior that is not fully a conscious choice or decision, the TPB is expected to perform better than the TRA. According to TPB theory illustrated in Fig. 1, human behavior is guided by three kinds of considerations: (i) beliefs about the likely outcomes of the behavior and that the evaluations of these outcomes produce a favorable or unfavorable attitude toward the behavior; (ii) beliefs about the normative expectations of references, i.e. subjective norms; and (iii) beliefs about the presence of factors that may facilitate or inhibit behavior performance and that the perceived power of these factors gives rise to perceived behavioral control.


106

The evolution of the TPB and research is marked by a debate about the meaning of the third variable [32]. The current dual-aspect conceptualization of perceived behavioral control is determined by two important factors such as perceived autonomy and confidence related to how easy or difficult behaviors can be. Those factors can be both internal (knowledge, skill, willpower) and external (time, money, resources, cooperation of others). From this point of view, people will believe that they can carry out their intentions when they believe that they have the resources and opportunities to perform the behavior and when they believe that they can freely make the decision to use those resources and opportunities.

Fig. 1. TPB Model.

We believe that TPB can be helpful in generating rich and detailed accounts of software project teams, the interactions between their members, and, especially, the actions oriented toward certain software practices. The TPB model appears to provide a better conceptual framework for dealing with the complexities of organizational contexts and to understand the influence of a belief system on team's practices. However, the application of a theoretical approach to SE can be challenging. The problems and objects of study in SE require approaches suited to their dynamics and contexts.

TPB has been widely used to predict and explain health-related intention and behaviour and attitude toward the behavior (whether the behavior is seen as good-bad or pleasant-unpleasant) and subjective norms (perceived social pressure from relevant others) were found to be significant direct predictors of intention and participation in many studies [8][19][23][24][25][29][30]. Facilitating and hampering conditions were also found to have significant effects in the prediction of participation [16][18][27][28][32].

2.2 Research Conceptual Framework

The conceptual framework idealized for this research was derived from an adaptation of the TPB model and is shown in Fig.2. In this framework, the values and beliefs of a project team, in addition to the attitude toward its behavior, represent the strength of beliefs. It motivates people toward a behavior intention given that attitude is a predisposition to act in a positive or negative way toward an object. Another factor,


107

represented by subjective norms added to organizational culture, can also impact the team's behavior intention and, consequently, the team's practices. Therefore, the behavioral intention is affected by what others think and the strength of their opinion on the organization in context [4]. Lastly, perceived behavioral control denotes people's perception of the degree to which they are capable of, or have control over, performing a given behavior [12]. Believing that they can perform a practice, because they have capacity and autonomy, motivates project teams to try to perform the respective behavior and increases the likelihood that they will expend effort and persevere in their attempts [32].

Team Behavior and

Practices

Organizational Culture

Subjective Norms

Team Beliefs and Values

Attitude toward Behavior

Behavior Intention

impacts

impacts

strength of Beliefs

organizational and team levels

Team Confidence

perceived behavioral control

Team Autonomy

impacts

Fig. 2. Conceptual Framework.

TPB fits in the context of our research, because it allows us to study the way beliefs, attitudes and sense of self-efficacy are formed and their relationship to behavior and practice with room to explore other relevant aspects. Moreover, TPB has been widely used to predict and explain health-related intention and behaviour and these results were successfully replicated [18][28][32].

3 Methodology

We have being running a long-term case study, involving software development projects, which has gone through a main cycle of 18 months. This study is now undergoing another cycle, lasting approximately eight months, with the aim of characterizing a project team's belief system in three organizations in Brazil (see Table 1) and to investigate its origins, sources, and impacts on the team's software development practices. In particular, we have been focusing on the social behavior and relationships that arise as an intrinsic part of adopting new software practices. To do this, we have been applying an interview approach, asking insightful questions, drawing maps of the projects context, and collecting some artifacts.

All three companies studied provide software development and evolution services for customers of both the public and private sectors. The interviews was performed according to [17] in all companies. So far, the findings seem to confirm the influence of team's belief systems on software practices and methodological lessons learned


108

from the first cycle of this research [22] were used to define the next cycle. After results obtained in the first cycle and reported in [21], new research questions (RQ) arose to direct the next phase, as follows:

RQ1: How do beliefs and attitude influence team practices in software organizations?

RQ2: How do organizational culture and subjective norms influence software team behavior and practices?

RQ3: How do team autonomy and confidence impact software practices? A particular moment of the research is being explored in this paper. It involved a

scope of nine interviews in three of the three companies during the second cycle of the study. This second cycle included the application of the behavior theories to map and analyze the relevant experiences of the software team members and uncover the beliefs which could hinder or benefit the adoption of new software practices. Using TPB as a guide, we could also focus on how the influence of organizational culture, subjective norms, team confidence and perceived autonomy, embedded in the software development context, might help to predict behavior intention.

3.1 Data Collection and Analysis

After a literature review, designed to address the key references to the behavior theories related, we prepared an initial version of the interview questions to identify some issues we intended to investigate. The purpose of this round was to cast light on the respondents’ past experiences, beliefs emerged or evolved from these experiences, impacts of new agile practices on projects and unexpected effects of known and new methods or techniques. Guided by TPP model, we asked them to retell and relive specific and directed stories that illustrated the beliefs and attitudes we were trying to capture.

Table 1. Companies under Study.

Company Age Personnel Software Process Certification 1 11 years 800 ISO-9001 and CMMI Level 2 2 03 years 15 --- 3 19 years 42.000 ISO-9001 and CMMI Level 3

To define the interview questions, we opted to keep our interview-based qualitative

approach based on the War Story technique [17]. War Story questionnaires usually have warm-up, past experience, lessons learned, and reaction questions. We used a few of each type in our interview questionnaire (see Appendix A and [21] for more details). Using the conceptual framework (Fig. 2) as a guide, we were able to capture and classify participants' beliefs, related impacts and describe their attitude toward practices. We uncovered the living experiences affected by the organizational culture and subjective norms. We also identified evidence of the real impacts of participants' beliefs on team practice and the influence of team confidence and level of autonomy on the adoption of new practices in software teams.


109

The data analysis was performed in a qualitative manner to cross reference beliefs and their impacts on team's behavior and practices. The transcription and coding were performed manually and validated with other researchers before analysis. All transcripts of the interviews, were categorized, tabulated, and also analyzed via cycles of pattern coding. The transcription of the nine interviews produced 44 pages in a text document and an average of seven codes per transcription. In total, we have built almost 50 patterns of code, focusing on the relevant actions, interactions and events in the past and current projects that might exert influence on the teams' behavior and practice.

Table 2. Participants Profile.

Software Practitioners

Company Project Roles Experience 1 STF Development Center Manager > 10 years 1 STF Coding Leader > 03 years 1 STF Technical Leader > 03 years 2 FD Scrum Master > 03 years 2 FD Developer 02 years 2 FD Developer 01 year 3 SIG Project Manager > 03 years 3 SIGEP Project Manager > 03 years 3 DO Quality Manager 01 year

After reducing the data to a limited number of belief classes, we characterized

them in terms of frequency, sources, context and associated impact. We have identified a total of 10 beliefs to 32 related impacts. Each impact was grouped by similarity and type (negative or positive). We also recorded information about the profile of each participant of the study (see Table 2). This information served as a context to better understand the points of view of each participant connected to the beliefs found and also to uncover the aspects of organizational culture and subjective norms as well as the influence of team confidence and autonomy.

4 Results

In Table 3,we list the beliefs most frequently mentioned by the participants; their respective class of beliefs, according to topics related to the software team practice; their attitude in past or current experiences; the influence of the organizational culture; the evidence of the perceived behavioral control; the impact occurred of each belief; and finally, their impact type, positive (+) or negative (-). The interviewees seem to have a common concern and interest in new approaches for project management. Another point is related to the Knowledge Management Belief. It is a common belief for three of nine participants. The impacts of these beliefs on SE practices are positive, contributing to improve the implementation of key software team practices.


110

Table 3. Salient Beliefs.

Belief Class Belief Attitude Organizational Culture

Perceived Behavioral

Control Impact Impact

Type

Process Project management using SCRUM Methodology.

Toward agile software process IT market and competition High

confidence

better productivity better quality high team-level effectiveness

+

Task Estimation

Project management using SE metrics supported by tools.

Toward precise task estimation

practice

Motivated by fixed and early deadlines

High autonomy

better response time better project monitoring

+

Project Management

Bad project management increases the chance of failure in software projects.

Toward task delivery High deadline pressure

Low autonomy

Low confidence

bad quality high rework bad scope and cost management

-

Knowledge Management

Knowledge sharing practice through software documentation and planning increases the chance of success in software projects.

Toward project information

sharing CMMI certification program High

confidence

better quality better productivity better response time

-

4.1 Case 1

RQ1. In case 1, the company is motivated by the IT market demands to adopt an agile methodology and the technical leaders are declaring explicit empathy and claim to be friendly to this kind of methodology. At the beginning of the project, the STF project team showed some resistance to this adoption, but during the project, the team accepted the new practices and started to work very well as a team. At first, they seemed to be afraid of taking a bigger responsibility as a self-management team, since this approach presumes that the team have significant authority and responsibility for many aspects of their work, such as planning, scheduling, delegating, and making decisions.

Also, the company's development center manager has an attitude toward the use of an individual productivity rate as a metric for project monitoring. To him, a more precise task estimation has improved team autonomy in his projects, which led to better negotiations with customers and, consequently, higher team-level effectiveness.

Regarding the requirement traceability control, for Company 1 it is mandatory to use a bad format and an apparently useless requirements traceability artifact. The technical leaders believe that this practice reduces configuration management and change management effectiveness. However, it is part of the CMMI certification program, so it is required for every project in the company.

RQ2. Our study highlights the influence factors associated with organizational culture and subjective norms to predict behavior. The values and beliefs of a project team, when in agreement with the organizational culture, reinforce the strength of their belief that the behavior will lead to positive consequences, which will exert influence on the team's behavior intention.

In company 1, the CMMI certification culture influences the adoption of software engineering metrics for project monitoring. The company's development center manager believes that a better task estimation using individual productivity rate can lead to project success.


111

The STF project team believes that the adoption of CMMI practices increases the company's competitive edge and can stimulate company's evolution process bringing more profitable and demanding projects. Nevertheless, some members of the project team reported that the CMMI organizational culture seems to frame software practices to a more traditional development process in a cascade model, which is not so compatible with agile methodologies.

RQ3. The STF project team seems to believe that the compliance with CMMI practices increases the chance of success in software projects. The knowledge sharing through software documentation and planning, driven by the CMMI certification program, also increases the chance of success in software projects. For this team, a defined development process and the existence of adequate planning can contribute to project goals achievement. So, the team confidence level related to this subject is substantially high.

The same applies to the adoption of agile methodologies for project management. The team has a high level of confidence and also autonomy to use the SCRUM1 model and its practices to conduct the STF project.

4.2 Case 2

RQ1. Our findings indicate that the attitude toward an object is based on how favorable the total set is, because one considers each belief about that object and its evaluation according to the project context. For example, most members of the FD project team agree that achieving knowledge sharing through project tracking blog is a good practice. For them, this practice improves team productivity and integration, which, in turn, leads to team-level effectiveness.

In addition, the Scrum Master of the FD project is motivated to use an agile methodology to speed up software process and delivery the software product. He believes that a good task estimation practice should be supported by appropriate tools, even though some team members are still resistant to this practice.

RQ2. The evidence suggests that a good fit between the organizational culture and subjective norms embedded in the software development context leads to an easier assimilation of a new practice or behavior.

Company 2 seems to be motivated by competition to adopt an agile methodology. Now it is part of its organizational culture and exerts huge influence on the behavior of the employees and how they act toward an agile practice.

1 SCRUM approach. http://www.scrum.org/


112

RQ3. The Scrum Master of the FD project is quite confident that the adoption of a task estimation practice supported by tools can help in the implementation of agile software methodology. He believes that it is a big challenge to predict team productivity without a systematic process for scheduling and task estimation. Some team members declare explicit resistance to this practice, but they perform it anyway, which denotes a low level of team autonomy.

4.3 Case 3

RQ1. In Company 3, the project managers overcame the challenge of dealing with geographically distributed teams improving team communication practices, not only adopting new communication tools, but mainly using past experiences and lessons learned to improve the communication process.

Also, the project managers are motivated to use individual productivity control as part of a task estimation practice for a new change management process during project monitoring. They care about proposing new work processes which can become a good practice for the whole company.

RQ2. Evidence also shows that there is a significant connection between a team belief system and organizational culture. According to the participants, the company's senior managers foster an attitude toward the ISO-9001 processes and procedures because of the company's certification program. With respect to CMMI certification program, it seems to fit very well in the context of a development center operational model, but not so well in an agile project scenario.

RQ3. In the middle of the CMMI certification program, the project teams of Company 3 demonstrated low autonomy to adapt software practices to their needs. The evidence shows that there are cases of mandatory production of complex, time consuming and useless reports, only to be in compliance to the CMMI model.

In addition, the project managers are not motivated to build a suitable project plan because, in most cases, they don't have full autonomy to conduct the project in compliance with the necessary practices for their project context.

5 Discussion

In this section, we summarize our results and answer the research questions. We discuss our findings in contrast to related work and present the limitations of this study.

Considering research question 1, the study indicates that the team members are concerned about the lack of productivity metric for project monitoring, which is influencing their attitude toward task estimation practices in some ways. The team members' beliefs appear to come from a personal hands-on experience that did not work well on previous projects. The related experiences were described in specific and directed stories reported by the participants during the interviews. So, it was possible to point out that all the participating companies are involved with agile


113

methodologies and introducing new agile practices in software projects routine. When applying the TPB model, we analyzed the relevant actions, interactions and events in the past of the team members, trying to uncover common or conflicting beliefs between them, which could benefit or hinder the adoption of new software practices.

With respect to question 2, the evidence suggested that a good fit between the CMMI culture, embedded in the software project context, that already involves task estimation practices, leads to an easier acceptance of this practice. We confirmed that there is a significant connection between organizational culture and subjective norms around the project team and its behavior intention as the TPB model indicates. Evidence to support this was found in the statements declared by the participants about how much organizational support and culture are essential to achieve team effectiveness. Our findings also indicated that when the values and beliefs of a project team are in agreement with the organizational culture they will exert greater influence on the team's behavior intention, because the organizational culture can reinforce the strength of these beliefs and the confidence that the particular behavior will lead to positive consequences.

The main findings that relate to research question 3 are associated with the low autonomy of the project teams of all three companies to adapt software practices to their project requirements and contexts. In some cases, it is because of the CMMI certification demands and, in others cases, because of unviable deadlines set by customers. This type of situation breaks the project team confidence and reduces its effectiveness. In accordance with TPB theory, the perceived power of influence factors that may inhibit behavior performance will cause the project teams to go against their initial intention. Thus, TPB proved to fit better then TRA for the cases in which the behavior is not entirely under volitional control.

5.1 Limitations

To our knowledge, this study is one of the first initiatives of applying behavioral theories in the context of SE practices to guide research in software organizations. We should say that we do not have a complete list of implications and answers for the research questions, thus, further studies should be performed to point to other possibilities of applying behavioral theories in SE contexts.

Construct validity is concerned with design of a study and whether the studied scenario represent the real world. In our study we tried to generalize the findings from empirical statements to theoretical statements. It involved generalizing data from interviews and perceptions by discussing them in accordance with the behavioral literature. In this respect, we related our findings to relevant examples of application of TRA and TPB and compared them with related theories used in information systems research.

Another possible limitation is that we were working with the findings of software projects within only three participating organizations. It may not be possible to generalize the results in this context. However, the participants were professionals of three completely different companies using typical development technologies in a typical working environment, e.g., the natural setting demanded by case study approach. This makes the results easier to be validated.


114

We had to do a trade-off between the number of participants, the duration and the cost of this study. We understand that nine interview subjects is not the ideal number of participants for the interview approach, but we had to balance that with our need for a case study of a software team's belief systems. However, it seemed to be enough to show that it is possible and interesting to capture and represent a belief system in a software project context. Our intention is to increase the relevance of the obtained results to software industry, and contribute to an improved understanding on how to apply this kind of theory to study software practices.

Lastly, there is also a risk that our findings could be influenced by factors that escaped our attention. To mitigate that, we chose to discuss and validate findings with other researchers to seek the completeness of the conclusions.

6 Conclusion and Further Work

This paper has presented a study that characterizes a belief system applying the Theory of Planned Behavior (TPB) in software project teams with professionals from three different companies. Using a conceptual framework as a guide, we captured participants' beliefs and described their main attributes and the impacts on software development practices.

Scientific research has tried to explain attitude and behavior and TPB has received considerable attention within fields related to social behavior. It was most often tested in environments where individuals have a low perceived behavior control. TPB fits well in the context of software development practices and serves as a good theoretical framework in mapping software project teams' behavior.

In order to contribute to an improved understanding on how to apply a behavioral theory to study software practices, we cast light on relevant knowledge and experience on the characterization of belief systems and their impact on software industry practices. Overall, we have the aim of identifying and mapping organizational and team factors of influence on the adoption of new practices through an approach that has proven to lead to practical and useful recommendations for companies.

Our next step is to conduct new case studies with a focus on organizational culture factor. Through the synthesis of all the evidence, we intend to contribute to provide rich narrative accounts for this type of SE research activity, and elucidate more questions and issues that arise from the practice of software development.

Acknowledgment

This work was partially supported by the scholarship from the CAPES Foundation, process number 5744-11-3 and by the Fraunhofer Institute for Experimental Software Engineering. The authors are grateful to all involved in this study, specially the interviewees for their insights and cooperation and to the SOLUTIS organization for supporting this work.


115

References

1. Aguirre, J., Speer, N. “Examining the relationship between beliefs and goals in teacher practice”. Journal of Mathematical Behavior, 3 (18): 327-356, 2000.

2. Ajzen, I. "From intentions to actions: A theory of planned behavior". In J. Kuhl & J. Beckman (Eds.), Action control: From cognition to behavior (pp. 11–39). Berlin: Springer-Verlag, 1985.

3. Ajzen, I. "The Theory of Planned Behavior". Organizational Behavior and Human Decision Processes, 50(2):179–211, 1991.

4. Ajzen, I, Fishbein, M. “Understanding attitudes and predicting social behavior”, 1st ed. USA: Prentice Hall (Englewood Cliffs, N.J.), 1980.

5. Ajzen, I., Fishbein, M. "The influence of attitudes on behavior". In D. Albarracín, B. T. Johnson, & M. P. Zanna (Eds.), The handbook of attitudes . pp. 173-221). Mahwah, NJ: Erlbaum, 2005.

6. Ajzen, I., Gilbert Cote, N. "Attitudes and the prediction of behavior". In W. D. Crano & R. Prislin (Eds.), Attitudes and attitude change (pp. 289-311). New York: Psychology Press, 2008.

7. Becker, T., Randall, D., Riegel, C. "The multidimensional view of commitment and the theory of reasoned action: a comparative evaluation". JOM, 21(4): 617-638, 1995.

8. Blank, M., Hennessy, M. "A reasoned action approach to hiv prevention for persons with serious mental illness". The ANNALS of the American Academy of Political and Social Science, 640, March, 2012.

9. Cohen, S. “Designing effective self-managing work teams”. Proc. of Theory Symposium on Self-Managed Work Teams, Denton, Texas, USA, June, 1993.

10. Douglas, N. and Wykowski, T. "From Belief to Knowledge Achieving and Sustaining an Adaptive Culture in Organizations", USA: CRC Press, 2010.

11. Fishbein, M., Ajzen, I "Belief, attitude, intention, and behavior: an introduction to theory and research". MA: Addison-Wesley, 1975.

12. Fishbein, M., Ajzen, I. "Predicting and changing behavior: A reasoned action approach" USA: Psychology Press, 2010.

13. Funda SAVASCI-ACIKALIN. "Teacher beliefs and practice in science education". Asia-Pacific Forum on Science Learning and Teaching 10(1), article 12, June, 2009.

14. Hennessy, M. “Advancing Reasoned Action Theory (The ANNALS of the American Academy of Political and Social Science Series)”, USA: Sage, 2012.

15. Kirkman, B., Rosen, B. “Beyond self-management: antecedents and consequences of team empowerment”. The Academy of Management Journal, 42(1): 58-74, 1999

16. Kraft, P., Rise J., Sutton, S., Røsamb, E. "Perceived difficulty in the theory of planned behaviour: perceived behavioural control or affective attitude?". British Journal of Social Psychology, 44(3: 479–496, 2005.

17. Lutters, W.G., Seaman, C.B. “Revealing actual documentation usage in software maintenance through war stories”. IST 1(49):576–587, 2007.

18. Middlestadt, S. "Beliefs underlying eating better and moving more: lessons learned from comparative salient belief elicitations with adults and youths". The ANNALS of the American Academy of Political and Social Science, 640: 81, 2012

19. Montano, D.,Taplin, S. “A Test of an expanded theory in reasoned action to predict mammography participation”. Social Science and Medicine, 32(6): 733-741, 1991.

20. Omar, M., Syed-Abdullah, S., Yasin, A. "The impact of agile approach on software engineering teams". American Journal of Economics and Business Administration., 3(1): 12-17. 2011.

21. Passos, C., Braun, P., Cruzes, D., Mendonça, M. “Analyzing the impact of beliefs in software project practices”. Proc. of ESEM’11, Banff-Alberta, Canada, September, 2011.

22. Passos, C., Cruzes, D., Dybå, T., Mendonça, M. “Challenges of applying ethnography to study software practices”. Proc. of ESEM’12, Lund, Sweden, September, 2012.

23. Peslak, A., Ceccucci, W., Sendall, P. “An empirical study of social networking behavior using theory of reasoned action”. Proc. of CONISAR’11, North Carolina, USA, November, 2011.


116

24. Randolph, M., Pinkerton, S., Somlai, A., Kelly, J., McAuliffe, T., Gibson, R., Hackl, K. "Seriously mentally Ill women's safer sex behaviors and the theory of reasoned action". Journal of Health Education and Behavior, 36(5): 948-958, 2009.

25. Roberto, A., Krieger, J., Katz, M., Goei, R., Jain, P. "Predicting pediatricians' communication with parents about the human papillomavirus (HPV) vaccine: an application of the theory of reasoned action". Journal of Health Communication, 26(4): 303-312, 2009.

26. Sheppard, B., Hartwick, J., Warshaw, P. "The theory of reasoned action: a meta-analysis of past research with recommendations for modifications and future research". Journal of Consumer Research 15(3): 325-343, 1988.

27. Sutton, S., McVey, D., Glanz, A. "A comparative test of the theory of reasoned action and the theory of planned behavior in the prediction of condom use intentions in a national sample of english young people". Journal of Health Psychology, 18(1): 72-81, 1999.

28. Sutton, S., French, D., Hennings, S., Mitchell, J., Wareham, N., Griffin, N., Hardeman, W., Kinmonth, K.. "Eliciting salient beliefs in research on the theory of planned behaviour: the effect of question wording". Current Psychology: Developmental, Learning, Personality, Social, 22(3): 234–251, 2003.

29. Towler, G., Shepherd, R. “Modification of Fishbein and Ajzen’s Theory of Reasoned Action to predict chip consumption”. Food Quality and Preference, 3(1): 37-45, 1991.

30. Vanlandingham, M., Suprasert, S., Grandjean, N., Sittitrai, W. "Two views of risky sexual practices among northern thai males: the health belief model and the theory of reasoned action". Journal of Health and Social Behavior, 36(2): 195-212, 1995.

31. Wernick, P., Hall, T. "Can Thomas Kuhn's paradigms help us understand software engineering?". European Journal of Information Systems, 13 (3): 235-243, 2004.

32. Yzer, M. "Perceived behavioral control in reasoned action theory: a dual-aspect interpretation". The ANNALS of the American Academy of Political and Social Science, 640: 101, 2012.

Appendix A: Interview Questionnaire

Warm-up questions: 1. In your opinion, what is the main challenge of your project? (productivity, quality,

deadline, cost or other) Past experience questions: 2. Could you cite a past experience where the project was conducted without planning,

defined schedule, without following a software development methodology and without risk analysis, and how this impacted (positively or negatively) the project results?

Lessons learned questions: 3. Is there any practice in your current project that is new to you? Do you think this new

practice is beneficial to your project? Please, explain in what sense. And after using this new practice for a while, have you changed your opinion regarding its usefulness and importance?

4. Is there any practice in your current project (or past projects) that you adopted just because it was required to use it in the organization that you work, but you don't see its usefulness? Why do you not believe in this practice?

5. Is there any practice that you have introduced to your current project and that was not used in the organization in which you work? why did you think it was important to introduce this practice? And was there any problem with acceptance? What are the results of this practice in your current project?

Reaction questions: 6. Is there any new methodology that the organization in which you work is adopting? Do

you know the reason for that change? Do you believe in this new methodology? Why? In which aspects is this company's new methodology is affecting your current project?


117

Índice de Artículos


118

Índice de Autores

Documents

Memorias del X Workshop Latinoamericano Ingeniería de ... · Memorias del X Workshop Latinoamericano Ingeniería de Software Experimental ESELAW 2013 Del 8 al 10 de Abril, 2013 Universidad