17
J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text Mining Techniques for identifying Concerns in Software Documents Alejandro Rago & J. Andrés Díaz-Pace [email protected] PUC-Rio, May/21/2014 ISISTAN Research Institute Facultad de Cs. Exactas, UNICEN CONICET Outline 2 Motivation Pros/cons of textual documents Crosscutting concerns Basic NLP processing Approach 1 : REAssistant use cases Main NLP techniques & tool support Domain specialization (use cases) Experimental results Approach 2 : ProSAD architecture documents NLP & User Profiling techniques Recommendation of SAD sections to stakeholders Insights and future work J. Andres Diaz Pace PUC-Rio, May 2014

Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

1

PUC-Rio, May 2014 5/28/2014

Text Mining Techniques for

identifying Concerns in

Software Documents

Alejandro Rago & J. Andrés Díaz-Pace

[email protected]

PUC-Rio, May/21/2014

ISISTAN Research Institute Facultad de Cs. Exactas, UNICEN

CONICET

Outline

2

Motivation

Pros/cons of textual documents

Crosscutting concerns

Basic NLP processing

Approach 1: REAssistant use cases

Main NLP techniques & tool support

Domain specialization (use cases)

Experimental results

Approach 2: ProSAD architecture documents

NLP & User Profiling techniques

Recommendation of SAD sections to stakeholders

Insights and future work

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 2: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

2

PUC-Rio, May 2014 5/28/2014

About me

Adjunct professor at UNICEN

University (Tandil, Argentina)

Adjunct Researcher of CONICET

Former member of Technical Staff

of SEI (2007-2010) ATAM Evaluator Certificate.

Software Architecture Professional

Certificate.

PhD. Computer Sciences

- UNICEN (2004)

Research interests Design driven by quality attributes

Conformance/reconstruction techniques

Assistive tools for software design

Object-oriented frameworks

3 J. Andres Diaz Pace – PUC-Rio, May 2014

Software development, artifacts & text

Textual documents are very common, and useful in

daily software development activities

Requirements

Design decisions

Architecture blueprint

Users guide

API specification

Bug reports

Although oral communication is important, some

knowledge must be recorded anyway

4 J. Andres Diaz Pace – PUC-Rio, May 2014

Page 3: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

3

PUC-Rio, May 2014 5/28/2014

Pros & cons of textual descriptions

Natural language writing (or loosely-structured text)

is the norm in many projects

Simple

Promotes fluid interchanges between parties

However, NL comes with

Ambiguity

Implicit concepts

Complex semantics

Latent Concerns + scattering & tangling

(not necessary a bad thing, but requires awareness)

5

A concern can be seen as a

property, aspect or constraint

that often affects several

artifacts/decisions

across the development lifecycle

J. Andres Diaz Pace – PUC-Rio, May 2014

Example 1: Use cases

6

Potential

Performance

Concern

J. Andres Diaz Pace – PUC-Rio, May 2014

direct

direct

indirect

Existing approaches perform a syntactic text analysis,

maybe helped by dictionaries no semantic info exploited

Page 4: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

4

PUC-Rio, May 2014 5/28/2014

Example 2: Architecture document (SAD)

7 J. Andres Diaz Pace – PUC-Rio, May 2014

QAs

(Performance)

Design Decisions

(Performance)

fulfilled by

Exploit NLP to extract potential QAs and DDs, based on

ontology of Software Architecture concepts

Main point: The analyst´s perspective

8

SRS SAD Other

Software documents

Concerns

Manual analysis of hidden/latent concerns is

costly, time-consuming and error prone

Not identifying relevant CCs can have negative effects

Inability to deal with them effectively, re-work, poor quality, customer dissatisfaction

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 5: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

5

PUC-Rio, May 2014 5/28/2014

REAssistant

Automated Identification of

Crosscutting Concerns in Use Cases

9 J. Andres Diaz Pace – PUC-Rio, May 2014

The REAssistant tool for textual UCs

10 J. Andres Diaz Pace – PUC-Rio, May 2014

https://code.google.com/p/reassistant/

Page 6: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

6

PUC-Rio, May 2014 5/28/2014

REAssistant: Behind the scenes

We rely on Apache

UIMA infrastructure +

NLP techniques

Layers of annotations

for input text

11 J. Andres Diaz Pace – PUC-Rio, May 2014

Leverage on

UC knowledge

domain actions

http://uima.apache.org/

SQL-like

search engine

Enables diverse

concern search

strategies

REAssistant: The NLP pipeline

12 J. Andres Diaz Pace – PUC-Rio, May 2014

Basic NLP

Advanced

NLP

DOMAIN-SPECIFIC

FOR USE CASES

Page 7: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

7

PUC-Rio, May 2014 5/28/2014

REAssistant: Layers of annotations

13 J. Andres Diaz Pace – PUC-Rio, May 2014

SEMANTIC

ANALYSIS

A domain action refers to a

semantic class that groups

different actions (e.g., compute,

calculate, perform, execute)

commonly used in textual uses

cases, because those actions

essentially mean the same thing

REAssistant: Types of concern queries

1. Direct queries

Focused on finding explicit references to a particular

concern (tokens or lemmas)

e.g., “storage”, “table”, “roll-back”

Supported by standard tools, such as EAMiner

2. Indirect queries

Focused on more subtle associations that come from a

semantic interpretation of the use cases (domain actions)

e.g., read/write/save/persist operations on data objects

Predicates with arguments

A given concern can be detected with a set of both

direct & indirect queries

14 J. Andres Diaz Pace – PUC-Rio, May 2014

Page 8: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

8

PUC-Rio, May 2014 5/28/2014

REAssistant: Predefined concern queries

15 J. Andres Diaz Pace – PUC-Rio, May 2014

REAssistant: Inferring domain actions

Multi-label classifier trained with samples from different domains

16 J. Andres Diaz Pace – PUC-Rio, May 2014

Page 9: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

9

PUC-Rio, May 2014 5/28/2014

REAssistant: Case-studies

Goals

Compare CCs identified by the tool against those by humans

Compare our tool against other existing tools EAMiner

REAssistant works at finer granularity: steps in use cases

Assess results of analysts + REAssistant

3 case-studies

Course Registration System – academic, 8 UCs, ~20 pages

Health Watcher System – academic, 9 UCs, ~20 pages

MSLite – industrial, 22 Ucs, ~40 pages

Need to validate against “gold standard” for CCs

17 J. Andres Diaz Pace – PUC-Rio, May 2014

“Assisting Requirements Analysts to Find Latent Concerns with REAssistant” A. Rago,

C. Marcos, A. Diaz-Pace. To appear in Automated Software Engineering (Springer) 2014

REAssistant: Some results

18 J. Andres Diaz Pace – PUC-Rio, May 2014

Page 10: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

10

PUC-Rio, May 2014 5/28/2014

REAssistant: Lessons learned

In general, humans are likely to have a good

precision, but they still miss some concerns

So, “human” recall is low

Both EAMiner & REAssistant excel at identifying

concerns, versus manual identification

Recall is the “sweet spot” where the

advice of REAssistant becomes useful

Our tool has similar precision as EAMiner

Boost in recall (~40%) due to indirect rules

(when compared to direct rules only)

19 J. Andres Diaz Pace – PUC-Rio, May 2014

requirements architecture

REAssistant: Future work

Apply the NLP techniques to SAD domain

Need to investigate an equivalent to “domain actions”

Some “architectural” concerns will be design decisions

Mix of text and figures architecture diagrams

Improve query

mechanisms/technology

Apache UIMA Ruta

Perform traceability analysis

Between concerns detected in requirements and concerns

detected in corresponding SAD Quality attributes

Latent semantic indexing?

20 J. Andres Diaz Pace – PUC-Rio, May 2014

Page 11: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

11

PUC-Rio, May 2014 5/28/2014

ProSAD

Automated Identification of

Stakeholders’ Concerns for

Optimizing SAD production

21 J. Andres Diaz Pace – PUC-Rio, May 2014

ProSAD: The need of documentation

22

Architecture knowledge must be properly documented

in order to be effectively shared and communicated Particularly in a context of multiple stakeholders

For project members and external stakeholders as well

Architecture

Stakeholders

access describes

Software Architecture

Document (SAD)

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 12: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

12

PUC-Rio, May 2014 5/28/2014

ProSAD: The reality …

23

Who likes documentation? Documentation efforts are often an

expenditure that developers/managers

do not wish to bear

Typically performed late (if at all), when

all design activities are finished

The SAD must target multiple readers Each with a different background and information needs

Studies have shown that individual stakeholders’

concerns are addressed by less than 25% of the SAD

(per stakeholder)

It is likely a big document, which contains development-

oriented contents only

J. Andres Diaz Pace – PUC-Rio, May 2014

ProSAD: How to add value?

24

Architecture documentation should be

planned, iterative, but just enough Document essential aspects for stakeholders

Alleviate costs of documenting “too much”

Produce documentation in increments (just like

we do for design and code)

Requirements

Design

(Some

Architectural

decisions)

Rest of SE

process

Documentation work goes

hand in hand with design work

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 13: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

13

PUC-Rio, May 2014 5/28/2014

ProSAD: Assisting the documenter

SAD hosted in DokuWiki

25

DOCUMENTATION

ASSISTANT

Stakeholders’ interests on views (matrix)

Update plan

( Backlog of

documentation

tasks)

J. Andres Diaz Pace – PUC-Rio, May 2014

documenter

The ProSAD approach: Schema

26

Agile principle TAGRI (They Aren´t Going to Read It)

Views & Beyond

SA Documentation

Method

User Profiles

V&B profiles +

browsing activities

Knapsack

formulation J. Andres Diaz Pace – PUC-Rio, May 2014

Page 14: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

14

PUC-Rio, May 2014 5/28/2014

ProSAD: V&B interests & views

27 J. Andres Diaz Pace – PUC-Rio, May 2014

ProSAD: About documentation tasks

28

A “unit of work” for the documenter It applies to a SAD section (i.e., a given view)

It increases its level of detail

There is a backlog of

feasible tasks for the

current SAD version Only those tasks that bring

most benefit to the stakeholders

should be chosen

… as long as their estimated costs

do not exceed a cost constraint

Formulated as “Next SAD Version Problem” (NSVP)

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 15: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

15

PUC-Rio, May 2014 5/28/2014

ProSAD: Preliminary evaluation

29

Groups of stakeholders accessing Wiki-based SADs Compare increments produced in an ad-hoc manner with

their counterparts generated with our optimizer (assistant)

Assignment: design

an architecture solution

for a model problem

and then use V&B Resulting SADs were

sliced into 3 parts

SAD_slice1 SAD_slice2 SAD _final

10-25% 40-60% 75-85%

J. Andres Diaz Pace – PUC-Rio, May 2014

ProSAD: What-if analysis for utility/cost

30

Optimized SAD achieved higher utility in both transitions, while keeping production costs low ~15-25% utility improvement & ~10-33% cost reduction Cost reduction actually means smaller SADs

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 16: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

16

PUC-Rio, May 2014 5/28/2014

ProSAD: Satisfaction per stakeholder

31

Not all stakeholders were evenly satisfied in optimized

SADs high-priority stakeholders were favored “Conditioning” effects of precedent SAD versions (not-undo)

J. Andres Diaz Pace – PUC-Rio, May 2014

ProSAD: Lessons learned & Future work

32

Unnecessary information will not be documented,

with the consequent effort savings

Measures of cost and utility are approximate

The stakeholder-centric motivation for architectural

knowledge is reinforced in the development team

Apply optimization to real design sessions

Ongoing experiments with case-studies

Dependencies among SAD sections are not modeled

The V&B templates provides “consistency rules”, which

require a more complex KP formulation

Leverage on stakeholders' social network info

J. Andres Diaz Pace – PUC-Rio, May 2014

Page 17: Text Mining Techniques for identifying Concerns in Software …inf2007/docs/diversos/apresentacao2.pdf · 2016-03-17 · J. Andres Diaz Pace (2014) 1 PUC-Rio, May 2014 5/28/2014 Text

J. Andres Diaz Pace (2014)

17

PUC-Rio, May 2014 5/28/2014

Final comments

33

REAssistant & ProSAD

Synergies between AI, agents and Software Engineering Software assistants

Metaphor: Assistants provide recommendations & developers make the final decisions

“Technology packages” provided by third parties simplify the integration

Learning user´s information and/or training classifiers can be challenging

“Cold start” problem

Some prior body of knowledge must exist (and be codified)

Tool performance (e.g., precision) might vary with the domain

J. Andres Diaz Pace – PUC-Rio, May 2014

Thank you!

34

[email protected]

or

[email protected]

J. Andres Diaz Pace – PUC-Rio, May 2014