Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
J. Andres Diaz Pace (2014)
1
PUC-Rio, May 2014 5/28/2014
Text Mining Techniques for
identifying Concerns in
Software Documents
Alejandro Rago & J. Andrés Díaz-Pace
PUC-Rio, May/21/2014
ISISTAN Research Institute Facultad de Cs. Exactas, UNICEN
CONICET
Outline
2
Motivation
Pros/cons of textual documents
Crosscutting concerns
Basic NLP processing
Approach 1: REAssistant use cases
Main NLP techniques & tool support
Domain specialization (use cases)
Experimental results
Approach 2: ProSAD architecture documents
NLP & User Profiling techniques
Recommendation of SAD sections to stakeholders
Insights and future work
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
2
PUC-Rio, May 2014 5/28/2014
About me
Adjunct professor at UNICEN
University (Tandil, Argentina)
Adjunct Researcher of CONICET
Former member of Technical Staff
of SEI (2007-2010) ATAM Evaluator Certificate.
Software Architecture Professional
Certificate.
PhD. Computer Sciences
- UNICEN (2004)
Research interests Design driven by quality attributes
Conformance/reconstruction techniques
Assistive tools for software design
Object-oriented frameworks
3 J. Andres Diaz Pace – PUC-Rio, May 2014
Software development, artifacts & text
Textual documents are very common, and useful in
daily software development activities
Requirements
Design decisions
Architecture blueprint
Users guide
API specification
Bug reports
…
Although oral communication is important, some
knowledge must be recorded anyway
4 J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
3
PUC-Rio, May 2014 5/28/2014
Pros & cons of textual descriptions
Natural language writing (or loosely-structured text)
is the norm in many projects
Simple
Promotes fluid interchanges between parties
However, NL comes with
Ambiguity
Implicit concepts
Complex semantics
Latent Concerns + scattering & tangling
(not necessary a bad thing, but requires awareness)
5
A concern can be seen as a
property, aspect or constraint
that often affects several
artifacts/decisions
across the development lifecycle
J. Andres Diaz Pace – PUC-Rio, May 2014
Example 1: Use cases
6
Potential
Performance
Concern
J. Andres Diaz Pace – PUC-Rio, May 2014
direct
direct
indirect
Existing approaches perform a syntactic text analysis,
maybe helped by dictionaries no semantic info exploited
J. Andres Diaz Pace (2014)
4
PUC-Rio, May 2014 5/28/2014
Example 2: Architecture document (SAD)
7 J. Andres Diaz Pace – PUC-Rio, May 2014
QAs
(Performance)
Design Decisions
(Performance)
fulfilled by
Exploit NLP to extract potential QAs and DDs, based on
ontology of Software Architecture concepts
Main point: The analyst´s perspective
8
SRS SAD Other
Software documents
Concerns
Manual analysis of hidden/latent concerns is
costly, time-consuming and error prone
Not identifying relevant CCs can have negative effects
Inability to deal with them effectively, re-work, poor quality, customer dissatisfaction
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
5
PUC-Rio, May 2014 5/28/2014
REAssistant
Automated Identification of
Crosscutting Concerns in Use Cases
9 J. Andres Diaz Pace – PUC-Rio, May 2014
The REAssistant tool for textual UCs
10 J. Andres Diaz Pace – PUC-Rio, May 2014
https://code.google.com/p/reassistant/
J. Andres Diaz Pace (2014)
6
PUC-Rio, May 2014 5/28/2014
REAssistant: Behind the scenes
We rely on Apache
UIMA infrastructure +
NLP techniques
Layers of annotations
for input text
11 J. Andres Diaz Pace – PUC-Rio, May 2014
Leverage on
UC knowledge
domain actions
http://uima.apache.org/
SQL-like
search engine
Enables diverse
concern search
strategies
REAssistant: The NLP pipeline
12 J. Andres Diaz Pace – PUC-Rio, May 2014
Basic NLP
Advanced
NLP
DOMAIN-SPECIFIC
FOR USE CASES
J. Andres Diaz Pace (2014)
7
PUC-Rio, May 2014 5/28/2014
REAssistant: Layers of annotations
13 J. Andres Diaz Pace – PUC-Rio, May 2014
SEMANTIC
ANALYSIS
A domain action refers to a
semantic class that groups
different actions (e.g., compute,
calculate, perform, execute)
commonly used in textual uses
cases, because those actions
essentially mean the same thing
REAssistant: Types of concern queries
1. Direct queries
Focused on finding explicit references to a particular
concern (tokens or lemmas)
e.g., “storage”, “table”, “roll-back”
Supported by standard tools, such as EAMiner
2. Indirect queries
Focused on more subtle associations that come from a
semantic interpretation of the use cases (domain actions)
e.g., read/write/save/persist operations on data objects
Predicates with arguments
A given concern can be detected with a set of both
direct & indirect queries
14 J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
8
PUC-Rio, May 2014 5/28/2014
REAssistant: Predefined concern queries
15 J. Andres Diaz Pace – PUC-Rio, May 2014
REAssistant: Inferring domain actions
Multi-label classifier trained with samples from different domains
16 J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
9
PUC-Rio, May 2014 5/28/2014
REAssistant: Case-studies
Goals
Compare CCs identified by the tool against those by humans
Compare our tool against other existing tools EAMiner
REAssistant works at finer granularity: steps in use cases
Assess results of analysts + REAssistant
3 case-studies
Course Registration System – academic, 8 UCs, ~20 pages
Health Watcher System – academic, 9 UCs, ~20 pages
MSLite – industrial, 22 Ucs, ~40 pages
Need to validate against “gold standard” for CCs
17 J. Andres Diaz Pace – PUC-Rio, May 2014
“Assisting Requirements Analysts to Find Latent Concerns with REAssistant” A. Rago,
C. Marcos, A. Diaz-Pace. To appear in Automated Software Engineering (Springer) 2014
REAssistant: Some results
18 J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
10
PUC-Rio, May 2014 5/28/2014
REAssistant: Lessons learned
In general, humans are likely to have a good
precision, but they still miss some concerns
So, “human” recall is low
Both EAMiner & REAssistant excel at identifying
concerns, versus manual identification
Recall is the “sweet spot” where the
advice of REAssistant becomes useful
Our tool has similar precision as EAMiner
Boost in recall (~40%) due to indirect rules
(when compared to direct rules only)
19 J. Andres Diaz Pace – PUC-Rio, May 2014
requirements architecture
REAssistant: Future work
Apply the NLP techniques to SAD domain
Need to investigate an equivalent to “domain actions”
Some “architectural” concerns will be design decisions
Mix of text and figures architecture diagrams
Improve query
mechanisms/technology
Apache UIMA Ruta
Perform traceability analysis
Between concerns detected in requirements and concerns
detected in corresponding SAD Quality attributes
Latent semantic indexing?
20 J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
11
PUC-Rio, May 2014 5/28/2014
ProSAD
Automated Identification of
Stakeholders’ Concerns for
Optimizing SAD production
21 J. Andres Diaz Pace – PUC-Rio, May 2014
ProSAD: The need of documentation
22
Architecture knowledge must be properly documented
in order to be effectively shared and communicated Particularly in a context of multiple stakeholders
For project members and external stakeholders as well
Architecture
Stakeholders
access describes
Software Architecture
Document (SAD)
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
12
PUC-Rio, May 2014 5/28/2014
ProSAD: The reality …
23
Who likes documentation? Documentation efforts are often an
expenditure that developers/managers
do not wish to bear
Typically performed late (if at all), when
all design activities are finished
The SAD must target multiple readers Each with a different background and information needs
Studies have shown that individual stakeholders’
concerns are addressed by less than 25% of the SAD
(per stakeholder)
It is likely a big document, which contains development-
oriented contents only
J. Andres Diaz Pace – PUC-Rio, May 2014
ProSAD: How to add value?
24
Architecture documentation should be
planned, iterative, but just enough Document essential aspects for stakeholders
Alleviate costs of documenting “too much”
Produce documentation in increments (just like
we do for design and code)
Requirements
Design
(Some
Architectural
decisions)
Rest of SE
process
Documentation work goes
hand in hand with design work
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
13
PUC-Rio, May 2014 5/28/2014
ProSAD: Assisting the documenter
SAD hosted in DokuWiki
25
DOCUMENTATION
ASSISTANT
Stakeholders’ interests on views (matrix)
Update plan
( Backlog of
documentation
tasks)
J. Andres Diaz Pace – PUC-Rio, May 2014
documenter
The ProSAD approach: Schema
26
Agile principle TAGRI (They Aren´t Going to Read It)
Views & Beyond
SA Documentation
Method
User Profiles
V&B profiles +
browsing activities
Knapsack
formulation J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
14
PUC-Rio, May 2014 5/28/2014
ProSAD: V&B interests & views
27 J. Andres Diaz Pace – PUC-Rio, May 2014
ProSAD: About documentation tasks
28
A “unit of work” for the documenter It applies to a SAD section (i.e., a given view)
It increases its level of detail
There is a backlog of
feasible tasks for the
current SAD version Only those tasks that bring
most benefit to the stakeholders
should be chosen
… as long as their estimated costs
do not exceed a cost constraint
Formulated as “Next SAD Version Problem” (NSVP)
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
15
PUC-Rio, May 2014 5/28/2014
ProSAD: Preliminary evaluation
29
Groups of stakeholders accessing Wiki-based SADs Compare increments produced in an ad-hoc manner with
their counterparts generated with our optimizer (assistant)
Assignment: design
an architecture solution
for a model problem
and then use V&B Resulting SADs were
sliced into 3 parts
SAD_slice1 SAD_slice2 SAD _final
10-25% 40-60% 75-85%
J. Andres Diaz Pace – PUC-Rio, May 2014
ProSAD: What-if analysis for utility/cost
30
Optimized SAD achieved higher utility in both transitions, while keeping production costs low ~15-25% utility improvement & ~10-33% cost reduction Cost reduction actually means smaller SADs
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
16
PUC-Rio, May 2014 5/28/2014
ProSAD: Satisfaction per stakeholder
31
Not all stakeholders were evenly satisfied in optimized
SADs high-priority stakeholders were favored “Conditioning” effects of precedent SAD versions (not-undo)
J. Andres Diaz Pace – PUC-Rio, May 2014
ProSAD: Lessons learned & Future work
32
Unnecessary information will not be documented,
with the consequent effort savings
Measures of cost and utility are approximate
The stakeholder-centric motivation for architectural
knowledge is reinforced in the development team
Apply optimization to real design sessions
Ongoing experiments with case-studies
Dependencies among SAD sections are not modeled
The V&B templates provides “consistency rules”, which
require a more complex KP formulation
Leverage on stakeholders' social network info
J. Andres Diaz Pace – PUC-Rio, May 2014
J. Andres Diaz Pace (2014)
17
PUC-Rio, May 2014 5/28/2014
Final comments
33
REAssistant & ProSAD
Synergies between AI, agents and Software Engineering Software assistants
Metaphor: Assistants provide recommendations & developers make the final decisions
“Technology packages” provided by third parties simplify the integration
Learning user´s information and/or training classifiers can be challenging
“Cold start” problem
Some prior body of knowledge must exist (and be codified)
Tool performance (e.g., precision) might vary with the domain
J. Andres Diaz Pace – PUC-Rio, May 2014
Thank you!
34
or
J. Andres Diaz Pace – PUC-Rio, May 2014