Improving Requirements Glossary Construction via Clustering

Improving Requirements Glossary Construction via Clustering:

Approach and Industrial Case Studies!

Chetan Arora1 Mehrdad Sabetzadeh1 Lionel Briand1 Frank Zimmer2

1 University of Luxembourg, Luxembourg!2 SES TechCom, Luxembourg

.lusoftware verification & validationVVS

NL Requirements• NL requirements can be vague and ambiguous

• Multiple interpretations of same words - scale

• Inconsistent terminology

• Multiple terms for same concepts

• element / component / object!

• Multiple representations of same keywords

• status of Ground Station Interface component!

• Ground Station Interface component’s status!

• Interface component status

• Glossaries help mitigate ambiguities

• consistent terminology

• improves communication - especially in scenarios with multiple stakeholders

Requirement Glossaries

Glossary Building Process• Identify / Search for relevant glossary terms

• Define these terms, along with related terms

• Ideally, glossary should be built along with requirements specification, but reality?

Automated Term Extraction

TextRank

TermRaider

“Wait, my colleagues have used so many

variations of the same terms”

“Even I’ve used different variations myself. Let me

correct them all in the document.”

“Let’s identify the glossary terms!!!”

Story Behind

“How about we cluster all of them together”

Identification of

Candidate Terms

Similarity Calculation Clustering

Candidate Terms

6666664

s11 · · · s1n... . . . ...sn1 · · · snn

7777775

Similarity Matrix

Identification of

Candidate Terms

6666664

s11 · · · s1n... . . . ...sn1 · · · snn

7777775

Similarity Matrix

Identification of

Candidate Terms

6666664

s11 · · · s1n... . . . ...sn1 · · · snn

7777775

Similarity Matrix

Identification of

Candidate Terms

6666664

s11 · · · s1n... . . . ...sn1 · · · snn

7777775

Similarity Matrix

Approach

Similarity Measure

ClusteringParameter(s)

NL Requirements

Combination and Filtering Heuristics

Clusters

Identification of Candidate Terms Similarity Calculation ClusteringIdentification of Candidate Terms Similarity Calculation ClusteringIdentification of Candidate Terms Similarity Calculation Clustering

R1 - STS shall supply GSI monitoring information (GSI input parameters and GSI output parameters) to the STS subcontractor.

R2 - When GSI component’s status changes, STS shall update the progress of development activities.

R1 - STS shall supply GSI monitoring information (GSI input parameters and GSI output parameters) to the STS subcontractor.

R2 - When GSI component’s status changes, STS shall update the progress of development activities.• STS • STS Subcontractor

• GSI • GSI input parameter • GSI output parameter

• GSI component • GSI component’s status • GSI monitoring information

• development activity • progress of development activity

Identification of Candidate Terms Similarity Calculation Clustering

Empirical Evaluation

Case StudiesCas

Case-B

380 Requirements

110 Requirements

RQ1 - How accurate is our approach at identifying glossary terms?

Term Extraction Accuracy

TextRank

TermRaider

Clustering Accuracy!

RQ2 - Which similarity measure for candidate terms yields the most accurate clusters?

Tuning the Clustering Algorithm!

RQ3 - How can one systematically specify the number of clusters for clustering?

Clustering Effectiveness!

• RQ4 - Is clustering effective at grouping clustering terms?

• Comparison of generated clusters against Ideal clusters

Inferring Ideal Clusters• Difficult to objectively define ideal clusters

directly

• Systematic inference of ideal clusters using domain models

Initial Modelby Domain Expert

Model Completion by Researchers

Validation by Domain Expert

Var(status):- status- status of GSI component- GSI component status

Var(availability):- availability- availability of GSI component

Ground Station Interface

(GSI) statusavailability

GSI Component

GSI Monitoring Information

GSI Anomaly

GSI Output Parameter

GSI Input Parameter

GSI Component

GSI Anomaly

GSI Input Parameter

GSI Component

GSI Anomaly

GSI Input Parameter

From Domain Models to ClustersVar(status):- status- status of GSI component- GSI component status

GSI Component

GSI Anomaly

GSI Input Parameter

Specialization Relation

Aggregation Association

Attributes

Overlapping

GSI Component

GSI Anomaly

GSI Input Parameter

GSI Component

GSI Anomaly

GSI Input Parameter

GSI Component

GSI Anomaly

GSI Input Parameter

Accuracy Metrics for Clusters

Precision = |Common Elements| ___________________ |Generated Cluster|

Recall = |Common Elements| ___________________ |Ideal Cluster|

Precision = 4 / 9 !

Recall = 4 / 412 3

Ideal Clusters - Only for Evaluation

Scalability!

RQ5 - Does our approach run in reasonable time?

Results

How accurate is our approach at identifying glossary terms? (RQ1)

12.5 16.1

20.8 26

72 70.7 73.6

Case A Case A Case A Case A Case A Case B Case B Case B Case B Case B

JATE TextRank TermRaider Topia Our Approach JATE TextRank TermRaider Topia Our Approach

Precision Recall F-measure

Case-A Case-B

Which similarity measure yields the most accurate clusters? (RQ2)

SyntacticSoftTFIDFMonge ElkanLevensteinRandom

Fmeasure vs. Num_Clusters

Num_Clusters0 5 10 20 30 40 50

0.100.20

Graph Builder

SyntaticSoftTFIDFMonge ElkanLevensteinRandom

F-measure vs. Number of Clusters

Number of Clusters0 50 100 150 200 250 300

Where(100 rows excluded)

Graph Builder

Case-B

Number of ClustersNumber of Clusters

Case-A

Syntactic Similarity MeasureSoftTFIDFMonge ElkanLevensteinRandom

Number of Clusters0 5 10 15 20 25 30 35 40 45 50

0.100.150.200.250.300.350.400.450.500.550.600.65

Graph Builder

0.100.150.200.250.300.350.400.450.500.550.600.65

Graph Builder

Syntactic measure:SoftTFIDFMonge ElkanLevensteinRandom

0.100.150.200.250.300.350.400.450.500.550.600.65

Graph Builder

0.100.150.200.250.300.350.400.450.500.550.600.65

Graph Builder

Syntactic measure:SoftTFIDFMonge ElkanLevensteinRandom

051525

10203040

How can one systematically specify the number of clusters for clustering? (RQ3)

0 5 10 20 30 40 50Num_Clusters

Overlay Plot

600000

700000

800000

900000

1000000

1100000

0 50 100 150 200 250 300Num_Clusters

Overlay PlotCase-A Case-B

Number of Clusters

Chosen # of clusters = 120 Chosen # of clusters = 23

5% marginfrom max BIC

Is clustering effective at grouping candidate terms? (RQ4)

Case-A Case-B90.8

66.8 76.2

Upperbound Actual Uppderbound Actual

UpperboundUpperbound Actual Actual

RQ5 - Does our approach run in reasonable time?

• Case - A : 380 requirements ~ 26 minutes

• Case-B : 110 requirements: < 2 minutes

Future Work• Further enhance the accuracy using semantic

analysis

• More empirical studies

• Identifying relationships between different development artefacts

Conclusion

12.5 16.1

20.8 26

72 70.7 73.6

Case A Case A Case A Case A Case A Case B Case B Case B Case B Case B

JATE TextRank TermRaider Topia Our Approach JATE TextRank TermRaider Topia Our Approach

Case-A

Case-B

380 Requirements 110 Requirements

0 5 10 20 30 40 50Num_Clusters

Overlay Plot

600000

700000

800000

900000

1000000

1100000

0 50 100 150 200 250 300Num_Clusters

Overlay Plot

Case-A Case-B

Number of Clusters

Chosen # of clusters = 120 Chosen # of clusters = 23

5% marginfrom max BIC

Improving Requirements Glossary Construction via Clustering

Software

EURO XXII Presentation Improving the Quality of Customer Satisfaction Measurements of MUSA Method Using Clustering Data Mining Techniques

Chapter19 Clustering Analysis. Content Similarity coefficient Hierarchical clustering analysis Dynamic clustering analysis Ordered sample clustering analysis

Affinity Clustering: Hierarchical Clustering at Scalepapers.nips.cc/paper/7262-affinity-clustering-hierarchical-clustering-at-scale.pdf · Afﬁnity Clustering: Hierarchical Clustering

IMPROVING BC’S CARE PERSONS with - Fraser Health · PDF fileClinical Care and Patient Safety, MoH ... APPENDIX A: Glossary of ... Improving BC’s Care for Persons with Dementia

Clustering k-mean clustering

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples

IMPROVING THE ACCURACY OF TEXT DOCUMENT CLUSTERING … · IMPROVING THE ACCURACY OF TEXT DOCUMENT CLUSTERING BASED ON SYNGRAM ALGORITHM ... masalah Polysemy dengan mengubah istilah-istilah

Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering … · 2015-05-06 · 1 Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace

Improving Scalability of the Nested Partition-Based Clusteringonlinepresent.org/proceedings/vol100_2015/35.pdf · Improving Scalability of the Nested Partition-Based ... Clustering,

Neutrosophic sets and fuzzy c means clustering for improving ct liver image segmentation

Improving Ad Hoc Network Behaviour Using Clustering T echnique with NS2

Clustering. 2 Outline Introduction K-means clustering Hierarchical clustering: COBWEB

Collaborative Clustering for Entity Clustering

METHODS FOR EVALUATING, SELECTING AND IMPROVING SOFTWARE CLUSTERING ALGORITHMS

Improving Accountability in Primary Care - AFHTO · iv Improving Accountability in Primary Care FINAL DRAFT – JUNE 29, 2011 iv Glossary of Acronyms ACO Accountable Care Organization

Fixing the Curve: Improving Major League Baseball Pitch ... · Fixing the Curve: Improving Major League Baseball Pitch Classification with Model-Based Clustering December 17th, 2018

Report Improving user journeys for humanitarian cash transfers...Improving user journeys for humanitarian cash transfers 3 Glossary This report uses terminology as defined in the CaLP

Clustering 2: Hierarchical clustering

Improving a Centroid-Based Clustering by Using Suitable

Guest Lecture: Clusteringcvml.ist.ac.at/talks/clustering-core2018.pdfsingle linkage clustering, complete linkage clustering, average linkage clustering Graph-based clustering spectral