18
Recovering UML class models from C++: A detailed explanation Andrew Sutton, Jonathan I. Maletic * Department of Computer Science, Kent State University, Kent, OH 44242, USA Received 5 July 2006; accepted 25 October 2006 Available online 22 December 2006 Abstract An approach to recovering design-level UML class models from C++ source code to support program comprehension is presented. A set of mappings are given that focus on accurately identifying such elements as relationship types, multiplicities, and aggregation seman- tics. These mappings are based on domain knowledge of the C++ language and common programming conventions and idioms. Addi- tionally, formal concept analysis is used to detect design-level attributes of UML classes. An application implementing these mappings is used to reverse engineer a moderately sized, open-source application and the resultant class model is compared against those produced by other UML reverse engineering tools. This comparison shows that the presented mapping rules effectively produce meaningful and semantically accurate UML models. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Software engineering; Reverse engineering; Design recovery; Program comprehension; UML class models 1. Introduction The software industry has widely accepted and often uses UML (Unified Modeling Language) [12] tools in for- ward engineering, but these tools are used less frequently during software maintenance and evolution. This is due to a number of reasons; foremost among these is that the manual recovery and maintenance of UML models is time-consuming and costly. As such, UML models become stale while the source code continues to evolve. Although many UML modeling tools allow us to reverse engineer UML models from source code, they often perform poorly at this task. A case study of reverse engineering tools [22] finds that many of these tools, despite advances in the research literature, continue to focus on producing the core elements of UML (i.e., simple class diagrams), but often fail to adequately represent design abstractions. This is problematic when recovered software models fail to accu- rately represent the abstract program semantics required for high-level program comprehension. This problem can be exacerbated by the fact that end users are typically una- ware of the internal processes for producing the UML models. This is to say that these tools do not disclose their mechanisms for reverse engineering, which can lead to results that do not meet the end-user’s expectations. Although this study [22] concludes that these tools pro- vide reliable functionality, the resulting models are any- thing but consistent. For example, Microsoft Visio is incapable of reverse engineering associations, Visual Para- digm creates dependencies when associations are appropri- ate, and Rational Rose C++ Modeler creates only aggregate associations (open diamonds in UML). The pri- mary reason for these inconsistencies is the sizeable seman- tic gap between UML and C++. Although this gap is quite wide, it is by no means unbridgeable. Unfortunately, com- monly used reverse engineering tools are closed source sys- tems and provide little information about how UML models are created from C++. This leaves developers to speculate about rationale for the application’s logic. As such, there is no standard ‘‘bridge’’ between C++ and UML, and all reverse engineering tools tend to build their own. 0950-5849/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2006.10.011 * Corresponding author. Tel.: +1 330 672 9039. E-mail addresses: [email protected] (A. Sutton), [email protected]. edu (J.I. Maletic). www.elsevier.com/locate/infsof Information and Software Technology 49 (2007) 212–229

Recovering UML class models from C++: A detailed explanation

Embed Size (px)

Citation preview

Page 1: Recovering UML class models from C++: A detailed explanation

www.elsevier.com/locate/infsof

Information and Software Technology 49 (2007) 212–229

Recovering UML class models from C++: A detailed explanation

Andrew Sutton, Jonathan I. Maletic *

Department of Computer Science, Kent State University, Kent, OH 44242, USA

Received 5 July 2006; accepted 25 October 2006Available online 22 December 2006

Abstract

An approach to recovering design-level UML class models from C++ source code to support program comprehension is presented. Aset of mappings are given that focus on accurately identifying such elements as relationship types, multiplicities, and aggregation seman-tics. These mappings are based on domain knowledge of the C++ language and common programming conventions and idioms. Addi-tionally, formal concept analysis is used to detect design-level attributes of UML classes. An application implementing these mappings isused to reverse engineer a moderately sized, open-source application and the resultant class model is compared against those produced byother UML reverse engineering tools. This comparison shows that the presented mapping rules effectively produce meaningful andsemantically accurate UML models.� 2006 Elsevier B.V. All rights reserved.

Keywords: Software engineering; Reverse engineering; Design recovery; Program comprehension; UML class models

1. Introduction

The software industry has widely accepted and oftenuses UML (Unified Modeling Language) [12] tools in for-ward engineering, but these tools are used less frequentlyduring software maintenance and evolution. This is dueto a number of reasons; foremost among these is that themanual recovery and maintenance of UML models istime-consuming and costly. As such, UML models becomestale while the source code continues to evolve. Althoughmany UML modeling tools allow us to reverse engineerUML models from source code, they often perform poorlyat this task. A case study of reverse engineering tools [22]finds that many of these tools, despite advances in theresearch literature, continue to focus on producing the coreelements of UML (i.e., simple class diagrams), but oftenfail to adequately represent design abstractions. This isproblematic when recovered software models fail to accu-rately represent the abstract program semantics required

0950-5849/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.infsof.2006.10.011

* Corresponding author. Tel.: +1 330 672 9039.E-mail addresses: [email protected] (A. Sutton), [email protected].

edu (J.I. Maletic).

for high-level program comprehension. This problem canbe exacerbated by the fact that end users are typically una-ware of the internal processes for producing the UMLmodels. This is to say that these tools do not disclose theirmechanisms for reverse engineering, which can lead toresults that do not meet the end-user’s expectations.

Although this study [22] concludes that these tools pro-vide reliable functionality, the resulting models are any-thing but consistent. For example, Microsoft Visio isincapable of reverse engineering associations, Visual Para-digm creates dependencies when associations are appropri-ate, and Rational Rose C++ Modeler creates onlyaggregate associations (open diamonds in UML). The pri-mary reason for these inconsistencies is the sizeable seman-tic gap between UML and C++. Although this gap is quitewide, it is by no means unbridgeable. Unfortunately, com-monly used reverse engineering tools are closed source sys-tems and provide little information about how UMLmodels are created from C++. This leaves developers tospeculate about rationale for the application’s logic. Assuch, there is no standard ‘‘bridge’’ between C++ andUML, and all reverse engineering tools tend to build theirown.

Page 2: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 213

We address this problem by defining a set of mappingsfor the reverse engineering of UML class models fromC++ source code [37,38]. These mappings employ a com-bination of C++ syntactic and semantic information alongwith domain knowledge of programming conventions, idi-oms, and reuse libraries to produce semantically accurateUML class models. Many of these mappings extend andintegrate techniques presented in the literature on this top-ic. As part of these mappings, a sophisticated informationanalysis technique (formal concept analysis) is applied tothe UML model to recover design-level attributes of classesrather than re-document member variables.

These mappings are implemented in a reverse engineer-ing tool, pilfer, which is used to evaluate the relevance ofthe defined mappings by reverse engineering a moderate-ly-sized C++ application. The model produced by pilfer

is compared against those produced by other tools in orderto validate the accuracy and completeness of the definedmappings. Because performance is an important aspect ofreverse engineering tools, we also compare pilfer’s run timeperformance against these tools.

This paper is organized as follows. Section 2 provides amore detailed context of the problem being addressed. Sec-tion 3 describes rationale and tradeoffs for each mapping.Section 4 describes the implementation of pilfer. In Section5, we present a comparison of models generated by pilfer

and other reverse engineering tools. Section 6 describeswork related to this topic and Section 7 presents our con-clusions and future work.

2. Reverse engineering analysis

Design recovery is the process of recovering design deci-sions, abstractions, and rationale from a program’s sourcecode [5]. Design recovery directly supports program com-prehension through reverse engineering. Fig. 1 depicts the

Fig. 1. A reverse engineering stack produces increasingly abstract source mocomputed for source models at different levels of abstraction.

architecture of a technology stack used in reverse engineer-ing to recover program designs. This technology stack ismotivated in part by the Rigi reverse engineering environ-ment [36,45] and the DMS program analysis system [3].It integrates the technologies used in reverse engineeringsuch as static analysis, concept analysis, and software clus-tering. More precisely, Fig. 1 illustrates an architecture (orreference model) for a reverse engineering environment.This architecture is composed of layered analysis methodsand models. Each method computes a model of the systemthat is consumed or used by higher level analyses (e.g.,parsing yields an AST that can be used to compute controlflow graphs). Moreover, there are a number of potentialmetrics (those shown are only a small subset of softwaremetrics) that can be computed from the models at all layersof this architecture.

Much like the computation of metrics, mapping rules (orsimply mappings) can be defined at any level of abstrac-tion. Although these mappings can be used to produceany number of artifacts, here we specifically envision thembeing used to produce UML models. For example, simplemappings from C++ to UML produce UML classes fromthe C++ classes found in source code. However, these sim-ple mappings – those most often implemented in commonUML reverse engineering tools – are fairly naıve. Theyrarely embed more than a rudimentary knowledge of pro-gramming language semantics or libraries, nor do theyallow users to embed their own specific domain knowledgeinto the reverse engineering processes. Moreover, with alimited set of analysis methods, these tools are often inca-pable of producing anything more than visually re-docu-mented source code. Reverse-engineered artifacts of thisnature are often too detailed to be of any great use to thecasual reader. The information they relate is easily attain-able from the source code, or through a source code re-doc-umentation tools such as Doxygen or JavaDoc.

dels based on layered processes and analyses. Software metrics are often

Page 3: Recovering UML class models from C++: A detailed explanation

Tab

le1

Ali

sto

fco

nce

pts

bei

ng

reve

rse-

engi

nee

red

,th

ere

qu

ired

leve

lo

fan

alys

isto

per

form

that

task

incl

ud

ing

do

mai

nk

no

wle

dge

and

the

deg

ree

of

amb

igu

ity

asso

ciat

edw

ith

each

task

UM

Lco

nce

pt

tob

ere

vers

e-en

gin

eere

dR

equ

ired

anal

ysis

Do

mai

nk

no

wle

dge

Deg

ree

of

amb

igu

ity

Des

crip

tio

no

fre

vers

een

gin

eeri

ng

task

Par

sin

gS

eman

tic

anal

ysis

Sta

tic

anal

ysis

Enti

ties

Cla

sses

XN

on

eIn

terf

aces

XX

Med

Dis

tin

guis

hin

terf

aces

fro

mcl

asse

sD

ata

typ

esX

XM

edD

isti

ngu

ish

dat

aty

pes

fro

mcl

asse

sA

ttri

bu

tes

XN

on

eD

esig

n-l

evel

attr

ibu

tes

XX

Hig

hD

eter

min

ed

esig

n-l

evel

attr

ibu

tes

Rea

d-o

nly

attr

ibu

tes

XX

Lo

wF

ind

attr

ibu

tes

wit

ho

nly

acce

sso

rsA

ttri

bu

tety

pe

XX

Med

Co

rrec

tly

reso

lve

typ

ere

fere

nce

sM

ult

ipli

city

XX

Med

Iden

tify

mu

ltip

lici

tyo

fat

trib

ute

sO

rder

ing

XX

Med

Iden

tify

ord

ered

and

un

ord

ered

con

tain

ers

Par

amet

ers

XN

on

eP

aram

eter

dir

ecti

on

XX

Med

Iden

tify

in,

ou

to

rin

ou

tp

aram

eter

s

Rel

ati

on

ship

s

Ass

oci

atio

nX

Hig

hF

ind

attr

ibu

tes

con

stit

uti

ng

asso

ciat

ion

sA

ggre

gati

on

sem

anti

csX

XH

igh

Det

erm

ine

ob

ject

life

tim

eR

eali

zati

on

XM

edD

isti

ngu

ish

inh

erit

ance

fro

mim

ple

men

tati

on

214 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

The use of domain knowledge within these mappingsprovides a significantly broader range of functionality.Domain knowledge can includes programming style con-ventions (e.g., idioms, patterns, and even method naming),knowledge of reuse libraries, and even information fromthe problem domain. Additionally, these mappings canleverage other abstract source models (e.g., control flow,data flow, and call graphs) to assist in the mapping process.

Table 1 describes the UML concepts [31] to be reverse-engineered from source code. These concepts are represen-tative of current deficiencies in reverse engineering tools asnoted in [22] and through our practical experience with anumber of reverse engineering tools. Although the numberof tasks required to fully reverse engineer C++ programs issignificantly longer, we are interested in defining mappingsfor a small subset of those.

The columns in Table 1 are described as follows: therequired analysis column describes the analysis methodrequired to accomplish each task. This classification isbased primarily on experience with C++ parsing technolo-gy and the implementation of different reverse engineeringmethods. Parsing implies that the results could be obtainedthrough the abstract syntax tree (AST) of the program,semantic analysis means that the application must relyupon specific C++ semantics in order to perform the task,and static program analysis (or simply static analysis) indi-cates the need for more sophisticated analysis methods(e.g., data flow analysis). The domain knowledge columnindicates whether or not domain knowledge will contributeto the completion of the task. Here, domain knowledge istaken to include knowledge of the problem domain and(primarily) the solution domain of an application and itsimplementation. For example, knowledge of programmingidioms, design patterns, and code concepts (e.g., STL tem-plate concepts) are potentially useful when recovering thesemantics source code. Tasks in which the application ofdomain knowledge plays a role are marked. The degree

of ambiguity describes how consistently different tools per-form the task. Low ambiguity implies that tools producemostly consistent models, whereas high ambiguity tasksresult in widely varying results. The degree of ambiguitywas determined by considering the number alternativedesign-level semantics for a given C++ declaration. Forexample, identifying UML associations is highly ambigu-ous; this becomes evident when comparing the results ofdifferent reverse engineering application. The descriptioncolumn provides additional details for the reverse engineer-ing task.

Note that we have not included methods (member func-tions) in this table. Most reverse engineering tools typicallyhave little difficulty identifying the methods of a class.Here, we are interested in which methods are actually pro-duced in the UML model. Most tools include constructors,destructors, and overloaded operations in the reverse-engi-neered classes. However, we see these as implementationdetails (i.e., language integration features) that only serveto pollute the resultant UML model. We recommended

Page 4: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 215

allowing the user to decide the relevance of those methodsto their comprehension requirements.

3. Mappings for reverse engineering

In this section, we define mappings for UML reverseengineering tasks with a degree of ambiguity of mediumor higher in Table 1, or those that are often seen as difficultor having potential ambiguities in their mappings. Themappings defined herein are heuristics based on syntacticand semantic features rather than deterministic analyses.

3.1. Identifying types of classes

The distinction between classes, interfaces, and data

types is semantically important in UML. Taken as a whole,these elements are described as classifiers, or named model-ing elements that (a) have properties and behaviors and (b)can participate in generalization relationships. However,the treatment of these elements in models can vary greatly.For example, UML does not allow associations betweenclasses and data types, and only classes can realize interfac-es. Unfortunately, C++ does not provide a rich enoughvocabulary to easily distinguish these classifiers throughsimple parsing. Moreover, it is not always obvious whethera C++ class is a UML data type or interface. Let us nowaddress each of these separately.

3.1.1. Interfaces

Our method for identifying interfaces is fairly straight-forward. Our definition of interface is borrowed from otherOO languages, namely Java and C#. We define a C++interface as a class that defines only public, pure virtualmethods, declares no member variables, defines no con-structors or destructors, and, if derived, the base classesmust also be interfaces.

A number of these restrictions derive from the notionthat interfaces specify a contract rather than programbehavior. For example, a class implementing a methodassociates behavior with that class. Moreover, a classdeclaring member variables associates state informationwith the class. Obviously, if a class declares no membervariables then it has no need of specialized constructorsor destructors. Finally, we restrict interfaces to beingderived only from other interfaces in order to align our def-inition with common OO models.

The code in Fig. 2 shows a class, IList, that meets ourcriteria for interface declarations. It defines only public

Fig. 2. The IList class is an interface, providing only public abstractmethods, and defining no state or behavior.

pure virtual methods, and defines member variables, con-structors, or destructors. Although we can use this knowl-edge to effectively distinguish a ‘‘true’’ C++ interface fromother abstract classes, there are some tradeoffs in its usage.First, our definition of a C++ interface is fairly restrictiveand may not align well with the common convention oftreating abstract classes as interfaces. As such, this map-ping may produce some unexpected results for some devel-opers. Second, earlier versions of UML greatly restrict theusage of interface elements in associations (i.e., they do notparticipate in associations). These restrictions have beenrelaxed in newer versions.

3.1.2. Data types

In UML, a data type is one that is identified only byits value such that two instances having the same valueare said to be the same instance. Instances of classeshowever, have a distinct identity and two objects withthe same state, are not necessarily considered to be thesame objects. Typical examples of data types include pro-gramming language primitives such as integers, Booleanvalues, and enumeration values. A string class is alsoan example of a data type. In order to identify data typesin C++, we have to rely heavily on how classes are usedin a program. Specifically, we use a class’s construction,copy, and assignment semantics in order to identify itas a data type.

Our definition of a C++ data type encompasses two dis-tinct variations. A class that implements a public defaultconstructor, a copy constructor, and assignment operatoris a data type. In this case, the String class in Fig. 3 explic-itly implements default construction, copy construction,and assignment semantics that will allow the class tobehave like a POD (plain old data) type. Also, a class thatimplements no constructors or assignment operators is alsoa data type (such as the Complex class). In this case, thedeveloper is relying on the compiler to supply defaultsfor these methods. Note that we do not consider destruc-tors in the classification. Destructors add little to thedesign-level semantics of the class because all classes havedestructors – either implicit or explicit. As such, using the

Fig. 3. The Complex class relies on the compiler to supply its copy,construction, and assignment semantics. The String class overloads thesemethods to provide specialized copy and assignment semantics.

Page 5: Recovering UML class models from C++: A detailed explanation

Fig. 4. The ModelElement class has three member variables but definesonly two modeled properties: its name and unique id. The reference countis a detail of the implementation.

216 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

destructor in classification may lead to ambiguity whenclassifying C++ classes.

Automating this mapping might lead to cases whereclasses are misidentified as data types, especially in caseswhere the author intentionally implements the constructor,copy constructor and assignment operator. However, wefeel that these cases are rare, and might even representextraneous functionality for those classes (e.g., dead codeor poor class design).

3.2. Identifying attributes

Typical reverse engineering tools correlate UML attri-butes with a class’s member variables. However, inUML, attributes are used somewhat differently. Typically,an attribute reflects a facet of a class’s interface that can beread or written rather than representing the implementa-tion details of a member variable. This is to say thatUML attributes more likely correspond to instances ofthe property idiom rather than the member variables ofthe class. This is more appropriate for reverse engineeringtools because it represents a more abstract view of a classrather than its implementation details. Programming lan-guages such as C# provide syntactic features that allowthe explicit declaration of such properties, but C++ doesnot.

Our method for identifying attributes of classes is basedon the collection of accessors and mutators of a C++member variable. For the purpose of this discussion, wedefine an accessor as a constant method that returns amember variable of a class. A mutator is a method thatwrites the value of a member variable. Accessors and muta-tors are grouped by the member variables on which theyoperate. A read-only property can be identified by anaccessor returning a member variable. A read-write proper-ty can be identified by the presence of an accessor and a setof mutators. We define writable properties as having a set

of mutators because the property’s interface could supportcollection semantics (e.g., add, remove, and clear). Exam-ples of mapping rules are show in Table 2.

Fig. 4 shows a class with two model-able properties: id

(read-only) and name (read-write). These properties arederived by examining accessor and mutator methods andtheir relationship to member variables of the class.

However, this method of detecting attributes of C++classes is not without fault. Attempts to automate thisdetection without more sophisticated analysis techniqueswill almost certainly lead to the detection of false positives.

Table 2Example mappings from groups of C++ member-function declarations to UM

C++ member function declaration

const Foo & foo() const; void setFoo(const Foo &);Foo *foo() const; void setFoo(Foo *);const Foo & foo() const;

Foo *foo() const;

This can have wide ranging effects if behaviors (i.e., meth-ods) of a class are mis-modeled as UML attributes. How-ever, a developer performing this task manually shouldhave some intuition about what features of the class areproperties and which are operations, allowing them to dis-ambiguate cases where the function definitions are unclear.

3.2.1. Attribute type

Although it is relatively easy to determine the type of amember variable in C++, that type does not always mapdirectly to UML. For example, UML provides no syntaxfor modeling pointer or reference types, and many com-mon C++ type qualifiers (const, mutable, and volatile)can have little or no meaning because typed elements inUML are simply references to classifiers. No additionalinformation is modeled in the specification of type.

To this end, we define a simple mapping for type resolu-tion. We define the type of a modeled attribute to be thetype reference in a C++ type expression. The type refer-ence can be obtained by stripping out all pointer, reference,array symbols, and any qualifiers in the expression. Exam-ples are shown in Table 3.

In addition to the simple C++ type expression mappingin Table 3, we also need to deal with more complex tem-plate type expressions. Templates such as containers andsmart pointers in the STL (Standard Template Library)are used frequently in C++ programs, but do not necessar-ily contribute to type information in a UML model.Instead, they embed semantics about the associationbetween classes, but not about the types of the classes par-ticipating in that association. Such classes exhibit a transi-tive property for the collaborating classes. We considerthese classes to be transitive in nature if they satisfy a con-dition of transitive containment. That is to say, ‘‘class A

L attributes

UML type

Read-write property, ‘‘foo’’Read-write property, ‘‘foo’’Read-only property, ‘‘foo’’Read-only property, ‘‘foo’’

Page 6: Recovering UML class models from C++: A detailed explanation

Table 3Mappings from C++ type specifications to UML types discard qualifiersand pointer, reference, and array tokens

C++ type declaration UML type

Foo foo; FooFoo *foo; FooFoo **foo; Fooconst *Foo; FooFoo & foo; Fooconst & Foo; Foo

Table 5Mappings from C++ declarations to UML multiplicity ranges depend onpointer, reference, and array symbols associated with the type reference

C++ declaration UML multiplicity range

Foo 1 ..1Foo* 0 ..*Foo []

a 0 ..*Foo *[]b 0 .. *Foo [n]

c n ..n

Foo *[n] 0 ..n

Foo ** 0 ..*Foo & 1..1listÆFooæ 0 ..*auto_ptrÆFoo *æ 0 ..1

Template classes can also contribute multiplicity information.a The expression Foo [] is only usable in formal parameter lists and is

semantically equivalent to Foo *.b The expression Foo *[] is only usable in formal parameter lists and is

semantically equivalent to Foo **.c Where n is a constant, integral value.

Table 6Mappings from C++ declarations to UML orderings depend primarily onthe semantics of abstract data types (e.g., set or vector)

C++ declaration UML multiplicity ordering

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 217

contains class B which contains class C, and therefore classA contains class C’’. In this case, class B is either a contain-er or smart pointer. It is the summary statement that ‘‘classA contains class C’’ that is more appropriate in UML typespecifications. The relevant type information can beextracted from template declaration by extracting its inner-most arguments. The mapping for extracted type informa-tion from simple C++ type specification also applies.Examples mappings are shown in Table 4. As such, auto-mating this mapping requires a great deal of embeddeddomain knowledge. An application performing this taskneeds to know in advance which classes admit this transi-tive containment property and which template parameterscorrespond to the appropriate type information.

3.2.2. Multiplicity

Multiplicity defines the allowable number of instances ofan attribute and is expressed as a range between an lowerand upper bound (e.g., 0 ..*). The multiplicity of attributescan be difficult to detect. There is no set of simple rules thatreadily describe a mapping of declarations to multiplicities.Fortunately, there are several indicators in C++ that canhelp us approximate the multiplicity of an attribute – espe-cially pointers, array brackets, and transitive classes. Table5 lists the mapping rules between C++ type declarationsand UML multiplicity ranges.

Note that the only multiplicity ranges that can be unam-biguously identified are those where (a) only a singleinstance is declared, (b) a reference to a single instance isdeclared, or (c) a fixed-size array is declared. In all othercases, we cannot accurately identify either the lower orupper bounds of the range. This is due to the ambiguityof C++ declarations. For example, we might typicallyexpect a pointer declaration (i.e., Foo *) to represent a sin-gle object, but C++ defines no difference between this anda C-array of Foo objects. Additionally, we can use knowl-

Table 4Transitive type mappings from C++ template typed declarations use theinner-most template arguments to construct UML type information

C++ declaration UML type

listÆFooæ FoosetÆFoo *æ Fooconst stackÆFooæ & FooqueueÆauto_ptrÆFoo *æ æ Foo

edge of containers and smart pointers to extract multiplic-ity semantics.

3.2.3. OrderingBecause many attributes represent the containment of

multiple instances, the UML metamodel provides the abil-ity to describe ordering semantics for containers. UMLdefines two types of ordering for attributes: ordered andunordered. These simply specify whether the containingattribute stores instances sequentially (e.g., a list or vector)or otherwise (e.g., a set). To date, there is no good staticanalysis method that can accurately recover the orderingsemantics of containers. This is due to the fact that thestorage mechanisms are woven throughout the implemen-tation of various data structures. Fortunately, we can usethe declarative semantics of arrays and the use of containerclasses to aid in this mapping. Mappings for containerordering semantics are shown in Table 6.

The mappings for ordering semantics are easily derivedfrom information about member variable type declara-tions. C-array and C-vector declarations are always allo-cated with sequential memory storage. The orderingsemantics of containers are intrinsic to their data type.

Foo, Foo *a ordered

Foo [] orderedFoo *[] orderedvectorÆFoo *æ orderedlistÆFooæ ordereddequeÆFoo *æ orderedsetÆFooæ unordered

a Member variables with single or optional (1 or 0 . . .1) multiplicity aretypically described as ordered, which is the default ordering given by theUML specification.

Page 7: Recovering UML class models from C++: A detailed explanation

218 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

We might note that the ordered/unordered attributesemantics of UML do not map precisely onto the conceptsdefined by the STL. An ordered attribute corresponds tothe STL Sequence concept in that contained elements arearranged in a strict linear order. Examples include array,vectors, deques, and lists. On the other hand, the SortedAssociative Container concept has no equivalent semanticsin the UML specification although the possibility of a ven-dor-specific extension for sorted attribute types ismentioned.

3.3. Identifying parameter direction

UML Parameters share some features with UML attri-butes (e.g., type resolution) and as such can be reverse-en-gineered using similar mappings. However, UMLparameters also encode information about how their valuesare transmitted to (and from) an operation. This informa-tion is called the parameter’s direction kind. UML definesfour directions for parameter passing: in, out, inout andreturn. The direction determines whether or not the param-eter will be used as an input to an operation, an output ofthe operation, both (i.e., in the case that an operation thatchanges the state or value of an input), or as a pure returnvalue. Because C++ does not provide us with enoughdeclarative granularity for this mapping, we define a listof mappings based on the parameter’s type and its qualifi-ers. These mappings are shown in Table 7.

In C++, arguments are passed to methods either by ref-erence or by value. Because pass-by-value parametersresult in local copies of the supplied arguments, they caneasily be identified as in parameters. However, if theparameter is passed by reference, we need to examine theconst-ness of the declared parameter. If a parameter decla-ration includes the const keyword, then it can be modeledas an in parameter. Otherwise, it can be modeled as anout parameter.

3.4. Identifying associations

Although UML associations are most often used to rep-resent ‘‘has-a’’ relationships, they are sometimes employedto model semantic relationship between two different clas-

Table 7Mappings from C++ parameter declarations to UML parameter directionrely upon pointer, reference, and const qualifiers in the type specification

C++ declaration UML parameter direction

Foo inFoo & inoutconst Foo & inFoo * inoutFoo *

a outconst Foo * in

a While it is possible to use parameters of this type as inputs, APIdesigners often use pointers to store the outputs of pointer manipulation(e.g., in-place memory allocation).

ses. C++ does not provide a syntactic concept for model-ing these semantic links. However, it is fairly easy todiscern the ‘‘has-a’’ relationships from the member variabledeclarations of a class. Our method for identifying associ-ations in C++ relies heavily on the correct identificationof C++ classifier types and declarative type information.We define a C++ association as a member variable witha type reference to a modeled UML class, but not a datatype. We restrict classes and interfaces from being associat-ed with data types because the semantics of that particularrelationship are wholly encapsulated in the fact that themember variable is modeled as an attribute. Note thatthe derivation of the type reference used in this mappingmust follow the mapping rules for type resolution.

The code in Fig. 5 shows two classes participating inassociations. The mNamespace member is an obvious can-didate. In the case of mOwnedElement we extract the innertype of the set member to define an association betweenNamespace and ModelElement.

3.4.1. Aggregation

An association’s aggregation kind defines lifetimesemantics for instances contained through the relationship.UML defines three types of containment semantics forassociations: none, aggregate, and composite. The onlytypes of aggregation kind that can be derived from theC++ grammar are aggregate and composite. The none vari-ety of aggregation kind represents purely semantic linksbetween classes and is of little interest in this context.

Determining the aggregation kind of a property is diffi-cult because C++ provides very little language support fordeclaring or embedding shared and composite associationsemantics. For example, a member variable pointing toanother object could be a composed member of its enclos-ing class (as in the private implementation idiom), or itcould simply be stored for convenience and shared betweena number of other objects. The use of smart pointers allowsthe developer to embed specific lifetime semantics into aprogram. Recognizing these beacons can drastically reducethe amount of effort spent deducing the proper aggregationkind of these elements.

Much like the determination of multiplicity, the determi-nation of aggregation kind has no simple set of rules. C++provides some declarative information that can be used in

Fig. 5. The ModelElement and Namespace classes participate in associ-ations according to our heuristic. They both define member variablesreferencing another class.

Page 8: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 219

the mapping such as pointer, reference, and array symbols.Also, transitive classes such as containers and smart point-ers can be used in this mapping. Table 8 shows mappings ofdeclared member variables types into UML aggregationsemantics. Additionally, we require the attribute being ana-lyzed to participate in an association before aggregationsemantics can be defined.

Unlike smart pointers, container classes can have var-ied semantics. Most STL container classes define transi-tive aggregation semantics. This is to say thataggregation kind can be deduced from the inner type ref-erence in the template expression according to the simpletype expression mappings in Table 8. Specific domainknowledge of these classes must be used to deduce aggre-gation semantics.

3.5. Identifying realization

In UML, inheritance is represented by generalizationrelationships, which can be easily determined throughC++ inheritance specifications. However, the implemen-tation (realization) of interfaces poses a slightly differentproblem. Although the syntactic mechanisms for realiza-tion inheritance are the same in C++, UML providesadditional syntax that can be used to express interfacesemantics. In UML realization is a dependency betweena class and an interface, expressing the fact that theclass implements the contract specified by the targetinterface.

Our method for recognizing realization relationshipsrelies on the ability to correctly determine classifier types.If a class implements all the abstract methods of an inter-face, then a realization relationship can be created betweenthe class and the interface. Partial realization only results inan abstract base class derived from the interface. We mightnote that realization is not a replacement for generaliza-

Table 8Mappings from C++ declarations to UML aggregation semantics relyupon pointer, reference, and array symbols within the type declaration

C++ member variable declaration UML aggregation kind

Foo compositeFoo *

a aggregateFoo & b aggregateFoo [] compositeFoo *[] aggregateauto_ptr<>c compositeboost::scoped_ptr<> compositeboost::shared_ptr<> aggregate

Additional information can be acquired from transitive classes.a We consider any level of indirection to be aggregation. Multiple levels

of indirection generally imply the use of arrays, C-vectors or C-matrices.b Although it is typically rare to define classes with reference member

variables, it is possible. This requires an instance of the referenced type bepassed as a parameter to a suitable constructor, and that the lifetime of thereferenced object must exceed that of the container.

c The auto_ptr<> class allows ownership to be transferred so we cannotdetermine composition without additional analysis.

tion. The class is still derived from an interface, so the gen-eralization relationship could also be modeled.

4. Implementation

The pilfer reverse engineering tool is currently imple-mented in the Python programming language. This waschosen for a number of reasons. First, it allows the appli-cation to be built and modified quickly, allowing develop-ers to modify or experiment with the given mappings. pilfer

leverages two key technologies to implement its reverseengineering capabilities: srcML1 and the Open ModelingFramework (OMF).2 These technologies are explained inSections 4.1 and 4.2, respectively. In addition to the soft-ware used to implement the source code parsing andUML modeling, pilfer implements the mappings describedabove. Additional analysis components are also present inthe implementation – specifically for the detection of attri-butes from C++ member variables.

The architecture and workflow for pilfer is shown inFig. 6, which describes the interaction between the differentcomponents of the tool chain. Specifically, srcML is usedto produce an XML representation of the source code.pilfer uses the srcML output to construct an abstractsemantics graph (ASG), perform semantic analysis and exe-cute the mappings over the resulting graph. Finally, UMLmodel elements are constructed and serialized through theOMF. This section describes these components in detail.

4.1. srcML

Rather than implementing another C++ parser specifi-cally for the purpose of implementing pilfer, we haveinstead turned to an existing, successful application for factextraction. srcML (SouRce Code Markup Language)[7,8,27] is an XML representation that supports documentand data views of source code. The source code documentis preserved within the XML format by retaining all of thelexical information (e.g., white space, preprocessor direc-tives, and comments) within the original file. A data viewis provided by the srcML format by the addition ofXML elements representing the syntactical structures ofthe C++ programming language (e.g., functions, classes,statements, etc.). A srcML translator, src2srcml, hasbeen developed to transform C++ into its srcMLrepresentation.

We use srcML as a parsing platform for a number ofreasons. Because srcML is an XML format, we can lever-age any number of XML APIs to query or explore the pro-gram during reverse engineering. For example, we use aDOM tree (actually libxml2) to walk the AST of sourcefiles and produce a C++-specific model. Moreover, srcMLis a lightweight (i.e., coarse-grained) markup language as

1 See www.sdml.info/projects/srcml/ for details on srcML.2 See www.sdml.info/projects/omf/ for details on the OMF.

Page 9: Recovering UML class models from C++: A detailed explanation

Fig. 6. The pilfer tool architecture. Source code is transformed intosrcML, mapped into UML, and output as XMI.

220 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

opposed to, say, that of GCCXML.3 This means the size ofintermediate files is only somewhat larger than the originalsource files (usually by a factor of less than 5) allowing pil-

fer to quickly process the marked up source code. Interme-diate formats like GCCXML encode the entire AST inXML and produce files that can be several orders of mag-nitude greater than the original source code, and analysesbased on these products will run much slower.

4.2. The open modeling framework

There are few viable solutions to producing UML mod-els without tying into an existing, monolithic modelingapplication. The open modeling framework is a C++implementation of both the Meta-Object Facility (MOF)and UML metamodels. In this regard, it is similar to boththe Java Metadata Interface (JMI)4 and the Eclipse Mod-eling Framework (EMF),5 but adds comprehensive supportfor UML modeling. Moreover, the OMF provides a mod-el-centric view of UML that focuses on models and theircomponents rather than the content or management of dia-grams. This allows developers to focus on the inspection,validation, or manipulation of models rather than tryingto dovetail analysis-intensive tools to application frame-work interfaces provided by the existing graphical applica-tions such as Rational Rose, Microsoft Visio, and VisualParadigm. This feature has significant impact in thedomain of reverse engineering tools where the focus is onsource code analysis and model construction rather thanthe rendering of user interfaces and layout UML diagrams.Additionally, scriptable Python interfaces provided by theOMF enable integration with other (non-UML) technolo-gies such as grammar parsers, databases, and XML.

We use the OMF as pilfer’s internal representation ofUML. Because the OMF supports reading and writing of

3 See www.gccxml.org/ for details on GCCXML.4 See java.sun.com/products/jmi/ for details on the JMI.5 See www.eclipse.org/emf/ for details on EMF.

multiple versions of XMI, the resulting models can (theo-retically) be exchanged between modeling applications forvisualization – although the interoperability of XMI dia-lects is a well-documented problem [19].

4.3. Implementing pilfer

The primary purpose of the pilfer software is to providea framework that allows the development of flexible andreplaceable mappings between C++ and UML. pilfer doesthis by iterating over the AST provided by srcML inputsand constructing an internal, C++-specific, abstract syntaxgraph (ASG). Additionally, pilfer implements lightweightsemantic analysis of this graph. Specifically, pilfer aug-ments the ASG with C++ language concepts such as theinstantiability of the class, whether or not is default orcopy-constructible, assignable, etc. Many of these conceptsare critical to the implementation of UML mappings.

Once this model is obtained, the process of mapping itto UML is fairly straightforward. It begins by iteratingover the elements of the model and executing specific map-pings based on the type of element. Each mapping is imple-mented as a single function that performs one or moreresponsibilities. For example, the UML class mappingexamines the list of constructors, the destructor, and oper-ators to determine if the class is a data type. If not, it alsoexamines the list of methods to determine whether or notthe class is a UML interface. Finally, the mapping will cre-ate a UML element corresponding to the deduced classifier.

Many of the other mappings provided with the pilfer

software are required to analyze type information todeduce UML semantics. These are implemented using sim-ple string matching techniques. For example, the multiplic-ity of an attribute can be easily determined by searching for‘*’, ‘[‘, or ‘]’ characters within the type specifications. Infact, this technique is applied to many mappings imple-mented by pilfer.

Deducing information from template classes is some-what more complicated. In order to determine the seman-tics of a template instance, pilfer creates a ‘‘templateinstantiation tree’’. This is built such the leaf nodes are rep-resented by non-template types. A traversal of the treeyields the names of templates that contribute to the seman-tics of the declaration and the underlying types of the dec-laration as well. The template classes found within the treeare used to ‘‘guide’’ the traversal and how the semantics areextracted from the tree. For example, the std::list class usesits first template parameter as its value type (the type beingstored), whereas the std::map class uses the second param-eter as its value type. We might also note that templateclasses may provide somewhat ambiguous type and seman-tics information. The std::pair class is one such templatethat resists the application of type and semantics asdescribed above.

Additionally, pilfer allows the use of configurationoptions to control several of its implemented mappings. pil-

fer supports configuration options for the detection of

Page 10: Recovering UML class models from C++: A detailed explanation

6 See www.graphviz.org for details on GraphViz.

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 221

design-level attributes (discussed in Section 4.4), the inter-pretation of pointer multiplicities, and the recovery of com-posite associations. The options for controlling attributerecovery allow the user to activate (or deactivate) the activ-ity. When not activated, pilfer will simply modify membervariables as UML attributes. The option for interpretingthe multiplicity of pointers, will create 0 ..1 multiplicitiesfor pointers. It is common practice to use pointers to refer-ence single objects rather than allocate C-arrays so 0 ..*multiplicities may be inappropriate. The use of this optiondepends on knowledge of the programming conventionsused in the target system. Finally, the option for control-ling association generation allows the pilfer to restrict theinterpretation of associations. In some cases, it may be use-ful to limit the recovery of associations to only membervariables that are references or pointers. If this option isused, non-referential member variables are not consideredwhen attempting to determine if an attribute should bemodeled as an association. However, this is disabled bydefault.

4.4. Detecting design-level attributes

In many cases, reverse engineering tools create too muchinformation. This is to say that the mappings implementedby such tools generate detailed listings of a class’s imple-mentation rather than attempting to present more abstract,design level views of the system. To address this issue, pilfer

can automatically detect design-level attributes of classes.To do this, pilfer will only model member variables thatare associated with a set of accessor and mutator methods.Moreover, the methods associated with that attribute willbe excluded from the list of methods being modeled.

The implementation of this method employs a sophisti-cated information analysis technique – formal conceptanalysis [10,11,26]. Here, concept analysis is used to findgroupings of trivial accessors and mutators in C++ meth-ods, allowing us to deduce the design-level attributes fromthe resulting concept lattice. The use of concept analysisallows us to reduce the information being reverse engi-neered into UML, and significantly improves the qualityand readability of resulting class diagrams. Moreover,there is little to no loss of design-level semantics in theresulting models. This is to say that performing this reduc-tion on a C++ implementation results in semanticallyappropriate UML model.

Concept analysis operates on a relationship between aset of objects and the set of attributes that each object isdescribed by. The algorithm computes the maximal set ofattributes associated with a set of objects. Each maximalsubset of objects and attributes is called a concept. Con-cepts are related by the attributes that each object shares,creating a lattice of related concepts.

In order to identify abstract attributes of C++ classes,we use concept analysis to identify cohesive units of func-tionality clustered around attributes. For our purposes,we define the attribute set as the set of member variables

declared within a class, and the object set as the memberfunctions defined within the class. The only relationshipused is that in which a method in the object set reads orwrites an attribute value in the attribute set. In otherwords, we are interested in which member functions usewhich member variables.

The examples in this section are based on the Hippo-Draw application (discussed in Section 5). Table 9 showsa list of member variables and functions from Hippo-Draw’s Range class and their usage table. This table wasconstructed for this example by examining which memberfunctions used which member variables and removing the‘‘m_’’ from each member variable. No distinction is madeas to the actual type of usage (read, written, returned, orpassed as an argument). This attribute usage is then usedto produce a concept lattice according to the concept anal-ysis algorithm. Each concept consists of an extent (theobjects in each concept) and its intent (the attributes ineach concept) that each function uses. The extent andintent of each concept in the Range class are shown inTable 10. Information in this table is used to generate thelattice shown in Fig. 7. Because the extents are requiredto be maximal collections of objects sharing common attri-butes, the resulting concepts represent potential cohesionbetween member functions around a set of member vari-ables. Fig. 7 illustrates the relationship between memberfunctions using sets of common attributes. It can be seenthat concepts in the first tier of the lattice (i.e., those labeledmax, min, pos, and empty) actually represent accessors andmutators for the given member variables. Using this lattice,we can easily define a heuristic for automatically identify-ing the clusters of mutators and accessors of a member var-iable. Essentially, we treat any concept in the first tier withexactly one member variable and one or more functions asan abstract class property. When such an attribute isfound, it is modeled as a UML attribute. The memberfunctions representing the set of accessors and mutatorsare removed from the list of UML operations to be mod-eled. Note that member variables that do not meet thesecriteria are not modeled in UML because they representimplementation level attributes.

The concept analysis capability is implemented as anoptional analysis component of pilfer. This is to say thatpilfer can be run with or without using this technique, thelatter producing implementation views of the source code.Specifically, pilfer traverses the UML model and runs theconcept analysis algorithm on every UML class, data type,and interface. The implementation is built around Lindig’simplementation [26] via its python bindings. When used,pilfer can generate dot (GraphViz6) files describing the con-cept lattices such as the one pictured in Fig. 7.

Experiments with this technique have shown that thisapproach is very successful at finding design-level attri-butes. In fact, we found that this approach also successfully

Page 11: Recovering UML class models from C++: A detailed explanation

Table 9The usage of member variables (attributes) by member functions (objects)in HippoDraw’s Range class

Member variables (attributes)

min max pos empty

Member functions(objects)

low XsetLow Xhigh XsetHigh Xpos XsetPos XsetRange X X XsetLength X Xincludes X Xexcludes X Xfraction X XsetIntersect X X XsetUnion X X XsetEmpty XnumberOfBins X X

Fig. 8. A snippet of the OMF’s ModelElement class from the Meta-ObjectFacility. Dynamically typed containers such as OMF::Set impede pilfer’sability to correctly identify associations.

ig. 7. This concept lattice represents the usage of member variables bynctions. Only the sub-concepts of the top concept define abstract

roperties.

222 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

reduces delegate interfaces of composed member variables.This is to say that it can eliminate portions of a class inter-face that simply delegate to a contained member variable.We conducted two experiments to validate the applicabilityof this technique. These experiments were performed byreverse engineering two bodies of source code: the OMF’simplementation and HippoDraw. These experiments wereperformed by running pilfer against these bodies of sourcecode to produce both XMI documents and UML diagrams(generated by GraphViz).

The first test was conducted against the Open ModelingFramework’s implementation of the Meta-Object Facility.To perform this test, we fed pilfer the source code corre-sponding to the implementation of that metamodel. Thegoal of this experiment was to recover UML diagrams thatcorrespond to the diagrams shown in the MOF specifica-tions [30]. Specifically, we sought to show that the UMLdiagrams contained the same number and types of attri-butes present in this specification. In this experiment, pilfer

performs flawlessly, identifying 100% of all design-levelattributes (recall) with 100% precision (no false positives).A UML diagram of a subset of the resulting model isshown in Fig. 9. The listing of attributes and operationsfor each class are identical to the diagrams given in the

Table 10Each concept in the resulting lattice consists of a maximal set of objects and

Concepts Object and attribute sets

top {{low, setLow, high, setHigh, pos, setPos, range, setRange, setnumberOfBins}, ;}

c0 {{low, setLow}, {min}}c1 {{high, setHigh}, {max}}c2 {{pos, setPos}, {pos}}c3 {{setEmpty, empty}}c4 {{setLength, includes, excludes, fraction, numberOfBins}, {minc5 {{setRange, setUnion, setIntersect}, {min, max, pos}}bottom {;, {min, max, pos, empty}}

Ffup

attributes

Length, includes, excludes, fraction, setIntersect, setUnion, setEmpty,

, max}}

Page 12: Recovering UML class models from C++: A detailed explanation

Fig. 9. UML diagram of a subset of classes from the OMF implementation of the Meta-Object Facility (MOF).

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 223

OMG specification. It should be noted that some of theassociations and type references are missing from the mod-el. This is an artifact of the OMF’s extensive use of dynam-ically typed containers (such as the OMF::Set class shownin Fig. 8). However, the model is still fairly accurate in itsdepiction of the Meta-Object Facility. Additionally, pilfer-generated models retain the UML attributes from whichthe associations are derived. While this is not incorrect

(as per the UML specifications), it is certainly atypicalfor modeling purposes. We do not feel that modeling attri-butes for these associations detracts from the overall qual-ity or readability of the resulting model.

In the second experiment, we ran pilfer against Hippo-Draw to produce an XMI model and extracted a list ofUML attributes. We then compared these attributesagainst the member variables of classes in the HippoDraw

Page 13: Recovering UML class models from C++: A detailed explanation

224 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

source code to determine whether or not the mapping pro-duced adequate results. To do this, we examined the sourcecode of HippoDraw and classified member variables asattributes if a set of trivial accessor and mutator functionsfor those member variables was found. This classificationwas used to produce a list of attributes that was then com-pared against the list derived from the XMI model. How-ever, this process of classification was fairly naıve andonly included the most trivial of accessor/mutator group-ings (i.e., single line get and set methods). More complexattributes such as delegating methods were not consideredas mutators. This restrictive classification was imposed inorder to ensure consistency during the classificationprocess.

In this experiment, pilfer recalled 96% of all attributeswith a precision of 62%. In other words, the techniqueidentified almost all of the attributes that we had manuallyclassified as attributes, but admitted a significantly largernumber of false positives than the first experiment. Howev-er, the lower precision is due to our method for manuallyclassifying design-level attributes, which are admittedlyrestrictive. Inspection of these false positives conclusivelyshows that pilfer is including delegating methods in theaccessor/mutator groups and producing UML attributesfor the corresponding member variables.

5. A comparison of results

In order to evaluate the effectiveness of our mappingsand analyses, we used pilfer to reverse engineer Hippo-Draw7 (version 1.13.1), an open-source tool for informa-tion visualization. HippoDraw is a medium-sized C++application containing about 230 classes and consists of88 KLOC. We compared the results produced by pilfer

against those produced by Doxygen, Visual Paradigm forUML (2005), and Microsoft Visio 2003 (used as a pluginfor Visual Studio.NET 2003). We used Doxygen as a con-trol in this comparison because it sprovides an accuratedescription of the source code. We had originally plannedto conduct this experiment with two other applicationssupporting reverse engineering: Rational Rose C++ Edi-tion (last version) and Umbrello 1.3. However, in bothcases, the applications were unable to successfully parsethe source code, making it impossible to evaluate their out-put. Rational Rose, in particular, failed to parse standardincludes from Microsoft’s implementation of the STL,and Umbrello crashed intermittently during parsing ormodel construction, making it an unviable candidate forthis comparison. More recent versions of Rational (e.g.,XDE) and Umbrello may work better. Our comparison isprimarily based on the number of structural elementsrecovered through reverse engineering (e.g., classes, attri-butes, and generalizations).

7 See www.slac.stanford.edu/grp/ek/hippodraw/ for detailsonHippoDraw.

Although the quantitative measures generated fromthese comparisons give us some insight into the complete-ness of the reverse engineering systems, it is difficult toascertain the quality of the recovered models. For example,different people might have different preferences on whatinformation is recovered or how UML semantics areinferred from the source code. This makes it nearly impos-sible to create a qualitative measure against a ‘‘referencemodel’’ of the application. Moreover, each modeling appli-cation expresses its own view of the source code. Thosemappings from C++ to UML might not align with themappings the user expects.

pilfer includes a number of options that allow for specif-ic control the mappings from C++ to UML. For example,we know from experience that HippoDraw tends to usepointers to refer to single objects rather than C-arrays.As such, we configured pilfer to treat pointer member vari-ables as having 0 . . .1 multiplicity instead of 0 .. *.

An example of a UML class generated from pilfer isshown in Fig. 10. The corresponding class generated byVisual Paradigm contains 18 attributes and 38 operations.The amount of information present in the Visual Paradigmclass is prohibitively large and significantly reduces thereadability (and therefore comprehensibility) of diagramsin which the class appears. Moreover, the Visual Paradigmparser fails to correctly parse template expressions (i.e., them_ticks member, shown as ticks in Fig. 10). Also, VisualParadigm produces no multiplicity information about thereverse-engineered attributes.

Fig. 10. The UML representation of the AxisModelBase class producedby pilfer.

Page 14: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 225

In order to further compare the results of our experi-ment, we have listed the number of elements recoveredthrough the various reverse engineering tools. These resultsare shown in Table 11.

As mentioned, we used Doxygen as a control in thecomparison. We used it to generate XML output for theclasses in the system and then extracted the relevant factsusing a simple XSL transform. Some of the information(such as STL classes) was removed from the resulting datain order to focus the comparison on only those classesdefined within the HippoDraw application. Using Doxygenallows us to show the improvement of abstraction recoveryfor the various reverse engineering tools. In all other cases,scripts were written against the XMI exports of eachmethod to extract the number of elements recovered.

Visual Paradigm was unable to parse elements in the qt

subdirectory of HippoDraw, possibly accounting for thelower number of classes. The excessively large number ofdata types identified by Visual Paradigm is actually anerror in their internal XMI production algorithms. Theygenerate a new data type element for every instance of aC++ type reference. Included in this set of data types areall of the C++ primitive types (int, float, etc.). Also, thelarge number of parameters reflects the application’s ten-dency to create return parameters for nullary (void) func-tions. Obviously, it is difficult to understand, in detail,how Visual Paradigm determined its rules for generatingassociations. It also generated a number of class-to-classdependencies, but the rationale for generating those rela-tionships is equally unclear.

Microsoft Visio performs significantly better (especiallyin terms of parsing), but also includes a large number ofdata types. This list of data types includes the entire setof 65 primitive types from C++, C#, IDL, and VisualBa-sic. Also, Visio models typedefs as UML data types ratherthan correctly creating import elements. Also, these mea-surements specifically exclude classes that Visio had ren-dered as ‘‘external’’ to the project. External classesincluded template instantiations and undefined classes.

Table 11The number of UML elements recovered through various reverse engineering

Concept being reverse-engineered Reverse engineering tool

Doxygen

Entities

Classes 250Data types 0Interfaces 0

Attributes 889Operations 7238Parameters 7405

Relationships

Association 0Shared 0Composite 0

Generalizations 201Realizations 0

Unfortunately, Visio made no attempt to recover any kindof associations between elements.

Because pilfer, is configured differently than the othertools (i.e., its alternative method for identifying attributes),it produced a model with a significantly lower number ofmember variables and operations (as illustrated byFig. 10). We might also note that the data types recoveredby pilfer are also C++ classes (giving a total of 237 C++classes). The data types are not enumerations or typedefs.

For the most part, the different reverse engineering toolsare in agreement about the number of classes. Had VisualParadigm succeeded in reverse engineering the qt subdirec-tory, it would have presented numbers similar to MicrosoftVisio. We might suggest the Visio presents a low number ofclasses because it was not configured to reverse engineer allof the subdirectories in HippoDraw; the content beingreverse-engineered is based on the Visual Studio projectwhich may exclude a number of directories. A significantamount of variation also likely stems from each applica-tion’s internal mechanism for handling templates instantia-tions – which is not always documented and much lessobvious. For example, an application treating templateinstances as distinct classes in the system will obviouslyreport more classes, attributes an operations.

In general, support for recovering associations, specifi-cally aggregation semantics, in the studied tools is insuffi-cient. It appears the tools would rather avoid makingdecisions about association semantics than possibly makea wrong decision. Among the systems studied only pilfer

provides any significant information about associationsbetween classes.

In terms of performance, pilfer does quite well withrespect to the work of others who reported performanceinformation in the literature and those we could measure.While this is certainly not an exact comparison betweenthese tools, it is given to provide a general sense of therun time performance rather than a normative benchmark.In addition to timing comparisons against the applicationsabove, we also compare pilfer against g4re, [23] an interop-

tools

Visual Paradigm Microsoft Visio pilfer

202 212 219391 74 18

0 0 0541 626 302

2809 2934 20335731 4198 3088

92 0 770 0 650 0 12

136 190 1600 0 0

Page 15: Recovering UML class models from C++: A detailed explanation

Table 12Approximate timing results for the execution of various reverse engineer-ing tools against different bodies of source code

Program Size (KLOC) Time

Visio 102 2 min 14 sVisual Paradigm 88 >10 ming4re vs. FOX 125 �20 ming4re vs. Jikes 70 �11 minpilfer 102 1 min 42 s

226 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

erable reverse engineering tool based on a GCC intermedi-ate representation (translation unit files). To obtain ourperformance numbers, we ran both Microsoft Visio andVisual Paradigm on an Intel P4 1.7 GHz machine with1 GB of RAM. pilfer was run on an Intel P4 2.8 GHzmachine with 1 GB of RAM, running Linux. g4re wasreported to run on an Athlon64 3000+ with 1 GB ofRAM running Linux. The timing results are shown inTable 12.

Unfortunately, the comparisons are skewed for a num-ber of reasons – especially the use of different platforms,different operating systems, and different implementationlanguages. Moreover, not all times are reported for thesame inputs. For example, the Visual Paradigm wasunavailable (due to licensing) at the time of this perfor-mance test, so we could not run it against the same versionof HippoDraw (i.e., 1.16.0) that Microsoft Visio and pilfer

were tested against. Also, the recorded time is only anapproximation because early experiments did not recordtiming performance accurately. The g4re application wastested against two systems. FOX is a C++ GUI toolkitand Jikes is an experimental Java compiler.

Microsoft Visio and pilfer perform at roughly the samelevel, and we can imagine that Visio would perform betterif run on a similarly equipped system. Visual Paradigmseemed to take the longest – perhaps because of its Javaroots or because of memory usage problems. Despite theslower times of g4re, its analysis times are actually quitegood. The vast majority of the time spent by this applica-tion is in processing the output of GCC’s GENERIC mod-ule. It produces XML files containing the abstract syntaxgraph (ASG) of each compiled translation unit, resultingin an enormous amount of data. Contrast this with the out-put size of srcML which is typically less than five times theoriginal data size, resulting in significantly less XMLprocessing.

6. Related work

The prevailing method of integrating modeling andreverse engineering tools is to build reverse engineeringparsers and analyses into existing UML modeling applica-tions. Examples include Rational Rose, Together, Umb-rello, Visual Paradigm, and ArgoUML. However, IDE’sare beginning to realize the importance of providing a visu-al medium for source code and have begun to include

UML modeling functionality. Both Microsoft’s Visual Stu-dio 2005 and Apple’s XCode2 both support the ability tomodel classes in UML-like diagrams.

It is a matter of fact that most advances in reverse engi-neering research have little impact on industrial reverseengineering tools despite the fact that research technologiesare addressing many of the shortcomings of industry appli-cations. For example, numerous approaches have beendescribed for reverse engineering associations and relatedinformation from Java. Barowski and Cross [2] propose atechnique to extract dependency information from Javaclass (bytecode) files. The approach is used to detect inher-itance, realization, and associations between classes. How-ever, this definition of dependency is overly general anddoes not correspond to more specific relationships ofUML. Moreover, the approach is not capable of expressingmultiplicity or aggregation semantics of the recovered asso-ciations. Ref. [16] provides a much stronger definition ofassociations at both the design and implementation levelfor both aggregate and composite associations. An algo-rithm for the detection of these associations is derived froma formal analysis of their properties and applied to a num-ber of Java systems, performing quite well (96% recall and75% precision). The approach described in [18] applies heu-ristics to static and semantic analysis of Java class files.This work is very much like our own approach as itconstructs object models (class models) using lightweightsyntactic and semantic analysis. However, their implemen-tation Womble, only operates on Java bytecode whereaspilfer, operates on C++ source code.

In [15], UML associations, multiplicities, and aggrega-tion semantics are inferred by examining the source code.Additionally, this approach defines one of the few tech-niques for finding bi-directional associations. Ref. [21] usesboth static and dynamic analysis to accomplish many ofthe same goals. Yet another approach to identifying themultiplicity of associations is given in [20]. To our knowl-edge (possibly because of poor documentation), none ofthese approaches have been integrated into any of the pre-valent reverse engineering tools.

Many of the analysis methods in the works listed aboveare not applicable to C++, especially those operating onbytecode. Unfortunately, related research on reverse engi-neering methods for C++ is significantly less. In [41], typeinformation is reverse engineered from C++ by examiningusage patterns of so-called weakly-typed containers. Thisapproach can be used to identify which classes are usedin containers if the container collects instances of abstractbase classes such as the ubiquitous Object class. Although[42,43] focus more on recovering behavioral aspects of aprogram, they illustrate additional techniques for reverseengineering information from C++. In these approaches,an object flow graph is coupled with other forms of analy-sis to produce object and interaction diagrams, respective-ly. Ref. [29] takes an approach similar to ours – theimplementation of a C++ reverse engineering environ-ment. However, this work focused on the modeling of

Page 16: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 227

C++ syntax in UML rather than attempting to recover thedesign abstractions involved. Moreover, little is said aboutthe rules used to recover multiplicities or aggregationsemantics.

Formal concept analysis is being applied to a number ofdifferent areas in software engineering many of which aresurveyed in [39]. In legacy systems, formal concept analysishas been applied to preprocessor configurations in order tobetter understand and help maintain complicated portabil-ity configurations [24,35]. Concept analysis is frequentlyused to construct object oriented models from legacy orprocedural code. Examples include [4,32,33,44]. It has alsobeen used to analyze the coupling and cohesion betweenmodules in [26].

Another popular application of concept analysis is touse it to reorganize or analyze class hierarchies in object-oriented software [13,14,34,35]. In [40], concept latticesare used to detect design patterns by analyzing inheritancerelationships. Additionally, concept lattices are used as avisualization technique for information extracted fromsource code in [6].

One innovative use of concept analysis has been in thearea of program comprehension via feature location[10,11]. In this research, concept analysis is used in con-junction with both dynamic and static analysis. Specifical-ly, dynamic analysis is used to generate sets of relatedscenarios and subprograms. Concept analysis is used toidentify features by relating which subprograms areinvoked in each scenario. Static (call graph) analysis is thenused to augment the information in the concept lattice,providing a detailed visualization the feature’simplementation.

What distinguishes this work from other applications ofconcept analysis is primarily the scope to which the analy-sis is applied. These applications typically consider theentire scope of a program – sometimes generating an enor-mous amount of information that resists further analysisand much less visualization [1]. In our approach, we con-sider object/attribute on a class-by-class basis rather thanthe entire system as a whole. One application similar toours is [9], in which concept lattices are constructed forindividual Java classes to support program comprehension.Our approach varies in that the resulting lattices are usedto determine additional qualities of a class rather than pro-ducing visualization material.

7. Conclusions and future work

In this paper, we have discussed the inconsistency ofreverse engineering tools due to the semantic gap betweenUML and C++ and the non-disclosure policy of thosetools. In an effort to bridge the gap between the two lan-guages and to provide a platform for common modelingproblems, we have detailed a set of mappings from C++to UML class models. These heuristic mappings are basedprimarily on easily accessible syntactic and semantic infor-mation in the program. These mappings are intended to

recover design-level UML class models from source code,supporting program comprehension.

To validate the applicability and accuracy of thesemappings, we have implemented them in our own reverseengineering application pilfer. When used to reverse engi-neer HippoDraw, it produced UML models comparableto those produced by other reverse engineering tools.However, the inclusion of domain knowledge allowed pil-

fer to produce models that reflected the abstract designrather than the simply recreating the implementation-level structures of the program. This is immediately obvi-ous in (a) the identification of abstract attributes and (b)the presence of appropriate aggregation semantics inassociations.

In the future we plan to expand our investigation ofcommon ambiguities and inconsistencies between reverseengineering tools and define similar mappings for their res-olution. Moreover, we plan to refine the architecture of pil-

fer to allow even more user customization, scriptability,and interoperability. Allowing users to develop and inte-grate more analysis technologies is a priority for creatinga more complete and accurate reverse engineering environ-ment. Although pilfer is already capable of producing stan-dards-compliant XMI for model interchange, and dot filesfor visualization of UML diagrams and concept lattices, weenvision a broader range of exported formats such that wecan interoperate with a broad range of applications in thereverse engineering community. Efforts are being undertak-en by the research community to define such formats[17,28]. Moreover, interoperability is being increasinglydiscussed as a requirement for reverse engineering toolsas a whole (e.g., [23]). To support these broader goals ofthe community, we envision using pilfer to export, say,GXL models of call graphs, object relation diagrams andother such artifacts, allowing other applications to analyzeor visualize the results.

Acknowledgments

We thank the reviewers for their helpful and detailedcomments in revising this paper. This work was supportedin part by a grant from the United States National ScienceFoundation (C-CR 02-04175).

References

[1] N. Anquetil, A comparison of graphs of concept for reverseengineering, in: Proceedings of 8th International Workshop onProgram Comprehension (IWPC’00), June 10–11, Limerick, Ireland,2000, pp. 231–240.

[2] L.A. Barowski, J.H. Cross, Extraction and use of class dependencyinformation in java, in: Proceedings of Ninth Working Conference onReverse Engineering (WCRE’02), October 29–November, Richmond,Virginia, 2002, pp. 309–318.

[3] I.D. Baxter, C. Pidgeon, M. Mehlich, DMS: program transfor-mations for practical scalable software evolution, in: Proceedingsof 26th International Conference on Software Engineering(ICSE04), 23–28 May, Edinburgh, Scotland, UK, 2004, pp. 625–634.

Page 17: Recovering UML class models from C++: A detailed explanation

228 A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229

[4] G. Canfora, A. Cimitile, A. De Lucia, G.A. Di Lucca, A case study ofapplying an eclectic approach to identify objects in code, in:Proceedings of 7th International Workshop on Program Compre-hension (IWPC’99), Pittsburgh, Pennsylvania, 1999, pp. 136–143.

[5] E.J. Chikofsky, J.H. Cross, Reverse engineering and design recovery:a taxonomy, IEEE Software 7 (1) (1990) 13–17.

[6] R. Cole, T. Tilley, Conceptual analysis of software structure, in:Proceedings of 15th International Conference on Software Engineer-ing and Knowledge Engineering (SEKE’03), 1–3 July, San FranciscoBay, California, 2003, pp. 726–733.

[7] M.L. Collard, H.H. Kagdi, J.I. Maletic, An XML-based lightweightC++ fact extractor, in: Proceedings of 11th IEEE InternationalWorkshop on Program Comprehension (IWPC’03), Portland, 10–11May, OR, 2003, pp. 134–143.

[8] M.L. Collard, J.I. Maletic, A. Marcus, Supporting document anddata views of source code, in: Proceedings of ACM Symposium onDocument Engineering (DocEng’02), 8–9 November, McLean VA,2002, pp. 34–41.

[9] U. Dekel, Y. Gil, Revealing class structure with concept lattices, in:Proceedings of 10th Working Conference on Reverse Engineering(WCRE’03), 13–16 November, Victoria, Canada, 2003, pp. 353–365.

[10] T. Eisenbarth, R. Koschke, D. Simon, Aiding program comprehen-sion by static and dynamic feature analysis, in: Proceedings ofInternational Conference on Software Maintenance (ICSM01), 7–9November, Florence, Italy, 2001, pp. 602–611.

[11] T. Eisenbarth, R. Koschke, D. Simon, Locating features in sourcecode, IEEE Transactions on Software Engineering 29 (3) (2003) 210–224.

[12] M. Fowler, Distilled Third Edition. A Brief Guide to the StandardObject Modeling Language, Addison-Wesley, Reading, MA, 2000.

[13] R. Godin, H. Mili, G. Mineau, R. Missaoui, A. Arfi, T.-T. Chau.Building and maintaining analysis-level class hierarchies using galoislattices, in: Proceedings of Conference on Object-Oriented Program-ming Systems, Languages, and Application, September 26–October1,Washington, D.C., 1993, pp. 394–410.

[14] R. Godin, G. Mineau, R. Missaoui, A. Arfi, T.-T. Chau, Design ofclass hierarchies based on concept (Galois) lattices, InternationalJournal of Knowledge Engineering and Software Engineering 5 (1)(1998) 119–142.

[15] M. Gogolla, R. Kollman, Re-documentation of Java with UML classdiagrams, in: Proceedings of 7th Reengineering Forum, Reengineer-ing Week 2000, Zurich, February 29–March 3,Switzerland, 2000, pp.41–48.

[16] Y.G. Gueheneuc, H. Albin-Amiot, Recovering binary class relation-ships: putting icing on the UML cake, in: Proceedings of 19th annualACM SIGPLAN Conference on Object-Oriented Programming,Systems, Languages, and Applications (OOPSLA’04), 24–28 October,Vancouver, Canada, 2004, pp. 301–314.

[17] R.C. Holt, A. Winter, A. Schurr, GXL: toward a standard exchangeformat, in: Proceedings of 7th Working Conference on ReverseEngineering (WCRE ’00), 23–25 November, Brisbane, Queensland,Australia, 2000, pp. 162–171.

[18] D. Jackson, A. Waingold, Lightweight extraction of object modelsfrom Bytecode, in: Proceedings of 21st International Conference onSoftware Engineering (ICSE’99), 16–22 May, Los Angeles, Califor-nia, 1999, pp. 194–202.

[19] J. Jiang, T. Systa, Exploring differences in exchange formats –tool support and case studies, in: Proceedings of SeventhEuropean Conference on Software Maintenance and Reengineer-ing (CSMR’03), 26–28 March, Benevento, Italy, 2003, pp. 389–398.

[20] M. Keschenau, Student research competition: reverse engineering ofUML specifications from Java programs, in: Proceedings of Com-panion to the 19th Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, 24–28October, Vancouver, Canada, 2004, pp. 326–327.

[21] R. Kollman, M. Gogolla, Application of UML assciations and theiradornments in design recovery, in: Proceedings of Eigth Working

Conference on Reverse Engineering (WCRE’01), 24–28 October,Stutgart, Germany, 2001, pp. 81–92.

[22] R. Kollman, P. Selonen, E. Stroulia, A. Zundorf, A study in thecurrent state of the art in tool-supported UML-based static reverseengineering, in: Proceedings of Ninth Working Conference onReverse Engineering (WCRE’02), October 29–November 1, Rich-mond, Virginia, 2002, pp. 22–34.

[23] N.A. Kraft, B.A. Malloy, J.F. Power, Toward an infrastructure tosupport interoperability in reverse engineering, in: Proceedings of12th Working Conference on Reverse Engineering (WCRE ’05), 7–11November, Pittsburgh, PA, 2005, pp. 196–205.

[24] M. Krone, G. Snelting, On the inference of configuration structuresfrom source code, in: Proceedings of 16th International Conferenceon Software Engineering (ICSE’94), 16–21 May, Sorento, Italy, 1994,pp. 49–57.

[26] C. Lindig, and G. Snelting, ‘‘Assessing modular structure of legacycode based on mathematical concept analysis, in: Proceedings ofInternational Conference on Software Engineering (ICSE’97), 17–23May 1997, Boston, MA, 1997, pp. 349–359.

[27] J.I. Maletic, M.L. Collard, A. Marcus, Source code files as structureddocuments, in: Proceedings of 10th IEEE International Workshop onProgram Comprehension (IWPC’02), 27–29 June 2002, Paris, France,2002, pp. 289–292.

[28] A.J. Malton, R.C. Holt, Boxology of NBA and TA: a basis forunderstanding software architecture, in: Proceedings of 12th WorkingConference on Reverse Engineering (WCRE’ 05), 7–11 November2005, Pittsburgh, PA, 2005, pp. 187–195.

[29] S. Matzko, P.J. Clarke, T.H. Gibbs, B.A. Malloy, J.F. Power, R.Monahan, Reveal: a tool to reverse engineer class diagrams, in:Proceedings of 40th International Conference on Tools Pacific:Objects for Internet, Mobile and Embedded Applications, Sydney,Australia, 2002, pp. 13–21.

[30] OMG, (2002), ‘‘Meta Object Facility (MOF), 1.4’’: <http://www.omg.org>.

[31] OMG, (2003), ‘‘Unified Modeling Language, 1.5’’: <http://www.omg.org>.

[32] H.A. Sahraoui, W. Melo, H. Lounis, F. Dumont, Applying conceptformation methods to object identification in procedural code, in:Proceedings of International Conference on Automated SoftwareEngineering (ASE ’97), Lake Tahoe, CA, 2–5 November 1997.

[33] M. Siff, T.W. Reps, Identifying modules via concept analysis, IEEETransactions on Software Engineering 25 (6) (1999) 749–768.

[34] G. Snelting, Software reengineering based on concept lattices, in:Proceedings of 4th European Conference on Software Maintenanceand Reuse (CSMR’00), Feb 29–Mar 03 2000, Zurich, Switzerland,2000, pp. 3–12.

[35] G. Snelting, F. Tip, Reengineering class hierarchies using conceptanalysis, in: Proceedings of SIGSOFT, 1998, pp. 99–110.

[36] M.-A.D. Storey, K. Wong, H.A. Muller, Rigi: a visualizationenvironment for reverse engineering, in: Proceedings of IEEEInternational Conference on Software Engineering (ICSE’97), 17–23May 1997, Boston, MA, 1997, pp. 606–607.

[37] A. Sutton, Accurately reverse engineering UML class models fromC++, Kent State University, Kent, Ohio, Masters Thesis, 2005.

[38] A. Sutton, J.I. Maletic, Mappings for accurately reverse engineeringUML class models from C++, in: Proceedings of 12th WorkingConference on Reverse Engineering (WCRE ’05), 7–11 November2005 , Pittsburgh, PA, 2005, pp. 175–184.

[39] T. Tilley, R. Cole, P. Becker, P. Eklund, A survey of formal conceptanalysis support for software engineering activities, in: Proceedings of1st International Conference on Formal Concept Analysis (ICF-CA’03), February 27–March 1 2003, Darmstadt, Germany, 2003.

[40] P. Tonella, G. Antoniol, Object-oriented design pattern inference, in:Proceedings of 3rd European Conference on Software Maintenanceand Reuse (CSMR’99), 3–5 March 1999, St. Agnes, Amsterdam,1999, pp. 230–240.

[41] P. Tonella, A. Potrich, Reverse engineering of the UML class diagramfrom C++ code in the presence of weakly typed containers, in:

Page 18: Recovering UML class models from C++: A detailed explanation

A. Sutton, J.I. Maletic / Information and Software Technology 49 (2007) 212–229 229

Proceedings of International Conference on Software Maintenance(ICSM’01), 6–10 November 2001, Florence, Italy, 2001, pp. 376–385.

[42] P. Tonella, A. Potrich, Static and dynamic C++ code analysis for therecovery of the object diagram, in: Proceedings of InternationalConference on Software Maintenance (ICSM’02), 3–6 October,Montreal, Canada, 2002, pp. 54–63.

[43] P. Tonella, A. Potrich, Reverse engineering of the interactiondiagrams from C++ code, in: Proceedings of International Confer-

ence on Software Maintenance (ICSM’03), 22–26 September 2003,Amsterdam, The Netherlands, 2003, pp. 159–168.

[44] A. van Deursen, T. Kuipers, Identifying objects using cluster andconcept analysis, in: Proceedings of 21st IEEE International Confer-ence on Software Engineering (ICSE’99), 16–22 May 1999, LosAngeles, CA, 1999, pp. 246–255.

[45] K. Wong, S.R. Tilley, H. Muller, M.-A.D. Storey, Structuralredocumentation: a case study, IEEE Software 12 (1) (1995) 46–54.