19
JOURNAL OF SOFTWARE MAINTENANCE: RESEARCH AND PRACTICE J. Softw. Maint: Res. Pract. 2000; 12:305–323 Research Object localization in procedural programs: a graph-based approach Doris L. Carver *,† and Ramachenga Valasareddi Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803, U.S.A. SUMMARY Maintenance of legacy systems is a laborious, error-prone task. It is often difficult to define encapsulated components in procedural programs. We define a comprehensive process for re-engineering procedural, legacy code to an object-oriented architecture. The process is based on a program representation graph, called a statement dependence graph. The process includes a technique to recognize potential object hierarchies, state variables and operations. Procedures are partitioned into operations by analyzing variable use-def chains. The statement dependence graph is restructured by merging cohesive parts of the graph to produce a restructured graph. From the restructured graph, we identify hierarchies of objects. The process to encapsulate the objects includes streamlining the interfaces. Copyright 2000 John Wiley & Sons, Ltd. KEY WORDS: object identification; statement dependence graph; restructured graph; state reference graph; object formation; object cohesion 1. INTRODUCTION Maintenance of legacy systems is a laborious, error-prone task. It is often difficult to define encapsulated components in procedural programs. Since an algorithmic decomposition typically highlights ordering of events, system modules tend to reflect steps in the algorithmic process. In addition, ongoing maintenance tends to destroy the original program structure. A system that is represented in the form of encapsulated components provides positive benefits not only for maintenance but also for reuse. Thus, there are advantages to re-engineering procedural systems to object-oriented architectures. Even when a complete re-engineering is not feasible, there are advantages to identifying objects in procedural systems. With an object-oriented model, it is possible to define and use modules that are functionally incomplete, thereby allowing easy extension of modules. Object-oriented systems * Correspondence to: Dr. Doris L. Carver, Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803, U.S.A. E-mail: [email protected] Received 5 May 2000 Copyright 2000 John Wiley & Sons, Ltd. Revised 18 July 2000

Object localization in procedural programs: a graph-based approach

Embed Size (px)

Citation preview

Page 1: Object localization in procedural programs: a graph-based approach

JOURNAL OF SOFTWARE MAINTENANCE: RESEARCH AND PRACTICEJ. Softw. Maint: Res. Pract.2000;12:305–323

Research

Object localization inprocedural programs:a graph-based approach

Doris L. Carver∗,† and Ramachenga Valasareddi

Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803, U.S.A.

SUMMARY

Maintenance of legacy systems is a laborious, error-prone task. It is often difficult to define encapsulatedcomponents in procedural programs. We define a comprehensive process for re-engineering procedural,legacy code to an object-oriented architecture. The process is based on a program representation graph,called a statement dependence graph. The process includes a technique to recognize potential objecthierarchies, state variables and operations. Procedures are partitioned into operations by analyzing variableuse-def chains. The statement dependence graph is restructured by merging cohesive parts of the graph toproduce a restructured graph. From the restructured graph, we identify hierarchies of objects. The processto encapsulate the objects includes streamlining the interfaces. Copyright 2000 John Wiley & Sons, Ltd.

KEY WORDS: object identification; statement dependence graph; restructured graph; state reference graph;object formation; object cohesion

1. INTRODUCTION

Maintenance of legacy systems is a laborious, error-prone task. It is often difficult to defineencapsulated components in procedural programs. Since an algorithmic decomposition typicallyhighlights ordering of events, system modules tend to reflect steps in the algorithmic process. Inaddition, ongoing maintenance tends to destroy the original program structure. A system that isrepresented in the form of encapsulated components provides positive benefits not only for maintenancebut also for reuse. Thus, there are advantages to re-engineering procedural systems to object-orientedarchitectures. Even when a complete re-engineering is not feasible, there are advantages to identifyingobjects in procedural systems. With an object-oriented model, it is possible to define and use modulesthat are functionally incomplete, thereby allowing easy extension of modules. Object-oriented systems

∗Correspondence to: Dr. Doris L. Carver, Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803,U.S.A.†E-mail: [email protected]

Received 5 May 2000Copyright 2000 John Wiley & Sons, Ltd. Revised 18 July 2000

Page 2: Object localization in procedural programs: a graph-based approach

306 D. L. CARVER AND R. VALASAREDDI

Figure 1. The re-engineering process.

are based on the concepts of abstraction, encapsulation, modularity, and information hiding. Systemsexhibiting these features are in general easier to maintain because of the potential for localized impactof changes that result from maintenance. Object identification facilitates the gradual replacement ofolder systems. Also, the identification of object-like features in programs can help avoid degradationof the original design during maintenance.

In this paper we describe a re-engineering process for object identification in procedural code. Thisre-engineering process is depicted in Figure1. The approach is to reverse engineer the design andrepresent it as a statement dependence graph (StDG). The StDG is then restructured by merging

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 3: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 307

cohesive components in the graph. From the restructured StDG (RSG), we identify object state andoperations. Finally, we build a state reference graph (SRG) using the state variables, operations, callgraph, user-defined data structures, and variable use-def information. We then use the SRG to identifycandidate objects. We include a tool called ReArchitect in Figure1. ReArchitect extracts information,creates models, provides visualization, performs slicing, and implements program restructuringtransformations. We defined ReArchitect to provide automated support for the re-engineering process;however, since ReArchitect is not the focus of the paper, we do not include details of ReArchitect.Figure1 also shows that the software engineer plays a role in the process by providing input to guidethe process. This input is particularly useful in the object identification phase.

2. RELATED WORK

An object has state, behavior, and identity. The state of the object encompasses the properties of theobject. Behavior is how an object acts and reacts to messages it receives. Operations are categorizedas modifier, selector, iterator, constructor, and destructor. A modifier operation alters the object state, aselector operation accesses the object state, an iterator permits all parts of the object to be accessed, aconstructor creates, and a destructor frees the object state.

Research in object identification has involved concept analysis [1,2] and clustering techniques [3,4].The research has focused heavily on techniques for extracting objects from data that have already beenaggregated in programmer-defined data structures. The concept analysis approach uses functions andattributes of functions to identify potential objects. The attributes of functions can include parameterand return types of functions, global variable usage information, and slice criteria [5]. This approachtypically generates a set of possible partitions; however, one drawback to the approach is that itgenerates a large number of partitions.

Clustering techniques are based on a graph with procedures and external (and global) variablesas nodes and with references by the procedures to the variables as edges. Each isolated sub-graphcontained in the graph is a candidate for an object. For these techniques to be effective, either the statevariables must be identified by some mechanism or they must be declared as global variables. Cluster-based object identification approaches typically involves variable selection, cluster identification, andcluster separation.

With variable selection, target variables are selected based either on scope or role in the procedure.Scope-based variables include global variables, parameter variables [1,2,6], and coexisting variables[7]. Role-based variables include variables representing persistent data or variables aggregated inprogrammer-defined data structures [4].

The cluster identification process groups target variables and procedures into clusters, and eachcluster is a candidate for an object. Clusters are identified based either on data aggregated inprogrammer-defined data structures [8] or data bindings [1,4,7]. Clusters based on the data structuresgroup the procedures accessing the data structure with the data structure. Data bindings identify theinterfaces between the components of a system [9]. A data bindingcan be defined as a triple(r, s, V )

wherer ands are procedures andV is a set of variables either used or defined in bothr ands. Twotypes of data bindings areused data bindinganddefined data binding. A used data binding is a databinding where bothr ands only use V. A defined data binding is a data binding where bothr ands

defineV . Data bindings result in a graph with both target variables and procedures as nodes and the

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 4: Object localization in procedural programs: a graph-based approach

308 D. L. CARVER AND R. VALASAREDDI

uses/definitions of variables by the procedures as edges. Each sub-graph or highly connected portionsof the sub-graph can form a cluster.

Cluster separation is needed when a procedure implements more than one function and thusmay belong to multiple clusters. Each of the functions can logically belong to different objects.The clustering process should identify these procedures and separate them into smaller procedures.Statistical techniques and domain knowledge have been used to identify procedures that belong tomore than one cluster [4].

With most graph-based approaches, two types of undesired links can occur among sub-graphs:coincidental and spurious connections. Coincidental connections are due to routines that implementmore than one function, each function logically belonging to a different object. Spurious connectionsare related to routines that access more than one data structure. Slicing is used in [10] to separateroutines that contribute to coincidental connections and to remove spurious connections from the graph.Many clustering methods do not address how to identify the spurious or coincidental connections.

One issue that arises in the process of defining objects is that code for individual operations is ofteninterleaved. Rugaber [11] defines interleaving as the merging of two or more distinct plans withinsome contiguous textual area of a system. A plan denotes a group of statements that achieves somepurpose or goal. In terms of objects, a plan may consist of one or more operations. Interleaving mayoccur for several reasons, including efficiency considerations or the sequential nature of proceduralprogramming. For example, it may be more efficient to compute two related values at one place ratherthan separately. Also, constructors and modifiers of data structures are typically interleaved throughouta procedure. Interleaving is introduced to take advantage of commonalities; however, interleaved planstypically have a distinct purpose. Plans, which are candidates for operations, can be identified by usingvariable(s) definitions and their use contexts.

We define a program representation known as the statement dependence graph (StDG) that is thebasis for object identification. Our approach clusters operations and state variables into candidateobjects. It identifies modifier, selector, iterator, constructor, and destructor operations in procedures.It is based on a graph representation with functions, global variables, or function attributes as nodes.The edges represent references by the procedures to the variables. We extend the nodes to include keylocal variables and program slices. We reduce the number of undesired links by separating the usesfrom definitions, by replacing direct uses of state variables by selector operations, and by separatingprocedures that define multiple state variables.

3. STATEMENT DEPENDENCE GRAPH (StDG)

We developed the graph-based program representation based on the following criteria: support for fine-grained analysis, statement level manipulation, user input, representation for all parts of the program,and modular representation for a group of statements. Using the graph representation of a program, wedetermine dependences through graph traversal.

The StDG is based on statement-level granularity. Each program statement is given a statementnumber and is represented as a box in the graph. Each box has ports at the top and bottom. Top ports(use ports) represent the variables used and bottom ports (def ports) the variables defined (modified)in the statement. The dependences among variables within a statement are indicated as internal edgesin the box. A box with ports and internal edges is called asite. The site representation model is an

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 5: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 309

1 int nw, nc;2 void wordCount(int inword){3 int c;4 c = getchar ();5 while ( c != EOF ){6 nc = nc + 1;7 if( c==‘ ‘ ‖c==‘\n’‖c==‘\t’ )8 inword = 0;9 else if( inword==0 ){10 inword = 1;11 nw = nw + 1;12 }13 c = getchar ();14 }15 }16 main(){17 int inw;18 inw = nc = nw = 0;19 wordCount(inw);20 printf (‘‘\n’’);21 printf (‘‘ %d %d\n’’,nc,nw ) ;22 }

(a) (b)

(c)

Figure 2. (a) C program; (b) StDG of function wordCount(); (c) StDG of function main().

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 6: Object localization in procedural programs: a graph-based approach

310 D. L. CARVER AND R. VALASAREDDI

extension of the notation found in [12]. A site is labeled with the statement number it represents, and aport is labeled with the variable it represents along with the number of the site in which it occurs. Theexternal edges in the graph connect the def ports of a site to the use ports of sites where they are used.The complete algorithm to construct the StDG can be found in [13].

A simple program and its representation are presented in Figure2. In Figure2(b), external edgesindicate the data dependence (e.g., edge (c4, c7)), control dependence (e.g., edge (τ5, τ5)), or flowdependence (e.g., edge (ϕ15, τ ) among the statements. The data dependence and control dependenceedges are computed by the same techniques used to construct the program dependence graph [14].The flow dependence edges are used to connect sites representing those statements that do not use (ordefine) any variables to the sites that depend on them, either syntactically or semantically.

In addition to the program variable ports, we use four special ports: control variable (τ ), flowvariable (ϕ), final-use variable (η), and multi-def variable (λ). Control expressions (blockheads) definecontrol variablesτ . Statements that depend on control expressions (block members) use the controlvariable defined by their blockhead. Theτ use ports of all block members have the same label, thatof their blockhead—the control variable does not change from site to site, unlike the data variables.Within a site, all def ports are connected to theτ use port. Statements that do not use (or define) anyvariables define a flow variableϕ. Flow dependence is indicated as an edge from theϕ def port of adependent statement to aτ use port of the statement on which it depends.

A simple use of a variable in statements (e.g., in output statements or return statements) that doesnot result in any definitions is indicated using a final-use variableη and an edge from the variable’s useport to aη def port. The presence of multiple definitions in a statement is indicated using a multi-defvariableλ and internal edges from theλ use port to each of the def ports in the site.

A summary site is a modular representation for a procedure or a slice. A call to a procedure isreplaced by the summary site of the called procedure. A summary site of a procedure includes thesummary dependences among the statements within the procedure. Each procedure in the program hastwo special sites: enter and exit sites. The enter site models the context of the call. The data variablesdefined in the site that are used in other sites are known as theexposed defsof the site. The exit sitemodels the return context of the call. The variables (local, global, and parameter) used at the site thatare defined in other sites are known as thereaching defsof the site. Similarly, dependences among thestatements in a slice can be summarized and used in place of the slice statements in the graph. Afterthe StDG is created, summary sites are created. The StDG process is based on discard-type multiplerepresentations; that is, it discards earlier representations once new representations are obtained. Itdiscards the dependence information among groups of cohesive statements such as those found insummary sites. We refer to the construction process for summary sites as compaction.

4. COMPACTION OF THE StDG

We perform compaction on the StDG to obtain the RSG. Compaction involves merging sites by movinga site (source site) into another site (destination site). In this work, we merge sites with high cohesion.A cohesive group of statements in a procedure is a candidate for an operation. Cohesion indicatesthe binding strength between two elements of a program. Binding strength is indicated as edges(dependences) in the graph. If a variable definition at a site is used at only one other site, then thecohesion between the variable definition and use sites is high. On the other hand, if the variable is used

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 7: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 311

at several sites, the cohesion is divided among the sites that share the variable. We consider cohesionamong two sites A and B as high if ‘A and B are connected and every site reachable from A is alsoreachable from B’. In terms of the graph, two sites are considered highly cohesive if they are connectedand satisfy any of the following three conditions (definitions relative to the formal definition of thesteps are shown in Figure4):

1. All the def edges of one site reach one other site only. Formally,

S1, S2: SITE• DPD(S1) ⊆ UPS(S2) ∪ UPS(S1). (1)

That is, def ports of def ports of S1 belong to either the use ports of S1 or to the use ports of S2.2. A group of sites are circularly connected. Formally,

S1, S2: SITE• DPD(S2)+ ≡ DPD(S1)

+ (2)

where

DPD(S)+ = DPD(S) ∪ DPD(DPD(S)) ∪ · · · . (3)

That is, ports reachable from S1 and S2 are the same.3. All the sites connected to one site are also connected to the other site. Formally,

S1, S2: SITE• S2 ∈ Site(DPD(S1)) ∧ (Site(DPD(S1)+) − S2) ⊆ Site(DPD(S2)

+). (4)

That is, S2 is one of the sites containing def ports of def ports of S1 (S1 and S2 are connected)and sites containing def ports of def ports of S1, excluding S2, are a subset of sites containingdef ports of def ports of S2 (all the sites reachable from S1 can also be reached from S2)

The StDG sites with high cohesion are merged into one site using structural compaction, datacompaction, and edge compaction. For structural compaction, we consider that all members in a blockare dependent on the blockhead and in turn on all members on which the blockhead depends. Blockmembers on which the blockhead depends and the blockhead form a circular chain of sites. Hence,these member sites are merged into the blockhead site.

Data compaction involves the defining site (source) and use sites (sink) of a data variable. If a sinksite has a final-use variable as its def port, then a temporary sink site called afinal-use siteis created.Variables used in afinal-use siteare related by coincidence only. If we include these sites in the mergingprocess, the sites where these variables are defined will be represented as related in the graph. Thus,sites with a final-use variable are not merged with any other site. The StDG in Figure2(b) with sitesmerged is shown in Figure3. In Figure3, only the statements represented by each site with reachingdefs and exposed defs are shown.

For edge compaction, we seek to reduce the number of sites connected to a site in order to increasethe possibilities of compaction. This reduction is achieved by changing a direct edge between two sitesinto an indirect connection through another connected site where possible. This concept is similar toreplacing a global variable by a local variable and passing it as a parameter.

The complete compaction algorithm that creates the RSG from the StDG is lengthy and is notincluded here. Note that the granularity of a site in the StDG is a statement, but the granularity ofa site in the RSG is a set of statements that have cohesive properties.

There are numerous applications of the RSG, including identifying objects in code, maintenance,program understanding, and slicing. In this paper, we consider the object identification process.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 8: Object localization in procedural programs: a graph-based approach

312 D. L. CARVER AND R. VALASAREDDI

Figure 3. StDG with sites merged.

SITE. Set of all sites in the graphPORT. Set of all ports in the graph

Let s∈ SITE; p∈ PORT; sp∈ SITE∨ sp∈ PORT; (Note that the def edges are the outgoing edges from aport or a site, and the use edges are the incoming edges to a port or a site.)

UPS(s)—set of all use ports of a site s (the top ports of s).DPS(s)—set of all def ports of a site s (the bottom ports of s).DED(sp)—def edges of a def port sp or def edges of all def ports of site sp.DPD(sp)—def ports of a def port sp or def ports of all def ports of site sp. Ports connected to DED(sp).UEU(sp)—set of all use edges of a use port sp or use edges of all use ports of site sp.UPU(sp)—set of use ports of a use port sp or use Ports of all use ports of site sp. Ports connected to

UEU(sp).Site(p)—returns the site(s) in which the port(s) is present.Blockend(s)—returns the site number of the block end of s.

Exposed def ports are the def ports in a site that are connected to ports in other sites. These ports can becontrol (ECDP) or data (EDDP) ports.

Figure 4. Definitions.

5. OBJECT IDENTIFICATION

An object has state (variables) or attributes (SV) and operations (P) in the form of constructors,modifiers, selectors, and destructors. Each site in the RSG is a potential operation, and the programvariables defined by these sites are candidates for object attributes. The object identification is afour-step process: identification of the object state, identification of object operations, identificationof candidate objects, and formation of objects. We use the example in Figure5 to demonstrate theobject identification process. Figure4 contains definitions used to describe the process.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 9: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 313

Table I. Data generated from the queue example. FDV (final def variable)—variabledefined last in the site; FDSV (final def state variable)—state variable defined lastin S. If FDV of a site is a control variable or a local variable then the site may havea different variable as FDSV than FDV. DSV (state variables defined in a site)—all state variables defined in the site; USV (state variables used in a site)—all state

variables used in the site.

Procedures Sites FDV FDSV DSV USV

initStack 1. base base base {base} ?

2. sp sp sp {sp} ?

3. size size size {size} ?

initQ 1. front front front {front} ?

2. back back back {back} ?

isEmptyStack ? ? ? {sp,base}isEmptyQ ? ? ? {sp,base,front,back}push sp sp {sp} {sp,size}pop sp sp {sp} {sp}enq front front {sp,front} {sp,front}deq back back {front,back} {front,back}

5.1. Identification of object state

We select variables (V) as potential candidates for attributes from three sources:

1. programmer defined data structures present in the program (DV), DV= {stack, queue};2. programmer identified variables (IV);3. exposed defs of sites with more than one def edge or those that reach an exit site in the StDG

(PV). That is,

SV ⊆ V:ProgramVariablesSV = DV ∪ IV ∪ PV = {base, sp, size, front, back}PV = {v • v:V; j :ProgramStatement;s:SITE| ∀vj • (vj ,s,Definition) ∈ DPS(s) ∧#DPD(vj ,s,Definition) > 1} (vj is a def port in s and the number of ports connectedto vj is greater than one).

5.2. Identification of operations (P)

Each site in the RSG is a potential operation. Sites and other representation information for the examplein Figure5 are presented in TableI. Procedure initStack has three sites and initQ has two sites; each ofthese sites is considered as a different operation. All other procedures have one site each in the graph.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 10: Object localization in procedural programs: a graph-based approach

314 D. L. CARVER AND R. VALASAREDDI

struct stack{ int ∗base,∗sp, size;};struct stack{ struct stack∗front, ∗back;};struct queue∗ q;struct stack∗ initStack(struct stack∗ s, int sz){ s = (struct stack*) malloc (sizeof(struct stack)));

s->base = s->sp = (int*) malloc (sz∗ (sizeof(int)));s->size = sz;}

struct queue* initQ(){ q = (struct queue∗) malloc (sizeof (struct queue));initStack(q->front ,10);initStack(q->back,10}

int isEmptyStack(struct stack∗ s){ return (s->sp == s->base) ;}

int isEmptyQ(){ return (q->front->sp == q->front-base && q->back->sp == q->back->base);}

void push(struct stack∗ s, int i){ ∗(s->sp) = i;

s->sp++; }

void enq(int i){ ∗(q->front->sp) = i;

q->front->sp++; }

void pop(struct stack∗ s){ if (isEmptyStack(s) return -1;

s->sp- -;return (∗(s->sp));}

int deq(){ if(isEmptyStack(s) return -1;

if(!isEmptyStack(q->front)) push(q-back, pop(q->front));return pop(q->back);}

Figure 5. Queue with two stacks.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 11: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 315

5.3. Identification of candidate objects

We define a seven-step process to identify candidate objects.Let, OV(o) be a state of object o(OV(o) ⊆ SV), OP(o) the operations in object o(OP(o) ⊆ P), and

O the set of objects in the program(o ∈ O).

1. For each selected data structure or group of variables (d), add an object od to the object set. Thatis, if d is a DV then add od to O, and if v is a member of d then add v to OV(od).

2. For each state variable v, a set P(v) of sites with v as FDSV is defined. P(v) and v are groupedinto a candidate object o. If v belongs to one of the objects (already grouped), then add P(v) tothe object of v. If v belongs to a candidate object o, then add P(v) to the object o. If v does notbelong to any object then add a new object (o) and add v and P(v) to the new object.

3. Add operations with null FDSV as candidate objects.4. Build a SRG using candidate objects as nodes and references to candidate objects of variables

of other candidate objects as edges. Nodes in the SRG are placed at different levels; if a node Areferences node B (operations in A reference variables in B) then node B is placed at a lowerlevel than node A.Root nodes at the bottom of the SRG represent objects that do not reference any variables. Ifnodes are circularly connected then the nodes are merged into one node. SRG is a directedgraph G(N,E) with nodes≡ candidate objects (O) and edges≡ {(c1, c2) | c1, c2 ∈ O ∧ v ∈OV(c1) ∧ P(v) ∈ OP(c1) ∧ w ∈ OV(c2) ∧ P(w) ∈ OP(c2)}. The SRG of the example inFigure5 is shown in Figure6. The nodes (sp, base, and size) are represented in one node as theybelong to a data structure. The nodes (front and back) also belong to the same data structure.In the figure, nodes with out-going dotted edges represent the sites with null FDSV (isEmptyQ,isEmptyStack).

5. Merge connected nodes in the isolated sub-graphs of the SRG. The SRG nodes can only bemerged with the node at the highest level that is connected to it. Nodes of sites with null FDSVare eliminated from the graph if they are not merged with any other node. These eliminatednodes are due the presence of procedures that access several objects (e.g. main function in Cprograms). A node is merged with a lower level node with the help of the software engineerand using the information related to the two nodes in TableI, StDG, and RSG. Two groups areidentified in the example shown in Figure6.

6. When an operation of one object uses a variable of another object, introduce a selector operationin the second object. That is, if there is an edge from c1 to c2 and an operation in c1 uses avariable of c2, a selector operation is introduced. Similarly, a modifier or an iterator operation isintroduced if one object defines a variable of another.

7. If a site (s1) defining a control variable and sites (s2) connected to s1 are grouped into differentobjects, then duplicate s1 and add it to each object that has a site connected to s1.

5.4. Object formation

We complete the object identification process by using the variables and operations identified in theprevious steps. Examples of the decision rules for constructing the objects include:

• if the FDV of a modifier operation is a local variable, add a return statement after each statementdefining the local variable;

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 12: Object localization in procedural programs: a graph-based approach

316 D. L. CARVER AND R. VALASAREDDI

Figure 6. State reference graph.

• operations are given user-supplied semantically meaningful names, and duplicate operations ineach object are removed;

• if an operation of one object uses a state variable of another object and the operations in the twoobjects are connected by an edge, add the variable as a formal parameter;

• introduce constructors and destructors for each object;• sites with null FDSV, USV, and DSV are grouped with the sites that are connected to them.

More information about the rules can be found in [13].

6. OBJECT IDENTIFICATION PROCESS

To demonstrate the object extraction process, we use the sample program shown in Figure7. Followingthe process in Figure1, we first generate the StDG. We then apply compaction techniques to the StDG

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 13: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 317

1. int main(){2. long Fitems[MAX];3. long Bitems[MAX];4. int Fsp, Bsp, size=MAXj;5. long it,jt;6. Fsp=Bsp=0;7. for (j=0;j<size;j++){8. Fitems[j]=0;9. Bitems[j]=0;10. }11. ....12. if (Fsp<size)13. Fitems[Fsp++]=it;14. ...15. if (Fsp<size)16. Fitems[Fsp++]=it;17. ....18. if(Bsp)19. jt=Bitems[Bsp-1];20. else if(Fsp)21. jt=Fitems[0];22. else jt=0;23. .....24. if (Bsp)25. jt=Bitems[-Bsp];26. else if(Fsp){27. while (Fsp)28. Bitems[Bsp++]=Fitems[-Fsp];29. jt = Bitems[-Bsp];30. }31. else jt=0;32. .....33. if(Fsp<size)34. Fitems[Fsp++]=it’35. .....36. {

Figure 7. Sample program for object identification.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 14: Object localization in procedural programs: a graph-based approach

318 D. L. CARVER AND R. VALASAREDDI

Figure 8. RSG for the sample program.

to obtain the RSG. The RSG is given in Figure8, and the statements that compose the sites are shownin Figure9. From the RSG, we identify variables and statements that are candidates for objects andoperations. In this example, we identify the following state variables and operations:

SV = {Fitems, Fsp, Bitems, Bsp}P = {6a, 6b, 7, 8, 9, 12, 15, 17, 24, 25, 26, 27, 29, 31, 33}P(Fitems) ={8, 12, 15, 33}P(FsP) ={6a}P(Bitems) ={9, 27}P(Bsp) ={6b, 25, 29}

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 15: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 319

Site 6:6. Fsp=Bsp=0;Site 7:7. for (j=0;j<size:j++){10. }Site 8:8. Fitems[j]=0;Site 9:9. Bitems[j]=0;Site 12;12. if (Fsp<size)13. Fitems[Fsp++]=it;Site 15:15. if (Fsp<size)16. Fitems[Fsp++]=it;Site 17:17. ...18. if (Bsp)19. jt=Bitems[Bsp-1];20. else if (Fsp)21. jt=Fitems[0];22. else jt=0;23. ...Site 24:24. if (Bsp)Site 25:25. jt=Bitems[-Bsp];Site 26:26. else if(Fsp){30. }Site 27:27. while (Fsp)28. Bitems[Bsp++]=Fitems[-Fsp];Site 29:29. jt=Bitems[-Bsp];Site 31:30. else jt=();Site 33:33. if (Fsp<size)34. Fitems[++]=it;

Figure 9. RSG sites.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 16: Object localization in procedural programs: a graph-based approach

320 D. L. CARVER AND R. VALASAREDDI

Figure 10. RSG for the sample program.

Using the candidate object and operation information, we derive the SRG, shown in Figure10. Basedon the merging criteria, the nodes are merged as{26, FSP} (26 is a nullFDSV site),{24, Bsp} (24 isa null FDSV site),{Bsp, Bitems} (connected nodes at the same level), and{Bsp, Bitems, 17} (17 is anull FDSV site). The resulting objects and their operations are:

OV(o1) ={Bsp, Bitems}OP(o1) ={6b, 25, 29, 9, 17, 27, 29}OV(o2) ={Fitems}OP(o2) ={8, 12, 15, 33}PV(o3) ={Fsp}OP(o3) ={6a, 26}As Figure1 indicates, the knowledge of the software engineer is utilized as an important part ofthis process. In this example, we assume that the engineer determined that Fitems and Fsp shouldbe included in the same object, resulting in the following objects and operations:

OV(o1) ={Bsp, Bitems}OP(o1) ={6b, 25, 29, 9, 17, 27, 29}OV(o2) ={Fitems, Fsp}OP(o2) ={8, 12, 15, 33, 6a, 26}By applying the decision rules for object formation, we obtain the following objects:

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 17: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 321

Init F(){/*Site 6 is separated into 6a& 6b*/6. Fsp=0; /*as it has a multi-def variable*/} /*site 6 – part */init B() {/*site 6 – part */6. Bsp = 0;}init Fi(int j) {/*using – j */7. For (j-0; j< size j++)[9. Fitems[j] = 0;10.}}/(site 9*/pushF(long it){/*using – it */12. if (Fsp< size)13. Fitems[Fsp++] = it;}/*site 12, site 15, site 33 */pop B() {/*defining non-SV */

long jt; /( non-SV defined */25. jt = Bitems [-Bsp]:

return jt;}/* site 25, site 29 */F to B() {27. while (topF())28. Bitems[Bsp++] = popF();}pop F(){28. return Fitems[-Fsp];}ptr B() {/*Bsp – used in sites not part of the object, also in24*/18. return Bsp;}/*site 18, site 24 */ptr F() {/*Fsp – used in sites not part of the object, also in 26*/19. return Fsp;{/*site 20, site 26 */top F(){20. return Fitems [0];{/*site 21 */top B(){19. return Bitems [bsp-1];}/*site 19*/

Figure 11. Operations in objects.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 18: Object localization in procedural programs: a graph-based approach

322 D. L. CARVER AND R. VALASAREDDI

OV(o1) ={Bsp, Bitems}OP(o1) ={init F, init Fi, pushF, popF, top F, ptr F}OV(o2) ={Fitems, Fsp}OP(o2) ={init B, init Bi, F to B, pop B, top B, ptr B}The operations are shown in Figure11. They have been assigned meaningful names, and duplicateoperations have been removed. The sites that an operation represents are indicated as comments in theoperation.

7. SUMMARY

We described a systematic process for object identification in procedural code. This approach extendsmost object identification methods in that it is not based on the assumption that programs are builtaround well-defined data structures or that procedures are well-designed. The process, which is basedon a graph representation called StDG, includes methods to deal with a procedure that uses the objectstate of more than one object. We also distinguish between variable uses and definitions. We firstidentify candidate operations from the StDG. We then restructure the StDG by merging cohesive partsof the graph. We consider the merged parts as potential operations. These operations, along with thestate variables and the user-defined data structures, are arranged in a SRG that represents the objectstructure. From this system hierarchy, objects are identified and encapsulated by streamlining theinterfaces.

The re-engineering process is partially automated, thereby enhancing the potential for processinglarge programs. The construction of the StDG is fully automated and the construction of the RSGrequires minimal human intervention. Code changes can be incorporated into the representation, thuseliminating the need for re-derivation of the complete representation when the code is changed. Inaddition, the merging of the cohesive components early in the process reduces the graph size.

REFERENCES

1. Liu S, Wilde N. Identifying objects in a conventional procedural language: an example of data design recovery.ProceedingsConference on Software Maintenance. IEEE Computer Society Press: Los Alamitos CA, 1990; 266–271.

2. Siff M, Reps T. Identifying modules via concept analysis.Proceedings International Conference on Software Maintenance.IEEE Computer Society Press: Los Alamitos CA, 1997.

3. Wiggerts TA, Baster I, Quilici A, Verhoef C. Using clustering algorithms in legacy systems remodularization.Proceedings4th Working Conference on Reverse Engineering (WCRE ’97). IEEE Computer Society Press: Los Alamitos CA, 1997;33–43.

4. Yeh AS, Harris D, Reubenstein H. Recovering abstract data types and object instances from a conventional procedurallanguage.Proceedings 2nd Working Conference on Reverse Engineering (WCRE ’95). IEEE Computer Society Press: LosAlamitos CA, 1995; 227–236.

5. Weiser M. Program slicing.IEEE Transactions on Software Engineering1984;SE-10(4):352–357.6. Chu WC, Patel S. Software restructuring by enforcing localization and information hiding.Proceeedings Conference on

Software Maintenance. IEEE Computer Society Press: Los Alamitos CA, 1992; 165–171.7. Achee B, Carver D. Creating object-oriented designs from legacy FORTRAN code.Journal of Systems and Software1997;

39(2):179–194.8. Gall H, Klosch R. Finding objects in procedural programs: An alternate approach.Proceedings 2nd Working Conference

on Reverse Engineering. IEEE Computer Society Press: Los Alamitos CA, 1995; 208–216.9. Huchens D, Basili V. System structure analysis: clustering with data bindings.IEEE Tranasctions on Software Engineering

1985;11(8):749–757.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323

Page 19: Object localization in procedural programs: a graph-based approach

OBJECT LOCALIZATION IN PROCEDURAL PROGRAMS 323

10. Canfora G, Cimitile A, Munro M. An improved algorithm for identifying objects in code.Software Practice and Experience1996;26(1):25–45.

11. Rugaber S, Stirewalt K, Wills L. The interleaving problem in program understanding.Proceedings InternationalConference on Software Maintenance. IEEE Computer Society Press: Los Alamitos CA, 1995; 265–274.

12. Jackson D, Rollins EJ. A new model of program dependences for reverse engineering.Software Engineering Notes,SIGSOFT1994;19(5):2–10.

13. Valasreddi R, Carver D. A representation model for procedural program maintenance.Proceedings of the 1999 ACMSymposium on Applied Computing. ACM Press: New York NY, 1999; 580–585.

14. Ferrante J, Ottenstein JK, Warren JD. The program dependence graph and its use in optimization.ACM Transactions onProgramming Languages and Systems1987;9(3):319–349.

AUTHORS’ BIOGRAPHIES

Doris L. Carver is a Professor of Computer Science at Louisiana State University. She received her BS inmathematics from Carson-Newman College, MS in mathematics from the University of Tennessee, and her Ph.D.in computer science from Texas A&M University. Her primary research interests are requirements engineering,reuse, software maintenance, and re-engineering.

Ramachenga Valasareddireceived his MS in computer science from the University of Southwestern Louisianaand his Ph.D. in computer science from Louisiana State University. His research interests are softwaremaintenance, re-engineering, and object-oriented software development.

Copyright 2000 John Wiley & Sons, Ltd. J. Softw. Maint: Res. Pract.2000;12:305–323