27
JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237 Published online 6 November 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr.418 Research Migrating legacy data structures based on variable overlay to Java Mariano Ceccato 1 , Thomas Roy Dean 2 , Paolo Tonella 1, , and Davide Marchignoli 3 1 Fondazione Bruno Kessler—IRST, via Sommarive 18, 38050 Povo, Trento, Italy 2 Queen’s University, Kingston, Canada 3 Informatica Bancaria Trentina, Trento, Italy SUMMARY Legacy information systems, such as banking systems, are usually organized around their data model. Hence, when these systems are migrated to modern environments, translation of the data model involves the most critical decisions, having strong implications on the rest of the translation. In this paper, we report our experience and describe the approaches adopted in migrating a large banking system (ten million lines of code) to Java, starting from a proprietary data model which gives programmers explicit control of the variable overlay in memory. After presenting the basic translation scheme, we discuss the exceptions that may occur in practice. Then, we consider two heuristic approaches useful to reduce the number of cases where a behavior equivalent to that of unions must be reproduced in Java. Finally, we comment on the experimental results obtained so far. Copyright © 2009 John Wiley & Sons, Ltd. Received 13 February 2009; Revised 9 September 2009; Accepted 10 September 2009 KEY WORDS: reverse engineering; legacy systems migration; object-oriented data model 1. INTRODUCTION Migrating a legacy system to new technologies and programming languages is a hard to make and high-risk decision. However, there are situations in which not making such a decision would expose the software company to an even higher risk, that of losing the market share and eventual failure. This is especially true for legacy systems written in proprietary languages and running on proprietary platforms. In fact, customers are increasingly demanding for widely adopted and well-supported solutions, often preferring ‘open’ technologies with a large user community. Correspondence to: Paolo Tonella, Fondazione Bruno Kessler—IRST, via Sommarive 18, 38050 Povo, Trento, Italy. E-mail: [email protected] Copyright 2009 John Wiley & Sons, Ltd.

Migrating legacy data structures based on variable overlay to Java

Embed Size (px)

Citation preview

Page 1: Migrating legacy data structures based on variable overlay to Java

JOURNAL OF SOFTWAREMAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICEJ. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237Published online 6 November 2009 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr.418

Research

Migrating legacy datastructures based on variableoverlay to Java

Mariano Ceccato1, Thomas Roy Dean2, Paolo Tonella1,∗,†and Davide Marchignoli3

1Fondazione Bruno Kessler—IRST, via Sommarive 18, 38050 Povo, Trento, Italy2Queen’s University, Kingston, Canada3Informatica Bancaria Trentina, Trento, Italy

SUMMARY

Legacy information systems, such as banking systems, are usually organized around their data model.Hence, when these systems are migrated to modern environments, translation of the data model involvesthe most critical decisions, having strong implications on the rest of the translation. In this paper, wereport our experience and describe the approaches adopted in migrating a large banking system (tenmillion lines of code) to Java, starting from a proprietary data model which gives programmers explicitcontrol of the variable overlay in memory. After presenting the basic translation scheme, we discuss theexceptions that may occur in practice. Then, we consider two heuristic approaches useful to reduce thenumber of cases where a behavior equivalent to that of unions must be reproduced in Java. Finally, wecomment on the experimental results obtained so far. Copyright © 2009 John Wiley & Sons, Ltd.

Received 13 February 2009; Revised 9 September 2009; Accepted 10 September 2009

KEY WORDS: reverse engineering; legacy systems migration; object-oriented data model

1. INTRODUCTION

Migrating a legacy system to new technologies and programming languages is a hard to makeand high-risk decision. However, there are situations in which not making such a decision wouldexpose the software company to an even higher risk, that of losing the market share and eventualfailure. This is especially true for legacy systems written in proprietary languages and runningon proprietary platforms. In fact, customers are increasingly demanding for widely adopted andwell-supported solutions, often preferring ‘open’ technologies with a large user community.

∗Correspondence to: Paolo Tonella, Fondazione Bruno Kessler—IRST, via Sommarive 18, 38050 Povo, Trento, Italy.†E-mail: [email protected]

Copyright q 2009 John Wiley & Sons, Ltd.

Page 2: Migrating legacy data structures based on variable overlay to Java

212 M. CECCATO ET AL.

FBK-IRST is involved in a project aimed at migrating a terminal-based legacy banking systemwritten in a proprietary language to a Java-based application server. The language used by thelegacy system is BAL, an acronym for Business Application Language. BAL is a BASIC-likelanguage that contains unstructured data elements (described in Section 2) as well as unstructuredcontrol statements (e.g., GOTO). Programs are composed of multiple segments and may also containuser-defined functions. BAL programs are compiled to a byte code representation and run on avirtual machine implemented in the C language. The execution environment, called B2U, is alsoproprietary.Persistent data are currently stored into C-ISAM (Indexed Sequential Access Method) tables. The

overall goal of the migration project is three-fold: (1) migrating C-ISAM tables to DB2 relationaltables; (2) migrating the BAL language to Java; and (3) migrating the character-oriented UI to aGUI. Since big-bang migrations carry a significant risk [1], we initially considered the feasibilityof changing the persistence layer first, then language, and finally UI. However, this path turned outto be impractical, for performance reasons. In fact, accessing a relational DB from BAL introducesa performance penalty that is quite high, as measured experimentally on typical long computationsexecuted in BAL on C-ISAM vs DB2 tables. The reason is that BAL lacks the constructs to iterateover data retrieved from a relational DB (e.g., iterators over result-sets). Therefore, we decided togive priority to step (2), language migration. In fact, adoption of Java is a key enabler for both (1)and (3). Once all code is migrated to Java, it will be possible to migrate the persistent data fromC-ISAM to DB2 and obtain acceptable performance, thanks to the JDBC layer available in Java.It will also be possible to rewrite input forms as graphical ones. In this paper, the focus is on step(2), language migration, and its implications on the data model. Hence, we present the migrationof the legacy data model from a programming language perspective. Once an object-oriented (OO)data model is automatically generated, we expect that it will be relatively straightforward to convertthe Java data model to a relational data schema. The abstraction and encapsulation provided bythe Java OO model combined with the support for database access will simplify the incrementalmigration from C-ISAM tables to a relational database implementation. Hence, the Java OO modelrepresents an intermediate but necessary step between the original BAL C-ISAM implementationand a final relational database implementation in Java.At the core of the system being migrated is the persistent data model described explicitly in a

so-called application dictionary, which is reflected in the data structures manipulated in the BALprograms. The header files containing the BAL declarations for the persistent data structures areautomatically generated from the C-ISAMdictionary, so that they are ensured to be always consistentwith the C-ISAM tables. Translation of the data model dictates the shape of the translated Java code,which will necessarily revolve around the equivalent of the original data structures. Unfortunately,the gap between the BAL data model and the Java OO data model is not small. In fact, the BAL datamodel is of low level and close to the hardware level, where the memory is accessible as an array ofbytes. Any byte sequence in memory can be used for variable declaration, with arbitrary overlaysand aliasing. The possibility of arbitrary variable overlays is not something special or unique aboutthe BAL language. For example, the assembly language used for mainframe programming has asubstantially similar data model, with similar primitives to declare and locate variables in memory.Other cases where a similar problem occurs are in FORTRAN code (equivalence construct), COBOLcode (renames/redefines), C/C++ code (union), and, of course, in all assembly languages. Italso appears in several DB languages: IMS DL/1, overlapping record subschema definitions in

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 3: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 213

CODASYLDBMS,multiple record-type structures for the sameCOBOLfile and even SQL (throughsubstring operations on large column values in VIEWS). PL/I is another language that supportsredefinition of data structures according to alternate views. Hence, the problem of migrating thelegacy data structures of BAL is common to all programming languages providing an Assembly-likeview of the memory.In this paper we describe how to bridge the gap between a low-level, byte-array-oriented data

model to an OO data model, with the goal of producing readable and maintainable Java code, and,at the same time, acceptable performance of operations for data access and manipulation.We recognize a set of patterns that indicate a clear intention of the programmers to define a

nested data structure, consisting of containers and containees (record–subrecord relation). We alsodefine some heuristics to handle the cases that do not fall exactly into the container/containeepattern. One problem of the legacy data structures is that they admit multiple views over the samememory regions. This corresponds to the notion of union, which is not explicitly supported in Java.We defined a translation scheme, based on the copy-on-read/write protocol, that reproduces thesemantics of unions in Java. However, such a scheme produces a complex code, which is hard tomaintain and involves a substantial performance penalty. In fact, whenever a switch occurs fromone view to another, in the translated code an explicit variant switch takes place, which involvescopying one object into another. Since those objects are not type compatible, this is achieved throughserialization and deserialization, a quite expensive operation.We investigated two techniques to reduce the number of cases in which the hardly maintain-

able and low-performance union data structure has to be simulated in Java. One technique recog-nizes unions that only provide alternative substring access (SA) to a larger string. The secondtechnique recognizes unions that are fully discriminated by one of their fields, which makes itpossible to define an alternative translation scheme, based on inheritance. We investigated theeffectiveness of these two approaches when applied to the system being migrated (around tenmillions lines of BAL code). We report and comment on the results of such an evaluation in thispaper.The remainder of the paper is organized as follows: in Section 2, we describe the basic cases

that may occur in the BAL code and how we map them to Java. Exceptions to such basic casesare described in Section 3, where we explain the heuristics used to manage them. In Section 4 wepropose two alternative translation schemes for unions that can be applied, respectively, in caseof SA or fully discriminated access to a data structure with multiple overlays. Section 5 providesempirical data on the occurrence of the various cases and on the effectiveness of the alternativetranslations for unions, with reference to the banking system under migration. We also present andcomment on the performance data related to the cost of accessing and manipulating data structuresconverted from BAL to Java. In Section 6, we describe some of the previous work in the area andconclude in Section 7.

2. BASE STRATEGY RULES: EXACT SIZE MATCH

In this section we give a short introduction to the data model provided by the BAL language andprovide some examples of the basic ways in which the conventional notion of records and fieldsare expressed in the BAL language.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 4: Migrating legacy data structures based on variable overlay to Java

214 M. CECCATO ET AL.

DCL a # / / B y te V a r i a b l eDCL b% / / S h o r t V a r i a b l eDCL c&=5 / / BCD V a r i a b l e , 5 b y t e s longDCL d$ =100 / / S t r i n g V a r i a b l e , 100 b y t e s longDCL e$ / / S t r i n g V a r i a b l e , 16 b y t e s long

Figure 1. Primitive types.

While BAL contains some structured control flow statements, such as IF. . .ENDIF andWHILE. . .WEND, the data model is very unstructured and similar to that found in structuredassembly languages (e.g., that of IBM mainframes). The data model is byte oriented and thelanguage only provides four basic data types: byte, short, binary coded decimal (BCD) and string.The first two are the same as those available in most languages, representing a single byte andtwo contiguous bytes, respectively. Variables of the BCD and string data types can be of differentlengths, and the developer must specify the length (in bytes) if he/she wants something differentfrom the default length. Unlike languages such as C, there is no dynamic allocation, and the lengthof all variables is known at compile time. Compared with other procedural languages, such as PL/Iand COBOL, BAL has fewer data types, which makes the conversion task easier for atomic types.Figure 1 shows a simple example of variable declarations. The variables a and b are byte and

short variables (indicated by the type specifier ‘#’ and ‘%’). The variable c is a BCD variable(type specifier ‘&’, optional) that takes five bytes of storage. BAL stores the BCD value in its own,proprietary format. The variable d is a string variable 100 bytes long. In the absence of an explicitlength (i.e., ‘=〈expression〉’), default lengths of 8 bytes for BCD variables and 16 bytes for stringvariables are used. Thus, the variable e is a string variable that is 16-bytes long. Arrays of each ofthe types are also supported by the language, with at the most two indexes (i.e., either vectors ormatrices).Even mapping the atomic BAL types to Java types is not straightforward. Byte and short have

a natural counterpart in Java, although using byte and short in Java introduces downcasts, sinceintermediate computations may get automatically promoted to int. BAL strings are different fromJava strings in a few respects: they are represented as byte (8 bit) sequences, not as UNICODEcharacter (16 bit) sequences, they are mutable and they have a fixed length. The mapping withthe closest semantics would be the Java byte array. However, translation of BAL strings into bytearrays would result in low-quality, poorly maintainable Java code, in that it would deviate from thecommon Java programming practice and it would need ad hoc support for the manipulation of thetranslated strings. Instead of resorting to an ad hoc data-type based on a byte-array representation,we decided to use the Java-type String anyway, by providing proper translations and helperfunctions, when needed. Modifications of BAL strings are translated into reassignments and properhelper functions are provided for string truncation or padding up to the length declared in BAL.Such helper functions take advantage of annotations that record the original BAL string size. BCDnumbers can be mapped to the BigDecimal type in Java, but again some care must be taken.As with BAL strings, the size of the BCD must be recorded in annotations. BigDecimals arealso immutable, hence reassignment is needed whenever a BAL BCD is modified. Rounding rulesshould replicate exactly the same semantics as in BAL; hence, the appropriate mathematical context(MathContext object) should be chosen for all generated BigDecimals, as well as for theintermediate arithmetic.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 5: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 215

d e f

b

a

c

# i f d e f ADCL a$ = 9FIEL D = M, a

DCL b$ = 5DCL c$ = 4FIEL D = M, b

DCL d #DCL e #DCL f $ = 3

# e nd i f

a DS CA9b EQU a

DS CA5c DS CA4d EQU b

DS B1e DS B1f DS CA3

Figure 2. Simple containment with exact size match.

2.1. Simple overlay

In BAL, variables are laid out sequentially in memory, with global variables in the global spaceand local variables on the data stack. Grouping of variables into records is done by explicitlyoverlaying variables by giving them overlapping positions in memory. This is accomplished withthe FIELD=M, statement, as shown in Figure 2. The code starts by declaring a string variablea of length nine. The FIELD statement resets the current variable position (i.e., the position of thenext declared variable) in memory to the beginning of the variable a, and as a result, the stringvariable b has the same starting position as the a, but a shorter length. The variable c that followsb is assigned to the next location in memory after b, which is also within the boundaries of thevariable a. In fact, both variables (total length of nine bytes) are contained within variable a. Thus,an assignment to the variable a will also change both b and c, whereas an assignment to b willonly change the first 5 bytes of the variable a.The second FIELD statement resets the current variable position to the beginning of b (which

is also the beginning of a), and the three variables, d, e and f are all allocated from that position.Figure 2 (top) shows the position of variables in memory diagrammatically. Using the FIELD=Mstatement without a variable name resets the current variable position to the first position free inmemory. In our example, if appended at the end of the declarations, such a statement would movethe next data position available in memory immediately past the end of variable a, since all ofthe other variables are located within the space allocated to a. The right-hand listing in Figure 2shows an equivalent data structure in mainframe assembly language (DS=allocate data storage,CA=ASCII string, B1=binary byte). The EQU directive is the equivalent of the FIELD=M,statement.As can be seen in the figure, C style preprocessing statements are available to the developer.

Data structures are kept in separate files (some of which automatically generated from ISAMtables), which are included using the #include directive. Macro definitions are used to select(via #ifdef) which data structures to instantiate (e.g. #ifdef A in Figure 2).

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 6: Migrating legacy data structures based on variable overlay to Java

216 M. CECCATO ET AL.

p u b l i c c l a s s A {Aa a = new Aa ( ) ;c l a s s Aa {

Ab b = new Ab ( ) ;c l a s s Ab {

byte d ;byte e ;@Field ( s i z e =3)S t r i n g f ;

}@Field ( s i z e =4)S t r i n g c ;

}}

Figure 3. Simple containment.

There are several consequences to the approach taken by the BAL language. The first consequenceis that records do not introduce any additional lexical scope: the name space is flat and there is noequivalent of the dot notation (e.g., a.b), common in languages such as C and Java. The secondconsequence is that it is the developers’ responsibility to ensure that the sizes of the variables arecorrect. For example, in Figure 2, the variable a is intended to be a reference to the entire record.If the size of c is changed to five, then the size of a should also be changed. The last consequenceis that there are many ways of expressing the exact layout of variables within memory. The lasttwo consequences make the recovery of a structured record from a sequence of BAL declarationsdifficult.For the record data type, which is obtained in BAL through the FIELD=M construct, the mapping

to Java is straightforward in case of simple containment with exact size match (as depicted inFigure 2). Figure 3 shows the Java code produced for the example in Figure 2. Nested FIELD=Minstructions are mapped to inner classes in Java. BAL strings used as record containers become Javaobjects, the type of which is the Java class corresponding to the FIELD=M defined upon them. Forexample, the BAL strings a and b in Figure 2 are turned into the two objects a and b, declared asclass attributes within class A and Aa, and initialized with instances of class Aa and Ab, respectively.The translation shown in Figure 3 makes the assumption that records are either accessed through

their fields (the leaves of the containment tree) or, as a whole, through the container itself, used as areference to the record. For example, if field b is read or written as a BAL string in the BAL code,the generated Java object must resort to serialization methods (e.g., readFrom and writeTo) toproperly assign values to its attributes.The @Field attribute is used to preserve the BAL size of non-primitive types, such as strings and

BCD numbers. These annotations, along with the primitive types and structures, are later convertedto constants that give the size and position of each field to assist in serialization.

2.2. Multiple overlay

As with many legacy applications, the developers sometimes use alternate views of the samememory. The root cause of this descends from the persistence layer, in our case ISAM tables, where

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 7: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 217

b

d

a

c

e f

DCL a$ = 9FIEL D = M, a

DCL b$ = 5DCL c$ = 4

FIEL D = M, aDCL d #DCL e$ = 2 ( 3 )DCL f%

Figure 4. Union.

multiple record types are often hosted inside the same table for performance optimization reasonsor only because it is permitted by the language. In the source code, this turns out to be similar tothe union construct, provided by languages such as C and C++. Figure 4 shows an example.Variable a, with length 9, has been redefined twice. Once by two strings b and c. The other bythree variables, d, e and f. The variable d is a byte, whereas the variable f is a short. The variablee is a three element array of strings, where each element has a length of two. An assignment to thevariable b will change the values of the variable d and the first two elements of the array e.Java does not have direct programming support for unions. The idea behind unions is that an object

is made accessible through multiple views. In Java, one way to express such multiple accessibilitycan be achieved by making the object implement multiple interfaces, each of which is associatedwith one of the multiple views. In order to avoid replication of data, the union object implementsthe copy-on-read/write protocol, which allows lazy creation and update of the alternative viewsavailable from the object.Figure 5 shows the translation of the union in Figure 4. Class UnionAa implements the two

interfaces associated with the two alternative views defined in Figure 4 for variable a. The firstview exposes the fields b and c, hence the related interface (Aa1Int) has getter and setter methodsfor the corresponding class attributes b and c. Similarly, the second interface will expose gettersand setters for d, e and f (not shown in Figure 5 for space reasons). Since UnionAa imple-ments both interfaces, it must expose getters and setters for all fields in all alternative views (i.e.,b, c, d, e, f).Lazy creation and update of the alternative views in a union are achieved by initializing the union

fields for the variants to null. In Figure 5, inside class UnionAa both attributes a1 and a2 areinitialized to null. When a setter or getter is invoked on the union object, a switchVariantoperation is invoked if the current active variant of the union is different from the requested one.Then, the set or get operation can be delegated to the proper object (a1 or a2 in our example).The switchVariant operation is responsible for creating the requested variant, if the relatedattribute has null value, and for copying the field values from any other non-null variant, in case itexists. The switchVariant operation ensures that at each point in time only one union variant

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 8: Migrating legacy data structures based on variable overlay to Java

218 M. CECCATO ET AL.

p u b l i c c l a s s A {UnionAa a = new UnionAa ( ) ;c l a s s UnionAa implements Aa1Int , Aa2In t {

Aa1 a1 = n u l l ; / / l a z y c r e a t i o nAa2 a2 = n u l l ; / / l a z y c r e a t i o nS t r i n g getB ( ) { . . . re turn a1 . getB ( ) ; }vo id se tB ( S t r i n g b ) {

i f ( a1 == n u l l ) s w i t c h V a r i a n t ( . . . ) ;a1 . s e tB ( b ) ;

}. . .

}c l a s s Aa1 implements Aa1In t {

@Field ( s i z e =5)S t r i n g b ;S t r i n g getB ( ) { . . . }vo id se tB ( S t r i n g b ) { . . . }@Field ( s i z e =4)S t r i n g c ;S t r i n g getC ( ) { . . . }vo id se tC ( S t r i n g c ) { . . . }

}c l a s s Aa2 implements Aa2In t { . . . }

}i n t e r f a c e Aa1In t {

S t r i n g getB ( ) ; vo id se tB ( S t r i n g b ) ;S t r i n g getC ( ) ; vo id se tC ( S t r i n g c ) ;

}i n t e r f a c e Aa2In t { . . . }

Figure 5. Union.

has non-null value, so it must also take care of assigning null to the copied non-null variant, whenit is there.With reference to Figure 5, if setB is called and both a1 and a2 are null, the switchVariant

method will create an Aa1 object and assign it to a1. If setB is called and a2 is non-null,switchVariant will copy all fields of a2 into fields of a1. Since fields may be not aligned andof different types, field copy from one variant to another one resorts to the serialization operationsreadFrom(Reader) and writeTo(Writer) (not shown in Figure 5 for space reasons), tobe used whenever a switch from one union variant to another occurs.

3. EXCEPTIONS TO THE BASIC RULES

In this section, we examine variable declarations in BAL that deviate from the basic cases describedin the previous section. For each case, we describe how we manage to reverse engineer a structuredrepresentation of the data. Since cases have been discovered heuristically and do not cover theentire set of possibilities offered by BAL, there is a chance that none of the cases presented inthis section applies, with the consequence that our reverse engineering technique fails and manualintervention is required. Manual intervention is also required when a known problem is recognized

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 9: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 219

a

b c

DCL b& = 5DCL c$ = 3FIEL D = M, b

DCL a$ = 9

Figure 6. Inversion.

automatically, but we have no automated solution for it (e.g., missing container described below).The amount and cost of such manual interventions are empirically assessed later.

3.1. Inversion

Figure 6 represents a common alternative way of expressing the same top level structure as shownin Figure 2 (w.r.t. variables a, b and c only). In this example, the developer has first specified thesequence of fields in the structure before overlaying the fields with a single larger variable, which isused to reference the fields as a whole. The overall layout in memory, however, remains the same.The case of a FIELD=M with inversion (Figure 6) is mapped to Java similarly to simple contain-

ment (Figure 3), once the container has been recognized. The heuristics to recognize an inversion isthe following: a FIELD=M instruction refers to a variable (e.g., b) smaller than the first followingdeclaration (a). Then, the exact size match condition is verified assuming that the redefiner (a) isthe container and the redefinees (b, c) are the record fields. If the sum of the sizes of the redefineesis equal to the size of the candidate container (redefiner), an inversion is detected and mapped tothe Java class described previously (Figure 3).

3.2. Missing container

The existence of a container for the entire record is neither enforced nor necessary in BAL. In fact,the first field of the record can be used as a reference to the beginning of the record and a FIELD=Minstruction, followed by a list of declarations that exceed the field size, can be used to access thefull record. An example of this programming style is shown in Figure 7. This data structure is aunion for which no container variable is defined. The two views available in this union (either arecord with fields a, b, or a record with c, d) are accessed through the first record field (either aor c), although its size is less than the entire record size. Access to the next fields (e.g., b) is easilyachieved via FIELD=M,a followed by proper declarations (e.g., DCL aa$=5, DCL b$=4).Currently, we have no heuristics to manage this exception. The exact size match condition

is clearly violated and no inversion can be detected. As a consequence, our tool reports a sizemismatch error to be fixed manually. The manual fix consists of adding a surrounding container tothe declarations shown in Figure 7, e.g., DCL aa$=9 before the declaration of a, which is turnedinto a redefinition of aa. Once the missing container has been added in the BAL code, mapping to

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 10: Migrating legacy data structures based on variable overlay to Java

220 M. CECCATO ET AL.

a b

dc

# i f d e f ADCL a$ = 5DCL b$ = 4FIELD = M, a

DCL c$ = 2DCL d$ = 7

# e nd i f

Figure 7. Missing container.

Java of this example can be achieved along the lines described in the previous section for unions(Figure 5).

3.3. Wrong container size

Sometimes, the container for the whole record may exist but it may have the wrong size. In fact,it is the programmer’s responsibility to indicate the size of all variables, including those that actmerely as containers. If, during software maintenance, any field size changes, the change must bepropagated to all container variables for the changed field. Such a propagation is manual in thesource code, whereas it is tool-supported for the code generated automatically from ISAM tables.In both cases, the programmer is in charge of performing the size update. In most cases, while thecompiler does not complain, whether enough memory is allocated for the data structure, no run-timeerror ever shows up. So, from the point of view of BAL programming, it is acceptable. However,recognizing a single data structure with a container may become difficult in such a situation.Figure 8 shows an example where the declared container size is 7 instead of 9. Apparently, half

of variable c is declared inside the data structure, whereas the other half of it is a part of the nextfree memory positions. This is the typical hint of a wrong container size. However, it may be hardto determine how many declarations following b should be attributed to the record a. Consistentdeclaration of fields for a total size not exceeding 9 in the alternative views of this union indicatesthat the correct container size is probably 9 in this example.The heuristics to recognize all cases of wrong container sizes are detailed in the next subsection. In

the example in Figure 8, once we are able to recognize that the container size should be incremented,we can apply the same translation used for the basic case shown in Figure 4, resulting in a Javaclass similar to the one in Figure 5.

3.4. Other mismatching cases

The problem of grouping variable declarations according to the fields they are redefining can beformulated as a bracketing problem, as described in detail in a companion paper, focused on thebracketing transformations [2]. When producing such declaration grouping, the size of the different

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 11: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 221

d

c

e f

b

a

# i f d e f ADCL a$ = 7FIEL D = M, a

DCL b$ = 5DCL c$ = 4

FIELD = M, aDCL d$ = 2DCL e$ = 2DCL f $ = 3

# e nd i f

Figure 8. Wrong container size.

memory overlays may not match. In particular the following cases can occur:

Case 1 Exact size match.

Case 2 Redefinition uses less memory than the original variable.

Case 3 Redefinition uses more memory than the original variable.

The first case represents the ideal situation, a perfect match between a variable and its redefini-tions. This is the case of variable b and its redefinition as d, e and f in Figure 2. In this case weconsider the redefinition finished when a variable is appended that makes the size of the contentmatch the size of the container exactly.In the second case, the sum of the size of variables within a redefinition is smaller than the original

variable size. For instance, this would occur if c had size 3 instead of 4 in Figure 2. Technically,this requires to explicitly close the redefinition (stopping condition):

Case 2.1 The redefinition is explicitly closed by a FIELD=M statement, that resets the memorypointer to the next free available position.

Case 2.2 Another redefinition of the same variable starts before the full size is reached.

Case 2.3 A redefinition of another variable starts before the full size is reached.

Case 2.4 Declarations in the code are not enough to fit the size because the end of the variabledeclaration section is reached.

Case 3 occurs when the redefinition does not reach exactly the size of the original variable (case 1)and the redefinition is not terminated explicitly (case 2). In this case the redefinition is consideredto be closed when a variable is added that crosses the boundary of the enclosing variable.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 12: Migrating legacy data structures based on variable overlay to Java

222 M. CECCATO ET AL.

Since we do not know whether this structure corresponds to the actual intent of the developers, anerror is reported and a manual fix intervention is requested. We recognize this instance as an explicitintention of the developer when one of the following stopping conditions appear immediately afterthe last variable in the redefinition:

Case 3.1 The redefinition is followed by the FIELD=M statement that resets the memory pointer.

Case 3.2 The redefinition is followed by another redefinition of the same variable.

Case 3.3 The redefinition is followed by a redefinition of another variable.

Case 3.4 There are no other declarations in the code because the end of the variable declarationsection is reached.

An example of Case 3 is shown in Figure 8, where the declaration of c crosses the boundary ofa. Since the declaration of c is immediately followed by a stopping condition (case 3.2), bracketingof the redefinition of a can be completed automatically, without requiring any user intervention.The declarations of b and c are put inside the square brackets for the FIELD=M,a instruction.For d, e and f in the example shown in Figure 8 the stopping condition that applies is number 3.4(end of declaration section).

4. AVOIDING UNIONS

Unions are not a part of the programming model of Java. We obtain the equivalent of unions bymeans of Java classes that implement multiple interfaces and adhere to the copy-on-read/writeprotocol. However, such a programming style is not common in Java. Maintainability of Java code,such as to the one reported in Figure 4, may be quite hard, especially for programmers who donot have a deep and thorough understanding of the rationale behind variants, lazy creation andcopy on read/write, as implemented in Figure 4. Moreover, the copy-on-read/write protocol isparticularly expensive from the computational point of view, since it resorts to complete serializationand deserialization of an object whenever the view used in the program changes. In the originalBAL programs, such view changes are quite frequent, since they involve no performance penalty.On the contrary, in Java each such change involves substantial computation and data conversion.In summary, there are good reasons to avoid unions in Java whenever possible. In this sectionwe describe two heuristic approaches that can be used to limit the generation of unions duringmigration to Java. In the next section we present experimental results on their effectiveness.

4.1. Recognizing unions used to provide SA

Often, the reason why BAL programmers declare multiple views on the same data region (hencedefining a union) is to have a simple way to access some interesting substrings of the givendata region. Let us consider the example in Figure 9. The outermost record container a hasthree fields, k, i and d, which can be interpreted as the record key, separator and data part(this is a common data organization used by BAL programmers). The key k may be structuredin such a way that some of its substrings contain relevant information, which is meaningful

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 13: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 223

a

dik

k11 k12

k22k21

DCL a$ = 9FIELD = M, a

DCL k$ = 4FIELD = M, k

DCL k11 $ = 1DCL k12 $ = 3

FIELD = M, kDCL k21 $ = 2DCL k22 $ = 2

DCL i #DCL d$ = 4

Figure 9. Views on union k provide substring access to k.

to programmers on its own. For example, a bank account can be often split into meaningfulsubparts. In our example, we assume that four substrings of k are meaningful: k11, k12, k21,k22. Since the first two substrings overlap with those of the second, according to the migra-tion rules described in the previous section, a union is associated with the declaration of variablek. This means that the Java translation of k would be similar to class UnionAa in Figure 5,making the resulting Java code substantially harder to comprehend and less performing than theoriginal one.A better translation for the code in Figure 9 is shown in Figure 10. In this translation, the attribute

k is a string, not a union. All SAs to its parts are achieved by means of proper accessor (setter andgetter) methods, which can be generated automatically from the size of the information available inthe original BAL code. Getters return substrings, while setters concatenate the modified substringwith the unchanged parts of k, obtained as substrings of k.For example, k11 is obtained by the related getter as the substring of k starting at 0 and having

length 1. In Figure 10, we use BALStrings.substring instead of the Java native methodsubstring, since we need an exception-free version of substring, which is compliant with theoriginal BAL semantics. Moreover, trailing blanks are removed from the Java strings (whereasblank padding is automatic in BAL). This is achieved through the normalize method, whichalso truncates strings exceeding the declared size.The setter for k11 concatenates the new value of k11 (i.e., its formal parameter k11) with

the remaining part of k. The result is assigned to k (after normalization). Before performing theconcatenation, a BALStrings.expand operation is necessary to realize the automatic blankpadding semantics of BAL in Java.The pseudo-code of the algorithm used to identify unions that can be translated into Java classes

with proper substring accessors is shown in Figure 11. Its first step introduces an SA annotation

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 14: Migrating legacy data structures based on variable overlay to Java

224 M. CECCATO ET AL.

p u b l i c c l a s s A {@Field ( s i z e =4)S t r i n g k ;byte i ;@Field ( s i z e =4)S t r i n g d ;

p u b l i c getK ( ) { re turn k ; }p u b l i c se tK ( S t r i n g k ) { t h i s . k = BA LStrings . n o rm a l i z e ( k , 4 ) ; }p u b l i c getK11 ( ) {

re turn BA LStrings . n o rm a l i z e ( BA LStrings . s u b s t r i n g ( k , 0 , 1 ) , 1 ) ;}p u b l i c setK 11 ( S t r i n g k11 ) {

k = BA LStrings . n o rm a l i z e ( BA LStrings . expand ( k11 , 1 ) +BALString . expand ( BA LStrings . s u b s t r i n g ( k , 1 , 3 ) , 3 ) , 4 ) ;

}. . .

}

Figure 10. Java translation of fields with substring access.

Input: BAL programOutput: BAL program with some unions annotated as SA (substring access)

1 Identify leaf substring accesses• annotate with @SA each DCL of string type in the scope of a FIEL

2 Repeat

• annotate with @SA each FIELD=M that contains only DCL annotated with @SA4 Identify SA unions

• notated with @SAUntil no more SA annotation is added

5 Cleanup• keep @SA annotations only when associated with a union DCL (i

Figure 11. Algorithm to recognize substring access in unions.

for the declarations of string type that are in the scope of a redefinition (FIELD=M) and are notredefined. When all declarations in the scope of a redefinition are annotated as SA, the entireredefinition is also annotated as SA (Step 3 in Figure 11). When all redefinitions of a declarationare annotated as SA, the declaration itself becomes an SA (Step 4). After Step 4, it may happenthat the condition for Step 3 becomes true for more redefinitions, so we need to iterate (Step 2)until no more SA annotation is added. The post-processing performed at Step 5 aims at removingthe SA annotations that do not have to be reported in the output.After running the algorithm in Figure 11, we know exactly which unions of a BAL program can

be translated into Java using substring accessors, instead of the the union translation scheme inFigure 5. In the example in Figure 9, the algorithm first annotates the declarations of k11, k12,k21, k22, d, with @SA (k is redefined, so it is not initially annotated). Then, it annotates the twoFIELD=M, k with @SA. Step 4 annotates k with @SA, since all its redefinitions are annotated with

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 15: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 225

e

c

a

d

gft

tk

k

# i f d e f A# d e f i n e A1# d e f i n e A2# e n d i f

# i f d e f A1DCL a$ = 11FIELD = M, a

DCL k$ = 2 / / r e c o r d keyDCL t $ = 1 / / t = ”T” f o r t h i s v a r i a n tDCL c$ = 4DCL d$ = 4

# e n d i f

# i f d e f A2# i f d e f A2 SKIPDCL a$ = 11# e n d i f / / A2 SKIPFIELD = M, a

DCL k$ = 2 / / r e c o r d keyDCL t $ = 1 / / t = ”D” f o r t h i s v a r i a n tDCL e #DCL f$ = 5DCL g%

# e n d i f / / A2

Figure 12. Union with mutually exclusive overlays.

@SA. The final cleanup keeps only the annotation of k, i.e., the only declaration with more thanone redefinition (union declaration). Then, translation of the union declarations annotated with @SA(such as k) by means of proper substring accessors can be fully automated, along the lines of theexample shown in Figure 10.

4.2. Recognizing unions that can be replaced by an inheritance hierarchy

A special case of the union data structure is characterized by mutually exclusive overlays. In thiscase, one or more bytes of the structure form a discriminator, which identifies which overlay isintended to be used. One situation in which this case occurs is when reading or writing a tablewhere multiple record types are stored. Figure 12 shows an example of such a data structure. Thetwo variants of the storage are the variables k, t, c and d on the one hand, and the variables k,t, e, f and g on the other hand. The two data structures can be instantiated individually (eitherby defining only the macro A1, or by defining only the macros A2, A2 SKIP). In such cases thedata structure has only one view active (i.e., it is not a union). Union instantiation is achieved bydefining both A1 and A2, while leaving A2 SKIP undefined (this is easily obtained by defining themacro A only), so that the second group of declarations (on the right in Figure 12) overlays withthe first one.When a union is instantiated, the string variable t acts as the discriminator for this record. One

value, say the value ‘T’, will indicate that the first variant is to be used, whereas another, saythe value ‘D’, will indicate that the other variant is valid. The BAL language does not enforcemutually exclusive access to the record variants. It is up to the developer to code the related logicappropriately, by making sure that every access according to one of the views defined for the givenunion is guarded by some instruction ensuring that the discriminator holds the value correspondingto the view being used.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 16: Migrating legacy data structures based on variable overlay to Java

226 M. CECCATO ET AL.

p u b l i c c l a s s A {@Field ( s i z e =2)S t r i n g k ;@Field ( s i z e =1)S t r i n g t ;

}p u b l i c c l a s s A1 extends A {

@Field ( s i z e =4)S t r i n g c ;@Field ( s i z e =4)S t r i n g d ;

}

p u b l i c c l a s s A2 extends A {byte e ;@Field ( s i z e =5)S t r i n g f ;s h o r t g ;

}

Figure 13. Java translation of union with mutually exclusive overlays.

When the different views of a data structure are mutually exclusive, we can take advantage ofinheritance and instantiate the appropriate subclass, instead of resorting to unions. In Figure 12,the value of t determines the record type. Whenever t=‘T’, the first view is accessed, whilet=‘D’ selects the second view. In Java, the discriminator t is moved to the common superclassA, together with all attributes common to the two alternative views (e.g., k, see Figure 13). Thevalue of the discriminator in the code determines which subclass of A to use or which downcast toapply on an object of type A. For example, if a BAL code portion instantiates the data structure inFigure 12 assigning the value ‘T’ to t, we know that the Java translation must instantiate classA1. If an object has type A (e.g., because it is returned by a BAL function), but all its uses areguarded by t=‘D’, we can downcast it to A2 and use the specific methods of A2 in the translatedcode.Figure 14 shows an algorithm to determine a safe subset of all statements where the value of

a union discriminator is ensured to be a single constant value. For such statements it is possibleto use the subclass in the inheritance tree which is associated with such a discriminator value (togain generality, ranges of values instead of individual values may also be considered, with slightmodifications of the algorithm).The algorithm builds the control flow graph for the input program and performs a flow propagation

inside it, until a fixpoint is reached (Steps 1–3). This flow propagation is a particular instanceof constant propagation, a well-known program analysis technique [3]. Based on the outcome ofconstant propagation, it is possible to decide whether at a given statement only one union variantis accessible.The outcome of the algorithm in Figure 14 is used as follows. Whenever a Control Flow Graph

(CFG) edge connects two nodes having different discriminator values, respectively, (d,c1) and(d,c2), with c1 �=c2, we have potentially a union variant switch. At the target node, proper switchingcode must be generated in Java: if the currently active union variant is the one associated withthe discriminator value c1, before executing the code of the target node, the object created for thesubclass A1 (associated with c1) must be copied into that created for A2 (the subclass associatedwith c2). The same happens whenever an attribute of a union variant different from the currentlyactive one is accessed. In general, this may introduce a huge number of switch variant statementsif applied to code making undisciplined use of union variants with discriminators. On the contrary,

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 17: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 227

Figure 14. Algorithm to recognize unions with mutually exclusive overlays.

the use of inheritance becomes an improvement over the translation with unions when the numberof such variant switches is low or zero.Figure 15 (top) shows an example of BAL code which makes use of the union with the discrim-

inators described above (Figure 12). The code contains two calls to fsearch, a routine to accesspersistent data in BAL and fill-in the fields of the record whose key is passed as the second param-eter. After performing a constant propagation on the CFG of this program, it is possible to determinethat the discriminator is constant at both calls, whereas it is not so at the last statement in Figure 15(top). Correspondingly, the Java translation shown in Figure 15 (bottom) makes use of a1, the firstsubclass of A, associated with the discriminator value t = ‘T’, in the first call, whereas it usesa2 at the second call. When there is a possible variant switch (before the execution of the secondcall to fsearch and before the last statement), the routine switchVariant is invoked. Sucha routine just checks whether the first and second parameters reference the same object or not. Ifnot, the active variant (first parameter) is copied into the second parameter via serialization and isreturned by the routine.

5. EXPERIMENTAL DATA

In this section, we first describe the migration process and tools, as well as the system beingmigrated. Then, we report on some data that we collected when applying the proposed data modelstructuring techniques.

5.1. Migration process and tools

The legacy system contains two different kinds of data structures that deal, respectively, withpersistent and transient data. Persistent data are stored on ISAM files, each containing one or moreISAM tables. The structure of most of the ISAM tables is described in a particular ISAM table(indeed, a meta-table) called the dictionary. This is a detailed description of the table meta-data,

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 18: Migrating legacy data structures based on variable overlay to Java

228 M. CECCATO ET AL.

t = ”T” / / GEN = ( t , ”T ”)f s e a r c h (A, k ) / / @Discr ( t , ”T ”). . .i f ( t = ”D” ) / / GEN( t r u e ) = ( t , ”D”)

f s e a r c h (A, k ) / / @Discr ( t , ”D”). . .c = ” ” / / ( t , NC) p r o p ag a t e s t o t h i s nodeA1 a1 = new A1 ( ) ;A2 a2 = new A2 ( ) ;A a c t i v e V a r i a n t = a1 ;a c t i v e V a r i a n t . s e t T ( ”T” ) ;f s e a r c h ( ”A” , a1 ) ;. . .i f ( a c t i v e V a r i a n t . ge tT ( ) . e q u a l s ( ”D” ) ) {

a2 = s w i t c h V a r i a n t ( a c t i v e V a r i a n t , a2 ) ;f s e a r c h ( ”A” , a2 ) ;

}. . .a1 = s w i t c h V a r i a n t ( a c t i v e V a r i a n t , a1 ) ;a1 . s e tC ( ” ” ) ;

Figure 15. Java translation of code using unions with discriminators.

that includes not only the type and the size of table fields, but also supplementary information suchas the fields used as discriminators as well as the discriminator values. Declarations for data comingfrom these ISAM tables are inside include-files that are periodically generated from the dictionary.When moving to Java, the dictionary must be translated as well, since its declarations have to beturned into class definitions that allow instantiating Java bean objects whenever a record is retrievedfrom the persistent storage.A few remaining ISAM tables are described in developer maintained data structures, but not in

the dictionary. All the other data structures contain transient data: they are used in the front-endinteraction, they store intermediate results. The BAL code for transient data structures is manuallymaintained by the developers.

5.1.1. Dictionary

Considering the valuable information available in the dictionary, the analysis of persistent data struc-tures is performed directly on it, instead of the generated include-files. The dictionary is converted bya pre-existing custom tool into an XMI representation that can be inspected with any XML library.We used XOM‡ (an XML manipulation library for Java) to analyze the dictionary representationand to generate the Java classes to access ISAM tables. The same cases, described in the previoussection, apply both to data structures found in the user code and data structures documented inthe dictionary. Hence, the same bracketing algorithm [2] was used, but the implementation of thealgorithm for the dictionary is based on Java/XOM.

‡http://www.xom.nu/.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 19: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 229

5.1.2. User code

Analysis of the user code is performed in three main steps:

• Code normalization, the code is normalized in order to make the subsequent analysis and trans-formation simpler;

• Fact extraction, a number of facts are extracted from the code and used in the final step;• Data structure inference, the containment relationship is identified and all the possible overlaysare grouped together.

In the first step the code is normalized and code ambiguities are resolved. We use agile parsing [4],modifying grammar and language to distinguish between ambiguous cases. For example, in BALthe same syntax (i.e., brackets) is used for array access and for function invocation. In the codenormalization step, we change the declarations and all uses of arrays so as to comply with theC/Java syntax (square brackets). Unique naming [5] is used to generate identifiers that are uniquewithin the system, regardless of their scope. Unique naming is required because segments canhave local variables and global named constants can be used as sizes in segment local variables.Moreover, in the case of reused variable names, local names hide the global names.In the first step we also identify and mark the portions of code originated from the expansion of

include-files that are generated from the dictionary. The data structures in these portions of codeare not analyzed in step 3, since their analysis is carried out directly on the dictionary where theycome from.In the second step (fact extraction) the code is analyzed and information about it is stored in

a database. The most important facts produced in this phase deal with the type and length of allvariables and constants. In this step, the combination of information from multiple files into a singledatabase allows us to resolve the external information. An example is when the length of a variableis given by a constant or macro from another file.The third step is the application of a source level transformation that searches for each of the

cases described in the previous sections and inserts appropriate brackets to indicate the full extentof the boundaries of field redefinitions. In this step, size information is used to understand whendifferent fields overlay in the memory. Redefinitions of the same field are grouped together andmoved next to the field declaration, so that unions are immediately recognizable.All three analyses steps for the user code have been implemented using the TXL language [6].

5.2. The legacy system

The system that we are migrating is a production banking application which supports all thefunctionalities necessary to operate a bank, including account management, financial productsmanagement, front-desk operations, communications to central bank and other authorities, inter-bank communications, statistics and report generation. The user interface is character-oriented andthe overall architecture is client-server, with the client operating mostly as a character terminal.The execution environment is a proprietary platform called B2U.Table I shows some indicators of the characteristics of the system being migrated. The application

is quite large (around 9.7 MLOC). Since the BAL language admits preprocessor directives, theactual input to our analysis and transformation tools is the preprocessed (expanded) source code,with an approximate growth factor of 1.8. The persistent storage is also pretty large in terms of

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 20: Migrating legacy data structures based on variable overlay to Java

230 M. CECCATO ET AL.

Table I. Features of the system being migrated.

Lines of code (user code) 9716109Lines of code (after expansion) 17636696Number of source code files 2936Number of ISAM files 1339Number of ISAM tables 5893Number of unique ISAM tables 3950

Table II. Occurrences of structuring cases.

Case User code Dictionary

Case 1 127636 8279Case 2 17635 65Case 3 13241 4

Case 2.1 1847 0Case 2.2 9441 36Case 2.3 6251 11Case 2.4 196 18

Case 3.1 5633 0Case 3.2 1912 1Case 3.3 5050 3Case 3.4 646 0

Case err 2152 45

ISAM files and tables. For the latter, the correct number to consider is the number of unique ISAMtables. Some tables are on duplicates of other tables, having exactly the same structure. In such acase, only one table, representative of the entire equivalence class, is actually translated to Java.

5.3. Data model structuring cases

When programs have to exchange data, the same data definition file is included by all the pieces ofcode accessing them. However, depending on the way files are included, the bracketing heuristicscan report different cases. In fact, programmers could use any stopping condition after inclusion(for example, an explicit memory pointer reset or a redefinition of any variable), when programmersknow that such a file may contain size mismatches. For this reason, structuring cases in files thatare included multiple times are counted as many times as they appear in the expanded code.Table II shows the frequency of the cases considered during the inference of an object model from

the existing flat memory model. The table is split into two columns, associated with data modelinference for user code vs dictionary. Case err occurs whenever none of the case-based heuristicsapplies and manual intervention is required. Manual fixes are performed by the programmers whodeveloped the original legacy system, who are BAL experts.As is apparent from Table II, most of the cases, both in user code and in dictionary, can be handled

by the simplest of the cases in our case analysis: exact size match (Case 1). Case 2 (redefinition

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 21: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 231

Figure 16. Performance (in micro seconds) of data access primitives in BAL and Java.

of less memory than declared) seems to prevail on Case 3 (redefinition of more memory thandeclared), both in user code and in dictionary. Whenever redefinition takes less memory (Case 2),the presence of a successive redefinition of the same or another variable can be exploited to inferthe data structure boundaries in most situations (see Cases 2.2 and 2.3). Otherwise, when theredefinition is larger (Case 3) an explicit memory pointer reset (FIELD=M) or the presence of asuccessive redefinition of the same variable are the most frequently occurring heuristics to suggestthe correct boundaries (see Cases 3.1 and 3.2).The 45 error cases remaining in the dictionary have been fixed by BAL programmers. This

manual intervention required 5 working days. A similar code fixing is under way for the 2152error cases remaining in the user code. Even if the cases to solve are many, a further investigationshowed that they are not independent, they all refer to only 401 recurring variables. In all othercases (158 512), Java classes have been automatically generated for the user code.

5.4. Performance evaluation

Given the data-intensive purpose of the legacy system migrated, its performance in accessing andmanipulating data may represent an issue. For this reason we measured the overhead due to dataaccess and data conversion in the migrated code.Figure 16 plots the average time (in microseconds) required for accessing data by several prim-

itives implemented in the two languages. In total, 200 000 accesses have been performed on a realISAM table composed by 2 300 000 records, each record containing 1034 bytes. The dashed linerepresents the BAL reference performance, whereas the solid line reports the performance for Java.The performance is very similar, in two cases (delete and insert) the Java version is even faster thanBAL. Thus, we can argue that no significant overhead is expected in the migrated system, due todirect access to and navigation through data.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 22: Migrating legacy data structures based on variable overlay to Java

232 M. CECCATO ET AL.

Figure 17. Data conversion overhead in Java compared with a BAL assignment.

On the contrary, a relevant performance penalty is expected to occur in the presence of unions.In fact, when the translated Java code has to change from one view to another, a conversion isrequired, involving data serialization and deserialization, which may possibly cause some overhead.In Figure 17, the overhead of several type-to-type conversions in Java is compared with thetime spent in the corresponding BAL assignment (copy semantics). Data represent the average(in micro-seconds) on 1000 conversions/assignments. BAL requires a constant amount of time.On the contrary, depending on the type of source and target, the Java implementation exhibits majorvariations. Even if some conversions are very fast (e.g., arraycopy is even faster than BAL), othersare substantially slower. Java takes between one and two orders of magnitude longer than the BALreference. In two cases (BCDtoArray and BCDtoByteBuffer) the overhead is near to three orders ofmagnitude. In general, the most time-consuming conversions are those involving BCDs (either assource or as target), because they require heavy numerical processing.Overall, the performance loss is not critical for this application, but in order to limit the high

impact that conversions may have on the performance, they should be avoided as much as possiblein the translated code. To achieve this goal, when possible, alternative constructs should replaceunions, for example, resorting to SA and mutually exclusive overlay analysis.

5.5. Union reduction

Tables III and IV show the number of classes, interfaces and unions (counted also as classes) gener-ated, respectively, for user code and dictionary. Interfaces are generated only to support the properdefinition of unions in Java. The numbers indicate that in most of the cases, unions are not neces-sary, in that each data structure is accessible through a single view. In the user code, an impressiveamount of unions (−45%) can be saved if the SA analysis is included in the transformation. Inthe dictionary, the total number of unions is relatively small (532). However, if the SA analysis isperformed, the amount of required unions becomes half (−49%). It is interesting to note that, when

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 23: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 233

Table III. Java code generated for the user code.

After recognizingBase string access

Classes 20542 7247 (−64%)Interfaces 4255 2043 (−51%)Unions 1209 653 (−45%)

Table IV. Java code generated for the dictionary.

After recognizing After recognizing After rec. string acc.Base string access discriminator and discriminator

Classes 10174 3474 (−65%) 10174 (−0%) 3474 (−65%)Interfaces 2402 1256 (−47%) 627 (−73%) 227 (−90%)Unions 532 268 (−49%) 205 (−61%) 81 (−84%)

adopting SA, the percentage of reduction in the amount of classes, interfaces and unions in theuser code is similar to that observed for the dictionary. This suggests that BAL programmers use asimilar coding style when implementing data structures, either in the user code or in the dictionary.Among the classes generated from the dictionary, many satisfy the pattern shown in Figure 12

(mutually exclusive overlay). For these cases it is possible to take advantage of inheritance and usea discriminator to decide which subclass to instantiate or downcast to. In this way it is possible todrastically reduce the number of unions (−61%). Eventually, when both SA and mutually exclusiveoverlay are recognized, only very few unions remain (81), so that it is reasonable to plan theircomplete manual elimination from the generated Java code.

6. RELATED WORK

The problem of migrating a legacy software system to a novel technology has been widely addressedin the literature by different approaches. The different strategies have been classified by Bisbalet al. [7] into (1) redevelopment from scratch; (2) wrapping and (3) migration. In their view, eventhe migration strategy requires substantial redevelopment. Our contribution belongs to the thirdclass and consists of a set of automatic transformations.Data conversion and migration have been the topics of several publications over the years [8–14].

Sneed [12] reports his experience at the UBS (Union Bank of Switzerland), where he was involvedin a large reengineering project, which included data reengineering, data conversion and data accessreengineering. Although in a different setting, we faced exactly the same problems. Martin andMuller [11] enumerate the difficulties encountered when migrating C programs to Java. Amongthose difficulties, unions are mentioned. For them, the authors refer to their previous work [10],where implicit inheritance relationships are automatically recognized. We also resort to inheritancewhenever possible, but unfortunately such a pattern does not resolve all cases that may occur whenarbitrary overlays are permitted.Similar to Martin and Muller [10,11], we also considered data-type emulation [13] inadequate

for the basic data types of BAL and preferred a more natural mapping to the Java data types

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 24: Migrating legacy data structures based on variable overlay to Java

234 M. CECCATO ET AL.

(e.g., String, BigDecimal), in order to improve the readability of the generated code andperformance. We also avoided overlay simulation through byte arrays, which again would haveproduced quite obscure and hardly maintainable code.Data conversion and reengineering [8,12,15,16] deal with the problem of moving from one data

management system to another. The problemmay involve only type mapping and size/displacementre-arrangement, but it may also require more complex operations, necessary when the structure ofthe two data management systems is substantially different. Since, in our project, the migrated Javacode will access the same ISAM tables used by the legacy BAL code, the data conversion problemhas not been encountered yet. However, we plan to migrate the ISAM tables to DB2 relational tablesin the near future. Hence, we will face similar problems as those discussed in the data conversionand reengineering literature.Migration to OO programming and extraction of an OO data model from procedural code are

the topics of several works [17–21]. Class fields originate from persistent data, user interface,files, records and function parameters, whereas class operations come from the segmentation of theprogram according to branch labels in the work by Sneed and Nyary [19]. Other works on objectidentification rely on the analysis of global data and of the code accessing them [17,22,23]. Sincea record is too large and often contains unrelated data, cluster analysis was used [21] to identifygroups of related fields within a record. In order to decide which data and which routines shouldbe grouped together into classes, OO design metrics (Chidamber and Kemerer) have been used toguide the migration [18,24].Type inference was used to acquire information about variables in legacy applications, which

goes beyond that conveyed by the declared type, so as to simplify migration toward a programminglanguage with a richer and stronger-type system [25,26]. For instance, type inference was appliedto Cobol [27,28] to determine subtypes of existing types and to check for type equivalence. Staticanalysis and model checking have been used on Cobol to determine when a scalar type should bebetter regarded as a record type [29] and to determine unions, the variants of which are consistentlyaccessed through discriminators [30,31].The work presented in this paper differs from the existing literature in that it deals with a starting

data model permitting arbitrary overlays in the memory. Languages such as C and COBOL placesome restrictions so that fields from separate variables do not overlay each other. BAL has amore permissive model, similar to structured assembly language, which poses additional issues andrequires ad-hoc treatment. Our work represents the first step—reverse engineering a structured datamodel—toward an OO model of the data.This paper expands our previous publication [32] with more details and results. It includes two

novel techniques that can be employed to avoid the generation of unions and previously unpublishedexperimental results on performance and union reduction. The interested reader is referred to ourprevious paper [2] for a description of the TXL program transformations necessary to bracket theoriginal BAL code, so as to delimit the extension of each data variable of container type.

7. CONCLUSIONS AND FUTURE WORK

We have described an approach and a set of heuristics that can be used to migrate legacy data struc-tures permitting arbitrary variable overlays to a data model that constrains the possible relationships

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 25: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 235

among variables (in Java, only direct or indirect containment is allowed). We have also consideredthe problem of multiple alternative overlays (unions), i.e., multiple data structure variants definedupon the same memory region. In Java, it is possible to translate such data structure, but a specificprotocol (copy-on-read/write) has to be implemented to ensure consistency among different views,which involves substantial performance penalties. We have presented two algorithms to avoid orreduce the impact of such a problem. One is based on the identification of unions that are introducedonly to provide SA. The other is based on the identification of unions that are fully discrimi-nated and can be translated into a hierarchy of classes that are referenced in the code in mutualexclusion.Preliminary results indicate that unions have potentially a major impact on the performance of

the translated program. In fact, the data conversion required by the copy-on-read/write protocolmay introduce an increased execution time which can be as worse as three orders of magnitude.This justifies our effort to reduce the number of cases in which unions-like data structures areinstantiated in Java. Experimental results indicate that the two techniques investigated so far arequite effective. Taken in isolation, they more or less halve the number of unions that are producedin the translation. If combined, they reduce the number of unions to around 1

5 .Further experimental evaluation of the performance of the translated code is necessary to under-

stand whether the two proposed techniques solve a number of cases which result eventually inthe overall acceptable performance of the application as a whole. We may need to study otherapproaches or request manual interventions on the original code to simplify the job of the translator.We also plan to evaluate other performance bottlenecks that may be associated with the currenttranslation process.

REFERENCES

1. Bisbal J, Lawless D, Bing Wu, Grimson J. Legacy information systems: Issues and directions. IEEE Software 1997;16(5):102–111.

2. Ceccato M, Dean TR, Tonella P. Using program transformations to add structure to a legacy data model. WorkingConference on Source Code Analysis and Manipulation, 2008; 197–206.

3. Wegman MN, Zadeck FK. Constant propagation with conditional branches. ACM Transactions on Programming Languagesand Systems 1991; 13(2):181–210.

4. Dean TR, Cordy JR, Malton AJ, Schneider KA. Agile parsing in txl. Journal of Automated Software Engineering 2003;10(4):311–336.

5. Guo X, Cordy JR, Dean TR. Unique renaming of Java using source transformation. Proceedings 3rd IEEE InternationalWorkshop on Source Code Analysis and Manipulation 2007. IEEE Computer Society: Amsterdam, The Netherlands,2003.

6. Cordy JR. The txl source transformation language. Science of Computer Programming 2006; 61(3):190–210.7. Bisbal J, Lawless D, Wu B, Grimson J. Legacy information systems: Issues and directions. Software, IEEE 1999;

16(5):103–111.8. Cleve A. Automating program conversion in database reengineering: A wrapper-based approach. Proceedings of the 10th

European Conference on Software Maintenance and Reengineering, March 2006; 323–326.9. Jacobson I, Lindstrom F. Re-engineering of old systems to an object-oriented architecture. Proceedings of the Conference

Object-oriented Programming Systems, Languages, and Applications (OOPSLA). ACM: New York NY, 1991; 340–350.10. Martin J, Muller HA. Advances in Software Engineering: Topics in Comprehension, Evolution, and Evaluation. Springer:

New York NY, 2000.11. Martin J, Muller HA. Strategies for migration from C to Java. Proceedings of the 5th European Conference on Software

Maintenance and Reengineering. IEEE Computer Society: Silver Spring MD, 2001; 200–209.12. Sneed HM. Bank application reengineering and conversion at the union bank of switzerland. October 1991; 60–72.13. Terekhov AA, Verhoef C. The realities of language conversions. IEEE Software 2000; 17(6):111–124.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 26: Migrating legacy data structures based on variable overlay to Java

236 M. CECCATO ET AL.

14. Waters RC. Program translation via abstraction and reimplementation. IEEE Transactions on Software Engineering 1988;14(8):1207–1228.

15. Henrard J, Hick J-M, Thiran P, Hainaut J-L. Strategies for data reengineering. Proceedings of the Ninth WorkingConference on Reverse Engineering (WCRE). IEEE Computer Society: Washington, DC U.S.A., 2002; 211.

16. Hick J-M, Hainaut J-L. Database application evolution: A transformational approach. Data and Knowledge Engineering2006; 59(3):534–558.

17. Canfora G, Cimitile A, Munro M. An improved algorithm for identifying objects in code. Software: Practice andExperience 1996; 26:25–48.

18. De Lucia A, Di Lucca GA, Fasolino AR, Guerra P, Petruzzelli S. Migrating legacy systems towards object-orientedplatforms. Proceedings of the International Conference on Software Maintenance, 1997, 1–3 October 1997; 122–129.

19. Sneed HM, Nyary E. Extracting object-oriented specification from procedurally oriented programs. Proceedings of the2nd Working Conference on Reverse Engineering, 1995, 14–16 July 1995; 217–226.

20. Tan HBK, Ling TW. Recovery of object-oriented design from existing data-intensive business programs. Informationand Software Technology 1995; 37:67–77.

21. van Deursen A, Kuipers T. Identifying objects using cluster and concept analysis. Proceedings of International Conferenceon Software Engineering, 1999, 1999; 246–255.

22. Liu S-S, Wilde N. Identifying objects in a conventional procedural language: An example of data design recovery.Proceedings of the International Conference on Software Maintenance, 1990, 26–29 November 1990; 266–271.

23. Pidaparthi S, Cysewski G. Case study in migration to object-oriented system structure using design transformationmethods. Proceedings of the First Euromicro Conference on Software Maintenance and Reengineering, 1997, 17–19March 1997; 128–135.

24. Cimitile A, De Lucia A, Di Lucca GA, Fasolino AR. Identifying objects in legacy systems using design metrics. Journalof Systems and Software 1999; 44:199–211.

25. O’Callahan R, Jackson D. Lackwit: A program understanding tool based on type inference. Proceedings of the 19thInternational Conference on Software Engineering, 1997, 17–23 May 1997; 338–348.

26. Ramalingam G, Komondoor R, Field J, Sinha S. Semantics-based reverse engineering of object-oriented data models.Proceedings of the International Conference on Software Engineering, 2006, 2006; 192–201.

27. van Deursen A, Moonen L. Understanding cobol systems using inferred types. Proceedings of the Seventh InternationalWorkshop on Program Comprehension, 1999, 1999; 74–81.

28. van Deursen A, Moonen L. Exploring legacy systems using types. Proceedings of the Seventh Working Conference onReverse Engineering. IEEE Computer Society Press: Silver Spring MD, 2000; 32–41.

29. Komondoor R, Ramalingam G, Chandra S, Field J. Dependent types for program understanding. Proceedings of theInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems 2005, 2005; 157–173.

30. Jhala R, Majumdar R, Xu R-G. State of the union: Type inference via craig interpolation. Proceedings of the 13thInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2007, 2007; 553–567.

31. Komondoor R, Ramalingam G. Recovering data models via guarded dependences. Proceedings of the 14th WorkingConference on Reverse Engineering, 2007, 28–31 October 2007; 110–119.

32. Ceccato M, Dean TR, Tonella P, Marchignoli D. Data model reverse engineering in migrating a legacy system to Java.Fifteenth Working Conference on Reverse Engineering (WCRE). IEEE Computer Society: Silver Spring MD, 2008;177–186.

AUTHORS’ BIOGRAPHIES

Mariano Ceccato is a researcher in FBK-irst (Fondazione Bruno Kessler, former ITC-irst) in Trento, Italy. He received his master’s degree in Software Engineering fromthe University of Padova, Italy, in 2003, and his PhD in Computer Science from theUniversity of Trento in 2006, with the thesis ‘Migrating Object Oriented code to AspectOriented Programming’, under the supervision of Paolo Tonella, head of the SoftwareEngineering research unit in FBK. His research interests are source code analysis andtransformation, remote entrusting and empirical software engineering.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr

Page 27: Migrating legacy data structures based on variable overlay to Java

MIGRATING LEGACY DATA STRUCTURES TO JAVA 237

Thomas Roy Dean is an Associate Professor in the Department of Electrical andComputer Engineering at Queen’s University and an Adjunct Associate Professor atthe Royal Military College of Kingston. His background includes research in air trafficcontrol systems, language formalization and five and a half years as a Sr. Research Scien-tist at Legasys Corporation where he worked on advanced software transformation andevolution techniques in an industrial setting. His current research interests are softwaretransformation, web site evolution and the security of network applications.

Paolo Tonella is head of the Software Engineering Research Unit at Fondazione BrunoKessler, Trento Italy. He received his PhD degree in Software Engineering from theUniversity of Padova, in 1999, with the thesis ‘Code Analysis in Support to SoftwareMaintenance’. Since 1994 he has been with the Software Engineering group at IRST inTrento, Italy. He participated in several projects on software analysis and testing. He isthe author of ‘Reverse Engineering of Object Oriented Code’, Springer, 2005. He wroteover 100 peer reviewed conference/workshop papers and over 30 journal papers. Hiscurrent research interests include reverse engineering, crosscutting concerns, empiricalstudies, Web testing and test case generation.

Davide Marchignoli is responsible for the Gesbank migration project at IBT. He receivedhis PhD degree in Computer Science in 2002 from the University of Pisa, working mainlyon type theory. Since then, he has been mainly working on software system design.

Copyright q 2009 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2010; 22:211–237DOI: 10.1002/smr