339

[DataMine] Idea Group - Transformation of Knowledge Information and Data - Theory and Applications - 2004 - (by Laxxuss)

Embed Size (px)

DESCRIPTION

DATA MIne

Citation preview

  • Hershey London Melbourne Singapore

    Patrick van BommelUniversity of Nijmegen, The Netherlands

  • Acquisition Editor: Mehdi Khosrow-PourSenior Managing Editor: Jan TraversManaging Editor: Amanda AppicelloDevelopment Editor: Michele RossiCopy Editor: Alana BubnisTypesetter: Jennifer WetzelCover Design: Mindy GrubbPrinted at: Yurchak Printing Inc.

    Published in the United States of America byInformation Science Publishing (an imprint of Idea Group Inc.)701 E. Chocolate Avenue, Suite 200Hershey PA 17033Tel: 717-533-8845Fax: 717-533-8661E-mail: [email protected] site: http://www.idea-group.com

    and in the United Kingdom byInformation Science Publishing (an imprint of Idea Group Inc.)3 Henrietta StreetCovent GardenLondon WC2E 8LUTel: 44 20 7240 0856Fax: 44 20 7379 3313Web site: http://www.eurospan.co.uk

    Copyright 2005 by Idea Group Inc. All rights reserved. No part of this book may bereproduced in any form or by any means, electronic or mechanical, including photocopying,without written permission from the publisher.

    Library of Congress Cataloging-in-Publication Data

    Transformation of knowledge, information and data : theory and applications / Patrick vanBommel, editor. p. cm. Includes bibliographical references and index. ISBN 1-59140-527-0 (h/c) ISBN 1-59140-528-9 (s/c) ISBN 1-59140-529-7 (eisbn) 1. Database management. 2. Transformations (Mathematics) I. Bommel, Patrick van, 1964- QA76.9.D3T693 2004 005.74dc22 2004017926British Cataloguing in Publication DataA Cataloguing in Publication record for this book is available from the British Library.

    All work contributed to this book is new, previously-unpublished material. The views expressedin this book are those of the authors, but not necessarily of the publisher.

  • Preface ............................................................................................................. vi

    Section I: Fundamentals of Transformations

    Chapter ITransformation-Based Database Engineering ........................................... 1

    Jean-Luc Hainaut, University of Namur, Belgium

    Chapter IIRule-Based Transformation of Graphs and the Product Type ..............29

    Renate Klempien-Hinrichs, University of Bremen, GermanyHans-Jrg Kreowski, University of Bremen, GermanySabine Kuske, University of Bremen, Germany

    Chapter IIIFrom Conceptual Database Schemas to Logical Database Tuning ...... 52

    Jean-Marc Petit, Universit Clermont-Ferrand 2, FranceMohand-Sad Hacid, Universit Lyon 1, France

  • Chapter IVTransformation Based XML Query Optimization ...................................75

    Dunren Che, Southern Illinois University, USA

    Chapter VSpecifying Coherent Refactoring of Software Artefacts withDistributed Graph Transformations ...........................................................95

    Paolo Bottoni, University of Rome La Sapienza, ItalyFrancesco Parisi-Presicce, University of Rome La Sapienza, Italy and George Mason University, USAGabriele Taentzer, Technical University of Berlin, Germany

    Section II: Elaboration of Transformation Approaches

    Chapter VIDeclarative Transformation for Object-Oriented Models .................. 127

    Keith Duddy, CRC for Enterprise Distributed Systems Technology (DSTC), Queensland, AustraliaAnna Gerber, CRC for Enterprise Distributed Systems Technology (DSTC), Queensland, AustraliaMichael Lawley, CRC for Enterprise Distributed Systems Technology (DSTC), Queensland, AustraliaKerry Raymond, CRC for Enterprise Distributed Systems Technology (DSTC), Queensland, AustraliaJim Steel, CRC for Enterprise Distributed Systems Technology (DSTC), Queensland, Australia

    Chapter VIIFrom Conceptual Models to Data Models ............................................ 148

    Antonio Badia, University of Louisville, USA

    Chapter VIIIAn Algorithm for Transforming XML Documents Schema intoRelational Database Schema .................................................................... 171

    Abad Shah, University of Engineering & Technology (UET), PakistanJacob Adeniyi, King Saud University, Saudi ArabiaTariq Al Tuwairqi, King Saud University, Saudi Arabia

  • Chapter IXImprecise and Uncertain Engineering Information Modeling inDatabases: Models and Formal Transformations ................................ 190

    Z. M. Ma, Universit de Sherbrooke, Canada

    Section III: Additional Topics

    Chapter XAnalysing Transformations in Performance Management .................. 217

    Bernd Wondergem, LogicaCMG Consulting, The NetherlandsNorbert Vincent, LogicaCMG Consulting, The Netherlands

    Chapter XIMultimedia Conversion with the Focus on Continuous Media ......... 235

    Maciej Suchomski, Friedrich-Alexander University of Erlangen-Nuremberg, GermanyAndreas Mrcz, Dresden, GermanyKlaus Meyer-Wegener, Friedrich-Alexander University of Erlangen-Nuremberg, Germany

    Chapter XIICoherence in Data Schema Transformations: The Notion of SemanticChange Patterns ........................................................................................ 257

    Lex Wedemeijer, ABP Pensioenen, The Netherlands

    Chapter XIIIModel Transformations in Designing the ASSO Methodology ......... 283

    Elvira Locuratolo, ISTI, Italy

    About the Authors ..................................................................................... 303

    Index ............................................................................................................ 311

  • vi

    Background

    Data today is in motion, going from one location to another. It is more and moremoving between systems, system components, persons, departments, and orga-nizations. This is essential, as it indicates that data is actually used, rather thanjust stored. In order to emphasize the actual use of data, we may also speak ofinformation or knowledge.When data is in motion, there is not only a change of place or position. Otheraspects are changing as well. Consider the following examples: The data format may change when it is transferred between systems.

    This includes changes in data structure, data model, data schema, datatypes, etc.

    Also, the interpretation of data may vary when it is passed on from oneperson to another. Changes in interpretation are part of data semanticsrather than data structure.

    The level of detail may change in the exchange of data between depart-ments or organizations, e.g., going from co-workers to managers or fromlocal authorities to the central government. In this context, we often seechanges in level of detail by the application of abstraction, aggregation,generalization, and specialization.

    Moreover, the systems development phase of data models may vary.This is particularly the case when implementation-independent data mod-els are mapped to implementation-oriented models (e.g., semantic datamodels are mapped to operational database specifications).

    These examples illustrate just a few possibilities of changes in data. Numerousother applications exist and everybody uses them all the time. Most applicationsare of vital importance for the intelligent functioning of systems, persons, de-partments, and organizations.

  • vii

    In this book, the fundamental treatment of moving knowledge, information, ordata, with changing format, interpretation, level of detail, development phase,etc., is based on the concept of transformation. The generally accepted termsconversion, mutation, modification, evolution, or revision may be used inspecific contexts, but the central concept is transformation.Note that this definition covers well-known topics such as rewriting andversioning, and that it is relevant for collaborative information systems and datawarehouses. Although data transformation is typically applied in a networkedcontext (e.g., Internet or intranet), it is applied in other contexts as well.

    Framework

    Transformation techniques received a lot of attention in academic as well as inindustrial settings. Most of these techniques have one or more of the followingproblems: Loss of data: the result of the transformation does not adequately de-

    scribe the original data. Incomprehensibility: the effect of the transformation is not clear. Focus on instances: data instances are transformed, without incorpora-

    tion of data types. Focus on types: data types are transformed, without incorporation of

    data instances. Correctness: the transformation has no provable correctness.

    We therefore aim at generic approaches for the treatment of data transforma-tions. Some of the questions we deal with are the following: What is an ad-equate data transformation technique? What are the requirements for the inputand output of those techniques? What are the problems in existing approaches?What are the possibilities of a generic approach in important areas such as thesemantic web, supply chain management, the global information community,and information security?The theory and applications in this book are rooted in database schema trans-formation, as well as in database contents transformation. This allows for othertransformations, including transformation of document type definitions (DTDs)and of concrete documents. It is obvious that graph transformations are rel-evant here. Note that we do not particularly focus on specific kinds of data ordocuments (e.g., RDBMS, HTML or XML), although the models under consid-eration do not exclude such a focus.

  • viii

    From Source to Target

    Here we discuss general aspects of the move from source to target. They dealwith the basic assumptions underlying all transformation processes. Source. This is the structure to be transformed, or in other words, it is the

    input to the transformation process. An important distinction is made be-tween formal and informal sources. If the source is informal, the transfor-mation process cannot be fully automated. We usually then have a partlyautomated transformation aiming at support, with sufficient possibilitiesfor interaction. As an example, a modeling process often is the mapping ofan informal view to a formal model. In this book, the input and output ofmost transformations are assumed to be available in some formal lan-guage.

    Target. This is the resulting structure, so it is the output of the transforma-tion process. A main question here is how the relation between the targetand the source is defined. Even when the transformation process hasbeen completed, it is important that the relation of the target with thesource remains clear. One way of establishing such a clear relation, is tohave the target defined in terms of the source. This is also helpful inproviding correctness proofs.

    Applicability. In some cases, transformations are not really general in thesense that the possible source and target are rather restricted. If, for ex-ample, a theoretical model of transformations only allows for exotic tar-gets, not being used in practical situations, the theoretical model suffersfrom applicability problems.

    Structure vs. access operations. Besides the transformation of struc-tures, we must provide mechanisms for the transformation of access op-erations. These operations may be modification operations as well as re-trieval operations. Consequently, we have a source structure with corre-sponding access operations, and a target structure with equivalent opera-tions. This situation is shown in Figure 1. The transformation kernel con-tains all metadata relevant for the transformation.

    Correctness

    Evidently, the correctness of transformations is of vital importance. What pur-pose would transformations have, if the nature of the result is uncertain? Ageneral setup for guaranteeing transformation correctness consists of threesteps.

  • ix

    Wellformedness conditions. First, we describe the required properties ofthe target explicitly. We prefer to have basic (independent) wellformednessconditions here, as this facilitates the systematic treatment in the nextsteps.

    Transformation algorithm. Next, we describe the construction of thetarget on the basis of the source at hand. This construction process isdefined in the transformation algorithm, which may be enhanced usingguidance parameters. Guidance is interpreted as the development towardstarget structures having certain desirable qualities.

    Correctness proof. Finally, we prove that the result of the algorithm sat-isfies the wellformedness conditions. As a consequence, the resulting struc-ture is correct in the sense that all wellformedness conditions are satis-fied. Moreover, when specific guidance parameters are used, we have toprove that the resulting structure not only satisfies all wellformedness con-ditions, but has the desirable qualities (indicated by guidance parameters)as well.

    Sequences of Transformations

    Transformations may be composed or applied in sequences. Such sequencessometimes consist of a relatively small number of steps. In more complex prob-lem areas, however, this is no longer possible. Then, transformation sequenceswill be longer and due to the various options in each transformation step, theoutcome of the overall sequence is not a priori known. This is particularly thecase when non-deterministic (e.g., random or probabilistic) transformation pro-cesses are considered.

    Figure 1. Framework for transformation of structures and operations

    transformation kernel

    target structure source structure

    source operations

    target operations

    structure transformation

    operation transformation

  • xAlthough the outcome is not a priori known, it is often desirable to predict thenature of the result. One way of predicting the behavior of probabilistic trans-formation processes, is through the use of Markov theory. Here the probabili-ties of a single transformation step are summarized in a transition matrix, suchthat transformation sequences can be considered by matrix multiplication.We will illustrate the definition of a single-step matrix for two basic cases. Inthe first case, consider a transformation in a solution space S where each inputxS has as possible output some yN(x), where N(x)S and xN(x). So eachneighbor yN(x) can be produced from x by the application of some transfor-mation rule. Then the probability P(x,y) for the transformation of x into someyN(x) has the following property:

    P(x,y) |N(x)| = 1 (1)

    Evidently for yN(x) we have P(x,y)=0. With this property it is guaranteed thatP(x,y) is a stochastic matrix, since 0 P(x,y) 1 and yS P(x,y) = 1. Note thatin the above transformation the production of all results is equally likely.In the second case, we consider situations where the production of all results isnot equally likely. Consider a transformation in a solution space S where eachinput xS has as possible output some yB(x), where B(x)N(x) contains allbetter neighbors of x. Then the probability P(x,y) for the transformation of xinto some yB(x) is given by the above mentioned formula (1). However, as aresult of accepting only improving transformations, this formula now does notguarantee P(x,y) to be a stochastic matrix. The consequence of rejecting allneighbours in N(x)-B(x) is, that a transformation may fail. So now we have toconsider P(x,x). This probability has the following property:

    P(x,x) |N(x)| = |N(x)| - |B(x)| (2)

    In this case we have P(x,y)=0 for y {x}B(x). Now we have described ahill climbing transformation sequence. Note that the matrix underlying hillclimbing transformations is a stochastic matrix indeed.We will now give an overview of the book. It consists of three parts: fundamen-tals of transformations, elaboration of transformation approaches, and addi-tional topics. These three sections contain 13 chapters. It is possible to start ina later chapter (e.g., in Section II or III), without reading all earlier chapters(e.g., more theoretical chapters in Section I).

  • xi

    Fundamentals of Transformations

    Section I is about fundamentals and consists of five chapters. The focus ofChapter I is databases: Transformation-Based Database Engineering. Herewe consider the basic theory of the transformation of data schemata, wherereversibility of transformations is also considered. We describe the use of basictransformations in the construction of more complex (higher-level) transforma-tions. Several possibilities are recognized here, including compound transfor-mations, and predicate-driven and model-driven transformations. Basic trans-formations and their higher-level derivations are embedded within database (for-ward) design processes as well as within database reverse design processes.Most models to be transformed are defined in terms of graphs. In Chapter IIwe will therefore focus on graph transformations: Rule-Based Transforma-tion of Graphs and the Product Type. Graph transformations are based onrules. These rules yield new graphs, produced from a given graph. In this ap-proach, conditions are used to have more control over the transformation pro-cess. This allows us to indicate the order of rule application. Moreover, theresult (product) of the transformation is given special attention. In particular,the type of the product is important. This sets the context for defining the pre-cise relation between two or more graph transformations.Having embedded our transformations within the graph transformation context,Chapter III proceeds with graphs for concrete cases: From Conceptual Data-base Schemas to Logical Database Tuning. Here we present several algo-rithms, aiming at the production of directed graphs. In databases we have sev-eral aims in transformations, including efficiency and freedom from null values.Note that wellformedness of the input (i.e., a conceptual model) as well aswellformedness of the output (i.e., the database) is addressed.It is evident that graphs have to be transformed, but what about operations ongraphs? In systems design this corresponds with query transformation and op-timization. We apply this to markup languages in Chapter IV: TransformationBased XML Query Optimization. After representing document type defini-tions in terms of a graph, we consider paths in the graph and an algebra for textsearch. Equivalent algebraic expressions set the context for optimization, as weknow from database theory. Here we combine the concepts from previous chap-ters, using rule-based transformations. However, the aim of the transformationprocess now is optimization.In Chapter V, the final chapter of Section I, we consider a highly specializedfundament in the theory behind applications: Specifying Coherent Refactoringof Software Artefacts with Distributed Graph Transformations. Modifica-tions in the structure of systems are recorded in terms of so-called refactoring.This means that a coordinated evolution of system components becomes pos-

  • xii

    sible. Again, this graph transformation is rule-based. We use this approach toreason about the behavior of the system under consideration.

    Elaboration ofTransformation Approaches

    In Section II, we consider elaborated approaches to transformation. The focusof Chapter VI is object-oriented transformation: Declarative Transformationfor Object-Oriented Models. This is relevant not only for object-oriented datamodels, but for object-oriented programming languages as well. The transfor-mations under consideration are organized according to three styles of trans-formation: source-driven, target-driven, and aspect-driven transformations. Al-though source and target will be clear, the term aspect needs some clarifica-tion. In aspect-driven transformations, we use semantic concepts for setting upthe transformation rule. A concrete SQL-like syntax is used, based on rule forall where make linking statements. This also allows for the defini-tion of patterns.It is generally recognized that in systems analysis we should use conceptualmodels, rather than implementation models. This creates the context for trans-formations of conceptual models. In Chapter VII we deal with this: From Con-ceptual Models to Data Models. Conceptual models are often expressed interms of the Entity-Relationship approach, whereas implementation models areoften expressed in terms of the relational model. Classical conceptual modeltransformations thus describe the mapping from ER to relational models. Hav-ing UML in the conceptual area and XML in the implementation area, we nowalso focus on UML to XML transformations.We proceed with this in the next chapter: An Algorithm for TransformingXML Documents Schema into Relational Database Schema. A typical ap-proach to the generation of a relational schema from a document definition,starts with preprocessing the document definition and finding the root node ofthe document. After generating trees and a corresponding relational schema,we should determine functional dependencies and other integrity constraints.During postprocessing, the resulting schema may be normalized in case this isdesirable. Note that the performance (efficiency) of such algorithms is a criti-cal factor. The proposed approach is illustrated in a case study based on librarydocuments.Transformations are often quite complex. If data is inaccurate, we have a fur-ther complication. In Chapter IX we deal with this: Imprecise and UncertainEngineering Information Modeling in Databases: Models and FormalTransformations. Uncertainty in information modeling is usually based on fuzzy

  • xiii

    sets and probability theory. Here we focus on transformations in the context offuzzy Entity-Relationship models and fuzzy nested relations. In the models usedin this transformation, the known graphical representation is extended with fuzzyelements, such as fuzzy type symbols.

    Additional Topics

    In Section III, we consider additional topics. The focus of Chapter X is theapplication of transformations in a new area: Analysing Transformations inPerformance Management. The context of these transformations is an orga-nizational model, along with a goal model. This results in a view of organiza-tional management based on cycles of transformations. Typically, we have trans-formations of organizational models and goal models, as well as transforma-tions of the relationship between these models. Basic transformations are theaddition of items and detailing of components.Next we proceed with the discussion of different media: Multimedia Conver-sion with the Focus on Continuous Media. It is evident that the major chal-lenge in multimedia research is the systematic treatment of continuous media.When focusing on transformations, we enter the area of streams and convert-ers. As in previous chapters, we again base ourselves on graphs here, for in-stance chains of converters, yielding a graph of converters. Several qualitiesare relevant here, such as quality of service, quality of data, and quality ofexperience. This chapter introduces specific transformations for media-typechangers, format changers, and content changers.The focus of Chapter XII is patterns in schema changes: Coherence in DataSchema Transformations: The Notion of Semantic Change Patterns. Herewe consider updates of data schemata during system usage (operationalschema). When the schema is transformed into a new schema, we try to findcoherence. A catalogue of semantic changes is presented, consisting of a num-ber of basic transformations. Several important distinctions are made, for ex-ample, between appending an entity and superimposing an entity. Also, we havethe redirection of a reference to an owner entity, along with extension andrestriction of entity intent. The basic transformations were found during empiri-cal studies in real-life cases.In Chapter XIII, we conclude with the advanced approach: Model Transfor-mations in Designing the ASSO Methodology. The context of this methodol-ogy is ease of specifying schemata and schema evolution during system usage.The transformations considered here particularly deal with subtyping (also calledis-a relationships). This is covered by the transformation of class hierarchies ormore general class graphs. It is evident that schema consistency is one of theproperties required. This is based on consistency of class definitions, with in-

  • ductive approaches by: (a) requiring that initialization adheres to applicationconstraints, and (b) all operations preserve all constraints.

    Conclusions

    This book contains theory and applications of transformations in the context ofinformation systems development. As data today is frequently moving betweensystems, system components, persons, departments, and organizations, the needfor such transformations is evident.When data is in motion, there is not only a change of place or position. Otheraspects are changing as well. The data format may change when it is trans-ferred between systems, while the interpretation of data may vary when it ispassed on from one person to another. Moreover, the level of detail may changein the exchange of data between departments or organizations, and the systemsdevelopment phase of data models may vary, e.g., when implementation-inde-pendent data models are mapped to implementation-oriented models.The theory presented in this book will help in the development of new innova-tive applications. Existing applications presented in this book prove the powerof current transformation approaches. We are confident that this book contrib-utes to the understanding, the systematic treatment and refinement, and theeducation of new and existing transformations.

    Further Reading

    Kovacs, Gy. & van Bommel, P. (1997). From conceptual model to OO data-base via intermediate specification. Acta Cybernetica, (13), 103-140.

    Kovacs, Gy. & van Bommel, P. (1998). Conceptual modelling based design ofobject-oriented databases. Information and Software Technology, 40(1), 1-14.

    van Bommel, P. (1993, May). A randomised schema mutator for evolutionarydatabase optimisation. The Australian Computer Journal, 25(2), 61-69.

    van Bommel, P. (1994). Experiences with EDO: An evolutionary databaseoptimizer. Data & Knowledge Engineering, 13, 243-263.

    van Bommel, P. (1995, July). Database design by computer aided schema trans-formations. Software Engineering Journal, 10(4), 125-132.

    van Bommel, P., Kovacs, Gy. & Micsik, A. (1994). Transformation of databasepopulations and operations from the conceptual to the Internal level. In-formation Systems, 19(2), 175-191.

    xiv

  • van Bommel, P., Lucasius, C.B. & Weide, Th.P. van der (1994). Genetic algo-rithms for optimal logical database design. Information and SoftwareTechnology, 36(12), 725-732.

    van Bommel, P. & Weide, Th.P. van der (1992). Reducing the search space forconceptual schema transformation. Data & Knowledge Engineering, 8,269-292.

    Acknowledgments

    The editor gratefully acknowledges the help of all involved in the production ofthis book. Without their support, this project could not have been satisfactorilycompleted. A further special note of thanks goes also to all the staff at IdeaGroup Publishing, whose contributions throughout the whole process from in-ception of the initial idea to final publication have been invaluable.Deep appreciation and gratitude is due to Theo van der Weide and other mem-bers of the Department of Information Systems at the University of Nijmegen,The Netherlands, for the discussions about transformations of information models.Most of the authors of chapters included in this book also served as reviewersfor chapters written by other authors. Thanks go to all those who providedconstructive and comprehensive reviews. Special thanks also go to the publish-ing team at Idea Group Publishing, in particular to Michele Rossi, CarrieSkovrinskie, Jan Travers, and Mehdi Khosrow-Pour.In closing, I wish to thank all of the authors for their insights and excellentcontributions to this book.

    Patrick van Bommel, PhDNijmegen, The NetherlandsFebruary [email protected]://www.cs.kun.nl/~pvb

    xv

  • Section I

    Fundamentals ofTransformations

  • Transformation-Based Database Engineering 1

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Chapter I

    Transformation-BasedDatabase Engineering

    Jean-Luc Hainaut, University of Namur, Belgium

    Abstract

    In this chapter, we develop a transformational framework in which manydatabase engineering processes can be modeled in a precise way, and inwhich properties such as semantics preservation and propagation can bestudied rigorously. Indeed, the transformational paradigm is particularlysuited to database schema manipulation and translation, that are the basisof such processes as schema normalization and optimization, modeltranslation, reverse engineering, database integration and federation ordatabase migration. The presentation first develops a theoretical frameworkbased on a rich, wide spectrum specification model. Then, it describes howmore complex transformations can be built through predicate-based filteringand composition. Finally, it analyzes two major engineering activities,namely database design and reverse engineering, modeled as goal-orientedschema transformations.

  • 2 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Motivation and Introduction

    Modeling software design as the systematic transformation of formal specifica-tions into efficient programs, and building CASE1 tools that support it, has longbeen considered one of the ultimate goals of software engineering. For instance,Balzer (1981) and Fikas (1985) consider that the process of developing aprogram [can be] formalized as a set of correctness-preserving transfor-mations [...] aimed to compilable and efficient program production. In thiscontext, according to Partsch (1983),

    a transformation is a relation between two program schemes Pand P (a program scheme is the [parameterized] representationof a class of related programs; a program of this class is obtainedby instantiating the scheme parameters). It is said to be correct ifa certain semantic relation holds between P and P.

    These definitions still hold for database schemas, which are special kinds ofabstract program schemes. The concept of transformation is particularly attrac-tive in this realm, though it has not often been made explicit (for instance, as auser tool) in current CASE tools. A (schema) transformation is most generallyconsidered to be an operator by which a data structure S1 (possibly empty) isreplaced by another structure S2 (possibly empty) which may have some sort ofequivalence with S1. Some transformations change the information contents ofthe source schema, particularly in schema building (adding an entity type or anattribute) and in schema evolution (removing a constraint or extending arelationship type). Others preserve it and will be called semantics-preserving orreversible. Among them, we will find those which just change the nature of aschema object, such as transforming an entity type into a relationship type orextracting a set of attributes as an independent entity type.Transformations that are proved to preserve the correctness of the originalspecifications have been proposed in practically all the activities related toschema engineering: schema normalization (Rauh, 1995), DBMS2 schematranslation (Hainaut, 1993b; Rosenthal, 1988), schema integration (Batini, 1992;McBrien, 2003), schema equivalence (DAtri, 1984; Jajodia, 1983; Kobayashi,1986; Lien, 1982), data conversion (Navathe, 1980; Estivenart, 2003), reverseengineering (Bolois, 1994; Casanova, 1984; Hainaut, 1993, 1993b), schemaoptimization (Hainaut, 1993b; Halpin, 1995) database interoperability (McBrien,2003; Thiran, 2001) and others. The reader will find in Hainaut (1995) anillustration of numerous application domains of schema transformations.The goal of this chapter is to develop and illustrate a general framework fordatabase transformations in which all the processes mentioned above can be

  • Transformation-Based Database Engineering 3

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    formalized and analyzed in a uniform way. We present a wide spectrumformalism in which all the information/data models currently used can bespecified, and on which a set of basic transformational operators is defined. Wealso study the important property of semantics-preservation of these operators.Next, we explain how higher-level transformations can be built through threemechanisms, from mere composition to complex model-driven transformation.The database design process is revisited and given a transformational interpre-tation. The same exercise is carried out in the next section for database reverseengineering then we conclude the chapter.

    Schema Transformation Basics

    This section describes a general transformational theory that will be used as thebasis for modeling database engineering processes. First, we discuss somepreliminary issues concerning the way such theories can be developed. Then, wedefine a wide-spectrum model from which operational models (i.e., those whichare of interest for practitioners) can be derived. The next sections are dedicatedto the concept of transformation, to its semantics-preservation property, and tothe means to prove it. Finally, some important basic transformations aredescribed. Warning. In the database world, a general formalism in which database

    specifications can be built is called a model. The specification of a databaseexpressed in such a model is called a schema.

    Developing Transformational Theories

    Developing a general purpose transformational theory requires deciding on thespecification formalism, i.e., the model, in which the schemas are expressed andon the set of transformational operators. A schema can be defined as a set ofconstructs (entity types, attributes, keys, indexes, etc.) borrowed from a definitemodel whose role is to state which constructs can be used, according to whichassembly rules, in order to build valid schemas. For simplicity, the concept ofentity type is called a construct of the ERA3 model, while entity type CUS-TOMER is a construct of a specific schema. They are given the same name,though the latter is an instance of the former.Though some dedicated theories rely on a couple of models, such as those whichare intended to produce relational schemas from ERA schemas, the mostinteresting theories are based on a single formalism. Such a formalism defines

  • 4 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    the reference model on which the operators are built. According to its generalityand its abstraction level, this model defines the scope of the theory, that canaddress a more or less wide spectrum of processes. For instance, building atheory on the relational model will allow us to describe, and to reason on, thetransformation of relational schemas into other relational schemas. The 1NF4normalization theory is a popular example. Another example would be atransformational theory based on the ORM (Object-Role model) that wouldprovide techniques for transforming (normalizing, optimizing) conceptual schemasinto other schemas of the same abstraction level (de Troyer, 1993; Proper, 1998).The hard challenge is to choose a unique model that can address not only intra-model transformations, but inter-model operators, such as ORM-to-relationalconversion.To identify such models, let us consider a set of models that includes, amongothers, all the operational formalisms that are of interest for a community ofpractitioners, whatever the underlying paradigm, the age and the abstractionlevel of these formalisms. For instance, in a large company whose informationsystem relies on many databases (be they based on legacy or modern technolo-gies) that have been designed and maintained by several teams, this set is likelyto include several variants of the ERA model, UML class diagrams, severalrelational models (e.g., Oracle 5 to 10 and DB2 UDB), the object-relationalmodel, the IDMS and IMS models and of course the standard file structure modelon which many legacy applications have been developed.Let us also consider the transitive inclusion relation such that M M, whereMM and M,M , means that all the constructs of M also appear in M.5 Forinstance, if M denotes the standard relational model and M the object-relational model, then M M holds, since each schema expressed in M is a validschema according to model M.Now, we consider a model M* in , such that:

    M, MM*: M M*,

    and a model M0 in , for which the following property holds:

    M, MM0: M0 M.

    (x, ) forms a lattice of models, in which M0 denotes the bottom node and M*the upper node.M0, admittedly non-empty, is made up of a very small set of elementary abstractconstructs, typically nodes, edges and labels. An ERA schema S comprising anentity type E with two attributes A1 and A2 would be represented in M0 by the

  • Transformation-Based Database Engineering 5

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    nodes n1, n2, n3 which are given the labels E, A1 and A2, and by the edges(n1,n2) and (n1,n3).On the contrary, M* will include a greater variety of constructs, each of thembeing a natural abstraction of one or several constructs of lower-level models.This model should include, among others, the concepts of object type, attributeand inter-object association, so that the contents of schema S will be representedin M* by an object type with name E comprising two attributes with names A1and A2.Due to their high level of abstraction, models M0 and M* are good candidates todevelop a transformational theory relying on a single model. Considering thecontext-dependent definition of , M0 and M*, we cannot assert that theseconcepts are unique. Therefore, there is no guarantee that a universal theory canbe built.Approaches based on M0 generally define data structures as semantics-freebinary graphs on which a small set of rewriting operators are defined. Therepresentation of an operational model M such as ERA, relational or XML, in M0requires some additional features such as typed nodes (object, attribute, associa-tion and roles for instance) and edges, as well as ad hoc assembly rules thatdefine patterns. A transformation specific to M is also defined by a pattern, a sortof macro-transformation, defined by a chain of M0 transformations. McBrien(1998) is a typical example of such theories. We can call this approachconstructive or bottom-up, since we build operational models and transforma-tions by assembling elementary building blocks.The approaches based on M* naturally require a larger set of rewriting rules. Anoperational model M is defined by specializing M*, that is, by selecting a subsetof concepts and by defining restrictive assembly rules. For instance, a relationalschema can be defined as a set of object types (tables), a set of attributes(column), each associated with an object type (at least one attribute per objecttype) and a set of uniqueness (keys) and inclusion (foreign keys) constraints.This model does not include the concept of association. The transformations ofM are those of M* which remain meaningful. This approach can be qualified byspecialization or top-down, since an operational model and its transformationaloperators are defined by specializing (i.e., selecting, renaming, restricting) M*constructs and operators. DB-MAIN (Hainaut, 1996b) is an example of thisapproach. In the next section, we describe the main aspects of its model, named GER.6

    Data Structure Specification Model

    Database engineering is concerned with building, converting and transformingdatabase schemas at different levels of abstraction, and according to various

  • 6 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    paradigms. Some processes, such as normalization, integration and optimizationoperate in a single model, and will require intra-model transformations. Otherprocesses, such as logical design, use two models, namely the source and targetmodels. Finally, some processes, among others, reverse engineering and feder-ated database development, can operate on an arbitrary number of models (oron a hybrid model made up of the union of these models) as we will see later on.The GER model is a wide-spectrum formalism that has been designed to: express conceptual, logical and physical schemas, as well as their manipu-

    lation, support all the data-centered engineering processes, and support all DMS7 models and the production and manipulation of their

    schemas.

    The GER is an extended entity-relationship model that includes, among others,the concepts of schema, entity type, entity collection, domain, attribute, relation-ship type, keys, as well as various constraints. In this model, a schema is adescription of data structures. It is made up of specification constructs whichcan be, for convenience, classified into the usual three abstraction levels, namelyconceptual, logical and physical. We will enumerate some of the main constructsthat can appear at each level: A conceptual schema comprises entity types (with/without attributes;

    with/without identifiers), super/subtype hierarchies (single/multiple, totaland disjoint properties), relationship types (binary/N-ary; cyclic/acyclic;with/without attributes; with/without identifiers), roles of relationship type(with min-max cardinalities; with/without explicit name; single/multi-entity-type), attributes (of entity or relationship types; multi/single-valued; atomic/compound; with cardinality), identifiers (of entity type, relationship type,multivalued attribute; comprising attributes and/or roles), constraints (in-clusion, exclusion, coexistence, at-least-one, etc.)

    A logical schema comprises record types, fields, arrays, foreign keys,redundancy, etc.

    A physical schema comprises files, record types, fields, access keys (ageneric term for index, calc key, etc.), physical data types, bag and listmultivalued attributes, and other implementation details.

    It is important to note that these levels are not part of the model. The schema ofFigure 1 illustrates some major concepts borrowed to these three levels. Such ahybrid schema could appear in reverse engineering.

  • Transformation-Based Database Engineering 7

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    One remarkable characteristic of wide spectrum models is that all the transfor-mations, including inter-model ones, appear as intra-model operators. This hashighly interesting consequences. First, a transformation designed for manipu-lating schemas in an operational model M1 can be used in a model M2 as well,provided that M2 includes the constructs on which operates. For instance, mosttransformations dedicated to COBOL data structure reverse engineering appearto be valid for relational schemas as well. This strongly reduces the number ofoperators. Secondly, any new model can profit from the techniques andreasoning that have been developed for current models. For instance, designingmethods for translating conceptual schemas into object-relational structures orinto XML schemas (Estivenart, 2003), or reverse engineering OO-databases(Hainaut, 1997) have proved particularly easy since these new methods can be,to a large extent, derived from standard ones.The GER model has been given a formal semantics in terms of an extended NF2model (Hainaut, 1989, 1996). This semantics will allow us to analyze theproperties of transformations, and particularly to precisely describe how, andunder which conditions, they propagate and preserve the information contents ofschemas.

    Figure 1. Typical hybrid schema made up of conceptual constructs (e.g.,entity types PERSON, CUSTOMER, EMPLOYEE and ACCOUNT,relationship type of, identifiers Customer ID of CUSTOMER), logicalconstructs (e.g., record type ORDER, with various kinds of fields includingan array, foreign keys ORIGIN and DETAIL.REFERENCE) and physicalobjects (e.g., table PRODUCT with primary key PRO_CODE and indexesPRO_CODE and CATEGORY, table space PRODUCT.DAT) (Note that theidentifier of ACCOUNT, stating that the accounts of a customer havedistinct Account numbers, makes it a dependent or weak entity type.)

    1-1

    0-N

    of

    T

    PERSONNameAddress

    EMPLOYEEEmploye NbrDate Hiredid: Employe Nbr

    ACCOUNTAccount NBRAmountid: of.CUSTOMER

    Account NBR

    CUSTOMERCustomer IDid: Customer ID

    ORDERORD-IDDATE_RECEIVEDORIGINDETAIL[1-5] array

    REFERENCEQTY-ORD

    id: ORD-IDref: ORIGINref: DETAIL[*].REFERENCE

    PRODUCTPRO_CODECATEGORYDESCRIPTIONUNIT_PRICEid: PRO_CODE

    acc acc: CATEGORY

    PRODUCT.DAT

    PRODUCT

  • 8 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Let us note that we have discarded the UML class model as a candidate for M*due to its intrinsic weaknesses, including its lack of agreed-upon semantics, itsnon-regularity and the absence of essential concepts. On the contrary, acarefully defined subset of the UML model could be be a realistic basis forconstructive approaches.

    Specifying Operational Models with the GER

    In this section, we illustrate the specialization mechanism by describing apopular operational formalism, namely the standard 1NF relational model. All theother models, be they conceptual, logical or physical can be specified similarly.A relational schema mainly includes tables, domains, columns, primary keys,unique constraints, not null constraints and foreign keys. The relational model cantherefore be defined as in Figure 2. A GER schema made up of constructs fromthe first columns only, that satisfy the assembly rules, can be called relational.As a consequence, a relational schema cannot comprise is-a relations, relation-ship types, multivalued attributes or compound attributes.The physical aspects of the relational data structures can be addressed as well.Figure 3 gives additional specifications through which physical schemas for aspecific RDBMS can be specified. These rules generally include limitations suchas no more than 64 columns per index, or the total length of the componentsof any index cannot exceed 255 characters.

    Figure 2. Defining standard relational model as a subset of the GER modelGER constructs relational constructs assembly rules

    schema database schema entity type table an entity type includes at least one attribute simple domain domain single-valued and atomic attribute with cardinality [0-1]

    nullable column

    single-valued and atomic attribute with cardinality [1-1]

    not null column

    primary identifier primary key a primary identifier comprises attributes with cardinality [1-1]

    secondary identifier unique constraint reference group foreign key the composition of the reference group must be

    the same as that of the target identifier GER names SQL names the GER names must follow the SQL syntax

  • Transformation-Based Database Engineering 9

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Transformation: Definition

    The definitions that will be stated here are model-independent. In particular, theyare valid for the GER model, so that the examples will be given in the latter. Letus denote by M the model in which the source and target schemas are expressedby S the schema on which the transformation is to be applied and by S the schemaresulting from this application. Let us also consider sch(M), a function that returnsthe set of all the valid schemas that can be expressed in model M, and inst(S), afunction that returns the set of all the instances that comply with schema S.A transformation consists of two mappings T and t (Figure 4): T is the structural mapping from sch(M) onto itself, that replaces source

    construct C in schema S with construct C. C is the target of C through T,and is noted C = T(C). In fact, C and C are classes of constructs that canbe defined by structural predicates. T is therefore defined by the minimalprecondition P that any construct C must satisfy in order to be transformedby T, and the maximal postcondition Q that T(C) satisfies. T specifies therewriting rule of .

    t is the instance mapping from inst(S) onto inst(S), that states how toproduce the T(C) instance that corresponds to any instance of C. If c is aninstance of C, then c = t(c) is the corresponding instance of T(C). t can bespecified through any algebraic, logical or procedural expression.

    Figure 3. Defining the main technical constructs of relational datastructures as they are implemented in a specific RDBMS

    GER constructs relational constructs assembly rules for a specific DBMS access key index comprises from 1 to 64 attributes of the parent entity type collection table space a collection includes 1 to 255 entity types; an entity type

    belongs to at most 1 collection

    Figure 4. Two mappings of schema transformation (The inst_ofarrow from x to X indicates that x is an instance of X.)

    C' = T(C)

    c' = t(c)c

    C T

    t

    inst_ofinst_of

  • 10 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    According to the context, will be noted either or .Each transformation is associated with an inverse transformation ' which canundo the result of the former under certain conditions that will be detailed in thenext section.

    Reversibility of a Transformation

    The extent to which a transformation preserves the information contents of aschema is an essential issue. Some transformations appear to augment thesemantics of the source schema (e.g., adding an attribute), some removesemantics (e.g., removing an entity type), while others leave the semanticsunchanged (e.g., replacing a relationship type with an equivalent entity type).The latter are called reversible or semantics-preserving. If a transformationis reversible, then the source and the target schemas have the same descriptivepower, and describe the same universe of discourse, although with a differentpresentation. A transformation 1 = = is reversible, iff there exists

    a transformation 2 = = such that, for any construct C,and any instance c of C: P1(C) ([T2(T1(C))=C] and [ t2(t1(c)=c]). 2 is theinverse of 1, but the converse is not true8. For instance, an arbitraryinstance c of T(C) may not satisfy the property c=t1(t2(c)).

    If 2 is reversible as well, then 1 and 2 are called symmetricallyreversible. In this case, 2 = . 1 and 2 are called SR-transformations for short.

    Similarly, in the pure software engineering domain, Balzer (1981) introduces theconcept of correctness-preserving transformation aimed at compilable andefficient program production.We have discussed the concept of reversibility in a context in which some kindof instance equivalence is preserved. However, the notion of inverse transfor-mation is more general. Any transformation, be it semantics-preserving or not,can be given an inverse. For instance, del-ET(et_name), which removes entitytype with name et_name from its schema, clearly is not a semantics-preservingoperation, since its mapping t has no inverse. However, it has an inversetransformation, namely create-ET(CUSTOMER). Since only the T part is defined,this partial inverse is called a structural inverse transformation.

  • Transformation-Based Database Engineering 11

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Proving the Reversibility of a Transformation

    Thanks to the formal semantics of the GER, a proof system has been developedto evaluate the reversibility of a transformation. More precisely, this systemrelies on a limited set of NF2 transformational operators whose reversibility hasbeen proven, and that can generate a large number of GER transformations.Basically, the system includes five families of transformations, that can becombined to form more complex operators: denotation, through which a new object set is defined by a derivation rule

    based on existing structures, project-join which is a variant of the decomposition theorem, composition which replaces two relations by one of them and their

    composition, nest-unnest, the typical 1NF N1NF operators, and container, that states the equivalence between non-set containers (e.g.,

    bags, lists, arrays) and sets .

    Thanks to a complete set of mapping rules between the GER model and the NF2model in which these basic transformations have been built, the latter can beapplied to operational schemas. Figure 5 shows how we have defined adecomposition operator for normalizing relationship types from the basic project-join transformation. It is based on a three-step process:1. Source schema (Figure 5, top-left) is expressed in the NF2 formalism

    (bottom-left):

    {entities:A,B,C; R(A,B,C); A B}

    2. Basic project-join transformation is applied and yields a normalized rela-tional schema (bottom-right):

    {entities:A,B,C; R1(A,B); R2(A,C); R1[A]=R2[A]}

    3. NF2 schema is expressed in the GER, leading to the target schema (Figure5, top-right).

  • 12 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Since the the GER NF2 mappings are symmetrically reversible and theproject-join is an SR-transformation, the ERA transformation is symmetricallyreversible as well. It can be defined as follows:

    T1 = T11T12T13

    T1' = T11'T12'T13'

    We note the important constraint R1[A]=R2[A] that gives the project-join transfor-mation the SR property, while Fagins theorem merely defines a reversibleoperator. We observe how this constraint translates into a coexistence constraintin the GER model that states that if an A entity is connected to a B entity, it mustbe connected to at least one C entity as well, and conversely.The reader interested in a more detailed description of this proof system isrefered to Hainaut (1996).

    Six Mutation Transformations

    A mutation is an SR-transformation that changes the nature of an object.Considering the three main natures of object, namely entity type, relationshiptype and attribute, six mutation transformations can be defined. In Figure 6, thecouples of operators 1 to 3, show them applied on typical schema fragments.The transformations 4 are not primitive since they can be defined by combiningother mutations. However, they have been added due to their usefulness. More

    Figure 5. Proving the SR property of the decomposition of a relationshiptype according to a multivalued dependency (here an FD)

    source schema target schema

    GER 0-N

    0-N0-NR

    C

    BA

    R: A B

    T1

    T1'

    0-1 0-NR1

    0-N

    0-NR2 C

    BA

    coex: R2.C[*]R1.B

    T11 T11' T13 T13'

    NF2 entities: A,B,C

    R(A,B,C)

    A B

    T12

    T12'

    entities:A,B,C

    R1(A,B) R2(A,C) R1[A]=R2[A]

  • Transformation-Based Database Engineering 13

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    sophisticated mutation operators can be defined as illustrated in Hainaut (1991)in the range of entity-generating transformations.

    Other Basic Transformations

    The mutation transformations can solve many database engineering problems,but other operators are needed to model special situations. The CASE toolassociated with the DB-MAIN methodologies includes a kit of about 30 basicoperators that have proven sufficient for most engineering activities. Whennecessary, user-defined operators can be developed through the meta functionsof the tool (Hainaut, 1996b). We will describe some of the basic operators.Expressing supertype/subtype hierarchies in DMS that do not support themexplicitly is a recurrent problem. The technique of Figure 7 is one of the mostcommonly used (Hainaut, 1996c). It consists in representing each source entitytype by an independent entity type, then to link each subtype to its supertypethrough a one-to-one relationship type. The latter can, if needed, be furthertransformed into foreign keys by application of 2-direct (T2).

    Figure 6. Six mutation transformations 1 to 3 (Though not primitive,compound transformations 4 are shown as well.)

    source schema target schema comment

    1

    0-50-N r BA

    T1 T1' 1-1

    0-5

    rB1-1

    0-N

    rAR

    id: rA.ArB.B

    BA

    Transforming relationship type r into entity type R (T1) and conversely (T1'). Note that R entities are identified by any couple (a,b) AxB through relationship types rA and rB (id:ra.A,rB.B).

    2 0-N 0-5r BA

    A1id: A1

    T2 T2'

    BA1[0-5]ref: A1[*]

    AA1id: A1

    Transforming relationship type r into reference attribute B.A1 (T2) and conversely (T2').

    3 A

    A1A2[0-5]A3

    T3 T3'

    1-N0-5 ra2A

    A1A3

    EA2A2id: A2

    Transforming attribute A2 into entity type EA2 (T3) and conversely (T3').

    4

    AA1A2[0-5]A3

    T4 T4'

    0-5 1-1ra2

    EA2A2id: ra2.A

    A2

    AA1A3

    Not a primitive operator. T4 can be defined by application of T3 to EA2.A2, then of T1' to EA2 in the schema above. Note that the EA2 entities depending on the same A entity have distinct A2 values (id:ra2.A,A2).

  • 14 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Transformations 3 and 4 show how to process standard multivalued at-tributes. When the collection of values is no longer a set but a bag, a list or anarray, operators to transform them into pure set-oriented constructs are mostuseful. Transformations 6 in Figure 8 are dedicated to arrays. Similar operatorshave been defined for the other types of containers.Attributes defined on the same domain and the name of which suggests a spatialor temporal dimension (e.g., departments, countries, years or pure numbers) arecalled serial attributes. In many situations, they can be interpreted as the

    Figure 7. Transforming an is-a hierarchy into one-to-one relationshiptypes and conversely

    Figure 8. Converting an array into a set-multivalued attribute andconversely

    source schema target schema comment

    5 DC

    C1C2

    BB1B2

    AA1A2

    T5 T5'

    1-1

    0-1r

    1-1

    0-1 s

    AA1A2excl: s.C

    r.B

    CC1C2

    BB1B2

    An is-a hierarchy is replaced by one-to-one relationship types. The exclusion constraint (excl:s.C,r.B) states that an A entity cannot be simultane-ously linked to a B entity and a C entity. It derives from the disjoint property (D) of the subtypes.

    source schema target schema comment

    6

    AA1A2[0-5] arrayA3

    T6 T6'

    AA1A2[0-5]

    IndexValue

    A3id(A2):

    Index

    Array A2 (left) is transformed into a multivalued compound attribute A2 (right), whose values are distinct wrt component Index (id(A2):Index). The latter indicates the position of the value (Value). The domain of Index is the range [1..5].

    Figure 9. Transforming serial attributes into a multivalued attribute andconversely

    source schema target schema comment

    7

    A A1A2XA2YA2ZA3

    T7

    T7'

    dom(A2.Dimension) = {'X','Y','Z'}

    AA1A2[3-3]

    DimensionValue

    A3id(A2):

    Dimension

    The serial attributes {A2X, A2Y, A2Z} are transformed into the multivalued compound attribute A2 where the values (Value) are indexed with the distinctive suffix of the source attributes, interpreted as a dimension (sub-attribute Dimension, whose domain is the set of prefixes).

  • Transformation-Based Database Engineering 15

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    representation of an indexed multivalued attributes (Figure 9). The identificationof these attributes must be confirmed by the analyst.

    Higher-Level Transformations

    The transformations described in the section, Schema Transformation Basics,are intrinsically atomic: one elementary operator is applied to one object instance,and (4 excluded) none can be defined by a combination of others (orthogonal-ity). This section develops three ways through which more powerful transfor-mations can be developed.

    Compound Transformations

    A compound transformation is made up of a chain of more elementary operatorsin which each transformation applies on the result of the previous one. Thetransformation 8 in Figure 10, illustrated by a concrete example, transforms acomplex relationship type R into a sort of bridge entity type comprising as manyforeign keys as there are roles in R. It is defined by the composition of 1-directand 2-direct. This operator is of frequent use in relational database design.The transformation 9 is more complex (Figure 11). It is composed of a chainof four elementary operators. The first one transforms the serial attributesExpense-2000, ..., Expense-2004 into multivalued attribute Expense comprisingsub-attributes Year (the dimension) and Amount (transformation 7-direct). Thesecond one extracts this attribute into entity type EXPENSE, with attributes Yearand Amount (transformation 4-direct). Then, the same operator is applied to

    Figure 10. Transformation of a complex relationship type into relationalstructures

    source schema target schema comment

    8

    0-N

    0-N

    0-N

    exportVolume

    PRODUCTProd_IDid: Prod_ID

    COUNTRYCtry_Nameid: Ctry_Name

    COMPANYCy_Nameid: Cy_Name

    T8

    T8'

    PRODUCTProd_IDid: Prod_ID

    EXPORTProd_IDCtry_NameCy_NameVolumeid: Ctry_Name

    Prod_IDCy_Name

    ref: Cy_Nameref: Prod_IDref: Ctry_Name COUNTRY

    Ctry_Nameid: Ctry_Name

    COMPANYCy_Nameid: Cy_Name

    The relationship type export is first transformed into an entity type + three many-to-one relationship types. Then, the latter are converted into foreign keys.

  • 16 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    attribute Year, yielding entity type YEAR, with attribute Year. Finally, the entitytype EXPENSE is transformed into relationship type expense (1-inverse).

    Predicate-Driven Transformations

    A predicate-driven transformation p applies an operator to all the schemaobjects that meet a definite predicate p. It will be specified by (p). p is astructural predicate that states the properties through which a class of patternscan be identified. Interestingly, a predicate-based transformation can be inter-preted as a user-defined elementary operator. Indeed, considering the standarddefinition = , we can rewrite p as *(true) where * = . Ingeneral, the inverse of p cannot be derived from the expression of and p.Indeed, there is no means to derive the predicate p that identifies the constructsresulting from the application of p, and only them.We give in Figure 12 some useful transformations that are expressed in thespecific language of the DB-MAIN tool, which follows the (p) notation. Mostpredicates are parametric; for instance, the predicate ROLE_per_RT( ),where and are integers such that states that the number

    Figure 11. Extracting a temporal dimension from serial attributes

    Figure 12. Three examples of predicate-driven transformation

    source schema target schema comment

    9

    ProjectDep#InitialBudgetExpense-2000Expense-2001Expense-2002Expense-2003Expense-2004

    T9 T9'

    dom(Year) = [2000..2004]

    1-N

    5-5

    expenseAmount

    YEARYearid: Year

    ProjectDep#InitialBudget

    The serial attributes are first transformed into a multivalued attribute, which in turn is extracted as external entity type EXPENSE. The dimension attribute (Year) is also extracted as entity type YEAR. Finally, EXPENSE is mutated into relationship type expense.

    predicate-driven transformation interpretation RT_into_ET(ROLE_per_RT(3 N))

    transform each relationship type R into an entity type (RT_into_ET), if the number of roles of R (ROLE_per_RT) is in the range [3 N]; in short, convert all N-ary relationship types into entity types.

    RT_into_REF(ROLE_per_RT(2 2) and ONE_ROLE_per_RT(1 2))

    transform each relationship type R into reference attributes (RT_into_ET), if the number of roles of R is 2 and if R has from 1 to 2 one role(s), i.e., R has at least one role with max cardinality 1; in short, convert all one-to-many relationship types into foreign keys.

    INSTANTIATE(MAX_CARD_of_ATT(2 4))

    transform each attribute A into a sequence of single-value instances, if the max cardinality of A is between 2 and 4; in short, convert multivalued attributes with no more than 4 values into serial attributes.

  • Transformation-Based Database Engineering 17

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    of roles of the relationship type falls in the range [..]. The symbol Nstands for infinity.

    Model-Driven Transformations

    A model-driven transformation is a goal-oriented compound transformationmade up of predicate-driven operators. It is designed to transform any schemaexpressed in model M into an equivalent schema in model M.As illustrated in the discussion of the relational model expressed as a specializa-tion of the GER (Figure 2), identifying the components of a model also leads toidentifying the constructs that do not belong to it. Except when M M, anarbitrary schema S expressed in M may include constructs which violate M.Each construct that can appear in a schema can be specified by a structuralpredicate. Let P

    M denote the set of predicates that defines model M and P

    M that

    of model M. In the same way, each potentially invalid construct can be specifiedby a structural predicate. Let P

    M/M denote the set of the predicates that identify

    the constructs of M that are not valid in M. In the DB-MAIN language used inFigure 12, ROLE_per_RT(3 N) is a predicate that identifies N-ary relationship typesthat are invalid in DBTG CODASYL databases, while MAX_CARD_of_ATT(2 N)defines the family of multivalued attributes that is invalid in the SQL2 databasemodel. Finally, we observe that each such set as P

    M can be perceived as a single

    predicate formed by anding its components.Let us now consider predicate p P

    M/M, and let us choose a transformation

    = such that:

    (p P) (PM

    Q)

    Clearly, the predicate-driven transformation p solves the problem of invalidconstructs defined by p. Proceeding in the same way for each component of P

    M/M

    provides us with a series of operators that can transform any schema in modelM into schemas in model M. We call such a series a transformation plan, whichis the practical form of any model-driven transformation. In real situations, a plancan be more complex than a mere sequence of operations, and may compriseloops to process recursive constructs for instance.In addition, transformations such as those specified above may themselves becompound, so that the set of required transformations can be quite large. In suchcases, it can be better to choose a transformation that produces constructs thatare not fully compliant with M, but that can be followed by other operators whichcomplete the job. For instance, transforming a multivalued attribute can be

  • 18 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    obtained by an ad hoc compound transformation. However, it can be thoughtmore convenient to first transform the attribute into an entity type + a one-to-many relationship type (4-direct), which can then be transformed into a foreignkey (2-direct). This approach produces transformation plans which are moredetailed and therefore less readable, but that rely on a smaller and more stableset of elementary operators.The transformation toolset of DB-MAIN includes about 30 operators that haveproven sufficient to process schemas in a dozen operational models. If all thetransformations used to build the plan have the SR-property, then the model-driven transformation that the plan implements is symmetrically reversible.When applied to any source schema, it produces a target schema semanticallyequivalent to the former. This property is particularly important forconceptuallogical transformations. Figure 13 sketches, in the form of a script,a simple transformation plan intended to produce SQL2 logical schemas fromERA conceptual schemas. Actual plans are more complex, but follow theapproach developed in this section.It must be noted that this mechanism is independent of the process we aremodeling, and that similar transformation plans can be built for processes suchas conceptual normalization or reverse engineering. Though model-driventransformations provide an elegant and powerful means of specification of manyaspects of most database engineering processes, some other aspects still requirehuman expertise that cannot be translated into formal rules.

    Figure 13. Simple transformation plan to derive a relational schema fromany ERA conceptual schema (To make them more readable, thetransformations have been expressed in natural language instead of in theDB-MAIN language. The term rel-type stands for relationship type.)

    step predicate-based transformation comment 1 transform IS-A relations into one-to-one rel-types operator 5-direct; 2 transform complex rel-types into entity types operator 1-direct; complex means N-ary or binary

    many-to-many or with attributes; 3 disaggregate level-1 compound attributes each compound attribute directly depending on an entity

    type is replaced by its components; 4 transform level-1 multivalued attributes into entity

    types operator 4-direct; each multivalued attribute directly depending on an entity type is replaced by an entity type;

    5 repeat steps 3 to 4 until the schema does not include complex attributes any more

    to cope with multi-level attribute structures;

    6 transform relationship types into reference groups at this point, only one-to-many and one-to-one rel-types subsist; they are transformed into foreign keys;

    7 if the schema still includes rel-types, add a technical identifier to the relevant entity types and apply step 6

    step 6 fails in case of missing identifier; a technical attribute is associated with the entity type that will be referenced by the future foreign key;

  • Transformation-Based Database Engineering 19

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Transformation-Based Database Design

    Most textbooks on database design of the eighties and early nineties propose afive-step approach that is sketched in Figure 14. Through the ConceptualDesign phase, users requirements are translated into a conceptual schema,which is the formal and abstract expression of these requirements. The LogicalDesign phase transforms the conceptual schema into data structures (the logicalschema) that comply with the data model of a family of DMS such as relational,OO or standard file data structures. Through the Physical Design phase, thelogical schema is refined and augmented with technical specifications that makeit implementable into the target DMS and that gives it acceptable performance.From the logical schema, users views are derived that meet the requirementsof classes of users (View Design). Finally, the physical schema and the usersviews are coded into the DDL9 of the DMS (Coding).

    Database Design as a Transformation Process

    Ignoring the view design process for simplification, database design can bemodeled by (the structural part of) transformation DB-design:

    code = DB-design(UR)

    where code denotes the operational code and UR the users requirements.

    Figure 14. Standard strategy for database design

    Dat

    abas

    e de

    sign

    Conceptual design

    Logical design

    Physical design

    Coding View design

    Logical schema

    Physical schema

    Users requirements

    Conceptual schema

    Users views

    Operational code

  • 20 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Denoting the conceptual, logical and physical schemas respectively by CS, LSand PS and the conceptual design, logical design, physical design and codingphases by C-design, L-design, P-design and Coding, we can refine theprevious expression as follows:

    CS = C-design(UR)

    LS = L-design(CS)

    PS = P-design(LS)

    code = Coding(PS)

    Clearly, these processes are model-driven transformations and can then bedescribed by transformation plans. The level of formality of these processesdepends on the methodology, on the existence of CASE support and of non-functional requirements such as performance and robustness, that generallyrequire human expertise. For instance, conceptual design (C-design) is a highlyinformal process based on human interpretation of complex information sources,while logical design can be an automated process completely described by atransformation plan. Anyway, these processes can be decomposed into sub-processes that, in turn, can be modeled by transformations and described bytransformation plans, and so forth, until the latter reduce to elementary opera-tors. Three of these processes are worth being examined a bit further.

    Conceptual Design

    This process includes, among others, two major sub-processes, namely BasicAnalysis, through which informal or semi-formal information sources areanalyzed and their semantic contents are translated into conceptual structures,and (Conceptual) Normalization, through which these raw structures are givensuch additional qualities as readability, normality, minimality, extensibility, com-pliance with representation standards, etc. (Batini, 1992; Blaha, 1998). Thissecond process is more formal than the former, and is a good candidate fortransformational modeling. The plan of Figure 15, though simplistic, can improvethe quality of many raw conceptual schemas.

    Logical Design

    As shown in the preceding sections, this process can be specified by a model-based transformation. In fact, we have to distinguish two different approaches,

  • Transformation-Based Database Engineering 21

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    namely ideal and empirical. The ideal design produces a logical schema thatmeets two requirements only: it complies with the target logical model M and itis semantically equivalent to the conceptual schema. According to the transfor-mational paradigm, the logical design process is a M-driven transformationcomprising SR-operators only. The plan of Figure 13 illustrates this principles forrelational databases. Similar plans have been designed for CODASYL DBTG,Object-relational and XML (Estievenart, 2003) databases, among others. Em-pirical design is closer to the semi-formal way developers actually work, relyingon experience and intuition, rather than on standardized procedures. Otherrequirements such as space and time optimization often are implicitly taken intoaccount, making formal modeling more difficult, if not impossible. Though nocomprehensive model-driven transformations can describe such approaches,essential fragments of empirical design based on systematic and reproduciblerules can be described by compound or predicate-driven transformations.

    Coding

    Quite often overlooked, this process can be less straightforward and morecomplex than generally described in the literature or carried out by CASE tools.Indeed, any DMS can cope with a limited range of structures and integrityconstraints for which its DDL provides an explicit syntax. For instance, plainSQL2 DBMSs know about constraints such as machine value domains, uniquekeys, foreign keys and mandatory columns only. If such constructs appear in aphysical schema, they can be explicitly declared in the SQL2 script. On the otherhand, all the other constraints must be either ignored or expressed in any other

    Figure 15. Simple transformation plan to normalize ERA conceptualschemas

    step predicate-based transformation comment 1 transform attribute entity types into attributes an attribute entity type has key attributes only, and is

    linked to one other entity type only through a mandatory rel-type; operator 3-inverse and 4-inverse;

    2 transform relationship entity types into rel-types a relationship entity type has few attributes, is linked to at least 2 entity types through mandatory many-to-one rel-types and is identified by these entity types; operator 1-inverse;

    3 decompose N-ary rel-types where a non full-key FD holds

    transformation of Fig. 5;

    4 remove transitive IS-A relations if A is-a B and B is-a C, then A is-a C cannot be explicitly asserted;

    5 transform complex multivalued attributes into entity types

    a complex attribute is compound, multivalued and has a local identifier; it is assumed to express an autonomous entity type; operator 4-inverse;

  • 22 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    way, at best through check predicates or triggers, but more frequently throughprocedural sections scattered throughout the application programs. Distinguish-ing the DDL code from the external code, the operational code can be split intotwo distinct parts:

    code = codeddl codeext

    Despite this variety of translation means, the COD process typically is a two-model transformation (in our framework, GER to DMS-DDL) that can beautomated.

    Transformation-Based Database ReverseEngineering

    Database reverse engineering is the process through which one attempts torecover or to rebuild the technical and functional documentation of a legacydatabase. Intensive research in the past decade have shown that reverseengineering generally is much more complex than initially thought. We can putforward two major sources of difficulties. First, empirical design has been, andstill is, more popular than systematic design. Second, only the codeddl part of thecode provides a reliable description of the database physical constructs.Empirical design itself accounts for two understanding problems. First, it oftenrelies on non-standard, unpublished, translation rules that may be difficult tointerpret. Second, actual logical schemas often are strongly optimized, so thatextracting a conceptual schema from the logical schema involves understandingnot only how the latter has been translated in the target model, but also how, andaccording to which criteria, it has been optimized.The codeddl component expresses a part of the physical schema only. Therefore,the codeext part must be retrieved and interpreted, which leads to two indepen-dent problems. The first one requires parsing a huge volume of program code toidentify code sections that cope with implicit, i.e., undeclared, constructs such asdecomposed (flattened) fields or referential constraints. The second problemconcerns the correct interpretation of these code fragments, that translates intoconstructs to be added to the physical schema.The whole process is described in Figure 16. It shows that database reverseengineering is decomposed into two main sub-processes, namely Extraction andConceptualization. The objective of the Extraction process is to recover the

  • Transformation-Based Database Engineering 23

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    complete logical schema of the legacy database. It includes three activities:Parsing the DDL code to extract the raw physical schema, schema Refinementthrough which implicit and hidden constructs are elicited from external code (aswell as from other sources, such as the data themselves, but we will ignore themin this discussion) and Cleaning, in which the technical constructs of the physicalschema are removed.The second main sub-process, Conceptualization, is intended to derive aplausible conceptual schema from the logical schema. It consists in identifyingthe trace of the translation of conceptual constructs, then in replacing them withtheir source. For instance, a foreign key is interpreted as (i.e., replaced by) amany-to-one relationship type.The transformational interpretation of the reverse engineering process isstraighforward:

    CS = DBRE(code)

    where code denotes operational code and CS the conceptual schema.

    DBRE can be developed as follows:

    LS = Extraction(code)

    CS = Conceptualization(LS)

    Figure 16. Main processes of database reverse engineering

    Dat

    abas

    e r

    ever

    se

    engi

    nee

    ring

    Extr

    actio

    nCleaning

    Parsing Refinement

    Logical schema

    Physical schema

    Conceptual schema

    codeddl codeext

    Conceptualization

    Logical schema

  • 24 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Extraction itself includes three processes:

    PS = Parsing(codeddl)

    PS = Refinement(PS,codeext)

    LS = Cleaning(PS)

    By comparing the transformational expression of the database design anddatabase reverse engineering processes, we can state the following equivalencerules, in which, as usual, denotes the inverse of transformation :

    Parsing + Refinement = Coding

    Cleaning = P-design

    Conceptualization = L-design

    Through this analysis, we have shown that, if database design and reverseengineering can be modeled by transformations, then database reverse engineer-ing is, to a large extent, the inverse of database design. This induces importantconsequences. In particular: database reverse engineering requires a deep understanding of empirical

    database design methodologies, and the Conceptualization process can be analyzed and specified by identify-

    ing the strategies and the transformations that are most popular in empiricallogical design, and by considering their inverse.

    Among the operators that have been described, the transformations 1-inverse,2-inverse, 3-direct and 3-inverse, 4-direct and 4-inverse, 5-inverse, 6-direct, 7-direct, form a sound (but unfortunately not complete) basis forconceptualizing logical schemas. This process can be supported by predicate-driven and model-driven transformations, but, even more than for forwardengineering, reverse engineering heavily relies on human expertise. An in-depthdescription of a wide-scope reverse engineering methodology can be found inHainaut (2002).

  • Transformation-Based Database Engineering 25

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    Conclusions and Perspectives

    In this chapter, we have shown that schema transformation can be used as amajor paradigm in database engineering. In particular, being formally defined, itcan be used to precisely model complex processes and to reason on theirproperties such as semantics preservation. It has also been used to derive newprocesses from former ones, as illustrated by the formalization of databasereverse engineering as the inverse of database design.Due to their formality, transformations can be implemented in CASE tools, eitheras implicit operators, or as tools that are explicitly made available to thedeveloper. Two implementations are worth being mentioned, namely Rosenthal(1994) and Hainaut (1996b). The latter reference describes the DB-MAINCASE environment which includes a transformation toolbox as well as specialengines for user-defined, predicate-driven and model-driven transformations.Further information can be found at http://www.info.fundp.ac.be/libd.Transformations also have a great potential in other domains, such as databaseinteroperability, in which mediation between existing databases (McBrien, 2003)and data wrapping (Thiran, 2001) can be formalized and automated thanks to theuse of transformational operators. In this domain, data instance transformationsare modeled by the t part of the transformations. Specifying how the sourceschema is transformed into the target schema automatically provides a chain ofinstance transformations that are used to generate the data conversion code thatis at the core of data migrators (ETL processors), wrappers and mediators.

    References

    Balzer, R. (1981). Transformational implementation: An example. IEEE TSE,SE-7(1).

    Batini, C., Ceri, S. & Navathe, S. B. (1992). Conceptual Database Design.Benjamin/Cummings.

    Blaha, M. & Premerlani, W. (1998). Object-oriented Modeling and Designfor Database Applications. Prentice Hall.

    Bolois, G. & Robillard, P. (1994). Transformations in reengineering techniques.Proceedings of the Fourth Reengineering Forum Reengineering inPractice. Victoria, Canada.

    Casanova, M. & Amaral De Sa, A. (1984). Mapping uninterpreted schemes intoentity-relationship diagrams: Two applications to conceptual schema de-sign. IBM Journal of Research & Development, 28(1).

  • 26 Hainaut

    Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

    DAtri, A. & Sacca, D. (1984). Equivalence and mapping of database schemes.Proceedings of the 10th VLDB Conference. Singapore.

    De Troyer, O. (1993). On Data Schema Transformation. PhD thesis, Univer-sity of Tilburg, Tilburg, The Netherlands.

    Estivenart, F., Franois, A., Henrard, J. & Hainaut, J.L. (2003). Web siteengineering. Proceedings of the Fifth International Workshop on WebSite Evolution, Amsterdam, September 2003. IEEE CS Press.

    Fagin, R. (1977). Multivalued dependencies and a new normal form for relationaldatabases. ACM TODS, 2(3).

    Fikas, S. F. (1985). Automating the transformational development of software.IEEE TSE, SE-11.

    Hainaut, J.L. (1989). A generic entity-relationship model. Proceedings of theIFIP WG 8.1 Conference on Information System Concepts: An In-depth Analysis, North-Holland.

    Hainaut, J.L. (1991). Entity-generating schema transformations for entity-relationship models. Proceedings of the 10th Entity-Relationship Ap-proach, San Mateo (CA), North-Holland.

    Hainaut, J.L. (1995). Transformation-based database engineering. Tutorialnotes, VLDB95, Zrich, Switzerland, September 1995. Retrieved from theWWW: http://www.info.fundp.ac.be/libd

    Hainaut, J.L. (1996). Specification preservation in schema transformations -Application to semantics and statistics. Data & Knowledge Engineering,11(1).

    Hainaut, J.L. (2002). Introduction to Data