12
Fundamentals of Data Warehouses

Fundamentals of Data Warehouses - Springer978-3-662-05153-5/1.pdf · Preface to the Second Edition Data warehousing is still an active and rapidly evolving field even though many

Embed Size (px)

Citation preview

Fundamentals of Data Warehouses

Springer-Verlag Berlin Heidelberg GmbH

Matthias J arke • Maurizio Lenzerini Yannis Vassiliou • Panos Vassiliadis

Fundamentals of Data Warehouses

Second, Revised and Extended Edition

With 59 Figures

, Springer

Matthias larke

Dept. of Computer Science V

RWTHAachen AhornstraBe 55 52056 Aachen, Germany [email protected]

Maurizio Lenzerini

Dipartimento di Informatica e Sistemistica Universita di Roma "La Sapienza"

Via Saleria 113 00198 Rome, Italy [email protected]

Yannis Vassiliou Panos Vassiliadis

Dept. of Electrical and Computer Engineering

Computer Science Division National Technical University of Athens 15773 Zographou Athens, Greece [email protected]

ACM Computing Classification (1998): H.2.7, H.2-4, D.2.9, 1.2, K.6.3-4

Library of Congress Cataloging-in-Publication Data

Fundamentals of data warehouses 1 Matthias larke ... ret al.J. p.cm.

Includes bibliographical references and index. ISBN 978-3-642-07564-3 ISBN 978-3-662-05153-5 (eBook) DOI 10.1007/978-3-662-05153-5

1. Data warehousing. I. larke, Matthias.

QA76.9.D37 F86 2002 658.4'038'0285574--dc21

2002070677

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German copyright law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.

http://www.springer.de

© Springer-Verlag Berlin Heidelberg 2000, 2003 Originally published by Springer-Verlag Berlin Heidelberg New York in 2003 Softcover reprint of the hardcover 2nd edition 2003

The use of general descriptive names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Cover design: KiinkelLopka, Heidelberg Typesetting: Camera-ready by authors Printed on acid-free paper SPIN: 10837815 45/3142 ud - 54321 0

Preface to the Second Edition

Data warehousing is still an active and rapidly evolving field even though many of the foundations are stabilizing. The first edition of this book sold out within little more than a year. Moreover, we found that a number of updates had to be made in particular to the state of the practice because some of the tools described in the first edition are no longer on the market and a trend towards more integrated solu­tions can be observed. We are grateful to several researchers and developers from data warehouse vendor companies who pointed out such issues. In addition, the second edition contains more information about new developments in metadata management and, in a new Chap. 8, a comprehensive description and illustration of the quality-oriented data warehouse design and operation methodology devel­oped in the final stages of the European DWQ project. Thanks to Ingeborg Mayer of Springer-Verlag for her persistence and support in the revision for this edition. Many Thanks to Ulrike Drechsler and Christian Seeling for their technical support of this revision.

Aachen, Athens, Rome, September 2002

Matthias J arke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis

Preface to the First Edition

This book is an introduction and sourcebook for practitioners, graduate students, and researchers interested in the state of the art and the state of the practice in data warehousing. It resulted from our observation that, while there are a few hands-on practitioner books on data warehousing, the research literature tends to be frag­mented and poorly linked to the commercial state of practice. As a result of the synergistic view taken in the book, the last chapter presents a new approach for data warehouse quality assessment and quality-driven design which reduces some of the recognized shortcomings. For the reader, it will be useful to be familiar with the basics of the relational model of databases to be able to follow this book.

The book is made up of seven chapters. Chapter 1 sets the stage by giving a broad overview of some important terminology and vendor strategies. Chapter 2 summarizes the research efforts in data warehousing and gives a short description of the framework for data warehouses used in this book.

The next two chapters address the main data integration issues encountered in data warehousing. Chapter 3 presents a survey of the main techniques used when linking information sources to a data warehouse, emphasizing the need for seman­tic modeling of the relationships. Chapter 4 investigates the propagation of up­dates from operational sources through the data warehouse to the client analyst, looking both at incremental update computations and at the many facets of re­freshment policies.

The next two chapters study the client-side of a data warehouse. Chapter 5 shows how to reorganize relational data into the multidimensional data models used for online analytic processing applications, focusing on the conceptualization of, and reasoning about mUltiple, hierarchically organized dimensions. Chapter 6 takes a look at query processing and its optimization, taking into account the reuse of materialized views and the multidimensional storage of data.

In the literature, there is not much coherence among all these technical issues on the one side, and the business reasoning and design strategies underlying data warehousing projects. Chapter 7 ties these aspects together. It presents an ex­tended architecture for data warehousing and links it to explicit models of data warehouse quality. It is shown how this extended approach can be used to docu­ment the quality of a data warehouse project and to design a data warehouse solu­tion for specific quality criteria.

The book resulted from the ESPRIT Long Term Research Project DWQ (Foun­dations of Data Warehouse Quality) which was supported by the Commission of the European Union from 1996 to 1999. DWQ's goal was to develop a semantic foundation that will allow the designers of data warehouses to link the choice of deeper models, richer data structures, and rigorous implementation techniques to quality-of-service factors in a systematic manner, thus improving the design, the operation, and most importantly the long-term evolution of data warehouse appli­cations.

VIII Preface to the First Edition

Many researchers from all DWQ partner institutions - the National Technical University of Athens (Greece), RWTH Aachen University of Technology (Ger­many), DFKI German Research Center for Artificial Intelligence, the INRIA Na­tional Research Center (France), IRST Research Center in Bolzano (Italy), and the University of Rome - La Sapienza, have contributed to the underlying survey work. Their contributions are listed in the following overview. Great thanks go to our industrial collaborators who provided product information and case studies, including but not limited to Software AG, the City of Cologne, Team4 System­haus, Swiss Life, Telecom Italia, and Oracle Greece. Valuable comments from our EU project officer, David Cornwell, as well as from the project reviewers Stefano Ceri, Laurent Vieille, and Jari Veijalainen have sharpened the presentation of this material. Last but not least we thank Dr. Hans Wossner and his team at Springer­Verlag for a smooth production process. Christoph Quix was instrumental in sup­porting many of the technical editing tasks for this book; special thanks go to him.

Aachen, Athens, Rome, June 1999

Matthias J arke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis

Overview and Contributors

1. Data Warehouse Practice: An Overview Matthias Jarke, Christoph Quix

2. Data Warehouse Research: Issues and Projects Yannis Vassiliou, Mokrane Bouzeghoub, Matthias Jarke, Manfred A. Jeusfeld, Maurizio Lenzerini, Spyros Ligoudistianos, Aris Tsois, Panos Vassiliadis

3. Source Integration Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, Ricardo Rosati

4. Data Warehouse Refreshment Mokrane Bouzeghoub, Franc;:oise Fabret, Helena Galhardas, Maja Matulovic­Broque, Joao Pereira, Eric Simon

5. Multidimensional Data Models and Aggregation Enrico Franconi, Franz Baader, Ulrike Sattler, Panos Vassiliadis

6. Query Processing and Optimization WemerNutt

7. Metadata and Data Warehouse Quality Matthias Jarke, Manfred A. Jeusfeld, Christoph Quix, Timos Sellis, Panos Vassiliadis

8. Quality-Driven Data Warehouse Design Matthias Jarke, Maurizio Lenzerini, Christoph Quix, Timos Sellis, Panos Vassiliadis

Bibliography Appendix A. ISO Standards for Information Quality B. Glossary

Index

Addresses of Contributors

Matthias larke, Christoph Quix Infonnatik V (Infonnation Systems), RWTH Aachen Ahornstr. 55, 52056 Aachen, Gennany Email: [email protected]

Franz Baader, Ulrike Sattler Institut flir Theoretische Infonnatik, TU Dresden, 01062 Dresden, Gennany Email: [email protected]

Yannis Vassiliou, Spyros Ligoudistianos, Timos Sellis, Aris Tsois, Panos Vassiliadis Department of Electrical and Computer Engineering Computer Science Division, National Technical University of Athens Zographou 157 73, Athens, Greece Email: [email protected]

Enrico Franconi University of Manchester, Department of Computer Science Oxford Rd., Manchester M13 9PL, United Kingdom Email: [email protected]

Eric Simon, Mokrane Bouzeghoub, Franyoise Fabret, Helena Galhardas, Maja Matulovic-Broque, laoa Pereira Project Rodin, INRIA Rocquencourt Domaine de Voluceau BP.105, 78153 Le Chesnay Cedex, France Email: [email protected]

Maurizio Lenzerini, Diego Calvanese, Giuseppe De Giacomo, Daniele Nardi, Ricardo Rosati Dipartimento di Infonnatica e Sistemistica, Universita di Roma "La Sapienza" Via Salaria 113,00198 Roma, Italy Email: [email protected]

WemerNutt Department of Computing and Electrical Engineering, Heriot-Watt University Riccarton, Edinburgh EH14 4AS, United Kingdom Email: [email protected]

Manfred A. leusfeld KUB Tilburg, INFOLAB Warandelaan 2, Postbus 90153, 5000 LE Tilburg, The Netherlands Email: [email protected]

Contents

1 Data Warehouse Practice: An Overview ................................................. 1

1.1 Data Warehouse Components........ ....................... ....... ................................ 2 l.2 Designing the Data Warehouse... ................. ............ ....... .................... ......... 4 1.3 Getting Heterogeneous Data into the Warehouse ........................................ 5 1.4 Getting Multidimensional Data out of the Warehouse................................. 6 l.5 Physical Structure of Data Warehouses ....................................................... 10 1.6 Metadata Management ................................................................................. 13 l. 7 Data Warehouse Project Management......................................................... 13

2 Data Warehouse Research: Issues and Projects ...................................... 15

2.1 Data Extraction and Reconciliation ............................................................. 15 2.2 Data Aggregation and Customization ...... .................................................... 15 2.3 Query Optimization ..................................................................................... 16 2.4 Update Propagation ...................................................................................... 17 2.5 Modeling and Measuring Data Warehouse Quality ..................................... 17 2.6 Some Major Research Projects in Data Warehousing .................................. 19 2.7 Three Perspectives of Data Warehouse Metadata ........................................ 21

3 Source Integration ..................................................................................... 27

3.1 The Practice of Source Integration ............................................................... 27 3.1.1 Tools for Data Warehouse Management .......................................... 28 3.1.2 Tools for Data Integration ................................................................. 29

3.2 Research in Source Integration .................................................................... 30 3.2.1 Schema Integration ........................................................................... 32 3.2.2 Data Integration - Virtual ................................................................. 36 3.2.3 Data Integration - Materialized ........................................................ 38

3.3 Towards Systematic Methodologies for Source Integration ....................... 40 3.3.1 Architecture for Source Integration .................................................. 41 3.3.2 Methodology for Source Integration ................................................. 43

3.4 Concluding Remarks .................................................................................... 45

4 Data Warehouse Refreshment .................................................................. 47

4.1 What is Data Warehouse Refreshment? ...................................................... 47 4.1.1 Refreshment Process within the Data Warehouse Lifecycle ............ 47 4.1.2 Requirements and Difficulties of Data Warehouse Refreshment ..... 50 4.1.3 Data Warehouse Refreshment: Problem Statement.. ........................ 52

XIV Contents

4.2 Incremental Data Extraction ...................................................................... 54 4.2.1 Wrapper Functionality .................................................................... 55 4.2.2 Change Monitoring ... ..... ..... .............. ........ ... .... .......... ..... .... .... ........ 56

4.3 Data Cleaning............................................................................................. 62 4.3.1 Conversion and Normalization Functions....................................... 63 4.3.2 Special-Purpose Cleaning............................................................... 64 4.3.3 Domain-Independent Cleaning....................................................... 64 4.3.4 Rule-Based Cleaning ....................................................................... 65 4.3.5 Concluding Remarks on Data Cleaning.......................................... 67

4.4 Update Propagation into Materialized Views ............................................ 67 4.4.1 Notations and Definitions ............................................................... 68 4.4.2 View Maintenance: General Results............................................... 68 4.4.3 View Maintenance in Data Warehouses - Specific Results ........... 71

4.5 Toward a Quality-Oriented Refreshment Process...................................... 73 4.5.1 Quality Analysis for Refreshment .................................................. 73 4.5.2 Implementing the Refreshment Process.......................................... 77 4.5.3 Workflow Modeling with Rules ..................................................... 80

4.6 Implementation of the Approach ............................................................... 83

5 Multidimensional Data Models and Aggregation................................... 87

5.1 Multidimensional View of Information ...................................................... 90 5.2 ROLAP Data Model ................................................................................... 92 5.3 MOLAP Data Model................................................................................... 95 5.4 Logical Models for Multidimensional Information..................................... 97 5.5 Conceptual Models for Multidimensional Information ............................... 100

5.5.1 Inference Problems for Multidimensional Conceptual Modeling .... 101 5.5.2 Which Formal Framework to Choose? ............................................ 103

5.6 Conclusion .................................................................................................. 105

6 Query Processing and Optimization ........................................................ 107

6.1 Description and Requirements for Data Warehouse Queries ...................... 107 6.1.1 Queries at the Back End ................................................................... 108 6.1.2 Queries at the Front End .................................................................. 108 6.1.3 Queries in the Core .......................................................................... 109 6.1.4 Transactional Versus Data Warehouse Queries ............................... 109 6.1.5 Canned Queries Versus Ad-hoc Queries ......................................... 110 6.1.6 Multidimensional Queries ................................................................ 110 6.1.7 Extensions ofSQL ........................................................................... 112

6.2 Query Processing Techniques ..................................................................... 113 6.2.1 Data Access ..................................................................................... 113 6.2.2 Evaluation Strategies .... , .................................................................. 116 6.2.3 Exploitation of Redundancy ............................................................ 117

6.3 Conclusions and Research Directions ......................................................... 121

Contents XV

7 Metadata and Data Warehouse Quality .................................................. 123

7.1 Metadata Management in Data Warehouse Practice ................................... 124 7.1.1 Metadata Interchange Specification (MDIS) ................................... 125 7.1.2 The Telos Language ........................................................................ 125 7.1.3 Microsoft Repository ....................................................................... 127 7.1.4 OIM and CWM ................................................................................ 128

7.2 A Repository Model for the DWQ Framework .......................................... 129 7.2.1 Conceptual Perspective .................................................................... 131 7.2.2 Logical Perspective .......................................................................... 132 7.2.3 Physical Perspective ........................................................................ 132 7.2.4 Applying the Architecture ModeL .................................................. 133

7.3 Defining Data Warehouse Quality .............................................................. 138 7.3.1 Data Quality ..................................................................................... 13 9 7.3.2 Stakeholders and Goals in Data Warehouse Quality ....................... 140 7.3.3 State of Practice in Data Warehouse Quality ................................... 143

7.4 Representing and Analyzing Data Warehouse Quality ............................... 144 7.4.1 Quality Function Deployment ......................................................... 145 7.4.2 The Need for Richer Quality Models: An Example ......................... 146 7.4.3 The Goal-Question-Metric Approach .............................................. 147 7.4.4 Repository Support for the GQM Approach .................................... 148

7.5 A Detailed Example: Quality Analysis in Data Staging ............................. 154 7.5.1 Evaluation of the Quality ofa DSA Schema ................................... 158 7.5.2 Analyzing the Quality ofa View ..................................................... 160

8 Quality-Driven Data Warehouse Design ................................................. 165

8.1 Interactions between Quality Factors and DW Tasks ................................. 165 8.2 The DWQ Data Warehouse Design Methodology ...................................... 166

8.2.1 Source Integration ............................................................................ 167 8.2.2 Multidimensional Aggregation and OLAP Query Generation ........ 169 8.2.3 Design Optimization and Data Reconciliation ................................. 171 8.2.4 Operational Support ......................................................................... 172

8.3 Optimizing the Materialization ofDW Views ............................................ 174 8.4 Summary and Outlook ................................................................................ 178

Bibliography ...................................................................................................... 181

Appendix A. ISO Standards Information Quality ........................................ 203

Appendix B. Glossary ...................................................................................... 207

Index .................................................................................................................. 215