30
High-Performance Parallel Database Processing and Grid Databases David Taniar Monash University, Australia Clement H.C. Leung Hong Kong Baptist University and Victoria University, Australia Wenny Rahayu La Trobe University, Australia Sushant Goel RMIT University, Australia A John Wiley & Sons, Inc., Publication

High-Performance Parallel Database Processing and Grid ... · High-Performance Parallel Database Processing and Grid Databases David Taniar Monash University, Australia Clement H.C

  • Upload
    others

  • View
    191

  • Download
    0

Embed Size (px)

Citation preview

  • High-PerformanceParallel DatabaseProcessingand Grid DatabasesDavid TaniarMonash University, Australia

    Clement H.C. LeungHong Kong Baptist University and Victoria University, Australia

    Wenny RahayuLa Trobe University, Australia

    Sushant GoelRMIT University, Australia

    A John Wiley & Sons, Inc., Publication

    InnodataFile Attachment9780470391358.jpg

  • High-PerformanceParallel DatabaseProcessingand Grid Databases

  • High-PerformanceParallel DatabaseProcessingand Grid DatabasesDavid TaniarMonash University, Australia

    Clement H.C. LeungHong Kong Baptist University and Victoria University, Australia

    Wenny RahayuLa Trobe University, Australia

    Sushant GoelRMIT University, Australia

    A John Wiley & Sons, Inc., Publication

  • Copyright 2008 by John Wiley & Sons, Inc. All rights reserved.

    Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

    No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form orby any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should beaddressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008.

    Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.

    For general information on our other products and services please contact our Customer CareDepartment within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.

    Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,however, may not be available in electronic formats.

    Library of Congress Cataloging-in-Publication Data:

    Taniar, David.High-performance parallel database processing and grid databases / by David

    Taniar, Clement Leung, Wenny Rahayu.p. cm.

    Includes bibliographical references.ISBN 978-0-470-10762-1 (cloth : alk. paper)

    1. High performance computing. 2. Parallel processing (Electronic computers)3. Computational grids (Computer systems) I. Leung, Clement H. C. II. Rahayu,

    Johanna Wenny. III. Title.QA76.88.T36 2008004’ .35—dc22

    2008011010

    Printed in the United States of America

    10 9 8 7 6 5 4 3 2 1

    http://www.copyright.com

  • Contents

    Preface xv

    Part I Introduction

    1. Introduction 3

    1.1. A Brief Overview: Parallel Databases and Grid Databases 41.2. Parallel Query Processing: Motivations 51.3. Parallel Query Processing: Objectives 7

    1.3.1. Speed Up 71.3.2. Scale Up 81.3.3. Parallel Obstacles 10

    1.4. Forms of Parallelism 12

    1.4.1. Interquery Parallelism 131.4.2. Intraquery Parallelism 141.4.3. Intraoperation Parallelism 151.4.4. Interoperation Parallelism 151.4.5. Mixed Parallelism—A More Practical Solution 18

    1.5. Parallel Database Architectures 19

    1.5.1. Shared-Memory and Shared-Disk Architectures 201.5.2. Shared-Nothing Architecture 221.5.3. Shared-Something Architecture 231.5.4. Interconnection Networks 24

    1.6. Grid Database Architecture 261.7. Structure of this Book 291.8. Summary 301.9. Bibliographical Notes 301.10. Exercises 31

    v

  • vi CONTENTS

    2. Analytical Models 33

    2.1. Cost Models 332.2. Cost Notations 34

    2.2.1. Data Parameters 342.2.2. Systems Parameters 362.2.3. Query Parameters 372.2.4. Time Unit Costs 372.2.5. Communication Costs 38

    2.3. Skew Model 392.4. Basic Operations in Parallel Databases 43

    2.4.1. Disk Operations 442.4.2. Main Memory Operations 452.4.3. Data Computation and Data Distribution 45

    2.5. Summary 472.6. Bibliographical Notes 472.7. Exercises 47

    Part II Basic Query Parallelism

    3. Parallel Search 51

    3.1. Search Queries 51

    3.1.1. Exact-Match Search 523.1.2. Range Search Query 533.1.3. Multiattribute Search Query 54

    3.2. Data Partitioning 54

    3.2.1. Basic Data Partitioning 553.2.2. Complex Data Partitioning 60

    3.3. Search Algorithms 69

    3.3.1. Serial Search Algorithms 693.3.2. Parallel Search Algorithms 73

    3.4. Summary 743.5. Bibliographical Notes 753.6. Exercises 75

    4. Parallel Sort and GroupBy 77

    4.1. Sorting, Duplicate Removal, and Aggregate Queries 78

    4.1.1. Sorting and Duplicate Removal 784.1.2. Scalar Aggregate 794.1.3. GroupBy 80

    4.2. Serial External Sorting Method 80

  • CONTENTS vii

    4.3. Algorithms for Parallel External Sort 83

    4.3.1. Parallel Merge-All Sort 834.3.2. Parallel Binary-Merge Sort 854.3.3. Parallel Redistribution Binary-Merge Sort 864.3.4. Parallel Redistribution Merge-All Sort 884.3.5. Parallel Partitioned Sort 90

    4.4. Parallel Algorithms for GroupBy Queries 92

    4.4.1. Traditional Methods (Merge-All and HierarchicalMerging) 92

    4.4.2. Two-Phase Method 934.4.3. Redistribution Method 94

    4.5. Cost Models for Parallel Sort 96

    4.5.1. Cost Models for Serial External Merge-Sort 964.5.2. Cost Models for Parallel Merge-All Sort 984.5.3. Cost Models for Parallel Binary-Merge Sort 1004.5.4. Cost Models for Parallel Redistribution Binary-Merge

    Sort 1014.5.5. Cost Models for Parallel Redistribution Merge-All Sort 1024.5.6. Cost Models for Parallel Partitioned Sort 103

    4.6. Cost Models for Parallel GroupBy 104

    4.6.1. Cost Models for Parallel Two-Phase Method 1044.6.2. Cost Models for Parallel Redistribution Method 107

    4.7. Summary 1094.8. Bibliographical Notes 1104.9. Exercises 110

    5. Parallel Join 112

    5.1. Join Operations 1125.2. Serial Join Algorithms 114

    5.2.1. Nested-Loop Join Algorithm 1145.2.2. Sort-Merge Join Algorithm 1165.2.3. Hash-Based Join Algorithm 1175.2.4. Comparison 120

    5.3. Parallel Join Algorithms 120

    5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 1215.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124

    5.4. Cost Models 128

    5.4.1. Cost Models for Divide and Broadcast 1285.4.2. Cost Models for Disjoint Partitioning 1295.4.3. Cost Models for Local Join 130

  • viii CONTENTS

    5.5. Parallel Join Optimization 132

    5.5.1. Optimizing Main Memory 1325.5.2. Load Balancing 133

    5.6. Summary 1345.7. Bibliographical Notes 1355.8. Exercises 136

    Part III Advanced Parallel Query Processing

    6. Parallel GroupBy-Join 141

    6.1. Groupby-Join Queries 141

    6.1.1. Groupby Before Join 1426.1.2. Groupby After Join 142

    6.2. Parallel Algorithms for Groupby-Before-JoinQuery Processing 143

    6.2.1. Early Distribution Scheme 1436.2.2. Early GroupBy with Partitioning Scheme 1456.2.3. Early GroupBy with Replication Scheme 146

    6.3. Parallel Algorithms for Groupby-After-JoinQuery Processing 148

    6.3.1. Join Partitioning Scheme 1486.3.2. GroupBy Partitioning Scheme 150

    6.4. Cost Model Notations 1516.5. Cost Model for Groupby-Before-Join Query Processing 153

    6.5.1. Cost Models for the Early Distribution Scheme 1536.5.2. Cost Models for the Early GroupBy with Partitioning

    Scheme 1566.5.3. Cost Models for the Early GroupBy with Replication

    Scheme 158

    6.6. Cost Model for “Groupby-After-Join” Query Processing 159

    6.6.1. Cost Models for the Join Partitioning Scheme 1596.6.2. Cost Models for the GroupBy Partitioning Scheme 161

    6.7. Summary 1636.8. Bibliographical Notes 1646.9. Exercises 164

  • CONTENTS ix

    7. Parallel Indexing 167

    7.1. Parallel Indexing–an Internal Perspective on Parallel IndexingStructures 168

    7.2. Parallel Indexing Structures 169

    7.2.1. Nonreplicated Indexing (NRI) Structures 1697.2.2. Partially Replicated Indexing (PRI) Structures 1717.2.3. Fully Replicated Indexing (FRI) Structures 178

    7.3. Index Maintenance 180

    7.3.1. Maintaining a Parallel Nonreplicated Index 1827.3.2. Maintaining a Parallel Partially Replicated Index 1827.3.3. Maintaining a Parallel Fully Replicated Index 1887.3.4. Complexity Degree of Index Maintenance 188

    7.4. Index Storage Analysis 188

    7.4.1. Storage Cost Models for Uniprocessors 1897.4.2. Storage Cost Models for Parallel Processors 191

    7.5. Parallel Processing of Search Queries using Index 192

    7.5.1. Parallel One-Index Search Query Processing 1927.5.2. Parallel Multi-Index Search Query Processing 195

    7.6. Parallel Index Join Algorithms 200

    7.6.1. Parallel One-Index Join 2007.6.2. Parallel Two-Index Join 203

    7.7. Comparative Analysis 207

    7.7.1. Comparative Analysis of Parallel Search Index 2077.7.2. Comparative Analysis of Parallel Index Join 213

    7.8. Summary 2167.9. Bibliographical Notes 2177.10. Exercises 217

    8. Parallel Universal Qualification—Collection Join Queries 219

    8.1. Universal Quantification and Collection Join 2208.2. Collection Types and Collection Join Queries 222

    8.2.1. Collection-Equi Join Queries 2228.2.2. Collection–Intersect Join Queries 2238.2.3. Subcollection Join Queries 224

    8.3. Parallel Algorithms for Collection Join Queries 2258.4. Parallel Collection-Equi Join Algorithms 225

    8.4.1. Disjoint Data Partitioning 226

  • x CONTENTS

    8.4.2. Parallel Double Sort-Merge Collection-EquiJoin Algorithm 227

    8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 2288.4.4. Parallel Hash Collection-Equi Join Algorithm 232

    8.5. Parallel Collection-Intersect Join Algorithms 233

    8.5.1. Non-Disjoint Data Partitioning 2348.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join

    Algorithm 2448.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 2458.5.4. Parallel Hash Collection-Intersect Join Algorithm 246

    8.6. Parallel Subcollection Join Algorithms 246

    8.6.1. Data Partitioning 2478.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join

    Algorithm 2488.6.3. Parallel Sort-Hash Subcollection Join Algorithm 2498.6.4. Parallel Hash Subcollection Join Algorithm 251

    8.7. Summary 2528.8. Bibliographical Notes 2528.9. Exercises 254

    9. Parallel Query Scheduling and Optimization 256

    9.1. Query Execution Plan 2579.2. Subqueries Execution Scheduling Strategies 259

    9.2.1. Serial Execution Among Subqueries 2599.2.2. Parallel Execution Among Subqueries 261

    9.3. Serial vs. Parallel Execution Scheduling 264

    9.3.1. Nonskewed Subqueries 2649.3.2. Skewed Subqueries 2659.3.3. Skewed and Nonskewed Subqueries 267

    9.4. Scheduling Rules 2699.5. Cluster Query Processing Model 270

    9.5.1. Overview of Dynamic Query Processing 2719.5.2. A Cluster Query Processing Architecture 2729.5.3. Load Information Exchange 273

    9.6. Dynamic Cluster Query Optimization 275

    9.6.1. Correction 2769.6.2. Migration 2809.6.3. Partition 281

    9.7. Other Approaches to Dynamic Query Optimization 2849.8. Summary 285

  • CONTENTS xi

    9.9. Bibliographical Notes 2869.10. Exercises 286

    Part IV Grid Databases

    10. Transactions in Distributed and Grid Databases 291

    10.1. Grid Database Challenges 29210.2. Distributed Database Systems and Multidatabase Systems 293

    10.2.1. Distributed Database Systems 29310.2.2. Multidatabase Systems 297

    10.3. Basic Definitions on Transaction Management 29910.4. Acid Properties of Transactions 30110.5. Transaction Management in Various Database Systems 303

    10.5.1. Transaction Management in Centralized and HomogeneousDistributed Database Systems 303

    10.5.2. Transaction Management in Heterogeneous Distributed DatabaseSystems 305

    10.6. Requirements in Grid Database Systems 30710.7. Concurrency Control Protocols 30910.8. Atomic Commit Protocols 310

    10.8.1. Homogeneous Distributed Database Systems 31010.8.2. Heterogeneous Distributed Database Systems 313

    10.9. Replica Synchronization Protocols 314

    10.9.1. Network Partitioning 31510.9.2. Replica Synchronization Protocols 316

    10.10. Summary 31810.11. Bibliographical Notes 31810.12. Exercises 319

    11. Grid Concurrency Control 321

    11.1. A Grid Database Environment 32111.2. An Example 32211.3. Grid Concurrency Control 324

    11.3.1. Basic Functions Required by GCC 32411.3.2. Grid Serializability Theorem 32511.3.3. Grid Concurrency Control Protocol 32911.3.4. Revisiting the Earlier Example 33311.3.5. Comparison with Traditional Concurrency Control

    Protocols 334

  • xii CONTENTS

    11.4. Correctness of GCC Protocol 33611.5. Features of GCC Protocol 33811.6. Summary 33911.7. Bibliographical Notes 33911.8. Exercises 339

    12. Grid Transaction Atomicity and Durability 341

    12.1. Motivation 34212.2. Grid Atomic Commit Protocol (Grid-ACP) 343

    12.2.1. State Diagram of Grid-ACP 34312.2.2. Grid-ACP Algorithm 34412.2.3. Early-Abort Grid-ACP 34612.2.4. Discussion 34812.2.5. Message and Time Complexity Comparison Analysis 34912.2.6. Correctness of Grid-ACP 350

    12.3. Handling Failure of Sites with Grid-ACP 351

    12.3.1. Model for Storing Log Files at the Originator andParticipating Sites 351

    12.3.2. Logs Required at the Originator Site 35212.3.3. Logs Required at the Participant Site 35312.3.4. Failure Recovery Algorithm for Grid-ACP 35312.3.5. Comparison of Recovery Protocols 35912.3.6. Correctness of Recovery Algorithm 361

    12.4. Summary 36512.5. Bibliographical Notes 36612.6. Exercises 366

    13. Replica Management in Grids 367

    13.1. Motivation 36713.2. Replica Architecture 368

    13.2.1. High-Level Replica Management Architecture 36813.2.2. Some Problems 369

    13.3. Grid Replica Access Protocol (GRAP) 371

    13.3.1. Read Transaction Operation for GRAP 37113.3.2. Write Transaction Operation for GRAP 37213.3.3. Revisiting the Example Problem 37513.3.4. Correctness of GRAP 377

    13.4. Handling Multiple Partitioning 378

    13.4.1. Contingency GRAP 37813.4.2. Comparison of Replica Management Protocols 38113.4.3. Correctness of Contingency GRAP 383

  • CONTENTS xiii

    13.5. Summary 38413.6. Bibliographical Notes 38513.7. Exercises 385

    14. Grid Atomic Commitment in Replicated Data 387

    14.1. Motivation 388

    14.1.1. Architectural Reasons 38814.1.2. Motivating Example 388

    14.2. Modified Grid Atomic Commitment Protocol 390

    14.2.1. Modified Grid-ACP 39014.2.2. Correctness of Modified Grid-ACP 393

    14.3. Transaction Properties in Replicated Environment 39514.4. Summary 39714.5. Bibliographical Notes 39714.6. Exercises 398

    Part V Other Data-Intensive Applications

    15. Parallel Online Analytic Processing (OLAP) and BusinessIntelligence 401

    15.1. Parallel Multidimensional Analysis 40215.2. Parallelization of ROLLUP Queries 405

    15.2.1. Analysis of Basic Single ROLLUP Queries 40515.2.2. Analysis of Multiple ROLLUP Queries 40915.2.3. Analysis of Partial ROLLUP Queries 41115.2.4. Parallelization Without Using ROLLUP 412

    15.3. Parallelization of CUBE Queries 412

    15.3.1. Analysis of Basic CUBE Queries 41315.3.2. Analysis of Partial CUBE Queries 41615.3.3. Parallelization Without Using CUBE 417

    15.4. Parallelization of Top-N and Ranking Queries 41815.5. Parallelization of Cume Dist Queries 41915.6. Parallelization of NTILE and Histogram Queries 42015.7. Parallelization of Moving Average and Windowing Queries 42215.8. Summary 42415.9. Bibliographical Notes 42415.10. Exercises 425

  • xiv CONTENTS

    16. Parallel Data Mining—Association Rules and Sequential Patterns 427

    16.1. From Databases To Data Warehousing To Data Mining:A Journey 428

    16.2. Data Mining: A Brief Overview 431

    16.2.1. Data Mining Tasks 43116.2.2. Querying vs. Mining 43316.2.3. Parallelism in Data Mining 436

    16.3. Parallel Association Rules 440

    16.3.1. Association Rules: Concepts 44116.3.2. Association Rules: Processes 44416.3.3. Association Rules: Parallel Processing 448

    16.4. Parallel Sequential Patterns 450

    16.4.1. Sequential Patterns: Concepts 45216.4.2. Sequential Patterns: Processes 45616.4.3. Sequential Patterns: Parallel Processing 459

    16.5. Summary 46116.6. Bibliographical Notes 46116.7. Exercises 462

    17. Parallel Clustering and Classification 464

    17.1. Clustering and Classification 464

    17.1.1. Clustering 46417.1.2. Classification 465

    17.2. Parallel Clustering 467

    17.2.1. Clustering: Concepts 46717.2.2. k-Means Algorithm 46817.2.3. Parallel k-Means Clustering 471

    17.3. Parallel Classification 477

    17.3.1. Decision Tree Classification: Structures 47717.3.2. Decision Tree Classification: Processes 48017.3.3. Decision Tree Classification: Parallel Processing 488

    17.4. Summary 49517.5. Bibliographical Notes 49817.6. Exercises 498

    Permissions 501

    List of Conferences and Journals 507

    Bibliography 511

    Index 541

  • Preface

    The sizes of databases have seen exponential growth in the past, and such growthis expected to accelerate in the future, with the steady drop in storage cost accom-panied by a rapid increase in storage capacity. Many years ago, a terabyte databasewas considered to be large, but nowadays they are sometimes regarded as small,and the daily volumes of data being added to some databases are measured interabytes. In the future, petabyte and exabyte databases will be common.

    With such volumes of data, it is evident that the sequential processing paradigmwill be unable to cope; for example, even assuming a data rate of 1 terabyte persecond, reading through a petabyte database will take over 10 days. To effectivelymanage such volumes of data, it is necessary to allocate multiple resources to it,very often massively so. The processing of databases of such astronomical propor-tions requires an understanding of how high-performance systems and parallelismwork. Besides the massive volume of data in the database to be processed, somedata has been distributed across the globe in a Grid environment. These massivedata centers are also a part of the emergence of Cloud computing, where dataaccess has shifted from local machines to powerful servers hosting web appli-cations and services, making data access across the Internet using standard webbrowsers pervasive. This adds another dimension to such systems.

    Parallelism in databases has been around since the early 1980s, whenmany researchers in this area aspired to build large special-purpose databasemachines—databases employing dedicated specialized parallel hardware.Some projects were born, including Bubba, Gamma, etc. These came andwent. However, commercial DBMS vendors quickly realized the importanceof supporting high performance for large databases, and many of them haveincorporated parallelism and grid features into their products. Their commitmentto high-performance systems and parallelism, as well as grid configurations,shows the importance and inevitability of parallelism.

    In addition, while traditional transactional data is still common, we seean increasing growth of new application domains, broadly categorized asdata-intensive applications. These include data warehousing and online analyticprocessing (OLAP) applications, data mining, genome databases, and multiplemedia databases manipulating unstructured and semistructured data. Therefore,it is critical to understand the underlying principle of data parallelism, beforespecialized and new application domains can be properly addressed.

    xv

  • xvi PREFACE

    This book is written to provide a fundamental understanding of parallelism indata-intensive applications. It features not only the algorithms for database opera-tions but also quantitative analytical models, so that performance can be analyzedand evaluated more effectively.

    The present book brings into a single volume the latest techniques and principlesof parallel and grid database processing. It provides a much-needed, self-containedadvanced text for database courses at the postgraduate or final year undergraduatelevels. In addition, for researchers with a particular interest in parallel databasesand related areas, it will serve as an indispensable and up-to-date reference. Prac-titioners contemplating building high-performance databases or seeking to gain agood understanding of parallel database technology too will find this book valuablefor the wealth of techniques and models it contains.

    STRUCTURE OF THE BOOK

    This book is divided into five parts. Part I gives an introduction to the topic, includ-ing the rationale behind the need for high-performance database processing, as wellas basic analytical models that will be used throughout the book.

    Part II, consisting of three chapters, describes parallelism for basic query opera-tions. These include parallel searching, parallel aggregate and sorting, and paralleljoin. These are the foundation of query processing, whereby complex queries canbe decomposed into any of these atomic operations.

    Part III, consisting of the next four chapters, focuses on more advanced queryoperations. This part covers groupby-join operations, parallel indexing, parallelobject-oriented query processing, in particular, collection join, and query schedul-ing and optimization.

    Just as the previous two parts deal with parallelism of read-only queries, the nextpart, Part IV, concentrates on transactions, also known as write queries. We usethe grid environment to study transaction management. In grid transaction man-agement, the focus is mainly on grid concurrency control, atomic commitment,durability, as well as replication.

    Finally, Part V introduces other data-intensive applications, including datawarehousing, OLAP, business intelligence, and parallel data mining.

    ACKNOWLEDGMENTS

    The authors would like to thank the publisher, John Wiley & Sons, for agreeingto embark on this exciting journey. In particular, we would like to thank PaulPetralia, Senior Editor, for supporting this project. We would also like to thankWhitney Lesch and Anastasia Wasko, Assistants to the Editor, for their endlessefforts to ensure that we remained on track from start to completion. Without theirencouragement and reminders, we would not have been able to finish this book.

  • PREFACE xvii

    We also thank Bruna Pomella, who proofread the entire manuscript, for com-menting on ambiguous sentences and correcting grammatical mistakes.

    Finally, we would like to express our sincere thanks to our respective univer-sities, Monash University, Victoria University, Hong Kong Baptist University, LaTrobe University, and RMIT, where the research presented in this book was con-ducted. We are grateful for the facilities and time that we received during thewriting of this book. Without these, the book would not have been written in thefirst place.

    David TaniarClement H.C. LeungWenny RahayuSushant Goel

  • Part I

    Introduction

  • Chapter1

    Introduction

    Parallel databases are database systems that are implemented on parallel com-puting platforms. Therefore, high-performance query processing focuses on queryprocessing, including database queries and transactions, that makes use of paral-lelism techniques applied to an underlying parallel computing platform in order toachieve high performance.

    In a Grid environment, applications need to create, access, manage, and distributedata on a very large scale and across multiple organizations. The main challengesarise due to the volume of data, distribution of data, autonomy of sites, and hetero-geneity of data resources. Hence, Grid databases can be defined loosely as beingdata access in a Grid environment.

    This chapter gives an introduction to parallel databases, parallel query processing,and Grid databases. Section 1.1 gives a brief overview. In Section 1.2, the motivationsfor using parallelism in database processing are explained. Understanding the moti-vations is a critical starting point in exploring parallel database processing in depth.This will answer the question of why parallelism is necessary in modern databaseprocessing.

    Once we understand the motivations, we need to know the objectives or the goalsof parallel database processing. These are explained in Section 1.3. The objectiveswill become the main aim of any parallel algorithms in parallel database systems,and this will answer the question of what it is that parallelism aims to achieve inparallel database processing.

    Once we understand the objectives, we also need to know the various kinds of par-allelism forms that are available for parallel database processing. These are describedin Section 1.4. The forms of parallelism are the techniques used to achieve the objec-tives described in the previous section. Therefore, this section answers the questionsof how parallelism can be performed in parallel database processing.

    High-Performance Parallel Database Processing and Grid Databases,by David Taniar, Clement Leung, Wenny Rahayu, and Sushant GoelCopyright 2008 John Wiley & Sons, Inc.

    3

  • 4 Chapter 1 Introduction

    Without an understanding of the kinds of parallel technology and parallelmachines that are available for parallel database processing, our introductorydiscussion on parallel databases will not be complete. Therefore, in Section 1.5, weintroduce various parallel architectures available for database processing.

    Section 1.6 introduces Grid databases. This includes the basic Grid architecturefor data-intensive applications, and its current technological status is also outlined.

    Section 1.7 outlines the components of this book, including parallel query pro-cessing, and Grid transaction management.

    1.1 A BRIEF OVERVIEW: PARALLEL DATABASESAND GRID DATABASES

    In 1965, Intel cofounder Gordon Moore predicted that the number of transistorson a chip would double every 24 months, a prediction that became known pop-ularly as Moore’s law. With further technological development, some researchersclaimed the number would double every 18 months instead of 24 months. Thus itis expected that the CPU’s performance would increase roughly by 50–60% peryear. On the other hand, mechanical delays restrict the advancement of disk accesstime or disk throughput, which reaches only 8–10%. There has been some debateregarding the accuracy of these figures. Disk capacity is also increasing at a muchhigher rate than that of disk throughput. Although researchers do not agree com-pletely with these values, they show the difference in the rate of advancement ofeach of these two areas.

    In the above scenario, it becomes increasingly difficult to use the available diskcapacity effectively. Disk input/output (I/O) becomes the bottleneck as a resultof such skewed processing speed and disk throughput. This inevitable I/O bottle-neck was one of the major forces that motivated parallel database research. Thenecessity of storing high volumes of data, producing faster response times, scal-ability, reliability, load balancing, and data availability were among the factorsthat led to the development of parallel database systems research. Nowadays, mostcommercial database management systems (DBMS) vendors include some parallelprocessing capabilities in their products.

    Typically, a parallel database system assumes only a single administrativedomain, a homogeneous working environment, and close proximity of datastorage (i.e., data is stored in different machines in the same room or building).Below in this chapter, we will discuss various forms of parallelism, motivations,and architectures.

    With the increasing diversity of scientific disciplines, the amount of data col-lected is increasing. In domains as diverse as global climate change, high-energyphysics, and computational genomics, the volume of data being measured andstored is already scaling terabytes and will soon increase to petabytes. Data can

  • 1.2 Parallel Query Processing: Motivations 5

    be best collected locally for certain applications like earth observation and astron-omy experiments. But the experimental analysis must be able to access the largevolume of distributed data seamlessly. The above requirement emphasizes the needfor Grid-enabled data sources. It should be easy and possible to quickly and auto-matically install, configure, and disassemble the data sources along with the needfor data movement and replication.

    The Grid is a heterogeneous collaboration of resources and thus will containa diverse range of data resources. Heterogeneity in a data Grid can be due to thedata model, the transaction model, storage systems, or data types. Data Grids pro-vide seamless access to geographically distributed data sources storing terabytesto petabytes of data with proper authentication and security services.

    The development of a Grid infrastructure was necessary for large-scale com-puting and data-intensive scientific applications. A Grid enables the sharing, selec-tion, and aggregation of a wide variety of geographically distributed resourcesincluding supercomputers, storage systems, data sources, and specialized devicesowned by different organizations for solving large-scale resource-intensive prob-lems in science, engineering, and commerce. One important aspect is that theresources—computing and data—are owned by different organizations. Thus thedesign and evolution of individual resources are autonomous and independent ofeach other and are mostly heterogeneous.

    Based on the above discussions, this book covers two main elements, namely,parallel query processing and Grid databases. The former aims at high perfor-mance of query processing, which is mainly read-only queries, whereas the latterconcentrates on Grid transaction management, focusing on read as well as writeoperations.

    1.2 PARALLEL QUERY PROCESSING: MOTIVATIONS

    It is common these days for databases to grow to enormous sizes and be accessedby a large number of users. This growth strains the ability of single-processorsystems to handle the load. When we consider a database of 10 terabyte in size,simple processing using a single processor with the capability of processing witha speed of 1 megabyte/second would take 120 days and nights of processing time.If this processing time needs to be reduced to several days or even several hours,parallel processing is an alternative answer.

    10 TB D 10ð 1024ð 1024 MB D 1;048;576 MB10;048;576 MB=1 MB=sec ³ 10;048;576 seconds

    ³ 174;760 minutes³ 2910 hours³ 120 days and nights

  • 6 Chapter 1 Introduction

    Because of the performance benefits, and also in order to maintain higher through-put, more and more organizations turn to parallel processing. Parallel machinesare becoming readily available, and most RDBMS now offer parallelism featuresin their products.

    But what is parallel processing, and why not just use a faster computer to speedup processing?

    Computers were intended to solve problems faster than a human beingcould—this is the reason for their being invented. People continue to wantcomputers to do more and more and to do it faster. The design of computers hasnow become more complex than ever before, and with the improved circuitrydesign, improved instruction sets, and improved algorithms to meet the demandfor faster response times, this has been made possible by the advances inengineering. However, even with the advances in engineering that produce thesecomplex, fast computers, there are speed limitations. The processing speedof processors depends on the transmission speed of information between theelectronic components within the processor, and this speed is actually limited bythe speed of light. Because of the advances in technology, particularly fiber optics,the speed at which the information travels is reaching the speed of light, but itcannot exceed this because of the limitations of the medium. Another factor isthat, because of the density of transistors within a processor; it can be pushed onlyto a certain limit.

    These limitations have resulted in the hardware designers looking for anotheralternative to increase performance. Parallelism is the result of these efforts. Par-allel processing is the process of taking a large task and, instead of feeding thecomputer this large task that may take a long time to complete, the task is dividedinto smaller subtasks that are then worked on simultaneously. Ultimately, thisdivide-and-conquer approach aims to complete a large task in less time than itwould take if it were processed as one large task as a whole. Parallel systemsimprove processing and I/O speeds by using multiple processors and disks in par-allel. This enables multiple processors to work simultaneously on several parts ofa task in order to complete it faster than could be done otherwise.

    Additionally, database processing works well with parallelism. Database pro-cessing is basically an operation on a database. When the same operation can beperformed on different fragments of the database, this creates parallelism; this inturn creates the notion of parallel database processing.

    The driving force behind parallel database processing includes:

    ž Querying large databases (of the order of terabytes) andž Processing an extremely large number of transactions per second (of the order

    of thousands of transactions per second).

    Since parallel database processing works at the query or transaction level, thisapproach views the degree of parallelism as coarse-grained. Coarse-grained paral-lelism is well suited to database processing because of the lesser complexity of itsoperations but needs to work with a large volume of data.

  • 1.3 Parallel Query Processing: Objectives 7

    1.3 PARALLEL QUERY PROCESSING: OBJECTIVES

    The primary objective of parallel database processing is to gain performanceimprovement. There are two main measures of performance improvement. Thefirst is throughput—the number of tasks that can be completed within a given timeinterval. The second is response time—the amount of time it takes to complete asingle task from the time it is submitted. A system that processes a large numberof small transactions can improve throughput by processing many transactions inparallel. A system that processes large transactions can improve response time aswell as throughput by performing subtasks of each transaction in parallel.

    These two measures are normally quantified by the following metrics: (i) speedup and (ii) scale up.

    1.3.1 Speed Up

    Speed up refers to performance improvement gained because of extra processingelements added. In other words, it refers to running a given task in less time byincreasing the degree of parallelism. Speed up is a typical metric used to measureperformance of read-only queries (data retrieval). Speed up can be measured by:

    Speed up D elapsed time on uniprocessorelapsed time on multiprocessors

    A linear speed up refers to performance improvement growing linearly withadditional resources—that is, a speed up of N when the large system has N timesthe resources of the smaller system. A less desirable sublinear speed up is whenthe speed up is less than N . Superlinear speed up (i.e., speed up greater than N )is very rare. It occasionally may be seen, but usually this is due to the use of asuboptimal sequential algorithm or some unique feature of the architecture thatfavors the parallel formation, such as extra memory in the multiprocessor system.

    Figure 1.1 is a graph showing linear speed up in comparison with sublinearspeed up and superlinear speed up. The resources in the x-axis are normally mea-sured in terms of the number of processors used, whereas the speed up in the y-axisis calculated with the above equation.

    Since superlinear speed up rarely happens, and is questioned even by expertsin parallel processing, the ultimate goal of parallel processing, including paralleldatabase processing, is to achieve linear speed up. Linear speed up is then used asan indicator to show the efficiency of data processing on multiprocessors.

    To illustrate a speed up calculation, we give the following example: Suppose adatabase operation processed on a single processor takes 100 minutes to complete.If 5 processors are used and the completion time is reduced to 20 minutes, thespeed up is equal to 5. Since the number of processors (5 processors) yields thesame speed up (speed up D 5), a linear speed up is achieved.

    If the elapsed time of the job with 5 processors takes longer, say around 33minutes, the speed up becomes approximately 3. Since the speed up value is lessthan the number of processors used, a sublinear speed up is obtained.

  • 8 Chapter 1 Introduction

    Speed up

    Linear-Speed Up

    Sublinear Speed up

    Resources

    Superlinear Speed Up

    N

    N

    Figure 1.1 Speed up

    In an extremely rare case, the elapsed time of the job with 5 processors may beless than 20 minutes—say, for example, 16.5 minutes; then the speed up becomes6. This is a superlinear speed up, since the speed up (speed up D 6) is greater thanthe number of processors (processors D 5).

    1.3.2 Scale Up

    Scale up refers to the handling of larger tasks by increasing the degree of paral-lelism. Scale up relates to the ability to process larger tasks in the same amountof time by providing more resources (or by increasing the degree of parallelism).For a given application, we would like to examine whether it is viable to add moreresources when the workload is increased in order to maintain its performance.This metric is typically used in transaction processing systems (data manipulation).Scale up is calculated as follows.

    Scale up D uniprocessor elapsed time on small systemmultiprocessor elapsed time on larger system

    Linear scale up refers to the ability to maintain the same level of performancewhen both the workload and the resources are proportionally added. Using theabove scale up formula, scale up equal to 1 is said to be linear scale up. A sub-linear scale up is where the scale up is less than 1. A superlinear scale up is rare,and we eliminate this from further discussions. Hence, linear scale up is the ulti-mate goal of parallel database processing. Figure 1.2 shows a graph demonstratinglinear/sublinear scale up.

    There are two kinds of scale up that are relevant to parallel databases, dependingon how the size of the task is measured, namely: (i) transaction scale up, and (ii)data scale up.

  • 1.3 Parallel Query Processing: Objectives 9

    Scale up Linear Scale Up

    Sublinear Scale Up

    Workload(Resources increase proportionally to workload)

    1

    0

    Figure 1.2 Scale up

    Transaction Scale Up

    Transaction scale up refers to the increase in the rate at which the transactionsare processed. The size of the database may also increase proportionally to thetransactions’ arrival rate.

    In transaction scale up, N -times as many users are submitting N -times as manyrequests or transactions against an N -times larger database. This kind of scaleup is relevant in transaction processing systems where the transactions are smallupdates.

    To illustrate transaction scale up, consider the following example: Assume ittakes 10 minutes to complete 100 transactions on a single processor. If the numberof transactions to be processed is increased to 300 transactions, and the numberof processors used is also increased to 3 processors, the elapsed time remains thesame; if it is 10 minutes, then a linear scale up has been achieved (scale up D 1).

    If, for some reason, even though the number of processors is already increasedto 3 it takes longer than 10 minutes, say 15 minutes, to process the 300 transactions,then the scale up becomes 0.67, which is less than 1, and hence a sublinear scaleup is obtained.

    Transaction processing is especially well adapted for parallel processing, sincedifferent transactions can run concurrently and independently on separate proces-sors, and each transaction takes a small amount of time, even if the database grows.

    Data Scale Up

    Data scale up refers to the increase in size of the database, and the task is a largejob whose runtime depends on the size of the database. For example, when sortinga table whose size is proportional to the size of the database, the size of the databaseis the measure of the size of the problem. This is typically found in online analyti-cal processing (OLAP) in data warehousing, where the fact table is normally verylarge compared with all the dimension tables combined.

    To illustrate data scale up, we use the following example: Suppose the fact tableof a data warehouse occupies around 90% of the space in the database. Assume

  • 10 Chapter 1 Introduction

    the job is to produce a report that groups data in the fact table according to somecriteria specified by its dimensions.

    For example, the processing of this operation on a single processor takes onehour. If the size of the fact table is then doubled up, it is sensible to double up thenumber of processors. If the same process now takes one hour, a linear scale uphas been achieved.

    If the process now takes longer than one hour, say for example 75 minutes, thenthe scale up is equal to 0.8, which is less than 1. Therefore, a sublinear scale up isobtained.

    1.3.3 Parallel Obstacles

    A number of factors work against efficient parallel operation and can diminish bothspeed up and scale up, particularly: (i) start up and consolidation costs, (ii) inter-ference and communication, and (iii) skew.

    Start Up and Consolidation Costs

    Start up cost is associated with initiating multiple processes. In a parallel opera-tion consisting of multiple processes, the start up time may overshadow the actualprocessing time, adversely affecting speed up, especially if thousands of processesmust be started. Even when there is a small number of parallel processes to bestarted, if the actual processing time is very short, the start up cost may dominatethe overall processing time.

    Consolidation cost refers to the cost associated with collecting results obtainedfrom each processor by a host processor. This cost can also be a factor that preventslinear speed up.

    Parallel processing normally starts with breaking up the main task into multiplesubtasks in which each subtask is carried out by a different processing element.After these subtasks have been completed, it is necessary to consolidate the resultsproduced by each subtask to be presented to the user. Since the consolidation pro-cess is usually carried out by a single processing element, normally by the hostprocessor, no parallelism is applied, and consequently this affects the speed up ofthe overall process.

    Both start up and consolidation refer to sequential parts of the process and can-not be parallelized. This is a manifestation of the Amdahl law, which states that thecompute time can be divided into the parallel part and the serial part, and no matterhow high the degree of parallelism in the former, the speed up will be asymp-totically limited by the latter, which must be performed on a single processingelement.

    For example, a database operation consists of a sequence of 10 steps, 8 of whichcan be done in parallel, but 2 of which must be done in sequence (such as startup and consolidation operations). Compared with a single processing element, an8-processing element machine would attain a speed up of not 8 but somewherearound 3, even though the processing element cost is 8 times higher.