Distributed computing principles, algorithms, and systems

This page intentionally left blank

Distributed ComputingPrinciples, Algorithms, and Systems

Distributed computing deals with all forms of computing, information access,and information exchange across multiple processing platforms connectedby computer networks. Design of distributed computing systems is a com-plex task. It requires a solid understanding of the design issues and anin-depth understanding of the theoretical and practical aspects of their solu-tions. This comprehensive textbook covers the fundamental principles andmodels underlying the theory, algorithms, and systems aspects of distributedcomputing.Broad and detailed coverage of the theory is balanced with practical

systems-related problems such as mutual exclusion, deadlock detection,authentication, and failure recovery. Algorithms are carefully selected, lucidlypresented, and described without complex proofs. Simple explanations andillustrations are used to elucidate the algorithms. Emerging topics of signif-icant impact, such as peer-to-peer networks and network security, are alsocovered.With state-of-the-art algorithms, numerous illustrations, examples, and

homework problems, this textbook is invaluable for advanced undergraduateand graduate students of electrical and computer engineering and computerscience. Practitioners in data networking and sensor networks will also findthis a valuable resource.

Ajay D. Kshemkalyani is an Associate Professor in the Department of Com-puter Science, at the University of Illinois at Chicago. He was awarded hisPh.D. in Computer and Information Science in 1991 from The Ohio StateUniversity. Before moving to academia, he spent several years working oncomputer networks at IBM Research Triangle Park. In 1999, he received theNational Science Foundations CAREER Award. He is a Senior Member ofthe IEEE, and his principal areas of research include distributed computing,algorithms, computer networks, and concurrent systems. He currently serveson the editorial board of Computer Networks.

Mukesh Singhal is Full Professor and Gartner Group Endowed Chair in Net-work Engineering in the Department of Computer Science at the Universityof Kentucky. He was awarded his Ph.D. in Computer Science in 1986 fromthe University of Maryland, College Park. In 2003, he received the IEEE

Technical Achievement Award, and currently serves on the editorial boardsfor the IEEE Transactions on Parallel and Distributed Systems and the IEEETransactions on Computers. He is a Fellow of the IEEE, and his principalareas of research include distributed systems, computer networks, wireless andmobile computing systems, performance evaluation, and computer security.

Distributed ComputingPrinciples, Algorithms, andSystems

Ajay D. KshemkalyaniUniversity of Illinois at Chicago, Chicago

and

Mukesh SinghalUniversity of Kentucky, Lexington

CAMBRIDGE UNIVERSITY PRESS

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, So Paulo

Cambridge University PressThe Edinburgh Building, Cambridge CB2 8RU, UK

First published in print format

ISBN-13 978-0-521-87634-6

ISBN-13 978-0-511-39341-9

Cambridge University Press 2008

2008

Information on this title: www.cambridge.org/9780521876346

This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.

Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Published in the United States of America by Cambridge University Press, New York

www.cambridge.org

eBook (EBL)

hardback

To my father Shri Digambar and

my mother Shrimati Vimala.Ajay D. Kshemkalyani

To my mother Chandra Prabha Singhal,

my father Brij Mohan Singhal, and my

daughters Meenakshi, Malvika,

and Priyanka.Mukesh Singhal

Contents

Preface page xv

1 Introduction 11.1 Definition 11.2 Relation to computer system components 21.3 Motivation 31.4 Relation to parallel multiprocessor/multicomputer systems 51.5 Message-passing systems versus shared memory systems 131.6 Primitives for distributed communication 141.7 Synchronous versus asynchronous executions 191.8 Design issues and challenges 221.9 Selection and coverage of topics 331.10 Chapter summary 341.11 Exercises 351.12 Notes on references 36

References 37

2 A model of distributed computations 392.1 A distributed program 392.2 A model of distributed executions 402.3 Models of communication networks 422.4 Global state of a distributed system 432.5 Cuts of a distributed computation 452.6 Past and future cones of an event 462.7 Models of process communications 472.8 Chapter summary 482.9 Exercises 482.10 Notes on references 48

References 49

viii Contents

3 Logical time 503.1 Introduction 503.2 A framework for a system of logical clocks 523.3 Scalar time 533.4 Vector time 553.5 Efficient implementations of vector clocks 593.6 JardJourdans adaptive technique 653.7 Matrix time 683.8 Virtual time 693.9 Physical clock synchronization: NTP 783.10 Chapter summary 813.11 Exercises 843.12 Notes on references 84

References 84

4 Global state and snapshot recording algorithms 874.1 Introduction 874.2 System model and definitions 904.3 Snapshot algorithms for FIFO channels 934.4 Variations of the ChandyLamport algorithm 974.5 Snapshot algorithms for non-FIFO channels 1014.6 Snapshots in a causal delivery system 1064.7 Monitoring global state 1094.8 Necessary and sufficient conditions for consistent global

snapshots 1104.9 Finding consistent global snapshots in a distributed

computation 1144.10 Chapter summary 1214.11 Exercises 1224.12 Notes on references 122

References 123

5 Terminology and basic algorithms 1265.1 Topology abstraction and overlays 1265.2 Classifications and basic concepts 1285.3 Complexity measures and metrics 1355.4 Program structure 1375.5 Elementary graph algorithms 1385.6 Synchronizers 1635.7 Maximal independent set (MIS) 1695.8 Connected dominating set 1715.9 Compact routing tables 1725.10 Leader election 174

ix Contents

5.11 Challenges in designing distributed graph algorithms 1755.12 Object replication problems 1765.13 Chapter summary 1825.14 Exercises 1835.15 Notes on references 185

References 186

6 Message ordering and group communication 1896.1 Message ordering paradigms 1906.2 Asynchronous execution with synchronous communication 1956.3 Synchronous program order on an asynchronous system 2006.4 Group communication 2056.5 Causal order (CO) 2066.6 Total order 2156.7 A nomenclature for multicast 2206.8 Propagation trees for multicast 2216.9 Classification of application-level multicast algorithms 2256.10 Semantics of fault-tolerant group communication 2286.11 Distributed multicast algorithms at the network layer 2306.12 Chapter summary 2366.13 Exercises 2366.14 Notes on references 238

References 239

7 Termination detection 2417.1 Introduction 2417.2 System model of a distributed computation 2427.3 Termination detection using distributed snapshots 2437.4 Termination detection by weight throwing 2457.5 A spanning-tree-based termination detection algorithm 2477.6 Message-optimal termination detection 2537.7 Termination detection in a very general distributed computing

model 2577.8 Termination detection in the atomic computation model 2637.9 Termination detection in a faulty distributed system 2727.10 Chapter summary 2797.11 Exercises 2797.12 Notes on references 280

References 280

8 Reasoning with knowledge 2828.1 The muddy children puzzle 2828.2 Logic of knowledge 283

x Contents

8.3 Knowledge in synchronous systems 2898.4 Knowledge in asynchronous systems 2908.5 Knowledge transfer 2988.6 Knowledge and clocks 3008.7 Chapter summary 3018.8 Exercises 3028.9 Notes on references 303

References 303

9 Distributed mutual exclusion algorithms 3059.1 Introduction 3059.2 Preliminaries 3069.3 Lamports algorithm 3099.4 RicartAgrawala algorithm 3129.5 Singhals dynamic information-structure algorithm 3159.6 Lodha and Kshemkalyanis fair mutual exclusion algorithm 3219.7 Quorum-based mutual exclusion algorithms 3279.8 Maekawas algorithm 3289.9 AgarwalEl Abbadi quorum-based algorithm 3319.10 Token-based algorithms 3369.11 SuzukiKasamis broadcast algorithm 3369.12 Raymonds tree-based algorithm 3399.13 Chapter summary 3489.14 Exercises 3489.15 Notes on references 349

References 350

10 Deadlock detection in distributed systems 35210.1 Introduction 35210.2 System model 35210.3 Preliminaries 35310.4 Models of deadlocks 35510.5 Knapps classification of distributed deadlock detection

algorithms 35810.6 Mitchell and Merritts algorithm for the single-

resource model 36010.7 ChandyMisraHaas algorithm for the AND model 36210.8 ChandyMisraHaas algorithm for the OR model 36410.9 KshemkalyaniSinghal algorithm for the P-out-of-Q model 36510.10 Chapter summary 37410.11 Exercises 37510.12 Notes on references 375

References 376

xi Contents

11 Global predicate detection 37911.1 Stable and unstable predicates 37911.2 Modalities on predicates 38211.3 Centralized algorithm for relational predicates 38411.4 Conjunctive predicates 38811.5 Distributed algorithms for conjunctive predicates 39511.6 Further classification of predicates 40411.7 Chapter summary 40511.8 Exercises 40611.9 Notes on references 407

References 408

12 Distributed shared memory 41012.1 Abstraction and advantages 41012.2 Memory consistency models 41312.3 Shared memory mutual exclusion 42712.4 Wait-freedom 43412.5 Register hierarchy and wait-free simulations 43412.6 Wait-free atomic snapshots of shared objects 44712.7 Chapter summary 45112.8 Exercises 45212.9 Notes on references 453

References 454

13 Checkpointing and rollback recovery 45613.1 Introduction 45613.2 Background and definitions 45713.3 Issues in failure recovery 46213.4 Checkpoint-based recovery 46413.5 Log-based rollback recovery 47013.6 KooToueg coordinated checkpointing algorithm 47613.7 JuangVenkatesan algorithm for asynchronous checkpointing

and recovery 47813.8 ManivannanSinghal quasi-synchronous checkpointing

algorithm 48313.9 PetersonKearns algorithm based on vector time 49213.10 HelaryMostefaouiNetzerRaynal communication-induced

protocol 49913.11 Chapter summary 50513.12 Exercises 50613.13 Notes on references 506

References 507

xii Contents

14 Consensus and agreement algorithms 51014.1 Problem definition 51014.2 Overview of results 51414.3 Agreement in a failure-free system (synchronous or

asynchronous) 51514.4 Agreement in (message-passing) synchronous systems with

failures 51614.5 Agreement in asynchronous message-passing systems with

failures 52914.6 Wait-free shared memory consensus in asynchronous systems 54414.7 Chapter summary 56214.8 Exercises 56314.9 Notes on references 564

References 565

15 Failure detectors 56715.1 Introduction 56715.2 Unreliable failure detectors 56815.3 The consensus problem 57715.4 Atomic broadcast 58315.5 A solution to atomic broadcast 58415.6 The weakest failure detectors to solve fundamental agreement

problems 58515.7 An implementation of a failure detector 58915.8 An adaptive failure detection protocol 59115.9 Exercises 59615.10 Notes on references 596

References 596

16 Authentication in distributed systems 59816.1 Introduction 59816.2 Background and definitions 59916.3 Protocols based on symmetric cryptosystems 60216.4 Protocols based on asymmetric cryptosystems 61516.5 Password-based authentication 62216.6 Authentication protocol failures 62516.7 Chapter summary 62616.8 Exercises 62716.9 Notes on references 627

References 628

17 Self-stabilization 63117.1 Introduction 63117.2 System model 632

xiii Contents

17.3 Definition of self-stabilization 63417.4 Issues in the design of self-stabilization algorithms 63617.5 Methodologies for designing self-stabilizing systems 64717.6 Communication protocols 64917.7 Self-stabilizing distributed spanning trees 65017.8 Self-stabilizing algorithms for spanning-tree construction 65217.9 An anonymous self-stabilizing algorithm for 1-maximal

independent set in trees 65717.10 A probabilistic self-stabilizing leader election algorithm 66017.11 The role of compilers in self-stabilization 66217.12 Self-stabilization as a solution to fault tolerance 66517.13 Factors preventing self-stabilization 66717.14 Limitations of self-stabilization 66817.15 Chapter summary 67017.16 Exercises 67017.17 Notes on references 671

References 671

18 Peer-to-peer computing and overlay graphs 67718.1 Introduction 67718.2 Data indexing and overlays 67918.3 Unstructured overlays 68118.4 Chord distributed hash table 68818.5 Content addressible networks (CAN) 69518.6 Tapestry 70118.7 Some other challenges in P2P system design 70818.8 Tradeoffs between table storage and route lengths 71018.9 Graph structures of complex networks 71218.10 Internet graphs 71418.11 Generalized random graph networks 72018.12 Small-world networks 72018.13 Scale-free networks 72118.14 Evolving networks 72318.15 Chapter summary 72718.16 Exercises 72718.17 Notes on references 728

References 729

Index 731

Preface

Background

The field of distributed computing covers all aspects of computing and infor-mation access across multiple processing elements connected by any form ofcommunication network, whether local or wide-area in the coverage. Sincethe advent of the Internet in the 1970s, there has been a steady growth ofnew applications requiring distributed processing. This has been enabled byadvances in networking and hardware technology, the falling cost of hard-ware, and greater end-user awareness. These factors have contributed tomaking distributed computing a cost-effective, high-performance, and fault-tolerant reality. Around the turn of the millenium, there was an explosivegrowth in the expansion and efficiency of the Internet, which was matchedby increased access to networked resources through the World Wide Web,all across the world. Coupled with an equally dramatic growth in the wirelessand mobile networking areas, and the plummeting prices of bandwidth andstorage devices, we are witnessing a rapid spurt in distributed applications andan accompanying interest in the field of distributed computing in universities,governments organizations, and private institutions.Advances in hardware technology have suddenly made sensor networking

a reality, and embedded and sensor networks are rapidly becoming an integralpart of everyones life from the home network with the interconnectedgadgets to the automobile communicating by GPS (global positioning system),to the fully networked office with RFID monitoring. In the emerging globalvillage, distributed computing will be the centerpiece of all computing andinformation access sub-disciplines within computer science. Clearly, this isa very important field. Moreover, this evolving field is characterized by adiverse range of challenges for which the solutions need to have foundationson solid principles.The field of distributed computing is very important, and there is a huge

demand for a good comprehensive book. This book comprehensively coversall important topics in great depth, combining this with a clarity of explanation

xvi Preface

and ease of understanding. The book will be particularly valuable to theacademic community and the computer industry at large. Writing such acomprehensive book has been a Herculean task and there is a deep sense ofsatisfaction in knowing that we were able complete it and perform this serviceto the community.

Description, approach, and features

The book will focus on the fundamental principles and models underlying allaspects of distributed computing. It will address the principles underlying thetheory, algorithms, and systems aspects of distributed computing. The mannerof presentation of the algorithms is very clear, explaining the main ideas andthe intuition with figures and simple explanations rather than getting entangledin intimidating notations and lengthy and hard-to-follow rigorous proofs ofthe algorithms. The selection of chapter themes is broad and comprehensive,and the book covers all important topics in depth. The selection of algorithmswithin each chapter has been done carefully to elucidate new and importanttechniques of algorithm design. Although the book focuses on foundationalaspects and algorithms for distributed computing, it thoroughly addresses allpractical systems-like problems (e.g., mutual exclusion, deadlock detection,termination detection, failure recovery, authentication, global state and time,etc.) by presenting the theory behind and algorithms for such problems. Thebook is written keeping in mind the impact of emerging topics such aspeer-to-peer computing and network security on the foundational aspects ofdistributed computing.Each chapter contains figures, examples, exercises, a summary, and

references.

Readership

This book is aimed as a textbook for the following:

Graduate students and Senior level undergraduate students in computerscience and computer engineering.

Graduate students in electrical engineering and mathematics. As wirelessnetworks, peer-to-peer networks, and mobile computing continue to growin importance, an increasing number of students from electrical engineeringdepartments will also find this book necessary.

Practitioners, systems designers/programmers, and consultants in industryand research laboratories will find the book a very useful reference becauseit contains state-of-the-art algorithms and principles to address variousdesign issues in distributed systems, as well as the latest references.

xvii Preface

Hard and soft prerequisites for the use of this book include the following:

An undergraduate course in algorithms is required. Undergraduate courses in operating systems and computer networks would

be useful. A reasonable familiarity with programming.

We have aimed for a very comprehensive book that will act as a singlesource for distributed computing models and algorithms. The book has bothdepth and breadth of coverage of topics, and is characterized by clear andeasy explanations. None of the existing textbooks on distributed computingprovides all of these features.

Acknowledgements

This book grew from the notes used in the graduate courses on distributedcomputing at the Ohio State University, the University of Illinois at Chicago,and at the University of Kentucky. We would like to thank the graduatestudents at these schools for their contributions to the book in many ways.The book is based on the published research results of numerous researchers

in the field. We have made all efforts to present the material in our ownwords and have given credit to the original sources of information. We wouldlike to thank all the researchers whose work has been reported in this book.Finally, we would like to thank the staff of Cambridge University Press forproviding us with excellent support in the publication of this book.

Access to resources

The following websites will be maintained for the book. Any errors andcomments should be sent to [email protected] or [email protected]. Furtherinformation about the book can be obtained from the authors web pages:

www.cs.uic.edu/ajayk/DCS-Book www.cs.uky.edu/singhal/DCS-Book.

C H A P T E R

1 Introduction

1.1 Definition

A distributed system is a collection of independent entities that cooperate tosolve a problem that cannot be individually solved. Distributed systems havebeen in existence since the start of the universe. From a school of fish to a flockof birds and entire ecosystems of microorganisms, there is communicationamong mobile intelligent agents in nature. With the widespread proliferationof the Internet and the emerging global village, the notion of distributedcomputing systems as a useful and widely deployed tool is becoming a reality.For computing systems, a distributed system has been characterized in one ofseveral ways:

You know you are using one when the crash of a computer you have neverheard of prevents you from doing work [23].

A collection of computers that do not share common memory or a commonphysical clock, that communicate by a messages passing over a communi-cation network, and where each computer has its own memory and runs itsownoperating system.Typically the computers are semi-autonomousandareloosely coupled while they cooperate to address a problem collectively [29].

A collection of independent computers that appears to the users of thesystem as a single coherent computer [33].

A term that describes a wide range of computers, from weakly coupledsystems such as wide-area networks, to strongly coupled systems such aslocal area networks, to very strongly coupled systems such as multipro-cessor systems [19].

A distributed system can be characterized as a collection of mostlyautonomous processors communicating over a communication network andhaving the following features:

No common physical clock This is an important assumption becauseit introduces the element of distribution in the system and gives rise tothe inherent asynchrony amongst the processors.

1

2 Introduction

No shared memory This is a key feature that requires message-passingfor communication. This feature implies the absence of the common phys-ical clock.It may be noted that a distributed system may still provide the abstraction

of a common address space via the distributed shared memory abstraction.Several aspects of shared memory multiprocessor systems have also beenstudied in the distributed computing literature.

Geographical separation The geographically wider apart that the pro-cessors are, the more representative is the system of a distributed system.However, it is not necessary for the processors to be on a wide-area net-work (WAN). Recently, the network/cluster of workstations (NOW/COW)configuration connecting processors on a LAN is also being increasinglyregarded as a small distributed system. This NOW configuration is becom-ing popular because of the low-cost high-speed off-the-shelf processorsnow available. The Google search engine is based on the NOW architec-ture.

Autonomy and heterogeneity The processors are loosely coupledin that they have different speeds and each can be running a differentoperating system. They are usually not part of a dedicated system, butcooperate with one another by offering services or solving a problemjointly.

1.2 Relation to computer system components

A typical distributed system is shown in Figure 1.1. Each computer has amemory-processing unit and the computers are connected by a communicationnetwork. Figure 1.2 shows the relationships of the software components thatrun on each of the computers and use the local operating system and networkprotocol stack for functioning. The distributed software is also termed asmiddleware. A distributed execution is the execution of processes across thedistributed system to collaboratively achieve a common goal. An executionis also sometimes termed a computation or a run.The distributed system uses a layered architecture to break down the com-

plexity of system design. The middleware is the distributed software that

Figure 1.1 A distributedsystem connects processors bya communication network.

P M

Communication network (WAN/ LAN)

P processor(s)M memory bank(s)

P M P M

P M

P M P M

P M

3 1.3 Motivation

Figure 1.2 Interaction of thesoftware components at eachprocessor.

Extent ofdistributedprotocols

Distributed application

Network layer

Application layer

Data link layer

Transport layer

Net

wor

k pr

otoc

ol st

ackDistributed software

(middleware libraries)

Operatingsystem

drives the distributed system, while providing transparency of heterogeneity atthe platform level [24]. Figure 1.2 schematically shows the interaction of thissoftware with these system components at each processor. Here we assumethat the middleware layer does not contain the traditional application layerfunctions of the network protocol stack, such as http, mail, ftp, and telnet.Various primitives and calls to functions defined in various libraries of themiddleware layer are embedded in the user program code. There exist severallibraries to choose from to invoke primitives for the more common func-tions such as reliable and ordered multicasting of the middleware layer.There are several standards such as Object Management Groups (OMG)common object request broker architecture (CORBA) [36], and the remoteprocedure call (RPC) mechanism [1, 11]. The RPC mechanism conceptuallyworks like a local procedure call, with the difference that the procedure codemay reside on a remote machine, and the RPC software sends a messageacross the network to invoke the remote procedure. It then awaits a reply,after which the procedure call completes from the perspective of the programthat invoked it. Currently deployed commercial versions of middleware oftenuse CORBA, DCOM (distributed component object model), Java, and RMI(remote method invocation) [7] technologies. The message-passing interface(MPI) [20, 30] developed in the research community is an example of aninterface for various communication functions.

1.3 Motivation

The motivation for using a distributed system is some or all of the followingrequirements:

1. Inherently distributed computations In many applications such asmoney transfer in banking, or reaching consensus among parties that aregeographically distant, the computation is inherently distributed.

2. Resource sharing Resources such as peripherals, complete data setsin databases, special libraries, as well as data (variable/files) cannot be

4 Introduction

fully replicated at all the sites because it is often neither practical norcost-effective. Further, they cannot be placed at a single site because accessto that site might prove to be a bottleneck. Therefore, such resources aretypically distributed across the system. For example, distributed databasessuch as DB2 partition the data sets across several servers, in addition toreplicating them at a few sites for rapid access as well as reliability.

3. Access to geographically remote data and resources In many sce-narios, the data cannot be replicated at every site participating in thedistributed execution because it may be too large or too sensitive to bereplicated. For example, payroll data within a multinational corporation isboth too large and too sensitive to be replicated at every branch office/site.It is therefore stored at a central server which can be queried by branchoffices. Similarly, special resources such as supercomputers exist only incertain locations, and to access such supercomputers, users need to log inremotely.Advances in the design of resource-constrained mobile devices as well

as in the wireless technology with which these devices communicatehave given further impetus to the importance of distributed protocols andmiddleware.

4. Enhanced reliability A distributed system has the inherent potentialto provide increased reliability because of the possibility of replicatingresources and executions, as well as the reality that geographically dis-tributed resources are not likely to crash/malfunction at the same timeunder normal circumstances. Reliability entails several aspects: availability, i.e., the resource should be accessible at all times; integrity, i.e., the value/state of the resource should be correct, in the

face of concurrent access from multiple processors, as per the semanticsexpected by the application;

fault-tolerance, i.e., the ability to recover from system failures, wheresuch failures may be defined to occur in one of many failure models,which we will study in Chapters 5 and 14.

5. Increased performance/cost ratio By resource sharing and accessinggeographically remote data and resources, the performance/cost ratio isincreased. Although higher throughput has not necessarily been the mainobjective behind using a distributed system, nevertheless, any task can bepartitioned across the various computers in the distributed system. Such aconfiguration provides a better performance/cost ratio than using specialparallel machines. This is particularly true of the NOW configuration.

In addition to meeting the above requirements, a distributed system also offersthe following advantages:

6. Scalability As the processors are usually connected by a wide-area net-work, adding more processors does not pose a direct bottleneck for thecommunication network.

5 1.4 Relation to parallel multiprocessor/multicomputer systems

7. Modularity and incremental expandability Heterogeneous processorsmay be easily added into the system without affecting the performance,as long as those processors are running the same middleware algo-rithms. Similarly, existing processors may be easily replaced by otherprocessors.

1.4 Relation to parallel multiprocessor/multicomputer systems

The characteristics of a distributed system were identified above. A typicaldistributed system would look as shown in Figure 1.1. However, how doesone classify a system that meets some but not all of the characteristics? Is thesystem still a distributed system, or does it become a parallel multiprocessorsystem? To better answer these questions, we first examine the architec-ture of parallel systems, and then examine some well-known taxonomies formultiprocessor/multicomputer systems.

1.4.1 Characteristics of parallel systems

A parallel system may be broadly classified as belonging to one of threetypes:

1. A multiprocessor system is a parallel system in which the multiple proces-sors have direct access to shared memory which forms a common addressspace. The architecture is shown in Figure 1.3(a). Such processors usuallydo not have a common clock.A multiprocessor system usually corresponds to a uniform memory

access (UMA) architecture in which the access latency, i.e., waiting time, tocomplete an access to any memory location from any processor is the same.The processors are in very close physical proximity and are connected byan interconnection network. Interprocess communication across processorsis traditionally through read and write operations on the shared memory,although the use of message-passing primitives such as those provided by

Figure 1.3 Two standardarchitectures for parallelsystems. (a) Uniform memoryaccess (UMA) multiprocessorsystem. (b) Non-uniformmemory access (NUMA)multiprocessor. In botharchitectures, the processorsmay locally cache data frommemory. M memory P processor

(b)(a)

Interconnection networkInterconnection network

PPP P

MMMM P M

P M P M P M

P M P M

6 Introduction

Figure 1.4 Interconnectionnetworks for shared memorymultiprocessor systems. (a)Omega network [4] for n= 8processors P0P7 andmemory banks M0M7. (b)Butterfly network [10] forn= 8 processors P0P7 andmemory banks M0M7.

P0P1P2P3P4

P6P7

101P5

000001

M0M1

010011100

101110111

001

101110111

100

111

110

100011010

000

M2010

000001

100101

P0P1

P2P3

P4P5

P6P7

(a) 3-stage Omega network (n = 8, M = 4) (b) 3-stage Butterfly network (n = 8, M = 4)

011 M3M4

M5M6M7

000001010011

M0M1M2M3M4

M5M6M 7

110111

the MPI, is also possible (using emulation on the shared memory). All theprocessors usually run the same operating system, and both the hardwareand software are very tightly coupled.The processors are usually of the same type, and are housed within the

same box/container with a shared memory. The interconnection networkto access the memory may be a bus, although for greater efficiency, it isusually a multistage switch with a symmetric and regular design.Figure 1.4 shows two popular interconnection networks the Omega

network [4] and the Butterfly network [10], each of which is a multi-stagenetwork formed of 22 switching elements. Each 22 switch allows dataon either of the two input wires to be switched to the upper or the loweroutput wire. In a single step, however, only one data unit can be sent on anoutput wire. So if the data from both the input wires is to be routed to thesame output wire in a single step, there is a collision. Various techniquessuch as buffering or more elaborate interconnection designs can addresscollisions.Each 2 2 switch is represented as a rectangle in the figure. Further-

more, a n-input and n-output network uses log n stages and log n bitsfor addressing. Routing in the 2 2 switch at stage k uses only the kthbit, and hence can be done at clock speed in hardware. The multi-stagenetworks can be constructed recursively, and the interconnection patternbetween any two stages can be expressed using an iterative or a recursivegenerating function. Besides the Omega and Butterfly (banyan) networks,other examples of multistage interconnection networks are the Clos [9]and the shuffle-exchange networks [37]. Each of these has very interestingmathematical properties that allow rich connectivity between the processorbank and memory bank.

Omega interconnection function The Omega network which connectsn processors to n memory units has n/2log2 n switching elements of size2 2 arranged in log2 n stages. Between each pair of adjacent stages ofthe Omega network, a link exists between output i of a stage and the inputj to the next stage according to the following perfect shuffle pattern which


is a left-rotation operation on the binary representation of i to get j. Theiterative generation function is as follows:

j ={2i for 0 i n/212i+1n for n/2 i n1 (1.1)

Consider any stage of switches. Informally, the upper (lower) input linesfor each switch come in sequential order from the upper (lower) half ofthe switches in the earlier stage.With respect to the Omega network in Figure 1.4(a), n= 8. Hence, for

any stage, for the outputs i, where 0 i 3, the output i is connectedto input 2i of the next stage. For 4 i 7, the output i of any stage isconnected to input 2i+1n of the next stage.Omega routing function The routing function from input line i to outputline j considers only j and the stage number s, where s 0 log2n 1.In a stage s switch, if the s+1th MSB (most significant bit) of j is 0, thedata is routed to the upper output wire, otherwise it is routed to the loweroutput wire.

Butterfly interconnection function Unlike the Omega network, the gen-eration of the interconnection pattern between a pair of adjacent stagesdepends not only on n but also on the stage number s. The recursive expres-sion is as follows.Let there beM = n/2 switches per stage, and let a switch bedenoted by the tuple x s, wherex 0M1 and stage s 0 log2n1.The two outgoing edges from any switch x s are as follows. There is

an edge from switch x s to switch y s+1 if (i) x = y or (ii) x XORy has exactly one 1 bit, which is in the s+1th MSB. For stage s, applythe rule above for M/2s switches.Whether the two incoming connections go to the upper or the lower

input port is not important because of the routing function, given below.

Example Consider the Butterfly network in Figure 1.4(b), n = 8 andM = 4. There are three stages, s = 012, and the interconnection patternis defined between s = 0 and s = 1 and between s = 1 and s = 2. Theswitch number x varies from 0 to 3 in each stage, i.e., x is a 2-bit string.(Note that unlike the Omega network formulation using input and outputlines given above, this formulation uses switch numbers. Exercise 1.5 asksyou to prove a formulation of the Omega interconnection pattern usingswitch numbers instead of input and output port numbers.)Consider the first stage interconnection (s = 0) of a butterfly of size M ,

and hence having log2 2M stages. For stage s = 0, as per rule (i), the firstoutput line from switch 00 goes to the input line of switch 00 of stages= 1. As per rule (ii), the second output line of switch 00 goes to input lineof switch 10 of stage s = 1. Similarly, x = 01 has one output line go to aninput line of switch 11 in stage s = 1. The other connections in this stage

8 Introduction

can be determined similarly. For stage s= 1 connecting to stage s= 2, weapply the rules considering only M/21 =M/2 switches, i.e., we build twobutterflies of size M/2 the upper half and the lower half switches.The recursion terminates for M/2s = 1, when there is a single switch.

Butterfly routing function In a stage s switch, if the s+1th MSB of jis 0, the data is routed to the upper output wire, otherwise it is routed tothe lower output wire.Observe that for the Butterfly and the Omega networks, the paths from

the different inputs to any one output form a spanning tree. This impliesthat collisions will occur when data is destined to the same output line.However, the advantage is that data can be combined at the switches ifthe application semantics (e.g., summation of numbers) are known.

2. A multicomputer parallel system is a parallel system in which the multipleprocessors do not have direct access to shared memory. The memory ofthe multiple processors may or may not form a common address space.Such computers usually do not have a common clock. The architecture isshown in Figure 1.3(b).The processors are in close physical proximity and are usually very

tightly coupled (homogenous hardware and software), and connected byan interconnection network. The processors communicate either via a com-mon address space or via message-passing. A multicomputer system thathas a common address space usually corresponds to a non-uniform mem-ory access (NUMA) architecture in which the latency to access variousshared memory locations from the different processors varies.Examples of parallel multicomputers are: the NYU Ultracomputer and

the Sequent shared memory machines, the CM* Connection machineand processors configured in regular and symmetrical topologies suchas an array or mesh, ring, torus, cube, and hypercube (message-passingmachines). The regular and symmetrical topologies have interesting math-ematical properties that enable very easy routing and provide many richfeatures such as alternate routing.Figure 1.5(a) shows a wrap-around 44 mesh. For a kk mesh which

will contain k2 processors, the maximum path length between any twoprocessors is 2k/21. Routing can be done along the Manhattan grid.Figure 1.5(b) shows a four-dimensional hypercube. A k-dimensional hyper-cube has 2k processor-and-memory units [13,21]. Each such unit is a nodein the hypercube, and has a unique k-bit label. Each of the k dimensions isassociated with a bit position in the label. The labels of any two adjacentnodes are identical except for the bit position corresponding to the dimen-sion in which the two nodes differ. Thus, the processors are labelled suchthat the shortest path between any two processors is the Hamming distance(defined as the number of bit positions in which the two equal sized bitstrings differ) between the processor labels. This is clearly bounded by k.


Figure 1.5 Some populartopologies for multicomputershared-memory machines. (a)Wrap-around 2D-mesh, alsoknown as torus. (b) Hypercubeof dimension 4.

0010

0011

0101

(b)(a)processor + memory

1100

1000

1110

1010

1111

101110010001

01100100

0000

11010111

Example Nodes 0101 and 1100 have a Hamming distance of 2. Theshortest path between them has length 2.Routing in the hypercube is done hop-by-hop. At any hop, the message

can be sent along any dimension corresponding to the bit position in whichthe current nodes address and the destination address differ. The 4Dhypercube shown in the figure is formed by connecting the correspondingedges of two 3D hypercubes (corresponding to the left and right cubesin the figure) along the fourth dimension; the labels of the 4D hypercubeare formed by prepending a 0 to the labels of the left 3D hypercubeand prepending a 1 to the labels of the right 3D hypercube. This canbe extended to construct hypercubes of higher dimensions. Observe thatthere are multiple routes between any pair of nodes, which provides fault-tolerance as well as a congestion control mechanism. The hypercube andits variant topologies have very interesting mathematical properties withimplications for routing and fault-tolerance.

3. Array processors belong to a class of parallel computers that are physicallyco-located, are very tightly coupled, and have a common system clock (butmay not share memory and communicate by passing data using messages).Array processors and systolic arrays that perform tightly synchronizedprocessing and data exchange in lock-step for applications such as DSPand image processing belong to this category. These applications usuallyinvolve a large number of iterations on the data. This class of parallelsystems has a very niche market.

The distinction between UMAmultiprocessors on the one hand, and NUMAand message-passing multicomputers on the other, is important becausethe algorithm design and data and task partitioning among the processorsmust account for the variable and unpredictable latencies in accessing mem-ory/communication [22]. As compared to UMA systems and array processors,NUMA and message-passing multicomputer systems are less suitable whenthe degree of granularity of accessing shared data and communication isvery fine.The primary and most efficacious use of parallel systems is for obtain-

ing a higher throughput by dividing the computational workload among the

10 Introduction

processors. The tasks that are most amenable to higher speedups on par-allel systems are those that can be partitioned into subtasks very nicely,involving much number-crunching and relatively little communication forsynchronization. Once the task has been decomposed, the processors performlarge vector, array, and matrix computations that are common in scientificapplications. Searching through large state spaces can be performed with sig-nificant speedup on parallel machines. While such parallel machines werean object of much theoretical and systems research in the 1980s and early1990s, they have not proved to be economically viable for two related reasons.First, the overall market for the applications that can potentially attain highspeedups is relatively small. Second, due to economy of scale and the highprocessing power offered by relatively inexpensive off-the-shelf networkedPCs, specialized parallel machines are not cost-effective to manufacture. Theyadditionally require special compiler and other system support for maximumthroughput.

1.4.2 Flynns taxonomy

Flynn [14] identified four processing modes, based on whether the processorsexecute the same or different instruction streams at the same time, and whetheror not the processors processed the same (identical) data at the same time. Itis instructive to examine this classification to understand the range of optionsused for configuring systems:

Single instruction stream, single data stream (SISD)This mode corresponds to the conventional processing in the von Neumannparadigm with a single CPU, and a single memory unit connected by asystem bus.

Single instruction stream, multiple data stream (SIMD)This mode corresponds to the processing by multiple homogenous proces-sors which execute in lock-step on different data items. Applications thatinvolve operations on large arrays and matrices, such as scientific applica-tions, can best exploit systems that provide the SIMD mode of operationbecause the data sets can be partitioned easily.Several of the earliest parallel computers, such as Illiac-IV, MPP, CM2,

and MasPar MP-1 were SIMD machines. Vector processors, array pro-cessors and systolic arrays also belong to the SIMD class of processing.Recent SIMD architectures include co-processing units such as the MMXunits in Intel processors (e.g., Pentium with the streaming SIMD extensions(SSE) options) and DSP chips such as the Sharc [22].

Multiple instruction stream, single data stream (MISD)This mode corresponds to the execution of different operations in parallelon the same data. This is a specialized mode of operation with limited butniche applications, e.g., visualization.


Figure 1.6 Flynns taxonomyof SIMD, MIMD, andMISD architectures formultiprocessor/multicomputersystems.

PP

CCCC

I

I

I I I

I III

D

P

(c) MISD(b) MIMD(a) SIMD

processing unit

control unit

P

D data stream

I instruction stream

CC

P

D

PP

DD D

Multiple instruction stream, multiple data stream (MIMD)In this mode, the various processors execute different code on differentdata. This is the mode of operation in distributed systems as well as inthe vast majority of parallel systems. There is no common clock amongthe system processors. Sun Ultra servers, multicomputer PCs, and IBM SPmachines are examples of machines that execute in MIMD mode.

SIMD, MISD, and MIMD architectures are illustrated in Figure 1.6. MIMDarchitectures are most general and allow much flexibility in partitioningcode and data to be processed among the processors. MIMD architecturesalso include the classically understood mode of execution in distributedsystems.

1.4.3 Coupling, parallelism, concurrency, and granularity

CouplingThe degree of coupling among a set of modules, whether hardware or software,is measured in terms of the interdependency and binding and/or homogeneityamong the modules. When the degree of coupling is high (low), the mod-ules are said to be tightly (loosely) coupled. SIMD and MISD architecturesgenerally tend to be tightly coupled because of the common clocking of theshared instruction stream or the shared data stream. Here we briefly examinevarious MIMD architectures in terms of coupling:

Tightly coupled multiprocessors (with UMA shared memory). These maybe either switch-based (e.g., NYU Ultracomputer, RP3) or bus-based (e.g.,Sequent, Encore).

Tightly coupled multiprocessors (with NUMA shared memory or thatcommunicate by message passing). Examples are the SGI Origin 2000and the Sun Ultra HPC servers (that communicate via NUMA sharedmemory), and the hypercube and the torus (that communicate by messagepassing).

Loosely coupled multicomputers (without shared memory) physically co-located. These may be bus-based (e.g., NOW connected by a LAN orMyrinet card) or using a more general communication network, and theprocessors may be heterogenous. In such systems, processors neither share

12 Introduction

memory nor have a common clock, and hence may be classified as dis-tributed systems however, the processors are very close to one another,which is characteristic of a parallel system. As the communication latencymay be significantly lower than in wide-area distributed systems, the solu-tion approaches to various problems may be different for such systemsthan for wide-area distributed systems.

Loosely coupled multicomputers (without shared memory and withoutcommon clock) that are physically remote. These correspond to the con-ventional notion of distributed systems.

Parallelism or speedup of a program on a specific systemThis is a measure of the relative speedup of a specific program, on a givenmachine. The speedup depends on the number of processors and the mappingof the code to the processors. It is expressed as the ratio of the time T1 witha single processor, to the time Tn with n processors.

Parallelism within a parallel/distributed programThis is an aggregate measure of the percentage of time that all the proces-sors are executing CPU instructions productively, as opposed to waiting forcommunication (either via shared memory or message-passing) operations tocomplete. The term is traditionally used to characterize parallel programs. Ifthe aggregate measure is a function of only the code, then the parallelism isindependent of the architecture. Otherwise, this definition degenerates to thedefinition of parallelism in the previous section.

Concurrency of a programThis is a broader term that means roughly the same as parallelism of aprogram, but is used in the context of distributed programs. The paral-lelism/concurrency in a parallel/distributed program can be measured by theratio of the number of local (non-communication and non-shared memoryaccess) operations to the total number of operations, including the communi-cation or shared memory access operations.

Granularity of a programThe ratio of the amount of computation to the amount of communicationwithin the parallel/distributed program is termed as granularity. If the degreeof parallelism is coarse-grained (fine-grained), there are relatively many more(fewer) productive CPU instruction executions, compared to the number oftimes the processors communicate either via shared memory or message-passing and wait to get synchronized with the other processors. Programs withfine-grained parallelism are best suited for tightly coupled systems. Thesetypically include SIMD and MISD architectures, tightly coupled MIMDmultiprocessors (that have shared memory), and loosely coupled multicom-puters (without shared memory) that are physically colocated. If programswith fine-grained parallelism were run over loosely coupled multiprocessors

13 1.5 Message-passing systems versus shared memory systems

that are physically remote, the latency delays for the frequent communicationover the WAN would significantly degrade the overall throughput. As acorollary, it follows that on such loosely coupled multicomputers, programswith a coarse-grained communication/message-passing granularity will incursubstantially less overhead.Figure 1.2 showed the relationships between the local operating system,

the middleware implementing the distributed software, and the network pro-tocol stack. Before moving on, we identify various classes of multiproces-sor/multicomputer operating systems:

The operating system running on loosely coupled processors (i.e., het-erogenous and/or geographically distant processors), which are themselvesrunning loosely coupled software (i.e., software that is heterogenous), isclassified as a network operating system. In this case, the application can-not run any significant distributed function that is not provided by theapplication layer of the network protocol stacks on the various processors.

The operating system running on loosely coupled processors, which arerunning tightly coupled software (i.e., the middleware software on theprocessors is homogenous), is classified as a distributed operating system.

The operating system running on tightly coupled processors, which arethemselves running tightly coupled software, is classified as a multipro-cessor operating system. Such a parallel system can run sophisticatedalgorithms contained in the tightly coupled software.

1.5 Message-passing systems versus shared memory systems

Shared memory systems are those in which there is a (common) shared addressspace throughout the system. Communication among processors takes placevia shared data variables, and control variables for synchronization amongthe processors. Semaphores and monitors that were originally designed forshared memory uniprocessors and multiprocessors are examples of how syn-chronization can be achieved in shared memory systems. All multicomputer(NUMA as well as message-passing) systems that do not have a shared addressspace provided by the underlying architecture and hardware necessarily com-municate by message passing. Conceptually, programmers find it easier toprogram using shared memory than by message passing. For this and severalother reasons that we examine later, the abstraction called shared memoryis sometimes provided to simulate a shared address space. For a distributedsystem, this abstraction is called distributed shared memory. Implementingthis abstraction has a certain cost but it simplifies the task of the applicationprogrammer. There also exists a well-known folklore result that communi-cation via message-passing can be simulated by communication via sharedmemory and vice-versa. Therefore, the two paradigms are equivalent.

14 Introduction

1.5.1 Emulating message-passing on a shared memory system (MP SM)The shared address space can be partitioned into disjoint parts, one partbeing assigned to each processor. Send and receive operations can beimplemented by writing to and reading from the destination/sender processorsaddress space, respectively. Specifically, a separate location can be reservedas the mailbox for each ordered pair of processes. A PiPj message-passingcan be emulated by a write by Pi to the mailbox and then a read by Pj fromthe mailbox. In the simplest case, these mailboxes can be assumed to haveunbounded size. The write and read operations need to be controlled usingsynchronization primitives to inform the receiver/sender after the data hasbeen sent/received.

1.5.2 Emulating shared memory on a message-passing system (SM MP)This involves the use of send and receive operations for write andread operations. Each shared location can be modeled as a separate process;write to a shared location is emulated by sending an update message tothe corresponding owner process; a read to a shared location is emulatedby sending a query message to the owner process. As accessing anotherprocessors memory requires send and receive operations, this emulationis expensive. Although emulating shared memory might seem to be moreattractive from a programmers perspective, it must be remembered that ina distributed system, it is only an abstraction. Thus, the latencies involvedin read and write operations may be high even when using shared memoryemulation because the read and write operations are implemented by usingnetwork-wide communication under the covers.An application can of course use a combination of shared memory and

message-passing. In a MIMD message-passing multicomputer system, eachprocessor may be a tightly coupled multiprocessor system with sharedmemory. Within the multiprocessor system, the processors communicate viashared memory. Between two computers, the communication is by messagepassing. As message-passing systems are more common and more suitedfor wide-area distributed systems, we will consider message-passing systemsmore extensively than we consider shared memory systems.

1.6 Primitives for distributed communication

1.6.1 Blocking/non-blocking, synchronous/asynchronous primitives

Message send and message receive communication primitives are denotedSend() and Receive(), respectively. A Send primitive has at least two param-eters the destination, and the buffer in the user space, containing the datato be sent. Similarly, a Receive primitive has at least two parameters the

15 1.6 Primitives for distributed communication

source from which the data is to be received (this could be a wildcard), andthe user buffer into which the data is to be received.There are two ways of sending data when the Send primitive is invoked

the buffered option and the unbuffered option. The buffered option which isthe standard option copies the data from the user buffer to the kernel buffer.The data later gets copied from the kernel buffer onto the network. In theunbuffered option, the data gets copied directly from the user buffer onto thenetwork. For the Receive primitive, the buffered option is usually requiredbecause the data may already have arrived when the primitive is invoked, andneeds a storage place in the kernel.The following are some definitions of blocking/non-blocking and syn-

chronous/asynchronous primitives [12]:

Synchronous primitives A Send or a Receive primitive is synchronousif both the Send() and Receive() handshake with each other. The processingfor the Send primitive completes only after the invoking processor learnsthat the other corresponding Receive primitive has also been invoked andthat the receive operation has been completed. The processing for theReceive primitive completes when the data to be received is copied intothe receivers user buffer.

Asynchronous primitives A Send primitive is said to be asynchronousif control returns back to the invoking process after the data item to besent has been copied out of the user-specified buffer.It does not make sense to define asynchronous Receive primitives.

Blocking primitives A primitive is blocking if control returns to theinvoking process after the processing for the primitive (whether in syn-chronous or asynchronous mode) completes.

Non-blocking primitives A primitive is non-blocking if control returnsback to the invoking process immediately after invocation, even thoughthe operation has not completed. For a non-blocking Send, control returnsto the process even before the data is copied out of the user buffer. For anon-blocking Receive, control returns to the process even before the datamay have arrived from the sender.For non-blocking primitives, a return parameter on the primitive call

returns a system-generated handle which can be later used to check thestatus of completion of the call. The process can check for the completionof the call in two ways. First, it can keep checking (in a loop or periodically)if the handle has been flagged or posted. Second, it can issue a Wait witha list of handles as parameters. The Wait call usually blocks until one ofthe parameter handles is posted. Presumably after issuing the primitivein non-blocking mode, the process has done whatever actions it couldand now needs to know the status of completion of the call, thereforeusing a blocking Wait() call is usual programming practice. The code fora non-blocking Send would look as shown in Figure 1.7.

16 Introduction

Figure 1.7 A non-blockingsend primitive. When the Waitcall returns, at least one of itsparameters is posted.

Send(X, destination, handlek) // handlek is a return parameter

Wait(handle1, handle2, , handlek, , handlem) // Wait always blocks

If at the time that Wait() is issued, the processing for the primi-tive (whether synchronous or asynchronous) has completed, the Waitreturns immediately. The completion of the processing of the primitiveis detectable by checking the value of handlek. If the processing of theprimitive has not completed, theWait blocks and waits for a signal to wakeit up. When the processing for the primitive completes, the communicationsubsystem software sets the value of handlek and wakes up (signals) anyprocess with a Wait call blocked on this handlek. This is called postingthe completion of the operation.

There are therefore four versions of the Send primitive synchronous block-ing, synchronous non-blocking, asynchronous blocking, and asynchronousnon-blocking. For the Receive primitive, there are the blocking synchronousand non-blocking synchronous versions. These versions of the primitives areillustrated in Figure 1.8 using a timing diagram. Here, three time lines are

Figure 1.8 Blocking/non-blocking andsynchronous/asynchronousprimitives [12]. Process Pi issending and process Pj isreceiving. (a) Blockingsynchronous Send andblocking (synchronous)Receive. (b) Non-blockingsynchronous Send andnonblocking (synchronous)Receive. (c) Blockingasynchronous Send . (d)Non-blocking asynchronousSend .

S_CP,

(a) Blocking sync. Send, blocking Receive (b) Nonblocking sync. Send, nonblocking Receive

P, R_C

S_CP,

R_CR R

P The completion of the previously initiated nonblocking operation

(c) Blocking async. Send (d) Non-blocking async. Send

R_CR Receive primitive issued processing for Receive completes

kernel_i

buffer_iprocess i

WW

WW

W Process may issue Wait to check completion of nonblocking operation

S_CSend primitive issuedS processing for Send completes

S S_Cprocess i

buffer_ikernel_i

process jbuffer_ jkernel_ j

S

WWS S_C S

Duration to copy data from or to user bufferDuration in which the process issuing send or receive primitive is blocked

17 1.6 Primitives for distributed communication

shown for each process: (1) for the process execution, (2) for the user bufferfrom/to which data is sent/received, and (3) for the kernel/communicationsubsystem.

Blocking synchronous Send (See Figure 1.8(a)) The data gets copiedfrom the user buffer to the kernel buffer and is then sent over the network.After the data is copied to the receivers system buffer and a Receive callhas been issued, an acknowledgement back to the sender causes controlto return to the process that invoked the Send operation and completes theSend.

non-blocking synchronous Send (See Figure 1.8(b)) Control returnsback to the invoking process as soon as the copy of data from the userbuffer to the kernel buffer is initiated. A parameter in the non-blocking callalso gets set with the handle of a location that the user process can latercheck for the completion of the synchronous send operation. The locationgets posted after an acknowledgement returns from the receiver, as per thesemantics described for (a). The user process can keep checking for thecompletion of the non-blocking synchronous Send by testing the returnedhandle, or it can invoke the blockingWait operation on the returned handle(Figure 1.8(b)).

Blocking asynchronous Send (See Figure 1.8(c)) The user process thatinvokes the Send is blocked until the data is copied from the users bufferto the kernel buffer. (For the unbuffered option, the user process thatinvokes the Send is blocked until the data is copied from the users bufferto the network.)

non-blocking asynchronous Send (See Figure 1.8(d)) The user processthat invokes the Send is blocked until the transfer of the data from the usersbuffer to the kernel buffer is initiated. (For the unbuffered option, the userprocess that invokes the Send is blocked until the transfer of the data from theusers buffer to the network is initiated.) Control returns to the user processas soon as this transfer is initiated, and a parameter in the non-blocking callalso gets set with the handle of a location that the user process can checklater using theWait operation for the completion of the asynchronous Sendoperation. The asynchronous Send completes when the data has been copiedout of the users buffer. The checking for the completionmay be necessary ifthe user wants to reuse the buffer from which the data was sent.

Blocking Receive (See Figure 1.8(a)) The Receive call blocks until thedata expected arrives and is written in the specified user buffer. Thencontrol is returned to the user process.

non-blocking Receive (See Figure 1.8(b)) The Receive call will causethe kernel to register the call and return the handle of a location that theuser process can later check for the completion of the non-blocking Receiveoperation. This location gets posted by the kernel after the expected dataarrives and is copied to the user-specified buffer. The user process can

18 Introduction

check for the completion of the non-blocking Receive by invoking theWaitoperation on the returned handle. (If the data has already arrived when thecall is made, it would be pending in some kernel buffer, and still needs tobe copied to the user buffer.)

A synchronous Send is easier to use from a programmers perspectivebecause the handshake between the Send and the Receive makes the com-munication appear instantaneous, thereby simplifying the program logic. Theinstantaneity is, of course, only an illusion, as can be seen from Figure 1.8(a)and (b). In fact, the Receive may not get issued until much after the dataarrives at Pj , in which case the data arrived would have to be buffered in thesystem buffer at Pj and not in the user buffer. At the same time, the senderwould remain blocked. Thus, a synchronous Send lowers the efficiency withinprocess Pi.The non-blocking asynchronous Send (see Figure 1.8(d)) is useful when a

large data item is being sent because it allows the process to perform otherinstructions in parallel with the completion of the Send. The non-blockingsynchronous Send (see Figure 1.8(b)) also avoids the potentially large delaysfor handshaking, particularly when the receiver has not yet issued the Receivecall. The non-blocking Receive (see Figure 1.8(b)) is useful when a large dataitem is being received and/or when the sender has not yet issued the Sendcall, because it allows the process to perform other instructions in parallelwith the completion of the Receive. Note that if the data has already arrived,it is stored in the kernel buffer, and it may take a while to copy it to theuser buffer specified in the Receive call. For non-blocking calls, however, theburden on the programmer increases because he or she has to keep track ofthe completion of such operations in order to meaningfully reuse (write toor read from) the user buffers. Thus, conceptually, blocking primitives areeasier to use.

1.6.2 Processor synchrony

As opposed to the classification of synchronous and asynchronous commu-nication primitives, there is also the classification of synchronous versusasynchronous processors. Processor synchrony indicates that all the proces-sors execute in lock-step with their clocks synchronized. As this synchronyis not attainable in a distributed system, what is more generally indicated isthat for a large granularity of code, usually termed as a step, the processorsare synchronized. This abstraction is implemented using some form of barriersynchronization to ensure that no processor begins executing the next step ofcode until all the processors have completed executing the previous steps ofcode assigned to each of the processors.

19 1.7 Synchronous versus asynchronous executions

1.6.3 Libraries and standards

The previous subsections identified the main principles underlying all com-munication primitives. In this subsection, we briefly mention some publiclyavailable interfaces that embody some of the above concepts.There exists a wide range of primitives for message-passing. Many com-

mercial software products (banking, payroll, etc., applications) use proprietaryprimitive libraries supplied with the software marketed by the vendors (e.g., theIBMCICS softwarewhich has a verywidely installed customer baseworldwideuses its own primitives). The message-passing interface (MPI) library [20, 30]and the PVM (parallel virtual machine) library [31] are used largely by the sci-entific community, but other alternative libraries exist. Commercial softwareis often written using the remote procedure calls (RPC) mechanism [1, 6] inwhich procedures that potentially reside across the network are invoked trans-parently to the user, in the samemanner that a local procedure is invoked [1, 6].Under the covers, socket primitives or socket-like transport layer primitives areinvoked to call the procedure remotely. There exist many implementations ofRPC [1, 7, 11] for example, Sun RPC, and distributed computing environ-ment (DCE) RPC. Messaging and streaming are two other mechanisms forcommunication. With the growth of object based software, libraries for remotemethod invocation (RMI) and remote object invocation (ROI) with their ownset of primitives are being proposed and standardized by different agencies [7].CORBA (common object request broker architecture) [36] and DCOM (dis-tributed component object model) [7] are two other standardized architectureswith their own set of primitives. Additionally, several projects in the researchstage are designing their own flavour of communication primitives.

1.7 Synchronous versus asynchronous executions

In addition to the two classifications of processor synchrony/asynchrony andof synchronous/asynchronous communication primitives, there is another clas-sification, namely that of synchronous/asynchronous executions.

An asynchronous execution is an execution in which (i) there is no pro-cessor synchrony and there is no bound on the drift rate of processorclocks, (ii) message delays (transmission + propagation times) are finite butunbounded, and (iii) there is no upper bound on the time taken by a processto execute a step. An example asynchronous execution with four processesP0 to P3 is shown in Figure 1.9. The arrows denote the messages; the tailand head of an arrow mark the send and receive event for that message,denoted by a circle and vertical line, respectively. Non-communicationevents, also termed as internal events, are shown by shaded circles.

A synchronous execution is an execution in which (i) processors are syn-chronized and the clock drift rate between any two processors is bounded,

20 Introduction

Figure 1.9 An example of anasynchronous execution in amessage-passing system. Atiming diagram is used toillustrate the execution.

internal event send event receive event

P0

P1

P2

P3

m1 m7

m4m2 m6

m5m3

(ii) message delivery (transmission + delivery) times are such that theyoccur in one logical step or round, and (iii) there is a known upper boundon the time taken by a process to execute a step. An example of a syn-chronous execution with four processes P0 to P3 is shown in Figure 1.10.The arrows denote the messages.

It is easier to design and verify algorithms assuming synchronous execu-tions because of the coordinated nature of the executions at all the processes.However, there is a hurdle to having a truly synchronous execution. It ispractically difficult to build a completely synchronous system, and have themessages delivered within a bounded time. Therefore, this synchrony has tobe simulated under the covers, and will inevitably involve delaying or block-ing some processes for some time durations. Thus, synchronous execution isan abstraction that needs to be provided to the programs. When implementingthis abstraction, observe that the fewer the steps or synchronizations of theprocessors, the lower the delays and costs. If processors are allowed to havean asynchronous execution for a period of time and then they synchronize,then the granularity of the synchrony is coarse. This is really a virtuallysynchronous execution, and the abstraction is sometimes termed as virtualsynchrony. Ideally, many programs want the processes to execute a series ofinstructions in rounds (also termed as steps or phases) asynchronously, withthe requirement that after each round/step/phase, all the processes should besynchronized and all messages sent should be delivered. This is the commonlyunderstood notion of a synchronous execution. Within each round/phase/step,there may be a finite and bounded number of sequential sub-rounds (or sub-phases or sub-steps) that processes execute. Each sub-round is assumed to

Figure 1.10 An example of asynchronous execution in amessage-passing system. Allthe messages sent in a roundare received within that sameround.

phase 1 phase 2 phase 3

P0

P1

P2

P3

21 1.7 Synchronous versus asynchronous executions

send at most one message per process; hence the message(s) sent will reachin a single message hop.The timing diagram of an example synchronous execution is shown in

Figure 1.10. In this system, there are four nodes P0 to P3. In each round,process Pi sends a message to Pi+1mod 4 and Pi1 mod 4 and calculates someapplication-specific function on the received values.

1.7.1 Emulating an asynchronous system by a synchronous system (A S)An asynchronous program (written for an asynchronous system) can be emu-lated on a synchronous system fairly trivially as the synchronous system is aspecial case of an asynchronous system all communication finishes withinthe same round in which it is initiated.

1.7.2 Emulating a synchronous system by an asynchronous system (S A)A synchronous program (written for a synchronous system) can be emulatedon an asynchronous system using a tool called synchronizer, to be studied inChapter 5.

1.7.3 Emulations

Section 1.5 showed how a shared memory system could be emulated by amessage-passing system, and vice-versa. We now have four broad classes ofprograms, as shown in Figure 1.11. Using the emulations shown, any classcan be emulated by any other. If system A can be emulated by system B,denoted A/B, and if a problem is not solvable in B, then it is also not solvablein A. Likewise, if a problem is solvable in A, it is also solvable in B. Hence,in a sense, all four classes are equivalent in terms of computability whatcan and cannot be computed in failure-free systems.

Figure 1.11 Emulationsamong the principal systemclasses in a failure-free system.

Synchronousmessagepassing (SMP)

Asynchronousshared memory (ASM)

Synchronousshared memory (SSM)

SM>MPMP>SMSM>MPMP>SM

A>S

S>A

A>S

S>A

Asynchronousmessagepassing (AMP)

22 Introduction

However, in fault-prone systems, as we will see in Chapter 14, this is not thecase; a synchronous system offers more computability than an asynchronoussystem.

1.8 Design issues and challenges

Distributed computing systems have been in widespread existence since the1970s when the Internet and ARPANET came into being. At the time, theprimary issues in the design of the distributed systems included providingaccess to remote data in the face of failures, file system design, and directorystructure design. While these continue to be important issues, many newerissues have surfaced as the widespread proliferation of the high-speed high-bandwidth internet and distributed applications continues rapidly.Below we describe the important design issues and challenges after catego-

rizing them as (i) having a greater component related to systems design andoperating systems design, or (ii) having a greater component related to algo-rithm design, or (iii) emerging from recent technology advances and/or drivenby new applications. There is some overlap between these categories. How-ever, it is useful to identify these categories because of the chasm among the(i) the systems community, (ii) the theoretical algorithms community withindistributed computing, and (iii) the forces driving the emerging applicationsand technology. For example, the current practice of distributed comput-ing follows the clientserver architecture to a large degree, whereas thatreceives scant attention in the theoretical distributed algorithms community.Two reasons for this chasm are as follows. First, an overwhelming num-ber of applications outside the scientific computing community of users ofdistributed systems are business applications for which simple models areadequate. For example, the clientserver model has been firmly entrenchedwith the legacy applications first developed by the Blue Chip companies (e.g.,HP, IBM, Wang, DEC [now Compaq], Microsoft) since the 1970s and 1980s.This model is largely adequate for traditional business applications. Second,the state of the practice is largely controlled by industry standards, which donot necessarily choose the technically best solution.

1.8.1 Distributed systems challenges from a system perspective

The following functions must be addressed when designing and building adistributed system:

Communication This task involves designing appropriate mechanismsfor communication among the processes in the network. Some exam-ple mechanisms are: remote procedure call (RPC), remote object invo-

23 1.8 Design issues and challenges

cation (ROI), message-oriented communication versus stream-orientedcommunication.

Processes Some of the issues involved are: management of processesand threads at clients/servers; code migration; and the design of softwareand mobile agents.

Naming Devising easy to use and robust schemes for names, identifiers,and addresses is essential for locating resources and processes in a trans-parent and scalable manner. Naming in mobile systems provides additionalchallenges because naming cannot easily be tied to any static geographicaltopology.

Synchronization Mechanisms for synchronization or coordinationamong the processes are essential. Mutual exclusion is the classical exam-ple of synchronization, but many other forms of synchronization, such asleader election are also needed. In addition, synchronizing physical clocks,and devising logical clocks that capture the essence of the passage of time,as well as global state recording algorithms, all require different forms ofsynchronization.

Data storage and access Schemes for data storage, and implicitly foraccessing the data in a fast and scalable manner across the network areimportant for efficiency. Traditional issues such as file system design haveto be reconsidered in the setting of a distributed system.

Consistency and replication To avoid bottlenecks, to provide fast accessto data, and to provide scalability, replication of data objects is highlydesirable. This leads to issues of managing the replicas, and dealing withconsistency among the replicas/caches in a distributed setting. A simpleexample issue is deciding the level of granularity (i.e., size) of data access.

Fault tolerance Fault tolerance requires maintaining correct and efficientoperation in spite of any failures of links, nodes, and processes. Processresilience, reliable communication, distributed commit, checkpointing andrecovery, agreement and consensus, failure detection, and self-stabilizationare some of the mechanisms to provide fault-tolerance.

Security Distributed systems security involves various aspects of cryp-tography, secure channels, access control, key management generationand distribution, authorization, and secure group management.

Applications Programming Interface (API) and transparency TheAPI for communication and other specialized services is important forthe ease of use and wider adoption of the distributed systems services bynon-technical users. Transparency deals with hiding the implementationpolicies from the user, and can be classified as follows [33]. Access trans-parency hides differences in data representation on different systems andprovides uniform operations to access system resources. Location trans-parency makes the locations of resources transparent to the users. Migra-tion transparency allows relocating resources without changing names. Theability to relocate the resources as they are being accessed is relocation

24 Introduction

transparency. Replication transparency does not let the user become awareof any replication. Concurrency transparency deals with masking the con-current use of shared resources for the user. Failure transparency refersto the system being reliable and fault-tolerant.

Scalability and modularity The algorithms, data (objects), and servicesmust be as distributed as possible. Various techniques such as replication,caching and cache management, and asynchronous processing help toachieve scalability.

Some of the recent experiments in designing large-scale distributed systemsinclude the Globe project at Vrije University [35], and the Globus project[15]. The Grid infrastructure for large-scale distributed computing is a veryambitious project that has gained significant attention to date [16, 17]. Allthese projects attempt to provide the above listed functions as efficiently aspossible.

1.8.2 Algorithmic challenges in distributed computing

The previous section addresses the challenges in designing distributed systemsfrom a system building perspective. In this section, we briefly summarize thekey algorithmic challenges in distributed computing.

Designing useful execution models and frameworksThe interleaving model and partial order model are two widely adoptedmodels of distributed system executions. They have proved to be particularlyuseful for operational reasoning and the design of distributed algorithms. Theinput/output automata model [25] and the TLA (temporal logic of actions) aretwo other examples of models that provide different degrees of infrastructurefor reasoning more formally with and proving the correctness of distributedprograms.

Dynamic distributed graph algorithms and distributed routingalgorithmsThe distributed system is modeled as a distributed graph, and the graphalgorithms form the building blocks for a large number of higher level com-munication, data dissemination, object location, and object search functions.The algorithms need to deal with dynamically changing graph characteristics,such as to model varying link loads in a routing algorithm. The efficiencyof these algorithms impacts not only the user-perceived latency but also thetraffic and hence the load or congestion in the network. Hence, the design ofefficient distributed graph algorithms is of paramount importance.

Time and global state in a distributed systemThe processes in the system are spread across three-dimensional physicalspace. Another dimension, time, has to be superimposed uniformly across


space. The challenges pertain to providing accurate physical time, and toproviding a variant of time, called logical time. Logical time is relative time,and eliminates the overheads of providing physical time for applications wherephysical time is not required. More importantly, logical time can (i) capturethe logic and inter-process dependencies within the distributed program, andalso (ii) track the relative progress at each process.Observing the global state of the system (across space) also involves

the time dimension for consistent observation. Due to the inherent dis-tributed nature of the system, it is not possible for any one process todirectly observe a meaningful global state across all the processes, withoutusing extra state-gathering effort which needs to be done in a coordinatedmanner.Deriving appropriate measures of concurrency also involves the time

dimension, as judging the independence of different threads of execu-tion depends not only on the program logic but also on executionspeeds within the logical threads, and communication speeds amongthreads.

Synchronization/coordination mechanismsThe processes must be allowed to execute concurrently, except when theyneed to synchronize to exchange information, i.e., communicate about shareddata. Synchronization is essential for the distributed processes to overcomethe limited observation of the system state from the viewpoint of any oneprocess. Overcoming this limited observation is necessary for taking anyactions that would impact other processes. The synchronization mechanismscan also be viewed as resource management and concurrency managementmechanisms to streamline the behavior of the processes that would oth-erwise act independently. Here are some examples of problems requiringsynchronization:

Physical clock synchronization Physical clocks ususally diverge in theirvalues due to hardware limitations. Keeping them synchronized is a fun-damental challenge to maintain common time.

Leader election All the processes need to agree on which process willplay the role of a distinguished process called a leader process. A leaderis necessary even for many distributed algorithms because there is oftensome asymmetry as in initiating some action like a broadcast or collectingthe state of the system, or in regenerating a token that gets lost in thesystem.

Mutual exclusion This is clearly a synchronization problem becauseaccess to the critical resource(s) has to be coordinated.

Deadlock detection and resolution Deadlock detection should becoordinated to avoid duplicate work, and deadlock resolution shouldbe coordinated to avoid unnecessary aborts of processes.

26 Introduction

Termination detection This requires cooperation among the processesto detect the specific global state of quiescence.

Garbage collection Garbage refers to objects that are no longer in useand that are not pointed to by any other process. Detecting garbage requirescoordination among the processes.

Group communication, multicast, and ordered message deliveryA group is a collection of processes that share a common context and collab-orate on a common task within an application domain. Specific algorithmsneed to be designed to enable efficient group communication and group man-agement wherein processes can join and leave groups dynamically, or evenfail. When multiple processes send messages concurrently, different recipientsmay receive the messages in different orders, possibly violating the semanticsof the distributed program. Hence, formal specifications of the semantics ofordered delivery need to be formulated, and then implemented.

Monitoring distributed events and predicatesPredicates defined on program variables that are local to different processesare used for specifying conditions on the global system state, and are usefulfor applications such as debugging, sensing the environment, and in industrialprocess control. On-line algorithms for monitoring such predicates are henceimportant. An important paradigm for monitoring distributed events is thatof event streaming, wherein streams of relevant events reported from differ-ent processes are examined collectively to detect predicates. Typically, thespecification of such predicates uses physical or logical time relationships.

Distributed program design and verification toolsMethodically designed and verifiably correct programs can greatly reduce theoverhead of software design, debugging, and engineering. Designing mecha-nisms to achieve these design and verification goals is a challenge.

Debugging distributed programsDebugging sequential programs is hard; debugging distributed programs isthat much harder because of the concurrency in actions and the ensuinguncertainty due to the large number of possible executions defined by theinterleaved concurrent actions. Adequate debugging mechanisms and toolsneed to be designed to meet this challenge.

Data replication, consistency models, and cachingFast access to data and other resources requires them to be replicated in thedistributed system. Managing such replicas in the face of updates introducesthe problems of ensuring consistency among the replicas and cached copies.Additionally, placement of the replicas in the systems is also a challengebecause resources usually cannot be freely replicated.


World Wide Web design caching, searching, schedulingThe Web is an example of a widespread distributed system with a direct inter-face to the end user, wherein the operations are predominantly read-intensiveon most objects. The issues of object replication and caching discussed abovehave to be tailored to the web. Further, prefetching of objects when access pat-terns and other characteristics of the objects are known, can also be performed.An example of where prefetching can be used is the case of subscribing toContent Distribution Servers. Minimizing response time to minimize user-perceived latencies is an important challenge. Object search and navigationon the web are important functions in the operation of the web, and are veryresource-intensive. Designing mechanisms to do this efficiently and accuratelyis a great challenge.

Distributed shared memory abstractionA shared memory abstraction simplifies the task of the programmer becausehe or she has to deal only with read and write operations, and no messagecommunication primitives. However, under the covers in the middlewarelayer, the abstraction of a shared address space has to be implemented byusing message-passing. Hence, in terms of overhea

Documents

Distributed computing principles, algorithms, and systems