ALGORITHMS - Eötvös Loránd Universitycompalg.inf.elte.hu › ~tony › Oktatas › FirstExpert › Vol2... · Authors of Volume 1: László Lovász (Preface), Antal Iványi (Introduction),

ALGORITHMS

OF INFORMATICS

Volume 2

AnTonComBudapest, 2010

This electronic book was prepared in the framework of project Eastern HungarianInformatics Books Repository no. TÁMOP-4.1.2-08/1/A-2009-0046.

This electronic book appeared with the support of European Union and with theco-financing of European Social Fund.

Editor: Antal Iványi

Authors of Volume 1: László Lovász (Preface), Antal Iványi (Introduction), ZoltánKása (Chapter 1), Zoltán Csörnyei (Chapter 2), Ulrich Tamm (Chapter 3), Péter Gács

(Chapter 4), Gábor Ivanyos and Lajos Rónyai (Chapter 5), Antal Járai and Attila Kovács(Chapter 6), Jörg Rothe (Chapters 7 and 8), Csanád Imreh (Chapter 9), Ferenc

Szidarovszky (Chapter 10), Zoltán Kása (Chapter 11), Aurél Galántai and András Jeney(Chapter 12),

Validators of Volume 1: Zoltán Fülöp (Chapter 1), Pál Dömösi (Chapter 2), SándorFridli (Chapter 3), Anna Gál (Chapter 4), Attila Pethő (Chapter 5), Lajos Rónyai(Chapter 6), János Gonda (Chapter 7), Gábor Ivanyos (Chapter 8), Béla Vizvári

(Chapter 9), János Mayer (Chapter 10), András Recski (Chapter 11), Tamás Szántai(Chapter 12), Anna Iványi (Bibliography)

Authors of Volume 2: Burkhard Englert, Dariusz Kowalski, Gregorz Malewicz, andAlexander Shvartsman (Chapter 13), Tibor Gyires (Chapter 14), Claudia Fohry and

Antal Iványi (Chapter 15), Eberhard Zehendner (Chapter 16), Ádám Balogh and AntalIványi (Chapter 17), János Demetrovics and Attila Sali (Chapters 18 and 19), Attila Kiss

(Chapter 20), István Miklós (Chapter 21), László Szirmay-Kalos (Chapter 22), IngoAlthöfer and Stefan Schwarz (Chapter 23)

Validators of Volume 2: István Majzik (Chapter 13), János Sztrik (Chapter 14), DezsőSima (Chapters 15 and 16), László Varga (Chapter 17), Attila Kiss (Chapters 18 and 19),András Benczúr (Chapter 20), István Katsányi (Chapter 21), János Vida (Chapter 22),

Tamás Szántai (Chapter 23), Anna Iványi (Bibliography)

Cover art: Victor Vasarely, Dirac, 1978. With the permission of Museum of Fine Arts,Budapest. The used film is due to GOMA ZRt.

Cover design by Antal Iványi

c© 2010 AnTonCom Infokommunikációs Kft.Homepage: http://www.antoncom.hu/

http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tonyhttp://www.cs.elte.hu/~lovaszhttp://www.sapientia.ro/hu/dr.-kasa-zoltan.htmlhttp://people.inf.elte.hu/csz/http://www.tu-chemnitz.de/informatik/HomePages/ThIS/Tamm/http://www.cs.bu.edu/fac/gacs/http://www.sztaki.hu/~ivanyos/http://www.sztaki.hu/~ronyai/http://compalg.inf.elte.hu/~ajarai/http://www.compalg.inf.elte.hu/attilahttp://www.cs.uni-duesseldorf.de/~rothe/http://www.inf.u-szeged.hu/~cimreh/http://www.sie.arizona.edu/faculty/szidar.htmlhttp://www.sapientia.ro/hu/dr.-kasa-zoltan.htmlmailto:[email protected]:[email protected]://www.inf.u-szeged.hu/~fulop/http://www.inf.unideb.hu/~domosi/mailto:http://numanal.inf.elte.hu/fridli.htmlhtpp://www.cs.utexas.edu/~pannihttp://www.inf.unideb.hu/~pethoe/http://www.sztaki.hu/~ronyai/mailto:[email protected]://www.sztaki.hu/~ivanyos/http://www.cs.elte.hu/vizvarihttp://www.unizh.ch/ior/Pages/Deutsch/Mitglieder/Mayer/Mayer.phphttp://www.cs.bme.hu/recskihttp://www.math.bme.hu/~szantai/http://www.nimfea.hu/kapcsolat/programvez.htmhttp://www.cecs.csulb.edu/~englert/file:[email protected]://www.cs.ua.edu/$%$7Egreg/ad.htmlhttp://www.engr.uconn.edu/~aas/http://www.itk.ilstu.edu/faculty/tbgyires/tbgyires.htmhttp://www.se.e-technik.uni-kassel.de/pm/leopoldE.htmlhttp://people.inf.elte.hu/tony/http://www2.informatik.uni-jena.de/~nez/mailto:[email protected]://people.inf.elte.hu/tony/http://www.sztaki.hu/sztaki/afe/infodep/demetrovics.jhtmlhttp://www.renyi.hu/~sali/mailto:[email protected]://www.iit.bme.hu/~szirmay/szirmay.htmlhttp://www.minet.uni-jena.de/www/fakultaet/iam/l_althoefer.htmlhttp://www.minet.uni-jena.de/www/fakultaet/iam/personen/stefan.htmlhttp://it.math.klte.hu/user/jsztrik/http://www.nik.hu/felepit.htmmailto:[email protected]:[email protected]:mailto:[email protected]://aszt.inf.elte.hu/~jvida/http://www.math.bme.hu/~szantai/http://www.nimfea.hu/kapcsolat/programvez.htmhttp://www.szepmuveszeti.hu/http://www.goma.hu/http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tonyhttp://www.antoncom.hu/

Contents

IV. COMPUTER NETWORKS . . . . . . . . . . . . . . . . . . . . . . 591

13. Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 59213.1. Message passing systems and algorithms . . . . . . . . . . . . . . . 593

13.1.1. Modeling message passing systems . . . . . . . . . . . . . . 59313.1.2. Asynchronous systems . . . . . . . . . . . . . . . . . . . . . 59313.1.3. Synchronous systems . . . . . . . . . . . . . . . . . . . . . . 594

13.2. Basic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59513.2.1. Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59513.2.2. Construction of a spanning tree . . . . . . . . . . . . . . . . 596

13.3. Ring algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60013.3.1. The leader election problem . . . . . . . . . . . . . . . . . . 60013.3.2. The leader election algorithm . . . . . . . . . . . . . . . . . 60113.3.3. Analysis of the leader election algorithm . . . . . . . . . . . 604

13.4. Fault-tolerant consensus . . . . . . . . . . . . . . . . . . . . . . . . 60713.4.1. The consensus problem . . . . . . . . . . . . . . . . . . . . . 60713.4.2. Consensus with crash failures . . . . . . . . . . . . . . . . . 60813.4.3. Consensus with Byzantine failures . . . . . . . . . . . . . . 60913.4.4. Lower bound on the ratio of faulty processors . . . . . . . . 61013.4.5. A polynomial algorithm . . . . . . . . . . . . . . . . . . . . 61013.4.6. Impossibility in asynchronous systems . . . . . . . . . . . . 611

13.5. Logical time, causality, and consistent state . . . . . . . . . . . . . 61213.5.1. Logical time . . . . . . . . . . . . . . . . . . . . . . . . . . . 61313.5.2. Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61413.5.3. Consistent state . . . . . . . . . . . . . . . . . . . . . . . . . 617

13.6. Communication services . . . . . . . . . . . . . . . . . . . . . . . . 61913.6.1. Properties of broadcast services . . . . . . . . . . . . . . . . 61913.6.2. Ordered broadcast services . . . . . . . . . . . . . . . . . . 62113.6.3. Multicast services . . . . . . . . . . . . . . . . . . . . . . . . 625

13.7. Rumor collection algorithms . . . . . . . . . . . . . . . . . . . . . . 62613.7.1. Rumor collection problem and requirements . . . . . . . . . 62613.7.2. Efficient gossip algorithms . . . . . . . . . . . . . . . . . . . 627

13.8. Mutual exclusion in shared memory . . . . . . . . . . . . . . . . . . 634

Contents 585

13.8.1. Shared memory systems . . . . . . . . . . . . . . . . . . . . 63413.8.2. The mutual exclusion problem . . . . . . . . . . . . . . . . 63413.8.3. Mutual exclusion using powerful primitives . . . . . . . . . 63513.8.4. Mutual exclusion using read/write registers . . . . . . . . . 63613.8.5. Lamport’s fast mutual exclusion algorithm . . . . . . . . . . 640

14. Network Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64414.1. Types of simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64414.2. The need for communications network modelling and simulation . . 64514.3. Types of communications networks, modelling constructs . . . . . . 64714.4. Performance targets for simulation purposes . . . . . . . . . . . . . 64914.5. Traffic characterisation . . . . . . . . . . . . . . . . . . . . . . . . . 65214.6. Simulation modelling systems . . . . . . . . . . . . . . . . . . . . . 660

14.6.1. Data collection tools and network analysers . . . . . . . . . 66014.6.2. Model specification . . . . . . . . . . . . . . . . . . . . . . . 66014.6.3. Data collection and simulation . . . . . . . . . . . . . . . . 66014.6.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66114.6.5. Network Analysers . . . . . . . . . . . . . . . . . . . . . . . 66214.6.6. Sniffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

14.7. Model Development Life Cycle (MDLC) . . . . . . . . . . . . . . . 66914.8. Modelling of traffic burstiness . . . . . . . . . . . . . . . . . . . . . 675

14.8.1. Model parameters . . . . . . . . . . . . . . . . . . . . . . . 68014.8.2. Implementation of the Hurst parameter . . . . . . . . . . . 68114.8.3. Validation of the baseline model . . . . . . . . . . . . . . . 68314.8.4. Consequences of traffic burstiness . . . . . . . . . . . . . . . 68614.8.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 690

14.9. Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69014.9.1. Measurements for link utilisation . . . . . . . . . . . . . . . 69014.9.2. Measurements for message delays . . . . . . . . . . . . . . . 690

15. Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . 70315.1. Parallel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 705

15.1.1. SIMD architectures . . . . . . . . . . . . . . . . . . . . . . . 70515.1.2. Symmetric multiprocessors . . . . . . . . . . . . . . . . . . . 70615.1.3. Cache-coherent NUMA architectures: . . . . . . . . . . . . . 70715.1.4. Non-cache-coherent NUMA architectures: . . . . . . . . . . 70715.1.5. No remote memory access architectures . . . . . . . . . . . 70815.1.6. Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70815.1.7. Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708

15.2. Performance in practice . . . . . . . . . . . . . . . . . . . . . . . . 70915.3. Parallel programming . . . . . . . . . . . . . . . . . . . . . . . . . . 713

15.3.1. MPI programming . . . . . . . . . . . . . . . . . . . . . . . 71415.3.2. OpenMP programming . . . . . . . . . . . . . . . . . . . . . 71715.3.3. Other programming models . . . . . . . . . . . . . . . . . . 719

15.4. Computational models . . . . . . . . . . . . . . . . . . . . . . . . . 72015.4.1. PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72015.4.2. BSP, LogP and QSM . . . . . . . . . . . . . . . . . . . . . . 721

586 Contents

15.4.3. Mesh, hypercube and butterfly . . . . . . . . . . . . . . . . 72215.5. Performance in theory . . . . . . . . . . . . . . . . . . . . . . . . . 72415.6. PRAM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 728

15.6.1. Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72915.6.2. Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73515.6.3. Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73715.6.4. Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74115.6.5. Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746

15.7. Mesh algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74915.7.1. Prefix on chain . . . . . . . . . . . . . . . . . . . . . . . . . 74915.7.2. Prefix on square . . . . . . . . . . . . . . . . . . . . . . . . 750

16. Systolic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75416.1. Basic concepts of systolic systems . . . . . . . . . . . . . . . . . . . 755

16.1.1. An introductory example: matrix product . . . . . . . . . . 75516.1.2. Problem parameters and array parameters . . . . . . . . . . 75616.1.3. Space coordinates . . . . . . . . . . . . . . . . . . . . . . . . 75716.1.4. Serialising generic operators . . . . . . . . . . . . . . . . . . 75816.1.5. Assignment-free notation . . . . . . . . . . . . . . . . . . . . 75916.1.6. Elementary operations . . . . . . . . . . . . . . . . . . . . . 76016.1.7. Discrete timesteps . . . . . . . . . . . . . . . . . . . . . . . 76016.1.8. External and internal communication . . . . . . . . . . . . . 76116.1.9. Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763

16.2. Space-time transformation and systolic arrays . . . . . . . . . . . . 76416.2.1. Further example: matrix product . . . . . . . . . . . . . . . 76416.2.2. The space-time transformation as a global view . . . . . . . 76516.2.3. Parametric space coordinates . . . . . . . . . . . . . . . . . 76716.2.4. Symbolically deriving the running time . . . . . . . . . . . . 77016.2.5. How to unravel the communication topology . . . . . . . . . 77016.2.6. Inferring the structure of the cells . . . . . . . . . . . . . . . 771

16.3. Input/output schemes . . . . . . . . . . . . . . . . . . . . . . . . . 77316.3.1. From data structure indices to iteration vectors . . . . . . . 77416.3.2. Snapshots of data structures . . . . . . . . . . . . . . . . . . 77516.3.3. Superposition of input/output schemes . . . . . . . . . . . . 77616.3.4. Data rates induced by space-time transformations . . . . . . 77716.3.5. Input/output expansion . . . . . . . . . . . . . . . . . . . . 77716.3.6. Coping with stationary variables . . . . . . . . . . . . . . . 77816.3.7. Interleaving of calculations . . . . . . . . . . . . . . . . . . . 779

16.4. Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78116.4.1. Cells without control . . . . . . . . . . . . . . . . . . . . . . 78116.4.2. Global control . . . . . . . . . . . . . . . . . . . . . . . . . . 78216.4.3. Local control . . . . . . . . . . . . . . . . . . . . . . . . . . 78316.4.4. Distributed control . . . . . . . . . . . . . . . . . . . . . . . 78616.4.5. The cell program as a local view . . . . . . . . . . . . . . . 790

16.5. Linear systolic arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 79416.5.1. Matrix-vector product . . . . . . . . . . . . . . . . . . . . . 79416.5.2. Sorting algorithms . . . . . . . . . . . . . . . . . . . . . . . 795

Contents 587

16.5.3. Lower triangular linear equation systems . . . . . . . . . . . 796

V. DATA BASES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798

17. Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 79917.1. Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799

17.1.1. Fixed partitions . . . . . . . . . . . . . . . . . . . . . . . . . 80017.1.2. Dynamic partitions . . . . . . . . . . . . . . . . . . . . . . . 806

17.2. Page replacement algorithms . . . . . . . . . . . . . . . . . . . . . . 81317.2.1. Static page replacement . . . . . . . . . . . . . . . . . . . . 81517.2.2. Dynamic paging . . . . . . . . . . . . . . . . . . . . . . . . 822

17.3. Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82417.3.1. Page replacement . . . . . . . . . . . . . . . . . . . . . . . . 82517.3.2. Scheduling with lists . . . . . . . . . . . . . . . . . . . . . . 82617.3.3. Parallel processing with interleaved memory . . . . . . . . . 83317.3.4. Avoiding the anomaly . . . . . . . . . . . . . . . . . . . . . 837

17.4. Optimal file packing . . . . . . . . . . . . . . . . . . . . . . . . . . 83717.4.1. Approximation algorithms . . . . . . . . . . . . . . . . . . . 83817.4.2. Optimal algorithms . . . . . . . . . . . . . . . . . . . . . . . 84117.4.3. Shortening of lists (SL) . . . . . . . . . . . . . . . . . . . . 84217.4.4. Upper and lower estimations (ULE) . . . . . . . . . . . . . 84217.4.5. Pairwise comparison of the algorithms . . . . . . . . . . . . 84317.4.6. The error of approximate algorithms . . . . . . . . . . . . . 845

18. Relational Data Base Design . . . . . . . . . . . . . . . . . . . . . . 85018.1. Functional dependencies . . . . . . . . . . . . . . . . . . . . . . . . 851

18.1.1. Armstrong-axioms . . . . . . . . . . . . . . . . . . . . . . . 85118.1.2. Closures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85218.1.3. Minimal cover . . . . . . . . . . . . . . . . . . . . . . . . . . 85518.1.4. Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857

18.2. Decomposition of relational schemata . . . . . . . . . . . . . . . . . 85918.2.1. Lossless join . . . . . . . . . . . . . . . . . . . . . . . . . . . 86018.2.2. Checking the lossless join property . . . . . . . . . . . . . . 86018.2.3. Dependency preserving decompositions . . . . . . . . . . . . 86418.2.4. Normal forms . . . . . . . . . . . . . . . . . . . . . . . . . . 86718.2.5. Multivalued dependencies . . . . . . . . . . . . . . . . . . . 872

18.3. Generalised dependencies . . . . . . . . . . . . . . . . . . . . . . . 87818.3.1. Join dependencies . . . . . . . . . . . . . . . . . . . . . . . . 87818.3.2. Branching dependencies . . . . . . . . . . . . . . . . . . . . 879

19. Query Rewriting in Relational Databases . . . . . . . . . . . . . . 88319.1. Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883

19.1.1. Conjunctive queries . . . . . . . . . . . . . . . . . . . . . . . 88519.1.2. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89019.1.3. Complexity of query containment . . . . . . . . . . . . . . . 898

19.2. Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90219.2.1. View as a result of a query . . . . . . . . . . . . . . . . . . 902

19.3. Query rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905

588 Contents

19.3.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90519.3.2. Complexity problems of query rewriting . . . . . . . . . . . 91019.3.3. Practical algorithms . . . . . . . . . . . . . . . . . . . . . . 913

20. Semi-structured Databases . . . . . . . . . . . . . . . . . . . . . . . . 93220.1. Semi-structured data and XML . . . . . . . . . . . . . . . . . . . . 93220.2. Schemas and simulations . . . . . . . . . . . . . . . . . . . . . . . . 93420.3. Queries and indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 93920.4. Stable partitions and the PT-algorithm . . . . . . . . . . . . . . . . 94520.5. A(k)-indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95220.6. D(k)- and M(k)-indexes . . . . . . . . . . . . . . . . . . . . . . . . 95420.7. Branching queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96120.8. Index refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965

VI. APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972

21. Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97321.1. Algorithms on sequences . . . . . . . . . . . . . . . . . . . . . . . . 973

21.1.1. Distances of two sequences using linear gap penalty . . . . . 97321.1.2. Dynamic programming with arbitrary gap function . . . . . 97621.1.3. Gotoh algorithm for affine gap penalty . . . . . . . . . . . . 97721.1.4. Concave gap penalty . . . . . . . . . . . . . . . . . . . . . . 97721.1.5. Similarity of two sequences, the Smith-Waterman algorithm 98021.1.6. Multiple sequence alignment . . . . . . . . . . . . . . . . . . 98121.1.7. Memory-reduction with the Hirschberg algorithm . . . . . . 98321.1.8. Memory-reduction with corner-cutting . . . . . . . . . . . . 984

21.2. Algorithms on trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 98621.2.1. The small parsimony problem . . . . . . . . . . . . . . . . . 98621.2.2. The Felsenstein algorithm . . . . . . . . . . . . . . . . . . . 987

21.3. Algorithms on stochastic grammars . . . . . . . . . . . . . . . . . . 98921.3.1. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 98921.3.2. Stochastic context-free grammars . . . . . . . . . . . . . . . 991

21.4. Comparing structures . . . . . . . . . . . . . . . . . . . . . . . . . . 99421.4.1. Aligning labelled, rooted trees . . . . . . . . . . . . . . . . . 99421.4.2. Co-emission probability of two HMMs . . . . . . . . . . . . 995

21.5. Distance based algorithms for constructing evolutionary trees . . . 99721.5.1. Clustering algorithms . . . . . . . . . . . . . . . . . . . . . 99821.5.2. Neighbour joining . . . . . . . . . . . . . . . . . . . . . . . . 1001

21.6. Miscellaneous topics . . . . . . . . . . . . . . . . . . . . . . . . . . 100521.6.1. Genome rearrangement . . . . . . . . . . . . . . . . . . . . . 100621.6.2. Shotgun sequencing . . . . . . . . . . . . . . . . . . . . . . . 1007

22. Computer Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101222.1. Fundamentals of analytic geometry . . . . . . . . . . . . . . . . . . 1012

22.1.1. Cartesian coordinate system . . . . . . . . . . . . . . . . . . 101322.2. Description of point sets with equations . . . . . . . . . . . . . . . 1013

22.2.1. Solids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101422.2.2. Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014

Contents 589

22.2.3. Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101522.2.4. Normal vectors . . . . . . . . . . . . . . . . . . . . . . . . . 101622.2.5. Curve modelling . . . . . . . . . . . . . . . . . . . . . . . . 101722.2.6. Surface modelling . . . . . . . . . . . . . . . . . . . . . . . . 102222.2.7. Solid modelling with blobs . . . . . . . . . . . . . . . . . . . 102322.2.8. Constructive solid geometry . . . . . . . . . . . . . . . . . . 1024

22.3. Geometry processing and tessellation algorithms . . . . . . . . . . 102622.3.1. Polygon and polyhedron . . . . . . . . . . . . . . . . . . . . 102622.3.2. Vectorization of parametric curves . . . . . . . . . . . . . . 102722.3.3. Tessellation of simple polygons . . . . . . . . . . . . . . . . 102722.3.4. Tessellation of parametric surfaces . . . . . . . . . . . . . . 102922.3.5. Subdivision curves and meshes . . . . . . . . . . . . . . . . 103122.3.6. Tessellation of implicit surfaces . . . . . . . . . . . . . . . . 1033

22.4. Containment algorithms . . . . . . . . . . . . . . . . . . . . . . . . 103522.4.1. Point containment test . . . . . . . . . . . . . . . . . . . . . 103522.4.2. Polyhedron-polyhedron collision detection . . . . . . . . . . 103922.4.3. Clipping algorithms . . . . . . . . . . . . . . . . . . . . . . 1040

22.5. Translation, distortion, geometric transformations . . . . . . . . . . 104422.5.1. Projective geometry and homogeneous coordinates . . . . . 104522.5.2. Homogeneous linear transformations . . . . . . . . . . . . . 1049

22.6. Rendering with ray tracing . . . . . . . . . . . . . . . . . . . . . . . 105222.6.1. Ray surface intersection calculation . . . . . . . . . . . . . . 105422.6.2. Speeding up the intersection calculation . . . . . . . . . . . 1056

22.7. Incremental rendering . . . . . . . . . . . . . . . . . . . . . . . . . 107022.7.1. Camera transformation . . . . . . . . . . . . . . . . . . . . . 107122.7.2. Normalizing transformation . . . . . . . . . . . . . . . . . . 107322.7.3. Perspective transformation . . . . . . . . . . . . . . . . . . 107422.7.4. Clipping in homogeneous coordinates . . . . . . . . . . . . . 107622.7.5. Viewport transformation . . . . . . . . . . . . . . . . . . . . 107722.7.6. Rasterization algorithms . . . . . . . . . . . . . . . . . . . . 107822.7.7. Incremental visibility algorithms . . . . . . . . . . . . . . . 1084

23. Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . 109323.1. Multiple-choice systems . . . . . . . . . . . . . . . . . . . . . . . . 1093

23.1.1. Examples of multiple-choice systems . . . . . . . . . . . . . 109423.2. Generating multiple candidate solutions . . . . . . . . . . . . . . . 1097

23.2.1. Generating candidate solutions with heuristics . . . . . . . . 109723.2.2. Penalty method with exact algorithms . . . . . . . . . . . . 110023.2.3. The linear programming - penalty method . . . . . . . . . . 110823.2.4. Penalty method with heuristics . . . . . . . . . . . . . . . . 1112

23.3. More algorithms for interactive problem solving . . . . . . . . . . . 111323.3.1. Anytime algorithms . . . . . . . . . . . . . . . . . . . . . . 111423.3.2. Interactive evolution and generative design . . . . . . . . . 111523.3.3. Successive fixing . . . . . . . . . . . . . . . . . . . . . . . . 111523.3.4. Interactive multicriteria decision making . . . . . . . . . . . 1115

590 Contents

23.3.5. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . 1116

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1129

Name Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140

IV. COMPUTER NETWORKS

13. Distributed Algorithms

We define a distributed system as a collection of individual computing devices thatcan communicate with each other. This definition is very broad, it includes anything,from a VLSI chip, to a tightly coupled multiprocessor, to a local area cluster ofworkstations, to the Internet. Here we focus on more loosely coupled systems. In adistributed system as we view it, each processor has its semi-independent agenda,but for various reasons, such as sharing of resources, availability, and fault-tolerance,processors need to coordinate their actions.Distributed systems are highly desirable, but it is notoriously difficult to constructefficient distributed algorithms that perform well in realistic system settings. Thesedifficulties are not just of a more practical nature, they are also fundamental innature. In particular, many of the difficulties are introduced by the three factorsof: asynchrony, limited local knowledge, and failures. Asynchrony means that globaltime may not be available, and that both absolute and relative times at whichevents take place at individual computing devices can often not be known precisely.Moreover, each computing device can only be aware of the information it receives,it has therefore an inherently local view of the global status of the system. Finally,computing devices and network components may fail independently, so that someremain functional while others do not.

We will begin by describing the models used to analyse distributed systems in themessage-passing model of computation. We present and analyze selected distributedalgorithms based on these models. We include a discussion of fault-tolerance indistributed systems and consider several algorithms for reaching agreement in themessages-passing models for settings prone to failures. Given that global time isoften unavailable in distributed systems, we present approaches for providing logicaltime that allows one to reason about causality and consistent states in distributedsystems. Moving on to more advanced topics, we present a spectrum of broadcastservices often considered in distributed systems and present algorithms implementingthese services. We also present advanced algorithms for rumor gathering algorithms.Finally, we also consider the mutual exclusion problem in the shared-memory modelof distributed computation.

13.1. Message passing systems and algorithms 593

13.1. Message passing systems and algorithms

We present our first model of distributed computation, for message passing sys-tems without failures. We consider both synchronous and asynchronous systemsand present selected algorithms for message passing systems with arbitrary networktopology, and both synchronous and asynchronous settings.

13.1.1. Modeling message passing systems

In a message passing system, processors communicate by sending messages overcommunication channels, where each channel provides a bidirectional connectionbetween two specific processors. We call the pattern of connections described by thechannels, the topology of the system. This topology is represented by an undirectedgraph, where each node represents a processor, and an edge is present between twonodes if and only if there is a channel between the two processors represented bythe nodes. The collection of channels is also called the network. An algorithm forsuch a message passing system with a specific topology consists of a local programfor each processor in the system. This local program provides the ability to theprocessor to perform local computations, to send and receive messages from each ofits neighbours in the given topology.

Each processor in the system is modeled as a possibly infinite state machine. Aconfiguration is a vector C = (q0, . . . , qn−1) where each qi is the state of a pro-cessor pi. Activities that can take place in the system are modeled as events (oractions) that describe indivisible system operations. Examples of events include localcomputation events and delivery events where a processor receives a message. Thebehaviour of the system over time is modeled as an execution, a (finite or infinite)sequence of configurations (Ci) alternating with events (ai): C0, a1, C1, a2, C2, . . ..Executions must satisfy a variety of conditions that are used to represent the cor-rectness properties, depending on the system being modeled. These conditions canbe classified as either safety or liveness conditions. A safety condition for a systemis a condition that must hold in every finite prefix of any execution of the system.Informally it states that nothing bad has happened yet. A liveness condition is acondition that must hold a certain (possibly infinite) number of times. Informally itstates that eventually something good must happen. An important liveness conditionis fairness, which requires that an (infinite) execution contains infinitely many ac-tions by a processor, unless after some configuration no actions are enabled at thatprocessor.

13.1.2. Asynchronous systems

We say that a system is asynchronous if there is no fixed upper bound on how longit takes for a message to be delivered or how much time elapses between consecutivesteps of a processor. An obvious example of such an asynchronous system is theInternet. In an implementation of a distributed system there are often upper boundson message delays and processor step times. But since these upper bounds are oftenvery large and can change over time, it is often desirable to develop an algorithm

594 13. Distributed Algorithms

that is independent of any timing parameters, that is, an asynchronous algorithm.In the asynchronous model we say that an execution is admissible if each

processor has an infinite number of computation events, and every message sent iseventually delivered. The first of these requirements models the fact that processorsdo not fail. (It does not mean that a processor’s local program contains an infiniteloop. An algorithm can still terminate by having a transition function not change aprocessors state after a certain point.)

We assume that each processor’s set of states includes a subset of terminatedstates. Once a processor enters such a state it remains in it. The algorithm hasterminated if all processors are in terminated states and no messages are in transit.

The message complexity of an algorithm in the asynchronous model is themaximum over all admissible executions of the algorithm, of the total number of(point-to-point) messages sent.

A timed execution is an execution that has a nonnegative real number as-sociated with each event, the time at which the event occurs. To measure the timecomplexity of an asynchronous algorithm we first assume that the maximum messagedelay in any execution is one unit of time. Hence the time complexity is the max-imum time until termination among all timed admissible executions in which everymessage delay is at most one. Intuitively this can be viewed as taking any executionof the algorithm and normalising it in such a way that the longest message delaybecomes one unit of time.

13.1.3. Synchronous systems

In the synchronous model processors execute in lock-step. The execution is parti-tioned into rounds so that every processor can send a message to each neighbour,the messages are delivered, and every processor computes based on the messagesjust received. This model is very convenient for designing algorithms. Algorithmsdesigned in this model can in many cases be automatically simulated to work inother, more realistic timing models.

In the synchronous model we say that an execution is admissible if it is infi-nite. From the round structure it follows then that every processor takes an infinitenumber of computation steps and that every message sent is eventually delivered.Hence in a synchronous system with no failures, once a (deterministic) algorithmhas been fixed, the only relevant aspect determining an execution that can changeis the initial configuration. On the other hand in an asynchronous system, therecan be many different executions of the same algorithm, even with the same initialconfiguration and no failures, since here the interleaving of processor steps, and themessage delays, are not fixed.

The notion of terminated states and the termination of the algorithm is definedin the same way as in the asynchronous model.

The message complexity of an algorithm in the synchronous model is the maxi-mum over all admissible executions of the algorithm, of the total number of messagessent.

To measure time in a synchronous system we simply count the number of roundsuntil termination. Hence the time complexity of an algorithm in the synchronous

13.2. Basic algorithms 595

model is the maximum number of rounds in any admissible execution of the algo-rithm until the algorithm has terminated.

13.2. Basic algorithms

We begin with some simple examples of algorithms in the message passing model.

13.2.1. Broadcast

We start with a simple algorithm Spanning-Tree-Broadcast for the (single mes-sage) broadcast problem, assuming that a spanning tree of the network graph with nnodes (processors) is already given. Later, we will remove this assumption. A proces-sor pi wishes to send a message M to all other processors. The spanning tree rootedat pi is maintained in a distributed fashion: Each processor has a distinguished chan-nel that leads to its parent in the tree as well as a set of channels that lead to itschildren in the tree. The root pi sends the message M on all channels leading toits children. When a processor receives the message on a channel from its parent, itsends M on all channels leading to its children.

Spanning-Tree-Broadcast

Initially M is in transit from pi to all its children in the spanning tree.Code for pi:

1 upon receiving no message: // first computation event by pi2 terminate

Code for pj , 0 ≤ j ≤ n− 1, j 6= i:3 upon receiving M from parent:4 send M to all children5 terminate

The algorithm Spanning-Tree-Broadcast is correct whether the system issynchronous or asynchronous. Moreover, the message and time complexities are thesame in both models.

Using simple inductive arguments we will first prove a lemma that shows thatby the end of round t, the message M reaches all processors at distance t (or less)from pr in the spanning tree.

Lemma 13.1 In every admissible execution of the broadcast algorithm in the syn-chronous model, every processor at distance t from pr in the spanning tree receivesthe message M in round t.

Proof We proceed by induction on the distance t of a processor from pr. First lett = 1. It follows from the algorithm that each child of pr receives the message inround 1.

Assume that each processor at distance t− 1 received the message M in round


t − 1. We need to show that each processor pt at distance t receives the messagein round t. Let ps be the parent of pt in the spanning tree. Since ps is at distancet − 1 from pr, by the induction hypothesis, ps received M in round t − 1. By thealgorithm, pt will hence receive M in round t.

By Lemma 13.1 the time complexity of the broadcast algorithm is d, where d isthe depth of the spanning tree. Now since d is at most n − 1 (when the spanningtree is a chain) we have:

Theorem 13.2 There is a synchronous broadcast algorithm for n processors withmessage complexity n − 1 and time complexity d, when a rooted spanning tree withdepth d is known in advance.

We now move to an asynchronous system and apply a similar analysis.

Lemma 13.3 In every admissible execution of the broadcast algorithm in the asyn-chronous model, every processor at distance t from pr in the spanning tree receivesthe message M by time t.

Proof We proceed by induction on the distance t of a processor from pr. First lett = 1. It follows from the algorithm that M is initially in transit to each processorpi at distance 1 from pr. By the definition of time complexity for the asynchronousmodel, pi receives M by time 1.

Assume that each processor at distance t − 1 received the message M at timet− 1. We need to show that each processor pt at distance t receives the message bytime t. Let ps be the parent of pt in the spanning tree. Since ps is at distance t− 1from pr, by the induction hypothesis, ps sends M to pt when it receives M at timet− 1. By the algorithm, pt will hence receive M by time t.

We immediately obtain:

Theorem 13.4 There is an asynchronous broadcast algorithm for n processors withmessage complexity n − 1 and time complexity d, when a rooted spanning tree withdepth d is known in advance.

13.2.2. Construction of a spanning tree

The asynchronous algorithm called Flood, discussed next, constructs a spanningtree rooted at a designated processor pr. The algorithm is similar to the Depth FirstSearch (DFS) algorithm. However, unlike DFS where there is just one processor with“global knowledge” about the graph, in the Flood algorithm, each processor has“local knowledge” about the graph, processors coordinate their work by exchangingmessages, and processors and messages may get delayed arbitrarily. This makes thedesign and analysis of Flood algorithm challenging, because we need to show thatthe algorithm indeed constructs a spanning tree despite conspiratorial selection ofthese delays.


Algorithm description. Each processor has four local variables. The linksadjacent to a processor are identified with distinct numbers starting from 1 andstored in a local variable called neighbours. We will say that the spanning treehas been constructed. when the variable parent stores the identifier of the linkleading to the parent of the processor in the spanning tree, except that this variableis none for the designated processor pr; children is a set of identifiers of the linksleading to the children processors in the tree; and other is a set of identifiers of allother links. So the knowledge about the spanning tree may be “distributed” acrossprocessors.

The code of each processor is composed of segments. There is a segment (lines1–4) that describes how local variables of a processor are initialised. Recall thatthe local variables are initialised that way before time 0. The next three segments(lines 5–11, 12–15 and 16–19) describe the instructions that any processor executesin response to having received a message: , or . Thelast segment (lines 20–22) is only included in the code of processor pr. This segment isexecuted only when the local variable parent of processor pr is nil. At some point oftime, it may happen that more than one segment can be executed by a processor (e.g.,because the processor received messages from two processors). Then theprocessor executes the segments serially, one by one (segments of any given processorare never executed concurrently). However, instructions of different processor may bearbitrarily interleaved during an execution. Every message that can be processed iseventually processed and every segment that can be executed is eventually executed(fairness).

Flood

Code for any processor pk, 1 ≤ k ≤ n1 initialisation2 parent← nil3 children← ∅4 other← ∅

5 process message that has arrived on link j6 if parent = nil7 then parent← j8 send to link j9 send to all links in neighbours \ {j}

10 else send to link j

11 process message that has arrived on link j12 children← children ∪ {j}13 if children ∪ other = neighbours \ {parent}14 then terminate


15 process message that has arrived on link j16 other← other ∪ {j}17 if children ∪ other = neighbours \ {parent}18 then terminate

Extra code for the designated processor pr19 if parent = nil20 then parent← none21 send to all links in neighbours

Let us outline how the algorithm works. The designated processor sends an message to all its neighbours, and assigns none to the parent variable(nil and none are two distinguished values, different from any natural number), sothat it never again sends the message to any neighbour.

When a processor processes message for the first time, the processorassigns to its own parent variable the identifier of the link on which the messagehas arrived, responds with an message to that link, and forwards an message to every other link. However, when a processor processes message again, then the processor responds with a message, becausethe parent variable is no longer nil.

When a processor processes message , it adds the identifier of thelink on which the message has arrived to the set children. It may turn out thatthe sets children and other combined form identifiers of all links adjacent to theprocessor except for the identifier stored in the parent variable. In this case theprocessor enters a terminating state.

When a processor processes message , the identifier of the link isadded to the set other. Again, when the union of children and other is large enough,the processor enters a terminating state.

Correctness proof. We now argue that Flood constructs a spanning tree. Thekey moments in the execution of the algorithm are when any processor assigns avalue to its parent variable. These assignments determine the “shape” of the spanningtree. The facts that any processor eventually executes an instruction, any messageis eventually delivered, and any message is eventually processed, ensure that theknowledge about these assignments spreads to neighbours. Thus the algorithm isexpanding a subtree of the graph, albeit the expansion may be slow. Eventually,a spanning tree is formed. Once a spanning tree has been constructed, eventuallyevery processor will terminate, even though some processors may have terminatedeven before the spanning tree has been constructed.

Lemma 13.5 For any 1 ≤ k ≤ n, there is time tk which is the first momentwhen there are exactly k processors whose parent variables are not nil, and theseprocessors and their parent variables form a tree rooted at pr.

Proof We prove the statement of the lemma by induction on k. For the base case,assume that k = 1. Observe that processor pr eventually assigns none to its parent


variable. Let t1 be the moment when this assignment happens. At that time, theparent variable of any processor other than pr is still nil, because no messages have been sent so far. Processor pr and its parent variable form a treewith a single node and not arcs. Hence they form a rooted tree. Thus the inductivehypothesis holds for k = 1.

For the inductive step, suppose that 1 ≤ k < n and that the inductive hypothesisholds for k. Consider the time tk which is the first moment when there are exactlyk processors whose parent variables are not nil. Because k < n, there is a non-treeprocessor. But the graph G is connected, so there is a non-tree processor adjacent tothe tree. (For any subset T of processors, a processor pi is adjacent to T if and onlyif there an edge in the graph G from pi to a processor in T .) Recall that by definition,parent variable of such processor is nil. By the inductive hypothesis, the k processorsmust have executed line 7 of their code, and so each either has already sent or willeventually send message to all its neighbours on links other than the parentlink. So the non-tree processors adjacent to the tree have already received or willeventually receive messages. Eventually, each of these adjacent processorswill, therefore, assign a value other than nil to its parent variable. Let tk+1 > tk bethe first moment when any processor performs such assignment, and let us denotethis processor by pi. This cannot be a tree processor, because such processor neveragain assigns any value to its parent variable. Could pi be a non-tree processor thatis not adjacent to the tree? It could not, because such processor does not have adirect link to a tree processor, so it cannot receive directly from the tree,and so this would mean that at some time t′ between tk and tk+1 some other non-tree processor pj must have sent message to pi, and so pj would have toassign a value other than nil to its parent variable some time after tk but beforetk+1, contradicting the fact the tk+1 is the first such moment. Consequently, pi isa non-tree processor adjacent to the tree, such that, at time tk+1, pi assigns to itsparent variable the index of a link leading to a tree processor. Therefore, time tk+1is the first moment when there are exactly k + 1 processors whose parent variablesare not nil, and, at that time, these processors and their parent variables form atree rooted at pr. This completes the inductive step, and the proof of the lemma.

Theorem 13.6 Eventually each processor terminates, and when every processorhas terminated, the subgraph induced by the parent variables forms a spanning treerooted at pr.

Proof By Lemma 13.5, we know that there is a moment tn which is the first momentwhen all processors and their parent variables form a spanning tree.

Is it possible that every processor has terminated before time tn? By inspectingthe code, we see that a processor terminates only after it has received or messages from all its neighbours other than the one to which parentlink leads. A processor receives such messages only in response to messagesthat the processor sends. At time tn, there is a processor that still has not even sent messages. Hence, not every processor has terminated by time tn.

Will every processor eventually terminate? We notice that by time tn, eachprocessor either has already sent or will eventually send message to all


its neighbours other than the one to which parent link leads. Whenever a processorreceives message, the processor responds with or ,even if the processor has already terminated. Hence, eventually, each processor willreceive either or message on each link to which the processorhas sent message. Thus, eventually, each processor terminates.

We note that the fact that a processor has terminated does not mean that aspanning tree has already been constructed. In fact, it may happen that processorsin a different part of the network have not even received any message, let aloneterminated.

Theorem 13.7 Message complexity of Flood is O(e), where e is the number ofedges in the graph G.

The proof of this theorem is left as Problem 13-1.

Exercises13.2-1 It may happen that a processor has terminated even though a processor hasnot even received any message. Show a simple network and how to delay messagedelivery and processor computation to demonstrate that this can indeed happen.13.2-2 It may happen that a processor has terminated but may still respond to amessage. Show a simple network and how to delay message delivery and processorcomputation to demonstrate that this can indeed happen.

13.3. Ring algorithms

One often needs to coordinate the activities of processors in a distributed system.This can frequently be simplified when there is a single processor that acts as acoordinator. Initially, the system may not have any coordinator, or an existing co-ordinator may fail and so another may need to be elected. This creates the problemwhere processors must elect exactly one among them, a leader. In this section westudy the problem for special types of networks—rings. We will develop an asyn-chronous algorithm for the problem. As we shall demonstrate, the algorithm hasasymptotically optimal message complexity. In the current section, we will see adistributed analogue of the well-known divide-and-conquer technique often used insequential algorithms to keep their time complexity low. The technique used in dis-tributed systems helps reduce the message complexity.

13.3.1. The leader election problem

The leader election problem is to elect exactly leader among a set of processors. For-mally each processor has a local variable leader initially equal to nil. An algorithmis said to solve the leader election problem if it satisfies the following conditions:

1. in any execution, exactly one processor eventually assigns true to its leadervariable, all other processors eventually assign false to their leader variables,

13.3. Ring algorithms 601

and

2. in any execution, once a processor has assigned a value to its leader variable,the variable remains unchanged.

Ring model. We study the leader election problem on a special type of network—the ring. Formally, the graph G that models a distributed system consists of n nodesthat form a simple cycle; no other edges exist in the graph. The two links adjacentto a processor are labeled CW (Clock-Wise) and CCW (Counter Clock-Wise).Processors agree on the orientation of the ring i.e., if a message is passed on in CWdirection n times, then it visits all n processors and comes back to the one thatinitially sent the message; same for CCW direction. Each processor has a uniqueidentifier that is a natural number, i.e., the identifier of each processor is differentfrom the identifier of any other processor; the identifiers do not have to be consecutivenumbers 1, . . . , n. Initially, no processor knows the identifier of any other processor.Also processors do not know the size n of the ring.

13.3.2. The leader election algorithm

Bully elects a leader among asynchronous processors p1, . . . , pn. Identifiers of pro-cessors are used by the algorithm in a crucial way. Briefly speaking, each processortries to become the leader, the processor that has the largest identifier among allprocessors blocks the attempts of other processors, declares itself to be the leader,and forces others to declare themselves not to be leaders.

Let us begin with a simpler version of the algorithm to exemplify some of theideas of the algorithm. Suppose that each processor sends a message around thering containing the identifier of the processor. Any processor passes on such messageonly if the identifier that the message carries is strictly larger than the identifier ofthe processor. Thus the message sent by the processor that has the largest identifieramong the processors of the ring, will always be passed on, and so it will eventuallytravel around the ring and come back to the processor that initially sent it. Theprocessor can detect that such message has come back, because no other processorsends a message with this identifier (identifiers are distinct). We observe that, noother message will make it all around the ring, because the processor with thelargest identifier will not pass it on. We could say that the processor with the largestidentifier “swallows” these messages that carry smaller identifiers. Then the processorbecomes the leader and sends a special message around the ring forcing all others todecide not to be leaders. The algorithm has Θ(n2) message complexity, because eachprocessor induces at most n messages, and the leader induces n extra messages; andone can assign identifiers to processors and delay processors and messages in sucha way that the messages sent by a constant fraction of n processors are passed onaround the ring for a constant fraction of n hops. The algorithm can be improvedso as to reduce message complexity to O(n lg n), and such improved algorithm willbe presented in the remainder of the section.

The key idea of the Bully algorithm is to make sure that not too many mes-sages travel far, which will ensure O(n lg n) message complexity. Specifically, the


activity of any processor is divided into phases. At the beginning of a phase, a pro-cessor sends “probe” messages in both directions: CW and CCW. These messagescarry the identifier of the sender and a certain “time-to-live” value that limits thenumber of hops that each message can make. The probe message may be passedon by a processor provided that the identifier carried by the message is larger thanthe identifier of the processor. When the message reaches the limit, and has notbeen swallowed, then it is “bounced back”. Hence when the initial sender receivestwo bounced back messages, each from each direction, then the processor is certainthat there is no processor with larger identifier up until the limit in CW nor CCWdirections, because otherwise such processor would swallow a probe message. Onlythen does the processor enter the next phase through sending probe messages again,this time with the time-to-live value increased by a factor, in an attempt to find ifthere is no processor with a larger identifier in twice as large neighbourhood. As aresult, a probe message that the processor sends will make many hops only whenthere is no processor with larger identifier in a large neighbourhood of the proces-sor. Therefore, fewer and fewer processors send messages that can travel longer andlonger distances. Consequently, as we will soon argue in detail, message complexityof the algorithm is O(n lg n).

We detail the Bully algorithm. Each processor has five local variables. Thevariable id stores the unique identifier of the processor. The variable leader storestrue when the processor decides to be the leader, and false when it decides notto be the leader. The remaining three variables are used for bookkeeping: asleepdetermines if the processor has ever sent a message that carries theidentifier id of the processor. Any processor may send message in both directions (CW and CCW) for different values of phase. Each time amessage is sent, a message may be sent back to the processor. Thevariables CWreplied and CCWreplied are used to remember whether the replieshave already been processed the processor.

The code of each processor is composed of five segments. The first segment(lines 1–5) initialises the local variables of the processor. The second segment (lines6–8) can only be executed when the local variable asleep is true. The remainingthree segments (lines 9–17, 1–26, and 27–31) describe the actions that the processortakes when it processes each of the three types of messages: , and respectively. The messages carry parametersids, phase and ttl that are natural numbers.

We now describe how the algorithm works. Recall that we assume that the localvariables of each processor have been initialised before time 0 of the global clock.Each processor eventually sends a message carrying the identifierid of the processor. At that time we say that the processor enters phase numberzero. In general, when a processor sends a message , wesay that the processor enters phase number phase. Message is neversent again because false is assigned to asleep in line 7. It may happen that by thetime this message is sent, some other messages have already been processed by theprocessor.

When a processor processes message that has arrived onlink CW (the link leading in the clock-wise direction), then the actions depend on


the relationship between the parameter ids and the identifier id of the processor. Ifids is smaller than id, then the processor does nothing else (the processor swallowsthe message). If ids is equal to id and processor has not yet decided, then, as weshall see, the probe message that the processor sent has circulated around the entirering. Then the processor sends a message, decides to be the leader,and terminates (the processor may still process messages after termination). If idsis larger than id, then actions of the processor depend on the value of the parameterttl (time-to-live). When the value is strictly larger than zero, then the processorpasses on the probe message with ttl decreased by one. If, however, the value ofttl is already zero, then the processor sends back (in the CW direction) a replymessage. Symmetric actions are executed when the messagehas arrived on link CCW, in the sense that the directions of sending messages arerespectively reversed – see the code for details.

Bully

Code for any processor pk, 1 ≤ k ≤ n1 initialisation2 asleep← true3 CWreplied← false4 CCWreplied← false5 leader← nil

6 if asleep7 then asleep←false8 send to links CW and CCW

9 process message that has arrivedon link CW (resp. CCW)

10 if id = ids and leader = nil11 then send to link CCW12 leader← true13 terminate14 if ids > id and ttl > 015 then send < probe,ids,phase,ttl− 1 >

to link CCW (resp. CW)16 if ids > id and ttl = 017 then send to link CW (resp. CCW)


18 process message that has arrived on link CW (resp. CCW)19 if id 6= ids20 then send to link CCW (resp. CW)21 else CWreplied← true (resp. CCWreplied)22 if CWreplied and CCWreplied23 then CWreplied← false24 CCWreplied← false25 send

to links CW and CCW

26 process message that has arrived on link CW27 if leader nil28 then send to link CCW29 leader← false30 terminate

When a processor processes message that has arrived on linkCW, then the processor first checks if ids is different from the identifier id of theprocessor. If so, the processor merely passes on the message. However, if ids = id,then the processor records the fact that a reply has been received from direction CW,by assigning true to CWreplied. Next the processor checks if both CWreplied andCCWreplied variables are true. If so, the processor has received replies from bothdirections. Then the processor assigns false to both variables. Next the processorsends a probe message. This message carries the identifier id of the processor, thenext phase number phase+ 1, and an increased time-to-live parameter 2phase+1− 1.Symmetric actions are executed when has arrived on link CCW.

The last type of message that a processor can process is . Theprocessor checks if it has already decided to be or not to be the leader. When nodecision has been made so far, the processor passes on the messageand decides not to be the leader. This message eventually reaches a processor thathas already decided, and then the message is no longer passed on.

13.3.3. Analysis of the leader election algorithm

We begin the analysis by showing that the algorithm Bully solves the leader electionproblem.

Theorem 13.8 Bully solves the leader election problem on any ring with asyn-chronous processors.

Proof We need to show that the two conditions listed at the beginning of the sectionare satisfied. The key idea that simplifies the argument is to focus on one processor.Consider the processor pi with maximum id among all processors in the ring. Thisprocessor eventually executes lines 6–8. Then the processor sends messages in CW and CCW directions. Note that whenever the processor sends messages, each such message is always passed on by


other processors, until the ttl parameter of the message drops down to zero, or themessage travels around the entire ring and arrives at pi. If the message never arrivesat pi, then a processor eventually receives the probe message with ttl equal to zero,and the processor sends a response back to pi. Then, eventually pi receives mes-sages from each directions, and enters phase number phase+ 1 bysending probe messages in both directions. Thesemessages carry a larger time-to-live value compared to the value from the previousphase number phase. Since the ring is finite, eventually ttl becomes so large thatprocessor pi receives a probe message that carries the identifier of pi. Note that piwill eventually receive two such messages. The first time when pi processes such mes-sage, the processor sends a message and terminates as the leader. Thesecond time when pi processes such message, lines 11–13 are not executed, becausevariable leader is no longer nil. Note that no other processor pj can execute lines11–13, because a probe message originated at pj cannot travel around the entirering, since pi is on the way, and pi would swallow the message; and since identifiersare distinct, no other processor sends a probe message that carries the identifier ofprocessor pj . Thus no processor other than pi can assign true to its leader variable.Any processor other than pi will receive the message, assign falseto its leader variable, and pass on the message. Finally, the messagewill arrive at pi, and pi will not pass it anymore. The argument presented thus farensures that eventually exactly one processor assigns true to its leader variable,all other processors assign false to their leader variables, and once a processor hasassigned a value to its leader variable, the variable remains unchanged.

Our next task is to give an upper bound on the number of messages sent bythe algorithm. The subsequent lemma shows that the number of processors that canenter a phase decays exponentially as the phase number increases.

Lemma 13.9 Given a ring of size n, the number k of processors that enter phasenumber i ≥ 0 is at most n/2i−1.Proof There are exactly n processors that enter phase number i = 0, because eachprocessor eventually sends message. The bound stated in the lemmasays that the number of processors that enter phase 0 is at most 2n, so the boundevidently holds for i = 0. Let us consider any of the remaining cases i.e., let us assumethat i ≥ 1. Suppose that a processor pj enters phase number i, and so by definitionit sends message . In order for a processor to send such message,each of the two probe messages that the processor sent inthe previous phase in both directions must have made 2i−1 hops always arriving at aprocessor with strictly lower identifier than the identifier of pj (because otherwise, ifa probe message arrives at a processor with strictly larger or the same identifier, thanthe message is swallowed, and so a reply message is not generated, and consequentlypj cannot enter phase number i). As a result, if a processor enters phase numberi, then there is no other processor 2i−1 hops away in both directions that can everenter the phase. Suppose that there are k ≥ 1 processors that enter phase i. We canassociate with each such processor pj , the 2i−1 consecutive processors that followpj in the CW direction. This association assigns 2i−1 distinct processors to each of


the k processors. So there must be at least k + k · 2i−1 distinct processor in thering. Hence k(1 + 2i−1) ≤ n, and so we can weaken this bound by dropping 1, andconclude that k · 2i−1 ≤ n, as desired.

Theorem 13.10 The algorithm Bully has O(n lg n) message complexity, where nis the size of the ring.

Proof Note that any processor in phase i, sends messages that are intended to travel2i away and back in each direction (CW and CCW). This contributes at most 4 · 2imessages per processor that enters phase number i. The contribution may be smallerthan 4 · 2i if a probe message gets swallowed on the way away from the processor.Lemma 13.9 provides an upper bound on the number of processors that enter phasenumber k. What is the highest phase that a processor can ever enter? The numberk of processors that can be in phase i is at most n/2i−1. So when n/2i−1 < 1, thenthere can be no processor that ever enters phase i. Thus no processor can enter anyphase beyond phase number h = 1+dlog2 ne, because n < 2(h+1)−1. Finally, a singleprocessor sends one termination message that travels around the ring once. So forthe total number of messages sent by the algorithm we get the

n+1+dlog2 ne∑

i=0

(n/2i−1 · 4 · 2i

)= n+

1+dlog2 ne∑

i=0

8n = O(n lg n)

upper bound.

Burns furthermore showed that the asynchronous leader election algorithm isasymptotically optimal: Any uniform algorithm solving the leader election problemin an asynchronous ring must send the number of messages at least proportional ton lg n.

Theorem 13.11 Any uniform algorithm for electing a leader in an asynchronousring sends Ω(n lg n) messages.

The proof, for any algorithm, is based on constructing certain executions of thealgorithm on rings of size n/2. Then two rings of size n/2 are pasted together insuch a way that the constructed executions on the smaller rings are combined, andΘ(n) additional messages are received. This construction strategy yields the desiredlogarithmic multiplicative overhead.

Exercises13.3-1 Show that the simplified Bully algorithm has Ω(n2) message complexity,by appropriately assigning identifiers to processors on a ring of size n, and by deter-mining how to delay processors and messages.13.3-2 Show that the algorithm Bully has Ω(n lg n) message complexity.

13.4. Fault-tolerant consensus 607

13.4. Fault-tolerant consensus

The algorithms presented so far are based on the assumption that the system onwhich they run is reliable. Here we present selected algorithms for unreliable dis-tributed systems, where the active (or correct) processors need to coordinate theiractivities based on common decisions.

It is inherently difficult for processors to reach agreement in a distributed set-ting prone to failures. Consider the deceptively simple problem of two failure-freeprocessors attempting to agree on a common bit using a communication mediumwhere messages may be lost. This problem is known as the two generals problem.Here two generals must coordinate an attack using couriers that may be destroyedby the enemy. It turns out that it is not possible to solve this problem using a fi-nite number of messages. We prove this fact by contradiction. Assume that thereis a protocol used by processors A and B involving a finite number of messages.Let us consider such a protocol that uses the smallest number of messages, say kmessages. Assume without loss of generality that the last kth message is sent fromA to B. Since this final message is not acknowledged by B, A must determine thedecision value whether or not B receives this message. Since the message may belost, B must determine the decision value without receiving this final message. Butnow both A and B decide on a common value without needing the kth message. Inother words, there is a protocol that uses only k − 1 messages for the problem. Butthis contradicts the assumption that k is the smallest number of messages neededto solve the problem.

In the rest of this section we consider agreement problems where the commu-nication medium is reliable, but where the processors are subject to two types offailures: crash failures, where a processor stops and does not perform any furtheractions, and Byzantine failures, where a processor may exhibit arbitrary, or evenmalicious, behaviour as the result of the failure.

The algorithms presented deal with the so called consensus problem, first in-troduced by Lamport, Pease, and Shostak. The consensus problem is a fundamentalcoordination problem that requires processors to agree on a common output, basedon their possibly conflicting inputs.

13.4.1. The consensus problem

We consider a system in which each processor pi has a special state component xi,called the input and yi, called the output (also called the decision). The variablexi initially holds a value from some well ordered set of possible inputs and yi isundefined. Once an assignment to yi has been made, it is irreversible. Any solutionto the consensus problem must guarantee:

• Termination: In every admissible execution, yi is eventually assigned a value,for every nonfaulty processor pi.

• Agreement: In every execution, if yi and yj are assigned, then yi = yj , for allnonfaulty processors pi and pj . That is nonfaulty processors do not decide onconflicting values.


• Validity: In every execution, if for some value v, xi = v for all processors pi,and if yi is assigned for some nonfaulty processor pi, then yi = v. That is, if allprocessors have the same input value, then any value decided upon must be thatcommon input.

Note that in the case of crash failures this validity condition is equivalent torequiring that every nonfaulty decision value is the input of some processor. Once aprocessor crashes it is of no interest to the algorithm, and no requirements are puton its decision.

We begin by presenting a simple algorithm for consensus in a synchronous mes-sage passing system with crash failures.

13.4.2. Consensus with crash failures

Since the system is synchronous, an execution of the system consists of a series ofrounds. Each round consists of the delivery of all messages, followed by one com-putation event for every processor. The set of faulty processors can be different indifferent executions, that is, it is not known in advance. Let F be a subset of atmost f processors, the faulty processors. Each round contains exactly one computa-tion event for the processors not in F and at most one computation event for everyprocessor in F . Moreover, if a processor in F does not have a computation event insome round, it does not have such an event in any further round. In the last round inwhich a faulty processor has a computation event, an arbitrary subset of its outgoingmessages are delivered.

Consensus-with-Crash-Failures

Code for processor pi, 0 ≤ i ≤ n− 1.Initially V = {x}round k, 1 ≤ k ≤ f + 1

1 send {v ∈ V : pi has not already sent v} to all processors2 receive Sj from pj , 0 ≤ j ≤ n− 1, j 6= i3 V ← V ∪⋃n−1j=0 Sj4 if k = f + 15 then y ← min(V )

In the previous algorithm, which is based on an algorithm by Dolev and Strong,each processor maintains a set of the values it knows to exist in the system. Initially,the set contains only its own input. In later rounds the processor updates its set byjoining it with the sets received from other processors. It then broadcasts any newadditions to the set of all processors. This continues for f + 1 rounds, where f is themaximum number of processors that can fail. At this point, the processor decideson the smallest value in its set of values.

To prove the correctness of this algorithm we first notice that the algorithmrequires exactly f + 1 rounds. This implies termination. Moreover the validity con-


dition is clearly satisfied since the decision value is the input of some processor. Itremains to show that the agreement condition holds. We prove the following lemma:

Lemma 13.12 In every execution at the end of round f + 1, Vi = Vj, for everytwo nonfaulty processors pi and pj.

Proof We prove the claim by showing that if x ∈ Vi at the end of round f + 1 thenx ∈ Vj at the end of round f + 1.

Let r be the first round in which x is added to Vi for any nonfaulty processorpi. If x is initially in Vi let r = 0. If r ≤ f then, in round r + 1 ≤ f + 1 pi sends xto each pj , causing pj to add x to Vj , if not already present.

Otherwise, suppose r = f + 1 and let pj be a nonfaulty processor that receivesx for the first time in round f + 1. Then there must be a chain of f + 1 processorspi1 , . . . pif+1 that transfers the value x to pj . Hence pi1 sends x to pi2 in round oneetc. until pif+1 sends x to pj in round f + 1. But then pi1 , . . . , pif+1 is a chain off + 1 processors. Hence at least one of them, say pik must be nonfaulty. Hence pikadds x to its set in round k − 1 < r, contradicting the minimality of r.

This lemma together with the before mentioned observations hence implies thefollowing theorem.

Theorem 13.13 The previous consensus algorithm solves the consensus problemin the presence of f crash failures in a message passing system in f + 1 rounds.

The following theorem was first proved by Fischer and Lynch for Byzantinefailures. Dolev and Strong later extended it to crash failures. The theorem showsthat the previous algorithm, assuming the given model, is optimal.

Theorem 13.14 There is no algorithm which solves the consensus problem in lessthan f + 1 rounds in the presence of f crash failures, if n ≥ f + 2.

What if failures are not benign? That is can the consensus problem be solved inthe presence of Byzantine failures? And if so, how?

13.4.3. Consensus with Byzantine failures

In a computation step of a faulty processor in the Byzantine model, the new state ofthe processor and the message sent are completely unconstrained. As in the reliablecase, every processor takes a computation step in every round and every message sentis delivered in that round. Hence a faulty processor can behave arbitrarily and evenmaliciously. For example, it could send different messages to different processors.It can even appear that the faulty processors coordinate with each other. A faultyprocessor can also mimic the behaviour of a crashed processor by failing to send anymessages from some point on.

In this case, the definition of the consensus problem is the same as in the messagepassing model with crash failures. The validity condition in this model, however, isnot equivalent with requiring that every nonfaulty decision value is the input ofsome processor. Like in the crash case, no conditions are put on the output of faultyprocessors.


13.4.4. Lower bound on the ratio of faulty processors

Pease, Shostak and Lamport first proved the following theorem.

Theorem 13.15 In a system with n processors and f Byzantine processors, thereis no algorithm which solves the consensus problem if n ≤ 3f .

13.4.5. A polynomial algorithm

The following algorithm uses messages of constant size, takes 2(f + 1) rounds, andassumes that n > 4f . It was presented by Berman and Garay.

This consensus algorithm for Byzantine failures contains f + 1 phases, eachtaking two rounds. Each processor has a preferred decision for each phase, initiallyits input value. At the first round of each phase, processors send their preferences toeach other. Let vki be the majority value in the set of values received by processorpi at the end of the first round of phase k. If no majority exists, a default value v⊥is used. In the second round of the phase processor pk, called the king of the phase,sends its majority value vkk to all processors. If pi receives more than n/2 + f copiesof vki (in the first round of the phase) then it sets its preference for the next phaseto be vki ; otherwise it sets its preference to the phase kings preference, v

kk received

in the second round of the phase. After f + 1 phases, the processor decides on itspreference. Each processor maintains a local array pref with n entries.

We prove correctness using the following lemmas. Termination is immediate. Wenext note the persistence of agreement:

Lemma 13.16 If all nonfaulty processors prefer v at the beginning of phase k, thenthey all prefer v at the end of phase k, for all k, 1 ≤ k ≤ f + 1.

Proof Since all nonfaulty processors prefer v at the beginning of phase k, they allreceive at least n− f copies of v (including their own) in the first round of phase k.Since n > 4f , n− f > n/2 + f , implying that all nonfaulty processors will prefer vat the end of phase k.

Consensus-with-Byzantine-failures

Code for processor pi, 0 ≤ i ≤ n− 1.Initially pref[j] = v⊥, for any j 6= iround 2k − 1, 1 ≤ k ≤ f + 1

1 send 〈pref[i]〉 to all processors2 receive 〈vj〉 from pj and assign to pref[j], for all 0 ≤ j ≤ n− 1, j 6= i3 let maj be the majority value of pref[0],. . . ,pref[n− 1](v⊥ if none)4 let mult be the multiplicity of maj


round 2k, 1 ≤ k ≤ f + 15 if i = k6 then send 〈maj〉 to all processors7 receive 〈king-maj〉 from pk (v⊥ if none)8 if mult >

n

2+ f

9 then pref[i]← maj10 else pref[i]← king −maj11 if k = f + 112 then y ←pref[i]

This implies the validity condition: If they all start with the same input v theywill continue to prefer v and finally decide on v in phase f+1. Agreement is achievedby the king breaking ties. Since each phase has a different king and there are f + 1phases, at least one round has a nonfaulty king.

Lemma 13.17 Let g be a phase whose king pg is nonfaulty. Then all nonfaultyprocessors finish phase g with the same preference.

Proof Suppose all nonfaulty processors use the majority value received from theking for their preference. Since the king is nonfaulty, it sends the same message andhence all the nonfaulty preferences are the same.

Suppose a nonfaulty processor pi uses its own majority value v for its preference.Thus pi receives more than n/2 + f messages for v in the first round of phase g.Hence every processor, including pg receives more than n/2 messages for v in the firstround of phase g and sets its majority value to v. Hence every nonfaulty processorhas v for its preference.

Hence at phase g+1 all processors have the same preference and by Lemma 13.16they will decide on the same value at the end of the algorithm. Hence the algorithmhas the agreement property and solves consensus.

Theorem 13.18 There exists an algorithm for n processors which solves the con-sensus problem in the presence of f Byzantine failures within 2(f + 1) rounds usingconstant size messages, if n > 4f .

13.4.6. Impossibility in asynchronous systems

As shown before, the consensus problem can be solved in synchronous systems inthe presence of both crash (benign) and Byzantine (severe) failures. What aboutasynchronous systems? Under the assumption that the communication system iscompletely reliable, and the only possible failures are caused by unreliable processors,it can be shown that if the system is completely asynchronous then there is noconsensus algorithm even in the presence of only a single processor failure. Theresult holds even if the processors only fail by crashing. The impossibility proofrelies heavily on the system being asynchronous. This result was first shown in a


breakthrough paper by Fischer, Lynch and Paterson. It is one of the most influentialresults in distributed computing.

The impossibility holds for both shared memory systems if only read/write reg-isters are used, and for message passing systems. The proof first shows it for sharedmemory systems. The result for message passing systems can then be obtainedthrough simulation.

Theorem 13.19 There is no consensus algorithm for a read/write asynchronousshared memory system that can tolerate even a single crash failure.

And through simulation the following assertion can be shown.

Theorem 13.20 There is no algorithm for solving the consensus problem in anasynchronous message passing system with n processors, one of which may fail bycrashing.

Note that these results do not mean that consensus can never be solved inasynchronous systems. Rather the results mean that there are no algorithms thatguarantee termination, agreement, and validity, in all executions. It is reasonable toassume that agreement and validity are essential, that is, if a consensus algorithmterminates, then agreement and validity are guaranteed. In fact there are efficient anduseful algorithms for the consensus problem that are not guaranteed to terminate inall executions. In practice this is often sufficient because the special conditions thatcause non-termination may be quite rare. Additionally, since in many real systemsone can make some timing assumption, it may not be necessary to provide a solutionfor asynchronous consensus.

Exercises13.4-1 Prove the correctness of algorithm Consensus-Crash.13.4-2 Prove the correctness of the consensus algorithm in the presence of Byzantinefailures.13.4-3 Prove Theorem 13.20.

13.5. Logical time, causality, and consistent state

In a distributed system it is often useful to compute a global state that consists ofthe states of all processors. Having access to the global can allows us to reason aboutthe system properties that depend on all processors, for example to be able to detecta deadlock. One may attempt to compute global state by stopping all processors,and then gathering their states to a central location. Such a method is will-suited formany distributed systems that must continue computation at all times. This sectiondiscusses how one can compute global state that is quite intuitive, yet consistent, ina precise sense. We first discuss a distributed algorithm that imposes a global orderon instructions of processors. This algorithm creates the illusion of a global clockavailable to processors. Then we introduce the notion of one instruction causallyaffecting other instruction, and an algorithm for computing which instruction affectswhich. The notion turns out to be very useful in defining a consistent global state of

13.5. Logical time, causality, and consistent state 613

distributed system. We close the section with distributed algorithms that computea consistent global state of distributed system.

13.5.1. Logical time

The design of distributed algorithms is easier when processors have access to (New-tonian) global clock, because then each event that occurs in the distributed systemcan be labeled with the reading of the clock, processors agree on the ordering of anyevents, and this consensus can be used by algorithms to make decisions. However,construction of a global clock is difficult. There exist algorithms that approximatethe ideal global clock by periodically synchronising drifting local hardware clocks.However, it is possible to totally order events without using hardware clocks. Thisidea is called the logical clock.

Recall that an execution is an interleaving of instructions of the n programs.Each instruction can be either a computational step of a processor, or sending amessage, or receiving a message. Any instruction is performed at a distinct point ofglobal time. However, the reading of the global clock is not available to processors.Our goal is to assign values of the logical clock to each instruction, so that thesevalues appear to be readings of the global clock. That is, it possible to postponeor advance the instants when instructions are executed in such a way, that eachinstruction x that has been assigned a value tx of the logical clock, is executedexactly at the instant tx of the global clock, and that the resulting execution is avalid one, in the sense that it can actually occur when the algorithm is run with themodified delays.

The Logical-Clock algorithm assigns logical time to each instruction. Eachprocessor has a local variable called counter. This variable is initially zero and itgets incremented every time processor executes an instruction. Specifically, when aprocessor executes any instruction other than sending or receiving a message, thevariable counter gets incremented by one. When a processor sends a message, it in-crements the variable by one, and attaches the resulting value to the message. Whena processor receives a message, then the processor retrieves the value attached to themessage, then calculates the maximum of the value and the current value of counter,increments the maximum by one, and assigns the result to the counter variable. Notethat every time instruction is executed, the value of counter is incremented by atleast one, and so it grows as processor keeps on executing instructions. The valueof logical time assigned to instruction x is defined as the pair (counter, id), wherecounter is the value of the variable counter right after the instruction has beenexecuted, and id is the identifier of the processor. The values of logical time forma total order, where pairs are compared lexicographically. This logical time is alsocalled Lamport time. We define tx to be a quotient counter + 1/(id+ 1), which is anequivalent way to represent the pair.

Remark 13.21 For any execution, logical time satisfies three conditions:(i) if an instruction x is performed by a processor before an instruction y is performedby the same processor, then the logical time of x is strictly smaller than that of y,(ii) any two distinct instructions of any two processors get assigned different logicaltimes,


(iii) if instruction x sends a message and instruction y receives this message, thenthe logical time of x is strictly smaller than that of y.

Our goal now is to argue that logical clock provides to processors the illusion ofglobal clock. Intuitively, the reason why such an illusion can be created is that wecan take any execution of a deterministic algorithm, compute the logical time tx ofeach instruction x, and run the execution again delaying or speeding up processorsand messages in such a way that each instruction x is executed at the instant txof the global clock. Thus, without access to a hardware clock or other externalmeasurements not captured in our model, the processors cannot distinguish thereading of logical clock from the reading of a real global clock. Formally, the reasonwhy the re-timed sequence is a valid execution that is indistinguishable from theoriginal execution, is summarised in the subsequent corollary that follows directlyfrom Remark 13.21.

Corollary 13.22 For any execution α, let T be the assignment of logical time toinstructions, and let β be the sequence of instructions ordered by their logical time inα. Then for each processor, the subsequence of instructions executed by the processorin α is the same as the subsequence in β. Moreover, each message is received in βafter it is sent in β.

13.5.2. Causality

In a system execution, an instruction can affect another instruction by altering thestate of the computation in which the second instruction executes. We say that oneinstruction can causally affect (or influence) another, if the information that oneinstruction produces can be passed on to the other instruction. Recall that in ourmodel of distributed system, each instruction is executed at a distinct instant ofglobal time, but processors do not have access to the reading of the global clock. Letus illustrate causality. If two instructions are executed by the same processor, thenwe could say that the instruction executed earlier can causally affect the instruc-tion executed later, because it is possible that the result of executing the formerinstruction was used when the later instruction was executed. We stress the wordpossible, because in fact the later instruction may not use any information producedby the former. However, when defining causality, we simplify the problem of captur-ing how processors influence other processors, and focus on what is possible. If twoinstructions x and y are executed by two different processors, then we could say thatinstruction x can causally affect instruction y, when the processor that executes xsends a message when or after executing x, and the message is delivered before orduring the execution of y at the other processor. It may also be the case that influ-ence is passed on through intermediate processors or multiple instructions executedby processors, before reaching the second processor.

We will formally define the intuition that one instruction can causally affectanother in terms of a relation called happens before, and that relates pairs ofinstructions. The relation is defined for a given execution, i.e., we fix a sequenceof instructions executed by the algorithm and instances of global clock when theinstructions were executed, and define which pairs of instructions are related by the

13.5. Logical time, causality, and consistent state 615

happens before relation. The relation is introduced in two steps. If instructions xand y are executed by the same processor, t

Documents

ALGORITHMS - Eötvös Loránd Universitycompalg.inf.elte.hu › ~tony › Oktatas › FirstExpert › Vol2... · Authors of Volume 1: László Lovász (Preface), Antal Iványi (Introduction),