152

PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

  • Upload
    lekhue

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Seminar of the Project Group

DisDaS

Distributed Data Streams

WS 2015/16

Alexander Mäcker, Manuel Malatyali,Sören Riechers and Friedhelm Meyer auf der Heide

Page 2: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Contact

Friedhelm Meyer auf der HeideHeinz Nixdorf InstituteUniversity of PaderbornFürstenallee 1133102 PaderbornGermany

email: [email protected]

Page 3: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Participants of the Seminar

Seminar: Pascal BemmannAdvisor: Sören Riechers

Seminar: Steen KnorrAdvisor: Alexander Mäcker

Seminar: Felix BiermeierAdvisor: Alexander Mäcker,

Sören Riechers

Seminar: Till KnollmannAdvisor: Manuel Malatyali

Seminar: Arne KemperAdvisor: Manuel Malatyali

Seminar: Jannik SundermeierAdvisor: Alexander Mäcker

Seminar: Jan BürmannAdvisor: Manuel Malatyali

Seminar: Nils KotheAdvisor: Sören Riechers

Page 4: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF
Page 5: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Contents

Interval Selection in the streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Pascal Bemmann

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Basics and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 H-random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A 2-approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Estimating the size of an optimal solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.2 Algorithms in the Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Same-size intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.1 Largest independent set of same size-intervals . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Size of largest independent set of same size-intervals . . . . . . . . . . . . . . . . . . . . 16

7 Conclusion and other results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Mutli-Dimensional Online Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Steen Knorr

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.1 Structure of this essay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Online Tracking in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1 An O (log∆)-competitive algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Lower bound on competitive ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Online Tracking in d dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1 Using Tukey medians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Using volume cutting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Tracking a dynamic set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Online Tracking with predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Frequency Moments - Approximations and Space Complexity . . . . . . . . . . . . . . . . . . . . 35Felix Biermeier

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Approximations of frequency moments Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 Approximating Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Improved space bound for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

ix

Page 6: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

x Contents

2.3 Approximating F0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1 Space complexity of deterministic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Space complexity of F∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Space complexity of Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Estimating Simple Functions on the Union of Data Streams . . . . . . . . . . . . . . . . . . . . . . 53Till Knollmann

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541.1 The Scenario and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541.2 Applications of this scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551.4 Why coordinated Sampling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2 Formal Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.1 Independence and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.2 Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Coordinated 1-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.1 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Public Coins Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3 Private Coins Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Extensions and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.1 Other Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 General Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Estimation F0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Set resemblance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

On Graph Problems in a Semi-streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Arne Kemper

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.1 Unweighted Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.2 Weighted Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Distances and further problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

The Count-Min-Sketch and its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Jannik Sundermeier

1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831.1 The Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841.2 Update Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841.3 Query types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.1 Linearity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.2 The Markov inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.3 Cherno bound(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.4 Pairwise-Independent Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Page 7: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Contents xi

3 The Count-Min Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 Query Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.1 Point Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Inner Product Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.3 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4 φ-Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.5 Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Palindrome Recognition In The Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Jan Bürmann

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051.2 (KR-)Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

2 ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082.1 Simple ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082.2 Space Ecient ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1132.3 Variant of ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3 Exact and ApproxLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.1 Exact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.2 ApproxLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 125Nils Kothe

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1271.2 Declaration of the problems, previous and new results . . . . . . . . . . . . . . . . . . 128

2 Tracking Distributed Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1292.1 The algorithm and upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1292.2 The lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3 Tracking Distributed Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.1 The algorithm and upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.2 Communication space trade-o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4 Tracking Distributed Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.1 The basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.2 The algorithm C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.3 Upper space and communication bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.4 Estimation of the rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Page 8: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF
Page 9: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model

Pascal Bemmann

Abstract In the interval selection problem we are given a set of intervals via a stream and want tond the maximum set of pairwise independent intervals. Let α(I) denote the size of an optimal solu-tion. We present the results of S. Cebello and P. Pérez-Antero for estimating α(I) in the streamingmodel where only one pass over the data is allowed, the endpoints of intervals lie within the rangeof 1, . . . , n and the memory is constrained.

For intervals of potentially dierent size we provide an algorithm that computes an estimate α ofα(I) such that 1

2 (1− ε)α(I) ≤ α ≤ α(I) holds with probability at least 2/3. For same sized intervalswe explain an algorithm that computes an estimate α of α(I) for which 2

3 (1 − ε)α(I) ≤ α ≤ α(I)holds with probability at least 2/3. The required space is in polynomial order of ε−1 and log n.

We also present approximation algorithms for the interval selection problem which use O(α(I))space which are used in the mentioned estimates.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Basics and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 H-random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A 2-approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Estimating the size of an optimal solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.2 Algorithms in the Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Same-size intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.1 Largest independent set of same size-intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Size of largest independent set of same size-intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Conclusion and other results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1 Introduction

In this work we will present results developed by Cabello and Pérez-Lantero [1]. We consider prob-lems in the streaming model. Usually huge data sets arrive sequentially. The task is to solve problemswith limited memory. We will assume that we are not able to look at input items again unless wesaved it in our memory. This can be also described as using only one pass over the input. Further-more we assume that we observe an input stream that is too big to be stored as a whole in our

Pascal BemmannUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

1

Page 10: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

2 Pascal Bemmann

memory.Within this model we will analyze the interval selection problem. As an input we get intervals withina predetermined range. The task is to nd the biggest set of intervals that are pairwise disjoint whileusing the memory most eciently. Another problem that arises is to estimate only the size of anoptimal solution without providing the actual set of intervals. Both of these problems will be ana-lyzed in the following chapters.Note that the interval selection problem is a generalization of the distinct element problem. Thisfundamental problem deals with the task to compute the number of pairwise dierent elements of adata stream. If for all intervals of the interval selection problem both endpoints of each interval areequal, the task becomes to count the number of dierent points (elements).

We will start by providing general denitions and tools that we use to approach the intervalselection problem. After that we will give you the idea of how to design a 2-approximation algorithmin the mentioned setting. This algorithm will be used together with other general results to constructan algorithm to estimate the size of an optimal solution for the interval selection problem. At theend we will provide the general approach how we can improve the presented results if we assumethat all inputs of our input are of the same size.

2 Basics and Denitions

In this section we provide denitions and useful tools which we will use in later proofs. To shortenthe notation we will use [n] to denote the set of integers 1, . . . , n. Also we will assume for all laterconstructions that 0 < ε < 1/2 holds.

Denition 1 (Interval selection problem). The interval selection problem is dened as follows:Given a set I of (input) intervals, the task is to nd the largest set of intervals that are pairwisedisjoint. These intervals are also called independent. α(I) denotes the size of an optimal solution forthis problem.

Another problem that arises from this denition is the problem to estimate α(I) without outputtingan independent subset of the input intervals. We will consider both of these problems in this work.

2.1 Intervals

We consider the input intervals of the interval selection problem to be closed. Intervals constructedduring the algorithm in Section 4 will be called windows to distinguish them from the input intervals.For the same reason we use the term segment for an interval used in the segment tree in Section5.1. We say that an interval I = [x, y] is contained in another interval I ′ if both endpoints x, y areelements of I ′.

Denition 2 (Leftmost and Rightmost intervals). Given a window W and a set of inputintervals I. Then we dene

• Leftmost(W ) is the interval with the smallest right endpoint among the intervals of I containedinW . We use the left endpoint as a tiebreaker, choosing the interval with the largest left endpoint.

• Rightmost(W ) is the interval with the largest left endpoint among the intervals of I containedin W . We use the right endpoint as a tiebreaker, choosing the interval with the smallest rightendpoint.

In case of W being empty both Leftmost(W ) and Rightmost(W ) are undened.

Page 11: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 3

If W contains just a single interval I ∈ I it holds that Leftmost(W ) = Rightmost(W ). Alsothe intersection of all input intervals contained in W is equal to Leftmost(W ) ∩ Rightmost(W ).Otherwise it would contradict to the denition of Leftmost(W ) or Rightmost(W ).It will be clear from context to which set of input intervals Leftmost(W ) and Rightmost(W ) referto.

2.2 Sampling

Denition 3 (ε-min-wise independence). A family of permutations H : h : [n] → [n] isε-min-wise independent if

∀X ⊂ [n] and ∀y ∈ X :1− ε|X| ≤ Pr

h∈H[h(y) = minh(X)] ≤ 1 + ε

|X| .

Note that we will only look at subsets of all possible permutations on [n]. This is because if weexamine the set of all permutations h(y) is distributed uniformly on [n] and it follows that ε = 0.Based on this denition we can use the results of Indyk [2] to obtain a family of permutations withproperties that we will need later.

Lemma 1. For every ε ∈ (0, 1/2) and n > 0 there exists a family permutations H(n, ε) = h : [n]→[n] with the following properties:

(i ) H(n, ε) has nO(log(1/ε)) permutations(ii ) H(n, ε) is ε-min-wise independent

(iii ) an element of H(n, ε) can be chosen uniformly at random in O(log 1/ε) time(iv ) for h ∈ H(n, ε) and x, y ∈ [n], we can decide with O(log(1/ε)) arithmetic operations whether

h(x) < h(y).

Proof. We will just provide a rough idea how to proof the above properties. The results of Indyk [2]grant a family H′ of ε′-wise independent hash functions with ε′ depending on ε and some constantfactors. It can be shown that each hash function h′ ∈ H′ can be used to create a ε-min-wiseindependent permutation using the lexicographic order of (h′(i), i) over all i ∈ [n]. Using standardconstructions over nite eld grant a family of hash functions satisfying condition (i), (iii) and (iv).Transforming these hash functions into permutations using the above approach grants property (ii).ut

3 H-random samples

We will use the result of the previous lemma to obtain H-random samples. These are elements thatare chosen nearly uniformly at random, but we still want to maintain some characteristic informationabout this samples. The general idea is based on the work of Datar and Muthukrishnan [4].We consider a xed subset X ⊂ [n] and H = H(n, ε) a family of permutations as stated in Lemma1. To obtain a H-random element s of X we choose a permutation h ∈ H uniformly at randomand set s = arg minh(x)|x ∈ X. It is important that we do not choose s completely uniformly atrandom. With the denition of ε-min-wise independence we obtain that

∀x ∈ X :1− ε|X| ≤ Pr[s = x] ≤ 1 + ε

|X| .

This follows from the observation that if we x h, it holds that h(x) = h(y) ⇔ x = y. Moreover,with Pr[s ∈ Y ] =

∑yi∈Y Pr[s = yi] we can conclude that

Page 12: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

4 Pascal Bemmann

∀Y ⊂ X :(1− ε)|Y ||X| ≤ Pr[s ∈ Y ] ≤ (1 + ε)|Y |

|X| . (1)

This gives us the opportunity to estimate the ratio |Y |/|X| for a xed Y . We keep calculating H-random samples from X and count how many of the samples are elements of Y . The probabilitythat an H-random sample is an element of Y correlates to the ratio between Y and X.Furthermore, H-random samples can be maintained during the stream. After choosing h ∈ H uni-formly at random we check whether for a new element a of the stream h(a) < h(s) holds. Thismeans that a is our new minimum of X and we update s = a.Also we will use H-random samples for conditional sampling where we sample elements until weobtain an element satisfying certain properties. To analyze later results we will need the followingobservation.

Lemma 2. Let Y ⊂ X ⊂ [n] and ε ∈ (0, 1/2). Consider a family of permutations H = H(n, ε) withthe properties of Lemma 1 and an H-random sample s from X. Then

∀y ∈ Y :1− 4ε

|Y | ≤ Pr[s = y|s ∈ Y ] ≤ 1 + 4ε

|Y | .

Proof. Fix an arbitrary y ∈ Y . With s = y ⇒ s ∈ Y and the considerations above we observe

Pr[s = y|s ∈ Y ] =Pr[s = y and s ∈ Y ]

Pr[s ∈ Y ]=

Pr[s = y]

Pr[s ∈ Y ]≤

1+ε|X|

(1−ε)|Y ||X|

=1 + ε

1− ε ·1

|Y |(∗)≤ (1 + 4ε)

1

|Y | .

Here the inequality marked with (∗) follows with

(1 + ε)

(1− ε) ≤ 1 + 4ε

⇔ 1 + ε ≤ (1 + 4ε)(1− ε) = 1 + 4ε− ε− 4ε2

⇔ 0 ≤ 2ε− 4ε2

and 2ε− 4ε2 = ε(2− 4ε) ≥ ε(2− 4 · 12 ) = 0.

Similarly we conclude that Pr[s = y|s ∈ Y ] ≥ (1− 4ε) 1|Y | . ut

4 A 2-approximation algorithm

The goal of this section is to construct a 2-approximation algorithm for the interval selection problemusing O(α(I)) space.The algorithm will maintain a set W which is a partition of the real line. We call elements of Wwindows which are intervals for which both the inclusion and exclusion of the endpoints are allowed.More specically all elements of W are pairwise disjoint and the union of all elements of W formswhole R. We formalize this desired set and its consequentially properties in the next lemma.

Lemma 3. Let I be a set of intervals and let W be a partition of the real line with the followingproperties:

• Each window of W contains at least one interval from I.• For each window W ∈W, the intervals of I contained in W pairwise intersect.

Let J be any set of intervals constructed by selecting for each window W of W an interval of Icontained in W . Then |J| > α(I)/2.

Page 13: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 5

Fig. 1 At the bottom is a partition of the real line. Filled circles represent included endpoints; empty circles excludedendpoints. At the top the we split an optimal solution J∗ (marked in blue) into J∗⊂ and J∗∩.

Proof. Consider a partition of the real line W with the above properties. To shorten the notationwe set k = |W|. With J∗ ⊂ I we refer to an optimal solution of the interval selection problem. Bydenition |J∗| = α(I) holds. For the further investigation we split J∗ into two disjoint sets J∗⊂ andJ∗∩. For an example see Figure 1.J∗⊂ describes the set of intervals which are fully contained in some window of W. By constructionall intervals contained in a window W of W pairwise intersect, therefore at most one interval of J∗⊂is contained in a window. With W having k elements we obtain |J∗⊂| ≤ k.Every interval which intersects at least two successive windows of W is contained in J∗∩. All theseintervals cannot be elements of J∗⊂. Since the k windows ofW are split by k−1 endpoints, |J∗∩| ≤ k−1holds. Since each element of J∗5 is either contained in J∗⊂ or J∗∩ we combine the above results toobtain

α(I) = |I∗| = |J∗⊂|+ |J∗∩| ≤ k + k − 1 = 2k − 1.

Since J is constructed by choosing one interval from each window and |J| = k we conclude that

2 · |J| = 2k > 2k − 1 ≥ α(I)

which completes the proof. ut

We will now present the general idea of an algorithm that maintains such a partition throughoutthe stream. For an example of such a partition see Figure 2.The overall goal is to partition the real line while storing Leftmost(W ) and Rigtmost(W ) for allwindows W ∈ W. When receiving the rst input interval I0 of the stream we set W = R andLeftmost(W ) = Rigtmost(W ) = I0. At this point Lemma 3 holds.Now we show that we can insert new intervals while keeping the above conditions. Let I be a newinterval of the stream. If I is not contained in any window of W no update is needed, since it willbe disregarded by the algorithm because we will choose our nal intervals from the set of intervalscontained in a window of W. Otherwise I is contained in some window W and we distinct two cases.

• If I intersects all intervals in W we check if we have to update Leftmost(W ) or Rightmost(W ).• Otherwise we have to split window W into two windows W1 and W2. If both endpoints of I

are bigger than Leftmost(W ) ∩Rightmost(W ) we use the right endpoint of Leftmost(W ) as asplitting value. Then we set W1 to the segment containing Leftmost(W ) and W2 to the segmentcontaining I. If both endpoints of I are smaller than Leftmost(W ) ∩Rightmost(W ) we use thesame approach with Rightmost(W ) instead of Leftmost(W ).

With this operations we ensure that our partition satises Lemma 3. The formal proof showing thatthese instructions maintain a partition satisfying Lemma 3 is a simple case distinction using inductive

Page 14: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

6 Pascal Bemmann

arguments. By storing W using a dynamic binary search tree the algorithm needs O(|W|) = O(α(I))space. Also all operations within this tree can be handled within O(log|W|) = O(logα(I)) time.By choosing one arbitrary input interval from each window we end up with a 2-approximationsolution for the interval selection problem.

Fig. 2 Maintaining of a partition of the real line by a 2-approximation algorithm.

5 Estimating the size of an optimal solution

In this section we want to estimate the size of an optimal solution of the interval selection problem.To reach this we will split the interval [1, n] into segments and apply the 2-approximation algorithmfrom Section 4 at each of these segments. We will construct our segments in such a way, that eachsegment contains neither too much nor too few input intervals. We will start with presenting resultsindependent from the streaming model. After that we will explain how to use these in algorithmsin the streaming model.

5.1 Segments

For our overall we use a segment tree T . This is a balanced binary tree on the segments [i, i+1) withi ∈ [n]. Each leaf of T corresponds to a segment [i, i+ 1) for some i, including the left endpoint andexcluding the right endpoint. Note that the order of leafs in the tree is the same as the order of thecorresponding segments on the real line. For any inner node v of T the corresponding interval S(v)is the (disjoint) union of the intervals of v's children. Then for the root node r the correspondinginterval S(r) is equal to [1, n+ 1). With S we denote the set of segments corresponding to the nodesof T . Since T is a balanced binary tree the size of S is 2n− 1. For an example see Figure 3.

To refer to the parent of a segment S ∈ S with S 6= S(r) we will use π(S). This is the segmentcorresponding to the parent node of the node v of S for which S(v) = S holds.

For the upcoming constructions we want to denote the size of the largest independent subset byβ(S) if we only consider input intervals which are elements of S ∈ S. Analogously we use β(S) forthe size of an optimal solution computed by a 2-approximation algorithm of Section 4 only appliedto input intervals which are elements of S. Resulting from this denitions we directly conclude that

∀S ∈ S : β(S) ≥ β(S) ≥ β(S)/2. (2)

Page 15: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 7

The next lemma tells us that if we restrict a 2-approximation algorithm to some segments of Swith certain properties and apply it to the input intervals contained in them, we obtain a (1/2− ε)approximation for the estimation of the size of an optimal solution.

Fig. 3 Segment tree for n = 8.

Lemma 4. Let S′ ⊂ S be such that

(i ) S(r) is the disjoint union of the segments in S′ and(ii ) for each S ∈ S′ it holds that β(π(S)) ≥ 2ε−1dlog ne.Then

α(I) ≥∑S∈S′

β(S) ≥ (1

2− ε)α(I).

Proof. Consider a set S′ with the above properties. Then we can merge the solutions produced by a2-approximation algorithm applied independently each S ∈ S′. No input interval is chosen multipletimes because the segments in S′ are disjoint. With inequality (2) the rst inequality follows

α(I) ≥∑S∈S′

β(S) ≥∑S∈S′

β(S).

To obtain the second inequality we look at set of elements S which is the set of leafmost elementsin the set of parents π(S)|S ∈ S′. This means that each element of S has a child in S′ but nodescendant which is an element of S. By denition for each segment S ∈ S′ there exists a S ∈ S suchthat the parent of S is on the path ΠT (S) in T from the root to S. Otherwise we would have founda leafmost parent node which is not an element of S. For each of these S ∈ S using (ii) it holds thatβ(S) ≥ 2ε−1dlog ne.Now we want to link this considerations to J∗ ⊂ I an optimal solution for the interval selectionproblem. Assume we are given such a solution. Then for each segment S ∈ S, J∗ contains at mosttwo intervals that intersect S but which are not completely contained in S. Otherwise at least twointervals of the optimal solution would intersect which leads to a contradiction. Therefore we canconclude for all S ∈ S

|J ∈ J∗|J ∩ S 6= ∅| ≤ |J ∈ J∗|J ⊂ S|+ 2 ≤ β(S) + 2. (3)

Per denition the segments in S are pairwise disjoint as otherwise they would not be a leafmostparent node. Then we can join single solutions restricted on segments of S to a solution for thewhole input. Together with (ii) we obtain

Page 16: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

8 Pascal Bemmann

|J∗| ≥∑S∈S

β(S) ≥ |S| · 2ε−1dlog ne. (4)

The maximum path length in T is dlog ne + 1 since T is a balanced tree. Then for all S ∈ S thepath from the root to S has at most dlog ne vertices because S is a parent node. Each S ∈ S′ hasa parent on the path from the root to some S ∈ S. With the fact that each S ∈ S has at most twochildren and by rearranging (2) to |S| we obtain

|S′| ≤ 2dlog ne · |S|≤ 2dlog ne · |J∗|2ε−1dlog ne = ε · |J∗|.

Since the segments of S′ form a disjoint union of S(r) we can conclude with (2), (3) and (4) that

|J∗| ≤∑S′∈S′|J ∈ J∗|J ∩ S′ 6= ∅| ≤

∑S′∈S′

(β(S) + 2) ≤ 2 · |S′|+∑S′∈S′

β(S) ≤ 2ε · |J∗|+∑S′∈S′

β(S).

Because β(S) is a 2-approximation and |J∗| = α(I) the next inequality proves the second inequalityof the lemma:

|J∗| ≤ 2ε · |J∗|+∑S′∈S′

β(S) ≤ 2ε · |J∗|+∑S′∈S′

2 · β(S)

⇒ (1− 2ε)|J∗| ≤∑S′∈S′

2 · β(S)

⇒ (1

2− ε)|J∗| ≤

∑S′∈S′

β(S).

ut

The next goal is to nd a set which satises the properties of Lemma 3. But to determinewhether a segment S belongs to this set S′ we want to use only local information which does notrequires knowledge about other segments to minimize our space requirements. The application of a2-approximation algorithm on segments β(S) is not suitable for this task because it is possible thatfor some segment S ∈ S\S(r), β(π(S)) < β(S) holds. This would cause problems in our overallconstruction. Instead we dene another estimate which is monotone nondecreasing along paths tothe root. In particular, we dene for each segment S ∈ S

γ(S) = |S′ ∈ S|S′ ⊂ S and∃I ∈ I s.t. I ⊂ S′|

describing the number of segments of S that are contained in S and contain at least one inputinterval. This number corresponds are nodes in the segment tree that are descendants of S andcontain some input interval. For this estimate we can prove the following properties.

Lemma 5. For all S ∈ S, we have the following properties:

(i) γ(S) ≤ γ(π(S)), if S 6= S(r),(ii) γ(S) ≤ β(S) · dlog ne,(iii) γ(S) ≥ β(S) and(iv) γ(S) can be computed in O(γ(S)) space using the portion of the stream after the rst interval

contained in S.

Proof. The rst statement follows immediately from the denition of the segment tree because eachparent node contains all input intervals which are contained in its children.To proof the remaining properties we x some S ∈ S and dene

S′ := S′ ∈ S|S′ ⊂ S and∃I ∈ I : I ⊂ S′

Page 17: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 9

denoting the set of segments contained in S which itself contains at least one input interval. Thesesegments are associated with descendants of S in the segment tree. Let TS be the subtree withroot S. Since T is a balanced tree it has at most dlog ne many levels. Due to the fact that γ(S)is exactly the size of S′ we can use the pigeonhole principle to conclude that there has to exists alevel L of TS which has at least γ(S)/dlog ne pairwise distinct elements of S′. All these segments aredisjoint because they are on the same level of the segment tree. This means we can pick an arbitraryinput interval from each of these segments to obtain a subset of the input intervals resulting inβ(S) ≥ γ(S)/dlog ne. Rearranging grants the second property.

To prove (iii) we consider an optimal solution J∗ constrained to S. Each interval J from J∗ iscontained in some interval of S′. Let S(J) be the smallest segments of the segments containingJ . Then J contains the middle point of S(J). Otherwise we could split S(J) in half and choosethe half containing J which would be smaller than S(J). Therefore for all J ∈ J∗ the segmentsS(J) are pairwise distinct or else two input intervals of the optimal solution would intersect. Notethat these segments do not have to be disjoint as their associated nodes might be on dierentlevels of the segment tree. With this minimal segments being elements of S′ we can conclude thatγ(S) = |S| ≥ |S(J)|J ∈ J∗| = |J∗|. The property follows with |J∗| = β(S).

We can use binary search trees to prove the fourth property. For a new input interval I we check ifour tree already contains the segments which are contained in S and contain I. If not, update thosesegments to the structure. Then γ(S) corresponds to the number of leaves and can be computed bytraversing the tree. The necessary space needed for such a tree is related to the number of elementsstored and results in O(γ(S)). ut

Equipped with this estimate we can dene a certain type of segments which will help us to nd aset satisfying Lemma 4.

Denition 4. A segment S of S with S 6= S(r) is relevant if

(i) γ(π(S)) ≥ 2ε−1dlog ne2 and(ii) 1 ≤ γ(S) < 2ε−1dlog ne2.Then Srel ⊂ S denotes the set of relevant segments of S. In case Srel is empty, we set it to Srel =S(r).

This formalizes the phrase that relevant segments contain at least one input interval but not toomany. Also it is guaranteed that parent nodes of nodes associated with relevant segments also containa certain amount of input intervals. Now we will analyze the result of applying a 2-approximationalgorithm to relevant segments. Remind that β(S) describes the size of a solution produces by a2-approximation algorithm.

Lemma 6. It holds that

α(I) ≥∑S∈Srel

β(S) ≥ (1

2− ε)α(I).

Proof. If γ(S(r)) < 2ε−1dlog ne2 then S is empty because by Lemma 5, γ(·) is nondecreasing alongpaths from leafs to the root in T . Then no node S can have a parent node for which γ(π(S)) ≥2ε−1dlog ne2 holds and we set Srel to Srel = S(r). Then the above inequality follows directly fromthe fact that β(S) denotes the size of a 2-approximation algorithm. Therefore we can assume thatγ(S(r)) ≥ 2ε−1dlog ne2. Then also the root node of T is not an element of Srel.

Dene S0 = S ∈ S\S(r) | γ(S) = 0 and γ(π(S)) ≥ 2ε−1dlog ne2. Let S be a the rst node ona path from the root to a leaf for which γ(π(S)) ≥ 2ε−1dlog ne2 and γ(S) < 2ε−1dlog ne2 holds. Incase it contains any input intervals it follows that S ∈ Srel and S ∈ S0 otherwise. Therefore Srel∪S0

forms a disjoint union of S(r). By rearranging Lemma 5 (ii) to β(S), using the denition of relevantsegments and the assumption that γ(S(r)) ≥ 2ε−1dlog ne2 we obtain

∀S ∈ Srel ∪ S0 : β(π(S)) ≥ γ(π(S))/dlog ne ≥ 2εdlog ne

Page 18: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

10 Pascal Bemmann

Fig. 4 Active segments (dotted, marked in red) in a segment tree caused by an interval I.

implying that S′ = Srel ∪S0 satises Lemma 4. Because all S ∈ S0 do not contain any input intervalby denition, it holds that γ(S) = β = 0 which grants the above statement. ut

Another important type of segments we need for our overall estimate are active segments.

Denition 5. A segment S ∈ S is active if S = S(r) or its parent π(S) contains some input interval.

For an example of active segments see Figure 4. Note that every relevant segment is also an activesegment because relevant segments contain at least one input interval. Later we will use H-randomsamples to estimate the ratio between relevant and active segments.

5.2 Algorithms in the Streaming Model

In this section we want to use the ndings of the previous section to construct algorithms estimatingthe number of active segments, the ratio between active and relevant segments and the average valueof β(S) over the relevant segments. Putting all these estimates together we can obtain an estimatefor the size of an optimal solution.

To provide an estimate for the number of active segments we rst denote with σS(I) the sequenceof active segments because of the input interval I. This sequence is ordered non increasing by thesize of the intervals. Therefore the sequence starts with the root node followed by nodes whichparents contain I. Since the segments of parent nodes are bigger than its children, parent nodesappear before its children in σS(I). The length of the sequence is bounded by 2dlog ne + 1. Thisholds because if an interval is contained in some leaf of the segment tree, there exists a path onnodes whose associated segments are active. All children of this nodes are by denition active, too.Since T is a balanced tree we obtain for the length of the sequence 2dlog ne+ 1 if we add the rootnode as well. Note that we do not need to store our whole segment tree to compute σS because theintervals associated to nodes are independent from the input intervals. Therefore we can calculatesegments associated to nodes in our segments tree on the y to compute the at most 2dlog ne + 1active nodes using only O(log n) space.With this we can show an estimate for the number of active segments Nact.

Lemma 7. There is an algorithm in the data stream model that in O(ε−2 + log n) space computesa value Nact such that

Pr[|Nact − Nact| ≤ ε ·Nact] ≥ 11/12.

Page 19: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 11

Proof. The stream of input intervals I = I1, I2 . . . denes the stream of active segments σ =σS(I1), σS(I2), . . . where S is set of intervals associated to the segments of the balanced segmenttree over [n]. Remind that this stream is O(log n) times longer than I. If we count the distinctelements in the stream σ we obtain the number of active segments. Therefore we reduce the intervalselection problem to the problem to counting the distinct elements of the stream σ.For this we can use the results of Kane, Nelson and Woodru [3]. Their algorithm grants a (1± ε)approximation for the distinct elements problem using O(ε−2 + log|S|) = O(ε−2 + log(n)) spacewith 2/3 success probability. The space needed is proven to match the optimal space bound upto constant factors. The success probability can be improved by using a standard technique. Byrunning O(log((2/12)−1)) instances of the algorithm in parallel we can amplify the success prob-ability to 1 − 1/12 = 10/12. The general idea of their algorithm is to run several constant factorapproximations in parallel. Then they keep several counters updated to compensate the constantfactors. With the help of their algorithm we obtain a value Nact matching the above property. ut

As we have an estimate for the number of active segments we can use this to compute an estimatefor the number of relevant segments which are a subset of all active segments.

Lemma 8. There is an algorithm in the data stream model that uses O(ε−4 log4 n) space and com-putes a value Nrel such that

Pr[|Nrel − Nrel| ≤ ε ·Nrel] ≥ 10/12.

Proof. First we will estimate the number of active segments using the algorithm from Lemma 7.Using H-random samples we will be able estimate the ratio between active and relevant segments.This can be turned into an estimate for Nrel = (Nrel/Nact) ·Nact.To ensure that the sample we take later will be representative, we need a lower bound for Nrel/Nact.In our segment tree T , each relevant segment S′ ∈ Srel has at most 2γ(S′) < 4ε−1dlog ne2 activesegments below it. This holds because each relevant segment is by denition also an active segment.Then all relevant segments contained in S′ have at most two children which are active. The inequalityfollows from the denition of γ(·). Also there are at most 2dlog ne active segments whose parent isan ancestor of S′. This upper bound holds exactly if S′ is a leaf of T . Then for each relevant segmentwe have at most

4ε−1dlog ne2 + 2dlog ne ≤ 4ε−1dlog ne2 + 2dlog ne2 = 6ε−1dlog ne2

active segment. Using this upper bound we can conclude that

NrelNact

≥ 1

6ε−1dlog ne2 =ε

6dlog ne2 (5)

which we will use later.Now consider an arbitrary mapping b between S and [n2] that can be easily computed. Then

Lemma 1 gives us a family H = H(n2, ε) of ε-min-wise independent hash functions. Note that weuse n2 as S contains at most 2n − 1 elements to get an injective function. For a xed h ∈ H, theconcatenation h b denes an order among the elements of S. This ordering will be used togetherwith ε-min-wise independent hash function to compute H-random samples.

We choose k = d72dlog ne2/(ε3(1 − ε))e ∈ Θ(ε−3 log2 n) and pick k permutation h1, . . . , hk ∈ Huniformly and independently at random. We will see shortly why k was chosen in this specic way.Then for each permutation hj with j = 1, . . . , k our H-random sample Sj is dened as

Sj = arg minhj(b(S))|S ∈ S is active.

Therefore Sj is the active segment of S which minimizes (hj b)(·). Then Sj is approximately arandom active segment of S. It is not completely uniformly random because we want to keep some

Page 20: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

12 Pascal Bemmann

information so we can apply properties of ε-min-wise independent hash functions to it. By deningthe random variable

X = |j ∈ 1, . . . , k|Sj is relevant|we have formalized the number of H-random samples which are active. We shift the discussion howX can be compute to the end of the proof. Then Nrel/Nact is in the order of X/k. To get a boundfor X we dene

p = Prhj∈H

[Sj is relevant]

and remind that every relevant segment is also an active segment. Then we can our results frominequality (1) from Section 3 by setting Y = Srel and X = Sact. We obtain

(1− ε)NrelNact

≤ p ≤ (1 + ε)NrelNact

. (6)

By using the denition of k, the lower bound of p and the lower bound in (5) it holds that

kp ≥ 72dlog ne2ε3(1− ε) ·

(1− ε)NrelNact

≥ 72dlog ne2ε3

· ε

6dlog ne2 =12

ε3. (7)

You can also understand each Sj as a binary random variable. Then X is the sum of k independentrandom variables and it follows that E[X] = kp. Using Chebychev's inequality, rearranging the lowerbound for kp in (7) and the fact that V ar[X] = kp(1−p) for sums of k independent random variablesit holds that

Pr

[∣∣∣∣Xk − p∣∣∣∣ ≥ εp] = Pr [|X − kp| ≥ εkp] ≤ V ar[X]

(εkp)2=kp(1− p)

(εkp)2=

(1− p)kpε2

<1

kpε2≤ 1

12. (8)

This allows us to formalize our nal estimator. We use the estimator Nact for the number of activesegments as stated in Lemma 7. Then we dene Nrel = Nact · (X/k). For now assume that theestimator of Nact is successful and p lies within an ε-range around X/k. Formally:[

|Nact − Nact| ≤ εNact]and

[∣∣∣∣Xk − p∣∣∣∣ ≤ εp] .

We use this to get bounds and the upper bound from (6) for our estimate Nrel:

Nrel = Nact · (X

k) ≤ (1 + ε)Nact · (1 + ε)p ≤ (1 + ε)2Nact ·

(1 + ε)NrelNact

= (1 + ε)3Nrel

With ε < 1/2 it holds that (1 + ε)3 ≤ (1 + 7ε). This grants Nrel ≤ (1 + 7ε)Nrel. Analogously usinglower bounds we can conclude that Nrel ≥ (1− 7ε)Nrel.It is left to analyze our overall success probability. This can be done via the probability that theestimates for Nact and p fail:

Pr[Nrel = (1± 7ε)nrel] ≥ 1− Pr[|Nact − Nact| ≥ εNact]− Pr

[∣∣∣∣Xk − p∣∣∣∣ ≥ εp] ≥ 1− 1

12− 1

12≥ 10

12.

We can scale ε by a factor 1/7 to obtain the claimed bound. This is a valid operation because εstays within the range of (0, 1/2).It is left to analyze how X can be computed and the space needed for the estimate. For each j ∈ [k]we store the H-random element Sj for all active segments seen so far using hj . Moreover, we storeinformation about the choice of hj and also about γ(Sj) and γ(π(Sj)) so we can decide whether Sjis relevant.Then let I1, I2, . . . be the stream of input intervals and σ = σS(I1), σS(I2), . . . the stream of active

Page 21: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 13

segments described earlier. When given a segment S of σ we have to update Sj if hj(S) < hj(Sj).With the segments of σS(I) ordered nondecreasing in size we can keep γ(π(Sj)) updated because Sjbecomes active for the rst time if its parent contains some input intervals. Then γ(π(Sj)) > 0 andthe following parts of σ can be used to compute γ(Sj) and γ(π(Sj)) using Lemma 5 (iv).To stay within our desired space bounds we use the following trick. If at some point γ(Sj) >2ε−1dlog ne2 violates the condition for a relevant segment, we only store the fact that Sj is notrelevant and nothing more. No later arriving input interval could possibly change this. Also ifγ(π(Sj)) > 2ε−1dlog ne2 we only store that is possible for Sj to be relevant. This information isenough and maintaining a counter for bigger values is not necessary. Then for each j we needε−1 log2 n space. With the denition of k we obtain a space bound of at most O(kε−1 log2 n) =O(ε−4 log4 n). ut

The last part we need for our nal result is about an average value of a 2-approximation applied torelevant segments. In specic we dene ρ = (

∑S∈Srel β(S)/|Srel|.

Lemma 9. There is an algorithm in the data stream model that uses O(ε−5 log6 n) space and com-putes a value ρ such that

Pr[|ρ− ρ| ≤ ερ] ≥ 10/12.

Proof. To get an estimate of ρ we will use conditional sampling. We compute H-random samplesuntil we get a sample satisfying a certain condition.As in the previous lemma let b be an arbitrary injective mapping between S and [n2]. Also considera family H = H(n2, ε) from Lemma 1. Then h b denes an order among the elements of S. In thefollowing Sact describes the set of active segments.We repeatedly sample h ∈ H uniformly at random until S1 = arg minS∈Sact h(b(S)) is a relevant

segment. Then set the random variable Y1 = β(S). Using Lemma 2 with X = Sact and Y = Srel weobtain

∀S ∈ Srel :1− 4ε

|Srel|≤ Pr[S1 = S] ≤ 1 + 4ε

|Srel|.

Using the upper bound of the equation above and the denition of ρ it follows from the denitionof the expected value that E[Y1] ≤ (1 + 4ε) · ρ. Similarly using the lower bounds we can show thatE[Y1] ≥ (1− 4ε) · ρ.With Lemma 5 (iii) we have that γ(S) ≥ β(S) ≥ β(S). Using the denition of relevant segmentsand ρ we obtain an upper bound for the variance of Y1:

V ar[Y1] = E(Y 21 )− (E(Y1))2 ≤ E[Y 2

1 ]

=∑S∈Srel

Pr[S1 = S] · (β(S))2

≤∑S∈Srel

1 + 4ε

|Srel|· β(S) · 2dlog ne2

ε

≤ ρ · 2(1 + 4εdlog ne2)

ε

≤ ρ · 6dlog ne2ε

Since γ(S) ≥ 1 means that S contains at least 1 input interval, the approximation 2-algorithm

constrained to S will also choose at least one input interval and therefore β(S) ≥ 1. This leads toρ ≥ 1.Consider some integer k which we choose later. This k will be the amount of relevant segments weuse to estimate ρ. Dene Y2, . . . Yk as independent random variables with the same distribution as Y1

and our estimate ρ as the average over those random variables: ρ = (∑ki=1 Yi)/k. With Chebyshev's

inequality and ρ ≥ 1 it holds that

Page 22: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

14 Pascal Bemmann

Pr[|ρ− E(Y1)| ≥ ερ] = Pr [|ρk − E[Y1]k| ≥ εkρ] ≤ V ar[ρk]

(εkρ)2≤ 6dlog ne2

kε3.

By setting k = 6 · 12 · dlog ne2/ε2 we get Pr[|p− E[Y1]| ≥ ερ] ≤ 1/12. With this as a basis we applythe same approach as in Lemma 8.First we set k0 = 12dlog ne2k/ε(1 − ε) ∈ Θ(ε−4 log4 n). Then for each j ∈ [k0] we compute an H-random sample by choosing hj ∈ H uniformly at random and setting Sj = arg minhj(b(S))|S is active.For the further analysis let X denote the overall number of relevant segments of S1, . . . , Sk0 andp = Pr[S1 ∈ Srel]. Using the lower bound of inequality (6) and the ratio between relevant and activesegments (5) from Lemma 8 we get

k0p ≥(12dlog ne2)k

ε(1− ε) · (1− ε)NrelNact

≥ k · 12dlog ne2ε(1− ε) ·

(1− ε)ε6dlog ne2 = 2k.

This result, Chebyshev's inequality and the fact that V ar[X] = k0p(1−p) holds because X is a sumof binary variables leads us to

Pr[|X − k0p| ≥ k0p/2] ≤ V ar[X]

(k0p/2)2=

4kp(1− p)k2

0p<

4

k0p≤ 4

2k≤ 1

12.

Here we also used that p > 0 and k0p ≥ 2k. This implies that with probability at least 11/12 thesample S1, . . . , Sk0 (represented by X) contains at least (1/2)k0p ≥ k relevant segments. With theserst k relevant segments we are able to estimate p which was dened above.Like before we can use the failure probability for X and p to obtain the probability 1−1/12−1/12 =10/12 that both

[|X − k0p| ≥ k0p/2] and [|ρ− E[Y1]| ≥ ερ]

hold. In case of success we get by using the upper bounds and our results for the expected value ofY1 that

ρ ≤ ερ+ E[Y1] ≤ ερ+ (1 + 4ε)ρ = (1 + 5ε)ρ

and analogously using the lower bounds ρ ≥ (1 − 5ε)ρ. Combining these two inequalities and theabove success probabilities grants

Pr [|ρ− ρ| ≤ 5ερ] ≥ 10/12.

Like in the previous Lemma we can rescale ε, this time using the factor 1/5. Also the argumentationfor the space bounds stays basically the same. For each j ∈ [k0] we keep information about hj ,

γ(Sj), γ(π(Sj)) and β(Sj). As discussed before β(Sj) ≤ β(Sj) ≤ γ(Sj) holds. Therefore our spacebound for each index j is the same as in Lemma 8, particularly O(ε−1 log2 n). Because we have k0

indices we need O(k0ε−1 log2 n) = O(ε−5 log6 n) space in total. ut

Now we are able to turn all the presented algorithms so far in one algorithm to compute anestimate for the optimal solution of the interval selection problem.

Theorem 1. Let ε ∈ (0, 1/2) and I be a set of interval with endpoints in [n] that arrive in a datastream. There is an algorithm that uses O(ε−5 log6 n) space and computes a value α such that

Pr[(1/2− ε) · α(I) ≤ α ≤ α(I)] ≥ 2/3.

Proof. We start with estimating Nrel and ρ using Lemma 8 respectively Lemma 9 and obtain theestimates Nrel and ρ. Then we combine them to a single estimate α0 = Nrel · ρ which we will nowfurther investigate. The success probability of α0 is at least 1− 2

12 − 212 = 2/3 using that the failure

probability of Nrel and ρ are 212 . In case of success both

Page 23: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 15[|Nrel − Nrel| ≤ ε ·Nrel

]and [|ρ− ρ| ≤ ερ]

hold. Using the denition of Nrel and ρ and the fact that Nrel = |Srel| together with Lemma 6 wecan show that

α0 ≤ (1 + ε)Nrel · (1 + ε)ρ = (1 + ε)Nrel · (1 + ε)

( ∑s∈Srel

β(S)

)/|Srel|

= (1 + ε)2∑S∈Srel

β(S)

≤ (1 + ε)2α(I)

Similarly one can show that α0 ≥ (1 − ε)2( 12 − ε)α(I) holds using the lower bounds of Nrel and p.

Combining these two results leads to

Pr[(1− ε)2(1

2− ε) · α(I) ≤ α0 ≤ (1 + ε)2 · α(I)] ≥ 2

3.

To obtain the nal result we use that

(1− ε)2(1

2− ε)/(1 + ε)2 ≥ 1

2− 3ε

for all ε ∈ (0, 1/2). This holds because

((1− ε)2/(1 + ε)2

)· (1

2− ε) =

(1− 4ε

(1 + ε)2

)(1

2− ε) =

1

2− 4ε

(1 + ε)2(1

2− ε)− ε

(∗)≥ 1

2− 2ε(

1

2− ε)− ε

=1

2− ε+ 2ε2 − ε

≥ 1

2− 3ε

with (∗) using4ε

(1 + ε)2(1

2− ε) =

(1 + ε)2(1− 2ε)

1+ε≥1

≤ 2ε(1− 2ε) ≤ 2ε.

Also we can again rescale ε this time using the factor 1/6. If we then set α = α0/(1 + ε)2 to avoidoverestimation we obtain our nal result. The space needed for this algorithm is precisely the spaceneeded for our two estimates Nrel and p from Lemma 8 and Lemma 9 which completes this proof.ut

6 Same-size intervals

Within this section we will show how we can improve the results presented so far if we assume thatall input intervals have the same length λ > 0.

Page 24: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

16 Pascal Bemmann

6.1 Largest independent set of same size-intervals

The rst approach is again to compute an approximation of the largest independent set. By usingthe shifting technique of Hochbaum and Mass [6] we obtain a (3/2) approximation using O(α(I)).We will maintain a partition of the real line using windows of length 3λ.For l ∈ R we dene the window Wl = [l, l + 3λ) including the left endpoint and excluding the rightendpoint. Given an a ∈ 0, 1, 2 we also dene

Wa = W(a+3j)λ|j ∈ Z.

Note that Wa is a partition of the real line. Furthermore we dene Ia for some a ∈ 0, 1, 2 as theset of input intervals that are contained in some window of Wa. Formally:

Ia = I ∈ I|∃j ∈ Z : I ⊂W(a+3j)λ.

Each interval of length λ is contained in exactly two windows of W0 ∪W1 ∪W2. Then it can beshown that maxα(I0), α(I1), α(I2) ≥ 2/3α(I) where α(Ia) denotes an optimal solution restrictedon the intervals contained in Ia. Because at most 2 intervals can t in a window of length 3λ, we areable to compute and store an optimal solution Ja restricted on input intervals of Ia for a ∈ 0, 1, 2.By returning the maximum of J0, J1 and J2 we obtain (3/2)-approximation. We will now describehow an algorithm can maintain these solutions Ja throughout the stream.We use the same approach as the algorithm in Section 4. We store Leftmost(W ) and Rightmost(W )for each window W ∈ Wa with a ∈ 0, 1, 2. In addition we store a boolean value active(W )indicating if some interval earlier in the stream is contained in W . If active(W ) = false, W doesnot contain any input interval and therefore we declare Leftmost(W ) and Rightmost(W ) undened.When receiving a new interval I of the stream we look at all windowsW ∈Wa for some a ∈ 0, 1, 2.If W is not active, we set active(W ) = true and add the input interval to Ja. In case W is alreadyactive and contains intersecting intervals, we check if there is an interval of W that is disjoint to I.If so, then these two intervals are added to Ja. In the other cases there is nothing to do.By following the above instructions we maintain indeed an optimal solution Ja restricted to intervalsof Ia. By using a binary search tree for storing the at most O(α(I)) active windows we can executeall necessary operation in time O(logα(I)) and O(α(I)) space. This grants a (3/2) approximationto the largest independent set.

6.2 Size of largest independent set of same size-intervals

To estimate the size of an optimal solution of all intervals are of the same size will use H-randomsamples again. First we will show how to estimate a solution constrained to Ia for a ∈ 0, 1, 2. Thenwill use the result to get a nals estimate.

Lemma 10. Let a ∈ 0, 1, 2 and ε ∈ (0, 1). There is an algorithm in the data stream model that inO(ε−2 log(1/ε) + log n) space computes a value αa such that

Pr[|α(Ia)− αa| ≤ ε · α(Ia)] ≥ 8/9.

Proof. Fix some a ∈ 0, 1, 2. We dene the type i of a window W of Wa as minimal number ofdisjoint input intervals contained inW . Since each window can contain at most two disjoint intervalsit can be of type 0, 1 or 2. By γi with i = 0, 1, 2 we denote the number of windows of type i in Wa

. Then α(Ia) = γ1 + γ2 since intervals counted by γ0 do not contain any input intervals. Like inSection 5 we will use H-random samples to estimate γ1 and then the ratio γ2/γ1 to obtain γ2.First we will describe how to obtain an estimate γ1 of γ1. Given the stream of input intervals

Page 25: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Interval Selection in the streaming model 17

I = I1, I2, . . . we can compute the sequence of windows W (I) = W (I1),W (I2), . . . with W (Ii)denoting the window of Wa that contains Ii. If such a window does not exist we skip Ii. It followsthat γ1 is the number of distinct elements in W (I). Again, using the results of Kane, Nelson andWoodru [3] we are able to compute the estimate γ1 for which Pr[(1−ε)γ1 ≤ γ1 ≤ (1+ε)γ1] ≥ 17/18holds using O(ε−2 + log n) space.Next we will estimate the ratio γ2/γ1. For this we useH-random samples very similar to the approachin Lemma 8. Given a family H = H(n, ε) of permutation [n]→ [n] as stated in Lemma 1. We choosek = d18ε−2e ∈ Θ(ε−2) and choose h1, . . . , hk ∈ H uniformly and independently at random. ∀hjwith j ∈ [k] let Wj be the window [l, l + 3λ) of Wa that contains at least one input interval andminimizes hj(l):

Wj = arg minhj(l) | [l, l + 3λ) ∈Wa and ∃I ∈ I : I ∈ [l, l + 3λ).

Then Wj is nearly a uniform random window of Wa among the segments which contain at least oneinput interval. We dene the random variable

M = |j ∈ [k] | Wj is of type 2|.

With the considerations beforeMγ1/k is roughly γ2. By applying Chebyshev's inequality onM andusing the choice of k it is possible to show that Mγ1/k = γ2 ± εγ1.Using both results above we output with γ1(1 + M/k) the desired estimate. Note that M can becomputed in O(ε−2 log(1/ε)) space by keeping information about hj and the current window Wj

for each index J . We also store Leftmost(Wj) and Rightmost(Wj) to decide if Wj is of type 1 or2. ut

Theorem 2. Let ε ∈ (0, 1/2) and I be a set of intervals of length λ with endpoints in [n] that arrivein a data stream. There is an algorithm that uses O(ε−2 log(1/ε)+log n) space and computes a valueα such that

Pr[(2/3− ε) · α(I) ≤ α ≤ α(I)] ≥ 2/3.

Proof. For each a ∈ 0, 1, 2 we use Lemma 10 to obtain the estimate αa for α(Ia). The probabilitythat these three estimates are successful is at least 1− 1/9− 1/9− 1/9 = 2/3. With the propertiesof these estimates we conclude that 2/3(1− ε) · α(I) ≤ maxα0, α1, α2 ≤ (1 + 1ε)α(I). Rescaling αby 1/(1 + ε) to avoid overestimation and replace ε by ε/2 complete the proof.

7 Conclusion and other results

We have shown how to get a 2-approximation for the interval selection problem and how to use itto get an estimate for the size of an optimal solution.It is also possible to show lower bounds for both problems considered. Emek, Halldórsson andRosén [5] showed that any streaming algorithm for the interval selection problem cannot achievean approximation ratio of r, for any constant r < 2. For same size intervals no ratio below 3/2 ispossible.In [1] they showed similar results of the problem of estimating α(I). For this they reduce the intrvalselection problem to the INDEX problem. Here it is the task to deciding whether a subset of [n]contains some element i ∈ [n]. The complexity of INDEX is well studied [7] [8]. To achieve anon trivial success probability Ω(n) bits of memory are required. This reduction shows that anyalgorithm that uses o(n) bits of memory cannot compute an estimate α for which

Pr[(1

2+ c)α(I) ≤ α ≤ α(I)] ≥ 2

3

Page 26: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

18 Pascal Bemmann

with some arbitrary constant c > 0 holds. This means that the results presented in this work matchthe lower bounds up to constant factors if we use o(n) space.

References

1. S. Cebello, P. Pérez-Lantero. Interval Selection in the Streaming Model. Volume 9214 of the series Lecture Notesin Computer Science pp 127-139. February 5, 2015.

2. P. Indyk. A small approximately min-wise independent family of hash function. J. Algorithms 38(1):84-90, 2001.3. D. M. Kane, J. Nelson and D. P. Woodru. An optimal solution algorithm for the distinct elements problem.

PODS 2010, pp. 41-52, 2010.4. M. Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. ESA 2002, pp.

323-334. Springer, Lecture Notes in Computer Science 2461, 20 02.5. Y. Emek, M. M. Halldórsson and A. Rosén. Space-constrained interval selection. ICALP 2012(1), pp. 302-313.

Springer, Lecture Note in Computer Science 7391,2012.6. D.S. Hochbaum and W. Maass. Approximation schemes for covering and packings problems in image processing

and vlsi. J. ACM32(1) : 130− 136, 1985.7. T.S. Jayram, R. Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance.

Theory of Computing 4(6) : 129− 135, 2008.8. E. Kushilevitz and N. Nisan. Communication Complexity. Cambride Usniversity Press. New, NY, USA, 1997.

Page 27: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking

Steen Knorr

Abstract This essay introduces the concepts presented in the paper "Multi-Dimensional OnlineTracking" by Ke Yi and Zhan Qin and rewrites and explains the proofs and concepts that theypresented. We study a class of online tracking problems for two parties. One of which observes amulti-valued function and tries to inform the second party about the value of this function. Thegoal we want to achieve is to minimize the communication between those parties. We will see threealgorithms for both one-dimensional as well as multi-dimensional functions. For these algorithmswe will show both a competitive factor to an optimal solution for the communication needed andthe running time. Unless otherwise denoted all content is derived from the above mentioned paper.Images that have been added and were not part of the original paper are marked with an asterisk(*).

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.1 Structure of this essay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Online Tracking in one dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1 An O (log∆)-competitive algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Lower bound on competitive ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Online Tracking in d dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1 Using Tukey medians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Using volume cutting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Tracking a dynamic set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Online Tracking with predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1 Introduction

We construct a model in which one observer named Alice observes a function f(t) in an onlinefashion and wants to inform a tracker named Bob about the values of that function. Also, we allowan error of ∆ on the messages that Bob receives and want to minimize the communication betweenboth parties. As long as Alice does not send a new message the function is still in a ∆ range tothe last value Bob received. The most intuitive solution to this problem would be to send the exactvalue of the function whenever necessary. However, there is a simple example in which this approachproves problematic. Given a function f that is dened as follows:

f(t) =

0 if t is even

2∆ if t is odd

19

Page 28: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

20 Steen Knorr

Our approach has to send a message every time when the optimal solution is to send exactly onemessage of ∆. This dierence in communication is quantied in the competitive ratio. This ratiois the worst-case ratio on the number of messages sent by the online algorithm divided by thenumber of messages an optimal oine algorithm AOPT would send. From this denition we canderive rounds in which the AOPT sends exactly one message. For our example the competitivenessis innite, because we can set the length of the round arbitrary large and AOPT still sends only onemessage.

1.1 Structure of this essay

In this essay we dene our problem to be the tracking of a function f : Z+ → Zd. The tracker Aliceonly sees the function values f(t), t ≤ tnow up to the current time and can decide to send a messageg(t) which may dier from f(t). We expect Alice to send a message whenever the function valuedeviates by more than ∆ from her last message and that whenever she sends a message g(tnow) that||f(tnow)− g(tnow)|| ≤ ∆. Unless stated otherwise || · || denotes the l2 norm in this essay. Our focusin this essay is the competitiveness of the algorithms with respect to communication.

Section 2 shows an algorithm that solves the problem for d = 1. This algorithm has an competitive-ratio of O (log∆) and we also prove that this ratio is optimal. The following section 3 presentstwo algorithms to solve the problem for any dimension d. At the end of the section we see anexample for using these algorithms to track a dynamic set. Afterwards, section 4 tries to optimizethe competitiveness of the multi-dimensional algorithms by predicting how the function will behave.Following a short summary of open problems in section 5 this essay closes with a short conclusionin section 6.

2 Online Tracking in one dimension

For this section our function that we want to track is in the form of f : Z+ → Z. The algorithmthat is presented here will build the general framework for all algorithms described in this essay.

2.1 An O (log∆)-competitive algorithm

The idea for the one dimensional algorithm is rather simple. We try to nd the optimal value vOPTthat the optimal algorithm AOPT sent. For this we maintain an interval S which contains all possiblevalues for vOPT and update it whenever we have to send a new message. Any update is a cut ofthis interval with the interval of size 2∆ around the current function value. Therefore the intervalS shrinks whenever we have to communicate and the median of this interval is the value we chooseas the message to sent. The reason to take the median of the interval lies in the fact that it ensuresthat S halves whenever we have to send. Assuming our algorithm ASOL needs to communicatethat means that the current function value is more than ∆ bigger or smaller than our last message.Therefore either the lower or the upper half of S is cut o because the optimal solution is in rangeof this new value as well.

The pseudo code is described in algorithm 1. First in each round the interval S is set to a rangearound the rst function value observed with size 2∆. Then, as long as there is an option for anoptimal value vOPT , we update the interval, obtain its median and send it out. On each update theinterval S is cut with an interval of ∆ range to the current function value f(tnow). If the S becomesempty we know that the round has ended and we immediately start a new round with the current

Page 29: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 21

a)

b)

set Sinterval [g-Delta,g+Delta]

communicationneeded

Fig. 1 * The interval S halves when we need to send a message.

function value.

Algorithm 1 One round of 1-dim. trackinglet S = [f(tnow −∆), f(tnow +∆)] ∩ Zwhile S 6= ∅ do

let g(tnow) be the median of Ssend g(tnow) to Bobwait until ||f(tnow)− g(tlast)|| > ∆

S ← S ∩ [f(tnow)−∆, f(tnow) +∆]end while

Correctness:

Proof. It is easy to see that the algorithm always communicates valid values. Whenever we send amessage it is within the cut of S ∩ [f(tnow) − ∆, f(tnow) + ∆] and therefore within a ∆ range tof(tnow). Assuming the cut is empty we start a new round with the current function value and themessage is correct as well. utCompetitiveness:

Proof. The proof for the competitiveness is done in two steps. Firstly, we ensure that AOPT alwayssends at least one message in each round. Secondly, we show that our algorithm only sends O (log∆)messages in each round. So for the rst part we assume that the optimal oine algorithm did notsend any message in this round. That would mean that the last value y AOPT sent in the previousround also applies to the current round. Therefore y ∈ S, because the value is in ∆ range to allother function values of this round. However the round is terminated only when the set S is emptywhich is a contradiction, because there is no choice for y. Thus, AOPT sends at least one messagein this round. ut

We thus obtain the following theorem:

Theorem 1. There is an O (log∆)-competitive online algorithm to track any function f : Z+ → Z.

2.2 Lower bound on competitive ratio

Given the competitiveness of our algorithm we will now show that it is also optimal. For this weconstruct an adaptive adversary Carole that controls the function values depending on the messages

Page 30: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

22 Steen Knorr

an online algorithm ASOL sends out. Similar to our algorithm Carole maintains an interval S ofpossible values for AOPT that is updated when the function value changes. When this intervalbecomes small enough Carole stops the round and the actual solution for the optimal algorithm isobtained. Afterwards, Carole starts a new round with function values that are 2∆ + 1 away fromany values used before.

The interval is initialized exactly like in our algorithm as a ∆ range around an arbitrary rstvalue. Depending on the announced value vSOL of ASOL Carole changes the function value until anew vSOL is sent:

• if vSOL is greater than the median of S, decrease the function value• else, increase the function value

This step is repeated until the interval size goes below 3 and S is updated whenever necessary. Wenow observe the number of elements ni in S at the time of the i-th triggering of ASOL. If we xthe median mS of S at the time of one such triggering we know that when mS will be cut, ASOLhas to send a new message, because the function value is in a ∆ distance to mS and also to vSOL.The reason for that is that we move in the opposite direction from the direction in which the onlinealgorithm's message diers from the median. Therefore, the interval shrinks between two triggeringsby at most 1/2 which results in log∆ messages that have to be sent before the round is over. Thisgives us:

Theorem 2. To track a function f : Z+ → Z, any online algorithm has to send Ω(log∆ · OPT)messages in the worst case, where OPT is the number of messages needed by the optimal oinealgorithm.

3 Online Tracking in d dimensions

In this section we generalize our algorithm to d dimensions. We will discuss two approaches for thisproblem and extend our analysis to (α, β)-competitiveness. In contrast to the one-dimensional casewe now allow our algorithm a deviation of β∆, whereas the optimal solution still has to work with adeviation of ∆. Only when the function has deviated by more than this threshold the algorithm hasto communicate again. We still divide the tracking period into rounds in which the optimal oinealgorithm communicates at least once. Within those rounds, our algorithm only communicates atmost k times and therefore we obtain a competitive ratio of k.

Luckily, we can simply extend the framework we used before. Instead of using an interval for ourset of possible values for vOPT , we now construct a set S = S0 at the beginning of the round. Ineach iteration we then pick the "median" from this set and send it to Bob. Whenever we have tocommunicate again we cut our set with a ball of radius ∆ instead of an interval. This way we stillmaintain a set of points and we can terminate one round whenever this set becomes empty, becausethen there is no optimal solution for this round. The balls Ball(p, r) we are using are closed ballscentred at p ∈ Rd with radius r ∈ R.

From here on we will assume β to be smaller that 2, because the problem becomes trivial then.In this case the algorithm can simply send the rst function value it received and does not havecommunicate until the round is over. If the dierence to the function ever increases to more than2∆ there is no optimal solution that contains the current function value and the value our algorithmsent. Therefore we will set β = 1 + ε, 0 < ε < 1.

Page 31: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 23

3.1 Using Tukey medians

For this approach we consider β to be 1. Algorithm 2 shows the pseudo code of the algorithm. Firstwe will discuss the creation of S0 which will be the starting set for our algorithm and later we willdene what the Tukey median is. For the construction of S0 we consider all points pi ∈ Zd that lie

Algorithm 2 One round of d-dim. tracking using Tukey medianslet S = S0

while S 6= ∅ dolet g(tnow) be the Tukey median of Ssend g(tnow) to Bobwait until ||f(tnow)− g(tlast)|| > ∆S ← S ∩ Ball(f(tnow),∆)

end while

within a 2∆ radius to the rst function value. We then construct sets Cl, 2 ≤ l ≤ d + 1 from thecentres of the smallest enclosing balls of any combination of l many of those points pi. The limitsfor l are chosen because a smallest enclosing ball in d dimensions is dened by at least 2 and mostd+ 1 many points. From these sets we can obtain S0 = C2 ∪ C3 ∪ · · · ∪ Cd+1.

The following lemma shows that S0 fulls our requirements.

Lemma 1. If S becomes empty at some time step, then the optimal oine algorithm must havecommunicated once in the current round.

Proof. Suppose this claim is false. Then the optimal oine algorithm AOPT does not send anymessage in the current round when S becomes empty. Let s be the last point AOPT sent in its lastcommunication and qi, i ∈ [m] the values the function took in the current round. Since AOPT didnot send any message it holds ||s − qi|| ≤ ∆ for all i ∈ [m]. We also know that m must be greaterthan 1 because if there was no other function value in the current round our algorithm picked theonly and therefore optimal solution. Thus we can assume m ≥ 2.

Now, let B be the smallest enclosing circle with centre o which contains all the qi. We can thensee that o ∈ S0: Let X be the set of smallest enclosing balls of a set of integer points in Zd each ofwhich is within a distance of at most 2∆ from f(tstart). Then S0 actually is the set of centres of theballs in X.

f(tstart)

q3

q1

q2

≤Δ

≤Δ

o

Fig. 2 Centre of B lies within a ∆ distance to both f(tstart) and any qi.

If o /∈ S0, then at least one qj , j ∈ [m] should be at a distance more than 2∆ from f(tstart). Since||s− qj || ≤ ∆, we have ||s− f(tstart)|| > ∆. That means that AOPT has to have communicated inthis round, which is a contradiction to our assumption. Therefore we now assume o ∈ S0.

Page 32: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

24 Steen Knorr

Since ||s − qi|| ≤ ∆ for all i ∈ [m], and B is the smallest enclosing ball containing all qi withcentre o, we have ||o− qi|| ≤ ∆ for all i. This means that o lies in the cut of all the balls with radius∆ centred on the qis. Therefore S should contain o which means that S is not empty and the roundhas not nished, which is another contradiction. ut

Now we only need to dene a useful median on our set S and Algorithm 2 is constructed success-fully. Before we can go on we need the following denition:

Denition 1. (Location depth) Let S be a set of points in Rd. The location depth of a point q ∈ Rdwith respect to S is the minimum number of points of S in a closed halfspace containing q.

Given the location depth we only need this additional theorem:

Theorem 3. (Helly's Theorem) Let C1, C2, . . . , Cn be convex sets in Rd, n ≥ d+1. Suppose that theintersection of every d+1 of these sets is nonempty. Then the intersection of all the Ci is nonempty.

Helly's Theorem [4] gives us the following observation:

Observation 1 Given a set S in Rd, there always exists a point q ∈ Rd having location depthat least |S| /(d + 1) with respect to S. The point with maximum depth is usually called the Tukeymedian.

This is the Tukey median our algorithm will use. Due to the properties of the Tukey median weknow that whenever ||f(tnow−g(tlast)|| > ∆, the Ball(f(tnow), ∆) is strictly contained in a halfspacebounded by a hyperplane passing through g(tlast). Therefore the cardinality of S decreases by afactor of at least 1/(d+ 1). Thus, the algorithm sends log1+ 1

d|S0| = O (dlog |S0|) messages in each

round.We now bound the size of S0 to conclude our analysis. Initially, we set S0 = C2 ∪C3 ∪ · · · ∪Cd+1

for Cj , 2 ≤ j ≤ d + 1. These Cj are the collections of smallest enclosing balls of every j points

in Ball(f(tstart), 2∆). The cardinality of any Cj is at most(

(b4∆c+1)d

j

), because for any dimension

there are 2∆ points in positive and negative direction and its midpoint and we choose j many ofany of those points. Therefore S0 contains at most

d∑l=0

(b4∆c+ 1d

j

)= O

(d

(e(b4∆c+ 1)d

d+ 1

)d+1)

points. This nishes our analysis with the following theorem:

Theorem 4. There is an O(d3log∆

)-competitive online algorithm that tracks any function f :

Z+ → Zd.

Running time. There are certain disadvantages to this approach. First of all we have to maintainS explicitly which has size exponential in d. Clarkson et al. [2] proposed a fast algorithm to computean approximate Tukey median via random sampling, however sampling becomes dicult from a setthat we will maintain only implicitly. Therefore we will show a new approach in the next sectionwhich does not need S to be explicit and also improves the competitive ratio by roughly a factor ofd.

3.2 Using volume cutting

3.2.1 Case β = 1 + ε

In this section we now set β = 1+ ε. In a later part we will then show that we can set the ε arbitrarysmall to generalize the algorithm we present here for β = 1. As mentioned earlier we will show that

Page 33: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 25

the algorithm has both a better competetitive ratio that the one explained in subsection 3.1 andonly a polynomial runtime with respect to d and ∆ per step.

The technique used here is called volume cutting. Instead of maintaining the set S of valid points,we will maintain a convex volume P which has no direct information about the points that wouldbe in S. We will show that it is sucient to use this volume and to assume that we can terminatethe round when the volume is small enough. Given the volume our algorithm returns its "median"and sends it to Bob and updates P similar to the Tukey algorithm by cutting the volume with aball centered on the current function value.

Algorithm 3 depicts the pseudo code for the algorithm that we are going to discuss now.

Algorithm 3 One round of d-dim. tracking using volume-cuttinglet P = Ball(f(tnow), β∆)

while ωmax(P ) ≥ ε∆ do

let g(tnow) be the centroid of Psend g(tnow) to Bobwait until ||f(tnow)− g(tlast)|| > β∆P ← P ∩ Ball(f(tnow), β∆)

end while

Using our typical framework we initialize our volume in which the optimal solution is contained,compute the "median" for P and send it, and update our P as the cut of all balls centered on thefunction values when we have to send a new message. To determine when P is "small enough" weuse the term ωmax(P ) which we will now dene in the following:

Denition 2. (Directional width) For a set P of points in Rd, and a unit direction µ, the directionalwidth of P in direction µ is ωµ(P ) = minp∈P 〈µ, p〉−minp∈P 〈µ, p〉 where 〈µ, p〉 is the standard innerproduct.

Therefore ωmax(P ) is minv∈R,||v||=1ωv(P ). The "median" we use here is the centroid of the volumeP which is dened as follows:

Denition 3. (Centroid of P ) The centroid of P is the intersection of hyperplanes that divide Pinto two parts of equal moments.

On closer inspection this denition comes down to the centre of mass or the midpoint of thevolume, which is exactly the requirement a median has fullled for us so far. The correctness canbe seen quite easily, because we still use the same framework and any value of S is contained in P .This holds because initially P contains S0 with all possible points in a β∆ ball centred on the rstfunction value. We update P the same way we would have to do with S and therefore we wouldperform the same actions on both P and S.

The size check of P using ωmax(P ) ensures that if P 's size is lower than this threshold that itdoes not contain any point within S0. In this case S is empty and the round is over. The reason forthat is rather simple. Given we know that ωmax(P ) < ε∆ which means that P is smaller than ε∆ inany direction, then, if we made the cuts with ∆-radius balls instead of β∆ = (1 + ε)∆-radius ballsthe cut will be empty. Figure 3 illustrates this fact.

We will now proceed to analyse the number of iterations necessary until P is smaller than thethreshold mentioned above, which will lead us to the competitive ratio.

Lemma 2. (By Grunbaum [3]) For a convex set P in Rd, any halfspace that contains the centroidof P also contains at least 1/e of the volume of P .

From this lemma we can deduce that P will decrease by a constant factor whenever it is updated byusing the following term. When we have to communicate we know that the centroid is not withinthe cut of P and the ball:

Page 34: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

26 Steen Knorr

Δ

(1+ε)Δ

εΔ

Fig. 3 * If ωmax(P ) is smaller or equal to ε∆, the cut of the ∆-balls is empty.

vol(P ∩ Ball(f(tnow), β∆)

vol(P )< 1− 1

e

A problem that could arise is that P could be cut into very thin slices and the number of iterationbecomes huge. However, we will show that such a situation will not occur, because the ball withwhich we cut P have a radius not too large. Our goal is to show the following Lemma:

Lemma 3. If ωmax(P ) ≥ ε∆, then ωmin(P ) = Ω(ε2∆).

p

P

vi

q

q'ui

B

DioB

lH

Fig. 4 Two dimensional plane J

cut by hyperplane H

o2o1

p

q

?

Fig. 5 Trying to obtain width of cut.

o2o1

p

q

c2

a2

b2

Fig. 6 * Use Pythagorean theoremto compute b ⇒ width of cut is2(r − b)

To prove this Lemma we need to show some additional facts beforehand.

Lemma 4. Let H be any supporting hyperplane of P at p ∈ ∂P , that is, H contains p and P iscontained in one of the two halfspaces bounded by H. Then there is a ball B with radius β∆ suchthat H is tangent to B at p and B contains P .

Proof. We know that there is a unique ball B with centre oB and radius β∆ such thatH is tangent toB and B is on the same side as P . The ball is unique because the centre is positioned orthogonally tothe point p where B and H touch. We want to show that B contains P entirely. Therefore supposethat B does not fully contain P , i.e. ∃q ∈ P : q /∈ B. We can simplify our situation by simplyconsidering the hyperplane J that is spanned by the points oB , p and q. Figure 4 illustrates oursituation for this proof from the view of the plane J .

Suppose P is an intersection of balls Bi 6= B, i ∈ [m] with radius β∆ and the intersections ofthose balls with the plane J , which are disks, we will denote as Di to the corresponding ball Bi. Weknow that any ∂Di has to intersect H in exactly two points, because if it intersected them at onlyone point, then Bi would be B and P would be contained in B, which is a contradiction. If did notintersect H at all, then p would not be contained in Bi and could therefore not be in P =

⋂i∈[m]Bi,

a contradiction as well. Therefore we will call those two intersection points ui and vi between ∂Di

and the line l which is the intersection of H and J . It is easy to see that either ui or vi are dierentfrom p and that one of those two has to lie on the same side of p as q. The point that fulls this

Page 35: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 27

requirement we call vi. Then, the intersection of all disks Di should contain a segment pvj wherevj is the closest point to p among all the vi. If this is the case however, then the segment should bepart of P and H cannot be tangent to P which is a contradiction.

Since we know that P is constructed using above mentioned balls, the only solution is that sucha point q which does not lie in B does not exist, and therefore B contains all of P . ut

We will now bound the minimum size of a convex volume which is the intersection of balls. Figures5 and 6 illustrate the problem. Given the maximal direction width, in this example the directionfrom p to q, we want to bound that the width, marked in gure 5 with "?", is not too small.

Lemma 5. Let M be the intersection of two balls of radius r in Rd. If ωmax(M) ≥ εr, thenωmin(M) = Ω(ε2r).

Proof. Let B1 and B2 be two balls and their intersection be M . The centres of those balls are o1

and o2, respectively. We will now consider the intersection of the boundaries S1 and S2 of those ballsand call it S. This intersection S is a (d − 2)-dimensional sphere and is the line pq in the examplein gure 5. We simplify our situation once again by taking an arbitrary point p in S and let J bea two dimensional hyperplane passing through p, o1 and o2. Thus, we result with the situation ofgures 5 and 6 and a point q which is another intersection point of S and J . The maximum widthof M is ||pq||, and ωmin(M) is equal to

2(lwidth) = 2(r − b) = 2(r −

√c2 − a2

)= 2

(r −

√r2 − (||pq||/2)2

)where a, b and c are taken from gure 6. At last, we see that if ||pq|| ≥ εr, then ωmin(M) = Ω(ε2r)This last step can be seen by applying the following steps:

2(r −

√r2 − ((εr)/2)2

)= 2

(r − r

√1− ε2

4

)≥ 2

(r − r

(1− ε2

4

))= 2

(r − r +

ε2

4r

)= Ω(ε2r)

ut

Q

P

o2 o1 qpx'

Hx

Hp Hq Hy

Bo'1

Bo1

y' q'

yo'1x

p'

Bo2

cap with diameter ≤λ

Fig. 7

We now have all the tools necessary to prove Lemma 3:

Page 36: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

28 Steen Knorr

Proof. (Lemma 3) Let Q be a polytope inscribed in P such that the diameter of every cap formed bythe intersection of P and a halfspace bounded by a hyperplane containing one face of ∂Q is no morethan λ. Figure 7 shows such a cap. Now we say that µ is the direction in which the directional widthof Q is minimized. Then we can construct two hyperplanes Hp and Hq which are supporting Q, andp and q be two points lying on both Q and the corresponding hyperplane such that pq is in directionof µ. These points must exist, because Q is a polytope. We construct two additional hyperplanesHx and Hy parallel to the other two and supporting P a two points x and y, respectively. Then wecall the points where the line pq intersects with Hx and Hy x

′ and y′.Lemma 4 gives us a unique ball Bo′1 centred at o′1 with radius β∆ containing P and being tangent

to H. We now pick a point o1 on the line pq between p and q, such that the distance between ||o1x′||

and ||o′1x|| is equal. Around this point o1 we obtain a ball with radius β∆ + λ which contains allof Bo′1 and therefore all of P . Similarly, there is another ball Bo2 containing P as well, using y,y′

instead of x,x′. We now take the two intersection points p′ and q′ of the boundary of the balls Bo1and Bo2 and the line pq each. It is easy to see that

||pq|| ≤ ωmin(P ) ≤ ||p′q′||

holds. Since ||p′q′|| is the minimum width of M = Bo1 ∩Bo2 we can apply our results from Lemma5 by contraposition. Thus, ωmin(M) = ||p′q′|| gives us ωmax(M) = O (||p′q′||/ε), assuming thatωmax(M) ≥ ε∆. Note, that if ωmax(M) < ε∆ the same would apply to P and is therefore theassumption for Lemma 3. Finally, if we choose our λ suciently small, the points x and x′, y andy′ and therefore o1 and o′1 as well as o2 and o′2 come closer, respectively. Also the dierence of theradii of Bo1 and Bo′1 is getting smaller and we get the following equation:

ωmin((P ) ≥ Ω(||p′q′||) ≥ Ω(ε · ωmax(M)) ≥ Ω(ε · ωmax(P ))

The rst part of the equation is obtained using a constant factor c, 0 < c < 1 dependent on λ. Thesecond part is the application of our result from Lemma 5 and the third part is due to the fact thatP is entirely contained within M . Together this gives us the desired result of ωmin(P ) = Ω(ε2r), ifωmax(P ) ≥ ε∆ holds. ut

We now have a lower bound on the size of P . To relate that size to the necessary number ofiterations one nal lemma is needed to conclude.

Lemma 6. Let K be a convex set in Rd. If ωµ(K) ≥ r for all µ ∈ Sd−1, then vol(K) ≥ rd/d!

Proof. We know that the diameter of K must be at least r, because all directional widths of K areat least k. Consequently we pick two point p1, q1 ∈ K with the largest distance in K and denote withµ1 = −−→p1q1 the direction. From that we derive a direction µ2 which is orthogonal to µ1 and obtainthe points p2, q2 that have the largest distance in direction µ2 within K. Those four points form aconvex quadrilateral Q2 in R2 (with basis of µ1 and µ2). We continue in this fashion of obtainingpi, qi for the i-th dimension and connecting them to all the other points forming a convex polytopeQi in R

i for all d dimensions. The volume of Qd we subsequently obtain is no smaller than rd/d!.Since Qd is fully contained within K the same applies to K. ut

Finally, we can put all things together to bound the number of iterations needed. Lemma 3 givesus that the width of our P in all directions is at least c · ε2∆. Therefore, Lemma 6 grants us thatthe volume of P is at least (c · ε∆)d/d!. Then, recalling that vol(P ) ≤ (4∆)d at the beginning of theround, Lemma 2 gives us:

log(4∆)d

(c · ε2∆)d/d!≤ log

(4∆)d

(c · ε2∆)d/dd= log

(4∆ · dc · ε2∆

)d= d · log

4 · dc · ε2 ≤ d · log

c2 · d2

ε2= O

(dlog

d

ε

)This is the number of triggerings of communication Algorithm 3 needs until the size P falls underthe threshold given and the round ends.

Page 37: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 29

Running time: In the general case the computation of the centroid of a convex body is quitecomplicated, as mentioned in [5]. There is, however, a randomized algorithm by Bertismas andVempala [1] that computes an approximate centroid using a separation oracle. This is what theyproved:

Lemma 7. Let K be a convex body in Rd given by a separation oracle, and a point in a ball ofradius ∆ that contains K. It ωmin(K) ≥ r, then there is a randomized algorithm with running timepoly(d, log(∆r )) that computes, with high probability, the approximate centroid z of a convex set Ksuch that any halfspace that contains z also contains at least 1/3 of the volume of K.

In our case we can emulate such an separation oracle by checking of all O(dlog dε

)balls that form

P . Additionally we can use f(tstart) as the initial point required by the Lemma. For our causewe set r = c · ε2∆ and can therefore compute an approximate centroid in poly(d, log

(∆r

). If the

computation fails, then with high probability, ωmax < ε2∆. This gives us a convenient way to checkwhether we should end the round. We adjust the algorithm for this in the following two lines:

1. Line 2 → while the number of iterations in the current round is no more than (1) do2. Line 3 → compute the approximate centroid of P using the algorithm of Lemma 7 and assign it

to g(tnow); if the algorithm of Lemma 7 fails, terminate the current round;

Thus, we can conclude this part with the following result:

Theorem 5. There is an (O (dlog(d/ε)) , 1 + ε)-competitive online algorithm to track any functionf : Z+ → Zd. The algorithm runs in time poly(d, log 1

ε at every time step.

3.2.2 Case β = 1

In section 3.1 we have shown that considering the points contained within S0 =⋂d+1i=2 Ci is sucient

to nd the optimal value for the current round. From the denition of S0 being a cut of smallestenclosing balls of at most d+ 1 points in Zd, we can obtain the following fact:

Lemma 8. For any points s = (x1, . . . , xd) in S0, xi(1 ≤ i ≤ d) are fractions in the form of yz wherey, z are integers and |z| ≤ d!(16∆2d)d.

Proof. For any point s ∈ S, let B be one of the minimum enclosing balls of some points Zd whichis centred at s. We know that there are k(2 ≤ k ≤ d + 1) integer points q1, . . . , 1k lying on theboundary of B. Supposing k = d + 1, we can compute s by solving a linear system of d equationswhere ||q1s|| = ||q2s||, ||q1s|| = ||q3s||, . . . , ||q1s|| = ||qd+1s|| or can rewrite them as AsT = b whereA = (aij) ∈ Zd×d and b ∈ Zd×1. The following example for the rst equality shows how the matrixA and the vector b are constructed. Here v[i] denotes the i-th component of the vector v.

Page 38: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

30 Steen Knorr√√√√ d∑i=1

(q1[i]− s[i])2= c =

√√√√ d∑i=1

(q2[i]− s[i])2

⇒d∑i=1

(q1[i]− s[i])2= c2 =

d∑i=1

(q2[i]− s[i])2

⇔d∑i=1

q1[i]2 − 2q1[i] · s[i] + s[i]2 = c2 =

d∑i=1

q2[i]2 − 2q2[i] · s[i] + s[i]2

⇔d∑i=1

(q1[i]2 − q2[i]2)− 2(q1[i]− q2[i]) · s[i] = 0

⇔d∑i=1

2(q1[i]− q2[i]) · s[i] =

d∑i=1

(q1[i]2 − q2[i]2)

→a1j = 2(q1[j]− q2[j]), b[1] =

d∑i=1

(q1[i]2 − q2[i]2)

It is easy to see that each coecient aij(1 ≤ i, j ≤ d) is an integer in range [−8∆, 8∆], because thedierence of two qi[j]s is at least −4∆ and at most 4∆ and b is a vector of integers.

For our matrix A Cramer's rule gives us

xi =detAidetA

where Ai denotes the matrix formed by replacing the i-th column in A with b. Given our boundsfor the coecients of aij we can bound the determinant of A: |detA| ≤ d!(8∆)d. This fact followsfrom the assumption that when any coecient is either −8∆ or 8∆ and the signs of the coecientsare chosen appropriately, the computation of the determinant gives the above mentioned thresholdvalue.

Now, if 2 ≤ k ≤ d, then the midpoint s of our ball must be in the (k − 1)-dimensional subspacedetermined by the qi, i ∈ [k], because all qis are on the boundary of the ball and the ball is a smallestenclosing ball. We can therefore obtain s from these qi like this: s = α1q1 + α2q2 + . . . αkqk =∑ki=1 αiqi for which

∑ki=1 αi = 1. This basically is due to the fact, that if the point is within the

subspace of these qi, together they form a base from which we can compute s. By similar arguments

it can be shown that each αi is in fact a fraction of the form yz , y, z,∈ Z and |z| ≤ d!

(16∆2d

)d.

Subsequently, since s =∑ki=1 αiqi the same applies to s as well. ut

Following from this Lemma we know that any two points in S0 have a distance of at least

1/(d!(16∆2

)d)2

. Thus, if we set ε = 1/

(8∆(d!(16∆2

)d)2)we can ensure that if ωmax(P ) < ε∆

then there is at most one point of S0 in P . What's left is to nd this point, provided it does exist.If do nd it we can send it to Bob and we don't have to send again until the round ends. Trying tocalculate this point directly might prove dicult therefore we try a dierent approach.

We call a number x good if x = xy , x, y ∈ Z and |z| ≤ d!(16∆2)d. A point s is good if all of

it's coordinates are good numbers. What we want to do is snapping the centroid p of P that wecalculated to the nearest good point s. If there is a point s′ in S0 which is inside P then this is agood point to which we can snap our centroid. If it does not lie within P we terminate the currentround, however if it does lie within that must mean that we found the last point in S0.

A problem that arises is that our volume P might get too small to apply the algorithm we use inLemma 7. To circumvent this problem we use a small trick in that we increase the radii of the ballwith which we update our volume P form ∆ to (1 + ε)∆. Note that we still want to ensure that ourvalues are in a ∆ range to the actual function value and we both use P and P ′ for the algorithm.

Page 39: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 31

Now if we use this trick we can see that our new P called P ′ still only contains one point s′ if ourprevious P contained it. We then can use our algorithm with the same value for r = c · ε2∆ If thealgorithm fails, we know that P is empty and the round is over; otherwise we obtain an approximatecentroid p ∈ P ′. We can nd the last s ∈ S0 by rounding each coordinate of p to the next goodnumber and check whether s ∈ P . [] gives us a theorem that allows us to compute the rounding inpolynomial time. Finally, the choice of our ε and our previous Theorem 5 give us:

Theorem 6. There is an O(d2log(d∆)

)-competitive online algorithm to track any function f :

Z+ → Zd. The algorithm runs in time poly(d, log∆) at every time step.

3.3 Tracking a dynamic set

A nice application of multi-dimensional tracking is the task to track a dynamic subset of a niteuniverse U . The universe consists of exactly d elements and the function we want to track is denedas f : Z+ → 2U . To accomplish this task we redene f so that any subset of U is representedby a d-dimensional vector as depicted in gure 8. Each coordinate of the vector states whether thecorresponding item from the universe is contained in the subset. Thus we get a function f : Z+ → Rd.To quantify the dierence between two subsets we can now just use the l2 distance.

Universe

(xpenta,xsquare,xhexa,xcircle)

Vector

, (0,1,0,1)

Fig. 8 * Universe U is mapped to d-dimensional vector

Alice now sends Bob approximate results for f(tnow) using vectors with fractional coordinates.However, if we insist that Alice sends 0, 1-vectors the competitive ratio becomes exponentiallylarge in ∆. This is shown in the following theorem:

Theorem 7. Suppose that there is an (α, β)-competitive algorithm for online tracking f : Z+ → 2U

and |U | > (β∆)2, if the algorithm can only send subsets of U , then α = 2Ω(∆2) for any constantβ < 19/18.

Proof. First we dene H = 0, 1d , d = (β∆)2 + 1. Similar to our proof of theorem 2 we constructan adaptive adversary that ensures that a round has at least α iterations by manipulating f(t).

Let S0 be the set of possible vectors sent by AOPT in its last communication, i.e. all vectors withdistance ∆ from f(tstart). The cardinality of S0 is

|S0| =∆2∑k=1

(d

k

)= Ω(2∆

2

)

The adversary Carole sets S = S0 at the beginning of each round. Analogously to before Carolechooses the values for the function f according to the messages ASOL sends. Whenever ASOLcommunicates a message v ∈ H, Carole changes f to the complement of v, u = 1− v. The distance

Page 40: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

32 Steen Knorr

between v and u is (β∆) + 1 > β∆, so ASOL has to send a new message. Whenever Carole sends anew message the set S decreases at most by

|H − Ball(u,∆)| =d∑

k=∆2+1

(d

k

)≤(

2∆2

3ε∆2

)≤ (e/ε)3ε∆2

elements, for ε = β − 1 < 1/18. This results in at least

Ω

(2∆

2

(e/ε)3ε∆2

)= Ω

(c∆

2)

iterations before S is empty and the round is over. ut

Therefore if we want to avoid an exponentially large competitive ratio, we have to allow Alice tosend fractional coordinates. Using our previous algorithms we ensure that the values g(tlast) sentby Alice are within a ∆ range to f(tnow) and if a 0, 1-vector is required, the tracker can convertthe vector by probabilistic rounding every coordinate. The expected distance is then still within a∆ range to f(tnow).

4 Online Tracking with predictions

In this section we try to improve our previous results by letting Alice predict the future trend of thefunction. For these predictions Alice uses the function values she has already seen and assumes a"well behaved" function. The messages Alice sends to Bob now are not values but linear functions. Ifsuch a prediction is correct the necessary communication can be reduced immensely. It is importantto notice that we cannot compute a new function from scratch when the actual function f deviatesby more than ∆. Our algorithm therefore uses line predictions and follows the framework presentedin section 2.

At the beginning of the round we send the rst value of f assuming a constant function. We callrst time when f deviates from the previously sent value by ∆ t1. The construct the lines usingq0,q1 as parameters such that the line passes through (0, q0) and (t1, q1). We call the (q0, q1)-spacethe parametric space wherein every prediction the algorithm sends out is a point. Within this spacewe can construct a square P which contains all points which are valid ∆ approximations of f(0)and f(t1). We will now proceed to cut this space with the spaces constructed from f(0) and f(ti)where ti is the i-th triggering of our algorithm. Analogously, the optimal solution must lie withinthe cut of all theses spaces.

The major task is to choose the initial set S = S0 at the beginning of the round when thealgorithm has to send a value for t1. Afterwards we proceed similar to the algorithm describedin subsection ??. To construct S we need to dene two intermediate sets M and L. Let M =(t, y)|t ∈ [T ], y ∈ Z +∆ ∪ Z−∆ where T is the length of the tracking period and Z +∆denotes x|x = y +∆, y ∈ Z, and similarly Z−∆. It is worth noting at this point that ∆ doesnot need to be an integer. Then we construct L to be the collection of all lines that pass throughtwo points in M . We then need two additional collections X and Y which contain the intersectionpoints of all lines between L and the line t = 0 and between L and the line t = t1 respectively. S0

is then chosen as (q0, q1)|q1 ∈ X, q1 ∈ Y ∩ P , where P is the square mentioned above.We now argue that the lines in S0 are sucient to consider and we will show when AOPT does not

communicate there must be some surviving point (q0, q1) ∈ S0. We consider the function space asdepicted in gure 11. Let l be the line chosen by AOPT in its last communication. If AOPT does notcommunicate in the current round, l must still intersect all the line segments ((t, f(t)−∆), (t, f(t)+∆), for Tstart ≤ t ≤ tnow. We can then translate an rotate l so that it passes through two points in

Page 41: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Mutli-Dimensional Online Tracking 33

q0

q1

f(0) - Δ f(0) + Δ

q1 = f(t1) - Δ

q1 = f(t1) + Δ P

g(t1)

Fig. 9 The parametric space

Q

q0

q1

f(0) - Δ f(0) + Δ

q1 = t1/t2(f(t1) + Δ - q0) P

g(t1) q1 = t1/t2(f(t1) - Δ - q0)

Fig. 10 Cutting in the parametric space

Δ

t1 t2 t3 t4

l

l'

Fig. 11 Moving optimal the line l

M , and the resulting line l′ still intersects all line segments. Therefore l′ survives and the round isnot yet nished.

At last we bound the cardinality of S0:

Lemma 9. |S0| = O(∆2T 6

), where T is the length of the tracking period.

Proof. If a line (q0, q1) passes two points (ti, f(ti)±∆), (tj , f(tj)±∆), (ti, tj ∈ Z+, 0 ≤ ti ≤ tj ≤ T )in M , then

q0 = (f(ti)±∆)− (f(ti)±∆)− (f(tj)±∆)

ti − tjti,

q1 = (f(ti)±∆)− (f(ti)±∆)− (f(tj)±∆)

ti − tj(ti − t1)

The number of possible choices for q0 is O(∆T 3

)and the same applies to q1. Thus, the cardinality

of S0 is at most O(∆2T 6). ut

From the proof of the Tukey we still know that the number of iteration is logarithmic in S0. Inthis case the dimension is 1 so the number of iterations is just O (logS0) which results with thelemma above to:

Theorem 8. There is an O (log(∆T ))-competitive online algorithm to track any function f : Z+ →Z with line predictions, where T is the length of the tracking period.

To construct our S0 our algorithm assumes to know the length of the tracking period. Supposingthat this information is not given in advance we can use a squaring trick to keep the competitiveratio. We begin by setting T to ∆. Whenever tnow reaches T and the current round has not endedyet, we restart our computations as if a new round started with T set to T 2. The number of iterationsis still at most O (log(∆T )).

5 Open Problems

The problem studied in this essay is a special case of the tracking framework where there is onlyone site. Potentially, a generalization of the techniques presented here to multiple sides might beuseful. Additionally if we consider transmitted bits instead of entire messages the competitive factorwill increase by a factor of d. Therefore exploring whether a transmission of only specic vectorcoordinates will improve the results. Finally the problem could be reconsidered using dierent metricspaces.

Page 42: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

34 Steen Knorr

6 Summary / Conclusion

We have seen that we can track multi-dimensional functions with a good competitive ratio andpolynomial runtime and shown the application of tracking dynamic set. The technique of applyingvolume cutting to implicitly maintain a set has proven protable and moreover we found a way tominimize the communication even further by predicting the values of the function.

References

1. Bertismas, A and Vempala, S.: Solving convex programs by random walks. Journal of the ACM 51 (2004)2. Clarkson, K. L., Eppstein,D., Miller G. L., Sturtivant, C., and Teng, S.: Appoximating center points with iterated

radon points. In Proc. Annual Symposium on Computational Geometry (1993)3. Grunbaum, B.: Partitions of mass-distributions and of convex bodies by hyperplanes. Pacic J. Math. (1960)4. Matousek, J.: Lectures on Discrete Geometry. Springer-Verlag New York Inc. (2002)5. Rademacher, L.: Approximating the centroid is hard. In Proc. Annual Symposium on Computational Geometry.

(2007)6. Yi, K and Zhang, Q.: Mutli-Dimensional Online Tracking. ACM-SIAM Symposium on Discrete Algorithms

(SODA. (2009)

Page 43: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and SpaceComplexity

Felix Biermeier

Abstract The frequency moments characterize a sequence of elements by a number Fk =∑ni=1m

ki

where mi denotes the amount of each type of element i. We present the work of N. Alon, Y.Matias and M. Szegedy who construct randomized algorithms to approximate Fk and analyze itsspace complexity [2]. It turns out that there exists approximation algorithms for certain frequencymoments in logarithmic space. In particular for those frequency moments which are useful in database applications. Unfortunately, the lower bound to approximate Fk for k > 5 is at least Ω(n1−1/k).

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Approximations of frequency moments Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 Approximating Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 Improved space bound for F2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.3 Approximating F0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.1 Space complexity of deterministic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Space complexity of F∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Space complexity of Fk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

1 Introduction

In this work we present estimates of the frequency moments of data streams by Noga Alon, YossiMatias and Mario Szegedy [2]. Due to huge data sets where we cannot store the whole input in ourmemory it is only natural to consider a data streaming model. The frequency moment is simplythe sum over the number of occurrences of each data value in the stream to the power of a certaininteger number which we denote by Fk =

∑ni=1m

ki where mi counts the occurrence of each data

values i ∈ 1, . . . , n. A special case is F∞ = max1≤i≤nmi.In general the frequency moment supplies statistical informations about the given data set. Dueto the denition F0 and F1 are very intuitive where F0 is the number of distinct elements and F1

is the length of the whole data stream. The other frequency moments are less intuitive but notless interesting. These are indicators for the skew of data which is especially relevant in databaseapplications. The frequency moment F2 is used in [8] to estimate the size of query results.The objective of this work is to present several approximation algorithms to estimate frequencymoments. In particular we are interest in algorithms with sublinear space complexity due to the

Felix BiermeierUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

35

Page 44: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

36 Felix Biermeier

huge amount of data. The main focus lies on randomized algorithms. We will see that in comparisonto deterministic algorithms the randomized algorithms are more ecient in regards to the spacecomplexity.This work is structured as follows. We start with approximations of the frequency moment Fk insection 2 which contains additionally an improved approximation of F2. After this we show lowerbounds regarding the space complexity in section 3. In this section we rely most of the time oncommunication complexity. We also disclose the connection between randomized and deterministicalgorithms mentioned before. The last section provides some additional remarks.

2 Approximations of frequency moments Fk

In this section we present randomized approximations to compute the frequency moment Fk. Theobjective is to construct those algorithms in sublinear space with respect to the length of the datastream and the number of dierent data values. A naive approach would be to consider a countervariable for every possible data value which is a simple approach but not space ecient unless weare only interested in F1. By the denition of frequency moments F1 denotes the length of the datastream. Therefore it is sucient to use a single counter for all dierent values.

2.1 Approximating Fk

In this part we present and analyze an algorithm to approximate the frequency moments Fk. Thegeneral idea of this algorithm is to compute a good estimation of Fk in expectation. Additionallythe algorithm will repeat this procedure independently from each other and outputs the median ofthose results for each instance. This is a usual technique to boost the success probability of a certainevent. It follows the whole statement.

Theorem 1. For every k ≥ 1, every δ > 0 and every ε > 0 there exists a randomized algorithm thatcomputes, given a sequence A = (a1, . . . , am) of members of N = 1, . . . , n in one pass and using

O(k log (1/ε)

λ2n1−1/k(log n+ logm)

)memory bits, a number Y such that

Pr [|Y − Fk| > λFk] ≤ ε.

Proof. Le be s1 = 8kn1−1/k

λ2 and s2 = 2 log (1/ε). To simplify the proof we assume that the lengthof a sequence A is known and we set m = |A|. We can get rid of this assumption but we omit thedetails. The algorithm to approximate Fk works as follows.

Page 45: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 37

Algorithm 1 Estimate Fk1: for i← 1 to s2 do

2: for j ← 1 to s1 do

3: apR← A

4: r ← |q : q ≥ p, aq = ap|5: Xij ← m

(rk − (r − 1)k

)6: end for

7: Yi ← 1s1

s1∑j=1

Xij

8: end for

9: output median of (Y1, . . . , Ys2 )

The inner loop of the algorithm computes a random variable Yi which consists of the average ofindependent and identically distributed random variables Xij . The outer loop repeats this procedurefor a certain amount of runs. Finally the algorithm outputs the median of those Yi. We can describethis approach by the median of means. The use of the median is a usual technique to boost thesuccess probability of a certain event what we will see in the analysis. The main issue of the algorithmis to track a random value ap from the sequence and count the occurrence of ap in the sequenceA subsequently. In order to compute a certain Xij we maintain the current value ap ∈ N and acounter r ∈ 1, . . . ,m. This requires O (log n+ logm) space.The analysis of the algorithm is quite simple. We are interested in the expected value and variance ofa single Yi. This requires the analysis of Xij . To simplify the notation we set X = Xij . Fortunately,the expected value of X supplies already a good result to estimate Fk

E[X] =1

m

n∑i=1

mi∑j=1

m(jk − (j − 1)k)

=m

m[(1k + (2k − 1k) + · · ·+ (mk

1 − (m1 − 1)k)) +

(1k + (2k − 1k) + · · ·+ (mk2 − (m2 − 1)k)) + · · ·+

(1k + (2k − 1k) + · · ·+ (mkn − (mn − 1)k))]

=

n∑i=1

mki = Fk.

With regards to the variance we consider the denition Var[X] = E[X2]−E[X]2 where the expectedvalue of X2 is

E[X2] =1

m

n∑i=1

mi∑j=1

(m(jk − (j − 1)k)

)2=m2

m[(12k + (2k − 1k)2 + · · ·+ (mk

1 − (mi − 1)k)2) +

(12k + (2k − 1k)2 + · · ·+ (mk2 − (m2 − 1)k)2) + · · ·+

(12k + (2k − 1k)2 + · · ·+ (mkn − (mn − 1)k)2)].

We can estimate this term by using the following inequality

ak − bk ≤ (a− b)kak−1, a > b > 0

which leads to

Page 46: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

38 Felix Biermeier

E[X2] ≤ m[(k12k−1 + · · ·+ kmk−11 (mk

1 − (m1 − 1)k)) + · · ·+(k12k−1 + · · ·+ kmk−1

n (mkn − (mn − 1)k))]

= km[(12k−1 + · · ·+mk−11 (mk

1 − (m1 − 1)k)) + · · ·+(12k−1 + · · ·+mk−1

n (mkn − (mn − 1)k))]

≤ km[(m2k−11 + · · ·+mk−1

1 (mk1 − (m1 − 1)k)) + · · ·+

(m2k−1n + · · ·+mk−1

n (mkn − (mn − 1)k))]

= km

n∑i=1

mk−1i mk

i = km

n∑i=1

m2k−1i = kmF2k−1 = kF1F2k−1.

The denition of Yi supplies

E[Yi] = E

1

s1

s1∑j=1

Xj

=1

s1

s1∑j=1

E [Xj ] = Fk.

This is a crucial observation and it allows us the computation of Var[Yi] with the fact below(n∑i=1

mi

)(n∑i=1

m2k−1i

)≤ n1−1/k

(n∑i=1

mki

)2

. (1)

Therefore by the independence of the random variables Xj for Yi we get

Var[Yi] = Var

1

s1

s1∑j=1

Xj

=1

s21

s1∑j=1

Var [Xj ] =1

s21

s1∑j=1

E[X2j

]− E [Xj ]

2 ≤E[X2j

]s1

≤ kF1F2k−1

s1

(1)

≤ kn1−1/kF 2k

s1.

Together with the observation above the Chebyshev's Inequality supplies for every xed i and

s1 = 8kn1−1/k

λ2

Pr[|Yi − Fk| > λFk] ≤ Var[Yi]

(λFk)2≤ kn1−1/kF 2

k

s1λ2F 2k

≤ 1

8.

Remember that the algorithm outputs the median of all Yi. The Cherno Bound will show that themedian is a good choice as an output in most of the cases. To apply Cherno Bound we dene byZi a random variable

Zi = 1⇔ |Yi − Fk| > λFk

such that Z =∑s2i=1 Zi. All Zi are independent binary random variables since all Yi respectively

all Xij are independent. With respect to our desired result Zi describes a bad event where ourestimation Yi deviates too much from the frequency moment Fk. The expectation of Z is

E [Z] =

s2∑i=1

E [Zi] ≤s2

8.

By choosing δ = 3 and µ = E[Z] Cherno bound supplies

Page 47: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 39

Pr[Z ≥ (1 + 3)

s2

8=s2

2

]≤ e−

9s2/8

2(4/3) = e−27 log(1/ε)

32 = e27 log(ε)

32 = ε27

32 ln(2) ≤ ε, 0 < ε < 1

which is sucient for all relevant ε. Therefore if the number of bad events is less than s2/2 themedian Y supplies a good estimation for Fk with the probability

Pr [|Y − Fk| ≤ λFk] ≥ 1− ε.

ut

2.2 Improved space bound for F2

While Theorem 1 already supplies a relatively good approximation with a sublinear space constraintit is possible to improve this result for certain Fk. In this section we present such a case for F2. Thealgorithm basically use a similar approach what we have seen before but the space complexity isdecreased to a logarithmic term.

Theorem 2. For every δ > 0 and every ε > 0 there exists a randomized algorithm that computes,given a sequence A = (a1, . . . , am) of members of N = 1, . . . , n in one pass and using

O(

log (1/ε)

λ2(log n+ logm)

)memory bits, a number Y such that

Pr [|Y − F2| > λF2] ≤ ε.

Proof. Similar to the previous proof we set the parameter s1 and s2 to s1 = 16λ2 respectively s2 =

2 log (1/ε). The algorithm to approximate F2 works as follows.

Algorithm 2 Estimate F2

1: for i← 1 to s2 do

2: for j ← 1 to s1 do

3: vp = (ε1, . . . , εn)R← V

4: Z ←(

n∑l=1

εlml

)2

5: Xij ← Z2

6: end for

7: Yi ← 1s1

s1∑j=1

Xij

8: end for

9: output median of (Y1, . . . , Ys2 )

Similar to the previous algorithm in section 2.1 Yi and Xij are random variables while Xij areindependent and identically distributed. The main dierence between those two algorithms is thecomputation of Xij . The set V of size h ∈ O

(n2)consists of vectors vi where each entry is a random

variable εl ∈ −1, 1. Additionally the random variables εl are four-wise independent. This will besucient to proof the statement. The construction of such a set is based on arithmetic operations ina nite eld which is rather technical and can be nd in [1]. In order to fulll the space constraintthe random variables Xij are computed in two steps. Since Z is a linear function it is only necessaryto maintain the current sum of Z and the current value p to get the relative εl. This is possible inO (log n+ logm) space.

Page 48: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

40 Felix Biermeier

As in the proof before we compute the expected value and variance of Yi. Due to the denition ofYi the expected value is

E[Yi] = E

1

s1

s1∑j=1

Xj

=1

s1

s1∑i=1

E [Xj ] = E[Xj ]

where the expected value of a single Xj is

E[X] = E

( n∑i=1

εimi

)2 =

n∑i=1

m2iE[ε2

i ] + 2∑

1≤i<j≤nmimjE[εi]E[εj ] =

n∑i=1

m2i = F2.

By the denition of the variance we have to compute the term

E[X2] = E

( n∑i=1

εimi

)4 = E

( n∑i=1

εimi

) n∑j=1

εjmj

( n∑k=1

εkmk

)(n∑l=1

εlml

)=

n∑i=1

m4iE[ε4i

]+

∑1≤i<j≤n

4m3imjE

[ε3i εj]

+∑

1≤i<j≤n6m2

im2jE[ε2i ε

2j

]+

∑1≤i<j<k≤n

12m2imjmkE

[ε2i εjεk

]+

∑1≤i<j<k<l≤n

24mimjmkmlE [εiεjεkεl]

=

n∑i=1

m4i + 6

∑1≤i<j≤n

m2im

2j .

This leads to

Var[X] = E[X2]− E [X]

2=

n∑i=1

m4i + 6

∑1≤i<j≤n

m2im

2j −

(n∑i=1

m2i

)2

=

n∑i=1

m4i + 6

∑1≤i<j≤n

m2im

2j −

n∑i=1

m4i − 2

∑1≤i<j≤n

m2im

2j

≤ 4∑

1≤i<j≤nm2im

2j + 2

n∑i=1

m4i = 2

2∑

1≤i<j≤nm2im

2j +

n∑i=1

m4i

= 2

(n∑i=1

m2i

)= 2F 2

2 .

By using the observation from above we obtain

Var[Y ] = Var

[1

s1

s1∑i=1

Xi

]=

1

s21

s1∑i=1

Var [Xi] =1

s1Var [Xi] ≤

2F 22

s1.

This is sucient to apply the Chebyshev Inequality such that

Pr[|Yi − F2| > λF2] ≤ Var[Yi]

(λFk)2≤ 2F 2

k

s1λ2F 2k

=1

8.

Similar to the previous proof the Cherno Bound completes the proof.

Page 49: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 41

2.3 Approximating F0

In this section we present the missing frequency moment F0 which is not covered by the algorithmsabove. Due to its denition of F0 we cannot apply the approach from before. However, there existsother approaches and the space complexity is not less exciting since we can construct an algorithmwith logarithmic space constraint. The crucial part of the algorithm is the existence of a family oflinear hash functions with certain properties.

Theorem 3. For every c > 2 there exists a randomized algorithm that, given a sequence A ofmembers N , computes a number Y using O (log n) memory bits, such that the probability that theratio between Y and F0 is not between 1/c and c is at most 2c.

Proof. Consider a nite eld F = GF (2d) with the smallest d ∈ N such that 2d > n and N =1, . . . , n is a subset of F . This allows computations of arithmetic operations in F with membersof the sequence A. This can be observed in the algorithm below.

Algorithm 3 Estimate F0

1: a, bR← F . independent

2: R← 0

3: for i← 1 to m do

4: zi ← a · ai + b

5: ri ← r(zi) . r(z) denotes the largest number of rightmost bits which are all 06: if ri > R then

7: R← ri;8: end if

9: end for

10: output 2R

We choose a, b ∈ F uniformly at random and independently. The general idea of this algorithm isthe computation of linear hash functions zi over the nite eld F and members of the sequence A.This is a random mapping of elements ai ∈ A to zi and it possesses some useful properties whatwill be observed later.An other interesting point is the function r(z) which denotes the largest number of rightmost bitsof a binary vector where those bits are all 0. This function aects the output of the algorithm .Regarding the space complexity it is sucient to maintain the values a, b and the current binaryindex of R. This is possible in O (log n) respectively O (log log n) space.For the rest of the proof we assume F0 is the correct value of distinct elements in A and r is xed.The analysis is based on two properties of zi which were indicated before. The random mappingfrom ai to zi is uniformly distributed over F if both values are xed. This implies for r ∈ N

Pr [r(zi) ≥ r] =1

2r.

The other property is the pairwise independence for two distinct and xed elements ai and aj whichimplies

Pr [r(zi) ≥ r ∧ r(zj) ≥ r] =1

22r.

In order to continue it is necessary to dene a random variable with regards to the number of distinctelements F0. Let Wai be a random variable

Wai = 1⇔ r(zi) ≥ r

Page 50: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

42 Felix Biermeier

such that Zr =∑Wai where each value of all F0 distinct elements is count only once. This leads to

E [Zr] =∑

E [Wai ] =F0

2r.

Due to the pairwise independence of zi and zj the variance of Zr is

Var[Zr] = Var

[∑Wai

]=∑

Var [Wai ] =∑ 1

2r

(1− 1

2r

)=F0

2r

(1− 1

2r

)≤ F0

2r.

To archive the desired result it is necessary to bound the probability of

2r

F0/∈[

1

c, c

].

We distinguish between the probabilities of Zr = 0 and Zr > 0. The inequality c2r < F0 and theChebyshev inequality implies

Pr [Zr = 0] ≤ Pr [|Zr − E [Zr]| ≥ E [Zr]] ≤Var [Zr]

E [Zr]2 <

E [Zr]

E [Zr]2 =

1

E [Zr]=

2r

F0<

1

c

while the inequality 2r > cF0 and the Markov inequality implies

Pr [Zr > 0] = Pr [Zr ≥ 1] ≤ E [Zr]

1=F0

2r<

1

c.

By using the union bound we archive

Pr

[2r

F0/∈[

1

c, c

]]<

1

c+

1

c=

2

c.

In case the algorithm outputs Y = 2R for which Zr > 0 we obtain our desired result. ut

3 Lower bounds

In the previous section we have seen a couple of randomized approximations to compute the fre-quency moment Fk in sublinear space. The question arises if we can establish lower bounds for Fk.We show a rather simple case for F∞ and we continue with the general cases of Fk. Especially thelatter cases require a deeper introduction to communication complexity. But before we start with astatement which demonstrates the benets of randomization in the approximations above.

3.1 Space complexity of deterministic algorithms

In this section we present the necessity of randomization in all algorithms of section 2. If we use adeterministic algorithm it is impossible for almost all frequency moments Fk to archive a sublinearspace constraint. The only exception is F1 what is the length of the data stream and it can computein logarithmic space by using a single counter.

Page 51: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 43

Proposition 1. For any nonnegative integer k 6= 1, any deterministic algorithm that outputs, givena sequence A of n/2 elements of N = 1, . . . , n, a number Y such that |Y − Fk| ≤ 0.1Fk must useΩ(n) memory bits.

Proof. By results of coding theory there exists a family G of t = 2Ω(n) subsets of N such that eachG ∈ G has a cardinality of at most n/4 and for all G1 6= G2 holds

|G1 ∩G2| ≤ n/8.

We x a deterministic algorithm D from above and dene a input sequence A(G1, G2) with lengthn/2 for all two members of G. The length of such an input is by denition n/2. We will comparethe memory congurations of two distinct input sequences after the algorithm nished the rst n/4elements respectively the rst subset G of A and the whole input.We assume that the size of the memory is at most log t. Imagine that the algorithm D produces foreach G ∈ G a memory conguration and we can give each conguration a number. Since the sizeof G is t the assumption together with the pigeonhole principle imply that there exists at least twodistinct sets G1, G2 ∈ G for which the algorithm D produces the same memory conguration.We consider two input sequences A(G1, G1) and A(G2, G1) where G1, G2 ∈ G are distinct. Bythe argument above the algorithm produces for the rst half of both inputs the same memoryconguration and eventually the same result for the whole input. The rst input A(G1, G1) supplies

F0 = n/4,

Fk =∑i∈G1

2k = 2k · |G1| = 2k · n/4

while the second input A(G2, G1) supplies

F0 ≥n

2− n

8=

3n

8,

Fk ≤(n

2− 2n

8

)+ 2k · n

8=n

4+ 2k · n

8.

We denote by F(1)k and F

(2)k for k ∈ N the output of the algorithm D under the rst respectively

second input sequence. By comparing both outputs we can see that they are unequal

F(1)0 =

n

4<

3n

8≤ F (2)

0

F(1)k = 2k · n

4> 2k · n

(12 + 1

2k

)4

= 2k ·(n

8+

n

4 · 2k)

=n

4+ 2k · n

8≥ F (2)

k , k > 1.

This is a contradiction to the assumption that the algorithm D outputs the same value for bothinputs and it follows that the algorithm must use at least log t = Ω(n) memory bits. ut

3.2 Space complexity of F∞

In this section we give a briey introduction to communication complexity which we can apply tothe special case F∞. Just for a reminder F∞ describes the maximum number of occurrences of anitem value over all possible item values in the data stream. The denitions below will help us to setup a foundation for the upcoming statements. About that the rst denition is in general useful toanalyze communication complexity of certain problems. It is based on [7].

Page 52: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

44 Felix Biermeier

Denition 1 (probabilistic communication complexity). We have given a boolean functionf : 0, 1n × 0, 1n 7→ 0, 1, two players with unlimited computation power and inputs x, y ∈0, 1n. Both players know their own input x respectively y and they are allowed to send messagesto each other according to a probabilistic protocol. At the end they output the value of f(x, y).With a probability of at least 1 − ε this output is the correct value. We denote by Cε(f) the ε-error probabilistic communication complexity of f which is dened as the expected number of bitscommunicated in the worst-case.

Denition 2 (Disjointness problem). We have given the disjointness function DISn : 0, 1n ×0, 1n 7→ 0, 1 and two inputs x, y ∈ 0, 1n. Both inputs characterize a subset Nx respectivelyNy of 1, . . . , n. The function DISn outputs 1 i Nx and Ny intersect.

Together with a result of [4] for the disjointness problem it follows a lower bound for F∞.

Proposition 2. Any randomized algorithm that outputs, given a sequence A of at most 2n elementsof N = 1, . . . , n a number Y such that

Pr[|Y − F∞| ≤ F∞/3] > 1− ε

for some xed ε < 1/2 must use Ω(n) memory bits.

Proof. We have given an algorithm D as above and two players with binary inputs x respectively ywhich characterize subsets Nx, Ny ⊆ 1, . . . , n. We will reduce the disjointness problem to the F∞problem by dening a communication protocol between those two players. Let A be a sequence oflength at most 2n consisting of all elements in Nx and Ny without mixing or ordering both subsets.The communication protocol is quite simple. It starts with the rst player who runs the algorithmD on the rst |Nx| elements of A. After this it sends its content of memory to the second playerwho applies the algorithm D on the rest of A and outputs DISn(x, y). According to the disjointnessfunction we distinguish between two results

DISn(x, y) =

0 if D outputs a value < 4/3

1 else.

The rst case implies that both sets Nx and Ny are disjoint which means that the correct value ofF∞ is 1 and |Y − F∞| ≤ F∞/3 holds for the required probability.The second case implies that both sets Nx and Ny intersect which means that the correct value ofF∞ is 2 and |Y − F∞| ≤ F∞/3 holds for the required probability as well.This shows that the F∞ problem is at least as hard as the disjointness problem. Regarding the resultfrom [4] we know that the disjointness problem needs at least Ω(n) memory bits. Therefore we geta lower bound for F∞ by the reduction above and we obtain our desired result. ut

3.3 Space complexity of Fk

In this section we analyze the space complexity of the general frequency moments Fk. To do thiswe have to do more preliminary. First we introduce the distributional communication complexitywhich gives us a lower bound for the probabilistic communication complexity Cε(f). The correctnessbehind this based on Yao's Minimax Principle which gives us the possibility to establish lower boundsfor randomized algorithm by analyzing the performance of deterministic algorithms. The argumentsare based on [3, 5, 6].

Denition 3 (distributional communication complexity). Similar to Denition 1 we havegiven a boolean function f , two players and inputs x, y ∈ 0, 1n. But in this case both players use

Page 53: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 45

a deterministic protocol to communicate with each other. At the end they output the correct valueof f(x, y) for all input pairs (x, y) except for at most an ε-fraction of the input under a probabilitydistribution µ. We denote by Dε(f |µ) the ε-error distributional communication complexity for funder a distribution µ.

It will be necessary to change the communication game in comparison to the one we used in Propo-sition 2. The core of this game treats still a disjointness problem but we adapt the number of playersand the objective itself.

Denition 4 (communication game). We have given s, t, n ∈ N and N = 1, . . . , n. We callDIS(s, t) a communication game with s players P1, . . . , Ps. Each player has an input Ai with Ai ⊆ Nand |Ai| = t. The goal is to distinguish between disjoint respectively uniquely intersecting inputsequences (A1, . . . , As). The players are allowed to communicate with each other according to aprobabilistic protocol. At the end of the protocol player Ps outputs a value.

Before we nally proof the main result of this section we need a last additional denition whichhelps us to quantify the quality of a probabilistic protocol in regards to the communication gameDIS.

Denition 5 (ε-correct protocol). We have given the communication game DIS(s, t). A proba-bilistic protocol is ε-correct if the protocol outputs 0 for any disjoint input respectively it outputs 1for any uniquely intersecting input with probability at least 1− ε. The output of other inputs maybe arbitrary.

Theorem 4. For any xed k > 5 and δ < 1/2, any randomized algorithm that outputs, given aninput sequence A of at most n elements of N = 1, . . . , n a number Zk such that

Pr[|Zk − Fk| > 0.1Fk] < δ

uses at least Ω(n1−5/k) memory bits.

Proof. Given an algorithm D as above we dene a randomized protocol for DIS(s, t) where D usesM memory bits, n = (2t− 1)s+ 1 with s = n1/k, t ∈ Θ(n1−1/k) and n is suciently large enough.To simplify the proof we assume t = n1−1/k.By the denition of DIS(s, t) we have s players P1, . . . , Ps with corresponding inputs A1, . . . , Aswhere each Ai is a subset of N with a cardinality of t. The communication between the players isdened by the following protocol. The rst player P1 runs the algorithm D on his input A1 and hesends his content of memory to the second player A2. This continues until the last player Ps outputsthe nal value Zk of D.According to DIS(s, t) we distinguish between two results

DIS(s, t) =

0 if Zk ≤ 1.1st

1 else .

The rst case implies that the inputs A1, . . . , As are disjoint which means that the correct value ofFk is st and |Zk − Fk| > 0.1Fk holds for the required probability. The second case implies that theinputs A1, . . . , As are uniquely intersecting which means that the correct value of Fk is

Fk = sk + s(t− 1) = (2t− 1)s+ 1 + s(t− 1) = s(3t− 2) + 1 > s(3t− 2) ≥ (3

2+ a)n, a ∈ o(1)

and |Zk − Fk| > 0.1Fk holds for the required probability as well.Therefore the algorithm D approximates Fk with probability at least 1 − γ which implies that theprotocol for DIS(s, t) is γ-correct. With regards to the communication complexity the protocol uses

Page 54: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

46 Felix Biermeier

at most sM memory bits. By applying Proposition 3 which we will proof later we get a lower boundΩ(t/s3) for the amount of memory bits. This leads to the desired result

M ≥ Ω(t/s4) = Ω(n/s5) = Ω(n1−5/k).

utThe rest of this section contains all statements which are necessary to complete the proof above andstill unproved.

Proposition 3. For any xed ε < 1/2 and any t ≥ s4, the length of any randomized ε-correctprotocol for the communication problem DIS(s, t) is at least Ω(t/s3).

Proof. We dene a probability distribution µ on the input sequence (A1, . . . , As) according toDIS(s, t). We consider partitions P of N = 1, . . . , n such that P =

⋃si=1 Ij ∪ x where each

Ij has cardinality of 2t − 1 and all Ij and x are pairwise disjoint. We choose such a partition Puniformly at random of all those partitions of N . Furthermore we choose a subset Aj of cardinalityt of Ij uniformly at random. At last we dene for both cases with probability 1/2

Aj =

Aj ∀j : 1 ≤ j ≤ s(Ij −Aj) ∪ x ∀j : 1 ≤ j ≤ s.

This shows that the distribution µ generates disjoint respectively uniquely intersecting input se-quences (A1, . . . , As) with probability 1/2 in both cases. An other aspect is that each of the disjointrespectively uniquely intersecting input has the same probability. To distinguish between both in-puts we denote by

(A0

1, . . . , A0s

)disjoint input sequences and by

(A1

1, . . . , A1s

)uniquely intersecting

input sequences. Furthermore we dene a box as a family X1 × · · · ×Xs where each Xi is a set ofsubsets of cardinality t of N .It is possible to show that all input sequences with a xed and corresponding communication betweenplayers according to DIS(s, t) forms a box. Regarding the distributional communication complexityproblem it suces to show that every deterministic protocol with less than Ω(t/s3) communicationbits errs with probability Ω(1) where we apply inputs according to the distribution µ.As we have mentioned above each box contains a xed communication pattern. We are looking forcommunication patterns where the relative protocol outputs 0 on an input sequence (A1

1, . . . , A1s)

which means that the protocol errs. If the number of patterns which output 0 is less than ps2ct/s

3

we conclude by summing up Lemma 1 over all boxes that

Pr[output 0 on input (A1

1, . . . , A1s)]≥ 1

2ePr[output 0 on input (A0

1, . . . , A0s)]− p.

By choosing a suciently small constant p > 0 we have shown the desired result. utLemma 1. There exists an absolute constant c > 0 such that for every box X1 × · · · ×Xs

Pr[(A1

1, . . . , A1s

)∈ X1 × · · · ×Xs

]≥ 1

2ePr[(A0

1, . . . , A0s

)∈ X1 × · · · ×Xs

]− s2−ct/s3 .

Proof. We x a box X1 × · · · ×Xs. A partition P is j-bad for a constant c > 0 if

PrP [A1j ∈ Xj ] <

(1− 1

s+ 1

)PrP [A0

j ∈ Xj ]− s2−ct/s3

,∀j, 1 ≤ j ≤ s

where PrP (·) denotes the conditional probability of a certain event given a partition P . Similar tothis a partition P is bad if the partition P is j-bad for at least one j respectively a partition is goodif it is not j-bad for all j. According to these denitions we dene random variables χ(P ) and χj(P )such that

Page 55: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 47

χ(P ) = 1⇔ P is bad partition,

χj(P ) = 1⇔ P is j-bad.

It is obvious that χ(P ) ≤s∑j=1

χj(P ). To simplify the notation we denote two events by

E ′ =(A0

1 × · · · ×A0s

)∈ X1 × · · · ×Xs,

E =(A1

1 × · · · ×A1s

)∈ X1 × · · · ×Xs.

The crucial observation is that the partition P is chosen uniformly at random which leads to

Pr [E ] =∑P

PrP [E ] · Pr [P ] =1

#P

∑P

PrP [E ](Def)

= E [PrP [E ]] . (2)

By Lemma 3 and the equality 2 above it follows that

Pr [E ] ≥ E [PrP [E ] (1− χ(P ))] ≥ 1

eE [PrP [E ′] (1− χ(P ))]− s2−ct/s3 . (3)

Keeping this in mind we continue with the term E [PrP [E ′]χj(P )] and we consider Lemma 2 tobound χj(P ). Given the information on the partition P it is enough to analyze the conditionalprobability PrP

[A0j ∈ Xj

]. Due to Lemma 2 the choice of x of the union Ij ∪ x is still left. We

denote by l the number of subsets of cardinality t in Xj which are part of the union Ij ∪ x. Thisimplies

PrP[A0j ∈ Xj

]= l/

(2t

t

).

Even any choice of x changes not much but only that

PrP[A0j ∈ Xj

]≤ l/

(2t− 1

t

)= l/

(2t

t

).

We conclude by Lemma 2 that

E [PrP [E ′]χj(P )] ≤ 2

20sE [PrP [E ′]] ≤ 1

2sE [PrP [E ′]] .

By using the inequality from above we get

E [PrP [E ′]χ(P )] ≤ E

PrP [E ′]s∑j=1

χj(P )

≤ s∑j=1

E [PrP [E ′]χj(P )] ≤s∑j=1

1

2sE [PrP [E ′]] =

1

2E [PrP [E ′]] .

(4)

Therefore by combining all inequalities we archive the desired result

Pr [E ](3)

≥ 1

eE [PrP [E ′]]− 1

eE [PrP [E ′]χ(P )]− s2−ct/s3

(4)

≥ 1

eE [PrP [E ′]]− 1

2eE [PrP [E ′]]− s2−ct/s3

=1

2eE [PrP [E ′]]− s2−ct/s3 .

ut

Page 56: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

48 Felix Biermeier

Lemma 2. There exists a choice for the constant c > 0 such that a partition P is j-bad and thefollowing holds. For any set of s − 1 pairwise disjoint t-subsets I ′r ⊂ N , (1 ≤ r ≤ s, r 6= j), theconditional probability that the partition P = I1 ∪ · · · ∪ Is ∪ x is j-bad, given that Ir = I ′r for allr 6= j, is at most 1

20s .

Proof. Due to the condition above the sets Ir for r 6= j and the union Ij ∪x are known. Thereforewe have 2t possibilities to construct a partition P by choosing an element of the union as x. Weconsider a simple case distinction over the set

B =C ⊆ Ij ∪ x : C ∈ Xj , |C| = t

.

The case |B| < 12

(2tt

)2−ct/s

3

implies PrP[A0j ∈ Xj

]< 2−ct/s

3

for all possible partitions P . We proofthis implication by contraposition.

∃ Partition P : PrP[A0j ∈ Xj

]≥ 2−ct/s

3

=⇒ PrP[chosen t-subset ∈ Xj

]≥ 2−ct/s

3

=⇒ |B| /(

2t− 1

t

)≥ 2−ct/s

3

=⇒ |B| ≥ 2−ct/s3

(2t− 1

t

)=

1

2

(2t

t

)2−ct/s

3

.

This in turn implies that no partition P is j-bad by combining this implication and the denitionof a j-bad partition P . We conclude that in this case the claim of this Lemma holds since theconditional probability is zero.Now we consider the other case |B| ≥ 1

2

(2tt

)2−ct/s

3

and we denote by F the family of all subsetsof Ij ∪ x of cardinality t. Furthermore we set Ij ∪ x = x1, . . . , x2t and we denote by pi theprobability that

pi =|C ∈ F : xi ∈ C|

|F| .

Due to the binary entropy function and a standard inequality we can bound the size of F

|F| ≤ 2∑2ti=1H(pi). (5)

Now we have to complete the partition P by choosing a xi as x. We combine this with the questionwhich xi results in a j-bad partition P and we denote by b the number of responsible xi. Due tothe denition of a j-bad partition P we archive pi < (1− 1/(s+ 1))(1− pi) which leads to an upperbound of H(pi). The rst step is to bound the probability pi such that

pi < (1− 1

s+ 1) (1− pi) s>1⇐⇒ pi

1− pi<

2

3≤ s

s+ 1⇐⇒ pi <

2

5.

By choosing a positive constant c′ ≤ 2/25 it follows

H(pi) ≤ 2√pi(1− pi) < 0.98 ≤ 1− c′/s2, s > 1.

This fact is sucient to give an upper bound to the size of F where we distinguish between xi whichleads to a j-bad partition P or not

|F| ≤ 2∑2ti=1H(pi) = 2

∑not j-badH(pi)+

∑j-badH(pi) ≤ 2(2t−b)+b(1−c′/s2) = 22t−bc′/s2 .

Together with the lower bound of the size of B from above we obtain

Page 57: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 49

1

2

(2t

t

)2−ct/s

3 ≤ |F| ≤ 22t−bc′/s2 . (6)

This again implies that (t

s3 log t =⇒ b ≤ c1ct

s

)(7)

where c1 ≥ 2/c′ is a constant and t/s3 log t means in this case there exists a suciently largeenough constant a > 1 such that t/s3 > a log t holds. We show that (6) implies (7) by contradiction.Therefore, we assume b > c1ct

s and it is sucient to proof

22t−bc′/s2+ct/s3+1 ≤(

2t

t

).

Starting from the left side of the inequality we obtain

22t−bc′/s2+ct/s3+1 ≤ 22t−ct/s3(c1c′−1)+1 ≤ 22t+1−ct/s3 ≤ 22t+1−ca log t =

22t+1

tca

t≥2

≤ t22t

tca=

22t

tca−1.

By choosing a suciently small constant c ≥ 5/(2a) we get our desired contradiction

22t

tca−1

t≥2

≤ 22t

t3/2≤ 22t

2t1/2=

22t−1

t1/2≤(

2t

t

).

The choice of c together with the other constants are enough to adapt the shown upper bound ofb such that b ≤ 2t

20s . Remember that b denotes the number of xi which leads to a j-bad partitionP and there exists only 2t possibilities for a partition P . By an average argument we obtain therequired result. ut

Lemma 3. If P = I1 ∪ · · · ∪ Is ∪ x is a good partition then

PrP[(A1

1, . . . , A1s

)∈ X1 × · · · ×Xs

]≥ 1

ePrP

[(A0

1, . . . , A0s

)∈ X1 × · · · ×Xs

]− s2−ct/s3 .

Proof. By negating the denition of a bad partition P we obtain a good partition such that

PrP [A1j ∈ Xj ] ≥

(1− 1

s+ 1

)PrP [A0

j ∈ Xj ]− s2−ct/s3

,∀j, 1 ≤ j ≤ s.

The desired result is obtained by using the denition of the distribution µ, an upper bound of theinverse natural exponential function and the following fact

(a− b)n ≥ an − nan−1b. (8)

Therefore we get

Page 58: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

50 Felix Biermeier

PrP [(A1

1 × · · · ×A1s

)∈ X1 × · · · ×Xs]

≥s∏j=1

((1− 1

s+ 1

)PrP

[A0j ∈ Xj

]− 2−ct/s

3

)

(8)

≥s∏j=1

(1− 1

s+ 1

)PrP

[A0j ∈ Xj

]−

s∑i=1

s−1∏i 6=j

((1− 1

s+ 1

)PrP

[A0j ∈ Xj

])︸ ︷︷ ︸

≤1

2−ct/s3

s∏j=1

(1− 1

s+ 1

)PrP

[A0j ∈ Xj

]−

s∑2−ct/s

3

=

(1− 1

s+ 1

)sPrP

[(A0

1 × · · · ×A0s

)∈ X1 × · · · ×Xs

]− s2−ct/s3

>1

ePrP

[(A0

1 × · · · ×A0s

)∈ X1 × · · · ×Xs

]− s2−ct/s3 .

ut

4 Remarks

In this section we give some additional notes what we have omitted in this work. Similar to thenecessity of randomization for the algorithms in section 2 we can say the same about the necessityof approximations for nearly all frequency moments. The proof is based on the reduction we haveseen in Proposition 2. An other point is the tightness of the lower bounds for certain frequencymoments.

References

1. Alon, N., Babai, L., Itai, A.: A fast and simple randomized parallel algorithm for the maximal independent setproblem. Journal of algorithms, 7(4):567583, 1986.

2. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In Proceedingsof the twenty-eighth annual ACM symposium on Theory of computing, pages 2029. ACM, 1996.

3. Babai, L., Frankl, P., Simon, J.: Complexity classes in communication complexity theory. In Foundations ofComputer Science, 1986., 27th Annual Symposium on, pages 337347. IEEE, 1986.

4. Kalyanasundaram, B., Schintger, G.: The probabilistic communication complexity of set intersection. SIAMJournal on Discrete Mathematics, 5(4):545557, 1992.

5. Razborov., A.A.: On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385390, 1992.

6. Yao., A.C.: Lower bounds by probabilistic arguments. In Foundations of Computer Science, 1983., 24th AnnualSymposium on, pages 420428. IEEE, 1983.

7. Yao., A.C.: Some complexity questions related to distributive computing (preliminary report). In Proceedingsof the eleventh annual ACM symposium on Theory of computing, pages 209213. ACM, 1979.

8. Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation.In ACM SIGMOD Record, volume 24, pages 233244. ACM, 1995.

Page 59: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Frequency Moments - Approximations and Space Complexity 51

svmult [utf8x]inputenc [T1]fontenc makeidx graphicx multicol [bottom]footmisc amsmath, amssymb,amsfonts, amscd algpseudocode algorithm listings verbatim

Page 60: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF
Page 61: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of DataStreams

Till Knollmann

Abstract Calculating statistics on the trac of a network oers interesting insights on load balanc-ing and trac analysis. Thereby, observing streams at dierent devices in the network arises tightconditions. We have a set-up of many devices with low computation power and limited memory.Those devices want to save a part of the streams they read (a sample) and send their observationsto a central server. This central instance then may approximate functions. The properties of theobserving parties lead to the task of nding ecient algorithms which have to use workspace loga-rithmic in the input size. The described scenario is used in current network monitoring products.We give insights in a coordinated sampling approach which oers fast processing time with lowworkspace usage. We point out that a coordinated sampling is needed to suce tight workspacebounds with low error. The samples we create can be used to approximate the Union of two streamsas well as other functions. Numerous modications and extensions allow us to reuse our algorithmsin dierent models and for dierent functions as well.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541.1 The Scenario and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541.2 Applications of this scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551.4 Why coordinated Sampling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2 Formal Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.1 Independence and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.2 Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Coordinated 1-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.1 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2 Public Coins Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3 Private Coins Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Extensions and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.1 Other Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 General Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Estimation F0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Set resemblance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Till KnollmannUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

53

Page 62: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

54 Till Knollmann

1 Introduction

The following thesis will give a detailed overview about the main topic of the scientic paper "Es-timating Simple Functions on the Union of Data Streams" by Philip B. Gibbons and SrikantaTirhapura. The paper has been seen in at least two versions. This thesis will be oriented towardsthe version of 2001 [2]. All content references to [2]. If content origins from other sources, it is re-spectively marked.We will give a short introduction by dening the scenario and model, motivating the topic and theimportance, and introducing basic notation in this section. In section 2, we will shortly present someformulas needed for analysis. In the main part in section 3, we will present a general approach calledCoordinated 1-Sampling which aims to reduce workspace cost while estimating a function. Afterthat we will present two concrete techniques; the Public Coins Scheme (section 3.2) and the PrivateCoins Scheme (section 3.3). We will analyse and describe the workspace and time bounds for thealgorithms. Furthermore, we will see detailed proofs for both techniques which show error and errorprobability bounds we can provide. In section 4, we will see some extensions and modications whichallows us to use Coordinated 1-Sampling in multiple scenarios. We conclude this thesis in section 5by summarizing the most important parts.

1.1 The Scenario and Model

In the following section, we will present the general scenario for this thesis. Consider two parties Aliceand Bob. Each of them sees a unique Stream of n numbered bits. Alice receives A = a1, a2, ..., anand Bob receives B = b1, b2, ..., bn in order. After receiving their streams Alice and Bob transmitthe contents of their streams to a mutual Referee which then estimates a function F on A and B.This scenario has some restrictions:

• Alice and Bob each have a limited amount of workspace which is signicant smaller than n.• Alice and Bob are not allowed to communicate directly.• The Referee does not see A nor B. He has to estimate F only based on the data received from

Alice and Bob.

Our goal in this scenario is to minimize the workspace and the processing time used by Alice and Bobwhile observing their streams. We assume that the Streams arrive in order and that both streamscan only be observed exactly once.

Denition 1 (The Union).Let A = a1, a2, .., an and B = b1, b2, .., bn be two Streams consisting of bits. The Union U isdened as the number of 1's in the bitwise or of the two Streams:U(A,B) =

∑ni=1(ai ∨ bi)

We are interested in approximating the Union function as dened above. Within this thesis, we willuse U to refer to the value returned by the Union. We will present two concrete techniques whichallow us to approximate U within tight bounds. To be more specic, we are interested in showingthat both techniques are (ε, δ)-Approximation schemes which use only logarithmic workspace.

Denition 2 ((ε, δ)-Approximation Scheme).An (ε, δ)-Approximation scheme for a value X is an algorithm that computes an approximation Xsuch that PrX /∈ ((1− ε)X, (1 + ε)X) < δ for positive ε, δ < 1.

An (ε, δ)-Approximation scheme is very useful. It allows us to dene an error ε and the errorprobability δ at will. We will see in section 3 a direct relationship between our workspace usage, ε,and δ.

Page 63: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 55

1.2 Applications of this scenario

The presented scenario is motivated by network monitoring and the characterization of internettrac. Consider we aim to compute statistics on the trac of a network. Then, we have many smalldevices like routers which are receiving packet streams. These devices have only a small amount ofmemory and they periodically send information concerning the observed streams to a central server.The central server has the task to compute statistics on the streams. This set-up is used in currentnetwork monitoring products like Lucent's Interpret-Net or Ciscos's NetFlow. [2] We will presentsome extensions and mention other applications in section 4.It is often important to consider the Union when approximating functions. For example, one streamfrom a source to a destination may often be split and send among dierent paths in order to increasethe general throughput. If we monitor the network, we are only able to observe parts of the stream.In order to understand the logical stream, we have to recalculate the Union of the streams. Forexample, calculating the Union enables us to determine the number of distinct IP destinations.

1.3 Related Work

In this section we will have a short look on other interesting results concerning the calculation ofthe Union. At rst, we want to point out that approximating the Union for a relative error ε = 1

3

is known to be possible. Alice and Bob simply calculate a =∑i ai and b =

∑i bi. This approach

is possible as a deterministic solution using workspace log(n). If we take 43 maxa, b as estimate,

we have an (ε, δ)-Approximation scheme with workspace usage log(n) bits. This holds, because the

Union may be between maxa, b and 2maxa, b. This solution covers the case ε ≥ 13 . Therefore,

we will focus on the case ε 13 .

There exists another interesting way to calculate the Union with an (ε, δ)-Approximation scheme.We can modify an algorithm for the size of the symmetric dierence,

∑i |(ai − bi)|, which has been

presented in [3]. We can use the function 2 |∑i(ai∨bi)| = |∑i ai|+ |

∑i bi|+

∑i |(ai−bi)| to express

the Union based on the symmetric dierence. Compared to the schemes we will present later, theresulting scheme has worse time complexity. Furthermore, it is not known how to extend it to morethan two parties [2]. Time consumption is a crucial factor in practice. Hence, we want to nd afaster solution.

1.4 Why coordinated Sampling?

The most important insight is, that the sampling of Alice and Bob needs to bee coordinated. Wecan show that uncoordinated sampling leads to higher workspace consumption while getting worseresults. In [2] it is shown that any possible estimator for U that is with probability 2

3 within a relativeerror of ε < 1

3 requires Ω(√n) sample size if the sampling is not coordinated. The problem here

is that the referee cannot dier between two situations with high probability when having samplessmaller than

√n. These two situations are:

• Each party has n2 1-bits. There is no overlap in the positions of the 1-bits and the Union is n.

• Each party has n2 1-bits. There is a complete overlap in the positions of the 1-bits and U = n

2 .

With a high probability there is no overlap in the sample sets in the second scenario, if we onlysample less than

√n 1-bits. The referee cannot distinguish between both scenarios. Therefore, the

referee has to output at most half the times an estimate ≥ 2n3 , or at most half the times an estimate

≤ 2n3 . Hence, we are not getting a better result for scenario 1 or 2 than ε = 1

3 at most half thetimes.

Page 64: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

56 Till Knollmann

2 Formal Preliminaries

In this section we will present formal preliminaries which are needed for the following thesis. The pre-sented algorithms are randomized. Therefore, we will need some properties concerning Independenceas well as known bounds in the analysis.

2.1 Independence and Random Variables

One important theorem we will need in the analysis of our algorithm is the mutually independence.

Theorem 1 (Mutually Independence).A set S = X1, X2, ..., Xn of random variables is mutually independent if and only if for everysubset S′ ⊂ S it holds: Pr∩XinS′ X = Pr∏XinS′ XIn some cases the above requirement is too hard to achieve. We will need a weaker property whichmight still be useful.

Theorem 2 (Pairwise Independence).A set S of random variables is pairwise independent if and only if for two dierent random variablesX ∈ S and Y ∈ S: PrX ∩ Y = PrXPrY To calculate the expectation of a random variable which consists of a set of other random variableswe will need the linearity of expectation.

Theorem 3 (Linearity of expectation).Let X1, .., Xn be a nite set of discrete random variables. Than it holds for the expectation E:E[∑ni=0Xi] =

∑ni=0E[Xi]

If a set of random variables is at least pairwise independent, we are gaining some useful propertieslike the following:

Theorem 4 (Variance for Pairwise Independence).Let X1, .., Xn be a set of pairwise independent discrete random variables. Then it holds for thevariance V ar:V ar[

∑ni=0Xi] =

∑ni=0 V ar[Xi]

2.2 Upper Bounds

In order to bound the error probabilities of our algorithms, we will need the following two formulasfor upper bounds:

Theorem 5 (Cherno Bounds).Let X1, X2, .., Xn be a sequence of n independent Bernoulli experiments with PrXi = 1 = p andPrXi = 0 = 1− p. Note that np is the expected number of successes.

For any δ > 0 it holds: Pr∑Xi ≥ (1 + δ)pn ≤ exp(− δ2pn3 )

For any δ ∈ [0, 1] it holds: Pr∑Xi ≤ (1− δ)pn ≤ exp(− δ2pn2 )

Unfortunately, Cherno Bounds can only be applied when we have mutually independent randomvariables. Therefore, we will also need Chebyshev's inequality which is more general.

Theorem 6 (Chebyshev's inequality).Let X be a random variable with expected value µ and a variance σ2 > 0. For any integer k > 0 itholds: Pr|X − µ| ≥ σk ≤ 1/k2

Page 65: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 57

3 Coordinated 1-Sampling

In this section we will explain a strategy called Coordinated 1-Sampling. We will start by explainingthe general approach and idea behind the strategy. Afterwards, we will present two concrete algo-rithms for the Public Coins Scheme and the Private Coins Scheme.In the Public Coins Scheme, Alice and Bob have access to a shared random string of fully indepen-dent bits which does not cost towards our workspace at a party. In contrast, the shared string hasto be saved by both parties in the Private Coins Scheme.We will explain the algorithms and analyse the time and space complexity. Thereby, we will seethat the algorithms use workspace sub-linear in the length of the input streams n. Moreover, we willproof that both algorithms are (ε, δ)-Approximation schemes.

3.1 General Approach

In the following section we will explain the general approach of Coordinated 1-Sampling. We havealready mentioned that the sampling in the presented scenario has to be coordinated in order toachieve low workspace bounds.Generally thinking, we are interested in nding a subset S(p) ⊆ P of the set of all positions P =1, .., n. S(p) denes which positions of the streams have to be sampled. If both Alice and Bob knowthis subset, they can collect their samples SA and SB . We use randomization in our approach. Letp ≤ 1 be some probability. Then S(p) ⊆ P is created by selecting each position in P independentlywith probability p. To save workspace we will let each party save only the 1-bits of their stream.That means, Alice will sample the set SA = i | (ai = 1) ∧ (i ∈ S(p)). Bob will create the sampleSB = i | (bi = 1) ∧ (i ∈ S(p)). By only storing the 1-bits of the streams the expected workspaceneeded is Up which is O(Up log(n)). U represents the value of the Union. Note, that storing all bitswould require expected O(np log(n)) bits of workspace which is not sub-linear in n.Let us assume we have a probability p and we created the sample sets SA and SB . We are able tocompute L =

∑i∈S(p)(ai∨bi), which is simply the Union based on our samples. Recap here, what we

just did. We looked at each position which is relevant for the Union (i.e. each i where (ai ∨ bi) = 1)and chose a position with the probability p to be sampled. Thereby, we did U Bernoulli experimentswith success probability p. Hence, the expected value E[L] for L equals Up. Assuming that wecreated a somehow "good" S(p), we can approximate U , if we divide L by p.Next, we want to dene some measure for a "good" result. We are interested in nding an (ε, δ)-Approximation. In other words, we want to achieve PrLp /∈ (1 ± ε)U < δ. By using ChernoBounds we can see that:PrLp /∈ (1± ε)U ≤ PrL ≤ (1− ε)U+ PrL ≥ (1 + ε)U ≤ 2 exp

(−Upε23

):= ∗

If we aim for Up ≈ log(1/δ)ε2 up to constants, we are assured to get a good estimate with high

probability. For instance, setting Up = 3 log(2/δ)ε2 achieves exactly PrLp /∈ (1± ε)U < δ in ∗.

We are left with the task for Alice and Bob to decide on p. That is, because we do not know U . Butwe restricted our model such that Alice and Bob cannot communicate and hence, they are not ableto decide on p beforehand. Therefore, each party has an own probability pA (Alice) and pB (Bob)initially set to 1. If at any party the workspace overows a certain bound, the probability is halved.We reduce the current sample by keeping each sampled position with probability 1

2 . To ensure acoordination, we have to ensure for two probabilities p and p′: If p < p′, then S(p) ⊆ S(p′). If weguarantee this property, the referee can estimate U based on the lowest sampling probability. If wedo so, it is not important if one party stops with a higher / lower probability. We can always createsamples based on the lowest probability.The interesting part in the following two schemes will be, how Alice and Bob are coordinated within

Page 66: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

58 Till Knollmann

this procedure. As this approach is a coordinated way of sampling only 1-bits, it is called Coordinated1-Sampling.

3.2 Public Coins Scheme

We will now have a closer look on Coordinated 1-Sampling in the Public Coins Scheme.The main property of the Public Coins Scheme is that Alice and Bob have access to a shared randomstring of fully independent and unbiased bits. Let us call that String I. Furthermore, it is importantthat I has not to be saved by any party. Therefore, I does not count towards the workspace of ouralgorithm. Note that the presence of I can be realised by letting Alice and Bob use the same pseudonumber generator with a common starting seed of Θlog(n) bits.

3.2.1 Coordination

Our main problem for the coordination is again the prohibited communication between both par-ties. Therefore, we have to clarify which positions they should sample at a given probability. Re-cap, that we are always halving the sampling probability. That means, we have to dene the setsM0 = S( 1

20 ),M1 = S( 121 ),M2 = S( 1

22 ), ..,Mlog(n) = S( 12log(n) ). For simplicity, we will call a set S(p)

with p = 12`

to be at level `. I can be accessed by both parties. Hence, we use the shared string I todene which position may be in which sets.We now dene a hash function h : P → 0, .., log(n) where P = 1, 2, .., n is the set of allpositions of the streams. We want i to be in the sets M0,M1, ... with probability 1, 1

2 ,14 , .... To

achieve such a distribution we consider the following idea: We throw a coin. If we get head, wethrow again until tails appears. Let D be the number of times we achieved head. Then it fol-lows PrD ≥ 0 = 1,PrD ≥ 1 = 1

2 ,PrD ≥ 2 = 122 , .... If we repeat the coin toss at most

log(n) times we are getting the desired distribution. We could now just assign M ′j = i | i ∈P and at least j coin tosses concerning i were head.But how can we emulate coin tosses in our scenario? Let us remember that I is just a chain offully independent bits. For each position i ∈ P we have a look at xi which is the i-th sub-stringof length log(n) of I. Now, consider the number mi of most signicant bits of xi which are allzero. It holds that mi equals D in the coin toss experiment for position i. Therefore we deneh(i) = mi = the number of leading 0-bits in the i-th substring of length log(n) of I. Similar to theexperiment above, we assign Mj = i | i ∈ P and h(i) ≥ j. We nally achieve Pri ∈ Mj = 1

2j

which is exactly the desired distribution. Furthermore, the mapping is mutually independent, as Iconsists of mutually independent bits. This property will be very useful in the analysis.Like already mentioned, I can be accessed by both Alice and Bob. Therefore, it is guaranteed thatthey can calculate the same hash value for a position. Hence, they both can reconstruct the setsM0,M1, ..,Mlog(n). In other words, their coordination is solved.

3.2.2 Algorithm

The algorithm follows the basic idea of the general approach. Our workspace bound here is cα where

c = 84 is a constant determined by the analysis and α = log(3/δ)ε2 .

Each party starts with the probability 1 = 120 , i.e. a level 0. Let us consider how Alice proceeds.

Whenever the used workspace overows cα, Alice will increase her level. Let Mi be the targeted setof positions which are sampled by Alice, i.e. her level is i. By increasing the level, Alice basicallyhalves her sampling probability which equals to changing the targeted position set toMi+1. Therebywe are expectedly halving the overall sample size. Therefore, at some probability the workspace will

Page 67: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 59

suce again. The algorithm works the same way for Bob.After the streams nished, the referee receives the samples of Alice and Bob, SA and SB , as well asthe sampling probabilities pA and pB . Now, p

′ = minpA, pB is determined because both partiessampled with at least that probability. Note that we guaranteed for two probabilities p0 and p1,if p0 < p1 then S(p0) ⊆ S(p1). Therefore, we have to create a useful sample by sub-sampling SAand SB in regards to the probability p′. Let S′A and S′B be the sub-sampled samples. Note, that atleast S′A = SA ∨ S′B = SB is true as one party already sampled at probability p′. The referee nowcomputes L which is the union based on the sub-sampled samples S′A and S′B . Afterwards

Lp′ is re-

turned as estimate. Knowing that E[L] = U ∗p, the estimate will hopefully be suciently close to U .

ExampleIn order to clarify how the algorithm proceeds we will have a closer look on an example. In Fig. 1we can see the input Streams of length n = 16 for Alice and Bob. The Union U(A,B) already hasbeen calculated as 13.

position1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 0

B1 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0

U1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0

𝑈 𝐴, 𝐵 = 13

Fig. 1 Example of Input Streams for Alice and Bob

Let us consider the String in Fig. 2 as our shared random String of independent bits. We can seehere how the hash function h(i) maps positions to a certain level by using the shared String.

Level Set of positions Size Samplingprobability

0 all 161

20

1 1, 2, 3, 6, 7, 10, 11, 12, 15 91

21

2 1, 7, 10, 11 41

22

3 7, 11 21

23

4 11 11

24

String: 0011 0111 0101 1101 1100 0110 0001 1101

i 1 2 3 4 5 6 7 8

h(i) 2 1 1 0 0 1 3 0

String: 1100 0010 0000 0100 1011 1010 0100 1110

i 9 10 11 12 13 14 15 16

h(i) 0 2 4 1 0 0 1 0

Fig. 2 Example of a shared random String and the resulting hash function

On the right side of Fig. 2, the created sets S(1), S( 12 ), .. have been listed. It leaps to the eye how

the size of each set is nearly halved from level to level.The most interesting part can be seen in Fig. 3, where the processing of the Streams by Alice andBob is presented. As we can see, the workspace at both parties is limited by 4 bits. Alice needs tohalf her probability two times while Bob only needs one level change. Eventually we end up withthe Samples SA = 1, 7, 10 and SB = 1, 6, 7, 10, and the probabilities pA = 1

22 and pB = 121 .

These information are sent to the Referee.

Page 68: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

60 Till Knollmann

Receivedpositions

Action Sample after action

Sample size

level

1, 2, 3, 4, 5, 6 sample 1, 2, 3, 5 4 0

7 Change level 1, 2, 3, 7 4 1

8, 9 Sample 1, 2, 3, 7 4 1

10 Change level 1, 7, 10 3 2

11, 12, 13, 14, 15, 16

Sample 1, 7, 10 4 2

Alice

-> Send Sample 𝑆𝐴 = 1, 7, 10 and probability 𝑝𝐴 =1

22

Receivedpositions

Action Sample after action

Sample size

level

1, 2, 3, 4, 5, 6, 7

sample 1, 5, 6, 7 4 0

8 Change level 1, 6, 7 3 1

9, 10, 11, 12, 13, 14, 15, 16

Sample 1, 6, 7, 10 4 1

Bob

-> Send Sample 𝑆𝐵 = 1, 6, 7, 10 and probability 𝑝𝐵 =1

21

Fig. 3 The sampling process by Alice and Bob

How the Referee approximates:

Let Robert be the Referee. First of all, Robert calculates p′ = minpA, pB = 122 . He now has to sub-

sample SA and SB according to p′. Therefore, he creates S′A = SA and S′B = SB ∩S(p′) = 1, 7, 10.Robert then outputs the Union based on S′A and S′B divided by p′. He returns 1

p′ ∗ 3 = 4 ∗ 3 = 12

as our estimate. Remembering that the actual Union U(A,B) was 13, our estimate is quite good,while using only 4 bits of workspace to process streams of length 16.

3.2.3 Complexity

Let us have a further look on the time and space complexity. As already said, we want to minimizetime and space consumption while keeping a good estimate.

Time ComplexityWe only analyse the time complexity for Alice as for Bob it is the same. Upon receiving a 1-bit atposition i, Alice has to do the following:

• Compute h(i) and determine whether i should be stored• If we run out of workspace, increase the level and discard all old items

To eciently discard all items which are no longer needed, we can use the following data-structure:We store our Sample as an array of linked lists S[1, .., log(n)]. S[k] contains a list of all items whichwill be discarded in level k.Inserting in this data structure costs only constant time. If we need to discard old items, it sucesto delete all elements in the list at the current level. Note that every element is inserted exactlyonce in the data structure. This leads to an amortized constant time over O(n) bits.We are able to reduce the time bound to expected constant time if we use the following approach:Once our algorithm reaches the workspace bound, we are discarding old items one per one if spaceis needed for a new item. This technique, called lazy discarding, has the disadvantage that we mayneed to search in several levels for the next item which shall be discarded.But we can analyse the following: Let us call Wi a random variable denoting the number of levelchanges an item i survives without being discarded. Than it follows that PrWi = j = 1

2j as i isin level j with this probability. This gives us:

E[Wi] =∑log(n)k=1 PrWi = kk =

∑log(n)k=1

12kk = log(n)(1/2)log(n)+2−(log(n)+1)(1/2)log(n)+1+(1/2)

((1/2)−1)2

Page 69: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 61

The last inequality holds, because for q 6= 1 it holds∑nk=0 a0q

kk = a0nqn+2−(n+1)qn+1+q

(q−1)2 . Using

log(n) + 2 = log(4n) and log(n) + 1 = log(2n) we have:

E[Wi] =log(n)

2log(4n)− log(n)+1

2log(2n)+ 1

2

1/4 = 4 log(n)4n − 4 log(n)+4

2n + 42 = log(n)−2 log(n)−2

n + 2 = −(log(n)+2

n

)+ 2

We have that limn→∞(log(n)+2

n

)= limn→∞

(1

n ln(2)

)= 0 using L'Hôspital's rule. Therefore E[Wi]

is constant for great n.We observe that the number of level changes an item survives is constant. The time we need todelete an item in the data-structure using lazy discarding is constant as well. Therefore, the algo-rithm uses constant expected time to discard an item. Note, that this bound also holds for worstcase input streams. All in all, we have constant expected time per received bit, assuming we areable to calculate h(i) in constant time.

Space Complexity

Each party has to store up to cα items for c = 84 and α = log(3/δ)ε2 . Each item consists of a pair

(i, h(i)) where h(i) ≤ log(n) and i ≤ n. Therefore, we can save a pair using Θ(log(n)) bits. Hence,

our workspace is in O(log(1/δ)log(n)

ε2

)bits.

It is interesting to see here how the workspace is in a direct relationship to the relative error ε andthe error probability δ. If we aim to reduce either ε or δ, we have to increase the workspace whilewe are still staying sub-linear in n.

3.2.4 Properties of an (ε, δ)-Approximation

We are now interested in showing the following Theorem:

Theorem 7. The probability that the algorithm produces an estimate not within a relative ε errorof U is less than δ.

The probability of failing can be written as the following if L =∑i∈S(p)(ai ∨ bi):

PrLp /∈ ((1− ε)U, (1 + ε)U)

≤ PrL ≤ (1− ε)Up+ PrL ≥ (1 + ε)Up := ∗

At this point, it is essential to realise that the process of choosing a position to be in S(p) andtherefore to be used in the calculation of L is mutually independent for every position. That iscorrect, because the mapping of the hash function was mutually independent. Therefore we can useCherno-Bounds here to give an upper bound for the two probabilities. Using both Cherno-Boundsin ∗ we get:

PrLp /∈ ((1− ε)U, (1 + ε)U)

≤ exp(−Upε22 ) + exp(−Upε23 ) < 2 ∗ exp(−Upε23 ) := ∗2

Recap that each Alice and Bob target a workspace around cα while only saving 1-bits. Dependingon the Streams of Alice and Bob, there may be two extreme cases for Up:

• Alice and Bob save exactly the same positions: Up = cα• Alice and Bob save disjoint sets of positions: Up = 2cα

Combining those two gives us cα ≤ Up ≤ 2cα. As we want to nd an upper bound for ∗2 we are

interested in minimizing Up. Assuming Up is at least cα with c = 84 and α = log(3/δ)ε2 we get:

PrLp /∈ ((1− ε)U, (1 + ε)U)

< 2 ∗ exp(− cαε23 ) = 2 ∗ exp(− 84 log(3/δ)ε2

3 ε2 ) = 2 ∗ exp(− 84 ln(3/δ)3 ln(2) )

< 2 ∗ exp(−ln( 3δ ))

Page 70: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

62 Till Knollmann

The last inequality holds because 843 ln(2) > 1. As eln(x) = x we have that:

PrLp /∈ ((1− ε)U, (1 + ε)U)

< 2 ∗

(3δ

)−1= 2 ∗ δ3 < δ

Thus, the theorem holds and our algorithm is indeed an (ε, δ)-Approximation. ut

3.3 Private Coins Scheme

Since it is often unusual that Alice and Bob have a shared random string which they do not haveto save, we will explain the algorithm for the so called Private Coins Scheme.In the Private Coins Scheme, Alice and Bob have access to the same random string of unbiased andindependent bits S. In contrast to the Public Coins Scheme they now have to save S. Therefore, Scounts to the workspace used by our algorithm.As a result, saving S is too expensive in every case. Hence, we need to create a new hash functionwhich does not utilise S. We will see that the new hash function is not able to create a mutuallyindependent mapping any more. Therefore, we are creating more error compared to the PublicCoins Scheme. To reduce our error probability down to δ again, we will use multiple instances ofthe algorithm. For more details have a look at section 3.3.2 and 3.3.4.

3.3.1 A new Hash function

In order to achieve the same distribution as already discussed in Section 3.2.1, we are again interestedin generating a String of length log(n) for each position. We will now use a method which has beenintensively analysed in [1].Consider the nite eld G = GF (2d) where d = log(n). We can image the members of this eld asthe bit representations of length d of the positions 1, .., n. We are able to map each number 1, .., n toexactly one member. This oers us the possibility to calculate within the number space [1, n], whilewe are assured to always get a certain number which can be interpreted as a bitstring of lengthlog(n). In [1] it is proven that, if we choose two members q, r ∈ G independent at random, we areable to create pairwise independent bitstrings for each position i with the following procedure: Wecalculate xi = q ∗ i+ r for a position i, where all operations are done in the eld G.We are dening our new hash function h′(i) as the number of leading 0's of xi. This is the sametrick which we already used in section 3.2.1. Note that the mapping is pairwise and not mutuallyindependent, and that we gain more error compared to the function presented in section 3.2.1.Moreover, each instance i of our algorithm has to choose the variables qi and ri independently fromthe other instances. Else, we would not be able to proof our probability bounds.

3.3.2 Algorithm

The algorithm stays basically the same, but we have a new workspace bound. This threshold isagain cα, but for c = 36 and α = 1

ε2 . Note, that we do not have the factor of log(1/δ) in α any more.That is, because we are now running 48 log(1/δ) many instances and we still want to guarantee thesame total workspace bounds like in section 3.2.3. We need to run this many instances to reduceour error probability down to δ again as it can be seen in section 3.3.4.

Page 71: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 63

3.3.3 Complexity

We proceed with an analysis of the space and time complexity of our algorithm for the Private CoinsScheme.

Time Complexity:We are getting the same time complexity bounds for one instance as in section 3.2.3. Again, theprocessing time has expected costs of 3 operations per bit. The time is dominated by the calcu-lation of h(i) for a position i. Assuming we can multiply and add in constant time within a eldof Θ(log(n)) bits, we have constant expected time per bit for one instance. Remember that we areusing 48 log(1/δ) many instances. Therefore, we need an expected time of O(log(1/δ)). The timeto process each item is dominated by performing O(log(1/δ)) multiplications in the eld per instance.

Space Complexity:Like in section 3.2.3 each instance stores up to cα log(n) bits at one party. As we have c = 36 and

α = 1ε2 that means we are using O

(log(n)ε2

)bits per instance. We are running 48 log(1/δ) many

instances, resulting in a total workspace of O(log(1/δ)log(n)

ε2

)bits.

3.3.4 Properties of an (ε, δ)-Approximation

Let us now have a look on the properties of our algorithm. We still want to prove that our algorithmis an (ε, δ)-Approximation. The problem we have is that the new hash function h′ only generates apairwise and notmutually independent mapping of positions to levels. Hence, the process of choosinga position can no longer be considered by using mutually independent random variables as in 3.2.4.Therefore we now have to do a more complex proof. We will see that we are at least able to provethe following theorem:

Theorem 8. The probability that one instance fails to produce an estimate not within a relative εerror of U is less than 1

3 .

We are using the median of s = 48 log(1/δ) many instances to reduce the error down to δ again.

Theorem 9. The probability that the median of s = 48 log(1/δ) many instances fails to be within arelative ε error of U is less than δ.

We will proceed by proving step-by-step each Theorem we need for the total correctness, startingfrom Theorem 9.

Proof of Theorem 9:For this proof let us assume we already know that Theorem 8 holds. (We will proof that later.) Atleast s

2 instances have to fail to falsify the median. Let Yi be a random variable with Yi = 1 if andonly if instance i fails. Then we know PrYi = 1 < 1

3 . Hence, E[∑i∈1,..,s Yi] = s

3 as the instances

are operating independent. We are interested in the case that∑i∈1,..,s Yi ≥ s

2 . To give an upperbound we want to use Cherno Bounds. As the Yi are mutually independent this is permitted. In or-der to use the bounds we are bringing Pr∑i∈1,..,s Yi ≥ s

2 to the form Pr∑i∈1,..,s Yi ≥ (1+t) s3by choosing t = 1

2 . Hence;

Prthe median fails ≤ Pr∑i∈1,..,s Yi ≥ (1 + t) s3 ≤ exp(− 13s3

122 ) = exp(− 48 ln(1/δ)

36 ln(2) )

< exp(−ln(1/δ)) =(

)−1= δ

Therefore, our approximation would indeed be an (ε, δ)-Approximation. But we still need to proveTheorem 8 which we just assumed to be correct at the beginning of this proof.

Page 72: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

64 Till Knollmann

Proof of Theorem 8:To prove this theorem, we have to analyse how the estimate may be for every possible level in0, .., log(n). In order to keep things short, we will dene:

Denition 3. Let i be one level. Then the Union based on samples of the set S(1/2i) is∑i∈S(1/2i)(ai∨

bi) := Zi. We denote an estimate by level i to be bad if 2iZi /∈ ((1− ε)U, (1 + ε)U).

Let Bi be the event, that level i produces a bad estimate, i.e. Bi = 1 if and only if level i produces abad estimate. Now we know how to describe if a level will produce a bad estimate. But we still needto notice if our algorithm stops in a certain level. Therefore, let Si = 1 for a level i i the algorithmstops in level i. Thereby the event of stopping in a level considers the level we reach at the referee.At this point, we are able to express when our algorithm fails. The algorithm fails in two cases:

• It stops in level log(n).• It stops in a level i < log(n) and produces a bad estimate (i.e. Bi = 1).

Using this and our random variables we have:

PrAlgorithm fails ≤∑log(n)−1i=0 PrSi ∧Bi+ PrSlog(n) := ∗

For a deeper analysis, we will still need some more notation. First of all, we only consider the po-sitions which may be important for the Union. That is, because the other ones are not saved. Let1U = i | i ∈ P and (ai ∨ bi) = 1 the set of all relevant positions. Furthermore, let X`i be a randomvariable for a level ` and a position i ∈ P with X`i = 1 i position i is in the set of level `. Now, weneed another random variable indicating the number of relevant positions of a level. Let X` be thisrandom variable with X` =

∑i∈1U

X`i.We are interested in breaking the prove into two smaller lemmas. To do so, we x a certain level `which is the rst level such that E[X`i] < Cα for a constant C = 24. This allows us to rewrite theterm ∗:

Prfailure ≤ ∑log(n)−1i=0 PrSi ∧ Bi + PrSlog(n) =

∑`i=0 PrSi ∧ Bi +

∑log(n)−1i=`+1 PrSi ∧ Bi +

PrSlog(n) <∑`i=0 PrBi+

∑log(n)−1i=`+1 PrSi+ PrSlog(n) =

∑`i=0 PrBi+

∑log(n)i=`+1 PrSi

We will show that it is very unlikely to produce a bad estimate in the levels up to ` (i.e.∑`i=0 PrBi

is small). Furthermore, we will show that it is very unlikely to stop in a level higher than ` (i.e.∑log(n)i=`+1 PrSi is small). More formally, we will prove the following lemmas:

Lemma 1.∑`i=0 PrBi < 1

6

Lemma 2.∑log(n)i=`+1 PrSi < 1

6

If those two lemmas are correct, we have proven Theorem 8 because:

PrAlgorithm fails <∑`i=0 PrBi+

∑log(n)i=`+1 PrSi < 1

6 + 16 = 1

3

Proof of the two Lemmas:In order to prove this lemmas we are going to use Chebyshev's inequality. Therefore, we will needthe expected number of items in level k E[Xk] and the variance V ar[Xk]. To calculate these valueswe will need that the X`i's are pairwise independent. For example, we can see for distinct i, j thatPr(X`i = 1) ∧ (X`j = 1) = Pr(h′(i) ≥ `) ∧ (h′(j) ≥ `) = Pr(h′(i) ≥ `)Pr(h′(j) ≥ `) =Pr(X`i = 1)Pr(X`j = 1) because the mapping of h′ is pairwise independent. For the other casesit can be shown similar that the X`i's are pairwise independent.

Page 73: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 65

Lemma 3. For k < log(n) it holds E[Xk] = U2k

and V ar[Xk] = U2k

(1− 1

2k

)Proof of Lemma 3:Let us have a look on the expectation. It holds E[Xk] = E[

∑i∈1U

Xki] =∑i∈1U

E[Xki] due to thelinearity of expectation. Now, remember that E[Xki] is the probability that position i made it tolevel k. This probability is the same for every position. By our hash function it is 1

2kwhich does not

depend on i. Hence, it follows:

E[Xk] = |1U | 12k

= U2k

For the variance of Xk we have V ar[Xk] = V ar[∑i∈1U

Xki] =∑i∈1U

V ar[Xki] = ∗2 becausethe X`i's are pairwise independent. The Variance of Xki is dened as V ar[Xki] =

∑x∈0,1(x −

E[Xki])2 PrXki = x. Thus, the variance is independent from i as the expectation is independent

from i. We know calculate: V ar[Xki] = 12k

(1− 1

2k

)2+(1− 1

2k

) (0− 1

2k

)2= 1

2k− 2

2k2k+ 1

2k

(12k

)2+(

12k

)2 − 12k

(12k

)2= 1

2k

(12k− 2

2k2k+(

12k

)2)= 1

2k

(1− 1

2k

).

From the points mentioned above, it follows in ∗2 :

V ar[Xk] = |1U |V ar[Xki] = U2k

(1− 1

2k

)ut

Proof of Lemma 1:We know for one level k that PrBk = Pr|Xk − E[Xk] | ≥ εE[Xk]. This holds because Bkis the event that our estimate is bad, i.e. that Xk is farther away from Up than Upε. Recaphere that E[Xk] = Up. In order to be able to use Chebyshev's inequality, we need the formPr|Xk − E[Xk] | ≥ tσk where σk =

√V ar[Xk] is the standard deviation of Xk. To achieve

this, we choose t = εE[Xk]σk

and it follows:

PrBk = Pr|Xk−E[Xk] | ≥ εE[Xk] = Pr|Xk−E[Xk] | ≥ tσk ≤ V ar[Xk]ε2E[Xk]2 = 2k

Uε2

(1− 1

2k

)< 2k

Uε2

Using this we can follow:∑li=0 PrBi <

∑`i=0

2k

Uε2 = 1Uε2

∑`i=0 2i = 1

Uε2

(1−2`+1

1−2

)= 2`+1−1

Uε2 < 2`+1

Uε2

We are using here that for any q 6= 1 it holds that a0

∑nk=0 q

k = a01−qk+1

1−q .

Remember we assumed ` to be the rst level in which E[X`] < Cα. It follows that E[X`−1] = U2`−1 ≥

Cα = Cε2 . Hence, U ≥ C2l−1

ε2 and we get:∑li=0 PrBi < 2`+1

Uε2 ≤ 2`+1ε2

C2`−1ε2= 22

C = 4C

As we have now chosen C = 24 we have∑li=0 PrBi < 4

24 = 16 and the lemma holds. ut

Proof of Lemma 2:We want to show that it is unlikely to stop in a level higher than `. But we know that our algorithmonly stops once and that we only proceed to a level higher than ` if level ` has still more than cαitems. Therefore, we have:∑log(n)i=`+1 PrSi = PrX` ≥ cα = PrX` − E[X`] ≥ cα− E[X`] ≤ PrX` − E[X`] ≥ cα− Cα

The last inequality holds because we have chosen ` such that E[X`] < Cα. Using Chebyshev's in-equality again, we get:

Page 74: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

66 Till Knollmann∑log(n)i=`+1 PrSi ≤ PrX` − E[X`] ≥ cα− Cα ≤ V ar[X`]

(c−C)2α2

Notice that V ar[X`] = U2`

(1− 1

2`

)≤ U

2`= E[X`] < Cα holds. Therefore;∑log(n)

i=`+1 PrSi ≤V ar[X`]

(c−C)2α2 <Cα

(c−C)2α2 = Cε2

(c−C)2

We have already chosen c = 36 and C = 24. Furthermore, ε is smaller than 1. Therefore:∑log(n)i=`+1 PrSi < Cε2

(c−C)2 = ε2

6 < 16 ut

4 Extensions and Applications

Coordinated 1-Sampling can be used for other problems than estimating the Union as well. Inthe following section, we will give a brief overview on possible extensions and applications for theapproach in general.

4.1 Other Scenarios

The presented algorithms can be extended to suit more general scenarios:

• The same algorithms can be used for more than 2 parties. We have to assure that each party hasaccess to the hash function.

• The same algorithms can be used if bits arrive in a mixed order. We have to assure that thepositions of the bits can be observed. We can then imagine a model where pairs (i, ai) and (j, bj)are sent out in an order controlled by an adversary.

• As the algorithms only observe the 1-bits we can still use the algorithms if we do not send 0-bits.• We can imagine situations in which some positions arrive multiple times. As long as both parties

can see the position number, redundancy is ignored.• We could imagine a scenario in which one stream is shorter. If we just assume the missing bits

of the shorter stream are all zero, we can still approximate the union.

For all scenarios above holds that we are able to guarantee the same workspace and time complexitybounds. Furthermore, we can guarantee the (ε, δ)-Approximation scheme property.

4.2 General Values

We are able to extend the presented algorithms to approximate g(A,B) =∑ni=0 maxai, bi when

the Streams A and B consist of n integers in the range 0, ..,M − 1. We are even able to obtainthe same asymptotic space bounds like in the approximation of U , if M is polynomial in n.We can map the presented function such that we have input for our algorithm like the following:Consider one value v ∈ 0, ..,M − 1. We transform v with a function Y (v,M) to the unaryrepresentation of v lled up with zeros such that Y (v,M) has length M . For example, Y (4, 8) =11110000 and Y (6, 8) = 11111100. Then we are using that U(Y (x,M), Y (y,M)) = maxx, y. Forexample, U(Y (4, 8), Y (6, 8)) = U(11110000, 11111100) = 6 = max4, 6. In general, we are creatingA′ = Y (a1,M), Y (a2,M), .., Y (an,M) from the stream A = a1, .., an. We do the same for Band are using our algorithm to approximate U(A′, B′) to receive an approximation for g.

Page 75: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Estimating Simple Functions on the Union of Data Streams 67

4.3 Estimation F0

We can modify the presented approach to estimate the zero-th frequency moment F0. F0 is thenumber of distinct values of a sequence. Estimating F0 is quite interesting for database optimizationand internet trac analysis.In order to estimate F0 we have to ensure that each value is only stored once in the data structureat a party. In addition, we will use a hash table of size Θ(α) to store the at most cα values. Thisallows a constant expected lookup time. Furthermore, we can map the number of occurrences usingthis hash table. If we have values in the range of 1, ..,m, we are able to achieve a total workspacebound of O

(log(1/δ)log(m)

ε2

). This holds as we can save each value using log(m) bits.

This result is very interesting. Realise, that estimating the Union is just a special case of estimatingF0 in which all items within one stream are distinct. Therefore, the same lower bounds for uncoordi-nated sampling presented in section 1.4 also apply to F0. Note that we are able to estimate relatedfunctions as well because we create a sample which does not depend on the estimated function.

4.4 Set resemblance

The resemblance r(A,B) of two sets A and B is dened as the size of their intersection divided bythe size of their union. We can use that |A ∩B| = |A|+ |B| − |A ∪B| which leads to:

r(A,B) = |A∩B||A∪B| = |A|+|B|−|A∪B|

|A∪B| = |A|+|B||A∪B| − 1

Therefore, we can now estimate the set resemblance by utilising our algorithms for the union. Notethat the error guarantees are in terms of an absolute and not a relative error in this case. It ismentioned in [2] that this is unavoidable due to known lower bounds. Furthermore, [2] points outthat this method provides the same space bounds and approximation guarantees, and is fastercompared to the best known previous approaches.

5 Summary / Conclusion

At the end of this thesis we will now summarize the content and conclude the presented problem.We have seen that the sampling of Alice and Bob needs to be coordinated in order to achieve goodresults on logarithmic workspace. Furthermore, all previously known approaches have some crucialweakness like high error rate, or high processing time.With the Coordinated 1-Sampling approach we presented in section 3, we are able to overcomethis weaknesses. We presented two techniques based on the general approach. In the Public CoinsScheme we were able to present an (ε, δ)-Approximation scheme for the Union function, if Aliceand Bob have access to a random string of fully independent bits. We described the algorithm bygiving an example in section 3.2.2. In the Private Coins Scheme the random string needed to besaved by both parties. As a result, we could not save the random string and we dened a new hashfunction. This hash function only oered a pairwise independent mapping. Therefore, we had to runmultiple instances of our algorithm to reduce the error probability down to δ again. We have shownthat the Private Coins Scheme is an (ε, δ)-Approximation scheme as well. Moreover, we analysedthe algorithms in terms of space and time complexity. Both Approximations only need constant

expected time per instance per bit. They need a total workspace in O(log(1/δ)log(n)

ε2

)bits.

In section 4 we continued by presenting possible modications and extensions for our algorithms.We have seen that we are able to estimate other functions, like F0, as well. The presented algorithms

Page 76: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

68 Till Knollmann

can be extended to be used with more than two parties and in more general scenarios.All in all, we have seen an ecient way of collecting samples which can then be used to approximatemultiple functions.

References

1. Zimand, Marius: On generating independent random strings. Mathematical Theory and Computational Practice,Springer, 2009. pp. 499508

2. Gibbons, Phillip B., and Srikanta Tirthapura. "Estimating simple functions on the union of data streams."Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures. ACM, 2001.

3. Feigenbaum, Joan, et al. "An approximate L 1-dierence algorithm for massive data streams." SIAM Journal onComputing 32.1 (2002): 131-151.

Page 77: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model

Arne Kemper

Abstract Massive graphs arise naturally in a lot of applications, especially in communication net-works like the internet. The size of these graphs makes it very hard or even impossible to storeset of edges in the main memory. Thus, random access to the edges can't be realized, which makesmost oine algorithms unusable. This essay investigates ecient algorithms that read the edgesonly in a xed sequential order. Since even basic graph problems often need at least linear space inthe number of vetices to be solved, the storage space bounds are relaxed compared to the classicstreaming model, such that the bound is O(n · polylog n). The essay describes algorithms for ap-proximations of the unweighted and weighted matching problem and gives a o(log1−ε n) lower boundfor approximations of the diameter. Finally, some results for further graph problems are discussed.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.1 Unweighted Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.2 Weighted Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Distances and further problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

1 Introduction

Streaming[2, 4, 5] is a well-studied method for handling massive data sets. There are two interpre-tations for how the stream is actually created in the application. The rst one usually assumes aclient-server model, in which data item are send one by one from the client to the server, where theyare processed. The other one is used for local computations of massive instances of a problem, whereit is very expensive or even impossible to realize random access to the items, usually because thedata is much larger than the local memory. Here the access problems are reduced by allowing onlysequential access in a xed order to the items. This practice make streaming interesting even forlocal computations. Obviously, streaming algorithms work in both interpretations, but the latter oneallows us a new approach to solve problems on data sets that are too big for normal computations.One set of problems that's worth to explore in this model is graphs problems. Massive graphs arecommon for real world applications, for example the web graph, where vertices are web pages and

Arne KemperUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

69

Page 78: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

70 Arne Kemper

edges are links or the call graph, that shows phone calls between parties. Since standard graph algo-rithms assume random access to the edge set these algorithms are unusable for this massive graphssince the constant reloading of data into the memory would be very expensive and time consuming.Still, an application like a web crawler that collects the edges for the web graph, would still generateedge one by one and could be used in client-server model. But since these algorithms are mostlyused in the context where the edge set stored locally but too big for the memory, we can actuallydo multiple passes of the data without too much problems.A big problem for solving graph problems in the streaming model is the rather low storage spacebound. A lot of basic graph problems are not solvable in sublinear space, thus such a bound isnot feasible. Therefore this essay discusses basic graph problems in the semi-streaming model. Thismodel allows storage space of O(n · polylog n) instead of just O(polylog n) and explicitly allowsmultiple passes over the input, which is sometimes considered in the normal streaming model, butstill rather uncommon.I will present a (2/3− ε) approximation semi-streaming algorithm for the bipartite matching prob-

lem in unweighted graph in O( log 1/εε ) passes, a one-pass algorithm for a 1/6-approximation for the

weighted graph matching problem. At last I will present a log n/ log log n-approximation for thediameter and a Ω(log1−ε n) lower bound for the approximation of the diameter and other distanceproblems.

2 Preliminaries

A graph G is denoted by G = (V,E), where V is the set of vertices and E the set of edges. Notethat the number of vertices is n, unless stated otherwise, therefore all bounds are dependent of thisnumber. The number of edges is denoted with m.

Denition 1. A graph stream is a sequence of edges ei1 , ei2 , ..., eim ∈ E. i1, ...im is an arbitrarypermutation, so the order of the edges is arbitrary but xed during the execution of an algorithm.

An algorithm can access these edges one by one and only in this order. This means that if thealgorithm wants to access an edge again after discarding it, it has to go through the whole set ofedges again. Note that since the order of the edges isn't random this denition also works for streamsthat stream the edges in order of the adjacency matrix or list, where edges adjacent to a vertex aregrouped. But since this isn't guaranteed the algorithms have to work for any order of edges.The eciency of these algorithms is measured in the number of bits of storage space it needs, thetime it need per edge and the number of passes it needs over the input.

Denition 2. A semi-streaming graph algorithm is an algorithm that computes over a graph stream.It accesses the input P (n,m) times via one-way passes and T (n,m) time per edge. The algorithmuses S(n,m) bits of space, where S(n,m) ∈ O(n · polylog n).

To show that the higher space limitation is actually necessary, consider the following Lemma:

Lemma 1. Every algorithm that decides if there is a directed path from s ∈ V to t ∈ V in a GraphG = (V,E) requires Ω(m) bits of space.

Proof. Consider a family F of graphs G = (L∪R∪ s, t, E), where all edge are directed from L toR or of the form (s, l) or (r, t) with l ∈ L and r ∈ R. Let |L| = |R| = n and |E| ≤ n2/2. Now considera stream that gives all edges between L and R, then one edge of the form (s, l) and at last an edgeof the form (r, t). Before the last two edges are streamed in, the algorithm can't decide whetheran edge it has seen before is part of a path between s and t, therefore it needs a dierent memoryconguration for every possible combination of edges, since the existence of any edge could mean adierent answer to whether there is a path. Thus the algorithm needs at least Ω(log2 |F|) ∈ Ω(m).

Page 79: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 71

Since connectivity of two vertices appears as a subproblem in many algorithms, for example thediameter or shortest paths, linear space is needed for graph problems and the classic streamingmodel is not applicable.

3 Graph Matching

Denition 3. Given a graph G = (V,E), a matching is a set M ⊆ E such that no two edges in Mhave common endpoint. If the size of the set can not be improved by just adding edges it's calledmaximal matching, if it is the biggest of all possible matchings, it's called a maximum matching.

This section will show that (approximations of) graph matching are indeed possible in the semi-streaming model. First the unweighted case will be considered, the weighted case will be discussedafterwards.

3.1 Unweighted Bipartite Matching

The rst step to test for a bipartite matching is testing if the graph is bipartite. This can be donein one pass.

Algorithm 1 IsBipartiteMaintain a disjoint set data structure for the connected components found so far and associate a sign with everyvertex, such that no two connected vertices have the same sign.If a new edge connects two vertices with the same sign, try xing it by ipping the signs of one of the connectedcomponents.If this can't x the signs the graph is not bipartite.

The algorithm test if an edge connects two vertices with the same sign. If it doesn't, this edgedoes not contradict an bipartition on the graph seen so far. The information that the connectedvertices are in the same connected component is stored. Since the only way to change a sign is toip the signs of the whole connected components the condition is met over the whole algorithm, ifit is ensured for every edge. So if the algorithm doesn't nd an edge that connects two vertices withthe same sign, that can't be xed, the graph is bipartite, where the two sets are the vertices witha common sign, respectively. The disjoint data structure with union by rank and path compressioncan be changed, such that it maintains the signs and still uses amortized only constant time.Similarly, a 1/2-matching for any unweighted matching can be found in one pass. This does notneed a bipartition and can therefore be done in the same pass as nding a bipartition.Given a matching M , a vertex is called free if it isn't an endpoint of any edge in M .

Algorithm 2 GreedyMatchingRequire: A graph stream of a graph G.Ensure: A maximal matching in G.1: M ← ∅2: for edge e that is streamed in do

3: if both ends of e are free w.r.t M then

4: M ←M ∪ e5: end if

6: end for

Page 80: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

72 Arne Kemper

Since every edge is considered, this matching is obviously maximal. For the approximation factor,consider the general case:

Lemma 2. Every maximal matching M is a 1/2-approximation of the maximum matching OPT .

Proof. Every edge in M has endpoints in common with at most two edges in OPT . If any edge inM would not have at least endpoint in common with an edge in OPT , one could add this edge toOPT , which would contradict the optimality. So every edge in M shares at least one endpoint withan edge in OPT . Additionally, if any edge in OPT would not have a common endpoint with an edgein M , this edge could be added to M , so M would not be maximal. Thus, |M | ≥ 1

2 |OPT |. utAlgorithm 2 nds a 1/2-approximation in one pass in any graph. For better approximations in

bipartite graphs we need to improve this matching. Since the all considered graphs in this chapterfrom now on are bipartite, they are treated as if they are directed and all edges go from L to R.Because edges are unweighted, we could improve the matching by nding two edges which end inboth endpoints of an edge in the matching and their other respective endpoint is free. If we removethe edge from the matching and add the two found one instead the set is still a matching, but it'ssize is improved. These three edges describe an augmenting path.

Denition 4. For a matching M for a bipartite graph G = (L ∪ R,E) a (length-3) augmentingpath is a tupel (wl, u, v, wr) if (u, v) ∈M, (u,wl), (wr, l) ∈ E and wl, wr are free. wl and wr are theleft and right wing tip, respectively, (u,wl) is the left wing and (wr, l) the right wing.A set of length-3 augmenting path where the paths are pairwise disjoint is called simultaneouslyaugmentable.

This denition can be easily generalized to longer paths (see gure 1), but this algorithm uses onlypaths of length 3.

Fig. 1 Two augmenting paths (of length 3 and 5) in G with a matching M . Green vertices are free w.r.t. M , rededges ∈M , blues ones ∈ E \M .

Note that applying a set of simultaneously augmentable length-3 augmenting paths still resultsin a matching since the added edges must be edge disjoint.Now, an algorithm that nds such a set can be dened, which will be used in the main algorithm.

Page 81: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 73

Algorithm 3 FindAugmentingPathsRequire: A graph stream of a bipartite graph G, a matching M and a parameter δ.Ensure: A set of simultaneous augmentable augmenting paths.1: In one pass, nd a maximal set of disjoint left wings L2: if L ≤ δM then return the set found3: end if

4: In another pass, for every edge that has a left wing, nd a maximal set of disjoint right wings (they form a set ofsimultaneously augmentable augmenting paths)

5: In another pass, nd a set of vertices, that:

• are endpoints of an edge e ∈M with a left wing• are the wingtips of edges with both wings• are endpoints of a matched edge that can't be augmented anymore

6: Repeat

Note that this algorithm does not change the matching, but only nds augmenting paths. Thechange of the matching happens in the main algorithm:

Algorithm 4 UnweightedBipartiteMatchingRequire: A graph stream of a bipartite graph G and a parameter ε.Ensure: A 2/3− ε approximation of a maximum matching.1: In one pass, nd a maximal matching M (by algorithm 2) and the bipartition (by algorithm 1)2: for k = 1, 2, ..., d log 6ε

log8/9e do

3: Run algorithm 3 with G,M and δ = ε2−3ε

4: for each augmenting path (wl, u, v, wr) found in the last step do

5: remove (u, v) from M and add (u,wl) and (wr, v)6: end for

7: end for

The eciency of this algorithm hinges on the quality of the set of augmenting paths found byalgorithm 3. To evaluate this, a relationship with a maximum set has to be established.

Lemma 3. The size of a maximal set of simultaneously augmentable length-3 augmenting paths Sis at least 1/3 of the size of a maximum set X.

Proof. Consider an augmenting path from S. Each of its wing tips can block at most one path inX, since no vertex can appear twice. Additionally there might be a dierent augmenting path forthe matched edge in X, so for every path in S, there are at most 3 paths in X. Thus, |S| ≥ 1

3 |X|.ut

The algorithm 3 does not produce a maximal matching, since it terminates if the number of leftwing tips is less than the δM threshold. So the next question would be how many paths (comparedto a maximum set X) the algorithm does nd.

Lemma 4. Algorithm 3 nds at least |X|−2δ|M |3 simultaneously augmentable length-3 augmenting

paths in 3/δ passes, where X is a maximum set of simultaneously augmentable length-3 augmentingpaths.

Proof. Since every repetition of the steps in the algorithm takes passes, the number of passes isdependent on how often more than δM many left wings can be found. Since all further repetitionsat least ignore those edges inM that have a left wing, at least δM edges are removed in every round.Thus, after at most (3/δ) passes the algorithm terminates.Let L(M) be the set of the endpoints of edges inM that are in L and VL(M) be the set of possible leftwing tips for edges in M . Since the left wings form a maximal matching between the part of L(M)

Page 82: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

74 Arne Kemper

and VL(M) that is not ignored and in the last repetition there are at most δM left wings are found,by Lemma 2, the maximum matching between these sets has size at most 2δ|M |. Furthermore, thereobviously can't be more augmenting paths than left wings. This means that all other augmentingpaths are in the part of the graph that is ignored in the last repetition, so at least |X|−2δ|M | paths.The algorithm nds a maximal set of augmenting paths in this part of the graph. Thus, according

to Lemma 3 it nds at least |X|−2δ|M |3 paths. ut

This makes no statement about how big the set |X| is compared to the matching M and amaximum matching. A relationship between them has to be established to nd out what the size ofthe matching is at the end of the algorithm.

Lemma 5. Let X be a maximum set of simultaneously augmentable length-3 augmenting paths fora maximal matching M and OPT be an maximum matching. Then |M |+ |X| ≥ 2/3|OPT |.

Proof. Consider the connected components of the symmetric dierence M4OPT . Note that theedges are undirected. None of the connected components can only consist of a single edge of OPT ,since then both endpoints of this edge would be free with regard to M , which contradicts themaximality of M . Every connected component has at most one more edge from OPT than from M ,because no two edges from the same set can be connected. So the only connected components thatdo not have a ratio between M and OPT of at least 2

3 are those with one edge from M and twofrom OPT . But those are at most |X| since they are augmenting paths.So every edge from M is either part of OPT and therefore not in M4OPT , part of a connectedcomponent where the ratio is 2/3 or part of one of the at most |X| components where the ratio is1/2. Thus, |M |+ |X| ≥ 2/3|OPT |. ut

Now, putting all of this together gives the desired result.

Theorem 1. For any 0 < ε < 1/3 and a bipartite graph, Algorithm 4 nds a 2/3− ε approximationof a maximum matching in O( log 1/ε

ε ) passes. The algorithm takes amortized constant time per edgein the rst pass and constant time per edge in every other pass. The algorithm needs O(n · log n)bits of storage space.

Proof. The constant processing time can be achieved by keeping the state of a vertex regardingwhether is part of the matching or ignored for the time being in memory.The needed storage is then dened by this state (O(n)), the bipartition (O(n)) and several match-ings, which are sets of at most n/2 edges, where a single edge takes O(log n) space. So the totalstorage space is O(n · log n) bits.The correctness of the algorithm is yet to be shown.Let OPT be the size of a maximum matching in G. Consider the ith repetition of the loop in Algo-rithm 4. LetMi be the matching found in this repetition, Xi a maximum sized set of simultaneously

augmentable length-3 augmenting paths for Mi, αi = |Xi||Mi| and si = |Mi|

OPT .

Note that if αi ≤ 3ε2−3ε , by Lemma 5,

|Mi|(1 + αi) ≥2

3OPT ⇔ |Mi| ≥

2

3

1

1 + αiOPT ⇔ |Mi| ≥ (

2

3− ε)OPT.

Thus, Mi is a23 − ε approximation. Therefore we assume αi >

3ε2−3ε holds for all i. Remember that

δ = ε2−3ε , therefore δ ≤ αi

3 for all αi. Using this and Lemma 4 gives

Page 83: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 75

|Xi| − 2δ|Mi|3

=αi|Mi| − 2δ|Mi|

3

≥ αi|Mi| − 2(αi/3)|Mi|3

=1/3 · αi|Mi|

3

=αi|Mi|

9

⇔ αi − 2δ

3≥ αi

9.

(1)

By Lemma 5|Mi|+ |Xi| ≥ 2/3 ·OPT⇔ |Mi|+ αi|Mi| ≥ 2/3 ·OPT

⇔ |Mi|+ αi|Mi|OPT

≥ 2/3

⇔ si + αisi ≥ 2/3

⇔ αisi ≥ 2/3− si.

(2)

Furthermore, with Lemma 4 we get

|Mi+1|

= |Mi|+αi|Mi| − 2δ|Mi|

3

= |Mi| · (1 +αi − 2δ

3)

(1)

≥ |Mi| · (1 +αi9

)

⇔ |Mi+1|OPT

≥ |Mi| · (1 + αi9 )

OPT

⇔ si+1 ≥ si · (1 +αi9

) = si +siαi

9

(3)

Putting this together we can get a closed recurrence:

si+1 ≥ si +siαi

9(2)⇔ si+1 ≥ si +

2/3− si9

⇔ si+1 ≥ si −si9

+2

27

⇔ si+1 ≥8

9si +

2

27

(4)

Because the algorithm nds a maximal matching in the rst pass and Lemma 2 s0 ≥ 1/2. Usingthis, a solution for (4) can be found:

si ≥2

3− 1

6

(8

9

)iThe algorithm does k =

⌈log 6ε

log 8/9

⌉repetitions, so

Page 84: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

76 Arne Kemper

|Mk|OPT

= sk ≥2

3− 1

6

(8

9

)k≥ 2

3− 1

6

(8

9

) log 6εlog 8/9

=2

3− 1

6· 6ε =

2

3− ε.

Conclusively, the number of passes is

1 + k · 3/δ = 1 + k · 6− 9ε

ε= 1 +

⌈log 6ε

log 8/9

⌉· 6− 9ε

ε∈ O(

log 1/ε

ε)

ut

3.2 Weighted Matching

In the unweighted case all edges have the same value in a matching, therefore only the number ofedges inuences the quality of a matching. Contrary to this, in the weighted case every edge e has aweight w(e) and the quality of a matching M is

∑e∈M w(e). There are many oine algorithms to

solve this problem and at least one can be adapted for the streaming model.First the edges are partitioned into dlog1+ε/3(d3/ε + 1.5e)ne groups and the algorithm nds max-imal matchings in all those groups, starting with the group with highest weight. This leads to a

1(2+ε) -approximation in O(log1+ε/3 n)) passes and O(n log n) storage. Details can be found in [6].

Since the number of passes isn't constant in n anymore it is interesting if there are approximationswith less passes. The following algorithm achieves a 1/6-approximation in only 1 pass.For a matching M , let wadj(e) be the sum of the weights of the (at most 2) edges in M that areadjacent to e.

Algorithm 5 weightedMatchingRequire: A graph stream of a weighted graph G.Ensure: A 1/6-approximation of a maximum matching.1: M ← ∅ . maintain a matching at all time2: for every streamed edge e do3: if w(e) > 2wadj(e) then4: remove the adjacent edges from M and add e5: end if

6: end for

Note that removing an edge is nal and can't be reversed, even if both endpoints would be freewith regard to the nal matching. So an edge has to not only account for the edges it directlyremoves, but all edges that were removed by them and so on. Thus w(e) must be at least twicewadj(e).

Theorem 2. In one pass and O(n log n) storage, the algorithm constructs a 16 -approximation of a

maximum matching.

Proof. For any set of edges A, let w(A) =∑e∈A w(e).

An edge is called born, if it is ever part of M , killed if it was born, but afterwards removed fromM . The removed edge is murdered by the new one. If an edge is born, but never killed, it's calleda survivor. Let S be the set of survivors. Since all edges in M have to be born, but never be killedS = M , and therefore the quality of this matching is w(S).Let the trail of dead of an edge e be T (e) =

⋃i∈N Ci, with Ci =

⋃e′∈Ci−1

the edges murdered by e′and C0 = e. So C1 is the set of edges murdered by e, C2 the of edges murdered by these edgesand so on.

Page 85: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 77

Claim 1: w(T (e)) ≤ w(e)Proof of claim: Every edge has at most one murderer and by the algorithm w(e) > 2wadj(e), soan edge has at least twice the weight of the edges it murdered, so 2w(Ci+1) ≤ w(Ci).

2w(T (e)) =∑i≥1

2w(Ci) ≤∑i≤0

w(Ci) = w(C0) +∑i≥1

w(Ci) = w(e) + w(T (e)),

which implies the claim.Now consider a maximum solution OPT = o1, o2, .....The weight of these edges is distributed tothe survivors of the algorithm.First, and edge e is accountable to o ∈ OPT , if e = o or o was never born because of e. In thesecond case two edges might be accountable to o. If only one edge is accountable to o, its wholeweight w(o) is charged to e, otherwise let e1, e2 be the two edges accountable to o. In this case e1 is

charged with w(o)w(e1)w(e1)+w(e2) and e2 with w(o)w(e2)

w(e1)+w(e2) . Since in this case

w(o) < 2(w(e1) + w(e2))⇔ w(o)

2< w(e1) + w(e2),

it holds thatw(o)w(e1)

w(e1) + w(e2)>w(o)w(e1)

w(o)2

= 2 · w(o)w(e1)

w(o)= 2w(e1).

e1 is charged at most twice its own weight. e2 is analogous. For the case that only one edge isaccountable e is also charged at most twice it's weight since either o = e or we have w(o) < 2w(e)since o was never born because of only one edge. So any edge is charged at most twice it's ownweight by a single charge.Note that an edge has an endpoint in common with every edge it murders and with every edge itgets charged by. So an edge can be charged by two edges from OPT , one for every endpoint. Becausethe calculations are easier if all edges in a trail of dead are charged by only one edge in OPT , thecharges are redistributed as follows. For distinct u1, u2, u3 and a v ∈ V : if e′ = (u1, v) gets chargedby (u2, v) and afterwards murdered by e = (u3, v) the charge is transferred from e′ to e. Edges arestill charged at most twice their own weight since w(e) ≥ w(e′).Although the edges in trail of death are only charged once, survivors still can be charged twice.Thus, for the set of survivors S,

w(OPT ) ≤∑e∈S

2w(T (e)) + 2 · 2w(e) =∑e∈S

2w(T (e)) + 4w(e)claim1≤

∑e∈S

2w(e) + 4w(e) = 6w(S)

Therefore the set of survivors and such the matching found by the algorithm is a 16 -approximation.

utLately, a (1+ε) approximation in O(( 1

ε )5) passes was introduced for the bipartite, unweighted graphs[3] and an algorithm for the weighted case that achieves in one pass an approximation ratio of 5.58[7] (compared to the 6, that is presented here).

3.3 Lower Bounds

According to Lemma 1, deciding whether there is a directed path between two xed vertices s, trequires Ω(m) bits of storage. So if "s-t-connectivity" ≤p "nd augmenting path", it can be inferredthat the matching algorithm is impossible in the classic streaming model, since nding an augmentingpath would also take Ω(m) bits of storage. Note that an augmenting path does not have xed length,otherwise the denition is similar to that of length-3 augmenting paths (cf. Figure 1).

Page 86: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

78 Arne Kemper

Theorem 3. "s-t-connectivity" ≤p "nd augmenting path"

Proof. Without loss of generality, let G = (v1 = s, v2, ..., vn−1, vn = t, E) be a directed graph.Construct the undirected graph G′ = (V ′, E′) with V ′ = vil|vi ∈ V ∪ vir|vi ∈ V ∪ vs, vtand E′ = (vil, vir)|vi ∈ V ∪ (vir, vjl)|(vi, vj) ∈ E ∪ (vs, v1l), (vt, vnr). The initial matching isM = (vil, vir)|vi ∈ V .Since all edges (vil, vir) are part of the matching and every other edge except (vs, v1l), (vt, vnr) is ofthe form (vir, vjl), the only way to nd an augmenting path, i.e. a path that contains more edgesfrom E \M than M , is for (vs, v1l) and (vt, vnr) to be part of this path. Due to the construction ofG′, v1l and vnr are connected exactly if the is a path from s to t in G.Thus, there is an augmenting path in G′ exactly if there is a path from s to t in G. ut

4 Distances and further problems

This section gives an overview for results regarding distances like shortest paths and the diameter inthe semi-streaming model. Notably, a lower bound for approximations in one pass will be presented.

Denition 5. An edge (u, v) of a graph is called k-critical if the shortest path from u to v in(V,E \ (u, v)) has length ≥ k.

The next Lemma will show the existence of a graph with useful properties. This will be done by usingthe probabilistic method, i.e. by showing that a random graph has these properties with positiveprobability and therefore one such graph must exist.

Lemma 6. For 0 < ε < 1 and suciently large n there exists a Graph G = (V,E) with |V | =

n, |E| = 2logε nn/4 such that more than half of the edges are k-critical with k = log1−ε n2 and more

than half subgraphs of the set

G′ ⊆ G|G′is formed by deleting a subset of the k-critical edges

have diameter less than or equal to 8k = 4 log1−ε n.

Proof. Consider a random graph G ∈ Gn,p, i.e. a random graph with n vertices, where the probabilitythat any xed edge exists is p, so the appearance of any edge is independent of every other edge.Let p = 2logε n/n. It is intuitively clear and can be shown by using the Cherno bound, that, withhigh probability, the number of edges in G is at least 2logε nn/4, which is 1/2 of the expectancy,since there are n2/2 many possible edges.Claim 1: With high probability, the majority of edges in G is k-critical.Proof of Claim 1: By using the Cherno bound again, for any vertex v, Pr[d(v) ≥ 2 · 2logε n]≤(e/4)2logε n

, where d(v) is the degree of v. Using the union bound, the probability that any vertex

has a degree that large is at most∑v∈V Pr[d(v) ≥ 2 · 2logε n] ≤ n · (e/4)2logε n

.

For suciently large n it holds that 2logε n ≥ log2 n ≥ 1 and not that (e/4) < 1. Thus

n · (e/4)2logε n ≤ n · (e/4)(log4/e n)2 =1

nlogn−1.

Therefore, the probability that no vertex has degree larger than 2 · 2logε n is at least 1− 1nlogn−1 and

it is assumed that this is always the case.This implies that number of vertices that have distance i from any vertex v is at most (2 · 2logε n)i.Consider any edge (u, v) in G = (V,E). Let Γi(v) be the set of vertices with distance at most i fromv in (V,E \ (u, v)), so

Page 87: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 79

|Γk(v)| ≤∑

0≤i≤k(2 · 2logε n)i

= 2 · (2logε n)k+1 =

= 2 · (2logε n)log1−ε n

2 +1

= 2 · (2logε n· log1−ε n2 +logε n)

= 2 · (2 logn2 +logε n)

= 2logn

2 +logε n+1.

This is smaller than 22 logn/3, because

2logn

2 +logε n+1

≤ 22 logn/3

⇔ log n

2+ logε n+ 1

≤ 2

3log n

⇔ logε n+ 1

≤ 1

3log n

holds for suciently large n. In a random graph, the vertex u is selected uniformly at random with

probability p from V \ v, so the probability that u is in Γk(v) is 22(logn)/3

n . Since (u, v) is k-criticalexactly if u is not in Γk(v), the probability for it being k-critical is

1− 22(logn)/3

n= 1− 1

n1/3.

Using the Cherno bound again shows that, with high probability, the majority of edges in G isk-critical. utClaim 2: The diameter of a random graph G ∈ Gn,p/2 is, with high probability, less than D =

4 log1−ε nProof of Claim 2: Consider any node v ∈ V . Let Si = Γi(v)\Γi−1(v). For those i with Γi(v) < n/2,the Cherno bound gives that, with high probability, |Si+1| > |Si|2log

εn/4. Now consider the rstt, such that Γt(v) ≥ n/2. Since |Γt(v)| = ∑t

j=1 |Sj | and the above bound, |St| > n/4 and with highprobability |Γt+1| = n. Solving this for t gives

t+ 1 ≤ log n

logε n− 2< 2 log1−ε n.

This is the maximum distance from v to every other node, so the diameter must be smaller than4 log1−ε n. ut

Using these two claims, the actual lemma can be proven. LetG ∈ Gn,p. Picking a random subgraphfrom G is the same a picking a Graph from Gn,p/2. Note that any edge could be removed from G,not only those that are k-critical. Consider the events

• A = G = (V,E) ∈ Gn,p| majority of edges is k-critical• B = G = (V,E) ∈ Gn,p/2| diameter of G is less than D• BH = for subgraphs H ′ of H| the diameter of H ′ is less than d

Page 88: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

80 Arne Kemper

The former argumentation shows that Pr[B] is high and that Pr[A] is very low. Thus, the probabilityof A and B happening at the same time is

Pr[A ∩B] ≥ Pr[B]− Pr[A] > 1/2.

Also,

Pr[A ∩B] =∑G

I[A]Pr[G]Pr[BG].

With a simple averaging argument, it follows that there is a G ∈ A with Pr[BG] > 1/2. Thisencompasses all subgraphs of G, not only those, where only k-critical are deleted. But since addingedges can not increase the diameter of graph, re-adding the non-k-critical edges does not changethis. So the majority of the subgraphs of G formed by deleting k-critical edges has diameter < D.ut

The following theorem can be proven by using this graph G and his subgraphs.

Theorem 4. For 0 < ε < 1, it is impossible in one pass to approximate the diameter of an un-weighted graph within a factor of o(log1−ε n) in the semi-streaming model.

Proof. Let k = log1−ε n2 and D = 8k. Consider a graph with the properties of Lemma 6 and η =

n−2DD = n

D − 2 vertices. Let F(G) be the set of subgraphs of G with diameter less than D. Notethat

|F(G)| ≥ 22logε nn/8 = 2ω(n·polylog n).

So there are more graphs than possible memory congurations in the semi-streaming model, so,for any given algorithm, there are two graphs G′, G′′ ∈ F(G) that are indistinguishable by thatalgorithm. So when these graphs are streamed to this algorithm the exists an edge in those graphswhich existence is undetermined at the end of the algorithm. Consider D of those graphs G1, ..., GD,and let ei = (uil, uir) be an edge whose existence is undetermined in a stream of Gi.The stream that is actually streamed to the algorithm consists of rst the graphs G1, ..., GD, thenedges of the form (uir, u(i+1)l) for i = 1, 2, ..., D − 1 and nally two new paths of length D each,one with endpoint s and u1l and the other with endpoints t and uDr. See gure 2 for the nalgraph. Since no path in the subgraphs G1, ..., GD has length D or longer, the diameter of the graphis the length of the path between s and t. This path has length 4D − 1, where 2D comes fromthe two added paths, plus D from the undetermined edges and D − 1 from the edges connecting

the subgraphs. Since all the undetermined edges might have been k′-critical, with k′ = log1−ε η2 , the

minimum diameter the algorithm can ensure is 3D − 1 + k′D. This is a Ω(k′)-approximation. Ifk′ ∈ Θ(k), the Theorem follows. Observe that

k′ =log1−ε η

2=

log1−ε nD − 2

2=

log1−ε n4 log1−ε n − 2

2= Θ(log1−ε n

4 log1−ε n)

andlog1−ε n

4 log1−ε n= log1−ε n− log1−ε 4 log1−ε n = Θ(log1−ε n) = Θ(k),

thus the theorem follows. ut

Page 89: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

On Graph Problems in a Semi-streaming Model 81

Fig. 2 Visualization of the streamed graph

Additionally, shortest paths can be approximated in one pass by building a spanner graph, espe-cially a log n/ log log n-spanner can be constructed for unweighted graphs by an algorithm similarto the one described in [1]. The algorithm construct this spanner S by adding the streamed edges toS, unless this would create a circle of length log n/ log log n. The weighted case is more complicated,since the edges would have to be sorted according to their weight, which is at least very dicult inthe streaming model.Instead, there is an algorithm that groups the edges into set of edges with similar weight and usesthe unweighted algorithm on these sets. If wmax is the maximum weight and wmin the minimumone, the range [wmin, wmax] is divided into intervals [(1+ε)iwmin, (1+ε)i+1wmin) and all edges withweight in this range are treated as if they had weight (1 + ε)iwmin and the unweighted algorithmis used on these. This leads to log(1+ε)

wmaxwmin

many spanners on the subset of edges. Note that ifthis is more than polylog n, the storage space needed is too big and the algorithm can't be used.This is independent of ε, so some graphs can't be handled in the semi-streaming model. The unionof the single spanners gives a (1 − ε) log n-spanner. The algorithm does not need prior knowledgeof the bounds of the edge weights and can operate on the maximum and minimum value yet. Tosummarize:

Theorem 5. For ε > 0, and a weighted undirected graph on n vertices with maximum edge weightwmax and minimum wmin, where log(1+ε)

wmaxwmin

= polylog n, there is a semi-streaming algorithmthat calculates a (1 + ε) log n-spanner of the graph in one pass. It needs O(log(1+ε)

wmaxwmin

) · n log nbits of space and at most O(log(1+ε)

wmaxwmin

· n) time per edge.

This spanner can be used to approximate shortest paths and the diameter of the graph. Also it canused to nd an approximation of the girth, which is the shortest circle in the graph. If the girthis larger than k, it can be determined exactly in a k-spanner, this spanner gives a log n/ log log n-approximation.Some other graph problems, that are solvable in the semi-streaming model are the construction ofa minimum spanning tree and testing for planarity. Since the number of edges in a spanning treeand in a planar graph are linear in the number of nodes, there are oine algorithms, that can beadapted without going over the storage bound. Note that at least the planarity test is impossible inthe classic streaming model.Lastly, the next algorithm nds articulation points (i.e. a vertex, such that the induced graph with-out wouldn't be connected). It uses a disjoint set data structure for every vertex v to keep allneighbors of v, that are in the same connected component of the graph without v, in the same set.

Page 90: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

82 Arne Kemper

Algorithm 6 ndArticulationPointsRequire: A graph stream of a unweighted, connected graph G.Ensure: The set of articulation points in G1: T = (V, ∅)2: for each v ∈ V do

3: SF.makeset(v)4: end for

5: for each streamed edge (u, v) do6: if SF.ndset(u) = SF.ndset(v) then7: nd the path u = a0, a1, ..., ak = v in T8: for each ai, 1 < i < k − 1 do

9: ai.union(ai−1, ai+1)10: end for

11: else

12: SF.union(u, v)13: T = T ∪ (u, v)14: u.makeset(v)15: v.makeset(u)16: end if

17: end for

18: for each v ∈ V do

19: if the neighbor of v w.r.t. T are in dierent sets then20: output v as an articulation point21: end if

22: end for

If v is an articulation point then there are two neighbors that have no other connection. Thereforetheir set in the data structure would never be unioned and the algorithm outputs v.

5 Conclusion

The semi-streaming model allows a lot of new possibilities while still keeping the core idea of thestreaming model. The multiple passes might not be suitable for a client-server model, but thepagewise reading of data is often still more ecient than random access.Interesting further research would be the tradeo between the number of passes, the storage space,the per-edge time and the approximation factor.

References

1. Althöfer, I., Das, G., Dobkin,D., Joseph, D.: Generating sparse spanners for weighted graphs. In: Proc. 2ndScandinavian Workshop on Algorithm Theory (SWAT'90), LNCS 447, pp. 2637, 1990

2. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Journalof Computer and System Sciences, 58(1). pp. 137147, Feb. 1999

3. Eggert, S., Kliemann, L., Munstermann, P., Srivastav, A.: Bipartite Matching in the Semi-Streaming Model. In:Algorithmica Journal, August 2011.http://dx.doi.org/10.1007/s00453-011-9556-8. Cited 25.11.2025, pp 490508

4. Feigenbaum, J., Kannan, Z., Strauss, M., Viswanathan, M.: An approximate L1 dierence algorithm for massivedata streams. In: SIAM Journal on Computing, 32(1). pp. 131151, 2002.

5. Rauch Henzinger, M., Raghavan, P.,Rajagopalan, S.: Computing on data streams. In: Technical Report 1998-001.DEC Systems Research Center, 1998.

6. Uehara, R., Chen, Z.: Parallel approximation algorithms for maximum weighted matching in general graphs. In:Information Processing Letters, 76(1-2). pp. 1317, 2000.

7. Zelke, M.: Weighted Matching in the Semi-Streaming Model.http://www.tks.informatik.uni-frankfurt.de/getpdf?id=715 Cited 25.11.2015.

Page 91: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications

Jannik Sundermeier

Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is atleast dicult or even impossible to store in local memory. Our dataset consists of events connectedto their frequencies. We are interested in several queries concerning our dataset. Since we cannotstore the whole dataset, there is provably some approximation needed. The Count-Min Sketch is adata structure which allows us to store a sublinear approximation of our original set. Furthermore,the Count-Min Sketch is a very easy to implement data structure which oers good estimationguarantees. In this thesis, we want to explain the structure of the Count-Min Sketch and proof theguarantees for some basic query types.

1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831.1 The Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841.2 Update Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841.3 Query types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.1 Linearity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.2 The Markov inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.3 Cherno bound(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.4 Pairwise-Independent Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3 The Count-Min Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 Query Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.1 Point Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Inner Product Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.3 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4 φ-Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.5 Heavy Hitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Summary / Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

1 Problem Setup

In this thesis, we will explain and analyze the Count-Min Sketch. Section 1.1 introduces the initialsituation where we want to make use of the Count-Min Sketch. Section 1.2 deals with some variants

Jannik SundermeierUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

83

Page 92: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

84 Jannik Sundermeier

of our general model and section 1.3 introduces several query types we want to get aware of usingthe Count-Min Sketch.

1.1 The Scenario

The information given in this thesis is mainly based on [1]. Whenever we make use of another source,you will nd a quote.The goal of the Count-Min Sketch is to handle a huge dataset which is at least dicult or evenimpossible to store in local memory. We can imagine this dataset as a set, consisting of eventscombined with their amount of occurrences. The setup is the data stream scenario. Hence, we haven dierent events and our set consists of tuples (i, a) with i ∈ 1, · · · , n and an amount a ∈ N. Wecan interpret this dataset as a vector a = (a1, · · · , an). An entry ai determines how often event i hashappened in our distributed system. Initially all values of our vector are set to zero. Furthermore, ourdataset obtains continuously updates. Updates are given as tuples (i, c). This means, that we haveto look at the vector entry ai and add the amount c to that entry. All other entries are not modied.Since we cannot store the whole dataset, we need some approximation of the information contained.A sketch data structure allows us to store an approximation of the original dataset sublinear inspace. We take the denition for a sketch out of [2].

Denition 1. An f(n)-sketch data structure for a class Q of queries is a data structure for providingapproximate answers to queries from Q that uses O(f(n)) space for a data set of size n, wheref(n) = nε for some constant ε < 1.

The set of queries Q will be introduced in section 1.3. All in all, we are looking for a sketch datastructure which allows us to store a sublinear approximation of our original vector a. To solve thisproblem, we will introduce the Count-Min Sketch which is a data structure that solves the problem.

1.2 Update Semantics

The dataset we are observing receives permanently updates. According to the application eld ofour data structure, dierent update types might be plausible. In some cases it seems to be usefulor even naturally, that the updates are strictly positive. Referring to that update semantics, we willcall it the cash-register case.If we want to allow negative updates, we will talk about the turnstile case. There are two importantvariations of the turnstile case to consider. The rst one is called the general case. The generalcase has no further restrictions to the updates. The second case is called the non-negative case.Negative updates are allowed, but the application somehow ensures that the vector entries of theoriginal vector cannot become negative.

1.3 Query types

We want to be able to compute dierent query types according to our vector a. Hereinafter weintroduce the queries we are interested in.

Page 93: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 85

1.3.1 Point Queries

The very basic point query is interested in approximating the value of a specic ai. Given a valuei, an approximated value for ai is returned.

1.3.2 Range Queries

A range query computes the sum of a given range of values. Given the values l and r, the following

sum is (approximately) computed:r∑i=l

ai. Consider that there is a certain order on the indices of the

original vector. For example, you want to analyze the regional allocation of visitors of your website.So each entry of your original vector represents the number of visitors of your website in a certaincity. Assume, that the entries from 1000 to 5000 are from European cities. By range querying forthe range [1000,5000] you can compute the amount of website visitors from Europe.

1.3.3 Inner Product Queries

An inner product query computes (an approximation) of the inner product of our vector a and a

given vector b. An inner product of two vectors a and b is dened as follows: a b =n∑i=1

aibi.

Inner Product Queries might be reasonable in a database scenario. Computing the join size of tworelations is a task query optimizers of database systems regularly have to solve. Related to oursketch, we assume without loss of generality, that the attribute values of the two joined relations aand b are integers in the range [1, n]. Now we can represent our relations with vectors a and b. Avalue ai determines, how many tuples we have with i in the rst relation. This is dened similar forvector b. Using sketches, we can estimate the join size of a and b in the presence of continuouslyupdates, so that items can be removed or added to the relations at every time.

1.3.4 φ-quantiles

The φ-quantiles of our original vector are dened as follows: At rst, we have to sort our vectorentries by their frequencies. The φ-quantiles consist of all indices with rank kφ‖a‖1 for k = 0, · · · , 1

φ .

In order to approximate the φ-quantiles, we accept all indices i with (kφ − ε)‖a‖1 ≤ ai ≤ (kφ + ε)for some specied ε < φ.

1.3.5 Heavy hitters

The φ-heavy hitters of our dataset are the events with highest frequencies. We can describe the setof φ-heavy hitters as follows: H = ai|i ∈ 1, · · · , n and ai ≥ φ‖a‖1. Because we cannot computeexact values, we try to approximate the φ-heavy hitters by accepting any i such that ai ≥ (φ−ε)‖a‖1for some ε < φ.The problem of nding heavy hitters is very popular, because we can imagine various applications.For instance, we are able to detect the most popular products of a web-shop, for example Amazon.Besides, we are able to detect heavy TCP ows. Here, we can interpret a as the amount of packetspassing a switch or a router. This might be useful for identifying denial-of-service attacks.

Page 94: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

86 Jannik Sundermeier

2 Preliminaries

In the following section, we will introduce the formal preliminaries used in later parts of this thesis.At rst we need a general theorem about expectation. Sometimes we want to analyze the expec-tation of a sum of random variables. Linearity of expectation (section 2.1) helps us to computethis expectation. Additionally, we need two important important results of the probability theory,namely Markov's inequality (section 2.2) and Cherno bounds (section 2.3), to proof our results.Afterwards we will introduce pairwise-independent hash functions (section 2.4) which is a family ofhash functions used here.

2.1 Linearity of Expectation

Linearity of expectation allows us to compute the expectation of a sum of random variables bycomputing the sum of the individual expectations.

Theorem 1. Let X1, X2, · · · , Xn discrete random variables and X =n∑i=1

Xi. Then it holds

E[X] = E[ n∑i=1

Xi

]=

n∑i=1

E[Xi].

2.2 The Markov inequality

To estimate our probabilistic results, we need Markov's inequality.

Theorem 2. For a non negative random variable X and for any a > 0, it holds:

Pr[X ≥ a] ≤ E[X]a .

2.3 Cherno bound(s)

For one analysis, we will need the following Cherno bound:

Theorem 3. Let X1, · · ·Xn a sequence of independent Bernoulli experiments with success probabil-

ity p. Let X =n∑i=1

Xi with expectation µ = np. Then, for every δ > 0 it holds:

Pr[X > (1 + δ)µ] <(

(1+δ)(1+δ)

)

2.4 Pairwise-Independent Hash Functions

For our sketch, we need the notion of pairwise independent hash functions. They are useful, becausethe probability of a hash collision is kept small and the mapping is independent from the hasheddata.

Page 95: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 87

Denition 2. Let n,m ∈ N, n ≥ m. A family of functions H ⊆ h|h : [N ] → [M ] is called afamily of pairwise independent hash functions, if for all i 6= j ∈ N and k, l ∈ M Pr[h(i) =k ∧ h(j) = l] = 1

M2 .

We remark here that another name for this class of hash functions is "strongly 2-universal". Thisdepends on the following fact: Let h from a family of pairwise independent hash functions.

For i 6= k, Pr[h(i) = h(k)] ≤ 1M

This inequality results directly from the denition given above. In the following we want to introducea concrete family of pairwise independent hash functions, we can use for examples in later parts ofthis thesis.

Example 1 The following exemplary construction is taken out of [4]. For further interest you cannd a proof there, which proofs that the given construction indeed is a family of pairwise-independenthash functions.Let U = 0, 1, 2, · · · , pk − 1 and V = 0, 1, 2, · · · , p− 1 for some integer k and prime p. Now, we

rewrite every element of the universe U with u = (u0, · · · , uk−1) such thatk∑i=0

uipi = u. Then, for

every vector a = (a0, a1, · · · , ak−1) with 0 ≤ ai ≤ p − 1, 0 ≤ k ≤ k − 1 and for any value b with0 ≤ b ≤ p− 1, let

ha,b(u) =( k∑i=1

aiui + b)

mod p.

Consider the family

H = ha,b|0 ≤ ai, b ≤ p− 1 for all 0 ≤ i ≤ k − 1.H is a family of pairwise-independent hash functions.

3 The Count-Min Sketch

In this section, we will introduce the Count-Min Sketch. First, we will explain a general idea of theCount-Min Sketch (section 3.1) and then we will dene it formally (section 3.2). Afterwards, we willgive an example of how it works (section 3.3).

3.1 Idea

The goal of the Count-Min Sketch is to store an approximation of the vector a sublinear in space.Therefore, it is not possible to store each entry of the original vector. A shorter vector of size w < n(a′ = (a′1, a

′2, ..a

′w)) is needed. Using a hash function h : 1, ..., n → 1, ..., w enables us to map

the indices of the original vector a to the new vector a′. This means, that an update (i,c) will notbe executed by adding c to ai, but by adding c to a′h(i).

Example 2 We assume that we have a vector a of length 7. More precisely, we assume that wehave an exemplary vector a = (2, 0, 6, 3, 0, 1, 4). The used hash function is here h1 : 0, ..., 9 →0, ..., 3, x x mod 4. The resulting vector a′ would be: a′ = (2, 1, 10, 3). Since h1(3) = h1(7) = 3we get an overestimate on the entries 1 and 7 (just as one example). With the hash-function h1 theprobability that an arbitrary x maps to 1 ( 3

10 ) is higher than the probability that an arbitrary x mapsto 3 ( 1

5 ).

Page 96: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

88 Jannik Sundermeier

To improve the previous seen results, we can use hash functions that provide certain guarantees, forexample a function out of a family of pairwise independent hash functions as dened in section 2.4.Thus, the probability for an overestimate is smaller than before. But there are still possibilities toimprove the results.Instead of storing just one vector a′ we can store store a certain amount of smaller vectors a′1, ..., a

′d.

Each vector a′i gets an own randomly chosen hash function hi out a family of pairwise independenthash functions. On update, each vector is updated. To estimate the value of ai, we compute theminimum of the estimates of every a′i. Intuitively, the minimum of all estimates seems to be a goodestimate, because at least hash collisions happened (if our updates are always positive). In thefollowing section, we describe the concrete construction.

3.2 Construction

The construction is dened as follows. We need two parameters ε and δ. We want to guarantee, thatthe error of our estimate is within a factor of ε with probability 1 − δ. With the parameters ε andδ, we choose d = dln( 1

δ )e and w = d eε e. The Count-Min Sketch is represented by a two-dimensionalarray count[d][w]. Initially, each entry of the sketch is set to zero. Additionally, we choose d hashfunctions randomly from a family of pairwise independent hash functions. Each hash functions isassigned to exactly one row of our sketch. On update (i, c) the procedure works as follows:

For each j set count[j, hj(i)] to count[j, hj(i)] + c.

Therefore, we can interpret each row of the sketch as its own approximation of the original vectora. For our queries, we can somehow compute the estimation of our sketch that seems to be closestto the original value. Figure 3.2 shows a visualization of the Count-Min Sketch.

h1

hd

w"

Fig. 1 A visualization of the Count-Min Sketch. h1, · · · , hd represent the chosen hash functions for the correspondingrow. Each hash function maps an index of the original vector to an index in the row.

It seems to be remarkable, that the size of the sketch itself does not depend on the input size.Indeed, the size of the sketch is always constant in n. However, this is not the whole truth. Toachieve reasonable results, you have to choose your parameters ε and δ dependent on your inputsize. For example w > n does not make sense, because your sketch will be larger than your inputsize. Furthermore, we have to store the hash functions. Usually we can do this using O(log(n)) space.At the latest, when analyzing this circumstance, we see that there is a dependency on the input size.

Page 97: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 89

But basically, if you want to achieve better error guarantees, you will have to store larger vectors. Ifyou want to increase the probability to stay in a certain bound, you will have to store more vectors.

3.3 Example

We want to illustrate the construction and the update procedure with an example. Set n = 8, w = 3and d = 3. Now, we choose 3 hash functions from the family of pairwise independent hash func-tions described in Example 1. h1(u) = (1u0 + 2u1 + 1) mod 3, h2(u) = (2u0 + u1) mod 3 andh3(u) = (2u0 + u1 + 1) mod 3. Consider the update (6, 2). At rst, we have to compute the base3 representation of 6. 6 = 0 ∗ 30 + 2 ∗ 31. Thus, we rewrite 6 using the vector (0, 2). Now, we cancompute the update for each row:

• h1(6) = (1 ∗ 0 + 2 ∗ 2 + 1) mod 3 = 2• h2(6) = (2 ∗ 0 + 2 ∗ 2 + 0) mod 3 = 1• h3(6) = (2 ∗ 0 + 1 ∗ 2 + 1) mod 3 = 0

Consider a second update (5,7). 5 = 2 ∗ 30 + 1 ∗ 3, so we rewrite 5 with the vector (2, 1).

• h1(5) = (1 ∗ 2 + 2 ∗ 1 + 1) mod 3 = 2• h2(5) = (2 ∗ 2 + 1 ∗ 1 + 0) mod 3 = 2• h3(5) = (2 ∗ 2 + 1 ∗ 1 + 1) mod 3 = 0

The following gure 3.3 visualizes the sketch after executing both updates. Note that h1(6) =h1(5) = 2 and h3(6) = h3(5) = 0. Hence, we have an overestimation on the values for indices 5 and6 concerning to hash functions h1 and h3. If we were interested in point querying for index 5 or 6,we would look at the estimation of every row. Exemplary, we will do this for index 5. In the rstrow, we see count[1, h1(5)] = 9. The estimation of the second row is count[2, h2(5)] = 7 and theestimation of the third row is count[3, h3(5)] = 9. We should take the estimation of h2, because nohash collision happened in that row. Indeed, computing the minimum of all estimations will be theprocedure we use. In section 4.1 we will analyze the error guarantees we could achieve with thatprocedure.

0" 0""

9"

0" 2" 7"9" 0" 0"

h1

h3

0" 0""

2"

0" 2" 0"2" 0" 0"

h1

h3

h2 h2

Update(6,2)" Update(5,7)"

Fig. 2 This gure visualizes the resulting sketch of the example. The left part visualizes the sketch after the update(6,2). The right parts shows the resulting sketch after both updates (6,2) and (5,7)

Page 98: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

90 Jannik Sundermeier

4 Query Estimation

In this chapter, we want to analyze the estimations for our queries described in section 1.3. Section4.1 deals with estimating a point query, section 4.2 is about the Inner Product Query and the topicof section 4.3 is Range Queries. Afterwards, we briey want to give an overview on the estimationsof φ-quantiles (section 4.4) and heavy hitters (section 4.5). In the whole chapter we make us of theL1-norm of a vector. The L1-norm of a vector a = (a1, a2, · · · , an) is dened as ‖a‖1 =

∑ni=1 ai.

4.1 Point Query

In the following, we will focus on point queries as dened in section 1.3. Therefore, we will analyzehow to estimate a point query using a Count-Min Sketch and afterwards we will proof error guar-antees we could achieve. We will start with an estimation for the non-negative case.

4.1.1 Non-Negative Case

By intuition, we have to nd that estimation for ai with the least amount of hash-collisions, becauseevery hash-collision to the index i leads to an overestimation. More formally, we can compute ourestimation (ai) for ai as follows:

ai = minjcount[j, hj(i)]

In other words, we look at the estimation for ai in every row j of our sketch and subsequently com-pute the minimum of all estimations. The succeeding theorem is about the error guarantees whichwe can achieve using the described estimation for a point query.

Theorem 4. The estimate ai oers the following guarantees:

1. ai ≤ ai2. With probability at least 1− δ: ai ≤ ai + ε‖a‖1.

Proof of Theorem 4: The idea of the proof is to analyze the expected error for one row of thesketch. Afterwards, we can use that knowledge to show that it is unlikely that the error in everyrow exceeds the introduced bound.

Proof of part 1:ai ≤ ai results directly from our update assumption. Since our updates cannot be negative, alwaysthe correct value ai is contained in our estimation. The only thing can happen is, that we overesti-mate the correct value for ai due to hash-collisions.

Proof of part 2:We dene ai,j as count[j, hj(i)] which is the estimation for ai in row j. Because of our observationsof part 1 we can say, that count[j, hj(i)] = ai + Xi,j . So it consist of the correct value ai and anerror term, which depends on the chosen hash function j and our estimation index i.For analyzing the error term, we need an indicator variable Ii,j,k, that indicates if there is a hashcollision of the index i and a dierent index k concerning the hash function j:

Page 99: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 91

Ii,j,k = 1↔ hj(i) = hj(k) for i 6= k and Ii,j,k = 0 otherwise.

Because we are interested in the expected error term for a row j, we are interested in the expectationof our indicator variable. As our indicator variable is a binary random variable, the expectation ofIi,j,k is the probability that Ii,j,k = 1:

E[Ii,j,k] = Pr[hj(i) = hj(k)](*)= 1

range(hj)= 1

= εe .

(∗) holds, because hj is chosen from a family of pairwise independent hash functions, which is espe-cially a family of universal hash functions. Using the introduced indicator variable, we are able toexpress our error term Xi,j formally:

Xi,j =n∑k=1

Ii,j,k ∗ ak

Now, we want to analyze the error term.

E[Xi,j ] = E[n∑k=1

Ii,j,k ∗ ak](**)

≤n∑k=1

akE[Ii,j,k](***)= ‖a‖1 εe

By using linearity of expectation (∗∗), the denition of the L1 norm (∗∗∗) and applying E[Ii,j,k] ≤ εe ,

we are able to show that the expected error term in a certain row j is less or equal to εe‖a‖1.

After analyzing the error term for one row of our sketch, we have to analyze the error for our wholeestimation. We can apply Markov inequality to show that the probability that the estimation ofevery row is worse than ε‖a‖1 is less or equal to δ. As already mentioned, our estimation is worsethan ε‖a‖1 if the error term of every row is worse than ε‖a‖1, because we compute the minimum ofall possible estimations. It follows that

Pr[ai − ai > ε‖a‖1] = Pr[∀j : Xi,j > ε‖a‖1].

Since we have chosen our hash functions independently from each other, we can interpret each row(or each smaller vector) as its own random experiment. Therefore, we can simply multiply the prob-abilities of each row.

Pr[∀j : Xi,j > ε‖a‖1] =d∏i=1

Pr[Xi,j > ε‖a‖1].

As a last step, we can apply Markov's inequality, because for the non-negative case, it holds Xi,j ≥ 0.

d∏i=1

Pr[Xi,j > ε‖a‖1] ≤d∏i=1

Pr[Xi,j ≥ ε‖a‖1]Markov≤

d∏i=1

E[Xi,j ]ε‖a‖1 .

Substituting the expectation of our error term by the term we already have calculated, we nallyproof the theorem.

d∏i=1

E[Xi,j ]ε‖a‖1 =

d∏i=1

εe‖a‖1ε‖a‖1 =

d∏i=1

1e = e−d = e−dln( 1

δ )e ≤ e−ln( 1δ ) = δ.

Finally, we have shown that - in the non-negative case - the error of our estimation is less or equalto ε‖a‖1 with a probability of 1− δ. ut

Page 100: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

92 Jannik Sundermeier

4.1.2 General Case

After analyzing the non-negative case, we are interested in possible negative updates, which is theanalysis of the general turnstile case. For this case, our estimation works as follows:

ai = medianjcount[j, hj(i)]

For this estimation, we can show, that the following theorem is true:

Theorem 5. With probability 1− δ 14 , ai − 3ε‖a‖1 ≤ ai ≤ ai + 3ε‖a‖1

Proof of Theorem 5: Similar to the proof for the non-negative case, we analyze the expected errorterm of a certain row j. The only dierence to the non-negative case is, that our error term Xi,j

could become negative. Thus, we can write our estimation for row j as count[j, hj(i)] = ai + Xi,j .In the following, we will analyze the absolute value of our error term Xi,j , because we can applyMarkov's inequality then.

E[|Xi,j |] = E[|n∑k=1

Ii,j,k ∗ ak|] ≤n∑k=1

|ak ∗ E[Ii,j,k]| ≤n∑k=1

|ak| ∗ εe = ‖a‖1 ∗ εe

Actually, we can use the same estimation for our error term as before. Applying Markov inequalityagain, we can show that with probability of at least 7

8 our error term is within the bounds of Theorem5.

Pr[|Xi,j | > 3ε‖a‖1]Markov≤ E[|Xi,j |]

3ε‖a‖1 ≤εe‖a‖13ε‖a‖1 = 1

3e ≤ 18

Consider the random experiment of taking a certain row j and look up, whether the error term ofthat row ts to our bounds or not. Due to the independent choice of our hash functions, we caninterpret this random experiment as Bernoulli experiment with success probability p = 1

8 (becausewe want to analyze, that it is unlikely that many estimations exceed our bounds). The expectation µof our experiment is µ = d ∗ p = d

8 . We need at least d2 of such low probability events to assure, that

the median does not t to our bounds. With that knowledge, we can apply Cherno bounds andnally proof the theorem. Denote Sj = 1 if the expected error of row j does not t to our boundsand 0 otherwise.

Pr[d∑j=1

Sj > (1 + 3)d8 ] ≤ ( e4

(1+3)(1+3) )d8 = ( e

38

2 )ln( 1δ ) =

( 1δ )

38

( 1δ )ln(2) = δln(2)− 3

8 < δ14

Eventually, it holds

Pr[ai − 3ε‖a‖1 ≤ ai ≤ ai + 3ε‖a‖1] > 1− δ 14 ut

4.2 Inner Product Query

In this section, we will analyze how to estimate an inner product query a b for two vectors a andb. We simply consider the strict turnstile case. Four our estimation, we need to sketches counta andcountb. The rst sketch represents vector a and the second one vector b. Both sketch systems have toshare the same hash functions and the same parameters ε and δ to achieve reasonable estimations.The idea of the estimation is the same idea as we used for point queries. For each row of our sketch(or for each chosen hash function), we compute the inner product query using that specic hashfunction. Afterwards, we compute the minimum of all estimations:

Page 101: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 93

a b = minjw∑k=1

counta[j, k] ∗ countb[j, k]

The following theorem deals with the error guarantees we can achieve using the estimation proceduredescribed above.

Theorem 6. The estimation a b oers the following guarantees:

1. a b ≤ a b2. With probability 1− δ : a b ≤ a b+ ε‖a‖1‖b‖1Proof of Theorem 6: The idea of the proof is again similar to the proof of Theorem 3. First, weanalyze the error term of a certain row j and afterwards show that it is unlikely that the error termof every row exceeds our bounds.

Proof of part 1:The rst fact again results directly from our update assumption. The correct inner product a b isalways contained in our solution because our updates cannot be negative.

Proof of part 2:Initially, we look at the estimation of a xed row j. We determine

(a b)j =n∑i=1

aibi +∑p 6=q

hj(p)=hj(q)

apbq = a b+∑p 6=q

hj(p)=hj(q)

apbq

The second part is again our error term we have to analyze further. In the following we describe itusing a random variable Xj . For the further analysis we rewrite our error term, using the indicatorvariable Ii,j,k we have already used for analyzing point queries.

Xj =∑p 6=q

hj(p)=hj(q)

apbq =∑

(p,q)

Ii,j,kapbq

Now, we can proceed analyzing the expectation of our error term, using the expectation of ourindicator variable and applying linearity of expectation.

E[Xj ] = E[∑

(p,q)

Ii,j,kapbq] =∑

(p,q)

apbqE[Ii,j,k] ≤ ∑(p,q)

apbqεe

To complete our analysis for one row, we need to know how many pairs of indices exist. We cananalyze this by computing the Cartesian product of the index set with itself. In order to see thatthis is linked to the product of the L1-norms, we rewrite the Cartesian product with a doubled sum.

∑(p,q)

apbqεe = ε

e

n∑p=1

n∑q=1

apbq = εe

n∑p=1

ap ∗n∑q=1

bq = εe‖a‖1‖b‖1

With the expected error term of one row, we show that it is unlikely that the error term of everyrow is greater than ε‖a‖1‖b‖1.

Pr[∀j : Xj > ε‖a‖1‖b‖1] =d∏j=1

Pr[Xj > ε‖a‖1‖b‖1]Markov≤

d∏j=1

εe‖a‖1‖b‖1ε‖a‖1‖b‖1 =

d∏j=1

1e ≤ δ

Eventually, it holds:

Pr[(a b) ≤ ε‖a‖1‖b‖1] > 1− δ ut

Page 102: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

94 Jannik Sundermeier

As mentioned in section 1.3, we can use an Inner Product Query, to estimate the join size of tworelations. With the previous analysis, we can directly conclude time and space complexity of theconcrete application.

Corollary 1. The join size of two relations on a particular attribute can be approximated up toε‖a‖1‖b‖1 with probability 1− δ, by keeping space O(( 1

ε )ln( 1δ )). The time used for the estimate also

can be described with O(( 1ε )ln( 1

δ )).

This follows, because the space used by our sketch is d ∗ w. Plugging in our values for d and w, weget dln( 1

δ )e ∗ d eε e. Therefore, the statement about space complexity is true. Because we have to lookat every sketch entry while computing the estimation, the same statement holds for time complexityas well.

4.3 Range Query

The following sections deal with the approximation of a range query dened in section 1.3. We startwith two naive approaches of estimating a range query (section 4.3.1) and afterwards introduce abetter approach using dyadic ranges (section 4.3.2). For the whole chapter we consider the non-negative turnstile case.

4.3.1 Naive approaches

The most naive approach to estimate a range query for a range [l, r] is simply point querying forevery index contained in the original range and afterwards summing up the estimates. It is easy tosee, that in this case the error term increases linearly with the size of the range. Indeed the eectivefactor would be r − l + 1.Another approach is to compute a special inner product query. For this version, we have to denea second vector x of size n, with xi = 1 if l ≤ i ≤ r and 0 otherwise. Computing the inner productof the vectors a and x will exactly compute the range query for the range [l, r]. But again, the errorterm increases linearly with the size of the range and we have the additional costs for creating asketch for the second vector and executing the updates.

4.3.2 Computation with dyadic intervals

The following section is based on the basic usage of dyadic ranges introduced in [3]. The concreteprocedure is described in [1]. To assure that the error term does not increase linearly to the size ofthe range but logarithmically, we need the notion of dyadic ranges.

Denition 3 (Dyadic Ranges). For x ∈ N and y ∈ N0 , a dyadic range is an interval of the form:

Dx,y := [(x− 1)2y + 1, x2y]

For better understanding and for a proof, we illustrate the dyadic ranges that are contained in anoriginal interval [1, n] (n is a power of two), using a binary tree. Figure 4.3.2 shows a visualizationof that tree.The following lemma and the related corollary tells us, why the usage of dyadic ranges could helpus to keep the error term sublinear to the size of the range.

Page 103: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 95

[1, n]

[1, n2 ]

[1, n4 ]

...

[1, 1] [2, 2]

...

[n4 + 1, n2 ]

......

[n2 + 1, n]

[n2 + 1, 3n

4 ]

......

[3n4 + 1, n]

......

[n 1, n 1][n, n]

1

Fig. 3 This is a visualization of a dyadic tree. Each node of the tree represents a dyadic interval, which is part of theoriginal interval [1, n]. The root represents the original interval. Each inner node has two children. The inner node isexactly halved by its children. Therefore, the left child of a node represents the rst half of the interval and the rightchild the right half.

Lemma 1. Let [l,r] = r a range. One every height of the tree of dyadic ranges, there are at most 2dyadic ranges in a minimum cardinality dyadic cover C which covers the interval r.

Since the tree of dyadic ranges is a balanced binary tree, we know that the height of that tree islog(n) + 1. This immediately implies the following corollary.

Corollary 2. Each interval of size n can be covered by using at most d2log(n)e non-overlappingdyadic ranges.

We want to use an example to clarify that this seems to be true and afterwards give a formal proof.

Example 3 Figure 3 shows the dyadic tree of the interval [1, 16]. We want to cover the intervali = [2, 12]. The black marked nodes build a dyadic cover of the interval i. However, this cover is nota minimum cardinality dyadic cover of i. Note, that on height log(n)+1 three nodes are marked black.We are able to reduce the amount of black nodes on that height by replacing the nodes representingthe intervals [7, 7] and [8, 8] by their parent node. After doing this, we could notice that on the heightabove are three nodes marked black. But again you can replace two black nodes sharing the sameparent by replacing them with their parent which is marked in grey here. So, if we take the grey nodeto our solution and drop every node of the subtree below, we will have a minimum cardinality dyadiccover.

Page 104: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

96 Jannik Sundermeier

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

Fig. 4 This is a visualization of a dyadic tree which represents all dyadic intervals which are part of the originalinterval [1,16]. The black marked nodes build a dyadic cover of the interval [2, 12].

We will proceed with the analysis of Lemma 1.Proof of Lemma 1: Let C a set of non-overlapping dyadic ranges covering an interval i = [l, r].Especially assume that C is not minimal and there is a height of the dyadic tree of i which containsat least 3 non-overlapping dyadic intervals of C. We start by looking at the lowest height h withat least 3 dyadic intervals of C. Let left ∈ C (especially left ⊆ [l, r]) the interval with the lowestindices of h contained in C and analogous right ∈ C (especially right ⊆ [l, r]) with the highestindices of h contained in C. Due to our assumption there has to be an interval d ∈ C (d ⊆ [l, r]) ofh whose indices are between the indices of left and right. Let dsib the sibling of d (this means, thatthey share the same parent node p).

Case 1: dsib = left or dsib = rightIn this case, we can replace d and dsib directly by their parent p, because dsib and d are both nonoverlapping dyadic ranges and d, dsib ⊆ [l, r]. Denote further, that d∪ dsib = p. Therefore, p ⊆ [l, r].By replacing d and dsib with p, we still cover the intervals d and dsib but have reduced the cardinalityof C.

Case 2: dsib 6= left and dsib 6= right.Without loss of generality, we assume that the indices of dsib are smaller than the indices of d (theproof for larger indices works analogous). Determine that all indices of dsib are larger than the in-dices of left but smaller than the indices of d.

Case 2.1 dsib ∈ C.This sub-case is analogous to Case 1. Simply replace d and dsib by their parent. The cardinality ofC decreased but we still cover the original interval.

Case 2.2 dsib /∈ CDue to our assumption the interval dsib has to be covered by intervals of C. A further consequenceis, that the interval dsib has to be covered by children of dsib because the intervals in C are non-overlapping and d ∈ C. Now replace all children c of dsib with c ∈ C, dsib itself and d with theparent of d and dsib p. We still cover the original interval but we have reduced the cardinality of Cby at least 1.

Since we have reduced the cardinality of C in both cases and we have reduced the amount ofdyadic ranges on height h (which are part of C), this implies our Lemma. If we use the described

Page 105: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 97

procedure to reduce the amount of dyadic ranges on height h until there are at most 2, we will lookat height h− 1 and execute the same procedure. If we do this for every height of our dyadic tree, wewill have at most 2 dyadic ranges on each height of the tree. Since the height of our tree is log(n)+1,it follows that we can cover our original range [l, r] by taking at most d2log(n)e dyadic ranges. ut

Using the previously gained knowledge, we can explain the idea of estimating range queries usingdyadic ranges. The core idea is to construct a data structure where we can answer range queriesfor dyadic intervals very easily. In that case we can answer our original range query by estimatingthe queries for the dyadic cover of the original interval and sum up the estimates. Now, we do notuse only one sketch, but we use log(n) + 1 sketches. Assume that n is a power of two (or increasen to the next power of two). Each sketch represents one height of the dyadic tree of the interval[1, n]. Denote the particular sketches with count0, · · · countlog(n) with counti describing the sketchfor height i. All sketches share the same hash functions and parameters ε and δ. On update (i, c)execute the following update procedure:

• At each height of the tree look at the interval Dx,y with i ∈ Dx,y

• Use x as key for the hash functions and execute update (x, c) for county

Using this procedure, we can execute point queries for each dyadic range. The rst step to answer arange query for an interval [l,r] is to compute a minimum cardinality dyadic cover C. Our estimationfor the whole range query works as follows: Denote a[l, r] as the exact result of a range query for an

interval [l, r] and a[l, r] the estimation using the procedure above.

a[l, r] :=∑

Dx,y∈Cminjcounty[j, hj(x)]

The subsequent theorem deals with the error guarantees we can achieve using the procedure de-scribed above.

Theorem 7. The estimation a[l, r] oers the following guarantees:

1. a[l, r] ≤ a[l, r]

2. With probability at least 1− δ: a[l, r] ≤ a[l, r] + 2εlog(n)‖a‖1

Proof of Theorem 7: The ideas of this proof are still the same as for point and inner productqueries. At the beginning, we look at the error term of a certain row j in our whole sketch systemand analyze the expected error. Afterwards we use Markov's inequality to show that exceeding thebounds is very unlikely.

Proof of part 1:We still consider the non-negative case. Hence, we cannot have negative updates and the point queryof every dyadic interval always contains the exact solution plus possible error terms based on hashcollisions. Since every point query for a dyadic interval cannot underestimate, the estimation for adyadic cover cannot underestimate.

Proof of part 2:Let us have a look at xed row j of the sketch county querying for a dyadic range Dx,y. We can ndthe sum of all values ai with i ∈ Dx,y and a certain error term Xy,j,x which depends on the chosensketch y, the row j and the queried index x.

Page 106: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

98 Jannik Sundermeier

county[j, hj(x)] =∑

i∈Dx,yai +Xy,j,x

Using the indicator variable Ii,j,k we have already used for point and inner product queries, we canrewrite our error term. The whole dyadic interval Dk,y of every index k with hj(k) = hj(k) is partof the error term if a hash collision occurs.

Xy,j,x =n∑k=1

Ix,j,k∑

i∈Dk,yai

As a result, we are able to analyze the expected error using linearity of expectation and the expec-tation of our indicator variable.

E[Xy,j,x] = E[ n∑k=1

Ix,j,k∑

i∈Dk,yai

]=

n∑k=1

Ix,j,k∑

i∈Dk,yai

≤n∑k=1

ε

e

∑i∈Dk,y

ai(*)=

ε

e

n∑i=1

ai =ε

e‖a‖1

(*) holds, because on every height of the dyadic tree, i is contained in exactly one dyadic intervalDx,y. For simplicity, we assume that we answer all queries using just one hash function j, becausethe following inequality holds:

a[l, r] :=∑

Dx,y∈Cminjcounty[j, hj(x)] ≤ minj

∑Dx,y∈C

county[j, hj(x)]

With this assumption we can bound the error term of our queries related to county using an addi-tional random variable which represents the error term for the whole row j.

Xj =∑

Dx,y∈CXy,k,x

E[Xj ] = E[ ∑Dx,y∈C

Xy,k,x =]

=∑

Dx,y∈CE[Xy,k,x] ≤

∑Dx,y∈C

ε

e‖a‖1 ≤ 2log(n)

ε

e‖a‖1

Finally, we are able to analyze the overall error.

Pr[minjXj > 2log(n)ε‖a‖1] =

d∏i=1

Pr[Xj > 2log(n)ε‖a‖1]Markov≤

d∏i=1

E[Xj ]

2log(n)ε‖a‖1

≤d∏i=1

2log(n) εe‖a‖12log(n)ε‖a‖1

=

d∏i=1

1

e≤ δ

Putting things together, we have nally shown:

Pr[a[l, r] > a[l, r] + 2log(n)ε‖a‖1] < 1− δ. utTo complete this chapter, we briey want to analyze the space complexity of the introduced proce-dure. Since we need log(n)+1 dierent sketches we get O(log(n)( 1

ε )(ln( 1δ )) as a bound for the space

complexity. If we further regard the space we need for our hash functions, we still get a bound ofO(log(n)). All in all, the space complexity still stays sublinear in n which is exactly what we wantedto achieve.

Page 107: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 99

4.3.3 Example

We consider the same basic setup as described in Example 3.3. n = 8, d = 3 and w = 3 and we useexactly the same hash functions h1, h2 and h3. First, we build our basic sketch structure. Thereforewe have to compute the dyadic tree of the interval [1, 8]. Figure 4.3.3 shows the set-up of our datastructure.

0& 0& 0&0& 0& 0&0& 0& 0&

count0

0& 0& 0&0& 0& 0&0& 0& 0&

0& 0& 0&0& 0& 0&0& 0& 0&

0& 0& 0&0& 0& 0&0& 0& 0&

count1

count2

count3

[1,8]&

[5,8]&[1,4]&

[1,2]& [3,4]&[5,6]& [7,8]&

[1,1]& [8,8]&…&

Fig. 5 The right part visualizes the dyadic tree for the interval [1, 8]. On the left side, we see the sketch structurefor each height of the dyadic tree. This is the initial set-up, hence all values contained are set to 0.

Determine the update (4,3). 4 = 1 ∗ 30 + 1 ∗ 31. Thus, we can rewrite 4 using the vector (1,1). Forexecuting the update, we have to look at every row of our sketch.

• count0 : 4 ∈ [1, 8] = D1,3 → execute update (1, 3)• count1 : 4 ∈ [1, 4] = D1,2 → execute update (1,2)• count2 : 4 ∈ [3, 4] = D2,1 → execute update (2,3)• count3 : 4 ∈ [4, 4] = D4,0 → execute update (4, 3)

Figure 4.3.3 illustrates the resulting sketch after executing the updates (4, 3), (7, 4) and (1, 3). Wewant to estimate a range query for the interval [1, 5]. Note, that the correct value a[1, 5] = 3+3 = 6.For our estimation we have to compute a minimum cardinality dyadic cover of [1, 5] which is C =[1, 4], [5, 5]. At rst, we execute a point query for the dyadic interval [1, 4]. [1, 4] = D1,2 is part ofthe sketch count1 so we have to point query for the index 1 using the sketch count1. a1 = 6. Thesecond step is point querying for the dyadic interval [5, 5]. [5, 5] = D5,0 is part of the sketch count3so we have to execute a point query for the index 5 using count3. a5 = 3. Summing up the results,

we get a[1, 5] = 6 + 3 = 9 > a[1, 5] = 6. Due to hash collisions, the result of our estimation is largerthan the real value.

Page 108: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

100 Jannik Sundermeier

0& 0& 10&0& 0& 10&10& 0& 0&

count0

4& 0& 6&0& 4& 6&6& 0& 4&

3& 7& 3&7& 3& 3&3& 7& 3&

4& 3& 3&3& 4& 3&3& 3& 4&

count1

count2

count3

[1,8]&

[5,8]&[1,4]&

[1,2]& [3,4]&[5,6]& [7,8]&

[1,1]& [8,8]&…&

Fig. 6 The visualization of the data structure after executing the updates (4, 3), (7, 4) and (1, 3).

4.4 φ-Quantiles

The authors of [1] describe two dierent approaches of computing the φ-quantiles. We will focuson the rst approach and skip the approach which uses random subset sums, because both arriveat a similar runtime bound. We consider the non-negative turnstile case. Basically, computing φ-quantiles of our input stream can be reduced to a series of range queries. Consider the array A =[a[1, 1], a[1, 2], · · · , a[n, n]| containing all possible range queries with 1 as a lower bound. Executing abinary search on the array A, we can compute the smallest index rk with a[1, rk] > kφa[1, n]. Usingthe sketch system of section 4.3, we can reduce this to computing O(log(n)) point queries.Using the described procedure, we can compute the ε-approximate φ-quantiles with probability atleast 1− δ.

4.5 Heavy Hitters

We analyze the problem in both the cash-register case and the turnstile case. In the turnstile case,we assume the non-negative case. In this section, we omit the proofs of our guarantees. At least theideas of the proofs can be looked up using the referenced literature.

4.5.1 Cash-Register Case

For estimating the heavy hitters of our original vector a, we need a second supporting data structure- a Min-Heap. Additionally, we have to keep track of the current value of ‖a‖1. On update (i, c), wehave to increment the value of ‖a‖1 by c - additional to our basic update procedure. After updating,we execute a point query for index i. If the estimation ai is above the threshold of φ‖a‖1 it is addedto our heap. Periodically, we analyze whether to root of our heap (meaning the event with lowestcount of our heap) is still above the threshold. If not, it is deleted from the heap. At the end ofour input stream, we scan every element of our heap and all elements whose estimate is still above

Page 109: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

The Count-Min-Sketch and its Applications 101

φ‖a‖1 are output.Analyzing the error guarantees, we could proof the following theorem:

Theorem 8. Using the procedure described above, Count-Min Sketches oer the following guaran-tees:

1. Every index with count more than φ‖a‖1 is output2. With probability 1− δ, no index with count less than (φ− ε)‖a‖1 is output.

4.5.2 Turnstile Case

If updates are allowed to be negative, we have to get aware of a two-sided error bound. In orderto solve the problem, we make use of the setup we have introduced in section 4.3 for estimatingRange Queries. Hence, we have one sketch for each height of the dyadic tree of our overall range.The update procedure stays the same. In order to nd all heavy hitters, we execute a parallel binarysearch in our dyadic tree. All single items whose approximated frequency is still above the thresholdare output. With a further analysis, we could show the following theorem:

Theorem 9. Using the procedure described above, estimating the heavy hitters using Count-MinSketches oer the following guarantees:

1. Every index with frequency at least (φ+ ε)‖a‖1 is output2. With probability 1− δ no index with frequency less than φ‖a‖1 is output.

5 Summary / Conclusion

In this thesis we have introduced and analyzed the Count-Min Sketch extensive. Initially, we haveseen that we can represent a set of events connected to their frequencies using a vector. Since wecannot store the whole vector, we need a technique to approximate the entries of the original dataset. The basic idea is to map the original vector into a second vector of smaller size. Using hashfunctions from a family of pairwise-independent hash functions, we can map the indices from theoriginal vector to the smaller vector and additionally, the probability of a hash collision is kept small.Storing not just one vector, but a certain amount of smaller vectors leads to the construction of theCount-Min Sketch.Afterwards, we have analyzed the approximation techniques for several query types. For the es-timation of point queries and inner product queries, we have seen that computing the minimumestimation of all rows j seems to be a good estimation (if updates cannot be negative). If negativeupdates are possible, we have proved that for point queries, computing the median oers good guar-antees.Furthermore, we have seen that estimating range queries with naive approaches cannot achieve goodguarantees (or even better guarantees are possible). Therefore, we introduced a sketch structure us-ing the notion of dyadic ranges. Our structure allows us to point query for dyadic ranges quite easily.Since we can cover each interval of size n using at most 2log(n) dyadic ranges, we could achieve anerror term which does not increase linearly to the size of the range, but logarithmically.Finally, we have roughly seen how to estimate heavy hitters and φ-quantiles, two applications whichare very popular in the data stream scenario.

Page 110: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

102 Jannik Sundermeier

References

1. Cormode, Graham and S.Muthukrishnan "An improved data stream summary: the count-min sketch and itsapplications" Journal of Algorithms 55.1 (2005) 58-75.

2. Gibbons, Phillip B and Matias, Yossi. "Synopsis Data Structures for Massive Data Sets" In: " SODA '99 Pro-ceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms", pp. 909-910.

3. Gilbert, Anna C., Kotidis, Yannis, S.Muthukrishnan and Strauss, Martin J.: "How to Summarize the Universe:Dynamic Maintenance of Quantiles" In: Proceeding VLDB '02 Proceedings of the 28th international conferenceon Very Large Data Bases, pp. 454-465.

4. Mitzenmacher, Michael, and Eli Upfal. Probability and computing: Randomized algorithms and probabilisticanalysis. Cambridge University Press, 2005, pp. 324,325.

Page 111: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model

Jan Bürmann

Abstract Palindromes, a string which reads forwards the same as it does backwards, play a vitalrole in many processes evolving in nucleic acids like DNA. These nucleic acid strands are of greatlength which makes it interesting to try to minimise the space requirement to nd them. We presentan algorithm for the palindromic problem which is to nd all palindromes and briey show two algo-rithms for the longest palindromic substring problem. The main algorithm ApproxSqrt is a O (

√n)-

sliding-window steaming algorithm with an additive error which uses Karp-Rabin-Fingerprints anda compression technique for palindromic sequences. The algorithm archives a space complexity of

O(√

). The results are the rst for the palindrome problem in the streaming model and might be

a starting point for further research and improvement.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051.2 (KR-)Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

2 ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082.1 Simple ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082.2 Space Ecient ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1132.3 Variant of ApproxSqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3 Exact and ApproxLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.1 Exact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.2 ApproxLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

1 Introduction

A palindrome is a string which reads forwards the same as it does backwards. Examples for palin-dromes are radar and madam. Associated with palindromes are two problems. The rst is thePalindrome Problem where the aim is to nd all substrings which are palindromes. For example,the String aradarr contains the following palindromes: radar, ara, rr. It is evident that palindromescan also contain palindromes, but they are only interesting if they have a dierent midpoint. Hencethe substring ada would not be of interest, because it has the same midpoint as radar. The secondproblem is the Longest Palindromic Substring Problem where the intention is to nd an arbitrary

Jan BürmannUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

103

Page 112: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

104 Jan Bürmann

longest palindrome. For the aforementioned example this is the string radar. In this case radar isunique as the longest palindrome, which does not have to be the case.

The most important area of application is molecular biology. Nucleotide bases are constituentsof nucleic acids like DNA and RNA. Taking the rst letter of their names a nucleic acid strandis a word over the alphabet consisting of those letters. In the case of DNA, this alphabet is Σ =A,G,C, T[5]. Palindromes occur in two forms: either in two opposite strands of the same sectionor as inverted repeats where in two dierent segments of the nucleic acid the same sequence occursin opposite directions[7]. Palindromes are common throughout the human genome[9]. Furthermore,several important processes regarding nucleic acids involve palindromic sequences. Those processeswhere palindromic sequences are involved include gene regulation, DNA replication, initiation ofgene amplication and DNA-protein binding[4, 8]. Figure 1 shows how a palindromic sequence isinvolved in the formation of a stem-loop which is the basis for binding with proteins and the structureof the nucleic acid in 3D[4]. Therefore identifying palindromic sequences is vital for the analysis ofnucleic acids. It is noticeable that the palindromes in language, as explained above, are dierentfrom those in nucleic acids. However it is possible to adjust the algorithms, which are presentedin this work, to identify those palindromes with an adjustment of the ngerprints. Thus this workconsiders palindromes as they appear in language as presented earlier.

Fig. 1 DNA structure with palindrome forming a stem-loop [12]

Solving the problem of nding palindromes is a well-studied area. Algorithms to nd palindromeshave existed for some time. Those algorithms include on-line algorithms to solve the problem inlinear time[10, 1] as well as a logarithmic time parallel algorithm[1]. Nevertheless, it seems thatthere are no results for the streaming model. Several techniques which are used by the presentedalgorithms have their origin in pattern matching algorithms. The most notable technique are Karp-Rabin Fingerprints[6]. A technique to map sequences of symbols to a single number by ensuringwith high probability that two ngerprints are only equal to each other if the sequences were thesame. The technique is introduced and explained below.

This work presents results of research by Berenbrink, Ergün, Mallmann-Trenn and Erfan SadeqiAzer[2]. Inspiration for the images are by Berenbrink et al. as well. Following the introduction weintroduce the terminology (see subsection 1.1) and the ngerprint technique (see subsection 1.2). Inthe main part of the work we present a steaming algorithm ApproxSqrt which solves the palindrome

problem in O(√

)space (see section 2). This section is divided into two parts, one part where

we present a correct working simple version which is not sublinear (see subsection 2.1) and anotherpart where we present a sublinear improved version (see subsection 2.2). Towards the end we present

Page 113: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 105

briey two further algorithm for the longest palindromic substring problem which use the ideas andtechniques of ApproxSqrt (see section 3).

1.1 Terminology

In the following work S ∈ Σn will denote the input stream with length |S| = n over the alphabetΣ. For simplicity the alphabet is assumed to consist of positive integers (Σ ⊂ N). One symbolat index i of the stream S will be denoted by S[i] with i ∈ 1, . . . n and a sequence of symbolsas S[i, j] = S[i]S[i + 1] · · ·S[j] with i, j ∈ 1, . . . n and i ≤ j. Palindromes in the stream can beidentied by the midpoint and the length of the palindrome. They will be dened with the followingdenition.

Denition 1. PalindromeA stream S ∈ Σn contains a palindrome of length ` at midpointm ∈

⌊l2

⌋, . . . , n−

⌊l2

⌋if

S[m− i] = S[m+ i] (1)

orS[m− i+ 1] = S[m+ i] (2)

for i ∈ 1, . . . ,⌊l2

⌋. The palindrome is an even palindrome if the length ` is even and it satises

equation (1). The palindrome is an odd palindrome if the length ` is odd and it satises equation(2).

Denition 1 states that a palindrome is odd if and only if the length is odd. For simplicity,the algorithms regarded below will assume that palindromes are even. The input stream can bealtered to make all palindromes of even length by double the symbols, hence apply the algorithm toS[1]S[1]S[2]S[2] · · ·S[n]S[n]. Furthermore, it follows from the denition 1, that the midpoint of anodd palindrome is the single symbol in the middle of the palindrome and the midpoint of an evenpalindrome is the left symbol of the two symbols in the middle of the palindrome (see gure 2 and3).

Fig. 2 Stream containing an even palin-drome. Midpoint m and maximal length`(m) are marked.

Fig. 3 Stream containing anodd palindrome. Midpoint mand maximal length `(m) aremarked.

Denition 2. Maximal Palindrome, Length of Palindrome, `∗-palindromeA palindrome at midpoint m, as dened in denition 1, is a maximal palindrome if its length is

maximal i.e. there is no palindrome with midpoint m which is longer. The maximal palindrome atmidpoint m in the stream up to index i S[1, i] is denoted by P [m, i] and the maximal palindromeat midpoint m in the entire stream S = S[1, n] is denoted by P [m] = P [m,n].

The maximal length `(m, i) of the palindrome at midpoint m in the stream up to index i S[1, i] is

dened as the ceiling of half of the length of the palindrome `(m, i) =⌈length of P [m,i]

2

⌉. The maximal

length in the entire stream is denoted by `(m) = `(m,n). For all integers which are not an index ofthe stream the maximal length is dened as zero: `(z) = 0 ∀z ∈ Z\1, . . . n.

Page 114: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

106 Jan Bürmann

A palindrome P [m] is a `∗-palindrome for l∗ ∈ N if `(m) ≥ `∗.

The algorithms presented below are estimating the length of the palindromes. This estimation ofthe maximal length `(m) of the palindrome P [m] is denoted by ˜(m). Subsequently the term lengthor length of a palindrome P [m] is used to refer to the maximal length `(m) if not stated otherwise.

1.2 (KR-)Fingerprints

As mentioned in the introduction, the algorithms use Karp-Rabin Fingerprints (KR-Fingerprints)or in short just Fingerprints. It is a technique used to compress strings. The idea is to use thesymbols of the string as the coecients of a polynomial and modulo the polynomial. Therefore a bigenough prime number and an integer smaller than this prime number are chosen and the integer isused as the indeterminate and the prime number is used to modulo the polynomial. The techniquewas dened by Karp and Rabin[6]. It is used in several pattern matching algorithms [6, 11, 3]. Thedenition in this work is similar to the denition of Breslauer and Galil[3].

Denition 3. FingerprintsFor an arbitrary prime number p ∈ [n4, n5] and an integer r ∈ 1, . . . p which is chosen randomly

and a string S′ the forward ngerprint φFr,p(S′) and the backward ngerprint φRr,p(S

′) are denedas:

φFr,p(S′) =

|S′|∑i=1

S′[i] · ri mod p (3)

and

φRr,p(S′) =

|S′|∑i=1

S′[i] · rl−i+1

mod p (4)

The prime number and the integer needs to be xed for one execution of an algorithm to usetheir features. Therefore we dene, for the ease of notation, the following.

Denition 4. Fingerprint NotationLet p, r, φFr,p(·) and φRr,p(·) be dened as in denition 3.

φr,p(·) refers to either the forward or the backwards ngerprint. If p and r are xed, φF (·) andφR(·) are written to refer to the forward and backward ngerprint respectively.

In respect of the stream S and two indices i, j with 1 ≤ i ≤ j ≤ n the forward and backwardngerprint of a substring of the stream are dened as:

FF (i, j) = φF (S[i, j]) (5)

andFR(i, j) = φR(S[i, j]) (6)

Besides as a method to compress strings, ngerprints are also useful because of the possibility tocompare them. Mapping a string to its ngerprint makes it possible to decide if two strings are thesame by just comparing two numbers instead of the two strings. Furthermore, the forward ngerprintof one string and the backward ngerprint of another string equal each other if one string is thereverse of the other one. This makes it very useful for palindromes which, if divided in the centre,have one string which is the reverse of the other string.

The algorithms which are presented below are streaming algorithms. Hence symbols are read oneafter the other and, due to memory restrictions, it is not possible to store all of them or read asymbol again which is no longer in the memory. However ngerprints have further properties which

Page 115: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 107

are very useful under those premises. If for a string with two substrings the ngerprints and length oftwo of those strings are known it is possible to calculate the ngerprint of the remaining ngerprint.The following lemma and corollary[3] state these properties formally:

Lemma 1. One can compute the ngerprint of the concatenated strings u and v as

φr,p(uv) = φr,p(u) + rkφr,p(v) mod p uv = u1u2 · · ·ukv1v2 · · · vl (7)

Corollary 1. To extract the ngerprints of u or v from the ngerprint of uv:

φr,p(v) = r−k(φr,p(uv)− φr,p(u)) mod p (8)

φr,p(u) = φr,p(uv)− rkφr,p(v) mod p (9)

Considering the streaming model we use we can express those statements in relation to thengerprints of the stream. The following lemma is therefore a reformulation of the statements abovefor the forward ngerprints in relation to the stream.

Lemma 2. Consider two substrings S[i, k] and S[k+1, j] and their concatenated string S[i, j] where1 ≤ i ≤ k ≤ j ≤ n.

FF (i, j) = (FF (i, k) + rk−i+1 · FF (k + 1, j)) mod p (10)

FF (k + 1, j) = r−(k−i+1)(FF (i, j)− FF (i, k)) mod p (11)

FF (i, k) = (FF (i, j)− rk−i+1 · FF (k + 1, j)) mod p (12)

1.2.1 Failure Probability

By the denition of the ngerprints (see denition 3) they are elements of the nite eld of integersmodulo p denoted by Fp. It is possible that two ngerprints are mapped to the same element.However it was shown that the probability of this is very small[3]. The next lemma states theprobability for our choice of p.

Lemma 3. For two arbitrary string s and s′ with s 6= s′ the probability that φ(s) = φ(s′) is smallerthan 1

n4 .

Since it is still possible that ngerprints fail we have that the statements about the algorithmspresented below are only valid with high probability (w.h.p.). We say that something is valid w.h.p.if the probability is at least 1− 1

n .

1.3 Results

As mentioned in the introduction we present one algorithm in detail and four algorithms briey.The aim of all of them is to optimise the space which is used. For all algorithms we assume to havea limited amount of space for the algorithm to work with, but the output space is not limited andalso not considered. The results of the presented algorithms are summarised in the following fourtheorems.

The rst algorithm ApproxSqrt nds all palindromes and estimates their length with a certainprecision. The problem the algorithm solves is the palindrome problem.

Theorem 1. For any ε ∈[

1√n, 1]Algorithm ApproxSqrt(S, ε) reports for every palindrome P [m]

in S its midpoint m as well as an estimate ˜(m) (of `(m)) such that w.h.p.

Page 116: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

108 Jan Bürmann

`(m)− ε√n < ˜(m) ≤ `(m).

The algorithm makes one pass over S, uses O(nε

)time, and O

(√nε

)space.

The second algorithm nds all longest palindromes. This means it solves the longest palindromicsubstring problem. It is a two pass algorithm which uses among other things ApproxSqrt.

Theorem 2. Algorithm Exact reports w.h.p. `max and m for all palindromes P [m] with a length of`max. The algorithm makes two passes over S, uses O (n) time, and O (

√n) space.

The third algorithm nds one longest palindrome which does not have to be the longest. Henceit might solve the longest palindromic substring problem. Its limits come from the limitation tologarithmic space and a multiplicative error.

Theorem 3. For any ε ∈ (0, 1], Algorithm ApproxLog reports w.h.p. an arbitrary palindrome

P [m] of length at least `max+ε . The algorithm makes one pass over S, uses O

(n log(n)ε log(1+ε)

)time, and

O(

log(n)ε log(1+ε)

)space.

The algorithm ApproxSqrt can be modied to be used for certain cases. One variant is to only con-sider palindromes which are shorter than

√n. In comparison to ApproxSqrt the variant represented

by the following theorem would calculate the `max precisely.

Theorem 4. For `max <√n, there is an algorithm which reports w.h.p. `max and a P [m] s.t.

`(m) = `max. The algorithm makes one pass over S, uses O (n) time, and O (√n) space.

2 ApproxSqrt

In this section we are introducing a single pass algorithm which reports the midpoints of all palin-dromes and their estimated length. We will also introduce the basic principals which are used by allthree algorithms as well as statements about the structure of palindromes and compressibility.

In order to introduce the principles and the general procedure of the algorithm, we will rstpresent a simple version of the algorithm (see subsection 2.1) which works as correct as the actualalgorithm, but does not hold the time and space restrictions. Most of the procedure is the samewhich makes it possible to part the basic procedure and the compression method (see subsection2.2.1 and 2.2.2) by which it is possible to archive sublinear space (see subsection 2.2.3 and 2.2.4).

2.1 Simple ApproxSqrt

In this subsection we will introduce the one-pass algorithm Simple ApproxSqrt which is a simpleversion of the actual algorithm ApproxSqrt. The algorithm reports all palindromes and their approx-imated lengths, but does not hold the goal of sublinearity in the worst case.

2.1.1 Procedure of Simple ApproxSqrt

The algorithm is a streaming algorithm and therefore processes the symbols of S one by one. Beside

the stream S the algorithm has a second parameter ε ∈[

1√n, 1]. This parameter controls the

precision of the estimation of the length of the palindromes. Throughout the description i will be

Page 117: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 109

the current index the algorithm reads in the i-th step. At that point the algorithm has alreadyprocessed S[1, i− 1]. Simple ApproxSqrt works with a sliding window of the size 2

√n (see gure 4).

The index i is the last symbol in the sliding window. Hence the algorithm keeps the 2√n symbols

of S[i − 2√n, i] in the memory. Due to the denition of the sliding window, the window can be

divided into two halves. The midpoint of the window denoted by m is the index of the last symbolof the rst half (i−√n). This is in accordance with the way we dene the midpoint of a palindrome(see denition 1). A consequence of the sliding window is the division of palindromes which are

Fig. 4 Illustration of the sliding window and the Fingerprint Pairs in the case of |S| = 25,√n = 5, ε = 1

2. The

sliding window indicated by the square with the dashed line. i is the current index, m is the midpoint of the windowand the braces indicate the ngerprint pairs of the window.

shorter than√n and palindromes which are longer than or equal to

√n. Subsequently we refer

to palindromes which are shorter than√n as short palindromes and palindromes which are longer

than or equal to√n as long palindromes. Short palindromes could be easily identied, because

at some point they are fully contained in the window. However, in order to archive a better timebound the length of those palindromes is only approximated. Identifying long palindromes, especiallyapproximating their length, is more complicated. We start with the identication of midpoints, thelength estimation of short palindromes and the detection of long palindromes. For this purposethe algorithms stores Fingerprint Pairs within the sliding window. Each of those pairs has onehalf ending in the middle of the window and one half starting in the middle of the window asboth have a size which is a multiple of bε√nc. Additionally there is one Fingerprint Pair of thesize√n. More formally: for r = maxr′|r′ ∈ N, r′ · bε√nc < √n the ngerprints have the sizes

1 · bε√nc , 2 · bε√nc , . . . , r · bε√nc ,√n. As the name suggests, a Fingerprint Pair consists of twongerprints, one for the rst half and one for the second half. Those ngerprints are used to comparethe rst half to the second half and if for a ngerprint j ∈ 1 · bε√nc , 2 · bε√nc , . . . , r · bε√nc ,√nthe ngerprint pairs match (FR(m − j + 1,m) = FF (m + 1,m + j)) the algorithm has found apalindrome P [m] of the size at least j. If a Fingerprint Pair j matches and another FingerprintPair j′ with j < j′ it is evident that the palindrome P [m] is shorter than

√n and the algorithm can

output m and ˜(m). The alternative case is that the Fingerprint Pair of size√n, and therefore all

Fingerprint Pairs, match. In this case the palindrome P [m] is at least of the size√n and therefore

a long palindrome as specied earlier.As mentioned earlier it is more complicated to calculate the length of long palindromes. This

is caused by the use of the sliding window procedure. The symbols which need to be comparedare outside the window and therefore not any more in memory or not yet processed. To solvethis, the algorithm stores these midpoints and uses equidistant checkpoints at which the lengthestimation is updated. As mentioned earlier, if the algorithm detects a midpoint m of a long palin-drome P [m] it stores the necessary information of the position m, the length estimation `(m, i),and the ngerprints up to the midpoint FR(1,m) and FF (1,m). We call the storage unit RS entry(RS(m, i) = (m, ˜(m, i), FR(1,m), FF (1,m))) and the algorithms stores all of those in a list calledLi. The aforementioned checkpoints are distributed throughout the stream with a distance of bε√nc.The rst checkpoint is a theoretically existing checkpoint because the checkpoint is at position 0 andtherefore is before the beginning of the stream. Every further checkpoint is then at an index which

is bε√nc after the preceding checkpoint. Formally this means that for k ∈ N with 0 ≤ k ≤⌊√

⌋the

checkpoints are at the indices k ·bε√nc. A checkpoint is created at the current index i whenever i is a

Page 118: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

110 Jan Bürmann

multiple of bε√nc. At that point the checkpoint and the ngerprint up to that checkpoint (FR(m, i))are stored in a list called CLi. We use the term creating a checkpoint in the following to refer to thealgorithm storing the data belonging to the checkpoint. Those checkpoints enable the algorithm tocheck in regular distances if the estimation of the length can be updated. For every midpoint m ∈ i

the algorithm checks if the distance between the current index and the midpoint i−m is the same asthe distance of a checkpoint c ∈ CLi to the midpoint m− c (i−m = m− c). If this is the case for amidpoint and a checkpoint, the algorithm compares the reverse of the string from the checkpoint tothe midpoint to the string from the midpoint to the current index S[c+ 1,m]R = S[m+ 1, i]. If thetwo strings equal the length estimation of the palindrome P [m] can be updated to the distance ofthe current index and the midpoint ˜(m, i) = i−m. We refer to this as the midpoint was successfullychecked against the checkpoint. This is done by ngerprints as well. Therefore the algorithm storesa further two ngerprints called Master Fingerprints. Those two are the ngerprints of the wholestream to the current index FF (1, i) and FR(1, i). The ngerprints of the two strings which are to becompared can be calculated by the means of the equations of corollary 1. The backwards ngerprintFR(c+1, i) of the string S[c+1,m] can be calculated from the ngerprint to the midpoint FR(1,m)and the checkpoint ngerprint FR(1, c) and forward ngerprint FR(m+1, i) of the string S[m+1, i]can be calculated from the master ngerprint FF (1, i) and the ngerprint to the midpoint FF (1,m)(see gure 5).

Fig. 5 Illustration of the calculation of the ngerprints used to compare the substring between a checkpoint c1 andthe midpoint (mj) and the substring between the midpoint and the current index (i). The ngerprint FR(mj + 1, i)can be calculated from the ngerprint of the palindrome FR(1,mj) and the checkpoint ngerprint FR(1, c1). Thengerprint FF (mj +1, i) can be calculated from the ngerprint of the master ngerprint FF (1, i) and the ngerprintof the palindrome FF (1,mj).

2.1.2 Pseudo Code

Now that all the necessary steps of the algorithm are explained we can conclude with the memoryinvariants and a summary of the steps of the algorithm. As mentioned in the previous subsectionthe algorithm maintains two lists: the list of long palindromes Li and the list of checkpoints CLi.Both are indexed with the current index to make it easier to reference to a list of a particular step.

Memory Invariants

After the algorithm has processed S[1, i− 1] of the stream and before the algorithm reads S[i] andperforms the operations of one step the following data is stored.

1. The symbols of the sliding window S[i− 1− 2√n, i− 1]

2. The Master Fingerprints FF (1, i− 1) and FR(1, i− i)3. The Fingerprint Pairs: For every j ∈ 1·bε√nc , 2·bε√nc , . . . , r·bε√nc ,√n with r = maxr′|r′ ∈

N, r′ · bε√nc < √n the ngerprints FR(i−√n− j, i−√n− 1) and FF (i−√n, i−√n+ j − 1)

Page 119: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 111

4. A list of checkpoint ngerprints: CLi−1 =

[FR(1, c1), FR(1, c2), . . . , FR

(1, c⌊

i−1

bε√nc

⌋)]

where

ck = k · bε√nc with k ∈ N and 0 ≤ k ≤⌊√

⌋. Those are all ngerprints of prexes of the already

seen checkpoints from the stream S[1, i− 1].5. A list of the RS entries of all detected

√n-palindromes whose midpoints are in S[1, i− 1−√n].

• The jth entry is of the form: RS(mj , i− 1) = (mj , ˜(mj , i− 1), FR(1,mj), FF (1,mj))

mj is the midpoint and ˜(mj , i) is the current length estimation of `(mj , i) up to index i−1.

Maintenance

The algorithm maintains the memory invariants by performing the following steps of iteration i asexplained in the previous subsection. The list of palindromes Li−1 and the list of checkpoints CLi−1

implicitly become Li and CLi respectively.

1. Read the ith symbol S[i], set the midpoint of the window m = i − √n and update the slidingwindow to S[m−√n, i] = S[i− 2

√n, i].

2. Update the Master Fingerprints to FF (1, i) and FR(1, i).3. Check if i is a checkpoint and therefore a multiple of bε√nc. In that case add the checkpoint

ngerprint FR(1, i) to CLi.4. Update and compare the Fingerprint Pairs:

a. For j ∈ 1 · bε√nc , 2 · bε√nc , . . . , r · bε√nc ,√ni. Update FR(m− j,m− 1) to FR(m− j + 1,m) and FF (m,m+ j − 1) to FF (m+ 1,m+ j).ii. If FR(m− j + 1,m) = FF (m+ 1,m+ j):iii. Set ˜(m, i) = j.

b. If ˜(m, i) <√n: Output m and ˜(m, i).

5. If ˜(m, i) ≥ √n• Create an RS(m, i) = (m, ˜(m, i), FR(1,m), FF (1,m)) entry for the palindrome.• Add the entry to the list of palindromes Li. RS(m, i)

6. Check for all checkpoints and palindromes if length estimates can be updated

• For ck with 0 ≤ k ≤⌊

i

bε√nc

⌋and RS(mj , i) ∈ Li with 1 ≤ j ≤ |Li|.

a. If i−mj = mj − ck and FR(c+ 1, i) = FR(m+ 1, i)

i. Update RS(mj , i) by setting ˜(mj , i) = i−mj .

7. If i = n: Output Li = Ln.

2.1.3 Correctness SimpleApproxSqrt

SimpleApproxSqrt does not hold the time and space bounds stated by Theorem 1. Nevertheless, thealgorithm works correct in nding all palindromes and estimating the lengths of the palindromes.Therefore we are going to prove the correctness, stated by the following lemma 4, which also givesus the correctness of the space ecient version presented below (see section 2.2).

Lemma 4. For any ε ∈[

1√n, 1]Algorithm ApproxSqrt(S, ε) reports for every palindrome P [m] in

S its midpoint m as well as an estimate ˜(m) (of `(m)) such that w.h.p.

Page 120: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

112 Jan Bürmann

`(m)− ε√n < ˜(m) ≤ `(m).

Proof. We x an arbitrary palindrome P [m].At rst we assume `(m) <

√n. In that case the whole palindrome is in iteration i = m+

√n inside

the sliding window with its midpoint at the midpoint of the window. The algorithm checks in step4.a.ii. for all j ∈ 1 · bε√nc , 2 · bε√nc , . . . , r · bε√nc ,√n if FR(m− j + 1,m) = FF (m+ 1,m+ j)and sets if applicable ˜(m, i) = j. Let us dene jmax as the maximum j where FR(m− j + 1,m) =FF (m+1,m+j) checks successfully. As a consequence the ngerprint pair of size jmax+bε√nc or√ndoes not check successfully. This means P [m] covers the ngerprint of size jmax, but must be shorterthan jmax+bε√nc. The algorithm has set ˜(m, i) = jmax, hence we have `(m)−ε√n < ˜(m) ≤ `(m).

Now let us assume `(m) ≥ √n. For this case we have as well that the midpoint of the palin-drome is in iteration i = m +

√n at the midpoint of the sliding window. Then the ngerprint

pairs of step 4. of the algorithm are all successful and the algorithm detects that `(m) ≥ √n. Asa consequence the algorithm does not output the palindrome but adds an RS(m, i) entry to Li instep 5.. We prove the boundaries of the length estimation by showing that for every i > m +

√n

the estimation up to that iteration `(m, i)− ε√n < ˜(m, i) ≤ `(m, i) holds. Let i′ be the last itera-tion where the algorithm updated the length estimate. We show the inequalities one after each other.

`(m, i)− ε√n < ˜(m, i)

To prove this, we show `(m, i) < i′ + ε√n−m. To do this we have to dierentiate between two

cases. The rst case is that i is closer to i′ than the distance between two checkpoints and the secondcase is that i is more than the distance between checkpoints after i′.

Case 1: `(m, i) > i′ + ε√n−m.

This means there is no checkpoint in between i′ and i. Since i is the current index we haveby denition `(m, i) < i − m. Therefore we have i′ + ε

√n > i ≥ l(m, i) + m, hence we have

`(m, i) > i′ + ε√n−m.

Case 2: `(m, i) ≥ i′ + ε√n−m.

This means there is at least one checkpoint in between i′ and i. The checkpoint for which thepalindrome was checked successfully must be at m − (i′ − m) = 2m − i′, because that was thelast time the algorithm updated the length estimation. As we stated before, the distance between iand i′ is greater than the checkpoint distance, which means that the algorithm created in step 3. acheckpoint at 2m− i′ − bε√nc. We dened i′ to be the last index where the length estimation wasupdated therefore the reverse of S[m− (i′ + bε√nc)] can not be the same as S[m+ 1, i′ + bε√nc],because otherwise the algorithm would have updated in step 6. and i′ would not have been the lasttime the length estimation was updated. Therefore `(m, i) < i′ + bε√nc −m ≤ ε√n.

With case 1 and case 2 and again the fact that the last update was at i′ we have:

`(m, i) < i′ +⌊ε√n⌋−m = `(m, i′) + ε

√n = ˜(m, i′) + ε

√n = ˜(m, i) + ε

√n

˜(m, i) ≤ `(m, i).Now we show the second inequality ˜(m, i) ≤ `(m, i). Every time the algorithm updated the

length estimation ˜(m, i) in 6.a.i. to `(m, i′) it means that the ngerprints in 6.a. equal each otherFR(m+1, i′) = FR(2m−i′,m). This means, under the assumption that ngerprints do not fail, thatthe reverse of S[m+1, i′] is the same as S[2m−i′,m]. Therefore we have the reverse of S[m+1, ˜(m, i′)]is the same as S[m− ˜(m, i′),m]. Hence we have ˜(m, i) = ˜(m, i′) = `(m, i′) ≥ `(m, i) ut

2.1.4 Limitations of Simple ApproxSqrt

As mentioned above the simple version of SimpleApproxSqrt does not have to be sublinear in space.To see this, let us consider the stream S = an with a ∈ Σ. Every index in the interval S[

√n, n−√n]

is the midpoint of a√n-palindrome and all those palindromes are stored in the list of palindromes.

Page 121: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 113

This is obviously linear in space which means that SimpleApproxSqrt requires linear space in theworst-case.

2.2 Space Ecient ApproxSqrt

In this section we introduce a modication of Simple ApproxSqrt to archive sublinear space require-ments in the worst case. The key idea is that if there exists a sequence of palindromes similar tothe example in subsection 2.1.4 (see gure 6 for an additional example), that they form a structurewhich can be exploited to compress the list of palindromes. We begin with the structure and theproves for it (see subsection 2.2.1). Then we present the new entries for the list which uses thestructure (see subsection 2.2.2). Afterwards we consider the algorithm Simple ApproxSqrt and thechanges to the algorithm Simple ApproxSqrt (see subsection 2.2.3). Finally we prove the correctnessof the compression and the algorithm by proving theorem 1 (see subsection 2.2.4).

2.2.1 Palindrome Structure

This list contains RS entries. In those entries we have the index of the midpoint, the length estimateand the ngerprints up to the midpoint (RS(m, i) = (m, ˜(m, i), FR(1,m), FF (1,m))). Therefore tocompress multiple RS entries we need relations between the midpoints and the length estimates. Wesee below that midpoints are equidistant and we can formulate a formula for the relation betweenthe length estimates.

To start at the beginning, the rst thing we notice about sequences of palindromes is that thepalindromes have to overlap (see gure 6). Therefore we dene below a run as a sequence of over-lapping palindromes.

Denition 5. `∗-runLet `∗ be an arbitrary integer and h ≥ 3. Let m1,m2, . . . ,mh be consecutive midpoints of `∗-palindromes (see denition 1) in S. The midpoints m1, . . . ,mh form an `∗-run, if mj+1 −mj ≤ `∗

2for all j ∈ 1, . . . , h− 1.

Such a run is maximal if it can not be extended by palindromes in the beginning or the end. Thishappens either if there is no close enough midpoint of a palindrome in front of the rst midpoint orafter the last midpoint of the run or if the run reaches the beginning or the end of the stream.

Denition 6. Maximal `∗-runAn `∗-run over the midpoints m1, . . . ,mh is maximal if it satises both of the following conditions:

i) `(m1 − (m2 −m1)) < `∗ (13)

andii) `(mh + (m2 −m1)) < `∗ (14)

Fig. 6 Illustration of the structure of overlapping palindromes. The three palindromes are 5-palindromes (`(mj) ≥ 5

with j ∈ 1, 2, 3). The symbols between the rst and the second midpoint form a word w which is repeated forwardsand backwards within the structure.

If we have a `∗-run we can prove that the distance between the midpoints are actually equallyspaced. Furthermore, if we dene the symbols in between the rst midpoint and the second midpoint

Page 122: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

114 Jan Bürmann

as a word w we can prove as well that within the run this word and its reverse wR are repeatedalternating at least from the rst to the last midpoint (see gure 6). We prove this structure inlemma 5 and corollary 2 below. For the proof we need a denition of periodicity.

Denition 7. PeriodA string S′ is said to have period p if it consists of repetitions of a block of p symbols. Formally, S′

has period p if S′[j] = S′[j + p] for all j = 1, . . . , |S′| − p.

We have to notice that this denition includes to call a block of symbols a period even if p > |S′|2 ,

which means the symbols might not even be repeated a second time before the end of the string.At rst we prove the structure of runs for the special case that the run is at most of length `∗.

Lemma 5. Let m1 < m2 < m3 < · · · < mh be indices in S that are consecutive midpoints of`∗-palindromes for an arbitrary natural number `∗. If mh −m1 ≤ `∗, then

a)m1,m2,m3, · · · ,mh are equally spaced in S, i.e. |m2 −m1| = |mk+1 −mk|∀k ∈ 1, . . . , h− 1(15)

and

b)S[m1 + 1,mh] =

(wwR)

h−12 h is odd

(wwR)h−12 w h is even

, where w = S[m1 + 1,m2] (16)

Proof. Let m1,m2,m3, · · · ,mh be consecutive midpoints of `∗-palindromes and `∗ an arbitrary butxed integer. We will proof the claim by proving a stronger claim. For that we dene the string orword w similar to the statement as the sequence of symbols between m1 and m2: w := S[m1 +1,m2].We do an induction over the midpoints m1, . . . ,mj and claim that the following holds:

a′)m1,m2,m3, · · · ,mj are equally spaced and b′)S[m1 + 1,m2 + `∗] is a prex of wwRwwR

Base Case j = 2:For j = 2 it holds trivially that the midpoints are equally spaced.

We know that the whole run has at most the length `∗ therefore we have that `∗ ≥ mh −m1 ≥ m2 − m1 = |w|. This and the fact that m1 is premised to be a palindrome implies thatS[m1 − |w| + 1,m1] = wR. Equally, we have that m2 is the midpoint of an `∗ palindrome andtherefore `(m2) ≥ `∗ ≥ |w|. Hence we have S[m1 +1,m2 + |w|] = wwR. Continue with this argumentthat m1 and m2 are `-palindromes this can be extended to the interval S[m + 1,m2 + `] being aprex of wwR · · ·wwR.Induction Step j − 1→ j:We assume that a') and b') hold up to midpoint mj−1 which is followed by a midpoint mj which isalso part of the run.

As a rst step we show that the distance between the rst midpoint and the midpoint mj

|mj −mj−1| is a multiple of the distance of the rst and the second midpoint |m2 −m1|. We provethis by contradiction. Therefore we assume mj = m1 + |w| · q + r with |m2 − m1| = |w|, q ∈ Z,q ≥ 0 and r ∈ 1, . . . , |w| − 1. We know mj is a midpoint of the run which means its closenessto mj−1 expressed as mj −mj−1 ≤ `∗ ⇔ mj ≤ mj−1 + `∗ let us conclude that mj is contained inthe interval S[m1 + 1,mj−1 + `∗]. Using that and the induction hypothesis we conclude that mj − ris the start of either w or wR, because the addition of r makes mj not a multiple of |m2 − m1|.Since mj is the midpoint of a palindrome means that wwR or wRw is a palindrome of length 2rwhich has a period of 2r. However if we consider the denition of w = S[m1 + 1,m2], r < |w| − 1and consecutiveness of the midpoints in the run we know that S[m1 + 1,m2] does not contain apalindrome and |w| = |m2 −m1|. Therefore |wwR| or |wRw| must be greater that 2r which meanswwR or wRw can not have a period of 2r which is a contradiction to our assumption.

We know now that mj = m1 + |w| · q. Since mj is the midpoint of a `∗-palindrome we canconclude from the induction hypothesis that b') holds. If we consider the structure of the interval

Page 123: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 115

S[mj−1 + |w| − `∗,mj−1 + |w|+ `∗] we infer that mj−1 + |w| is the midpoint of a palindrome, whichcan only me mj . This proves a') and concludes the induction step. ut

With the proof of the special case we can prove that every run obeys this structure.

Corollary 2. If m1,m2,m3, · · · ,mh form an `∗-run for an arbitrary natural number `∗ then

a)m1,m2,m3, · · · ,mh are equally spaced in S, i.e. |m2 −m1| = |mk+1 −mk|∀k ∈ 1, . . . , h− 1(17)

and

b)S[m1 + 1,mh] =

(wwR)

h−12 h is odd

(wwR)h−12 w h is even

, where w = S[m1 + 1,m2] (18)

Proof. Similar to lemma 5 we will prove this by induction. Let m1,m2,m3, · · · ,mh form an `∗-runwith `∗ ∈ N. If mh −m1 ≤ `∗ the statement follows directly from lemma 5. Therefore we assumemh −m1 > `∗.

Base Case:Because of the assumption there must exist an j0 with mj0 −m1 ≤ `∗ which has the highest indexamong the midpoints which a distance to m1 at most `∗. The denition of an `∗-run gives us thatj0 ≥ 3. Those midpoints m1, . . . ,mj0 satisfy the requirements of lemma 5. Hence they are equallyspaced and S[m1 + 1,mj0 ] is a prex of wwR · · · .

Induction Step:We assume that the statement holds up to a midpoint mj and consider the midpoints mj−1, mj and

mj+1. By the denition of a `∗-run we have mj + 1−mj ≤ `∗

2 . Therefore we have mj+1 −mj−1 =

mj+1 −mj +mj −mj−1 ≤ `∗

2 + `∗

2 = `∗. This means the midpoints mj−1, mj and mj+1 satisfy therequirements of lemma 5 and the structural properties hold for them. It remains to mention thatthe overlapping of the structure of the midpoints up to mj and the three midpoints mj−1, mj andmj+1 means that the properties hold up to midpoint mj+1. ut

With corollary 2 we can calculate the indices of the midpoints of the run if we have the midpointof the rst palindrome and the number of midpoints. Further, the structure makes it possible tocalculate the ngerprints of all the midpoints. It remains to nd a way to calculate the length orlength estimates of a run. The following lemma will supply us with a formula to do exactly that.

Lemma 6. At iteration i, let m1,m2,m3, · · · ,mh be midpoints a maximal `∗-run in S[1, i] for anarbitrary natural number `∗. For any midpoint mj, we have:

`(mj , i) =

`(m1, i) + (j − 1) · (m2 −m1) j < h+1

2

`(mh, i) + (h− j) · (m2 −m1) j > h+12

(19)

Proof. We will prove the statement only for the case that j < h+12 because the other case is similar.

The statement of corollary 2 provides us with the implications that the structure between the rstand last midpoint S[m1+1,mh] is a prex of wwR · · · with w = S[m1+1,m2] and that the midpointsare equally spaced. Therefore, we know that mj = m1 + (j − 1)|w|. We dene the index m0 beforem1 to be at the index m1 − |w|. This is close enough to m1, since `(m, i) ≥ `∗ ≥ |w|, to infer thesymbols between m0 and m1 to be wR (S[m0 + 1,m1] = wR). Now we consider a further midpointwhich is m2j . This midpoint is at most mh since 2j < 2 · h+1

2 = h + 1 ⇔ 2j ≤ h. Again with thestructure of corollary 2 and the fact that mj is a palindrome we know that at least the reverse ofS[m0 + 1,mj ]

R equals to S[mj + 1,m2j ]. Therefore the length of the palindrome `(mj , i) is at leastj · |w|.

Page 124: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

116 Jan Bürmann

We dene k to be the number of which the palindrome is actually longer than that, i.e. k :=`(mj , i) − j · |w|. This k must be the length of the longest sux of S[0,m0] which is equal to thereverse of the longest prex of S[m2j + 1, n]. Once more corollary 2 can be used to deduce thatS[m2j + 1,m2j + `∗] must equal to S[m0 + 1,m0 + `∗]. Hence k is the length of the longest sux ofS[1,m0] which is the reverse of prex of S[m0 + 1,m0 + `∗]. The length is limited to an additionaldistance of `∗, because the `∗-run starting at m1 is maximal.

Now we can rewrite k to maxk′|S[m0 − k′ + 1,m0] = S[m0 + 1,m0 + k′]. This is obviously thelength of the palindrome with midpoint m0. Therefore we know that the length of `(mj , i) can becalculated from the length of m0 by the expression `(m0, i) + j|w|. ut

2.2.2 Compression structure

After the explanation of the structure of close palindromes (see subsection 2.2.1) we present howthese results can be used to compress the list of palindromes. The subsection above presented runsand the results about runs in general. In the improved algorithm we will consider

√n-runs. The

space ecient version of ApproxSqrt stores the palindromes in a list Li which uses the palindromestructure to be compressed in comparison to the list Li from the algorithm Simple ApproxSqrt. Inthe list can be three dierent types of entries which are created by the algorithm depending onthe palindrome the algorithm detects in the stream. The three entries are RS-entries, RNF -entries,and RF -entries. We present the entries and the palindromes which are stored in those entries in thefollowing.

• All√n-palindromes which are not part of a

√n-run are stored in an RS-entry as in Simple

ApproxSqrt. A palindrome P [m, i] is stored in the entry RS(m, i).• All

√n-palindromes which are part of a maximal

√n-run are stored in a RF -entry. This structure

contains all information to be able to decompress all information of the palindromes within therun. For the midpoints m1, . . . ,mh which form an

√n-run the RF -entry RF (m1, . . . ,mh, i) stores

on the one side the midpoint of the rst palindrome, the length of the word, and the lengthestimates of the rst, the last and the palindromes in the middle. On the other side the entrycontains as well the ngerprints of the rst midpoint and the word.

m1,m2 −m1, ˜(m1, i), ˜(mbh+12 c, i),

˜(mdh+12 e, i),

˜(mh, i)

FF (1,m1), FR(1,m1), FF (m1 + 1,m2), FR(m1 + 1,m2)

• The last category are√n-palindromes which are part of a

√n-run, but the algorithm was not

able to detect if the run is maximal. Therefore it is possible that the algorithm will add furtherpalindromes to the run. This storage entry is called RNF -entry. The entry contains the same dataas an RF -entry without the three length estimates ˜(mbh+1

2 c, i),˜(mdh+1

2 e, i),˜(mh, i).

The list Li contains RNF -entries only during its execution. If the algorithm reaches the end ofthe stream it is clear that if there is a RNF -entry it must be maximal and can be converted to aRF -entry. In other words the returned list Ln contains only RS-entries and RF -entries.

An observation we can make is that certain palindromes are stored directly and some palindromesare only stored indirectly. For directly called palindromes the structure contains the midpoint orthe length estimate in one of the entries and the algorithm can access this information withoutthe necessity to calculate it. We call directly stored palindromes in the following explicitly storedpalindromes, which are the following.

• RS-entry: The palindrome P [m] which is stored by the entry.• RNF -entry: The palindrome P [m1].• RF -entry: The palindrome P [m1], P [mbh+1

2 c], P [mdh+12 e] and P [mh].

Page 125: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 117

This distinction is important for the algorithm because only explicitly stored palindromes areupdated throughout the execution. Indirectly stored palindromes are updated indirectly throughthe connection we proved above (see subsection 2.2.1).

2.2.3 Pseudo Code

Several parts of the algorithm ApproxSqrt are similar to Simple ApproxSqrt. Therefore we onlyconsider the steps which are dierent from Simple ApproxSqrt (see subsection 2.1.2). The dierencesare in Step 5, 6 and 7. The algorithm must deal with three dierent types of entries in the listof palindromes and therefore has to dierentiate between them in step 5. Furthermore, not allpalindromes are stored explicitly and not all length estimates of palindromes in a run have to beupdated. Hence Step 6 has to consider only explicitly stored palindromes. Finally step 7 needs anadditional step if the last entry is a RF -entry.

5. If ˜(m, i) ≥ √n obtain Li by adding the palindrome with midpoint m to Li−1:

a. The last element in Li is a RNF -entry. This entry has the structure(m1,m2 −m1, h, ˜(m1, i), F

F (1,m1), FR(1,m1), FF (m1 + 1,m2), FR(m1 + 1,m2)).i. If the palindrome can extend the run because it is has the right distance m = m1 +h(m2−m1) to the last palindrome in the run, then increase the h in the RNF -entry by 1.

ii. If the palindrome cannot be added.• Convert the RNF -entry into an RF -entry.

Calculate ˜(mbh+12 c = m1 +

(⌊h+1

2 − 1⌋)

(m2 −m1) and⌈h+1

2

⌉similarly.

For m′ ∈ mbh+12 c,mdh+1

2 e,mh calculate ˜(m′, i) =

maxi−2√n≤j≤i

j −m′|∃ck with j −m′ = m′ − ck and FR(ck + 1,m′) = FF (m′ + 1, j). Store the RF -entry.

• Store P [m, i] as an RS-entry.b. If the last two in Li are stored as RS-entries and together with P [m, i] form a

√n-run.

• Remove the entries from the list Li.• Create a RNF -entry for the three palindromes.

Calculate or setm1,m2−m1, h = 3, ˜(m1, i), FF (1,m1), FR(1,m1), FF (m1+1,m2), FR(m1+

1,m2)• Store the RNF -entry in the list.

c. Otherwise, store P [m, i] as an RS-entry:• Li = Li (m, ˜(m, i), FF (1,m), FR(1,m))

6. Check for all checkpoints and explicitly stored palindromes if length estimates can be updated(similar to Simple ApproxSqrt but limited to explicitly stored palindromes).

7. If i = n

• If the last element in Li is an RNF -entry, then convert it into an RNF -entry as in 5(a)ii.• Report Ln.

The palindromes in the RNF -entries can be extracted with

˜(mj , i) =

˜(m1, i) + (j − 1) · (m2 −m1) j < h+1

2˜(mh, i) + (h− j) · (m2 −m1) j > h+1

2

(20)

Page 126: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

118 Jan Bürmann

2.2.4 Correctness ApproxSqrt

We want to conclude the description of ApproxSqrt with a proof of its correctness. Therefore weprove that the compression is lossless before we prove theorem 1.

Lemma 7. At iteration i, the RF -entry over m1,m2, . . . ,mh is a lossless compression ofRS(m1, i), . . . , RS(mh, i).

Proof. We consider the RF -entry RF (m1, . . . ,mh, i) = (m1,m2 −m1, ˜(m1, i), ˜(mbh+12 c, i),

˜(mdh+12 e, i),

˜(mh, i), FF (1,m1), FR(1,m1), FF (m1 + 1,m2), FR(m1 + 1,m2)) and x one arbi-

trary chosen index j or midpoint mj respectively. We proof that RS(mj , i − 1) = (mj , ˜(mj , i −1), FR(1,mj), F

F (1,mj)) can be retrieved from the RF -entry with the same precision of the lengthestimate. In order to do this we need to retrieve three pieces of information. Firstly the position ofthe midpoint, secondly the ngerprints and thirdly the length estimate.

The rst information is the position of the midpoint mj . The RF -entry contains the index ofmidpoint of m1 and the distance between the midpoints m1 and m2. Hence we can calculate theposition of mj by the results of corollary 2 with the equation mj = m1 + (j − 1) · (m2 −m1).

The second information are the forward and backward ngerprint up to the midpoint mj . Bycorollary 2 we have the structure of S[m1 + 1,mh] to be wwRwwR · · · . The RF -entry containsthe ngerprints to the rst midpoint FF (1,m1) and FR(1,m1) and it also contains the ngerprintsFF (m1 +1,m2) and FR(m1 +1,m2) which are the ngerprints of w or wR. Using lemma 1 or lemma2 implies that we can compute FR(1,mj) and F

F (1,mj), because we have ngerprints which containall the symbols from index 1 up to index mj .

The third information is the length estimate. We have to argue that by calculating the retrievedlength estimate of mj from the length estimates we have we end up with the same accuracy as ifthe palindrome were stored in an RS-entry. For that we let i

′ be the index where the RF entry wasnished. Then we have to dierentiate between three cases.

1. mj = m1: ˜(m1, i) is stored explicitly in the RF -entry. Furthermore the algorithm has stored itas a RS-entry at rst and is treating it throughout its run as if it were an RS-entry. Therefore`(mj , i)− ε

√n < ˜(mj , i) ≤ `(j , i)

2. mj ∈ mbh+12 c,mdh+1

2 e,mh: The length estimates of those midpoints are calculated at index i′.

The length estimates is set to the maximal size for all checkpoints within reach. Therefore similarto the proof of lemma 4 the accuracy of `(mj , i)− ε

√n < ˜(mj , i) ≤ `(j , i) must hold. Similar to

m1 the palindrome P [mj ] and its and its length estimate are treated as if it were an RS-entry.

Therefore we have `(mj , i)− ε√n < ˜(mj , i) ≤ `(j , i).

3. Otherwise: WLOG 1 ≤ j ≤⌊h+1

2

⌋. To retrieve the length estimate of mj we can use formula (20)

which is motivated by lemma 6. Therefore by lemma 6 we have `(m1, i) − ˜(m1, i) = `(mj , i) −˜(mj , i). Furthermore we can infer from case 1, the denition of length and the way length

estimates are set the following inequality 0 ≤ `(m1, i) − ˜(m1, i) < ε√n. Hence the precision

`(mj , i)− ε√n < ˜(mj , i) ≤ `(mj , i) holds in this case as well.

We can conclude that an RF -entry over m1,m2, . . . ,mh is a lossless compression. ut

To argue the sublinearity of the list of palindromes Li one nal observation is necessary.

Observation 1 For any interval of length√n there can be at most two RS-entries and two Com-

pressed Runs in L∗.

The observation can be reasoned by considering a section of size√n and the

√n-palindromes

which might be inside of it. If there are more than two palindromes, which would be stored a RS-entries, they must overlap and can be stored as a run. If there are more than two runs they have to

Page 127: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 119

overlap as well and can be combined into one run. Therefore the number of palindromes and runsin a selection of that size is constant.

Finally we have described every part of ApproxSqrt and have proven every necessary step andtechnique to prove theorem 1.

Proof. of Theorem 1w.h.p.:

The theorem states that the results of ApproxSqrt hold w.h.p.. From our denition (see subsection1.2.1) this means the probability is at least 1− 1

n . If the algorithm fails depends on the ngerprintsand their probability to fail. If we recall the ngerprints the algorithm calculates (Master Finger-prints, checkpoint ngerprints, palindrome midpoint ngerprints, run ngerprints, and FingerprintPairs) and how often we calculate them we can surely give the number an upper bound of n2.The probability that one ngerprint fails is 1

n4 by lemma 3. Using the union bound results in aprobability that no ngerprint fails of 1− 1

n2 . Therefore the results hold w.h.p..

Correctness:Similar to lemma 4 where we proved the correctness of Simple ApproxSqrt we x an arbitrary

palindrome P [m].If the length of this palindrome `(m) is smaller than

√n the algorithm ApproxSqrt does deal

with the palindrome in the same way as Simple ApproxSqrt. Therefore the correctness follows fromlemma 4.

Therefore we assume `(m) ≥ √n. The algorithm identies P [m] as being a palindrome longerthan

√n at index i = m +

√n. Since `(m) ≥ √n the interval S[i − 2

√n + 1, i − √n] must be the

reverse of the interval S[i−√n+ 1, i] which means that all ngerprint pairs inside the window arechecked successfully and the palindrome is added as an RS -entry to the list Li. If the algorithmidenties later that P [m] is part of a run it is added to an RNF -entry and eventually this entry isin step 5.a.ii. or step 7 transformed to an RF -entry. At no point a palindrome is removed from thelist Li. Hence P [m] is stored as a RS-entry or a RF -entry in the list which is reported in step 7. Ifthe palindrome is part of an RF -entry the data can be obtained as proven in lemma 7.

It remains to prove that the length estimate is within the boundaries give by `(m) − ε√n <

˜(m) ≤ `(m). After iteration n the list Ln contains only RS-entries and a RF -entries. The lengthestimate of a RS-entry is correct by lemma 4 and the length estimate of a RF -entry is correct bylemma 7.

Space:The algorithm stores the sliding window, the checkpoints, the Fingerprint Pairs, and the list of

palindromes Li. The sliding window stores 2√n symbols. The checkpoints are bε√nc indices apart

with ε ∈[

1√n, 1]. Therefore there are

⌊n

bε√nc

⌋≤ 2n

ε√ncheckpoints. As a result of the choice of ε we

have√n ≤

√nε ≤ n. Therefore the number of checkpoints can be bound by O

(√nε

). Furthermore we

store only one ngerprint per checkpoint. The number of the Fingerprint Pairs.... The palindromesoccur either on their own or in a run. Both can be stored in constant space. By observation 1 thereis a constant number of runs and RS-entries in an interval of size

√n. As a consequence we can

bound the space of the list of palindromes by O (√n).

Running Time:The running time is dominated by the comparisons performed in step 4 and 6 of the algorithm.

Therefore the running time of those steps determines the running time of the algorithm.We consider the running time of step 6 at rst. The algorithm checks for each checkpoint and

each explicitly stored palindrome if the length estimate can be updated. As stated above the number

of checkpoints is bound by O(√

). The number of explicit stored palindromes is one for RS-entries

Page 128: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

120 Jan Bürmann

and four for runs. By observation 1 the number of single palindromes and runs is in an interval of√n bound by a constant. Hence we have O (

√n) explicitly stored palindromes which are checked

at most once against the O(√

)checkpoint ngerprints which gives us an order of O

(nε

)for the

comparisons.In step 4 the Fingerprint Pairs are calculated and compared. There are O

(1ε

)many ngerprints.

Due to the properties of ngerprints they can be updated in O(

). For every index the algorithm

compares the midpoint to O(

)Fingerprint Pairs. Altogether is step 4 in the order of O

(nε

).

As a conclusion we have that the algorithm works in time O(nε

). ut

2.3 Variant of ApproxSqrt

In this section we would like to sketch one variant of ApproxSqrt, because the variant is used by thealgorithm Exact. The algorithm is constructed to prove theorem 4. In subsection 2.1.1, we explainedthat it is possible to calculate the size of small palindromes exactly, because they are fully containedinside the sliding window. The algorithm initialises `max with 1. For every index the algorithm checksif there are palindrome of length `max or longer and updates `max and the longest palindrome ifapplicable. This way the algorithm is guaranteed to nd `max for `max <

√n and a palindrome of

that length.

3 Exact and ApproxLog

In this section we give an overview of the two algorithms Exact (see subsection 3.1) and ApproxLog(see subsection 3.2) which consider the longest palindromic substring problem. The description islimited to an overview, because the main ideas and techniques are used in ApproxSqrt and arealready explained in section 2.

3.1 Exact

Exact reports w.h.p. every palindrome of maximal length in O (n) time and O (√n) space. This

means the algorithm has to calculate the maximal length `max and nd all algorithms of that length.As described above ApproxSqrt has an additive approximation with an uncertainty of bε√nc. Thismeans that its results do not tell us the exact maximal length, but it gives us a range for the length.Furthermore, ApproxSqrt does nd all palindromes. Hence the idea is to use those results to ndthe maximal length and the palindromes of that length. The algorithm Exact does exactly that bybeing a two-pass algorithm.

The rst pass executes ApproxSqrt(S, 12 ) and the variant outlined in subsection 2.3 at the same

time. The second pass then depends on the rst pass. If for every palindrome P [m] the length `(m)is strictly smaller than

√n then the variant of ApproxSqrt returns `max. A second run checking for

every position of the window if there is a palindrome of that length suces to nd all palindromesof length `max. If there exists a palindrome P [m] which is at least

√n long we have the position

of all palindromes from ApproxSqrt(S, 12 ), but we do not have `max since the length estimates are

only satisfying `(m) − ε√n < ˜(m) ≤ `(m). The algorithms second pass has two phases. In the

rst phase the length estimates are used to remove all palindromes which cannot be the longest.Furthermore, only the palindromes in the middle of the runs are kept. For the remaining palindromeswe know that the length estimates are within a certain boundary. In addition we know that those

Page 129: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 121

palindromes where checked successfully against a checkpoint and not successfully against anothercheckpoint. Therefore there is an interval between those checkpoints for which it is uncertain if thepalindrome extends into this interval and if its or part of it is the reverse of the interval on theother end of the palindrome (see gure 7). The algorithms use this information to store the symbolsof the rst interval when the current index reaches them and compares them to the symbols ofthe second interval when the current index reaches those symbols. Afterwards those stored symbolscan be deleted to reduce the space requirements. In this way the algorithm can nd `max and allpalindromes of that length.

Fig. 7 Illustration of the limits of ApproxSqrt. The midpoint m is successfully checked against c2, but for checkpointc1 the comparison of the ngerprints failed. The information which can be deduced from that is that the symbols inthe area with the lines from top left to bottom right are denitely part of the palindrome. The symbols in the areawith the crossed lines are denitely not part of the palindrome. However there is no information about the symbolsin the dotted area. I1 and I2 identify those intervals which symbols might be completely or partly be part of thepalindrome.

To prove that this algorithm is correct we would have to prove that the pre-processing doesnot delete a palindrome which belongs to the longest palindromes. The interesting part is to provethat keeping only the palindromes in the middle of a run does not delete a longest palindrome. Wedo not prove this statement, but on a logical consideration this is correct. For runs with an evennumber of palindromes the statement follows directly from lemma 6. For runs with an odd numberof palindromes it is necessary to prove that all palindromes are shorter than the palindrome in thecentre. If we remember the structure proven in corollary 2 this seems right as well.

To prove theorem 2 and therefore the algorithm Exact, the biggest eort has to be used for thespace requirements. Therefore we would need to bind the number of uncertainty intervals. Eventuallythis can be bound by O (

√n).

3.2 ApproxLog

ApproxLog reports w.h.p. an arbitrary palindrome of length at least `max1+ε in O

(n log(n)ε log(1+ε)

)time and

O(

log(n)ε log(1+ε)

)space. This that we nd only one palindrome and it might be not the longest. One

main dierence to ApproxSqrt and Exact is that ApproxLog has a multiplicative error. This connotesthat it is not necessary to have equally spaces checkpoints. ApproxLog creates checkpoints decreasingexponential in number with the distance to the current index and also removes checkpoints in a xedinterval exponentially with increasing distance to the current index. Also because the algorithm onlyreports one palindrome it keeps only the length estimate of the longest palindrome the algorithmfound so far.

The functionality and correctness of ApproxLog depends on the way the checkpoints are createdand the compression of palindromes. Checkpoints are created close enough to ensure that the lengthestimate is within the boundaries. At the same time the storage of the palindromes uses the resultsof lemma 7 to ensure that the storage space for a certain interval can be bound by a constant.

To prove theorem 3 we would have to prove mainly the length estimate accuracy and a limitation ofthe storage space for the checkpoints. An alert spectator might guess from the high level descriptionof how checkpoints are created and the space requirements that the number of checkpoints dominates

Page 130: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

122 Jan Bürmann

the time and space requirements. With the results of the compressibility of palindromes the storageof one checkpoint can once again be bound by a constant and the number of checkpoints can be

bound by O(

log(n)ε log(1+ε)

).

4 Conclusion

We saw in the main part of this work the presentation of an algorithm, ApproxSqrt, and the tech-niques used to solve the palindrome problem. The target of optimisation was the space needed tocompute the result. The algorithm solves the problem with an additive error with a space require-

ment which can be bound by O(√

).

The algorithm uses several techniques which are interesting beyond the identication of palin-dromes. The most interesting are ngerprints. Fingerprints make it possible to compress sequencesof symbols while preserving the possibility to compare the sequences. With the right choice to ini-tialise them the probability that two dierent sequences are mapped to the same number is low.We have seen that the choice we made results in a probability of a ngerprint fail of 1

n4 . Anothertechnique is the creation of checkpoints to ensure a certain accuracy of estimation. Those can becreated equidistant or distributed by other means depending on the accuracy to archive. The lasttechnique we mention here is the compressibility of palindromes and the limitation of the numberof palindromes or runs in an area of the stream. Certainly the presented compression works only forpalindromes, but similar techniques are used in pattern matching algorithms[3].

In the end of the work we presented the overview of two algorithms, Exact and ApproxLog, whichsolve the longest palindromic substring problem. The techniques used by these algorithms are thesame introduced for ApproxSqrt. Exact has a space complexity of O (

√n) and ApproxLog has a

logarithmic space complexity.The algorithm ApproxSqrt solves the problem in sublinear space, but it would be interesting in

terms of prospects if it is possible to transfer the algorithms or techniques to a distributed modeland if this would result in a real benet. Nucleic acid strands are long. Hence a space complexity of√n might also be quite big. An easy idea would be to give every node a part of the input stream.

Surely it is possible with the presented algorithm to archive the same accuracy in those parts, butthe question at hand is how to deal with palindromes that overlap those parts. It might be possibleto reduce the communication to ngerprints, but the question here is at what point should these beexchanged and how to minimise the number of those.

References

1. Apostolico, A., Breslauer, D., Galil, Z.: Parallel Detection of All Palindromes in a String. Theor. Comput. Sci..141.4 (April 1995) doi:10.1016/0304-3975(94)00083-U, pp. 163173

2. Berenbrink, P., Ergün, F., Mallmann-Trenn F., Azer, E.S.: Palindrome Recognition In The StreamingModel. 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014). (2014)doi:10.1109/IIT.2007.4430509, pp. 149161

3. Breslauer, D., Galil, Z.: Real-Time Streaming String-Matching. ACM Trans. Algorithms. 10.4 (August 2014)doi:10.1145/2635814, pp. 22:122:12

4. Eltayeb, F., Elbahir, M., Mohamed, S., Ahmed, M., Zaki, N.: Development of a Web-base Application to DetectPalindromes in DNA Sequence. 4th International Conference on Innovations in Information Technology, 2007.IIT '07. (2007) doi:10.1109/IIT.2007.4430509, pp. 725727

5. IUPAC. Compendium of Chemical Terminology, 2nd ed. (the "Gold Book"). Compiled by A. D. Mc-Naught and A. Wilkinson. Blackwell Scientic Publications, Oxford (1997). XML on-line corrected version:http://goldbook.iupac.org (2006-) created by M. Nic, J. Jirat, B. Kosata; updates compiled by A. Jenkins.isbn: 0-9678550-9-8 doi:10.1351/goldbook. Last update: 2014-02-24; version: 2.3.3. Cited 01 November 2015doi:10.1351/goldbook.N04254.

Page 131: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Palindrome Recognition In The Streaming Model 123

6. Karp, R.M., Rabin, M.O.: Ecient randomized pattern-matching algorithms. IBM J. Res. Dev. 31.2 (March1987) doi:10.1147/rd.312.0249, pp. 249260

7. Kimball, J.W.: Kimball's general biology (2015)http://biology-pages.info. Cited 01 November 2015

8. Leung M.-Y., Choi K.P., Xia A., Chen L.H.Y.: Nonrandom Clusters of Palindromes in HerpesvirusGenomes. Journal of computational biology: a journal of computational molecular cell biology. (2005)doi:http://dx.doi.org/10.4230/LIPIcs.STACS.2014.149, pp. 331354

9. Lu L., Jia H., Dröge P., Li J.: The human genome-wide distribution of DNA palindromes. Functional & IntegrativeGenomics. (2007) doi:10.1007/s10142-007-0047-6, pp. 221227

10. Manacher, G.: A New Linear-Time On-Line Algorithm for Finding the Smallest Initial Palindrome of a String.J. ACM. 22.3 (July 1975) doi:10.1145/321892.321896, pp. 346351

11. Porat, B., Porat, E.: Exact and Approximate Pattern Matching in the Streaming Model. Proceedings of the2009 50th Annual IEEE Symposium on Foundations of Computer Science. (2009) doi:10.1109/FOCS.2009.11,pp. 315323

12. Wikimedia User Acdx: DNA palindrome.svg. Wikimedia Commons (2010)https://commons.wikimedia.org/wiki/File:DNA_palindrome.svg. Cited 01 November 2015

Page 132: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF
Page 133: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking DistributedCount, Frequencies, and Ranks

Nils Kothe

Abstract This essay deals with count-, frequency- and rank-tracking, which are fundamental dis-tributed tracking problems. The results presented in this essay are not based on own ressearch, butinstead are a presentation of the results achieved by Zengfeng Huang, Ke Yi, and Qin Zhang[1].We use two-way communication and randomization in order to achieve signicant improvementsover previous results. Our basic algorithm is introduced in the count-tracking problem, in which kdistributed sites each hold a counter ni that is equal to the amount of elements that the respectivesite has received from a data stream. The task that needs to be solved is that an additional mem-ber of the network wants to know an ε-approximation of n :=

∑ki=1 ni at all times. It is proven

in previous works that deterministic solutions to this problem need Ω(k/ε · logN) communication.We will show that through two-way communication and randomization, this can be reduced toO(√k/ε · logN). For the other two problems, we will use the insights and the algorithm developed

for the count-tracking problem to achieve similar results.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1271.2 Declaration of the problems, previous and new results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

2 Tracking Distributed Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1292.1 The algorithm and upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1292.2 The lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3 Tracking Distributed Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.1 The algorithm and upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.2 Communication space trade-o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4 Tracking Distributed Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.1 The basic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.2 The algorithm C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.3 Upper space and communication bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.4 Estimation of the rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

1 Introduction

This essay is dealing with the results and techniques from the paper named 'Randomized Algo-rithms for Tracking Distributed Count, Frequencies, and Ranks' by Zengfeng Huang, Ke Yi, andQin Zhang[1]. Please notice that the goal of this essay is not to present the results of original,

Nils KotheUniversität Paderborn, Warburger Str. 100, 33098 Paderborn, e-mail: [email protected]

125

Page 134: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

126 Nils Kothe

new research. Instead it focuses on presenting the results of Huang et al. in a more lengthy andunderstandable fashion.

The central problem of this essay is the count-tracking problem which provides the basis for twoother problems mentioned in the title. In the count-tracking problem, we consider k sites that eachhave a counter ni that counts the number of elements that have arrived at the site i. Notice that thiscounter is only getting incremented and not decremented. This is an appropriate approach since thiscounter is meant to represent the number of elements coming from a data stream. Initially each ni is0 and is incremented over time whenever a new element arrives. The value of ni at a certain time tis called ni(t). Now the task that is to be solved in the count-tracking problem is that a coordinator

wants to know an ε−approximation of the sum over all individual counters n(t) :=∑ki=1 ni(t)

for all possible values of a specic time t. This basically means that the coordinator wants tohave an estimator of n(t) called n(t) that is continuously at all times in an ε-range of n(t), i.e.(1 − ε)n(t) ≤ n(t) ≤ (1 + ε)n(t). The challenge of the count-tracking problem is that we wantto minimize the communication needed between the coordinator and each of the sites in order toachieve a feasible solution. An additional challenge that becomes interesting with the other twoproblems mentioned in the title is the minimization of space required at each of the sites.

Notice that the count-tracking problem is already solved for the deterministic case. There existsa simple algorithm in which each site sends out a new message to the coordinator whenever itscounter ni has increased by a factor of 1 + ε. This implies that the coordinator always has an ε-approximation of each ni, which in turn implies that the coordinator has an ε-approximation ofn. If we assume that N is the nal value of n, then the communication cost of this algorithm isO(k/ε · log(εN/k)) = O(k/ε · logN) for large enough N . This algorithm was used in [10] to solvebasically the same problem. Central properties of this algorithm are that there is only one-waycommunication and that it is deterministic. However, it was proven that this simple algorithm isalready optimal in [11]. The reason we still take a look at this problem is that we want to improvethis communication upper bound with the usage of randomization and two-way communication.

Page 135: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 127

1.1 The model

C

S1

S2

S3· · ·Sk−1

Sk

Fig. 1 Basic underlying model used in this essay. S1, · · · , Skrepresent the sites and C the coordinator. Each site has atwo-way communication channel to the coordinator.

The formal model that we will use to eval-uate the mentioned distributed trackingproblems consists of k distributed sitesnamed S1, · · · , Sk each having a streamthat receives new elements at unknowntimes. Each site is connected with a two-way communication channel to the coordi-nator called C (cf. Figure 1). We assumethat the total amount of elements sent toall sites is set to N . Each site has a multiset(bag) Ai(t) containing all elements that Sihas received up to the time t. Additionally,we dene A(t) :=

⊎ki=1Ai(t) as the multi-

set that contains all elements received untiltime t for all sites, with

⊎being the mul-

tiset addition. The coordinator is taskedwith computing an (approximative) valueof f(A(t)), with f being a function repre-senting the current task. For example, inthe count-tracking problem it holds thatf(A(t)) = |A(t)|. It is also important tonote that we assume that a broadcast fromthe coordinator to all sites costs exactly ksingle messages.

Another point is that sites can use the coordinator in order to communicate with each other byrst sending the message to the coordinator which in turn sends the message to the desired site.We also assume that communication time is zero which means that a new element can only arriveafter all participants have nished deciding whether they want to send messages. The space boundspresented in this essay are related to the number of words needed to store the desired information,with a word being able to store any integer from [0, N) as well as one element of a data stream.

Notice that our model degenerates into the standard streaming model for k = 1. Also note that ifour goal is to perform a one-shot computation of f(A(∞)) for k ≥ 2, then the model degenerates intothe (number-in-hand) k-party communication model. This means that our model is more generalthan either of those models. Additionally our model seems to be signicantly dierent from the twomentioned models. A good example is again the count-tracking problem, which is trivial in the othertwo models and non-trivial in our model. Meaning that it requires new techniques and algorithmsto be solved, like the two-way communication randomized approach presented in this essay.

Also note that there exists a very similar model used in [3] and [2] with but one major dierence:The other model also consists of k streams that run a streaming algorithm on the local data. However,the value of f is only computed if a user is requesting it or if all N elements have arrived. This ofcourse means that the count-tracking problems becomes trivial as well. It also means that each sitewaits passively to get polled for the computation of f . In contrast, in our model the sites activelyparticipate in maintaining the value of f . Notice that the other model needs to always poll all sitesif it wants to know the value of f continuously.

Page 136: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

128 Nils Kothe

1.2 Declaration of the problems, previous and new results

As mentioned in the beginning, we will rst start with providing a solution to the count-trackingproblem that uses two-way communication as well as randomization. Afterwards, we generalize ourproblem to the frequency-tracking problem and after that we generalize it even further to the rank-tracking problem. The deterministic communication complexity of the count-tracking problem hasbeen proven to be Θ(k/ε · logN) [11] as mentioned in the beginning. The deterministic approachonly requires one-way communication, while we will prove in this essay that with the hep of two-waycommunication and randomization we can reach Θ(

√k/ε·logN). Throughout the whole essay we will

assume that k ≤ 1/ε2, otherwise all upper bounds have an additional term of O(k·logN). Notice thatthe solution we provide actually works with an arbitrary success probability from the interval (0, 1).This can be achieved through the rescaling of ε and a probability called p. Another approach to reacha success probability of 1−δ would be to make the algorithm correct for O(log1+εN) = O(1/ε·logN)

time instances. If we run O(log logNεδ ) independent copies of the algorithm and take the median of

all estimators as our nal estimator, then we have a success probability of at least 1− δ. Also noticethat we prove in Section 2.2.1 that two-way communication is actually necessary to reach the desired√k-factor for the lower bound. This is done by proong that any randomized algorithm needs at

least Ω(k/ε · logN) communication, which is the same value as for deterministic algorithms.The frequency-tracking problem, sometimes also called the heavy-hitters tracking problem, is a

slight generalization of the count-tracking problem. Instead of counting the total number of ele-ments, we want to estimate the frequency of multiple types of elements. Again, A(t) is a multiset ofcardinality n(t) for a time t. The task that is to be solved in the frequency-tracking problem is thatthe coordinator wants to have an estimator of the number of elements over all sites S1, S2, · · ·Sk foreach j ∈ T , assuming that T is the set of all element types. We dene fij as the actual number of el-

ements of type j that have arrived at the site i and fj :=∑ki=1 fij . Notice that during our algorithm

presented in Section 3.1 we actually do not know the real value of each fij but rather an estimation

f ′ij that is good enough for our desired ε-range of (1 − ε)n(t) ≤ ∑ki=1

∑j∈T f

′ij ≤ (1 + ε)n(t) with

probability at least 0.9. It is necessary to only demand an error over the sum over all estimators,since otherwise every element would need to be reported to the coordinator if every element wouldhave a unique type. Notice that this success probability of 0.9 is again rather arbitrarily and thata dierent success probability can be achieved through rescaling. Also note that if we only want totrack the frequency of a single element type, then this problem degenerates into the count-trackingproblem. Previous studies have shown that the best deterministic solution for the frequency-trackingproblem needs O(k/ε · logN) communication [11]. In this essay, we will prove that one can achieveO(√k/ε · logN) through the usage of our count-tracking algorithm and randomization. Notice that

the lower communication bound of the count-tracking algorithm also applies for frequency tracking.Also note that our approach only needs Ω(1/(ε

√k)) space at each site. Of course it would be pos-

sible to reach O(1) space at each site if we simply send a message for each arriving element to thecoordinator. However, this would also imply a strong increase of communication cost. In this essaywe will prove a space-communication trade-o in Section 3.2.

Afterwards we take a look at the rank-tracking problem, where we will assume that A(t) containsno duplicate entries in order to make the proofs more easy and understandable. Now we denethe rank of an element x as the number of elements in A(t) that are smaller than x. Our goalis to create a data-structure that allows us to know the rank of x with an error of at most εn(t)with a constant probability. It is actually possible to solve the frequency-tracking problem withan algorithm that solves the rank-tracking problem by turning each element into a pair (x, y) tobreak all ties. We then compare the pairs lexicographically and maintain the rank-tracking datastructure mentioned above. Now, if we want to know the frequency of the element x, we simply queryour data structure for (x, 0), (x, 1), · · · , (x,∞). In previous works, a deterministic algorithm thatsolves the rank-tracking problem with communication O(k/ε · logN log2(1/ε)) [11] was presented.In this essay, we will show in Section 4 that we can achieve an upper communication bound of

Page 137: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 129

O(√k/ε · logN log1.5(1/(ε

√k))) through randomization and two-way communication. Since rank-

tracking is more general than frequency tracking, the lower communication bound of the count-tracking problem also applies here. The space that is used at each site in our algorithm is also closeto Ω(1/(ε

√k)).

A direct competitor for the algorithms presented in this essay are algorithms that use randomsampling for a constant success probability. Notice that previous research [12] has shown that randomsampling can solve all the problem mentioned above by taking a sample of size O(1/ε2). A randomsample can be maintained over a continuous timespan over distributed streams by using a totalcommunication amount of O(1/ε2 · logN)[13, 14]. This means that our algorithm is better thanrandom sampling for k = O(1/ε2). This is the reason we focus on k ≤ 1/ε2 in this essay.

2 Tracking Distributed Count

2.1 The algorithm and upper bound

2.1.1 The algorithm with a xed p

C

S1

n1

S2

n2

S3

n3

· · ·Sk−1

nk−1

Sk

nkn1

n2

n3

nk−1

nk

Fig. 2 S1, · · · , Sk represent the sites and C the coordinator.This graphic shows the basic idea of our algorithm. Whenevera new element arrives, we increment ni by one and send outni with probability p as ni

The basic idea of our algorithm is thatwhenever a site Si receives a new element,we increment ni by one and send ni to thecoordinator with probability p ∈ [0, 1] (cf.Figure 2). We also dene ni as the lastvalue sent by Si. Now we dene the esti-mator ni with which the coordinator esti-mates ni as the following:

ni :=

ni − 1 + 1/p, if ni exists0, else

(1)

Additionally, we can dene the estimatorover all sites as n :=

∑ki=1 ni. Notice that

our estimator is exactly ni if p = 1 and islarger than the last received ni for p < 1.This also means that our estimator in-creases if p gets smaller. Now if we want toanalyse the estimator, rst note that ourestimator should hold for any given timet. We will now show that ni is an unbiasedestimator of ni with variance at most 1/p2.This result feels intuitive, since ni − ni isexactly the number of elements where Sidecided not to sent an element to the co-ordinator. This number follows roughly ageometric distribution that is bounded byni and has an additional addend if there isno ni.

Lemma 1 E[ni] = ni, Var[ni] ≤ 1/p2

Proof. First we dene a random variable

Page 138: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

130 Nils Kothe

X :=

ni − ni + 1, if ni exists in Cni + 1/p, else

With which we can rewrite ni as ni = ni −X + 1/p. This means that we only need to show thatE[X] = 1/p and that Var[X] ≤ 1/p2.

E[X] =

ni∑t=1

(t(1− p)t−1p

)+ (1− p)ni

(ni +

1

p

)

=p

1− p

ni∑t=0

(t(1− p)t

)+ (1− p)ni

(ni +

1

p

)=

1

p+ (1− p)ni

((1− p)ni

p− ni + 1

p+ ni +

1

p

)=

1

p

Var[X] =E[X2]− E[X]2

=

ni∑t=1

(t2p(1− p)t−1

)+ (ni + 1/p)2(1− p)ni − 1/p2

=p

1− p

ni∑t=0

(t2(1− p)t

)+ (ni + 1/p)2(1− p)ni − 1/p2

=−1

p2(n2i (1− p)ni+2 − (2n2

i + 2ni − 1)(1− p)ni+1

+ (n+ 1)2(1− p)ni − (1− p)− 1) + (ni + 1/p)2(1− p)ni − 1/p2

=−(1− p)ni

p2(2 + n2p2 + 2np− p) +

2− pp2

+ (ni + 1/p)2(1− p)ni − 1/p2

=(1− (1− p)ni)(1− p)

p2≤ 1

p2

ut

With the help of Lemma 1, we also know that E[n] = n and Var[n] ≤ k/p2. Thus, we dene p asp :=

√k/εn to have an error of at most 2εn with a success probability of at least 0.75. This can be

shown by applying the Chebyshev inequality

Pr[|n− n| ≥ 2εn] ≤ Var[n]/(2εn)2 = 0.25

The Chebyshev's inequality also shows that we can change our success probability as well as theerror simply by rescaling ε and p with constant factors in order ot achieve the desired 0.9 successprobability with an error of at most εn.

Page 139: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 131

2.1.2 Dealing with a decreasing p

C

n

S1

n

n1

S2

n

n2

S3

n

n3

· · ·Sk−1

n

nk−1

Sk

n

nk

n

n

n

n

n

n1

n2

n3

nk−1

nk

Fig. 3 S1, · · · , Sk represent the sites and C the coordinator.This graphic shows the rened idea of our algorithm. When-ever a new element arrives, we increment ni by one and sendout ni with probability p as ni. Additionally, each site receivesa broadcast-value n to compute p.

In order to use the probability p as it isdened in Section 2.1.1, we need to some-how know the value n at each site. It is ac-tually impossible to know the exact valueof n within our model, but the analysis inthe previous section should make it clearthat it is sucient to have p ∈ Θ(

√k/εn).

In order to achieve this each site sendsout an additional message whenever nidoubles. Also, the coordinator broadcastsn :=

∑ki=1 ni whenever n is changed by

a factor between two and four (cf. Fig-ure 3). This ensures that each site alwayshas a constant factor approximation ofn through n. The communication cost ofthese broadcasts is O(k logN) since eachsite sends O(logN) messages to the co-ordinator and the coordinator broadcastsO(logN) times. These broadcasts split thewhole time interval into O(logN) roundswith each round having a unique value ofn that is used to dene p. Notice that anew round begins whenever a new n ar-rives. The question is now how a site actu-ally uses n in order to compute p.

First of all, if n ≤√k/ε then all sites set p = 1. This means that the rst O(

√k/ε) elements all

send a message to the coordinator. Secondly, if n >√k/ε then all sites set p = 1/bεn/

√kc2 with

bxc2 being the next power of 2 that is smaller than x. This implies that p gets halved wheneverit changes. If we take a look at our estimator ni in (1), then we see that this also means that ourestimator is increased at the beginning of a new round, although Si didn't receive a new element.This means we need to adjust each ni in order to have our estimator remain unbiased. Whenever anew round begins, if p was halved, then each site needs to adjust its ni and send out a new messageas follows: With probability 1/2 we simply send out ni to the coordinator again. If we loose theip we ip another coin with probability p (the new probability) until we succeed and decrementni by the number of failures. Afterwards we also send the new value of ni to the coordinator. Thisfeels very counter intuitive, but the explanation for this behaviour is actually quite simple. Wedene p′ := 1/2 · p as the new value of p at the beginning of a new round and assume that niwas decremented a times. Note that the expected value of a is 1/p′ since a behaves like a negativebinomial distribution. Now we can dene n′i as the new value of ni after the adjustment as follows

n′i :=

ni, with probability 1/2ni − a, else

With E[n′i] = 1/2 ·ni+ 1/2(ni−1/p′) = ni−1/p. If we now look at our new estimator n′i then wecan conclude E[n′i] = E[n′i]− 1 + 1/p′ = ni. This means that our adjustment of ni actually removesthe bias of our new estimator n′i on expectation and our model behaves as if it has always beenrunning with the new value of p.

Page 140: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

132 Nils Kothe

Now, if we look at the communication cost of a whole round then we can see that we haveO(k + np) = O(k +

√k/ε) = O(

√k/ε) communication per round for k ≤ 1/ε2. Combined with the

number of rounds, we receive an upper bound of O(√k/ε · logN) for the total communication. We

can combine this upper bound with the previous statements to the following theorem:

Theorem 1 There is an algorithm for the count-tracking problem that estimates n =∑i ni with

error at most εn with success probability at least 0.9 for any given time. It uses O(1) space at eachsite and O(

√k/ε · logN) total communication.

2.2 The lower bound

Before we look into the actual proofs for the lower bounds of the count-tracking problem, it isnecessary that we formalize the model in which these proofs are made. Each of the k sites receiveelements at an arbitrary time. This behaviour is plausible, since the count-tracking problem isessentially an online problem because we have to know an estimation of n for all time instances.Also, spontaneous action is not allowed. Notice that this means that our sites are only allowed tosend out messages if a new element or a new broadcast value n arrives. Also note that it is also truefor the coordinator, meaning that the coordinator can only send out a broadcast value wheneverhe receives a message ni from one of the sites. Additionally, each site and the coordinator are onlyallowed to use local information and some source of randomization. This implies that every site Sionly knows its own counter ni, the message history between Si and the coordinator, the allowederror ε and the number of sites k. The counters of the other sites as well as additional informationthe coordinator might have is unknown. We also assume that each site has no access to a clock ordoes not look at a clock it might have. The reason for this is that elements arrive arbitrarily andare not predictable. Otherwise if it would be possible to predict the number of elements that a sitehas based on the current value of the clock, then there would be no need for communication atall. Notice that if our sites and the coordinator only have access to local information and a sourceof randomization, then all decisions they make about whether a message will be sent or not stemfrom these two sources. Finally, we will formulate all lower bounds in the number of messages,disregarding the size of messages.

2.2.1 One-way communication lower bound

In this section we will proof that we actually need the ability to perform two-way communication toreceive the upper bound in Theorem 1. This is done by showing that one-way communication needsat least the following amount of communication. Also notice that we assume that N is so large thatit dominates k and 1/ε in the O-notation.Theorem 2 If only the sites can communicate with the coordinator but not the other way arround,then any randomized algorithm for the count-tracking problem that, at any time, estimates n withinerror εn with probability at least 0.9 must send Ω(k/ε · logN) messages.

Proof. First of all, we dene a hard input distribution µ:

(a) With probability 1/2: All N elements arrive at one site that is uniformly picked at random

(b) Else: All N elements arrive in a round-robin fashion with each of the k sites receiving roughlyN/k elements

Now, using Yao's Minimax principle [4], we only need to argue that any deterministic algorithmwith success probability at least 0.8 under the distribution µ has expected communication of Ω(k/ε ·logN).

Page 141: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 133

Notice that if we use one-way communication, then a site Si can only decide if it wants to senda message based own its own counter ni and the parameters ε and k. Taking this into considerationmeans that the behaviour of Si is basically the following: Each site Si has a series of thresholdst1i , t

2i , . . . such that if ni = tji holds, then Si sends the j-th message to the coordinator. We can

assume that these thresholds are already known in the beginning, since we are only interested inthe lower bound.

Now we divide the incoming elements into Ω(1/ε · N) rounds. We can achieve this by deningWj as the number of elements that have arrived up until round j. If we set W1 := k/ε and Wj+1 :=d(1 + ε)Wje for j ≥ 1, then there are 1/ε · log(εN/k) rounds, which is in Ω(1/ε · N) if N is largeenough as mentioned earlier.

The next step is that we look at the beginning of a new round j + 1 and assume thatS1, S2, . . . , Sk have already sent zj1, z

j2, . . . , z

jk messages to the coordinator. Now we dene tj+1

max :=

(1 + ε) ·maxtzjii |i = 1, 2, · · · , k

as the biggest threshold reached up to the end of the old round j

over all sites. It is not hard to notice that there must be at least k/2 sites with their next thresholds

tzji+1i ≤ tj+1

max . Since otherwise if we assume that there are less than k/2 sites with such thresholds,then with probability at least 1/4 case (a) happens and it holds for the site that is chosen to receive

all elements Si that tzji+1i > tj+1

max ≥ (1+ε)tzjii . This implies that our algorithm fails when the tj+1

max -thelement arrives with probability at least 1/4, which is bigger than our allowed failure of 1−0.9 = 0.1,contradicting our success guarantee.

On the other hand, if (b) happens, then all tzjii for i ∈ 1, 2, · · · , k are at most Wj/k since all

elements come in a round-robin fashion. If we now look at the next εWj messages that arrive at

the sites, then each site receives additional εWj/k messages. If the site Si has tzji+1i ≤ tj+1

max , then

it must send a message in the current round, since Wj/k + εWj/k ≥ tj+1max ≥ t

zji+1i . This means

that the zji + 1-th threshold gets triggered and a message gets send to the coordinator. Taking theargumentation in the previous paragraph into account, we know that there are at least k/2 sites

with tzji+1i ≤ tj+1

max . This implies that the communication cost of this round is at least k/2.Multiplied with the number of rounds, the total communication is at least Ω(k/ε · logN).

ut

2.2.2 Two-way communication lower bound

In this section we will prove two lower bounds when two-way communication is allowed. The rstone justies the restriction k ≤ 1/ε2, since otherwise random sampling will be nearly optimal.

Theorem 3 Any randomized algorithm for the count-tracking problem that, at any time, estimatesn within error 0.1n with probability at least 0.9 must exchange Ω(k) messages.

Proof. First we x the same hard input distribution as in Theorem 2. Now we only look at thenumber of sites that communicate with the coordinator at least once. Before any element arrives,we can still assume that each site keeps a triggering threshold. The threshold of each site Si shallremain the same unless it communicates with the coordinator at least once. One can argue that thenumber of sites whose triggering threshold is no more than 1 must be at least k/2 since otherwiseif case (a) happens and the randomly chosen site is one with a triggering threshold larger than 1,the algorithm will fail. This would be the case with probability at least 1/4, again contradictingthe success probability. On the other hand, if case (b) happens then all sides with threshold 1 willhave to communicate with the coordinator at least once: either their thresholds are triggered or theyreceive a message from the coordinator which changes their threshold.

ut

Page 142: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

134 Nils Kothe

Before we get to the second Theorem, we introduce a primitive problem that we will use in theproof of the second lower bound. Let s be dened as:

s :=

k/2 +

√k, with probability 1/2

k/2−√k, else

(2)

Denition 1 (1−BIT ) From all sites S1, S2, · · · , Sk, a subset of s, as dened in (2), sites getspicked uniformly at random. Each of those s sites have a bit of 1, while the other k− s sites have abit of 0. The goal of this communication problem is to know the value of s with probability at least0.8.

Now we will show the following lower bound for the 1-BIT problem:

Lemma 2 Any deterministic algorithm that solves 1-BIT has distributional communication com-plexity Ω(k)

Proof. Notice that whenever the coordinator communicates with a site then the site can send itswhole input which is just one bit to the coordinator. Afterwards, the coordinator knows all bits thatthis site will ever receive and has no need to further communicate with this site. The number whichwe need to know is therefore the number of sites with which the coordinator needs to communicate.Now there can be two types of actions in the protocol:

(a) A site sends its bit to the coordinator based on the bit it has(b) The coordinator, based on the accumulation of all the information that it has, asks a site to

send its bit

Also note that if a type (b) communication happens before a type (a) communication, then wecan always swap the two, since this only gives the coordinator more information at an earlier stage.Thus we can assume that all type (a) communication happens before any type (b) communication.

Now we look at the rst phase in which all type (a) communication happens. Let x be the numberof sites that send bit 0 to the coordinator and y be the number of sites that send the bit 1 to thecoordinator. If E[x + y] = Ω(k), then our Lemma is proven. So for the rest of the proof we canassume E[x+ y] = O(k). By using the Markov inequality, we get that with probability at least 0.9,x + y = O(k) holds. After we have completed the rst phase, the problem becomes that there ares′ := s− y = s− O(k) sites having bit 1, with the total number of sites that did not sent a messagebeing k′ := k−x−y = k−O(k) sites. Now the success probability with which the coordinator needsto know s′ needs to be at least 0.8− (1− 0.9) = 0.7.

Next we look at the second phase where all type (b) communication happens. Since the originals sites were chosen uniformly at random and the coordinator has no extra information besides thereceived messages from the rst phase, the best the coordinator can do is to sample a site chosenuniformly at random from the remaining k′ sites. Even after the coordinator obtains the bit fromthe sampled site, this information can't be used to nd out which of the remaining sites still have abit of 1. This means that in the second phase, the problem basically becomes: The coordinator picksz sites out of the remaining k′ sites and approximates the value of s with probability at least 0.7.We call this problem the sampling problem. We now need to proof that z is at least Ω(k) if we wanta success probability of at least 0.7. This proof however is rather large and since it proves a propertyof a problem dierent from the original 1-BIT problem, we will put this proof in an additionalfollowing Lemma 3. With this we can conclude the proof since all cases lead to communication ofat least Ω(k).

ut

Lemma 3 A solution to the sampling problem needs at least Ω(k) communication with successprobability at least 0.7.

Page 143: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 135

Proof. First of all, for the proof we will use the values k′ and s′ as dened in the proof of theprevious Lemma 2. Meaning that s′ is dened as the number of sites that hold a 1 inside their bitbut haven't sent a message to the coordinator and k′ being the total number of all sites that didn'tsent a message to the coordinator. Additionally we reuse y from Lemma 2 as the number of sitesthat have sent the bit 1 to the coordinator.

Now we assume that the coordinator samples only z := O(k) sites. We then deneX as the numberof sites that are sampled with bit 1. This means that X is chosen from the hypergeometric distribu-

tion with the probability density function (called pdf going forward) Pr[X = x] =(s′

x

)(k′−s′z−x

)/(k′

z

).

Notice that E[X] = z/k′ · s′, which is either z/k′(k2 − y +

√k)or z/k′

(k2 − y −

√k)depending

on the value of s. We dene p :=(k2 − y

)/k′ = 0.5 ± O(1) and α :=

√k/k′ = 1/

√k ± O(1/

√k).

In the following we will assume that X is picked randomly from one of the two normal distribu-tions N1(µ1, σ

21) and N2(µ2, σ

22) with equal probability, with µ1 := z(p − α), µ2 := z(p + α) and

σ1 = σ2 = Θ(√zp(1− p)) = Θ(

√z). In Feller [9] it is shown that the normal distribution approxi-

mates the hypergeometric distribution very well if z is large and p± α are constants in the interval(0, 1). What we have left to do is to decide from which of the two distributions X is drawn basedon the value of X with success probability at least 0.7.

As a next step we assume that f1(x;µ1, σ21) and f2(x;µ2, σ

22) are the pdf of the two normal

distributions N1 and N2. It is easy to see that that the best deterministic algorithm of dierentiatingthe two distributions based on the value of a sample X will do the following:

• If X > x0, then X is chosen from N2, otherwise X is chosen from N1, where x0 is a value suchthat f1(x0;µ1, σ

21) = f2(x0;µ2, σ

22).

Notice that this implies µ1 < x0 < µ2. Also note that if X > x0 and the algorithm decides thatX is chosen from N1, then we can always ip this decision and improve the success probability ofthe algorithm.

Now we take a look at the error rate of this approach. The error comes in this algorithm fromtwo sources: Firstly if X > x0 but X is actually drawn from N2 and secondly if X ≤ x0 but X isactually drawn from N1. The sum over these errors is

1/2 · (Φ(−l1/σ1) + Φ(−l2/σ2))

with l1 := x0−µ1 and l1 := µ2− x0. Notice that these denitions imply l1 + l2 = µ2−µ1 = 2αz.Also note that we use Φ(·) as the cumulative distribution function (called cdf in the following) ofthe normal distribution.

Finally note that l1/σ1 = O(αz/√z) = O(

√z/k) = O(1) and l2/σ2 = O(αz/

√z) = O(1). This

implies Φ(−l1/σ1) + Φ(−l2/σ2) > 0.99. This in turn implies that the failure probability is at least0.49, contradicting our success guarantee. This means that we must have z = Ω(k), which completesthe proof.

ut

Now we can nally proof our second lower bound for the count-tracking problem.

Theorem 4 Any randomized algorithm for the count-tracking problem that, at any time, estimatesn within error εn with probability at least 0.9 must exchange Ω(

√k/ε · logN) bits, when k ≤ 1/ε2.

Proof. For this proof we will x again a hard input distribution rst. Now we focus on the distribu-tional communication complexity of deterministic algorithms with success probability at most 0.8.Additionally we dene [m] as 0, 1, · · · ,m− 1.

Now we consider an adversary which sends us an input of l := log εNk = Ω(logN) rounds. The

second equality holds since we just assume that N dominates k and 1/ε as stated in the beginningof Section 2.2.1. After we have the rounds, we further divide each round i ∈ [l] into r := 1/(2ε

√k)

Page 144: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

136 Nils Kothe

subrounds. The input that is sent in each round i ∈ [l] is constructed by the adversary as follows:For each subround j ∈ [r] we select s as dened in (2). Then we choose s sites uniformly at randomand send 2i elements to each of those sites.

If we take a look at the total amount of messages sent in each round i ∈ [l], we get that therewere at most rs ·2i < τi :=

√k/ε ·2i messages sent. Notice that the allowed deviation εn is therefore

also bounded by τi, meaning if we assume we are in round p, then εn < ε∑pl=1 τl must also hold.

This implies that in every round i ∈ [l] we must not deviate by more than ετi since otherwise oursummed up deviation will exceed εn. This means that after s·2i elements have arrived in a subround,the algorithm needs to correctly identify the value of s with probability at least 0.8. Otherwise thealgorithm deviates from the original value by at least

√k · 2i > ετi, violating the success guarantee

of the algorithm. This is exactly the 1-BIT problem dened in Denition 1. Now we can use Lemma2 to get the lower bound of Ω(k) for the communication of each subround. Combining this lowerbound with the number of rounds and subrounds per round, we get l · r ·Ω(k) = Ω(

√k/ε · logN).

ut

3 Tracking Distributed Frequencies

The frequency-tracking problem is similar to count-tracking but instead of tracking just one counterwe want to track the frequency, meaning the number of occurrences, for many dierent types ofelements. In the following we will call a specic type of element 'item' and the goal is to track thefrequency of any item j within error εn. We also dene fij as the local frequency of the item j at

the site Si. Finally, we dene fj :=∑ki=1 fij .

In addition to these new denitions, we will continue to use the denitions of the count-trackingSection if the meaning has not changed. For example, we will continue to use ni without anyadditional remark if its meaning is still the total number of elements that have arrived at the siteSi.

3.1 The algorithm and upper bound

3.1.1 The algorithm with a xed p

Similar to Section 2.1.1, we will rst describe the algorithm while assuming that we have a xedvalue of p. Now if we assume that each site has a counter for every fij then we can simply usecount-tracking to solve the problem. The issue with this solution is of course that each site needsspace which is linear in the total number of distinct items, which could be arbitrary large. To solvethis space issue we will use an algorithm due to Manku and Motwani[5] at each site Si: The sitemaintains a list of counters called Li. Whenever a new element of item j arrives at Si the site rstchecks whether a counter cij already exists. If there is such a counter then the site increments cij byone. After the counter has been incremented the site sends the new value of cij to the coordinatorwith probability p. If no counter exists then we create cij = 1 with probability p and send it tothe coordinator. It is not hard to see that the expected size of Li is O(pni). Notice that when anew element arrives and a counter already exists, we always increment it but only send it to thecoordinator with probability p. Also note that there can be elements that were not sampled becausea counter for the respective item did not exist yet when the element arrived and the creation of anew counter failed. The probability of this failure is 1− p. Additionally, we call the last sent valuefrom site Si for the item j cij .

The new problem that arises from this algorithm is the estimation of fij at the coordinator. Asmentioned earlier we now have two possible sources for errors. First the number of elements that

Page 145: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 137

were not sampled, i.e. that no counter existed or was created when they arrived. Second the changesof the cij that were not sent to the coordinator. It is not hard to see that both of these errors followthe same distribution as ni − ni from the count-tracking problem, since both errors also depend onthe probability p and have similar behaviour. Thus the intuitive estimator for fij would be:

fij :=

cij − 2 + 2/p, if cij exists0, else

(3)

Contrary to the rst intuition, this estimator is actually biased and its bias might be as big asΘ(εn/

√k). Summing over k sites, this would exceed our error guarantee. To see this, consider all

copies of the item j that arrive at the site Si. Eectively the site samples every copy with probabilityp, while cij − 2 is exactly the number of copies between the rst and the last sampled copy if weexclude both. We dene two random variables X1 and X2 to formalize this concept:

X1 :=

t1, if the t1th copy is the rst one sampledfij + 1/p, if none is sampled

Notice that if the t1th copy is the rst one sampled, then the t1th element was the one that hascreated the counter cij . Also note that if the t2th element is the rst one sampled in the reversedorder, than this also means that it is the last element sent to the coordinator in the original order.

X2 :=

t2, if the t2th copy is the rst one sampled in the reversed orderfij + 1/p, if none is sampled

It is clear that X1 and X2 have the same distribution with E[X1] = E[X2] = 1/p after Lemma 1. This

means that fij should be fij = fij − (X1 +X2) + 2/p. Since it holds that cij − 2 = fij − (X1 +X2),the corrected unbiased estimator should be:

fij :=

cij − 2 + 2/p, if cij exists−fij , else

(4)

If we compare this unbiased estimator with the intuitive one at (3) then the only dierenceis for the case in which no counter exists for the item j at site Si. When fij = Θ(εn/

√k) and

p = Θ(1/fij), which happens with constant probability, meaning that the resulting bias would be

Θ(fij) = Θ(εn/√k).

The problem of the corrected estimator (4) is that it depends on fij , the very value that we wantto estimate. The workaround for this problem is that we simply do another independent samplingof fij called dij . dij gets sampled and sent to the coordinator with probability p as long as no actualcounter cij isn't present yet. If cij already exists then we can safely ignore the sampling of dij . Nowwe can dene the nal estimator of fij as:

f ′ij :=

cij − 2 + 2/p, if cij exists−dij/p, else

(5)

This true estimator of fij is still unbiased because of the independence of dij and cij . Now wecan analyse the variance of (5). The variance should intuitively not be aected by the introductionof a second independent counter dij since it is only relevant for relatively small numbers of fij andeven for those small numbers only with a relatively small probability. Now follows a formal proof:

Lemma 4 E[f ′ij ] = fij and Var[f ′ij ] = O(1/p2).

Page 146: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

138 Nils Kothe

Proof. We start by analysing the corrected estimator fij from (4). That E[fij ] = fij holds follows

directly from the previous construction of the true estimator fij in (5). The variance of this estimator

is Var[fij ] = Var[X1 + X2]. Notice that these two random variables are not independent but bothhave an expected value of 1/p and a variance ≤ 1/p2. First we must formulate an upper bound forVar[X1 +X2]:

Var[X1 +X2] =E[X21 +X2

2 + 2X1X2]− E[X1 +X2]2

=Var[X1] + E[X1]2 + Var[X2] + E[X2]2

+ 2E[X1X2]− (E[X1] + E[X2])2

≤4/p2 + 2E[X1X2]− 4/p2 ≤ 2E[X1X2].

We dene Et as the event that the tth copy of j is the rst one sampled. Then we have

E[X1X2] =

fij∑t=1

((1− p)t−1ptE[X2|Et]

)+ (1− p)fij (fij + 1/p)2

=

fij∑t=1

(1− p)fij−t(fij − t+ 1) +

fij−1∑l=1

(1− p)l−1pl

+ (1− p)fij (fij + 1/p)2

≤1/p2 + (1− p)fijf2ij +

(1− p)fijfijp

.

Now we dene c := fijp. If c ≤ 2 we are done, since in this case fij ≤ 2/p and the variance is inO(1/p2). For the rest of the proof we will assume that c > 2. If we use the value of c in the previousequation we know that c2 ≤ ec holds since c > 2 holds. Putting it all together we get

E[X1X2] =1

p2+

c2 + c

p2 + ec= O(1/p2).

Next we look at the true estimator f ′ij from (5). Note that dij is the sum of fij Bernoulli random

variables with probability p which means that E[dij/p] = fij and Var[dij/p] ≤ fijp/p2 = fij/p. We

now dene E? as the event that at least one element of the item j is sampled, meaning that cij is

available for the coordinator. Notice that E0 = E? holds. Now we the expected value of f ′ij is

E[f ′ij ] =E[fij |E?] Pr[E?] + E[−dij/p|E0] Pr[E0]

=E[fij |E?] Pr[E?] + (−fij) Pr[E0]

=E[fij ] = fij

and the variance is:

Var[f ′ij ] =E[f ′2ij ]− E[f ′ij ]2

=E[f2ij |E?] Pr[E?] + E[(dij/p)

2|E0] Pr[E0]− f2ij

=E[f2ij |E?] Pr[E?]− f2

ij + E[(dij/p)2] Pr[E0]

=E[f2ij |E?] Pr[E?]− f2

ij + (Var[dij/p] + f2ij) Pr[E0].

Page 147: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 139

Also notice that the following holds

Var[fij ] =E[f2ij ]− f2

ij

=E[f2ij |E?] Pr[E?] + E[f2

ij |E0] Pr[E0]− f2ij

=E[f2ij |E?] Pr[E?] + f2

ij Pr[E0]− f2ij .

If we use that knowledge on the prior equation, then we obtain

Var[f ′ij ] =Var[fij ] + Var[dij/p] Pr[E0]

≤Var[fij ] +fijp· (1− p)fij .

Due to the same reason as above the second term is O(1/p2), which completes the proof.ut

3.1.2 Dealing with a decreasing p

C

n

S1

n

L1, D1

S2

n

L2, D2

S3

n

L3, D3

· · ·Sk−1

n

Lk−1, Dk−1

Sk

n

Lk, Dk

n

n

n

n

n

c1j

c2j

c3j

c(k−1)j

ckj

Fig. 4 S1, · · · , Sk represent the sites and C the coordinator. Li represents the current cij counters and Di the currentdij counters of site i. Whenever a new element of type j arrives, we increment cij by one if it exists and send it outwith probability p as cij . If cij does not exist, then we create cij = 1 with probability p and send it to C. n representsthe sum over the received counters for the current round. At the beginning of a new round, a new n is sent.

For the actual algorithm we use the count-tracking approach to divide the whole time period intoO(logN) rounds. This means that each site must sent out an additional message whenever thecounter of a specic item j is two times the last sent value, i.e. cij = 2cij . On the other hand thecoordinator broadcasts the sum over the last sent counters n whenever it changes between a factor of2 and 4, with n :=

∑ki=1

∑j∈T cij if T is the set of all distinct items. Notice that if the coordinator

has not received any value cij , we just assume that cij as well as dij is zero and act as if cij is zero.This approach gives each site a constant factor approximation of n through n at all times (cf. Figure

Page 148: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

140 Nils Kothe

4). Also notice that for each round n remains the same and the arrival of a dierent n starts thenew round.

At the beginning of a new round we set p = 1/bεn/√kc2. Then all sites clear their memory and

start a new copy of the algorithm from scratch with the new p. This implies that the coordinatorreceives independent counters from each round for every item j for each site Si. The overall sumis then the sum of these independent counters. Notice that the variance of a round is bounded byO(k/p2) and that p increases geometrically over the rounds. This means that the total variance isasymptotically bounded by the variance of the last round, which is O(1/ε2).

However, if one site receives all O(n) elements of a round then the space bound of this site wouldbe O(pn) = O(

√k/p). In order to reduce this space we restrict the space of each site to at most

n/k elements. If a site receives more than n/k elements in a round then the site clears its memoryand sends a notication to the coordinator. Afterwards, the site starts as a new virtual site fromscratch. The coordinator will treat the site as a new site and assume that the original site simplydoes not receive any more elements for the remaining round. This means that the space at each siteis at most pn/k = O(1/(ε

√k)). Notice that there are at most O(k) of these new virtual sites which

means that the variance is only aected by a constant factor.Now if we look at the total communication of a single round we see that it is basically the

same amount of communication as in the count tracking problem. The number of messages used tomaintain a constant factor approximation of n takes O(k) communication. Additionally, for eacharriving element we sent out at most two messages each with probability p. The rst one for thenormal sampling for cij , the second for the independent sampling done to maintain dij . This takes

O(pn) communication. Putting it all together we get O((k + pn) logN) = O)(√k/ε logN) total

communication over all rounds for k ≤ 1/ε2. This leads us to the following result:

Theorem 5 There is an algorithm for the frequency-tracking problem that, at any time, estimatesthe frequency of any element within error εn with probability at least 0.9. It uses O(1/(ε

√k)) space

at each site and O)(√k/ε logN) communication.

3.2 Communication space trade-o

For the lower bound of the frequency-tracking problem we can also use knowledge gained inthe count-tracking problem. Actually, the lower bound of the count-tracking problem is also alower bound for the frequency-tracking problem. To show this, we will show the following space-communication trade-o.

Theorem 6 Consider any randomized algorithm for the frequency-tracking problem that, at anytime, estimates the frequency of any element within error εn with probability at least 0.9. If thealgorithm uses C bits of communication and uses M bits of space per site, then we must haveCM = Ω(logN/ε2), assuming k ≤ 1/ε2.

Notice that if Theorem 6 holds then the communication cost of C = O(√k/ε · logN) implies

space at each site of at least M = Ω(1/(ε√k)) bits. Additionally, if we ignore the dierence between

words and bits then the bound in Theorem 5 is also tight. Now we will proof Theorem 6:

Proof. First of all, this proof uses a result from Woodru and Zhang[6] which states that, under thek-party communication model, there is an input distribution µk such that for k ≤ 1/ε2 the followingholds: Any algorithm that solves the one-shot version of the problem under µk with error 2εn withprobability 0.9 needs at least c

√k/ε bits of communication for some constant value c. Additionally,

any algorithm that solves l independent copies of the one-shot version of the problem needs at leastl · c√k/ε bits of communication.

In our proof we will consider the problem mentioned above for ρk sites for some value ρ ≥ 1that we will determine later. Again, we rst divide the whole time period into logN rounds. In

Page 149: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 141

each round i = 1, 2, . . . , logN , we generate the incoming elements by using the distribution µρkindependently. Moreover, for each round we use another domain of µρk in order to get logN dierentand independent instances of the problem. Now the actual input generation goes as follows: For eachround i, for every element e picked from µρk, for any site, we replace e with 2i−1 copies of e. Thenwe arrange the elements such that S1 receives all its elements rst, then S2 receives all its elementsand so forth. Through this arrangement it becomes clear that it suces for the lower bound to solvethe frequency-tracking problem only after Sk has received all its elements for the current round.This means that we only need to solve the frequency-tracking problem for each round once. Thisimplies that we need to solve logN independent instances of the one-shot version of a continuoustracking algorithm that estimates the frequency-tracking problem. Combined with the results from[6], this means that we need at least c

√ρk/ε · logN communication over all rounds.

Now we dene Ak as a continuous tracking algorithm that estimates the frequency-trackingproblem and has a total communication of C bits and uses M bits space at each of the k sites. Inthe following we will show how to solve our problem over ρk sites in each round while using Ak.For each round we start the simulation for S1, S2, · · · , Sk. Whenever Ak exchanges a message, wedo the same. When S1 has received all its allowed elements, it sends its memory to the site Sk+1

and Sk+1 takes the role of S1. Analogously, if a site Sj has received all its allowed elements it sendsits memory to the site Sk+j and Sk+j takes the role of Sj . After Sρk is done we have nished theworkload for the current round. Sρk then sends a broadcast message and we proceed to the nextround.

Notice that we exchange exactly the same messages as Ak does which means that we havecommunication of C. Additionally, we communicate ρ(k − 1) memory snapshots and perform abroadcast message every round. Putting all things together we have total communication of

C + ρkM logN ≥ c√ρk/ε · logN

⇒M ≥ c

ε√ρk− C

ρk logN=

1√ρk

(c

ε− C√

ρk logN

).

This means that if we set√ρ := d 2Cε

c√k logN

e, then we have

M ≥ c

2ε√ρk

= Ω

(logN

Cε2

)= Ω

(logN

ε2

)which completes the proof.

ut

4 Tracking Distributed Ranks

If we look at a single stream of n elements then an algorithm that produces an unbiased estimatorfor any rank with variance O((εn)2) was presented in [8] and improved by [7] to work in a strongermodel. It uses O(1/ε · log1.5(1ε)) working space to maintain a data structure of size O(1/ε) that isused for the rank estimation. In the following we will call this algorithm A and use it as a black boxto solve the rank-tracking problem.

Page 150: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

142 Nils Kothe

4.1 The basic algorithm

As in the previous two chapters, we will use the introduced broadcast technique in order to get aconstant factor approximation of n through n at each site. This uses O(k · logN) communication.Again, this divides the whole time period into O(logN) rounds. We have Θ(n) elements arrivingin each round but a major dierence is that for the rank-tracking problem each site is allowed tohave more than n/k many elements at the same time. The reason for this is that we perform analgorithm called C for a chunk of at most n/k elements. If there are more than n/k elements, thenwe nish the computation of the current C and start another instance of the algorithm C. This newinstance will then compute C for the next n/k elements and so forth.

4.2 The algorithm C

As mentioned earlier, the algorithm C receives at most n/k elements and starts by dividing theseelements into blocks of size b := εn/

√k. This means that there are at most 1/(ε

√k) blocks. Now

the algorithm builds a balanced binary tree on the blocks in the order of the arrival of the elements.Notice that the height of the tree that is produced this way is h ≤ log(1/(ε

√k)). During the algorithm

we compute D(v) as all the elements contained in the leaves in the subtree rooted at v. For eachD(v) we then compute an instance of A which we call Av. Note that Av is used to process theelements as they arrive. In the following we will call v active if Av is still accepting new elementsand we call v full if all possible elements for D(v) have arrived. Also note that we count the level ofthe tree by saying that each leaf is at level 0 and the direct parents of the leaves are on level 1 andso forth. Now we set the error parameter of a node v at level l in Av to 2−l/

√h. Additionally, when

v is full, the site sends a summary computed by Av to the coordinator and deletes Av from the localmemory. Finally, for each element that is arriving the site samples it with probability p :=

√k/(εn)

and sends it to the coordinator if it is sampled.

4.3 Upper space and communication bound

First of all, notice that C has at most h active nodes with one active node at each level. This meansthat the total space used by C is

h∑l=0

√h2l · log1.5 1/ε = O

( √h

ε√k

log1.5 1/ε

).

Notice that the communication for C includes all computed summaries as well as all the sampledelements. For each level l, the size of all summaries is:

O(

1

ε√k

2−l · 2l√h

)= O

( √h

ε√k

).

If we now sum this value over all h levels we gain a total size of O(h1.5

ε√k

)for all summaries. Notice

that we have at most 2k instances of the algorithm C in a single round which means that the totalcommunication cost of a round is O(h1.5

√k/ε). As in the count-tracking and frequency-tracking

Page 151: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks 143

problem, the number of sampled elements is O(np) = O(√k/ε). Summing over all O(logN) rounds

we get a total communication cost of O(h1.5√k/ε · logN).

4.4 Estimation of the rank

Now all that is left is how the coordinator actually estimates the rank of any given element x at anytime with variance O((εn)2). First we decompose all n elements into smaller subsets and estimatethe rank of x for each of those smaller subsets. Notice that each estimator is unbiased which meansthat the combined estimator is unbiased as well. This holds because the total variance is the sumover the variances of the single estimators.

Next we focus on one single round. Notice that O(n) = O(n) elements arrive in each round.Every chunk of up to n/k elements is processed by one instance of C and now we focus on onechunk. We assume that n′ ≤ n/k elements have already arrived in such a chunk. Now we rewriten′ = q · b + r for some r < b and decompose those n′ elements into h + 1 subsets. The rst qbelements are decomposed into at most h subsets. Notice that each of those subsets corresponds toa full node in the balanced binary tree of C. Also note that this node has already sent its summaryto the coordinator, which we are using to estimate the rank. Finally notice that for a node at level lthe variance is (2−l/

√h · 2lb)2 = b2/h, which implies that the total variance from all h nodes is b2.

Now we focus on the remaining r elements of the current chunk that are still being processed byan active node. Notice that the coordinator does not own a summary for these nodes. However, wecan use the fact that each site also samples each element with probability p =

√k/(εn). We can

simply count the number of elements that are smaller than x over the elements that got sampledand sent to the coordinator over the remaining r elements. We dene c as the number of elementsthat are smaller than x for these remaining r elements. Additionally, we use c/p as an estimator.Notice that the variance of this estimator is r/p ≤ b/p = b2. Combining this insight with the resultsfrom the previous paragraph we get a variance for each chunk of O(b2). Since there are at most2k chunks for each round, we have a total variance of O(b2k) = O((εn)2) = O((εn)2) as desired.Putting it all together, we gain the following theorem:

Theorem 7 There is an algorithm for the rank-tracking problem that, at any time, estimates the

rank of any element within error εn with probability at least 0.9. It uses O(

1ε√k

log1.5 1ε · log0.5 1

ε√k

)space at each site with communication cost O

(√kε logN · log1.5 1

ε√k

).

References

1. Zengfeng Huang, Ke Yi, and Qin Zhang. Randomized algorithms for tracking distributed count, frequencies, andranks. In Proceedings of the 31st symposium on Principles of Database Systems, pages 295306. ACM, 2012.

2. Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and Christopher Olston. Finding (recently) frequentitems in distributed data streams. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st InternationalConference on, pages 767778. IEEE, 2005.

3. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. InProceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '01,pages 281291, New York, NY, USA, 2001. ACM.

4. Andrew Chi-Chin Yao. Probabilistic computations: Toward a unied measure of complexity. In Proceedings ofthe 18th Annual Symposium on Foundations of Computer Science, SFCS '77, pages 222227, Washington, DC,USA, 1977. IEEE Computer Society.

5. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedingsof the 28th international conference on Very Large Data Bases, pages 346357. VLDB Endowment, 2002.

6. David P Woodru and Qin Zhang. Tight bounds for distributed functional monitoring. In Proceedings of theforty-fourth annual ACM symposium on Theory of computing, pages 941960. ACM, 2012.

Page 152: PG DisDaS Seminar - Heinz Nixdorf Institut · Seminar of the Project Group DisDaS Distributed Data Streams WS 2015/16 Alexander Mäcker, Manuel Malatyali, Sören Riechers and riedhelmF

144 Nils Kothe

7. Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Je M Phillips, Zhewei Wei, and Ke Yi. Mergeablesummaries. ACM Transactions on Database Systems (TODS), 38(4):26, 2013.

8. Subhash Suri, Csaba D Tóth, and Yunhong Zhou. Range counting over multidimensional data streams. Discrete& Computational Geometry, 36(4):633655, 2006.

9. William Feller. An introduction to probability theory and its applications, 1968.10. Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. Communication-ecient distributed

monitoring of thresholded counts. In Proceedings of the 2006 ACM SIGMOD international conference onManagement of data, pages 289300. ACM, 2006.

11. Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica, 65(1):206223,2013.

12. Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events totheir probabilities. Theory of Probability & Its Applications, 16(2):264280, 1971.

13. Graham Cormode, S Muthukrishnan, Ke Yi, and Qin Zhang. Continuous sampling from distributed streams.Journal of the ACM (JACM), 59(2):10, 2012.

14. Srikanta Tirthapura and David P Woodru. Optimal random sampling from distributed streams revisited. InDistributed Computing, pages 283297. Springer, 2011.