Publish/Subscribe Systemsquerzoni/corsi_assets/1112/SistemiDist… · The publish/subscribe communication paradigm: Publishers: produce data in the form of events. Subscribers: declare

Università di Roma “La Sapienza”Dipartimento di Informatica e Sistemistica

M i dd l ewa re L abo r a to r yM I D L A B

Publish/SubscribeSystemsSistemi distribuiti (AA 11/12)

Leonardo [email protected]

domenica 6 maggio 12

mailto:[email protected]

mailto:[email protected]

■ Client/server is the most used architecture for distributed applications.

■ Communications take place following the request/reply (pull)interaction model.

■Drawbacks:■ interaction is limited to two entities (one-to-one);

■ each entity must know how to address its partner in the communication;

■ the two entities must be available at the same time in order to communicate;

■ communication is inherently synchronous;

■ Communication is only pull-based.

■ To address these issues many alternative models were introduced: RPC, Shared memories, Message queues, Publish/subscribe Th

e pu

b/su

b in

tera

ctio

n m

odel

1Client Server

request

reply


Publish/subscribe was thought as a comprehensive solution for those problems:

Many-to-many communication model - Interactions take place in an environment where various information producers and consumers can communicate, all at the same time. Each piece of information can be delivered at the same time to various consumers. Each consumer receives information from various producers.

Space decoupling - Interacting parties do not need to know each other. Message addressing is based on their content.

Time decoupling - Interacting parties do not need to be actively participating in the interaction at the same time. Information delivery is mediated through a third party.

Synchronization decoupling - Information flow from producers to consumers is also mediated, thus synchronization among interacting parties is not needed.

Push/Pull interactions - both methods are allowed.

These characteristics make pub/sub perfectly suited for distributed applications relying on document-centric communication.

The

pub/

sub

inte

ract

ion

mod

el

2


■ The publish/subscribe communication paradigm:

■ Publishers: produce data in the form of events.

■ Subscribers: declare interests on published data with subscriptions.

■ Each subscription is a filter on the set of published events.

■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.

Basic

bui

ldin

g bl

ocks

3

Event

Notification

Service

Publisher

Publisher

Publisher

Subscriber

Subscriber

Subscriber

{

Informationproducers

{Informationconsumers

{

Mediator







Basic

bui

ldin

g bl

ocks

3

Event

Notification

Service

Publisher

Publisher

Publisher

Subscriber

Subscriber

Subscriber

{



{

Mediator

subscribe

subscribe







Basic

bui

ldin

g bl

ocks

3

Event

Notification

Service

Publisher

Publisher

Publisher

Subscriber

Subscriber

Subscriber

{



{

Mediator

publish







Basic

bui

ldin

g bl

ocks

3

Event

Notification

Service

Publisher

Publisher

Publisher

Subscriber

Subscriber

Subscriber

{



{

Mediator

publish notify

notify







Basic

bui

ldin

g bl

ocks

3

Event

Notification

Service

Publisher

Publisher

Publisher

Subscriber

Subscriber

Subscriber

{



{

Mediator

unsubscribe

unsubscribe


■ Events represent information structured following an event schema.

■ The event schema is fixed, defined a-priori, and known to all the participants.

■ It defines a set of fields or attributes, each constituted by a name and a type. The types allowed depend on the specific implementation, but basic types (like integers, floats, booleans, strings) are usually available.

■Given an event schema, an event is a collection of values, one for each attribute defined in the schema.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

4


Example: suppose we are dealing with an application whose purpose is to distribute updates about computer-related blogs.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

5name type allowed values

blog_name string ANY

address URL ANY

genre enumeration [hardware, software, peripherals, development]

author string ANY

abstract string ANY

rating integer [1-5]

update_date date >1-1-1970 00:00

EventSchema

name value

blog_name Prad.de

address http://www.prad.de/en/index.html

genre peripherals

author Mark Hansen

abstract “The review of the new TFT panel...”

rating 4

update_date 26-4-2006 17:58

Event


http://www.prad.de/en/index.html

http://www.prad.de/en/index.html

■ Subscribers express their interests in specific events issuing subscriptions.

■ A subscription is, generally speaking, a constraint expressed on the event schema.

■ The Event Notification Service will notify an event e to a subscriber x only if the values that define the event satisfy the constraint defined by one of the subscriptions s issued by x. In this case we say that e matches s.

■ Subscriptions can take various forms, depending on the subscription language and model employed by each specific implementation.

■ Example: a subscription can be a conjunction of constraints each expressed on a single attribute. Each constraint in this case can be as simple as a >=< operator applied on an integer attribute, or complex as a regular expression applied to a string.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

6


■ From an abstract point of view the event schema defines an n-dimensional event space (where n is the number of attributes).

■ In this space each event e represents a point.

■ Each subscription s identifies a subspace.

■ An event e matches the subscription s if, and only if, the corresponding point is included in the portion of the event space delimited by s.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

7

S

e

e'


■Depending on the subscription model used we distinguish various flavors of publish/subscribe:

■ Topic-based

■ Hierarchy-based

■ Content-based

■ Type-based

■ Concept-based

■ XML-based

■ ???-based

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

8


Topic-based selection: data published in the system is mostly unstructured, but each event is “tagged” with the identifier of a topic it is published in. Subscribers issue subscriptions containing the topics they are interested in.

A topic can be thus represented as a “virtual channel” connecting producers to consumers. For this reason the problem of data distribution in topic-based publish/subscribe systems is considered quite close to group communications.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

9

Publisher Subscriber

Subscriber

Event Notification

Service

topic X


Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.

Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

01


Subscriber

ENS

Publisher

subscribed A

subscribed D

A

B C

D




Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

01


Subscriber

ENS

Publisher

subscribed A

subscribed D

A

B C

D A




Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

01


Subscriber

ENS

Publisher

subscribed A

subscribed D

A

B C

D

A




Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

01


Subscriber

ENS

Publisher

subscribed A

subscribed D

A

B C

D

D




Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

01


Subscriber

ENS

Publisher

subscribed A

subscribed D

A

B C

D D

D


Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.

Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

11


Subscriber

ENS

Publisher

Subscription:name=Acme*value>20$

Subscription:value=23



Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

11


Subscriber

ENS

Publisher



event1:name= Acme cablesvalue=23$

e1



Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

11


Subscriber

ENS

Publisher



e1

e1



Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

11


Subscriber

ENS

Publisher



e2

event2:name= Acme REvalue=18$



Even

t sc

hem

a an

d su

bscr

iptio

n m

odel

s

11


Subscriber

ENS

Publisher




■ The Event Notification Service is usually implemented as a:

■ Centralized service: the ENS is implemented on a single server.

■ Distributed service: the ENS is constituted by a set of nodes, event brokers, which cooperate to implement the service.

■ The latter is usually preferred for large settings where scalability is a fundamental issue.

Gen

eral

arc

hite

ctur

e

21

B

B

B

B

B

B

B

B

B

B

B

B

P

S

Event Notification Service

P

P

P

P

P

P

P

S

S

S

S

S

S

S

S


■Modern ENSs are implemented through a set of processes, called event brokers, forming an overlay network.

■ Each client (publisher or subscriber) accesses the service through a broker that masks the system complexity.

■An event routing mechanism routes each event inside the ENS from the broker where it is published to the broker(s) where it must be notified. Even

t ro

utin

g

31

B

B

B

P

S

Event Notification Service

S

S

S

S

S

S

B

B

B

B

B

B

B

B

B


Event flooding: each event is broadcast from the publisher in the whole system.

The implementation is straightforward but very expensive.

This solution has the highest message overhead with no memory overhead. 41

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40


Event flooding: each event is broadcast from the publisher in the whole system.

The implementation is straightforward but very expensive.

This solution has the highest message overhead with no memory overhead. 41

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22


Subscription flooding: each subscription is copied on every broker, in order to build locally complete subscription tables. These tables are then used to locally match events and directly notify interested subscribers. This approach suffers from a large memory overhead, but event diffusion is optimal. It is impractical in applications where subscriptions change frequently. 5

1

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22

x>30 IP x

x<>30 IP y

x<5 IP z

x>40 IP w

x>10 IP xyz


Subscription flooding: each subscription is copied on every broker, in order to build locally complete subscription tables. These tables are then used to locally match events and directly notify interested subscribers. This approach suffers from a large memory overhead, but event diffusion is optimal. It is impractical in applications where subscriptions change frequently. 5

1

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22

x>30 IP x

x<>30 IP y

x<5 IP z

x>40 IP w

x>10 IP xyz


Filter-based routing: subscriptions are partially diffused in the system and used to build routing tables. These tables, are then exploited during event diffusion to dynamically build a multicast tree that (hopefully) connects the publisher to all, and only, the interested subscribers.

61

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22



61

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22

3

1

2

a

b

c

9

5

4

6

7

8

f d

e

3 ANY

a x>=30 OR (x<18 AND x>10)

5 ANY

1 -

2 -

6 x>10

8 x<5

9 ANY

3 x>=30 OR (x<18 AND x>10)

b x>=30 OR (x<18 AND x>10)

3 ANY

7 x>10

5 ANY

e ANY

5 x>10 OR x<5

d ANY

9 x>10 OR x<5

f -



61

Even

t ro

utin

g

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22

3

1

2

a

b

c

9

5

4

6

7

8

f d

e

3 ANY

a x>=30 OR (x<18 AND x>10)

5 ANY

1 -

2 -

6 x>10

8 x<5

9 ANY

3 x>=30 OR (x<18 AND x>10)

b x>=30 OR (x<18 AND x>10)

3 ANY

7 x>10

5 ANY

e ANY

5 x>10 OR x<5

d ANY

9 x>10 OR x<5

f -


Rendez-Vous routing: it is based on two functions, namely SN and EN, used to associate respectively subscriptions and events to brokers in the system.

Given a subscription s, SN(s) returns a set of nodes which are responsible for storing s and forwarding received events matching s to all those subscribers that subscribed it.

Given an event e, EN(e) returns a set of nodes which must receive e to match it against the subscriptions they store.

Event routing is a two-phases process: first an event e is sent to all brokers returned by EN(e), then those brokers match it against the subscriptions they store and notify the corresponding subscribers.

This approach works only if for each subscription s and event e, such that e matches s, the intersection between EN(e) and SN(s) is not empty (mapping intersection rule).

71

Even

t ro

utin

g


Rendez-Vous routing: example.

Phase 1: two nodes issue the same subscription S.

SN(S) = {4,a}

81

Even

t ro

utin

g

3

1

2

a

b

c

9

5

4

6

7

8

f d

e


Rendez-Vous routing: example.

Phase 1I: an event e matching S is routed toward the rendez-vous node where it is matched against S.

EN(e) = {5,6,a}

Broker a is the rendez-vous point between event e and subscription S.

91

Even

t ro

utin

g

3

1

2

a

b

c

9

5

4

6

7

8

f d

e


A generic architecture of a publish/subscribe system:

Even

t ro

utin

g

02Pub/Sub Architecture

Matching

Flooding Selective diffusion Gossiping

Brokeroverlay

P2P structured

overlay

P2Punstructured

overlay

TCP/IPIP

multicastSOAP 802.11b/g

Event flooding

Subscription flooding

Rendez-Vous

Filter-based

Blind gossip

Informed gossip

Network protocols

Overlayinfrastructures

Event routing


Antonio Carzaniga, Matthew J. Rutherford, Alexander J. Wolf

“A Routing Scheme for Content-Based Networking” (SIENA)in Proceedings of IEEE INFOCOM 2004.

Abstract:“This paper proposes a routing scheme for content-based networking. A content-based network is a communication network that features a new advanced communication model where messages are not given explicit destination addresses, and where the destinations of a message are determined by matching the content of the message against selection predicates declared by nodes. Routing in a content-based network amounts to propagating predicates and the necessary topological information in order to maintain loop-free and possibly minimal forwarding paths for messages. The routing scheme we propose uses a combination of a traditional broadcast protocol and a content-based routing protocol. We present the combined scheme and its requirements over the broadcast protocol. We then detail the content-based routing protocol, highlighting a set of optimization heuristics. We also present the results of our evaluation, showing that this routing scheme is effective and scalable.”

SIEN

A

12


The specific architecture of this system:


Matching


Brokeroverlay

P2P structured

overlay

P2Punstructured

overlay

TCP/IPIP


Event flooding


Rendez-Vous

Filter-based

Blind gossip

Informed gossip

Network protocols


Event routing

SIEN

A


■ Each node has a service interface consisting of two operations:

■ send_message(m)

■ set_predicate(p)

■ A predicate is a disjunction of conjunctions of constraints of individual attributes.

■ A content-based network can be seen as a dynamically-configurable broadcast network, where each message is treated as a broadcast message whose broadcast tree is dynamically pruned using content-based addresses.

32

SIEN

A


Combined Broadcast and Content-Based (CBCB) routing scheme.

Content-based layer : “prunes” broadcast forwarding paths

Broadcast layer : diffuses messages in the network

Overlay point-to-point network: manages connections 42

x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

SIEN

A






x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

SIEN

A






x>30

x=167

x<18 AND x>10

x=30 OR x>200

x=30

x<>30

x<5

x>10

x>40

x=22

SIEN

A


52The broadcast layer :

■ A broadcast function B : N x I → I* is available at each router. Given a source node s and an input interface i, it returns a set of output interfaces.

■ The broadcast function defines a broadcast tree routed at each source node.

■ The broadcast function satisfies the all-pairs path symmetry property: for each pair of nodes x and y, the broadcast function defines two broadcast trees Tx and Ty, rooted at nodes x and y respectively, such that the path x⇝y in Tx is congruent to the reverse of the path y⇝x in Ty.

SIEN

A


62Example:

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

SIEN

A


62Example:

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

SIEN

A


72The content-based layer :

■ Maintains forwarding state in the form of a content-based forwarding table. The table, for each node, associates a content-based address to each interface.

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

x>30

x=167

x<18 and x>10

x=30 or x>200

x=30

x<>30x<5

x>10

x>40

SIEN

A


82The message forwarding mechanism:

■ The content-based forwarding table is used by a forwarding function Fc

that, given a message m, selects the subset of interfaces associated with predicates matching m.

■ The result of Fc is then combined with the broadcast function B, computed for the original source of m.

■ A message is therefore forwarded along the set of interfaces returned by the following formula:

(B(source(m), incoming_if(m)) ∪ {I0}) ∩ Fc(m)

SIEN

A


92Example:

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

x>30

x=167

x<18 and x>10

x=30 or x>200

x=30

x<>30x<5

x>10

x>40

x=12

SIEN

A


03Forwarding tables maintenance:

■ Push mechanism based on receiver advertisements.

■ Pull mechanism based on sender requests and update replies.

Receiver advertisements:

■ are issued by nodes periodically and/or when the node changes its local content-based address p0.

■ Content-based RA ingress filtering: a router receiving through interface i an RA issued by node r and carrying content-based address pRA first verifies whether or not the content-based address pi associated with interface i covers pRA. If pi covers pRA, then the router simply drops the RA.

■ Broadcast RA propagation: if pi does not cover pRA, then the router computes the set of next-hop links on the broadcast tree rooted in r (i.e., B(r, i)) and forwards the RA along those links.

■ Routing table update: if pi does not cover pRA, then the router also updates its routing table, adding pRA to pi, computing pi ← pi ∨ pRA.

SIEN

A


13Example: Broker 6 issues subscription s1

6

1 2

3 4

5

i pred

6 s1

i pred

4 s1

i pred

4 s1

i pred

3 s1

i pred

3 s1

SIEN

A


6

1 2

3 4

5

Example: Broker 2 issues subscription s2≺s1

23

i pred6 s1

2 s2

i pred

4 s1

i pred

4 s1

i pred

3 s1

i pred

3 s1

2 s2

i pred

4 s2

SIEN

A


33

6

1 2

3 4

5

Notice that, because of the ingress filtering rule, the RA protocol can only widen the selection of the content-based addresses stored in routing tables. In the long run, this may cause an “inflation” of those content-based addresses.

Example: Broker 6 substitute its predicate with s3≺s1

i pred6 s1

2 s2

i pred

4 s1

SIEN

A


43Sender Requests and Update Replies:

■ A router uses sender requests (SRs) to pull content-based addresses from all receivers in order to update its routing table.

■ The results of an SR come back to the issuer of the SR through update replies (URs).

■ The SR/UR protocol is designed to complement the RA protocol. Specifically, it is intended to balance the effect of the address inflation caused by RAs, and also to compensate for possible losses in the propagation of RAs.

■ An SR issued by n is broadcast to all routers, following the broadcast paths defined at each router by the broadcast function B(n, . ).

■ A leaf router in the broadcast tree immediately replies with a UR containing its content-based address p0.

■ A non-leaf router assembles its UR by combining its own content-based address p0 with those of the URs received from downstream routers, and then sends its URs upstream.

■ The issuer of the SR processes incoming URs by updating its routing table. In particular, an issuer receiving a UR carrying predicate pUR from interface i updates its routing table entry for interface i with pi ← pUR.

SIEN

A


53Example: Broker 5 sends a Sender Request (SR) to refresh its

forwarding table.

6

1 2

3 4

5i pred

3 s1

i pred

4 s1

SIEN

A


63Example: Update Replies (URs) are collected on the paths toward

broker 5.

6

1 2

3 4

5i pred

3 s2 ⋁ s3

i pred

4 s2 ⋁ s3

[s2]

[s3]

[s2 ⋁ s3]

[s2 ⋁ s3]

[ ]

SIEN

A


■ Exercise: consider the following system:

■ The event space is represented by a single numerical attribute x which can assume real values. Subscriptions can be expressed using the operators <=>.

SIEN

A

73

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

I

E

D

F

G

H

C

P

B

A


■ Subscribers issued the following subscriptions.

■ Firstly define a spanning tree associated to the broker associated with publisher P. Then, for every broker compute the content-based forwarding table associated to this spanning tree. Finally compute the path followed by event x=16 through the ENS.

SIEN

A

83Subscriber Subscription

A x>23

B x<0 OR x>90

C x<40

D x>25 AND x<60

E x>5 AND x<18

F x>5 AND x<10

G x>15 AND x<20

H x<12

I x>50


■ 1: define a spanning tree associated to broker 1

■ Every tree including all the brokers is ok.

SIEN

A

93

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

I

E

D

F

G

H

C

P

B

A


■ The content of subscription tables is computed starting from each subscriber and “climbing the tree” toward the root (broker 1).

■We are referring to a run-time status where we can assume that, independently from the order used to issue subscriptions, the tables’ content is perfect. SIEN

A

04

Broker Interface Content-based address

1 2 x>50

1 3 x>23 OR (x<0 OR x>90) OR x<40 OR (x>25 AND x<60)

2 7 x>50

3 4 x>23 OR (x<0 OR x>90)

3 8 x<40 OR (x>25 AND x<60)

4 5 x>23 OR (x<0 OR x>90)

5 6 x>23 OR (x<0 OR x>90)

8 10 x<12 OR (x>15 AND x<20)

8 11 x>5 AND x<10

8 12 x<40 OR (x>5 AND x<18) OR (x>25 AND x<60)

10 9 x<12 OR (x>15 AND x<20)

11 13 x>5 AND x<10

12 14 (x>5 AND x<18) OR (x>25 AND x<60)

14 15 (x>5 AND x<18) OR (x>25 AND x<60)


■ Routing event x=16. Notified subscribers: C, E, G.

■ The table reports which content-based addresses are satisfied by the event (in blue).

SIEN

A

14

Broker Interface Content-based address

1 2 x>50

1 3 x>23 OR (x<0 OR x>90) OR x<40 OR (x>25 AND x<60)

2 7 x>50

3 4 x>23 OR (x<0 OR x>90)

3 8 x<40 OR (x>25 AND x<60)

4 5 x>23 OR (x<0 OR x>90)

5 6 x>23 OR (x<0 OR x>90)

8 10 x<12 OR (x>15 AND x<20)

8 11 x>5 AND x<10

8 12 x<40 OR (x>5 AND x<18) OR (x>25 AND x<60)

10 9 x<12 OR (x>15 AND x<20)

11 13 x>5 AND x<10

12 14 (x>5 AND x<18) OR (x>25 AND x<60)

14 15 (x>5 AND x<18) OR (x>25 AND x<60)


■On the graph:

SIEN

A

24

3

1

2

8

9

10

11

7

4

5

6

12

14 15

13

I

E

D

F

G

H

C

P

B

A


Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron

“SCRIBE: A large-scale and decentralized application-level multicast infrastructure”Journal on Selected Areas in Communications, 2002.

Abstract:“This paper presents Scribe, a scalable application-level multicast infrastructure. Scribe supports large numbers of groups, with a potentially large number of members per group. Scribe is built on top of Pastry, a generic peer-to-peer object location and routing substrate overlayed on the Internet, and leverages Pastry’s reliability, self-organization, and locality properties. Pastry is used to create and manage groups and to build efficient multicast trees for the dissemination of messages to each group. Scribe provides best-effort reliability guarantees, but we outline how an application can extend Scribe to provide stronger reliability. Simulation results, based on a realistic network topology model, show that Scribe scales across a wide range of groups and group sizes. Also, it balances the load on the nodes while achieving acceptable delay and link stress when compared to IP multicast.”

SCRI

BE

34


The specific architecture of this system:


Matching


Brokeroverlay

P2P structured

overlay

P2Punstructured

overlay

TCP/IPIP


Event flooding


Rendez-Vous

Filter-based

Blind gossip

Informed gossip

Network protocols


Event routing

SCRI

BE


■ Scribe is a topic-based publish/subscribe system able to support a large number of groups with a potentially large number of publishers and subscribers.

■ Each user in the system (publisher or subscriber) is also a broker. The event notification service is therefore constituted by all the users.

■Users can join and leave the system. The event notification service can therefore change at runtime.

■ Scribe is built upon Pastry, a peer-to-peer location and routing service.

■ Pastry is used to build and maintain the application-level topology that connects brokers in the event notification service.

■ Pastry also provides applications with efficient primitives for object storage and location.

54

SCRI

BE


■ Pastry implements a Distributed Hash Table:

■ Each object is associated with a key.

■ Each key is stored (together with the corresponding objects) in a node.

■ Each object can be efficiently located and retrieved knowing its key.

■ Each node participating to Pastry is identified by 128-bit NodeID obtained applying a hash function h to its IP address.

■NodeId is in base 2b, where b is a configuration parameter.

■ The function h evenly distributes node identifiers in the circular key-space [0, 2128-1].

■ Each object is stored on the node with the closest NodeID.

■ Each node maintains three data structures:

■ Leaf set

■ Routing table

■ Neighborhood set

64

SCRI

BE


■ Leaf set: contains the set of nodes with the L/2 numerically closest larger NodeIDs, and the L/2 nodes with numerically closest smaller NodeIDs, relative to the present node’s NodeID.

■ Example: node 60, L=6 74

2

127 125121

23

25

53

63

60

74

83

98

135

160

191

177

183

215

208

226

240

LS60

23

25

53

63

74

83

SCRI

BE


■Routing table: matrix of Log2b N rows and 2b-1 columns. Entries in the n-th row match the first n-1 digits of current NodeID. The n-th digit has one of the 2b-1 possible values other than the n-th digit in current NodeID.

■ Example: routing table at node 10233102, b=2 84

-0-2212102 1 -2-2301203 -3-1203203

0 1-1-301233 1-2-230203 1-3-021022

10-0-31203 10-1-32102 2 10-3-23302

102-0-0230 102-1-1302 102-2-2302 3

1023-0-322 1023-1-000 1023-2-121 3

10233-0-01 1 10233-2-32

0 102331-2-0

2

SCRI

BE

Possible digit values0 1 2 3

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Row 8


■Neighborhood set: list of the M closest nodes.

■Node distance is measured using a proximity metric (IP hops, latency, bandwidth, etc).

■Nodes in this list are used to update entries in the routing table. 94

SCRI

BE


■ The main function provided by Pastry is route(msg,key).

■ Routing is realized matching key prefixes with nodes stored in each routing table.

■ In each routing step, the current node forwards the message to a node whose NodeID shares with the target key a prefix that is at least one digit longer than the prefix that the key shares with the current NodeID.

■ If no such node is found in the routing table the message is forwarded to a node whose NodeID shares a prefix with the key as long as the current node, but numerically close to the key than the current NodeID.

05

SCRI

BE


■ Scribe use the key-node mapping provided by Pastry to assign a rendez-vous node to each topic:

■ Each topic t (called Group in Scribe) is mapped to a key applying h(t)

■ EN(e)=h(e), SN(s)=h(s)

■Membership management:

■ Joining a group

■ Leaving a group

■Message diffusion

15

SCRI

BE


■When a node n wants to subscribe to t (joing group t):

■ it invokes route(JOIN[t],h(t))

■ the message is routed toward the rendez-vous node for t

■ each node n’ along the route checks a local groups list to see if it is currently a forwarder for t

■ if so it accepts n as a child, and adds it to the local children table■ otherwise it adds t to the groups list, add n to the children table and, finally,

invokes route(JOIN[t],h(t))

■ A node can unsubscribe t at any time:

■ if it has no children then it sends to its parent in the diffusion tree a LEAVE message

■ if it has still children for that group, it cannot leave the diffusion tree

■ Routing is done in two steps:

■ the node that publish the event for topic t invokes route(MCAST[e],h(t))

■ when the message reaches the rendez-vous point it is diffused following links defined by children tables for that group.

25

SCRI

BE


■ Example

35

2

127 125121

23

25

53

63

60

74

83

98

135

160

191

177

183

215

208

226

240

h(t) children father

73 177,191 83


73 121 74


73 83 83


73 - 121


73 - 121

SCRI

BE


R. Baldoni, R. Beraldi, V. Quema, L. Querzoni, S. Tucci Piergiovanni

“TERA: Topic-based Event Routing for Peer-to-Peer Architectures”International Conference on Distributed Event-Based Systems (DEBS), 2007.

Abstract:“The completely decoupled interaction model offered by the publish/subscribe communication paradigm perfectly suits the interoperability needs of todays large-scale, dynamic, peer-to-peer applications. Unmanaged inter-administrative environments, where these applications are expected to work, pose a series of problems (potentially wide number of partipants, low-reliability of nodes, absence of a centralized authority, etc.) that severely limit the scalability of existing approaches which were originally thought for supporting distributed applications built on the top of static and managed environments. In this paper we propose a novel architecture for implementing the topic-based publish/subscribe paradigm in large scale peer-to-peer systems. The proposed architecture is based on probabilistic mechanisms and peer-to-peer overlay management protocols. It achieves event diffusion by implementing traffic confinement (published events have a high probability to reach only interested subscribers), high scalability (with respect to several fundamental parameters like number of participants, subscriptions, topics and event publication rate) and fair load distribution (load distribution closely follows the distribution of subscription on nodes).”

TERA

45


■ A two-layer infrastructure:

■ All clients are connected by a single overlay network at the lower layer (general overlay).

■ Various overlay network instances at the upper layer connect clients subscribed to same topics (topic overlays).

■ Event diffusion:

■ The event is routed in the general overlay toward one of the nodes subscribed to the target topic.

■ This node acts as an access point for the event that is then diffused in the correct topic overlay.

Topic overlay

Node

Node used as access point

Event routedin the system

General overlay

(a) System overview

TERA

Applications

Overlay Management Protocol

Network

Event Management Subscription Management

Broadcast

Partition Merging

Access Point Lookup

Peer Sampling

Size Estimation

subscribe unsubscribe publish notify

(b) Node architecture

Figure 1: The TERA publish/subscribe system.

2 An Overview of TERA

TERA is a topic-based publish/subscribe system designed to o�er an event di�usion service for very

large scale peer-to-peer systems. Each published event is “tagged” with a topic and is delivered to

all the subscribers that expressed their interest in the corresponding topic by issuing a subscription

for it. The set of available topics is not fixed, nor predefined: applications using TERA can

dynamically create or delete them.

2.1 Architecture

Nodes participating to TERA are organized in a two-layers infrastructure (ref. Figure 1(a)). At

the lower layer, a global overlay network connects all nodes, while at the upper layer various topic

overlay networks connect subsets of all the nodes; each topic overlay contains nodes subscribed to

the same topic. All these overlay networks are separated and are maintained through an overlay

management protocol.

Subscription management and event di�usion in TERA are based on two simple ideas: nodes

4

TERA

55


■ Event routing in the general overlay is realized through a random walk.

■ The walk stops at the first broker that knows an access point for the target topic.

B4

B6

B1

B3

B2

B5

t

to the topic overlay

topic APa B5f B6

topic APx B1a B5

topic APe B4h B4

topic APt B1y B6

TERA

65


■ Each node maintains locally an Access Point Table (APT)

■ Each entry in the APT is a couple <topic, node address>

■ An entry <t,n> represents the fact that n is an access point for topic t.

■ The length of the APT is fixed.

■Goal:

■ each topic in the APT must be a uniform random sample among all the topics in the system;

■ the access point associated to a topic in an APT must be a uniform random sample among all the nodes subscribed to that topic.

topic AP

x B1

a B5

TERA

75


■ Subscription advertisement:

■ each node periodically advertises its subscriptions to a set of nodes chosen uniformly at random among the population;

■ each advertisement is a set of couples<topic, popularity>

■ An advertisement <t,p> represents the fact that there are (approximately) p nodes subscribed to topic t.

■ APT update. When a node receives and advertisement <t,p> from node n:

■ if the APT contains an entry for <t,m> it simply puts m=n

■ otherwise it puts a new entry <t,n> in the APT with probability 1/p

TERA

85


■ OMPs: Newscast, Cyclon, etc.

OverlayNetwork

ACCESS POINT

LOOKUP

Applications

Network

TERA

Overlay Management Protocol

General overlay

peer sampling

Topic 1 overlay

peer sampling

sizeestim.

Topic 2 overlay

peer sampling

sizeestim.

Topic n overlay

peer sampling

sizeestim.

EVENT MANAGEMENT SUBSCRIPTION MANAGEMENT

INNER-CLUSTER

DISSEMINATION

PARTITION

MERGING

ACCESS POINT TABLE

SUBSCRIPTION

TABLE

pub

lishe

d e

vent

s (low

er la

yer)

pub

lishe

d e

vent

s (u

pper

laye

r)ev

ents

node

IDs

lookup

check subscription

add o

r

rem

ove

topic

ove

rlay

siz

e

inst

antia

te/jo

in/le

ave

a to

pic

ove

rlay

subsc

riptio

n ad

vert

isem

ents

forc

e vi

ew

exch

ange

subsc

riptio

n ad

vert

isem

ents

viewexchange

joinoverlay

publish notify subscribe unsubscribe

rand

om

wal

ks

node

IDs

topic overlay id

Figure 2: A detailed view of the architecture of TERA.

as for notifying subscribers. An event dissemination startsas soon as an application publishes some data in a topic.It is done in two steps: the event is first routed to a nodesubscribed to the topic (this node acts as an access point forit); then, the access point di�uses the event in the overlayassociated to the topic. The first step is realized througha lookup executed on the Access Point Lookup component:if the lookup returns an empty list of node identifiers, thenode discards the event.

When a node subscribed to the topic receives an eventfor which it must act as an access point, it uses the broad-cast primitive provided by the Inner-Cluster Disseminationservice to forward the event to all nodes belonging to thecorresponding topic overlay. When a node subscribed tothe topic receives a broadcasted event, it notifies the appli-cation.

Access Points Lookup.The Access Point Lookup component plays a central role

in TERA’s architecture as it is used by both the Event Man-agement and Subscription Management components to ob-tain lists of access points identifiers for specific topics. Itsfunctioning is based on a local data structure, called Ac-cess Point Table (APT), and a distributed search algorithmbased on random walks.

Each APT is a cache, containing a limited number of en-tries, each with the form < t, n >, where t is a topic and nthe identifier of a node that can act as an access point for t.APTs are continuously updated following a simple strategy:each time a node receives a subscription advertisement fortopic t from a node n, it substitutes the access point identi-fier for t if an entry < t, n� > exists in the APT, otherwiseit adds a new entry < t, n > with probability 1/Pt, wherePt is the popularity of topic t estimated by n and attached

to the subscription advertisement. When an APT exceedsa predefined size, randomly chosen entries are removed.

As a consequence of this update strategy, APTs have thefollowing properties:

1. APT entries tend to contain non-stale access points,

2. inactive topics (i.e. topics that are no longer subscribedby any node) tend to disappear from APTs,

3. each access point is a uniform random sample of thepopulation of nodes subscribed to that topic,

4. the content of each APT is a uniform random sampleof the set of active topics (i.e. topics subscribed by atleast one node),

5. the size of each APT is limited.

The first property is a consequence of the way new en-tries are added to APTs; suppose, in fact, that there is onlyone topic t in the system subscribed by two nodes, na andnb; suppose, moreover, that, at certain point of time, nb

unsubscribes t. Starting from that moment, only na willadvertise t, therefore nodes containing an entry < t, nb >will eventually substitute it with entry < t, na >, as theuniformity of node samples provided by the peer samplingservice guarantees that na will eventually advertise t to allthe system population. The second property comes from thefact that inactive topics are no longer advertised. They are,thus, eventually replaced by active topics in APTs (assum-ing that the set of active topics is larger than the maximumAPT size). The third property is a consequence of the factthat subscription advertisements are sent to nodes returnedby the peer sampling service that provides uniform randomsamples, and that each node advertises its subscriptions with

TERA

95


■We want every topic to appear with the same probability in every APT, regardless of its popularity.

4.2.1 Topic distribution in APTsWe start by presenting an experiment showing that the

method used in TERA to update APTs content ensures auniform distribution of topics in every APT. This is a fun-damental property for APTs as it allows TERA to use theircontent as a uniform random sample of the active topic pop-ulation and build on it the access point lookup mechanism.We ran tests over a system with 104 nodes, each advertisingits subscriptions every 5 cycles to 5 neighbors out of 20 (theoverlay management protocol view size). APT size was lim-ited to 10 entries. We issued 5000 subscriptions distributedin various ways on 1000 distinct topics, and we measured,for each topic, the number of APTs containing an entry forit. The expected outcome of these tests is to find a con-stant value for such measure, regardless of the initial topicpopularity distribution.

Figure 3(a) shows the results for an initial uniform distri-bution of topic popularity. The X axis represents the topicpopulation (each topic is mapped to a number). Each blackdot represents the number of times a specific topic appearsin APTs, while the grey dot represents its popularity. Theplot shows that each topic is present, on average, in the samenumber of APTs, with a very small error that is randomlydistributed around the mean. This confirms that the topicdistribution in APTs can be considered uniform.

Figures 3(b) and 3(c) show the results for an initial zipfdistribution of topic popularity. The two graphs report theresults for di�erently skewed popularity distributions (dis-tribution parameter a = 0.7 and a = 2.0). As these graphsshow, TERA is always able to balance APT updates, anddelivers an almost uniform distribution. Even in an extremecase (a = 2.0), the APT update mechanism is able to bal-ance the updates coming from the small number of activetopics (in this scenario only 79 topics share the whole 5000subscriptions), maintaining their presence in APTs aroundthe same average value with a small standard deviation (al-ways below 5%). In the next evaluations, we only report re-sults for zipf popularity distribution with a = 0.7, as resultsfor other values of a did not exhibit significant di�erences.

4.2.2 Access Point LookupIn this section, we evaluate the probability for the access

point lookup mechanism to successfully returns a node iden-tifier for a lookup operation (in the case such node exists).We denote by K the lifetime of the random walk (the max-imum number of visited nodes), by |APT | the size of APTtables, and by |T | the number of topics8. The probability pto find an access point for a specific topic in an APT is p =|APT |

|T | . Assuming that every APT contains the maximum al-lowed number of entries, the probability that an access pointcannot be found within K steps is Pr{fail} = (1� p)K .Thus, the probability to find the access point visiting at most

K nodes is Pr{success} = 1�(1� p)K = 1�“1� |APT |

|T |

”K.

Therefore, to ensure with probability P that an access pointfor a given topic will be found, it is necessary that sizes Kor |APT | be such that:

8Thanks to the fact that APTs can be considered as uniformrandom samples of the set of active topics, each node canestimate at runtime the value of |T | [16].

Distribution of subscriptions on APTs

(uniform)

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000

Topics

Nu

mb

er o

f p

resen

ces

Distribution on APTs

Popularity

Std.Dev=1,11

(a)


(zipf a=0,7)

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000

Topics

Nu

mb

er o

f p

resen

ces


Popularity

Std.Dev=2,16

(b)


(zipf a=2,0)

0

250

500

750

1000

1250

1500

1750

2000

0 20 40 60 80

Topics

Nu

mb

er o

f p

resen

ces


Popularity

Std.Dev=51,49

(c)

Figure 3: The plot shows how topics are distributedamong APTs (black dots) when the topic popular-ity distribution (grey dots) is (a) uniform and (b-c)skewed (zipf with parameter a).

4.2.1 Topic distribution in APTsWe start by presenting an experiment showing that the

method used in TERA to update APTs content ensures auniform distribution of topics in every APT. This is a fun-damental property for APTs as it allows TERA to use theircontent as a uniform random sample of the active topic pop-ulation and build on it the access point lookup mechanism.We ran tests over a system with 104 nodes, each advertisingits subscriptions every 5 cycles to 5 neighbors out of 20 (theoverlay management protocol view size). APT size was lim-ited to 10 entries. We issued 5000 subscriptions distributedin various ways on 1000 distinct topics, and we measured,for each topic, the number of APTs containing an entry forit. The expected outcome of these tests is to find a con-stant value for such measure, regardless of the initial topicpopularity distribution.

Figure 3(a) shows the results for an initial uniform distri-bution of topic popularity. The X axis represents the topicpopulation (each topic is mapped to a number). Each blackdot represents the number of times a specific topic appearsin APTs, while the grey dot represents its popularity. Theplot shows that each topic is present, on average, in the samenumber of APTs, with a very small error that is randomlydistributed around the mean. This confirms that the topicdistribution in APTs can be considered uniform.

Figures 3(b) and 3(c) show the results for an initial zipfdistribution of topic popularity. The two graphs report theresults for di�erently skewed popularity distributions (dis-tribution parameter a = 0.7 and a = 2.0). As these graphsshow, TERA is always able to balance APT updates, anddelivers an almost uniform distribution. Even in an extremecase (a = 2.0), the APT update mechanism is able to bal-ance the updates coming from the small number of activetopics (in this scenario only 79 topics share the whole 5000subscriptions), maintaining their presence in APTs aroundthe same average value with a small standard deviation (al-ways below 5%). In the next evaluations, we only report re-sults for zipf popularity distribution with a = 0.7, as resultsfor other values of a did not exhibit significant di�erences.

4.2.2 Access Point LookupIn this section, we evaluate the probability for the access

point lookup mechanism to successfully returns a node iden-tifier for a lookup operation (in the case such node exists).We denote by K the lifetime of the random walk (the max-imum number of visited nodes), by |APT | the size of APTtables, and by |T | the number of topics8. The probability pto find an access point for a specific topic in an APT is p =|APT |

|T | . Assuming that every APT contains the maximum al-lowed number of entries, the probability that an access pointcannot be found within K steps is Pr{fail} = (1� p)K .Thus, the probability to find the access point visiting at most

K nodes is Pr{success} = 1�(1� p)K = 1�“1� |APT |

|T |

”K.

Therefore, to ensure with probability P that an access pointfor a given topic will be found, it is necessary that sizes Kor |APT | be such that:

8Thanks to the fact that APTs can be considered as uniformrandom samples of the set of active topics, each node canestimate at runtime the value of |T | [16].


(uniform)

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000

Topics

Nu

mb

er o

f p

resen

ces


Popularity

Std.Dev=1,11

(a)


(zipf a=0,7)

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000

Topics

Nu

mb

er o

f p

resen

ces


Popularity

Std.Dev=2,16

(b)


(zipf a=2,0)

0

250

500

750

1000

1250

1500

1750

2000

0 20 40 60 80

Topics

Nu

mb

er o

f p

resen

ces


Popularity

Std.Dev=51,49

(c)

Figure 3: The plot shows how topics are distributedamong APTs (black dots) when the topic popular-ity distribution (grey dots) is (a) uniform and (b-c)skewed (zipf with parameter a).

TERA

06


■Which is the probability for an event to be correctly routed in the general overlay toward an access point ?

■Depends on:

■ uniform randomness of topicscontained in access point tables;

■ access point table size;

■ random walk lifetime.

K =ln(1� P )

ln“1� |APT |

|T |

” or |APT | = |T |“1� K

�1� P

”

Note that, given K and P , |APT | linearly depends on |T |.In order to reduce APT size, it would be necessary to in-crease random walks length (i.e. using a large value for K)negatively a�ecting the time it takes to find an access point.To mitigate this problem, it is advisable to launch r multipleconcurrent random walks, each having a lifetime ⇤K

r ⌅. In-deed, the fact that topics are uniformly distributed amongAPTs guarantees that launching multiple concurrent ran-dom walks does not impact the lookup success rate. In thisway, access point lookup responsiveness is improved at thecost of a slightly larger overhead due to the independencyof each random walk lifetime.

We ran experiments to check that TERA’s behavior isclose to the one predicted by the analytical study. Tests wererun on a system with 1000 nodes, each having Cyclon viewsholding 20 nodes. At the beginning, 5000 subscriptionswere issued uniformly distributed on 1000 distinct topics.Lookups were started after 1000 cycles. Each lookup wasconducted starting four concurrent random walks (r = 4).

Figure 4(a) shows how the access point lookup success ra-tio changes when varying the lifetime of each random walk(K) for di�erent values of |APT |. For each line, we plottedboth simulation results (solid line) and values calculated us-ing the analytical study (dashed line). The plot confirmsthat TERA’s lookup mechanism is able to probabilisticallyguarantee that an access point for an active topic will befound with probability P . Note that this plot also showsthat the actual memory size required by APTs is limited.Indeed, consider the biggest APT size plotted on the graph:400 entries. Assuming that each entry in an APT is a stringcontaining 256 characters, the memory size occupied by anAPT containing 400 entries is about 104kB.

4.2.3 Partition MergingIn this section, we analyze the probability for the partition

merging mechanism to detect a very small overlay partition,and the time it takes for this to happen. Suppose that thereis a topic represented by an overlay network partitioned intwo clusters containing |G| and 1 nodes, respectively9. Letus call n this single node. The probability p to detect thepartition in a cycle can be expressed as p = 1�(pa·pb), wherepa is the probability that none of the nodes in G advertiseits subscriptions to n, and pb is the probability that n doesnot advertise its subscriptions to any of the nodes in G.

Probability pa can be expressed as

pa = (1� Pr{a node advertises to n})|G|

Every node in G advertises its subscription to n only ifn is contained in its view for the general overlay, and if nis one of the D nodes selected for the advertisement. Letus suppose, for the sake of simplicity, that D is equal to theview size. In this case Pr{a node advertises to n} = |V iew|

(N�1) ,

9Note that the case where a partition is constituted by asingle node is the most di⇤cult to solve as the probability fornodes belonging to distinct partitions to meet is the lowestpossible one.

Random Walk success rate.

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0 1 2 3 4 5 6 7 8 9 10

Random walk lifetime

Su

ccess r

ate

APT 50 Sim APT 50 Theo APT 100 Sim

APT 100 Theo APT 400 Sim APT 400 Theo

(a)

Cycles needed to merge a partitioned node

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0 50 100 150 200

Cycles

Pro

bab

ilit

y o

f m

erg

e

|G|=4 Sim

|G|=4 Theo

|G|=16 Sim

|G|=16 Theo

|G|=64 Sim

|G|=64 Theo

(b)

Figure 4: (a) The plot shows how the success ratefor access point lookups changes when varying themaximum APT size and the random walk lifetime.Solid lines represent results from the simulator,while dashed lines plot values from the formula.(b) The plot shows how the probability to detecta topic overlay partition increases with time (cy-cles). Solid lines represent results from the simula-tor, while dashed lines plot values from the formula.The tests were run varying the number |G| of nodessubscribed to the topic.

TERA

16


■ Load imposed on nodes is fairly distributed:

■ no hot spots or single points of failure;

■ Nodes that subscribe to more topics suffer more load.

Node stress distribution

general overlay - uniform popularity

1,E-05

1,E-04

1,E-03

Node population

Percen

tag

e o

f m

essag

es h

an

dle

d

Std.Dev.=4,04E-06


global - uniform popularity

1,E-05

1,E-04

1,E-03

Node population

Percen

tag

e o

f m

essag

es h

an

dle

d

1

10

100

Nu

mb

er o

f lo

cal

su

bscrip

tio

ns

Messages (percentage)

Subscriptions

Std.Dev.=1,23E-05

(a)


general overlay - zipf popularity

1,E-05

1,E-04

1,E-03

Node population

Percen

tag

e o

f m

essag

es h

an

dle

d

Std.Dev.=4,13E-06


global - zipf popularity

1,E-05

1,E-04

1,E-03

Node population

Percen

tag

e o

f m

essag

es h

an

dle

d

1

10

100

Nu

mb

er o

f lo

cal

su

bscrip

tio

ns

Messages (percentage)

Subscriptions

Std.Dev.=1,23E-05

(b)

Figure 5: The plots show how the load generated by TERA is distributed among nodes when the

distribution of topic popularity is either uniform (a) or zipf (b). For both popularities, the figure

shows in the left graph the load distribution in the general overlay and, in the right graph, the

global load distribution (black points), together with the subscription distribution on nodes (grey

points).

Figures 5(a) show the results for a test with uniform topic popularity, while figures 5(b) show

the same results for an initial zipf distribution with parameter a = 0.7. Pictures on the left

show how load is distributed in the general overlay. As shown by the graphs, TERA is able

to uniformly distribute load among nodes, avoiding the appearance of hot spots. This result is

obtained regardless of the distribution of topic popularities. Pictures on the right show the global

load experienced by nodes; in these graphs, nodes on the X axis are ordered in decreasing local

subscriptions count (i.e. points on the left refer to nodes subscribed to more topics), in order

to show how the global load is a�ected by the number of subscriptions maintained at each node.

The number of subscriptions per node is also plotted with grey dots. The graphs show how load

distribution closely follows the distribution of subscription on nodes, actually implementing the

pragmatic rule “the more you ask, the more you pay”, then fairly distributing the load among

participants.

18

TERA

26


■ Experiments show how the system scales with respect to:■ Number of subscriptions.

■ Number of topics.

■ Event publication rate.

■ Number of nodes.

■ (reference figure is given by a simple event flooding approach)

Average notification cost

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+01 1,E+03 1,E+05 1,E+07

Subscriptions

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100

event rate: 1

(a)


1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05

Topics

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

subscriptions: 10000

event rate: 1

(b)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05

Event publication rate

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100


(c)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+07

1,E+08

1,E+09

1,E+01 1,E+03 1,E+05 1,E+07 1,E+09

Nodes

Messag

es p

er n

oti

ficati

on

Event flooding TERA


topics: 100

event rate: 1

(d)

Figure 6: The plots show the average number of messages needed by TERA to notify an event

when the number of subscriptions (a), of topics (b), the event publication rate (c) and the total

number of nodes in the system (d) varies. For each figure, results from a simple event flooding

algorithm are reported for comparison.

4.3.2 Message cost per notification

The tra⌅c confinement strategy implemented by TERA induces some overhead. In order to assess

the global impact of this overhead, we evaluated the average cost incurred by TERA to notify a

single event to a subscriber, namely the total number of generated messages divided by the number

of notifications. This cost includes both messages generated to di�use the event, and messages

generated for TERA’s maintenance. To o�er a reference figure, we also evaluated the cost incurred

by a simple event flooding-based approach7 in the same settings.

Figure 6(a) reports the results when the total number of subscriptions varies between 102 and

106. The number of topics is fixed and equal to 100. The network considered in this test was

constituted by 104 nodes, while the event publication rate was maintained constant at 1 event per

topic in each cycle. For the evaluation to be meaningful, we required each topic to be subscribed

by at least one subscriber; therefore, each curve is limited on its left end by the number of available

topics. Moreover, we required each node to subscribe each topic at most once; therefore, each curve7Each event is broadcast in an overlay network containing all participants. The overlay is built and maintained

through the same overlay management protocol emplyed by TERA (Cyclon). Also the broadcast mechanism is the

same considered in TERA.

19

TERA

36








1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+01 1,E+03 1,E+05 1,E+07

Subscriptions

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100

event rate: 1

(a)


1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05

Topics

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000


event rate: 1

(b)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05


Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100


(c)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+07

1,E+08

1,E+09

1,E+01 1,E+03 1,E+05 1,E+07 1,E+09

Nodes

Messag

es p

er n

oti

ficati

on

Event flooding TERA


topics: 100

event rate: 1

(d)




















19

TERA

46








1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+01 1,E+03 1,E+05 1,E+07

Subscriptions

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100

event rate: 1

(a)


1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05

Topics

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000


event rate: 1

(b)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05


Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100


(c)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+07

1,E+08

1,E+09

1,E+01 1,E+03 1,E+05 1,E+07 1,E+09

Nodes

Messag

es p

er n

oti

ficati

on

Event flooding TERA


topics: 100

event rate: 1

(d)




















19

TERA

56








1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+01 1,E+03 1,E+05 1,E+07

Subscriptions

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100

event rate: 1

(a)


1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05

Topics

Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000


event rate: 1

(b)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05


Messag

es p

er n

oti

ficati

on

Event flooding TERA

nodes: 10000

topics: 100


(c)


1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+07

1,E+08

1,E+09

1,E+01 1,E+03 1,E+05 1,E+07 1,E+09

Nodes

Messag

es p

er n

oti

ficati

on

Event flooding TERA


topics: 100

event rate: 1

(d)




















19

TERA

66








1,E-04

1,E-03

1,E-02

1,E-01

1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+01 1,E+03 1,E+05 1,E+07 1,E+09

Nodes

Mes

sag

es p

er n

otif

icat

ion

pub diffusion

rnd walks

topic shuffle

subs advert.

general shuffle

TOTAL

subscriptions: 10000topics: 100event rate: 1

cost incurred to diffuse events inside topic

overlays

cost incurred to maintain the

general overlay

TERA

76


Documents

Publish/Subscribe Systemsquerzoni/corsi_assets/1112/SistemiDist… · The publish/subscribe communication paradigm: Publishers: produce data in the form of events. Subscribers: declare