Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Università di Roma “La Sapienza”Dipartimento di Informatica e Sistemistica
M i dd l ewa re L abo r a to r yM I D L A B
Publish/SubscribeSystemsSistemi distribuiti (AA 11/12)
Leonardo [email protected]
domenica 6 maggio 12
■ Client/server is the most used architecture for distributed applications.
■ Communications take place following the request/reply (pull)interaction model.
■Drawbacks:■ interaction is limited to two entities (one-to-one);
■ each entity must know how to address its partner in the communication;
■ the two entities must be available at the same time in order to communicate;
■ communication is inherently synchronous;
■ Communication is only pull-based.
■ To address these issues many alternative models were introduced: RPC, Shared memories, Message queues, Publish/subscribe Th
e pu
b/su
b in
tera
ctio
n m
odel
1Client Server
request
reply
domenica 6 maggio 12
Publish/subscribe was thought as a comprehensive solution for those problems:
Many-to-many communication model - Interactions take place in an environment where various information producers and consumers can communicate, all at the same time. Each piece of information can be delivered at the same time to various consumers. Each consumer receives information from various producers.
Space decoupling - Interacting parties do not need to know each other. Message addressing is based on their content.
Time decoupling - Interacting parties do not need to be actively participating in the interaction at the same time. Information delivery is mediated through a third party.
Synchronization decoupling - Information flow from producers to consumers is also mediated, thus synchronization among interacting parties is not needed.
Push/Pull interactions - both methods are allowed.
These characteristics make pub/sub perfectly suited for distributed applications relying on document-centric communication.
The
pub/
sub
inte
ract
ion
mod
el
2
domenica 6 maggio 12
■ The publish/subscribe communication paradigm:
■ Publishers: produce data in the form of events.
■ Subscribers: declare interests on published data with subscriptions.
■ Each subscription is a filter on the set of published events.
■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.
Basic
bui
ldin
g bl
ocks
3
Event
Notification
Service
Publisher
Publisher
Publisher
Subscriber
Subscriber
Subscriber
{
Informationproducers
{Informationconsumers
{
Mediator
domenica 6 maggio 12
■ The publish/subscribe communication paradigm:
■ Publishers: produce data in the form of events.
■ Subscribers: declare interests on published data with subscriptions.
■ Each subscription is a filter on the set of published events.
■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.
Basic
bui
ldin
g bl
ocks
3
Event
Notification
Service
Publisher
Publisher
Publisher
Subscriber
Subscriber
Subscriber
{
Informationproducers
{Informationconsumers
{
Mediator
subscribe
subscribe
domenica 6 maggio 12
■ The publish/subscribe communication paradigm:
■ Publishers: produce data in the form of events.
■ Subscribers: declare interests on published data with subscriptions.
■ Each subscription is a filter on the set of published events.
■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.
Basic
bui
ldin
g bl
ocks
3
Event
Notification
Service
Publisher
Publisher
Publisher
Subscriber
Subscriber
Subscriber
{
Informationproducers
{Informationconsumers
{
Mediator
publish
domenica 6 maggio 12
■ The publish/subscribe communication paradigm:
■ Publishers: produce data in the form of events.
■ Subscribers: declare interests on published data with subscriptions.
■ Each subscription is a filter on the set of published events.
■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.
Basic
bui
ldin
g bl
ocks
3
Event
Notification
Service
Publisher
Publisher
Publisher
Subscriber
Subscriber
Subscriber
{
Informationproducers
{Informationconsumers
{
Mediator
publish notify
notify
domenica 6 maggio 12
■ The publish/subscribe communication paradigm:
■ Publishers: produce data in the form of events.
■ Subscribers: declare interests on published data with subscriptions.
■ Each subscription is a filter on the set of published events.
■ An Event Notification Service (ENS) notifies to each subscriber every published event that matches at least one of its subscriptions.
Basic
bui
ldin
g bl
ocks
3
Event
Notification
Service
Publisher
Publisher
Publisher
Subscriber
Subscriber
Subscriber
{
Informationproducers
{Informationconsumers
{
Mediator
unsubscribe
unsubscribe
domenica 6 maggio 12
■ Events represent information structured following an event schema.
■ The event schema is fixed, defined a-priori, and known to all the participants.
■ It defines a set of fields or attributes, each constituted by a name and a type. The types allowed depend on the specific implementation, but basic types (like integers, floats, booleans, strings) are usually available.
■Given an event schema, an event is a collection of values, one for each attribute defined in the schema.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
4
domenica 6 maggio 12
Example: suppose we are dealing with an application whose purpose is to distribute updates about computer-related blogs.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
5name type allowed values
blog_name string ANY
address URL ANY
genre enumeration [hardware, software, peripherals, development]
author string ANY
abstract string ANY
rating integer [1-5]
update_date date >1-1-1970 00:00
EventSchema
name value
blog_name Prad.de
address http://www.prad.de/en/index.html
genre peripherals
author Mark Hansen
abstract “The review of the new TFT panel...”
rating 4
update_date 26-4-2006 17:58
Event
domenica 6 maggio 12
■ Subscribers express their interests in specific events issuing subscriptions.
■ A subscription is, generally speaking, a constraint expressed on the event schema.
■ The Event Notification Service will notify an event e to a subscriber x only if the values that define the event satisfy the constraint defined by one of the subscriptions s issued by x. In this case we say that e matches s.
■ Subscriptions can take various forms, depending on the subscription language and model employed by each specific implementation.
■ Example: a subscription can be a conjunction of constraints each expressed on a single attribute. Each constraint in this case can be as simple as a >=< operator applied on an integer attribute, or complex as a regular expression applied to a string.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
6
domenica 6 maggio 12
■ From an abstract point of view the event schema defines an n-dimensional event space (where n is the number of attributes).
■ In this space each event e represents a point.
■ Each subscription s identifies a subspace.
■ An event e matches the subscription s if, and only if, the corresponding point is included in the portion of the event space delimited by s.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
7
S
e
e'
domenica 6 maggio 12
■Depending on the subscription model used we distinguish various flavors of publish/subscribe:
■ Topic-based
■ Hierarchy-based
■ Content-based
■ Type-based
■ Concept-based
■ XML-based
■ ???-based
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
8
domenica 6 maggio 12
Topic-based selection: data published in the system is mostly unstructured, but each event is “tagged” with the identifier of a topic it is published in. Subscribers issue subscriptions containing the topics they are interested in.
A topic can be thus represented as a “virtual channel” connecting producers to consumers. For this reason the problem of data distribution in topic-based publish/subscribe systems is considered quite close to group communications.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
9
Publisher Subscriber
Subscriber
Event Notification
Service
topic X
domenica 6 maggio 12
Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.
Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
01
Publisher Subscriber
Subscriber
ENS
Publisher
subscribed A
subscribed D
A
B C
D
domenica 6 maggio 12
Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.
Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
01
Publisher Subscriber
Subscriber
ENS
Publisher
subscribed A
subscribed D
A
B C
D A
domenica 6 maggio 12
Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.
Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
01
Publisher Subscriber
Subscriber
ENS
Publisher
subscribed A
subscribed D
A
B C
D
A
domenica 6 maggio 12
Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.
Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
01
Publisher Subscriber
Subscriber
ENS
Publisher
subscribed A
subscribed D
A
B C
D
D
domenica 6 maggio 12
Hierarchy-based selection: even in this case each event is “tagged” with the topic it is published in, and Subscribers issue subscriptions containing the topics they are interested in.
Contrarily to the previous model, here topics are organized in a hierarchical structure which express a notion of containment between topics. When a subscriber subscribe a topic, it will receive all the events published in that topic and in all the topics present in the corresponding sub-tree.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
01
Publisher Subscriber
Subscriber
ENS
Publisher
subscribed A
subscribed D
A
B C
D D
D
domenica 6 maggio 12
Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
11
Publisher Subscriber
Subscriber
ENS
Publisher
Subscription:name=Acme*value>20$
Subscription:value=23
domenica 6 maggio 12
Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
11
Publisher Subscriber
Subscriber
ENS
Publisher
Subscription:name=Acme*value>20$
Subscription:value=23
event1:name= Acme cablesvalue=23$
e1
domenica 6 maggio 12
Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
11
Publisher Subscriber
Subscriber
ENS
Publisher
Subscription:name=Acme*value>20$
Subscription:value=23
e1
e1
domenica 6 maggio 12
Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
11
Publisher Subscriber
Subscriber
ENS
Publisher
Subscription:name=Acme*value>20$
Subscription:value=23
e2
event2:name= Acme REvalue=18$
domenica 6 maggio 12
Content-based selection: all the data published in the system is mostly structured. Each subscription can be expressed as a conjunction of constrains expressed on attributes. The Event Notification Service filters out useless events before notifying a subscriber.
Even
t sc
hem
a an
d su
bscr
iptio
n m
odel
s
11
Publisher Subscriber
Subscriber
ENS
Publisher
Subscription:name=Acme*value>20$
Subscription:value=23
domenica 6 maggio 12
■ The Event Notification Service is usually implemented as a:
■ Centralized service: the ENS is implemented on a single server.
■ Distributed service: the ENS is constituted by a set of nodes, event brokers, which cooperate to implement the service.
■ The latter is usually preferred for large settings where scalability is a fundamental issue.
Gen
eral
arc
hite
ctur
e
21
B
B
B
B
B
B
B
B
B
B
B
B
P
S
Event Notification Service
P
P
P
P
P
P
P
S
S
S
S
S
S
S
S
domenica 6 maggio 12
■Modern ENSs are implemented through a set of processes, called event brokers, forming an overlay network.
■ Each client (publisher or subscriber) accesses the service through a broker that masks the system complexity.
■An event routing mechanism routes each event inside the ENS from the broker where it is published to the broker(s) where it must be notified. Even
t ro
utin
g
31
B
B
B
P
S
Event Notification Service
S
S
S
S
S
S
B
B
B
B
B
B
B
B
B
domenica 6 maggio 12
Event flooding: each event is broadcast from the publisher in the whole system.
The implementation is straightforward but very expensive.
This solution has the highest message overhead with no memory overhead. 41
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
domenica 6 maggio 12
Event flooding: each event is broadcast from the publisher in the whole system.
The implementation is straightforward but very expensive.
This solution has the highest message overhead with no memory overhead. 41
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
domenica 6 maggio 12
Subscription flooding: each subscription is copied on every broker, in order to build locally complete subscription tables. These tables are then used to locally match events and directly notify interested subscribers. This approach suffers from a large memory overhead, but event diffusion is optimal. It is impractical in applications where subscriptions change frequently. 5
1
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
x>30 IP x
x<>30 IP y
x<5 IP z
x>40 IP w
x>10 IP xyz
domenica 6 maggio 12
Subscription flooding: each subscription is copied on every broker, in order to build locally complete subscription tables. These tables are then used to locally match events and directly notify interested subscribers. This approach suffers from a large memory overhead, but event diffusion is optimal. It is impractical in applications where subscriptions change frequently. 5
1
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
x>30 IP x
x<>30 IP y
x<5 IP z
x>40 IP w
x>10 IP xyz
domenica 6 maggio 12
Filter-based routing: subscriptions are partially diffused in the system and used to build routing tables. These tables, are then exploited during event diffusion to dynamically build a multicast tree that (hopefully) connects the publisher to all, and only, the interested subscribers.
61
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
domenica 6 maggio 12
Filter-based routing: subscriptions are partially diffused in the system and used to build routing tables. These tables, are then exploited during event diffusion to dynamically build a multicast tree that (hopefully) connects the publisher to all, and only, the interested subscribers.
61
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
3
1
2
a
b
c
9
5
4
6
7
8
f d
e
3 ANY
a x>=30 OR (x<18 AND x>10)
5 ANY
1 -
2 -
6 x>10
8 x<5
9 ANY
3 x>=30 OR (x<18 AND x>10)
b x>=30 OR (x<18 AND x>10)
3 ANY
7 x>10
5 ANY
e ANY
5 x>10 OR x<5
d ANY
9 x>10 OR x<5
f -
domenica 6 maggio 12
Filter-based routing: subscriptions are partially diffused in the system and used to build routing tables. These tables, are then exploited during event diffusion to dynamically build a multicast tree that (hopefully) connects the publisher to all, and only, the interested subscribers.
61
Even
t ro
utin
g
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
3
1
2
a
b
c
9
5
4
6
7
8
f d
e
3 ANY
a x>=30 OR (x<18 AND x>10)
5 ANY
1 -
2 -
6 x>10
8 x<5
9 ANY
3 x>=30 OR (x<18 AND x>10)
b x>=30 OR (x<18 AND x>10)
3 ANY
7 x>10
5 ANY
e ANY
5 x>10 OR x<5
d ANY
9 x>10 OR x<5
f -
domenica 6 maggio 12
Rendez-Vous routing: it is based on two functions, namely SN and EN, used to associate respectively subscriptions and events to brokers in the system.
Given a subscription s, SN(s) returns a set of nodes which are responsible for storing s and forwarding received events matching s to all those subscribers that subscribed it.
Given an event e, EN(e) returns a set of nodes which must receive e to match it against the subscriptions they store.
Event routing is a two-phases process: first an event e is sent to all brokers returned by EN(e), then those brokers match it against the subscriptions they store and notify the corresponding subscribers.
This approach works only if for each subscription s and event e, such that e matches s, the intersection between EN(e) and SN(s) is not empty (mapping intersection rule).
71
Even
t ro
utin
g
domenica 6 maggio 12
Rendez-Vous routing: example.
Phase 1: two nodes issue the same subscription S.
SN(S) = {4,a}
81
Even
t ro
utin
g
3
1
2
a
b
c
9
5
4
6
7
8
f d
e
domenica 6 maggio 12
Rendez-Vous routing: example.
Phase 1I: an event e matching S is routed toward the rendez-vous node where it is matched against S.
EN(e) = {5,6,a}
Broker a is the rendez-vous point between event e and subscription S.
91
Even
t ro
utin
g
3
1
2
a
b
c
9
5
4
6
7
8
f d
e
domenica 6 maggio 12
A generic architecture of a publish/subscribe system:
Even
t ro
utin
g
02Pub/Sub Architecture
Matching
Flooding Selective diffusion Gossiping
Brokeroverlay
P2P structured
overlay
P2Punstructured
overlay
TCP/IPIP
multicastSOAP 802.11b/g
Event flooding
Subscription flooding
Rendez-Vous
Filter-based
Blind gossip
Informed gossip
Network protocols
Overlayinfrastructures
Event routing
domenica 6 maggio 12
Antonio Carzaniga, Matthew J. Rutherford, Alexander J. Wolf
“A Routing Scheme for Content-Based Networking” (SIENA)in Proceedings of IEEE INFOCOM 2004.
Abstract:“This paper proposes a routing scheme for content-based networking. A content-based network is a communication network that features a new advanced communication model where messages are not given explicit destination addresses, and where the destinations of a message are determined by matching the content of the message against selection predicates declared by nodes. Routing in a content-based network amounts to propagating predicates and the necessary topological information in order to maintain loop-free and possibly minimal forwarding paths for messages. The routing scheme we propose uses a combination of a traditional broadcast protocol and a content-based routing protocol. We present the combined scheme and its requirements over the broadcast protocol. We then detail the content-based routing protocol, highlighting a set of optimization heuristics. We also present the results of our evaluation, showing that this routing scheme is effective and scalable.”
SIEN
A
12
domenica 6 maggio 12
The specific architecture of this system:
22Pub/Sub Architecture
Matching
Flooding Selective diffusion Gossiping
Brokeroverlay
P2P structured
overlay
P2Punstructured
overlay
TCP/IPIP
multicastSOAP 802.11b/g
Event flooding
Subscription flooding
Rendez-Vous
Filter-based
Blind gossip
Informed gossip
Network protocols
Overlayinfrastructures
Event routing
SIEN
A
domenica 6 maggio 12
■ Each node has a service interface consisting of two operations:
■ send_message(m)
■ set_predicate(p)
■ A predicate is a disjunction of conjunctions of constraints of individual attributes.
■ A content-based network can be seen as a dynamically-configurable broadcast network, where each message is treated as a broadcast message whose broadcast tree is dynamically pruned using content-based addresses.
32
SIEN
A
domenica 6 maggio 12
Combined Broadcast and Content-Based (CBCB) routing scheme.
Content-based layer : “prunes” broadcast forwarding paths
Broadcast layer : diffuses messages in the network
Overlay point-to-point network: manages connections 42
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
SIEN
A
domenica 6 maggio 12
Combined Broadcast and Content-Based (CBCB) routing scheme.
Content-based layer : “prunes” broadcast forwarding paths
Broadcast layer : diffuses messages in the network
Overlay point-to-point network: manages connections 42
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
SIEN
A
domenica 6 maggio 12
Combined Broadcast and Content-Based (CBCB) routing scheme.
Content-based layer : “prunes” broadcast forwarding paths
Broadcast layer : diffuses messages in the network
Overlay point-to-point network: manages connections 42
x>30
x=167
x<18 AND x>10
x=30 OR x>200
x=30
x<>30
x<5
x>10
x>40
x=22
SIEN
A
domenica 6 maggio 12
52The broadcast layer :
■ A broadcast function B : N x I → I* is available at each router. Given a source node s and an input interface i, it returns a set of output interfaces.
■ The broadcast function defines a broadcast tree routed at each source node.
■ The broadcast function satisfies the all-pairs path symmetry property: for each pair of nodes x and y, the broadcast function defines two broadcast trees Tx and Ty, rooted at nodes x and y respectively, such that the path x⇝y in Tx is congruent to the reverse of the path y⇝x in Ty.
SIEN
A
domenica 6 maggio 12
62Example:
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
SIEN
A
domenica 6 maggio 12
62Example:
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
SIEN
A
domenica 6 maggio 12
72The content-based layer :
■ Maintains forwarding state in the form of a content-based forwarding table. The table, for each node, associates a content-based address to each interface.
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
x>30
x=167
x<18 and x>10
x=30 or x>200
x=30
x<>30x<5
x>10
x>40
SIEN
A
domenica 6 maggio 12
82The message forwarding mechanism:
■ The content-based forwarding table is used by a forwarding function Fc
that, given a message m, selects the subset of interfaces associated with predicates matching m.
■ The result of Fc is then combined with the broadcast function B, computed for the original source of m.
■ A message is therefore forwarded along the set of interfaces returned by the following formula:
(B(source(m), incoming_if(m)) ∪ {I0}) ∩ Fc(m)
SIEN
A
domenica 6 maggio 12
92Example:
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
x>30
x=167
x<18 and x>10
x=30 or x>200
x=30
x<>30x<5
x>10
x>40
x=12
SIEN
A
domenica 6 maggio 12
03Forwarding tables maintenance:
■ Push mechanism based on receiver advertisements.
■ Pull mechanism based on sender requests and update replies.
Receiver advertisements:
■ are issued by nodes periodically and/or when the node changes its local content-based address p0.
■ Content-based RA ingress filtering: a router receiving through interface i an RA issued by node r and carrying content-based address pRA first verifies whether or not the content-based address pi associated with interface i covers pRA. If pi covers pRA, then the router simply drops the RA.
■ Broadcast RA propagation: if pi does not cover pRA, then the router computes the set of next-hop links on the broadcast tree rooted in r (i.e., B(r, i)) and forwards the RA along those links.
■ Routing table update: if pi does not cover pRA, then the router also updates its routing table, adding pRA to pi, computing pi ← pi ∨ pRA.
SIEN
A
domenica 6 maggio 12
13Example: Broker 6 issues subscription s1
6
1 2
3 4
5
i pred
6 s1
i pred
4 s1
i pred
4 s1
i pred
3 s1
i pred
3 s1
SIEN
A
domenica 6 maggio 12
6
1 2
3 4
5
Example: Broker 2 issues subscription s2≺s1
23
i pred6 s1
2 s2
i pred
4 s1
i pred
4 s1
i pred
3 s1
i pred
3 s1
2 s2
i pred
4 s2
SIEN
A
domenica 6 maggio 12
33
6
1 2
3 4
5
Notice that, because of the ingress filtering rule, the RA protocol can only widen the selection of the content-based addresses stored in routing tables. In the long run, this may cause an “inflation” of those content-based addresses.
Example: Broker 6 substitute its predicate with s3≺s1
i pred6 s1
2 s2
i pred
4 s1
SIEN
A
domenica 6 maggio 12
43Sender Requests and Update Replies:
■ A router uses sender requests (SRs) to pull content-based addresses from all receivers in order to update its routing table.
■ The results of an SR come back to the issuer of the SR through update replies (URs).
■ The SR/UR protocol is designed to complement the RA protocol. Specifically, it is intended to balance the effect of the address inflation caused by RAs, and also to compensate for possible losses in the propagation of RAs.
■ An SR issued by n is broadcast to all routers, following the broadcast paths defined at each router by the broadcast function B(n, . ).
■ A leaf router in the broadcast tree immediately replies with a UR containing its content-based address p0.
■ A non-leaf router assembles its UR by combining its own content-based address p0 with those of the URs received from downstream routers, and then sends its URs upstream.
■ The issuer of the SR processes incoming URs by updating its routing table. In particular, an issuer receiving a UR carrying predicate pUR from interface i updates its routing table entry for interface i with pi ← pUR.
SIEN
A
domenica 6 maggio 12
53Example: Broker 5 sends a Sender Request (SR) to refresh its
forwarding table.
6
1 2
3 4
5i pred
3 s1
i pred
4 s1
SIEN
A
domenica 6 maggio 12
63Example: Update Replies (URs) are collected on the paths toward
broker 5.
6
1 2
3 4
5i pred
3 s2 ⋁ s3
i pred
4 s2 ⋁ s3
[s2]
[s3]
[s2 ⋁ s3]
[s2 ⋁ s3]
[ ]
SIEN
A
domenica 6 maggio 12
■ Exercise: consider the following system:
■ The event space is represented by a single numerical attribute x which can assume real values. Subscriptions can be expressed using the operators <=>.
SIEN
A
73
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
I
E
D
F
G
H
C
P
B
A
domenica 6 maggio 12
■ Subscribers issued the following subscriptions.
■ Firstly define a spanning tree associated to the broker associated with publisher P. Then, for every broker compute the content-based forwarding table associated to this spanning tree. Finally compute the path followed by event x=16 through the ENS.
SIEN
A
83Subscriber Subscription
A x>23
B x<0 OR x>90
C x<40
D x>25 AND x<60
E x>5 AND x<18
F x>5 AND x<10
G x>15 AND x<20
H x<12
I x>50
domenica 6 maggio 12
■ 1: define a spanning tree associated to broker 1
■ Every tree including all the brokers is ok.
SIEN
A
93
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
I
E
D
F
G
H
C
P
B
A
domenica 6 maggio 12
■ The content of subscription tables is computed starting from each subscriber and “climbing the tree” toward the root (broker 1).
■We are referring to a run-time status where we can assume that, independently from the order used to issue subscriptions, the tables’ content is perfect. SIEN
A
04
Broker Interface Content-based address
1 2 x>50
1 3 x>23 OR (x<0 OR x>90) OR x<40 OR (x>25 AND x<60)
2 7 x>50
3 4 x>23 OR (x<0 OR x>90)
3 8 x<40 OR (x>25 AND x<60)
4 5 x>23 OR (x<0 OR x>90)
5 6 x>23 OR (x<0 OR x>90)
8 10 x<12 OR (x>15 AND x<20)
8 11 x>5 AND x<10
8 12 x<40 OR (x>5 AND x<18) OR (x>25 AND x<60)
10 9 x<12 OR (x>15 AND x<20)
11 13 x>5 AND x<10
12 14 (x>5 AND x<18) OR (x>25 AND x<60)
14 15 (x>5 AND x<18) OR (x>25 AND x<60)
domenica 6 maggio 12
■ Routing event x=16. Notified subscribers: C, E, G.
■ The table reports which content-based addresses are satisfied by the event (in blue).
SIEN
A
14
Broker Interface Content-based address
1 2 x>50
1 3 x>23 OR (x<0 OR x>90) OR x<40 OR (x>25 AND x<60)
2 7 x>50
3 4 x>23 OR (x<0 OR x>90)
3 8 x<40 OR (x>25 AND x<60)
4 5 x>23 OR (x<0 OR x>90)
5 6 x>23 OR (x<0 OR x>90)
8 10 x<12 OR (x>15 AND x<20)
8 11 x>5 AND x<10
8 12 x<40 OR (x>5 AND x<18) OR (x>25 AND x<60)
10 9 x<12 OR (x>15 AND x<20)
11 13 x>5 AND x<10
12 14 (x>5 AND x<18) OR (x>25 AND x<60)
14 15 (x>5 AND x<18) OR (x>25 AND x<60)
domenica 6 maggio 12
■On the graph:
SIEN
A
24
3
1
2
8
9
10
11
7
4
5
6
12
14 15
13
I
E
D
F
G
H
C
P
B
A
domenica 6 maggio 12
Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron
“SCRIBE: A large-scale and decentralized application-level multicast infrastructure”Journal on Selected Areas in Communications, 2002.
Abstract:“This paper presents Scribe, a scalable application-level multicast infrastructure. Scribe supports large numbers of groups, with a potentially large number of members per group. Scribe is built on top of Pastry, a generic peer-to-peer object location and routing substrate overlayed on the Internet, and leverages Pastry’s reliability, self-organization, and locality properties. Pastry is used to create and manage groups and to build efficient multicast trees for the dissemination of messages to each group. Scribe provides best-effort reliability guarantees, but we outline how an application can extend Scribe to provide stronger reliability. Simulation results, based on a realistic network topology model, show that Scribe scales across a wide range of groups and group sizes. Also, it balances the load on the nodes while achieving acceptable delay and link stress when compared to IP multicast.”
SCRI
BE
34
domenica 6 maggio 12
The specific architecture of this system:
44Pub/Sub Architecture
Matching
Flooding Selective diffusion Gossiping
Brokeroverlay
P2P structured
overlay
P2Punstructured
overlay
TCP/IPIP
multicastSOAP 802.11b/g
Event flooding
Subscription flooding
Rendez-Vous
Filter-based
Blind gossip
Informed gossip
Network protocols
Overlayinfrastructures
Event routing
SCRI
BE
domenica 6 maggio 12
■ Scribe is a topic-based publish/subscribe system able to support a large number of groups with a potentially large number of publishers and subscribers.
■ Each user in the system (publisher or subscriber) is also a broker. The event notification service is therefore constituted by all the users.
■Users can join and leave the system. The event notification service can therefore change at runtime.
■ Scribe is built upon Pastry, a peer-to-peer location and routing service.
■ Pastry is used to build and maintain the application-level topology that connects brokers in the event notification service.
■ Pastry also provides applications with efficient primitives for object storage and location.
54
SCRI
BE
domenica 6 maggio 12
■ Pastry implements a Distributed Hash Table:
■ Each object is associated with a key.
■ Each key is stored (together with the corresponding objects) in a node.
■ Each object can be efficiently located and retrieved knowing its key.
■ Each node participating to Pastry is identified by 128-bit NodeID obtained applying a hash function h to its IP address.
■NodeId is in base 2b, where b is a configuration parameter.
■ The function h evenly distributes node identifiers in the circular key-space [0, 2128-1].
■ Each object is stored on the node with the closest NodeID.
■ Each node maintains three data structures:
■ Leaf set
■ Routing table
■ Neighborhood set
64
SCRI
BE
domenica 6 maggio 12
■ Leaf set: contains the set of nodes with the L/2 numerically closest larger NodeIDs, and the L/2 nodes with numerically closest smaller NodeIDs, relative to the present node’s NodeID.
■ Example: node 60, L=6 74
2
127 125121
23
25
53
63
60
74
83
98
135
160
191
177
183
215
208
226
240
LS60
23
25
53
63
74
83
SCRI
BE
domenica 6 maggio 12
■Routing table: matrix of Log2b N rows and 2b-1 columns. Entries in the n-th row match the first n-1 digits of current NodeID. The n-th digit has one of the 2b-1 possible values other than the n-th digit in current NodeID.
■ Example: routing table at node 10233102, b=2 84
-0-2212102 1 -2-2301203 -3-1203203
0 1-1-301233 1-2-230203 1-3-021022
10-0-31203 10-1-32102 2 10-3-23302
102-0-0230 102-1-1302 102-2-2302 3
1023-0-322 1023-1-000 1023-2-121 3
10233-0-01 1 10233-2-32
0 102331-2-0
2
SCRI
BE
Possible digit values0 1 2 3
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
domenica 6 maggio 12
■Neighborhood set: list of the M closest nodes.
■Node distance is measured using a proximity metric (IP hops, latency, bandwidth, etc).
■Nodes in this list are used to update entries in the routing table. 94
SCRI
BE
domenica 6 maggio 12
■ The main function provided by Pastry is route(msg,key).
■ Routing is realized matching key prefixes with nodes stored in each routing table.
■ In each routing step, the current node forwards the message to a node whose NodeID shares with the target key a prefix that is at least one digit longer than the prefix that the key shares with the current NodeID.
■ If no such node is found in the routing table the message is forwarded to a node whose NodeID shares a prefix with the key as long as the current node, but numerically close to the key than the current NodeID.
05
SCRI
BE
domenica 6 maggio 12
■ Scribe use the key-node mapping provided by Pastry to assign a rendez-vous node to each topic:
■ Each topic t (called Group in Scribe) is mapped to a key applying h(t)
■ EN(e)=h(e), SN(s)=h(s)
■Membership management:
■ Joining a group
■ Leaving a group
■Message diffusion
15
SCRI
BE
domenica 6 maggio 12
■When a node n wants to subscribe to t (joing group t):
■ it invokes route(JOIN[t],h(t))
■ the message is routed toward the rendez-vous node for t
■ each node n’ along the route checks a local groups list to see if it is currently a forwarder for t
■ if so it accepts n as a child, and adds it to the local children table■ otherwise it adds t to the groups list, add n to the children table and, finally,
invokes route(JOIN[t],h(t))
■ A node can unsubscribe t at any time:
■ if it has no children then it sends to its parent in the diffusion tree a LEAVE message
■ if it has still children for that group, it cannot leave the diffusion tree
■ Routing is done in two steps:
■ the node that publish the event for topic t invokes route(MCAST[e],h(t))
■ when the message reaches the rendez-vous point it is diffused following links defined by children tables for that group.
25
SCRI
BE
domenica 6 maggio 12
■ Example
35
2
127 125121
23
25
53
63
60
74
83
98
135
160
191
177
183
215
208
226
240
h(t) children father
73 177,191 83
h(t) children father
73 121 74
h(t) children father
73 83 83
h(t) children father
73 - 121
h(t) children father
73 - 121
SCRI
BE
domenica 6 maggio 12
R. Baldoni, R. Beraldi, V. Quema, L. Querzoni, S. Tucci Piergiovanni
“TERA: Topic-based Event Routing for Peer-to-Peer Architectures”International Conference on Distributed Event-Based Systems (DEBS), 2007.
Abstract:“The completely decoupled interaction model offered by the publish/subscribe communication paradigm perfectly suits the interoperability needs of todays large-scale, dynamic, peer-to-peer applications. Unmanaged inter-administrative environments, where these applications are expected to work, pose a series of problems (potentially wide number of partipants, low-reliability of nodes, absence of a centralized authority, etc.) that severely limit the scalability of existing approaches which were originally thought for supporting distributed applications built on the top of static and managed environments. In this paper we propose a novel architecture for implementing the topic-based publish/subscribe paradigm in large scale peer-to-peer systems. The proposed architecture is based on probabilistic mechanisms and peer-to-peer overlay management protocols. It achieves event diffusion by implementing traffic confinement (published events have a high probability to reach only interested subscribers), high scalability (with respect to several fundamental parameters like number of participants, subscriptions, topics and event publication rate) and fair load distribution (load distribution closely follows the distribution of subscription on nodes).”
TERA
45
domenica 6 maggio 12
■ A two-layer infrastructure:
■ All clients are connected by a single overlay network at the lower layer (general overlay).
■ Various overlay network instances at the upper layer connect clients subscribed to same topics (topic overlays).
■ Event diffusion:
■ The event is routed in the general overlay toward one of the nodes subscribed to the target topic.
■ This node acts as an access point for the event that is then diffused in the correct topic overlay.
Topic overlay
Node
Node used as access point
Event routedin the system
General overlay
(a) System overview
TERA
Applications
Overlay Management Protocol
Network
Event Management Subscription Management
Broadcast
Partition Merging
Access Point Lookup
Peer Sampling
Size Estimation
subscribe unsubscribe publish notify
(b) Node architecture
Figure 1: The TERA publish/subscribe system.
2 An Overview of TERA
TERA is a topic-based publish/subscribe system designed to o�er an event di�usion service for very
large scale peer-to-peer systems. Each published event is “tagged” with a topic and is delivered to
all the subscribers that expressed their interest in the corresponding topic by issuing a subscription
for it. The set of available topics is not fixed, nor predefined: applications using TERA can
dynamically create or delete them.
2.1 Architecture
Nodes participating to TERA are organized in a two-layers infrastructure (ref. Figure 1(a)). At
the lower layer, a global overlay network connects all nodes, while at the upper layer various topic
overlay networks connect subsets of all the nodes; each topic overlay contains nodes subscribed to
the same topic. All these overlay networks are separated and are maintained through an overlay
management protocol.
Subscription management and event di�usion in TERA are based on two simple ideas: nodes
4
TERA
55
domenica 6 maggio 12
■ Event routing in the general overlay is realized through a random walk.
■ The walk stops at the first broker that knows an access point for the target topic.
B4
B6
B1
B3
B2
B5
t
to the topic overlay
topic APa B5f B6
topic APx B1a B5
topic APe B4h B4
topic APt B1y B6
TERA
65
domenica 6 maggio 12
■ Each node maintains locally an Access Point Table (APT)
■ Each entry in the APT is a couple <topic, node address>
■ An entry <t,n> represents the fact that n is an access point for topic t.
■ The length of the APT is fixed.
■Goal:
■ each topic in the APT must be a uniform random sample among all the topics in the system;
■ the access point associated to a topic in an APT must be a uniform random sample among all the nodes subscribed to that topic.
topic AP
x B1
a B5
TERA
75
domenica 6 maggio 12
■ Subscription advertisement:
■ each node periodically advertises its subscriptions to a set of nodes chosen uniformly at random among the population;
■ each advertisement is a set of couples<topic, popularity>
■ An advertisement <t,p> represents the fact that there are (approximately) p nodes subscribed to topic t.
■ APT update. When a node receives and advertisement <t,p> from node n:
■ if the APT contains an entry for <t,m> it simply puts m=n
■ otherwise it puts a new entry <t,n> in the APT with probability 1/p
TERA
85
domenica 6 maggio 12
■ OMPs: Newscast, Cyclon, etc.
OverlayNetwork
ACCESS POINT
LOOKUP
Applications
Network
TERA
Overlay Management Protocol
General overlay
peer sampling
Topic 1 overlay
peer sampling
sizeestim.
Topic 2 overlay
peer sampling
sizeestim.
Topic n overlay
peer sampling
sizeestim.
EVENT MANAGEMENT SUBSCRIPTION MANAGEMENT
INNER-CLUSTER
DISSEMINATION
PARTITION
MERGING
ACCESS POINT TABLE
SUBSCRIPTION
TABLE
pub
lishe
d e
vent
s (low
er la
yer)
pub
lishe
d e
vent
s (u
pper
laye
r)ev
ents
node
IDs
lookup
check subscription
add o
r
rem
ove
topic
ove
rlay
siz
e
inst
antia
te/jo
in/le
ave
a to
pic
ove
rlay
subsc
riptio
n ad
vert
isem
ents
forc
e vi
ew
exch
ange
subsc
riptio
n ad
vert
isem
ents
viewexchange
joinoverlay
publish notify subscribe unsubscribe
rand
om
wal
ks
node
IDs
topic overlay id
Figure 2: A detailed view of the architecture of TERA.
as for notifying subscribers. An event dissemination startsas soon as an application publishes some data in a topic.It is done in two steps: the event is first routed to a nodesubscribed to the topic (this node acts as an access point forit); then, the access point di�uses the event in the overlayassociated to the topic. The first step is realized througha lookup executed on the Access Point Lookup component:if the lookup returns an empty list of node identifiers, thenode discards the event.
When a node subscribed to the topic receives an eventfor which it must act as an access point, it uses the broad-cast primitive provided by the Inner-Cluster Disseminationservice to forward the event to all nodes belonging to thecorresponding topic overlay. When a node subscribed tothe topic receives a broadcasted event, it notifies the appli-cation.
Access Points Lookup.The Access Point Lookup component plays a central role
in TERA’s architecture as it is used by both the Event Man-agement and Subscription Management components to ob-tain lists of access points identifiers for specific topics. Itsfunctioning is based on a local data structure, called Ac-cess Point Table (APT), and a distributed search algorithmbased on random walks.
Each APT is a cache, containing a limited number of en-tries, each with the form < t, n >, where t is a topic and nthe identifier of a node that can act as an access point for t.APTs are continuously updated following a simple strategy:each time a node receives a subscription advertisement fortopic t from a node n, it substitutes the access point identi-fier for t if an entry < t, n� > exists in the APT, otherwiseit adds a new entry < t, n > with probability 1/Pt, wherePt is the popularity of topic t estimated by n and attached
to the subscription advertisement. When an APT exceedsa predefined size, randomly chosen entries are removed.
As a consequence of this update strategy, APTs have thefollowing properties:
1. APT entries tend to contain non-stale access points,
2. inactive topics (i.e. topics that are no longer subscribedby any node) tend to disappear from APTs,
3. each access point is a uniform random sample of thepopulation of nodes subscribed to that topic,
4. the content of each APT is a uniform random sampleof the set of active topics (i.e. topics subscribed by atleast one node),
5. the size of each APT is limited.
The first property is a consequence of the way new en-tries are added to APTs; suppose, in fact, that there is onlyone topic t in the system subscribed by two nodes, na andnb; suppose, moreover, that, at certain point of time, nb
unsubscribes t. Starting from that moment, only na willadvertise t, therefore nodes containing an entry < t, nb >will eventually substitute it with entry < t, na >, as theuniformity of node samples provided by the peer samplingservice guarantees that na will eventually advertise t to allthe system population. The second property comes from thefact that inactive topics are no longer advertised. They are,thus, eventually replaced by active topics in APTs (assum-ing that the set of active topics is larger than the maximumAPT size). The third property is a consequence of the factthat subscription advertisements are sent to nodes returnedby the peer sampling service that provides uniform randomsamples, and that each node advertises its subscriptions with
TERA
95
domenica 6 maggio 12
■We want every topic to appear with the same probability in every APT, regardless of its popularity.
4.2.1 Topic distribution in APTsWe start by presenting an experiment showing that the
method used in TERA to update APTs content ensures auniform distribution of topics in every APT. This is a fun-damental property for APTs as it allows TERA to use theircontent as a uniform random sample of the active topic pop-ulation and build on it the access point lookup mechanism.We ran tests over a system with 104 nodes, each advertisingits subscriptions every 5 cycles to 5 neighbors out of 20 (theoverlay management protocol view size). APT size was lim-ited to 10 entries. We issued 5000 subscriptions distributedin various ways on 1000 distinct topics, and we measured,for each topic, the number of APTs containing an entry forit. The expected outcome of these tests is to find a con-stant value for such measure, regardless of the initial topicpopularity distribution.
Figure 3(a) shows the results for an initial uniform distri-bution of topic popularity. The X axis represents the topicpopulation (each topic is mapped to a number). Each blackdot represents the number of times a specific topic appearsin APTs, while the grey dot represents its popularity. Theplot shows that each topic is present, on average, in the samenumber of APTs, with a very small error that is randomlydistributed around the mean. This confirms that the topicdistribution in APTs can be considered uniform.
Figures 3(b) and 3(c) show the results for an initial zipfdistribution of topic popularity. The two graphs report theresults for di�erently skewed popularity distributions (dis-tribution parameter a = 0.7 and a = 2.0). As these graphsshow, TERA is always able to balance APT updates, anddelivers an almost uniform distribution. Even in an extremecase (a = 2.0), the APT update mechanism is able to bal-ance the updates coming from the small number of activetopics (in this scenario only 79 topics share the whole 5000subscriptions), maintaining their presence in APTs aroundthe same average value with a small standard deviation (al-ways below 5%). In the next evaluations, we only report re-sults for zipf popularity distribution with a = 0.7, as resultsfor other values of a did not exhibit significant di�erences.
4.2.2 Access Point LookupIn this section, we evaluate the probability for the access
point lookup mechanism to successfully returns a node iden-tifier for a lookup operation (in the case such node exists).We denote by K the lifetime of the random walk (the max-imum number of visited nodes), by |APT | the size of APTtables, and by |T | the number of topics8. The probability pto find an access point for a specific topic in an APT is p =|APT |
|T | . Assuming that every APT contains the maximum al-lowed number of entries, the probability that an access pointcannot be found within K steps is Pr{fail} = (1� p)K .Thus, the probability to find the access point visiting at most
K nodes is Pr{success} = 1�(1� p)K = 1�“1� |APT |
|T |
”K.
Therefore, to ensure with probability P that an access pointfor a given topic will be found, it is necessary that sizes Kor |APT | be such that:
8Thanks to the fact that APTs can be considered as uniformrandom samples of the set of active topics, each node canestimate at runtime the value of |T | [16].
Distribution of subscriptions on APTs
(uniform)
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=1,11
(a)
Distribution of subscriptions on APTs
(zipf a=0,7)
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=2,16
(b)
Distribution of subscriptions on APTs
(zipf a=2,0)
0
250
500
750
1000
1250
1500
1750
2000
0 20 40 60 80
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=51,49
(c)
Figure 3: The plot shows how topics are distributedamong APTs (black dots) when the topic popular-ity distribution (grey dots) is (a) uniform and (b-c)skewed (zipf with parameter a).
4.2.1 Topic distribution in APTsWe start by presenting an experiment showing that the
method used in TERA to update APTs content ensures auniform distribution of topics in every APT. This is a fun-damental property for APTs as it allows TERA to use theircontent as a uniform random sample of the active topic pop-ulation and build on it the access point lookup mechanism.We ran tests over a system with 104 nodes, each advertisingits subscriptions every 5 cycles to 5 neighbors out of 20 (theoverlay management protocol view size). APT size was lim-ited to 10 entries. We issued 5000 subscriptions distributedin various ways on 1000 distinct topics, and we measured,for each topic, the number of APTs containing an entry forit. The expected outcome of these tests is to find a con-stant value for such measure, regardless of the initial topicpopularity distribution.
Figure 3(a) shows the results for an initial uniform distri-bution of topic popularity. The X axis represents the topicpopulation (each topic is mapped to a number). Each blackdot represents the number of times a specific topic appearsin APTs, while the grey dot represents its popularity. Theplot shows that each topic is present, on average, in the samenumber of APTs, with a very small error that is randomlydistributed around the mean. This confirms that the topicdistribution in APTs can be considered uniform.
Figures 3(b) and 3(c) show the results for an initial zipfdistribution of topic popularity. The two graphs report theresults for di�erently skewed popularity distributions (dis-tribution parameter a = 0.7 and a = 2.0). As these graphsshow, TERA is always able to balance APT updates, anddelivers an almost uniform distribution. Even in an extremecase (a = 2.0), the APT update mechanism is able to bal-ance the updates coming from the small number of activetopics (in this scenario only 79 topics share the whole 5000subscriptions), maintaining their presence in APTs aroundthe same average value with a small standard deviation (al-ways below 5%). In the next evaluations, we only report re-sults for zipf popularity distribution with a = 0.7, as resultsfor other values of a did not exhibit significant di�erences.
4.2.2 Access Point LookupIn this section, we evaluate the probability for the access
point lookup mechanism to successfully returns a node iden-tifier for a lookup operation (in the case such node exists).We denote by K the lifetime of the random walk (the max-imum number of visited nodes), by |APT | the size of APTtables, and by |T | the number of topics8. The probability pto find an access point for a specific topic in an APT is p =|APT |
|T | . Assuming that every APT contains the maximum al-lowed number of entries, the probability that an access pointcannot be found within K steps is Pr{fail} = (1� p)K .Thus, the probability to find the access point visiting at most
K nodes is Pr{success} = 1�(1� p)K = 1�“1� |APT |
|T |
”K.
Therefore, to ensure with probability P that an access pointfor a given topic will be found, it is necessary that sizes Kor |APT | be such that:
8Thanks to the fact that APTs can be considered as uniformrandom samples of the set of active topics, each node canestimate at runtime the value of |T | [16].
Distribution of subscriptions on APTs
(uniform)
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=1,11
(a)
Distribution of subscriptions on APTs
(zipf a=0,7)
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=2,16
(b)
Distribution of subscriptions on APTs
(zipf a=2,0)
0
250
500
750
1000
1250
1500
1750
2000
0 20 40 60 80
Topics
Nu
mb
er o
f p
resen
ces
Distribution on APTs
Popularity
Std.Dev=51,49
(c)
Figure 3: The plot shows how topics are distributedamong APTs (black dots) when the topic popular-ity distribution (grey dots) is (a) uniform and (b-c)skewed (zipf with parameter a).
TERA
06
domenica 6 maggio 12
■Which is the probability for an event to be correctly routed in the general overlay toward an access point ?
■Depends on:
■ uniform randomness of topicscontained in access point tables;
■ access point table size;
■ random walk lifetime.
K =ln(1� P )
ln“1� |APT |
|T |
” or |APT | = |T |“1� K
�1� P
”
Note that, given K and P , |APT | linearly depends on |T |.In order to reduce APT size, it would be necessary to in-crease random walks length (i.e. using a large value for K)negatively a�ecting the time it takes to find an access point.To mitigate this problem, it is advisable to launch r multipleconcurrent random walks, each having a lifetime ⇤K
r ⌅. In-deed, the fact that topics are uniformly distributed amongAPTs guarantees that launching multiple concurrent ran-dom walks does not impact the lookup success rate. In thisway, access point lookup responsiveness is improved at thecost of a slightly larger overhead due to the independencyof each random walk lifetime.
We ran experiments to check that TERA’s behavior isclose to the one predicted by the analytical study. Tests wererun on a system with 1000 nodes, each having Cyclon viewsholding 20 nodes. At the beginning, 5000 subscriptionswere issued uniformly distributed on 1000 distinct topics.Lookups were started after 1000 cycles. Each lookup wasconducted starting four concurrent random walks (r = 4).
Figure 4(a) shows how the access point lookup success ra-tio changes when varying the lifetime of each random walk(K) for di�erent values of |APT |. For each line, we plottedboth simulation results (solid line) and values calculated us-ing the analytical study (dashed line). The plot confirmsthat TERA’s lookup mechanism is able to probabilisticallyguarantee that an access point for an active topic will befound with probability P . Note that this plot also showsthat the actual memory size required by APTs is limited.Indeed, consider the biggest APT size plotted on the graph:400 entries. Assuming that each entry in an APT is a stringcontaining 256 characters, the memory size occupied by anAPT containing 400 entries is about 104kB.
4.2.3 Partition MergingIn this section, we analyze the probability for the partition
merging mechanism to detect a very small overlay partition,and the time it takes for this to happen. Suppose that thereis a topic represented by an overlay network partitioned intwo clusters containing |G| and 1 nodes, respectively9. Letus call n this single node. The probability p to detect thepartition in a cycle can be expressed as p = 1�(pa·pb), wherepa is the probability that none of the nodes in G advertiseits subscriptions to n, and pb is the probability that n doesnot advertise its subscriptions to any of the nodes in G.
Probability pa can be expressed as
pa = (1� Pr{a node advertises to n})|G|
Every node in G advertises its subscription to n only ifn is contained in its view for the general overlay, and if nis one of the D nodes selected for the advertisement. Letus suppose, for the sake of simplicity, that D is equal to theview size. In this case Pr{a node advertises to n} = |V iew|
(N�1) ,
9Note that the case where a partition is constituted by asingle node is the most di⇤cult to solve as the probability fornodes belonging to distinct partitions to meet is the lowestpossible one.
Random Walk success rate.
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
0 1 2 3 4 5 6 7 8 9 10
Random walk lifetime
Su
ccess r
ate
APT 50 Sim APT 50 Theo APT 100 Sim
APT 100 Theo APT 400 Sim APT 400 Theo
(a)
Cycles needed to merge a partitioned node
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
0 50 100 150 200
Cycles
Pro
bab
ilit
y o
f m
erg
e
|G|=4 Sim
|G|=4 Theo
|G|=16 Sim
|G|=16 Theo
|G|=64 Sim
|G|=64 Theo
(b)
Figure 4: (a) The plot shows how the success ratefor access point lookups changes when varying themaximum APT size and the random walk lifetime.Solid lines represent results from the simulator,while dashed lines plot values from the formula.(b) The plot shows how the probability to detecta topic overlay partition increases with time (cy-cles). Solid lines represent results from the simula-tor, while dashed lines plot values from the formula.The tests were run varying the number |G| of nodessubscribed to the topic.
TERA
16
domenica 6 maggio 12
■ Load imposed on nodes is fairly distributed:
■ no hot spots or single points of failure;
■ Nodes that subscribe to more topics suffer more load.
Node stress distribution
general overlay - uniform popularity
1,E-05
1,E-04
1,E-03
Node population
Percen
tag
e o
f m
essag
es h
an
dle
d
Std.Dev.=4,04E-06
Node stress distribution
global - uniform popularity
1,E-05
1,E-04
1,E-03
Node population
Percen
tag
e o
f m
essag
es h
an
dle
d
1
10
100
Nu
mb
er o
f lo
cal
su
bscrip
tio
ns
Messages (percentage)
Subscriptions
Std.Dev.=1,23E-05
(a)
Node stress distribution
general overlay - zipf popularity
1,E-05
1,E-04
1,E-03
Node population
Percen
tag
e o
f m
essag
es h
an
dle
d
Std.Dev.=4,13E-06
Node stress distribution
global - zipf popularity
1,E-05
1,E-04
1,E-03
Node population
Percen
tag
e o
f m
essag
es h
an
dle
d
1
10
100
Nu
mb
er o
f lo
cal
su
bscrip
tio
ns
Messages (percentage)
Subscriptions
Std.Dev.=1,23E-05
(b)
Figure 5: The plots show how the load generated by TERA is distributed among nodes when the
distribution of topic popularity is either uniform (a) or zipf (b). For both popularities, the figure
shows in the left graph the load distribution in the general overlay and, in the right graph, the
global load distribution (black points), together with the subscription distribution on nodes (grey
points).
Figures 5(a) show the results for a test with uniform topic popularity, while figures 5(b) show
the same results for an initial zipf distribution with parameter a = 0.7. Pictures on the left
show how load is distributed in the general overlay. As shown by the graphs, TERA is able
to uniformly distribute load among nodes, avoiding the appearance of hot spots. This result is
obtained regardless of the distribution of topic popularities. Pictures on the right show the global
load experienced by nodes; in these graphs, nodes on the X axis are ordered in decreasing local
subscriptions count (i.e. points on the left refer to nodes subscribed to more topics), in order
to show how the global load is a�ected by the number of subscriptions maintained at each node.
The number of subscriptions per node is also plotted with grey dots. The graphs show how load
distribution closely follows the distribution of subscription on nodes, actually implementing the
pragmatic rule “the more you ask, the more you pay”, then fairly distributing the load among
participants.
18
TERA
26
domenica 6 maggio 12
■ Experiments show how the system scales with respect to:■ Number of subscriptions.
■ Number of topics.
■ Event publication rate.
■ Number of nodes.
■ (reference figure is given by a simple event flooding approach)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+01 1,E+03 1,E+05 1,E+07
Subscriptions
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
event rate: 1
(a)
Average notification cost
1,E+00
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05
Topics
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
subscriptions: 10000
event rate: 1
(b)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05
Event publication rate
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
subscriptions: 10000
(c)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+07
1,E+08
1,E+09
1,E+01 1,E+03 1,E+05 1,E+07 1,E+09
Nodes
Messag
es p
er n
oti
ficati
on
Event flooding TERA
subscriptions: 10000
topics: 100
event rate: 1
(d)
Figure 6: The plots show the average number of messages needed by TERA to notify an event
when the number of subscriptions (a), of topics (b), the event publication rate (c) and the total
number of nodes in the system (d) varies. For each figure, results from a simple event flooding
algorithm are reported for comparison.
4.3.2 Message cost per notification
The tra⌅c confinement strategy implemented by TERA induces some overhead. In order to assess
the global impact of this overhead, we evaluated the average cost incurred by TERA to notify a
single event to a subscriber, namely the total number of generated messages divided by the number
of notifications. This cost includes both messages generated to di�use the event, and messages
generated for TERA’s maintenance. To o�er a reference figure, we also evaluated the cost incurred
by a simple event flooding-based approach7 in the same settings.
Figure 6(a) reports the results when the total number of subscriptions varies between 102 and
106. The number of topics is fixed and equal to 100. The network considered in this test was
constituted by 104 nodes, while the event publication rate was maintained constant at 1 event per
topic in each cycle. For the evaluation to be meaningful, we required each topic to be subscribed
by at least one subscriber; therefore, each curve is limited on its left end by the number of available
topics. Moreover, we required each node to subscribe each topic at most once; therefore, each curve7Each event is broadcast in an overlay network containing all participants. The overlay is built and maintained
through the same overlay management protocol emplyed by TERA (Cyclon). Also the broadcast mechanism is the
same considered in TERA.
19
TERA
36
domenica 6 maggio 12
■ Experiments show how the system scales with respect to:■ Number of subscriptions.
■ Number of topics.
■ Event publication rate.
■ Number of nodes.
■ (reference figure is given by a simple event flooding approach)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+01 1,E+03 1,E+05 1,E+07
Subscriptions
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
event rate: 1
(a)
Average notification cost
1,E+00
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05
Topics
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
subscriptions: 10000
event rate: 1
(b)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05
Event publication rate
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
subscriptions: 10000
(c)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+07
1,E+08
1,E+09
1,E+01 1,E+03 1,E+05 1,E+07 1,E+09
Nodes
Messag
es p
er n
oti
ficati
on
Event flooding TERA
subscriptions: 10000
topics: 100
event rate: 1
(d)
Figure 6: The plots show the average number of messages needed by TERA to notify an event
when the number of subscriptions (a), of topics (b), the event publication rate (c) and the total
number of nodes in the system (d) varies. For each figure, results from a simple event flooding
algorithm are reported for comparison.
4.3.2 Message cost per notification
The tra⌅c confinement strategy implemented by TERA induces some overhead. In order to assess
the global impact of this overhead, we evaluated the average cost incurred by TERA to notify a
single event to a subscriber, namely the total number of generated messages divided by the number
of notifications. This cost includes both messages generated to di�use the event, and messages
generated for TERA’s maintenance. To o�er a reference figure, we also evaluated the cost incurred
by a simple event flooding-based approach7 in the same settings.
Figure 6(a) reports the results when the total number of subscriptions varies between 102 and
106. The number of topics is fixed and equal to 100. The network considered in this test was
constituted by 104 nodes, while the event publication rate was maintained constant at 1 event per
topic in each cycle. For the evaluation to be meaningful, we required each topic to be subscribed
by at least one subscriber; therefore, each curve is limited on its left end by the number of available
topics. Moreover, we required each node to subscribe each topic at most once; therefore, each curve7Each event is broadcast in an overlay network containing all participants. The overlay is built and maintained
through the same overlay management protocol emplyed by TERA (Cyclon). Also the broadcast mechanism is the
same considered in TERA.
19
TERA
46
domenica 6 maggio 12
■ Experiments show how the system scales with respect to:■ Number of subscriptions.
■ Number of topics.
■ Event publication rate.
■ Number of nodes.
■ (reference figure is given by a simple event flooding approach)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+01 1,E+03 1,E+05 1,E+07
Subscriptions
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
event rate: 1
(a)
Average notification cost
1,E+00
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05
Topics
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
subscriptions: 10000
event rate: 1
(b)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05
Event publication rate
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
subscriptions: 10000
(c)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+07
1,E+08
1,E+09
1,E+01 1,E+03 1,E+05 1,E+07 1,E+09
Nodes
Messag
es p
er n
oti
ficati
on
Event flooding TERA
subscriptions: 10000
topics: 100
event rate: 1
(d)
Figure 6: The plots show the average number of messages needed by TERA to notify an event
when the number of subscriptions (a), of topics (b), the event publication rate (c) and the total
number of nodes in the system (d) varies. For each figure, results from a simple event flooding
algorithm are reported for comparison.
4.3.2 Message cost per notification
The tra⌅c confinement strategy implemented by TERA induces some overhead. In order to assess
the global impact of this overhead, we evaluated the average cost incurred by TERA to notify a
single event to a subscriber, namely the total number of generated messages divided by the number
of notifications. This cost includes both messages generated to di�use the event, and messages
generated for TERA’s maintenance. To o�er a reference figure, we also evaluated the cost incurred
by a simple event flooding-based approach7 in the same settings.
Figure 6(a) reports the results when the total number of subscriptions varies between 102 and
106. The number of topics is fixed and equal to 100. The network considered in this test was
constituted by 104 nodes, while the event publication rate was maintained constant at 1 event per
topic in each cycle. For the evaluation to be meaningful, we required each topic to be subscribed
by at least one subscriber; therefore, each curve is limited on its left end by the number of available
topics. Moreover, we required each node to subscribe each topic at most once; therefore, each curve7Each event is broadcast in an overlay network containing all participants. The overlay is built and maintained
through the same overlay management protocol emplyed by TERA (Cyclon). Also the broadcast mechanism is the
same considered in TERA.
19
TERA
56
domenica 6 maggio 12
■ Experiments show how the system scales with respect to:■ Number of subscriptions.
■ Number of topics.
■ Event publication rate.
■ Number of nodes.
■ (reference figure is given by a simple event flooding approach)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+01 1,E+03 1,E+05 1,E+07
Subscriptions
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
event rate: 1
(a)
Average notification cost
1,E+00
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05
Topics
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
subscriptions: 10000
event rate: 1
(b)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E-05 1,E-03 1,E-01 1,E+01 1,E+03 1,E+05
Event publication rate
Messag
es p
er n
oti
ficati
on
Event flooding TERA
nodes: 10000
topics: 100
subscriptions: 10000
(c)
Average notification cost
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+07
1,E+08
1,E+09
1,E+01 1,E+03 1,E+05 1,E+07 1,E+09
Nodes
Messag
es p
er n
oti
ficati
on
Event flooding TERA
subscriptions: 10000
topics: 100
event rate: 1
(d)
Figure 6: The plots show the average number of messages needed by TERA to notify an event
when the number of subscriptions (a), of topics (b), the event publication rate (c) and the total
number of nodes in the system (d) varies. For each figure, results from a simple event flooding
algorithm are reported for comparison.
4.3.2 Message cost per notification
The tra⌅c confinement strategy implemented by TERA induces some overhead. In order to assess
the global impact of this overhead, we evaluated the average cost incurred by TERA to notify a
single event to a subscriber, namely the total number of generated messages divided by the number
of notifications. This cost includes both messages generated to di�use the event, and messages
generated for TERA’s maintenance. To o�er a reference figure, we also evaluated the cost incurred
by a simple event flooding-based approach7 in the same settings.
Figure 6(a) reports the results when the total number of subscriptions varies between 102 and
106. The number of topics is fixed and equal to 100. The network considered in this test was
constituted by 104 nodes, while the event publication rate was maintained constant at 1 event per
topic in each cycle. For the evaluation to be meaningful, we required each topic to be subscribed
by at least one subscriber; therefore, each curve is limited on its left end by the number of available
topics. Moreover, we required each node to subscribe each topic at most once; therefore, each curve7Each event is broadcast in an overlay network containing all participants. The overlay is built and maintained
through the same overlay management protocol emplyed by TERA (Cyclon). Also the broadcast mechanism is the
same considered in TERA.
19
TERA
66
domenica 6 maggio 12
■ Experiments show how the system scales with respect to:■ Number of subscriptions.
■ Number of topics.
■ Event publication rate.
■ Number of nodes.
■ (reference figure is given by a simple event flooding approach)
Average notification cost
1,E-04
1,E-03
1,E-02
1,E-01
1,E+00
1,E+01
1,E+02
1,E+03
1,E+04
1,E+01 1,E+03 1,E+05 1,E+07 1,E+09
Nodes
Mes
sag
es p
er n
otif
icat
ion
pub diffusion
rnd walks
topic shuffle
subs advert.
general shuffle
TOTAL
subscriptions: 10000topics: 100event rate: 1
cost incurred to diffuse events inside topic
overlays
cost incurred to maintain the
general overlay
TERA
76
domenica 6 maggio 12