Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Raphaël Barazzutti, Thomas Heinze, André Martin, Emanuel Onica, Pascal Felber, Christof Fetzer,
Zbigniew Jerzak, Marcelo Pasin and Etienne Rivière
University of Neuchâtel, SwitzerlandSAP AG, Germany
TU Dresden, Germany
ICDCS 2014, Madrid, [email protected]
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Context
• Publish/Subscribe as a service• Running on private or public clouds
• Decoupled communication
• Composition of applications from multiple domains
• Content-based filtering• Subscriptions filter on the content of publications
• Canonical example: stock quotes filtering
2
Publisher Subscriber
Pub/Sub as a Service
Adm. domain A Adm. domain B Adm. domain CPublisherSubscriber
sp
notify(pub)p
s
register(sub)
pregister(pub)
s name = "IBM"price > 120$
volume > 10,000
p name = "IBM"price = 131$
volume = 12,312
p name = "IBM"price = 112$
volume = 9,892open = 109$close = 113 $
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Pub/sub as a service: requirements
• Arbitrary representation of publications and subscriptions• Not limited to attribute-based filtering
• Untrusted domains: encrypted filtering
• High-throughput and scalability• Thousands subscriptions per second
• Thousands publications per second
• Thousands to millions notifications per second
• Availability, dependability and low delays
• StreamHub [DEBS 2013]• Supports arbitrary filtering scheme, in particular
encrypted filtering (ASPE)
• Built on top of a Stream Processing Engine
3
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domainSubscriber
?
p s
?
encryption encryption
?
Untrusted domain
?
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Problem: resource provisioning
• Pub/sub service resource usage is unpredictable and varies over time• Example: Frankfurt stock exchange, Nov 18, 2011
• Throughput/delay requirements vs budget require appropriate provisioning
• This talk: can we make a high-throughput pub/sub service elastic?• Scale in and scale out based on actual requirements
• Maintain service continuity and minimize reconfiguration impact
4
0
200
400
600
800
1000
1200
09:00 12:00 15:00 18:00 21:00
Tic
ks p
er
seco
nd
Time (hours)
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Outline
• Background• StreamHub high-throughput content-based pub/sub engine
• StreamMine3G stream processing engine
• (elastic)e-StreamHub principles• Components
• Slice migration
• Enforcing elasticity policies
• Evaluation• Micro-benchmarks
• Trace-based using Frankfurt stock exchange workload
5
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub: principles
• Tiered approach to pub/sub for cloud deployments• Split pub/sub into three fundamental, consecutive operations
• Exploit massive data parallelism of each operation
• Supports arbitrary filtering mechanism• Event flows do not depend on a filtering scheme characteristics
• This work: use computationally intensive encrypted filtering
• StreamHub engine = Stream Processing application• Each of the 3 pub/sub operations mapped to an stream operator
• Operators supported by a Stream Processing Engine
6
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamMine3G stream processing engine
• Distributed stream computation as a DAG of operators
• Each operator split in multiple slices• Externalized state management, no state sharing between slices
• Support for unicast, anycast & broadcast primitives
7
slice 1
slice 2
slice n-1
slice n
...
Operator 1 Operator 2
Operator 3
?
Broadcast
Anycast
Unicast
slice 3 ...
Operator 4
Each slice supported by multiple
cores
<key,value>
e
<k,v> (k%2=1)
e
Slice state management
No shared memory
ee
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub Engine [DEBS13]8
Operator Access Point (AP) Matching (M) Exit Point (EP)
Function Subscriptions Partitioning Publications Filtering Publications Dispatching
State Stateless Stateful (persistent) Stateful (transient)
Role
➡ Decide where to store subs
➡ Broadcast pubs to next operator
➡ Store subs
➡ Filter incoming pubs, create list of matching subscribers ids
➡ Aggregate lists of matching subscribers ids
➡ Prepare & dispatch notifications
Subscriber
SUBx > 3 &y == 5
encrypts SUBC5F80BA363
Publisher
PUBx = 7
y = 10
encrypts PUB88F3B2A09C
DC
CP/
Con
nect
ion
Poin
tstores
sends
notifications of matching encrypted publications
AP:1
AP operator M operator
p
s
Broadcast(pubs)
Unicast(subs)
AP:2
AP:3
AP:4
Unicast Multicast
AP:5
AP:6
Anycast
?
EP operator
M:1
M:2
M:3
M:4
M:5
M:6
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6 DC
CP/
Con
nect
ion
Poin
t
e-StreamHub engine
ASPE
Public Cloud deployment
(same DCCP)
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
e(elastic)-StreamHub principle
• Fixed number of slices for each operator
• Elastic scaling decisions based on experienced load• Scale out: allocate new host(s), migrate slices
• Scale in: migrate slices, deallocate host(s)
• Redistribute slices upon load unbalance between hosts
9
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6
AP:1
AP:2
AP:3
AP:4
AP:5
AP:6
M:1
M:2
M:3
M:4
M:5
M:6
Host 1
Host 1 Host 2 Host 3
EP:1
EP:3
EP:6
AP:1 M:1
M:2
EP:2
AP:5
AP:6
M:3
M:4
EP:4
EP:5
AP:3
AP:4
M:5
M:6
AP:2scale out
Host 1 Host 2
EP:1
EP:3
EP:6
AP:1 M:1
M:2
EP:2
AP:5
AP:6
M:3
M:4
AP:2
EP:4
EP:5AP:3
AP:4 M:5 M:6
scale ininitial single host
placement
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
e-StreamHub: components
• StreamMine3G: support for slice migration (including state)• Goal: minimal interruption of flow
• Manager: orchestrates migration and collect workload probes
• SM3G and manager state stored in ZooKeeper for dependability
• Elasticity enforcer: takes migration decisions based on observed load• Based on elasticity policies
10
ZooKeeper
SM3G
Host 1
AP:1
M:1
EP:1
SM3G
Host 2
AP:2
M:2
AP:3 SM3G
Host 3
M:3
EP:2
EP:3
M:3elasticity policies
Elasticityenforcer
Manager
aggregatedprobes
migrationrequests
collects probes,orchestrates migrations
configures
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Slice migration
1. Input events duplicated to a buffer on dest host
2. Slice halted; state is copied from src to dest host• Incoming events are still buffered
• States associated with vectors of sequence numbers, one for each of previous operator slices
• Allows knowing which events were accounted in
3. Missed events replayed on state, slice resumed
➡ “Downtime” of the slice is essentially equivalent to state copying time
11
OP:1
SM3G runtime
Host 1
OP:1
SM3G runtime
Host 1
(inactive)
SM3G runtime
Host 2
buffer
(inactive)
SM3G runtime
Host 1
(inactive)
SM3G runtime
Host 2
buffer
(inactive)
SM3G runtime
Host 1
SM3G runtime
Host 2
OP:1
SM3G runtime
Host 2
OP:1
events state
t t
t t+nbuffer
n
1 2 3 4 5
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Elasticity policy• CPU utilization probes collected periodically
• Secondary metrics: network usage and memory usage
• Global rules: criteria on the average load• Trigger scale-in and scale-out operations
• Define a target average CPU utilization
• Local rules: criteria on the load of a single host
• Grace period between migrations (30s)
• Minimize cost of migrations (sum of migrated states)
12
Rules CPU Effect
Globalaverage > 70% scale out
Globalaverage < 30% scale in
Local < 20% or > 80% redistribute slices
Target average CPU = 50%average CPU = 50%
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Elasticity policy enforcement
• Three-steps resolution when global or local rule is violated
1. Decide on the set of slices to migrate• For each host, identify a set of slices with:
sum(CPU utilization) ≥ abs(current average utilization - target utilization)
• Subset sum problem: dynamic programming, pseudo-polynomial complexity
• Returns multiple solutions: select the one with less state transfer involved
2. Scaling out: add enough hosts for avg(load) to be ≤ target (50%)Scaling in: mark enough least-loaded hosts for avg(load) to be ≥ target (50%)
3. Decide on new placement• First-fit bin packing algorithm
• Start with current state without selected slices, greedily assign by decreasing weight
13
Host 1 Host 2
12%AP:1
11%AP:2
25%M:1
25%M:2
26%M:3
23%M:4
13%EP:1
12%EP:2
Host 1 Host 2
25%M:1
25%M:2
26%M:3
23%M:4
Host 3
12%AP:1
11%AP:2
13%EP:1
12%EP:2
before scale out after scale out
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Evaluation
14
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Experimental setup
• Prototype in C, C++ and Java
• 22 hosts, each with 8 cores and 8 GB• 1 Gbps switched network
• 4 hosts with G and S: generators and sinks
• 3 hosts for infrastructure
• 1 to 15 hosts used for e-StreamHub
• 8 to 120 cores
• Number of slices is fixed• micro-benchmarks: 4 AP, 8 M, 4 EP
• trace-based evaluation: 8 AP, 16 M, 8 EP
• Encrypted pubs and subs (ASPE)• 100,000 encrypted subs
• Each sub is ~1.2 KB
• 1% matching ratio: 1,000 notifications / pub
15
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6
AP:1
AP:2
AP:3
AP:4
AP:5
AP:6
M:1
M:2
M:3
M:4
M:5
M:6
Host 1/15
M:7
M:8
M:9
M:10
M:11
M:12
AP:7
AP:8
EP:7
EP:8
M:13
M:14
M:15
M:16
Host E
ZooKeeper
Host F
Manager
Host G
Elasticityenforcer
Host D
G:4 S:4
Host C
G:3 S:3
Host B
G:2 S:2
Host A
G:1 S:1
(unused)
Host 2/15
(unused)
Host 15/15
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Static e-StreamHub: Performance16
De
lays
(m
s)
Number of hosts and attribution to operators (AP|M|EP)
Distribution of delays under half of max. throughput
Max 75th 50th 25th Min
0
100
200
300
400
500
2: 0.5|1|0.5
4: 1|2|1
6: 1.5|3|1.5
8: 2|4|2
10: 2.5|5|2.5
12: 3|6|3
Pu
blic
atio
ns/
s
Maximum throughput
0
100
200
300
400
500
Per second:42.2 million matchings422,000 notifications
Median delay
12 hosts:3 hosts for AP slices6 hosts for M slices3 hosts for EP slices
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Slice migration: times
• Individual migration times, averaged over 25 migrations
• Under constant flow of 100 publications / second
• Worst-case scenario with “large” slices states• 4 AP, 8 M and 4 EP with 100K and 400K subscriptions
• Migration time minimal for AP and EP, dominated by state size for M
17
APM
(12.500 subscriptions, total of 100,000)
M (50.000 subscriptions,
total of 400,000)EP
average 232 ms 1.497 s 2.533 s 275 ms
standard deviation 31 ms 354 ms 1.557 s 52 ms
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Impact of migrations on delay18
0.0
0.5
1.0
1.5
2.0
2.5
100 120 140 160 180 200
de
lay
(se
con
ds)
time (seconds)
migrations: AP (1) AP (2) M (1) M (2) EP
min average max
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load19
• Increase/decrease of publication rate• 0 to 350 publications
per second
• 0 to 35 millions matching operations per second
• CPU load• average, max and min
over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load19
• Increase/decrease of publication rate• 0 to 350 publications
per second
• 0 to 35 millions matching operations per second
• CPU load• average, max and min
over 30 seconds
publication rate (publications / second)
0
50
100
150
200
250
300
350
400
hosts
0 2 4 6 8
10 12 14 16 18
Time (seconds)
host CPU load: min avg max
0
20
40
60
80
100
0 300 600 900 1200 1500 1800
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load19
• Increase/decrease of publication rate• 0 to 350 publications
per second
• 0 to 35 millions matching operations per second
• CPU load• average, max and min
over 30 seconds
publication rate (publications / second)
0
50
100
150
200
250
300
350
400
hosts
0 2 4 6 8
10 12 14 16 18
Time (seconds)
average delays (seconds)
0
0.5
1
1.5
2
2.5
3
3.5
4
0 300 600 900 1200 1500 1800
• Notification delays• average over 30
seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Frankfurt stock exchange20
publication rate (publications / second)
0
50
100
150
200
hosts
0
2
4
6
8
10
Time (seconds)
host CPU load: min avg max
0
20
40
60
80
100
0 500 1000 1500 2000 2500
• Replay of the Frankfurt stock exchange workload• Sped up 10 times: about
40 minutes shown
• 100,000 encrypted subscriptions
• CPU load• average, max and min
over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Frankfurt stock exchange20
publication rate (publications / second)
0
50
100
150
200
hosts
0
2
4
6
8
10
Time (seconds)
average delays (seconds)
0
0.5
1
1.5
2
0 500 1000 1500 2000 2500
• Replay of the Frankfurt stock exchange workload• Sped up 10 times: about
40 minutes shown
• 100,000 encrypted subscriptions
• CPU load• average, max and min
over 30 seconds
• Notification delays• average over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Conclusion
• Elastic scaling of a high-throughput content-based pub/sub engine• Implemented at the level of the support stream processing engine
• Applicability to other long-lived stream processing applications
• Support of live operator slice migration
• Redistribution of slices according to monitored workload
• Allows deploying pub/sub as a service on a public cloud without the need to provision for worst-case scenarios
• Evaluation using Frankfurt stock exchange traces
• Perspectives• Monitor network flows, minimize inter-host communications in
migration plans
• Leverage active replication, used for dependability, to minimize impact on delay while migrating (this requires deterministic execution)
21
Sunday 6 July 14
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Questions?
This work was supported by the EU FP7 program (SRT-15 project)
http://www.srt-15.eu
ICDCS 2014, Madrid, [email protected]
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Additional slides
23
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering
• Alternative filtering schemes• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for containment or aggregation
• The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports
24
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering
• Alternative filtering schemes• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for containment or aggregation
• The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domainSubscriber
ps
Trusted (?) domain
p
s
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering
• Alternative filtering schemes• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for containment or aggregation
• The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domainSubscriber
ps
Trusted (?) domain
p
s
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering
• Alternative filtering schemes• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for containment or aggregation
• The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domainSubscriber
?
p s
?
encryption encryption
?
Untrusted domain
?
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Events Paths and Support Libraries25
libcluster
Subscriber
s
TCP
DCCP AP
Anycast M
Determine key tomatching operatorusing clustering
Store subscriptionin operator slice state
libfilter
Optional
lib Library
State
O Operator slice
StreamHub client
C Static component
M
Unicast
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Events Paths and Support Libraries25
libcluster
Publisher
Subscriber
s
TCP
DCCP AP
Anycast M
p
TCP
DCCP AP
Anycast
M
Determine key tomatching operatorusing clustering
Store subscriptionin operator slice state
libfilter
Filter publication, return matchingsubscribers list
Broadcast
M
EP
EP
Unicast
DCCP
DCCP
Multicast
DCCP
Subscriber
p
TCP
p
p
...Optional
p,{s}
Publication +subscribers
lib Library
State
O Operator slice
StreamHub client
C Static component
M
Unicast
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub interface• Data Conversion and Connection Point(s) (DCCP)
• Stateless component(s) running on server(s) with direct WAN connectivity
• Maintain persistent TCP connections to/from publishers and subscribers
• Convert between external/internal format
26
Publisher
Subscriber
StreamHub engine
p
s
Data converter &connection point
p
DCCPpsPersistent TCP
connections
GPBformat
Internalformat
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Use an Overlay of Brokers?
• Brokers organized in an overlay (mesh, tree) ; each broker performs all pub/sub operations
• Storing subscriptions; matching, forwarding, notifying of publications
• Complex maintenance of routing tables between brokers
• Assumptions on the filtering scheme for inter-broker communication and publication forwarding: containment & aggregation
• Scalability/throughput depend on workload, notification delays may vary
27
Publisher Subscriber
s
ss
s
Subscriber Subscriber Subscriber
s s
Publisher Publisher
BrokerBroker
Broker
s
RT updates
RT updates
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Use an Overlay of Brokers?
• Brokers organized in an overlay (mesh, tree) ; each broker performs all pub/sub operations
• Storing subscriptions; matching, forwarding, notifying of publications
• Complex maintenance of routing tables between brokers
• Assumptions on the filtering scheme for inter-broker communication and publication forwarding: containment & aggregation
• Scalability/throughput depend on workload, notification delays may vary
27
Publisher Subscriber
p
ss
s
Subscriber Subscriber SubscriberPublisher Publisher
p p
BrokerBroker
Broker
p p
Sunday 6 July 14