Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University

Using Self-Regenerative Tools to Tackle Challenges of Scale

Ken BirmanQuickSilver ProjectCornell University

2

The QuickSilver team At Cornell:

Birman/van Renesse: Core platform Gehrke: Content filtering technology Francis: Streaming content delivery

At Raytheon: DiPalma/Work: Military scenarios

With help from: AFRL JBI team in Rome NY

3

Technical Overview Objective: Overcome communications challenges that

plague (and limit) current GIG/NCES platforms For example, dramatically improve time-critical event delivery

delays, speed of event filtering layer Do this even when sustaining damage or when under attack

Existing COTS publish-subscribe technology, particularly in Web Services (SOA) platforms:

Not designed for these challenging settings. Scale poorly. Very expensive to own/operate, easily disabled Forces military to “hack around” limitations, else major projects

can stumble badly (Navy’s CEC effort an example) Problem identified by Air Force JBI team in Rome NY

But also a major concern for companies like Google, Amazon Charles Holland, Assis. DDR&E highlighted topic as a top priority

4

Technical Overview GIG/NCES vision centers on reliable

communication protocols, like publish-subscribe. Underlying protocols are old… hit limits 15 years ago! Faster hardware has helped… but only a little

Peer-to-peer epidemic protocols (“gossip”) have never been applied in such systems

We’re fusing these with more conventional protocols, and achieving substantial improvements

Also makes our system robust, self-repairing Existing systems take an all-or-nothing approach

to reliability. Under stress, we often get nothing. Probabilistic guarantees enables better solutions But need provable guarantees of quality

5

Major risks, mitigation Building a big platform fast… despite

profound technical hurdles. But we not constrained by existing product to sell. Already demonstrated some solutions in SRS

seedling Users demand standards.

We’re extending Web Services architecture & tools Focus on real needs of military users.

We work closely with AFRL (JBI) and Raytheon (Navy) What about baseline (scenario II), quantitative

metrics and goals? Defer until last 15 minutes of talk.

6

Expected major achievement? QuickSilver will represent a breakthrough

technology for building new GIG/NCES applications … applications that operate reliably even under stress that

cripples existing COTS solutions … that need far less hand-holding for application developer,

deployment team, systems administrator (saving money!) … and enabling powerful new information-enabled

applications for uses like self-managed sensor networks, new real-time information tools for urban warfare, control and exploitation of autonomous vehicles

We’ll enable the military to take GIG concepts into domains where commercial products just can’t go!

7

Our topic: GIG and NCES platforms

Military computing systems are growing … larger, … and more complex, … and must operate “unattended”

With existing technology … are far too expensive to develop … require much to much time to deploy … are insecure and too easily disrupted

QuickSilver: Brings SRS concepts to the table

8

How are big systems structured?

Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients Front-end servers connected to a pool of back-

end application “services” (new applications on clusters and wrapped legacy applications)

Publish-subscribe very popular Sensor networks have similarities although

they lack this data center “focus”

9

GIG/NCES (and SOA) vision

Pub-sub combined with point-to-pointcommunication technologies like TCP

LB

service

LB

service

LB

service

LB

service

LB

service

LB

service

Clients connect via “front-end inteface systems”

Wrapper

Legacyapp

Wrapper

Legacyapp

Wrapper

Legacyapp

10

Big sensor networks? QuickSilver will also be useful in, e.g., sensor

networks We’re focused on fixed mesh of sensors using

wireless ad-hoc communication, mobile query sources, QuickSilver as the middleware

11

How to build big systems today?

Programmer is on his own! Expected to use GIG/NCES standards, base

on Service Oriented Architectures (SOAs) No support for this architecture as a whole

Focus is on isolated aspects, like legacy wrappers Existing SOAs focus on

Single client, single server No attention to performance, stability, scale Structure of data center is overlooked!

Results in high costs, lower quality solutions

12

x y z

Drill down: An example Many services (not all) will be RAPS of

RACS RAPS: A reliable array of partitioned

services RACS: A reliable array of cluster-structured

server processes

General Pershing searching for “Faluja SITREP 11-22-04 0900h”

Pmap “Faluja”: {x, y, z} (equivalent replicas)

Here, y gets picked, perhaps based on load

A set of RACS

RAPS

13

Multiple datacenters

Query source Update source

Services are hosted at data centers but accessible system-wide

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

One application can be a source of both queries and updates.

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

14

Problems you must solve by hand

Membership Within RACS Of the service Services in data

centers Communication

Multicast Streaming media

Resource management Pool of machines Set of services Subdivision into RACS

Fault-tolerance Consistency

15

Replication The unifying “concept” here?

Replication within a clustered service “Notification” in publish-subscribe apps. Replicated system configuration data Replication of streaming media

Existing platforms lack replication tools or provide them in small-scale forms

16

QuickSilver vision

We’ll develop a new generation of solutions that At its core offers scalable replication Is presented to the user through

GIG/NCES interfaces (Web Services, CORBA)

Is fast, stable and self-managed, self-repairs when disrupted

17

Core challenges

To solve our problem… Reduce the big challenge to smaller

ones Tackle these using a new conceptual

tools Then integrate solutions into a

publish-subscribe platform And apply to high-value scenarios

18

Milestones

Scalable reliable multicast (many receivers, “groups”)

Time-critical event notificationManagement and self-repair

Streaming real-time media dataScalable content filtering

Integrate into Core Platform

9/04 10/04 11/04 12/04 1/05 2/05 3/05 4/05 5/05 6/05 7/05 8/05 9/05 10/05 11/05 12/05

Develop baselines, overall architecture

Solve key subproblems

Integrate into platform

Deliver to early users

19

Large Scale makes it hard!Want…

Reliability Performance

• Publish rates,

• Latency, • Recovery

time Scalability

• # Participants

• # Topics• Subscriptio

n or failure rates

Self-tuning Nice

interfaces

A

B

CAC

BC

ABC

AB

x100A

B

C

x100

x100

Structured Solution– Detecting regularities

– Introducing some structure

– Sophisticated methods

– Re-adjusting dynamically

20

Techniques

• Detecting overlap patterns• IP multicast• Buffering• Aggregation, routing• Gossip (structured)• Receivers forwarding data• Flow control• Reconfiguring upon failure• Self-monitoring• Reconfiguring for speed-up• Modular structure• Reusable hot-plug modules

AB

A AC

ABCC

BCB

The system

• ~ 65,000 lines in C#• Modular architecture• Testing on a cluster

21

Drill down: How will we do it?

Combine scalable multicast… Uses Peer to Peer gossip to enhance

reliability of a scalable multicast protocol Achieves dramatic scalability

improvements … with a scalable “groups” framework

Uses gossip to take many costly aspects of group management “offline”

Slashes costs of huge numbers of groups!

22

Reliable multicast is too “fragile”

Most members are healthy….

… but one is slow

Most members are healthy….

23

Performance drops with scale

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ug

hp

ut

on

no

np

ertu

rbed

mem

ber

s group size: 32group size: 64group size: 96

32

128

group size: 128

24

Gossip 101

Suppose that I know something I’m sitting next to Fred, and I tell him

Now 2 of us “know” Later, he tells Mimi and I tell Anne

Now 4 This is an example of a push

epidemic Push-pull occurs if we exchange data

25

Gossip scales very nicely

Participants’ loads independent of size

Network load linear in system size Information spreads in log(system

size) time

% in

fect

ed

0.0

1.0

Time

26

Gossip in distributed systems

We can gossip about membership Need a bootstrap mechanism, but

then discuss failures, new members Gossip to repair faults in replicated

data “I have 6 updates from Charlie”

If we aren’t in a hurry, gossip to replicate data too

27

Bimodal Multicast

ACM TOCS 1999

Gossip source has a message from

Mimi that I’m missing.

And he seems to be missing two messages from Charlie that I

have.

Here are some messages from

Charlie that might interest you.

Could you send me a copy of Mimi’s 7’th message?

Mimi’s 7’th message was

“The meeting of our Q exam study

group will start late on

Wednesday…”

Send multicasts to report events

Some messages don’t get through

Periodically, but not synchronously, gossip

about messages.

28

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host

Bimodal multicast in baseline scenario

Bimodal multicastscales well

Baseline multicast: throughput collapses under stress

29

Bimodal Multicast Summary Imposes a constant overhead on participants

Many optimizations and tricks needed, but nothing that isn’t practical to implement

Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links

Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter

setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

30

So we have part of our solution

To multicast in many groups: Map down to IP multicast in popular

overlap regions Multicast unreliably

Then, in background’ Use gossip to repair omissions Also for flow control (rate based), surge

handling (deals with bursty traffic)

31

Techniques

• Detecting overlap patterns• IP multicast• Buffering• Aggregation, routing• Gossip (structured)• Receivers forwarding data• Flow control• Reconfiguring upon failure• Self-monitoring• Reconfiguring for speed-up• Modular structure• Reusable hot-plug modules

AB

A AC

ABCC

BCB

The system

• ~ 65,000 lines in C#• Modular architecture• Testing on a cluster

32

Other components of QuickSilver?

Astrolabe: Developed during seedling A hierarchical distributed database It also uses gossip… … and is used for self-organizing, scalable,

robust distributed management and control Slingshot: Uses FEC for low-latency

time-critical event notification ChunkySpread: Focus is on streaming

media Event Filter: Rapidly scans event stream

to identify relevant data

33

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2201 3.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

34

Scaling up… and up…

With a stack of domains, we don’t want every system to “see” every domain Cost would be huge

So instead, we’ll see a summaryName Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

cardinal.cs.cornell.edu


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

35

Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers

Name Load Weblogic? SMTP? Word Version

…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

SQL query “summarizes”

data

Dynamically changing query output is visible system-wide

36

Astrolabe “compared” to multicast

Both used gossip in similar ways But here, data comes from all nodes in

a system, not just a few sources Rates are low… hence overhead is low… … but invaluable when orchestrating

adaptation and self-repair Astrolabe is extremely robust to

disruption Hierarchy is self-constructed, self-healing

37

Remaining time: 2 baselines

First focuses on latency of real-time event notification

Second on speed of event filtering Both involve key elements of

QuickSilver and both are easy to compare with prior state-of-the-art

38

Slingshot

Time-critical event notifaction protocol Idea: probabilistic real-time goals Pay a higher overhead but reduce

frequency of missed deadlines Already yielding multiple order of

magnitude improvements in latency, throughput!

39

Redefining Time-Critical Probabilistic Guarantees: With x%

overhead, y% data is delivered within t seconds.

Data ‘expires’: stock quotes, location updates

Urgency-Sensitive: New data is prioritized over old

Application runs in COTS settings, co-existing with other non-time-critical applications on the same machine

40

Time-Critical Eventing Eventing: Publishers publish events to

topics, which are then received by subscribers

Applications characterized by many-to-many flow of small, discrete units of data

Scalability Dimensions: number of topics numbers of publishers and subscribers per topic degree of subscription overlap

41

Slingshot: Receiver-Based FEC

Topics are mapped to multicast groups Publishers multicast

events unreliably Subscribers

constantly exchange error correction packets for message history suffixes

42

Slingshot: Tunable Reliability

Overhead vs Performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio of Overhead

Fra

cti

on

of

Lo

st

Packets

Reco

vere

d

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

RT

Ts

Reliability Avg Recovery/Discovery Time

43

Slingshot: Scalability in Topics

0

20

40

60

80

100

120

Degree of Membership

RT

Ts

Multi-Group Naïve Per-Group

44

A second baseline

Scalable Stateful Content Filtering (Gehrke) Arises when deciding which events to

deliver to the client system Usually pub-sub is “coarse grained”,

then does content filtering A chance to apply security policy,

prune unwanted data… but can be slow

45

Model and problem stmt.

Model: Event is a set of (attribute, value) pairs

Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)}

Subscription is a set of predicates on event attributes (conjunctive semantics)

Example: Subscription looking for tanks in the area {(Type = “Tank”), (8 < Latitude < 12)} Equality and range predicates

Problem: Given: A (large) set of subscriptions, S, and a stream of events,

E Find: For each event e in E, determine the set of subscriptions

whose predicates are satisfied by e Scalability:

With the event rate With the number of subscriptions

46

What About State?Model: Event is a set of (attribute, value) pairs

Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)}

Subscription is a query over sequences of events Example: Subscription looking for adversaries with

suspicious behavior “Notify me if enemy first visits location A and then location

B” Subscriptions need to maintain state across events

Problem: Given: A (large) set of stateful subscriptions, S, and a

stream of events, E Find: For each event e in E, determine set of

subscriptions whose predicates are satisfied by e

47

Managing State

Use linear finite state automaton with self loops to encapsulate state

48

Baseline System Architecture

App Server

49

Experimental ResultsY axis in Log scale!

50

Putting it all together

Scalable reliable multicast (many receivers, “groups”)

Time-critical event notificationManagement and self-repair

Streaming real-time media dataScalable content filtering

Integrate into Core Platform

9/04 10/04 11/04 12/04 1/05 2/05 3/05 4/05 5/05 6/05 7/05 8/05 9/05 10/05 11/05 12/05

Develop baselines, overall architecture

Solve key subproblems

Integrate into platform

Deliver to early users

51

Services are hosted at data centers but accessible system-wide

Will QuickSilver solve our problem?

Query source Update source

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

One application can be a source of both queries and updates.

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

Scalable multicast used to update system-wide

parameters and management controls

Within and between groups, we need stronger reliability

properties and higher speeds. Groups are smaller but there

are many of them

We need a way to monitor and manage the collection of services in our data center. A

good match to Astrolabe

We need a way to monitor and manage the machines in

the server pool… another good match to Astrolabe

We’re exploring the limits beyond which a strong (non-

probabilistic) replication scheme is needed in clustered services.

QuickSilver will support virtual synchrony too

52

DoD “Typical” Baseline Data - 1

According to a study by the Congressional Budget Office for the Department of the Army in 2003, bandwidth demands for the Army alone will exceed bandwidth supply by a factor of between 10:1 and 30:1 by the year 2010.

The Army’s Bandwidth Bottleneck, A CBO Report, August 2003, http://www.cbo.gov/ftpdoc.cfm?index=4500&type=1

The growth rates, data volumes, and characterization of networked transactions described in a DCGS Block 10.2 Navy Study are consistent with the CBO study. In many cases the DCGS-N Study predicts earlier bandwidth saturation given the disparate rates of growth in total network capacity compared to technological innovation that necessarily will increase demand.

Throughput requirements of 3-10Mbps for Imagery Data, 200K-1Mbps for others forms (see next slide)

53

DoD “Typical” Baseline Data - 2

Inputs System size

System topology

Event type

Event rate

Perturbation rate

“Typical” DoD Scenario 100s-1000s nodes

Hierarchical Networks with “Bridges” using LAN/WAN, SATCOM, and Wireless RF (LOS & BLOS)

Multiple Situational Awareness Updates (Binary/Text/XML) Plans and Reports (Text) Imagery

Multiple SA – 100s/sec (size = 1KB per entity) Plans & Reports – Aperiodic/Sporadic (size = 10KB) Imagery – Aperiodic/Continuous (size = 50MB)

Since most of this sort of data is short-lived yet requires processing in a time-valued ordering scheme

54

DoD Challenges for SRS (2010-2025) – Network OrientedGranular, Scalable Redundancy: USN FORCEnet

Source: NETWARCOM Official FORCEnet World Wide Web site http://forcenet.navy.mil/fnep/FnEP_Brief.zip

55

DoD Challenges for SRS (2010-2025) – Network OrientedGranular, Scalable Redundancy: Ground Sensor Netting

Source: Raytheon Company © 2004 Raytheon Company. All Rights Reserved. Unpublished Work

Documents

Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University