Upload
cecilia-hubbard
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Using Self-Regenerative Tools to Tackle Challenges of Scale
Ken BirmanQuickSilver ProjectCornell University
2
The QuickSilver team At Cornell:
Birman/van Renesse: Core platform Gehrke: Content filtering technology Francis: Streaming content delivery
At Raytheon: DiPalma/Work: Military scenarios
With help from: AFRL JBI team in Rome NY
3
Technical Overview Objective: Overcome communications challenges that
plague (and limit) current GIG/NCES platforms For example, dramatically improve time-critical event delivery
delays, speed of event filtering layer Do this even when sustaining damage or when under attack
Existing COTS publish-subscribe technology, particularly in Web Services (SOA) platforms:
Not designed for these challenging settings. Scale poorly. Very expensive to own/operate, easily disabled Forces military to “hack around” limitations, else major projects
can stumble badly (Navy’s CEC effort an example) Problem identified by Air Force JBI team in Rome NY
But also a major concern for companies like Google, Amazon Charles Holland, Assis. DDR&E highlighted topic as a top priority
4
Technical Overview GIG/NCES vision centers on reliable
communication protocols, like publish-subscribe. Underlying protocols are old… hit limits 15 years ago! Faster hardware has helped… but only a little
Peer-to-peer epidemic protocols (“gossip”) have never been applied in such systems
We’re fusing these with more conventional protocols, and achieving substantial improvements
Also makes our system robust, self-repairing Existing systems take an all-or-nothing approach
to reliability. Under stress, we often get nothing. Probabilistic guarantees enables better solutions But need provable guarantees of quality
5
Major risks, mitigation Building a big platform fast… despite
profound technical hurdles. But we not constrained by existing product to sell. Already demonstrated some solutions in SRS
seedling Users demand standards.
We’re extending Web Services architecture & tools Focus on real needs of military users.
We work closely with AFRL (JBI) and Raytheon (Navy) What about baseline (scenario II), quantitative
metrics and goals? Defer until last 15 minutes of talk.
6
Expected major achievement? QuickSilver will represent a breakthrough
technology for building new GIG/NCES applications … applications that operate reliably even under stress that
cripples existing COTS solutions … that need far less hand-holding for application developer,
deployment team, systems administrator (saving money!) … and enabling powerful new information-enabled
applications for uses like self-managed sensor networks, new real-time information tools for urban warfare, control and exploitation of autonomous vehicles
We’ll enable the military to take GIG concepts into domains where commercial products just can’t go!
7
Our topic: GIG and NCES platforms
Military computing systems are growing … larger, … and more complex, … and must operate “unattended”
With existing technology … are far too expensive to develop … require much to much time to deploy … are insecure and too easily disrupted
QuickSilver: Brings SRS concepts to the table
8
How are big systems structured?
Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients Front-end servers connected to a pool of back-
end application “services” (new applications on clusters and wrapped legacy applications)
Publish-subscribe very popular Sensor networks have similarities although
they lack this data center “focus”
9
GIG/NCES (and SOA) vision
Pub-sub combined with point-to-pointcommunication technologies like TCP
LB
service
LB
service
LB
service
LB
service
LB
service
LB
service
Clients connect via “front-end inteface systems”
Wrapper
Legacyapp
Wrapper
Legacyapp
Wrapper
Legacyapp
10
Big sensor networks? QuickSilver will also be useful in, e.g., sensor
networks We’re focused on fixed mesh of sensors using
wireless ad-hoc communication, mobile query sources, QuickSilver as the middleware
11
How to build big systems today?
Programmer is on his own! Expected to use GIG/NCES standards, base
on Service Oriented Architectures (SOAs) No support for this architecture as a whole
Focus is on isolated aspects, like legacy wrappers Existing SOAs focus on
Single client, single server No attention to performance, stability, scale Structure of data center is overlooked!
Results in high costs, lower quality solutions
12
x y z
Drill down: An example Many services (not all) will be RAPS of
RACS RAPS: A reliable array of partitioned
services RACS: A reliable array of cluster-structured
server processes
General Pershing searching for “Faluja SITREP 11-22-04 0900h”
Pmap “Faluja”: {x, y, z} (equivalent replicas)
Here, y gets picked, perhaps based on load
A set of RACS
RAPS
13
Multiple datacenters
Query source Update source
Services are hosted at data centers but accessible system-wide
pmap
pmap
pmap
Server pool
l2P map
Logical partitioning of services
Logical services map to a physical resource pool, perhaps many to one
Data center A Data center B
One application can be a source of both queries and updates.
Operators can control pmap, l2P map, other parameters. Large-scale multicast used to
disseminate updates
14
Problems you must solve by hand
Membership Within RACS Of the service Services in data
centers Communication
Multicast Streaming media
Resource management Pool of machines Set of services Subdivision into RACS
Fault-tolerance Consistency
15
Replication The unifying “concept” here?
Replication within a clustered service “Notification” in publish-subscribe apps. Replicated system configuration data Replication of streaming media
Existing platforms lack replication tools or provide them in small-scale forms
16
QuickSilver vision
We’ll develop a new generation of solutions that At its core offers scalable replication Is presented to the user through
GIG/NCES interfaces (Web Services, CORBA)
Is fast, stable and self-managed, self-repairs when disrupted
17
Core challenges
To solve our problem… Reduce the big challenge to smaller
ones Tackle these using a new conceptual
tools Then integrate solutions into a
publish-subscribe platform And apply to high-value scenarios
18
Milestones
Scalable reliable multicast (many receivers, “groups”)
Time-critical event notificationManagement and self-repair
Streaming real-time media dataScalable content filtering
Integrate into Core Platform
9/04 10/04 11/04 12/04 1/05 2/05 3/05 4/05 5/05 6/05 7/05 8/05 9/05 10/05 11/05 12/05
Develop baselines, overall architecture
Solve key subproblems
Integrate into platform
Deliver to early users
19
Large Scale makes it hard!Want…
Reliability Performance
• Publish rates,
• Latency, • Recovery
time Scalability
• # Participants
• # Topics• Subscriptio
n or failure rates
Self-tuning Nice
interfaces
A
B
CAC
BC
ABC
AB
x100A
B
C
x100
x100
Structured Solution– Detecting regularities
– Introducing some structure
– Sophisticated methods
– Re-adjusting dynamically
20
Techniques
• Detecting overlap patterns• IP multicast• Buffering• Aggregation, routing• Gossip (structured)• Receivers forwarding data• Flow control• Reconfiguring upon failure• Self-monitoring• Reconfiguring for speed-up• Modular structure• Reusable hot-plug modules
AB
A AC
ABCC
BCB
The system
• ~ 65,000 lines in C#• Modular architecture• Testing on a cluster
21
Drill down: How will we do it?
Combine scalable multicast… Uses Peer to Peer gossip to enhance
reliability of a scalable multicast protocol Achieves dramatic scalability
improvements … with a scalable “groups” framework
Uses gossip to take many costly aspects of group management “offline”
Slashes costs of huge numbers of groups!
22
Reliable multicast is too “fragile”
Most members are healthy….
… but one is slow
Most members are healthy….
23
Performance drops with scale
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250Virtually synchronous Ensemble multicast protocols
perturb rate
aver
age
thro
ug
hp
ut
on
no
np
ertu
rbed
mem
ber
s group size: 32group size: 64group size: 96
32
128
group size: 128
24
Gossip 101
Suppose that I know something I’m sitting next to Fred, and I tell him
Now 2 of us “know” Later, he tells Mimi and I tell Anne
Now 4 This is an example of a push
epidemic Push-pull occurs if we exchange data
25
Gossip scales very nicely
Participants’ loads independent of size
Network load linear in system size Information spreads in log(system
size) time
% in
fect
ed
0.0
1.0
Time
26
Gossip in distributed systems
We can gossip about membership Need a bootstrap mechanism, but
then discuss failures, new members Gossip to repair faults in replicated
data “I have 6 updates from Charlie”
If we aren’t in a hurry, gossip to replicate data too
27
Bimodal Multicast
ACM TOCS 1999
Gossip source has a message from
Mimi that I’m missing.
And he seems to be missing two messages from Charlie that I
have.
Here are some messages from
Charlie that might interest you.
Could you send me a copy of Mimi’s 7’th message?
Mimi’s 7’th message was
“The meeting of our Q exam study
group will start late on
Wednesday…”
Send multicasts to report events
Some messages don’t get through
Periodically, but not synchronously, gossip
about messages.
28
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200Low bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200High bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host
Bimodal multicast in baseline scenario
Bimodal multicastscales well
Baseline multicast: throughput collapses under stress
29
Bimodal Multicast Summary Imposes a constant overhead on participants
Many optimizations and tricks needed, but nothing that isn’t practical to implement
Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links
Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter
setting Theory also let’s us predict behavior Despite simplified model, the predictions work!
30
So we have part of our solution
To multicast in many groups: Map down to IP multicast in popular
overlap regions Multicast unreliably
Then, in background’ Use gossip to repair omissions Also for flow control (rate based), surge
handling (deals with bursty traffic)
31
Techniques
• Detecting overlap patterns• IP multicast• Buffering• Aggregation, routing• Gossip (structured)• Receivers forwarding data• Flow control• Reconfiguring upon failure• Self-monitoring• Reconfiguring for speed-up• Modular structure• Reusable hot-plug modules
AB
A AC
ABCC
BCB
The system
• ~ 65,000 lines in C#• Modular architecture• Testing on a cluster
32
Other components of QuickSilver?
Astrolabe: Developed during seedling A hierarchical distributed database It also uses gossip… … and is used for self-organizing, scalable,
robust distributed management and control Slingshot: Uses FEC for low-latency
time-critical event notification ChunkySpread: Focus is on streaming
media Event Filter: Rapidly scans event stream
to identify relevant data
33
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2201 3.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
34
Scaling up… and up…
With a stack of domains, we don’t want every system to “see” every domain Cost would be huge
So instead, we’ll see a summaryName Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
cardinal.cs.cornell.edu
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
35
Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
SQL query “summarizes”
data
Dynamically changing query output is visible system-wide
36
Astrolabe “compared” to multicast
Both used gossip in similar ways But here, data comes from all nodes in
a system, not just a few sources Rates are low… hence overhead is low… … but invaluable when orchestrating
adaptation and self-repair Astrolabe is extremely robust to
disruption Hierarchy is self-constructed, self-healing
37
Remaining time: 2 baselines
First focuses on latency of real-time event notification
Second on speed of event filtering Both involve key elements of
QuickSilver and both are easy to compare with prior state-of-the-art
38
Slingshot
Time-critical event notifaction protocol Idea: probabilistic real-time goals Pay a higher overhead but reduce
frequency of missed deadlines Already yielding multiple order of
magnitude improvements in latency, throughput!
39
Redefining Time-Critical Probabilistic Guarantees: With x%
overhead, y% data is delivered within t seconds.
Data ‘expires’: stock quotes, location updates
Urgency-Sensitive: New data is prioritized over old
Application runs in COTS settings, co-existing with other non-time-critical applications on the same machine
40
Time-Critical Eventing Eventing: Publishers publish events to
topics, which are then received by subscribers
Applications characterized by many-to-many flow of small, discrete units of data
Scalability Dimensions: number of topics numbers of publishers and subscribers per topic degree of subscription overlap
41
Slingshot: Receiver-Based FEC
Topics are mapped to multicast groups Publishers multicast
events unreliably Subscribers
constantly exchange error correction packets for message history suffixes
42
Slingshot: Tunable Reliability
Overhead vs Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ratio of Overhead
Fra
cti
on
of
Lo
st
Packets
Reco
vere
d
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
RT
Ts
Reliability Avg Recovery/Discovery Time
43
Slingshot: Scalability in Topics
0
20
40
60
80
100
120
Degree of Membership
RT
Ts
Multi-Group Naïve Per-Group
44
A second baseline
Scalable Stateful Content Filtering (Gehrke) Arises when deciding which events to
deliver to the client system Usually pub-sub is “coarse grained”,
then does content filtering A chance to apply security policy,
prune unwanted data… but can be slow
45
Model and problem stmt.
Model: Event is a set of (attribute, value) pairs
Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)}
Subscription is a set of predicates on event attributes (conjunctive semantics)
Example: Subscription looking for tanks in the area {(Type = “Tank”), (8 < Latitude < 12)} Equality and range predicates
Problem: Given: A (large) set of subscriptions, S, and a stream of events,
E Find: For each event e in E, determine the set of subscriptions
whose predicates are satisfied by e Scalability:
With the event rate With the number of subscriptions
46
What About State?Model: Event is a set of (attribute, value) pairs
Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)}
Subscription is a query over sequences of events Example: Subscription looking for adversaries with
suspicious behavior “Notify me if enemy first visits location A and then location
B” Subscriptions need to maintain state across events
Problem: Given: A (large) set of stateful subscriptions, S, and a
stream of events, E Find: For each event e in E, determine set of
subscriptions whose predicates are satisfied by e
47
Managing State
Use linear finite state automaton with self loops to encapsulate state
48
Baseline System Architecture
App Server
49
Experimental ResultsY axis in Log scale!
50
Putting it all together
Scalable reliable multicast (many receivers, “groups”)
Time-critical event notificationManagement and self-repair
Streaming real-time media dataScalable content filtering
Integrate into Core Platform
9/04 10/04 11/04 12/04 1/05 2/05 3/05 4/05 5/05 6/05 7/05 8/05 9/05 10/05 11/05 12/05
Develop baselines, overall architecture
Solve key subproblems
Integrate into platform
Deliver to early users
51
Services are hosted at data centers but accessible system-wide
Will QuickSilver solve our problem?
Query source Update source
pmap
pmap
pmap
Server pool
l2P map
Logical partitioning of services
Logical services map to a physical resource pool, perhaps many to one
Data center A Data center B
One application can be a source of both queries and updates.
Operators can control pmap, l2P map, other parameters. Large-scale multicast used to
disseminate updates
Scalable multicast used to update system-wide
parameters and management controls
Within and between groups, we need stronger reliability
properties and higher speeds. Groups are smaller but there
are many of them
We need a way to monitor and manage the collection of services in our data center. A
good match to Astrolabe
We need a way to monitor and manage the machines in
the server pool… another good match to Astrolabe
We’re exploring the limits beyond which a strong (non-
probabilistic) replication scheme is needed in clustered services.
QuickSilver will support virtual synchrony too
52
DoD “Typical” Baseline Data - 1
According to a study by the Congressional Budget Office for the Department of the Army in 2003, bandwidth demands for the Army alone will exceed bandwidth supply by a factor of between 10:1 and 30:1 by the year 2010.
The Army’s Bandwidth Bottleneck, A CBO Report, August 2003, http://www.cbo.gov/ftpdoc.cfm?index=4500&type=1
The growth rates, data volumes, and characterization of networked transactions described in a DCGS Block 10.2 Navy Study are consistent with the CBO study. In many cases the DCGS-N Study predicts earlier bandwidth saturation given the disparate rates of growth in total network capacity compared to technological innovation that necessarily will increase demand.
Throughput requirements of 3-10Mbps for Imagery Data, 200K-1Mbps for others forms (see next slide)
53
DoD “Typical” Baseline Data - 2
Inputs System size
System topology
Event type
Event rate
Perturbation rate
“Typical” DoD Scenario 100s-1000s nodes
Hierarchical Networks with “Bridges” using LAN/WAN, SATCOM, and Wireless RF (LOS & BLOS)
Multiple Situational Awareness Updates (Binary/Text/XML) Plans and Reports (Text) Imagery
Multiple SA – 100s/sec (size = 1KB per entity) Plans & Reports – Aperiodic/Sporadic (size = 10KB) Imagery – Aperiodic/Continuous (size = 50MB)
Since most of this sort of data is short-lived yet requires processing in a time-valued ordering scheme
54
DoD Challenges for SRS (2010-2025) – Network OrientedGranular, Scalable Redundancy: USN FORCEnet
Source: NETWARCOM Official FORCEnet World Wide Web site http://forcenet.navy.mil/fnep/FnEP_Brief.zip
55
DoD Challenges for SRS (2010-2025) – Network OrientedGranular, Scalable Redundancy: Ground Sensor Netting
Source: Raytheon Company © 2004 Raytheon Company. All Rights Reserved. Unpublished Work