Upload
joan-murphy
View
214
Download
1
Embed Size (px)
Citation preview
A Framework for Highly-Available
Real-Time Internet Services
Bhaskaran Raman,EECS, U.C.Berkeley
Problem Statement
• Real-time services with long-lived sessions• Need to provide continued service in the
face of failures
Internet
Video smoothing proxy
Transcoding proxy
Clients
Content server
Real-time services
Problem Statement
• Goals:1. Quick recovery2. Scalability
Client Monitoring
Video-on-demand server
Network path failure
Service failure
Service replica
Our Approach
• Service infrastructure– Computer clusters deployed at several points on
the Internet– For service replication & path monitoring– (path = path of the data stream in a client
session)
Summary of Related Work
• Fail-over within a single cluster– Active Services (e.g., video transcoding)– TACC (web-proxy)– Only service failure is handled
• Web mirror selection– E.g., SPAND– Does not handle failure during a session
• Fail-over for network paths– Internet route recovery
• Not quick enough
– ATM, Telephone networks, MPLS• No mechanism for wide-area Internet
No architecture to address network path failure during a real-time session
Research Challenges
• Wide-area monitoring– Feasibility: how quickly and reliably can
failures be detected?– Efficiency: per-session monitoring would
impose significant overhead
• Architecture– Who monitors who?– How are services replicated?– What is the mechanism for fail-over?
Is Wide-Area Monitoring Feasible?
Monitoring for liveness of path using keep-alive heartbeat
Time
Time
Failure: detected by timeout
Timeout period
Time
False-positive: failure detected incorrectly
Timeout period
There’s a trade-off between time-to-detection and rate of false-positives
Is Wide-Area Monitoring Feasible?
• False-positives due to:– Simultaneous losses– Sudden increase in RTT
• Related studies:– Internet RTT study, Acharya & Saltz, UMD 1996
• RTT spikes are isolated
– TCP RTO study, Allman & Paxson, SIGCOMM 1999
• Significant RTT increase is quite transient
• Our experiments:– Ping data from ping servers– UDP heartbeats between Internet hosts
Ping measurements
• Ping servers– 12 geographically distributed
servers chosen– Approximation of a keep-alive
stream
• Count number of loss runs with > 4 simultaneous losses– Could be an actual failure or just
intermittant losses– If we have 1 second HBs, and
timeout after losing 4 HBs– This count gives the upper
bound on the number of false-positives
Ping server
Berkeley
Internet host
1
2
3HTTP
ICMP
Ping measurements
Ping server Ping host Total time > 4 misses
cgi.shellnet.co.uk hnets.uia.ac.be 15:14:14 12
hnets.uia.ac.be sites.inka.de 20:55:58 29
cgi.shellnet.co.uk hnets.uia.ac.be 14:53:14 0
www.hit.net d18183f47.rochester.rr.com 20:27:32 10
d18183f47.rochester.rr.com www.his.com 17:30:31 0
www.hit.net www.his.com 20:28:5 2
www.csu.net zeus.lyceum.com 20:1:15 17
zeus.lyceum.com www.atmos.albany.edu 20:1:17 62
www.csu.net www.atmos.albany.edu 20:1:18 7
www.interworld.net www.interlog.com 14:19:44 6
www.interlog.com inoc.iserv.net 20:32:7 24
www.interworld.net inoc.iserv.net 14:39:56 0
UDP-based keep-alive stream
• Geographically distributed hosts:– Berkeley, Stanford, UIUC, TU-Berlin, UNSW
• UDP heart-beat every 300ms• Measure gaps between receipt of successive
heart-beats• False positive:
– No heartbeat received for > 2 seconds, but received before 30 seconds
• Failure:– No HB for > 30 seconds
UDP-based keep-alive stream
HB destination HB source Total time Num. False positives
Berkeley UNSW 130:48:45 135
UNSW Berkeley 130:51:45 9
Berkeley TU-Berlin 130:49:46 27
TU-Berlin Berkeley 130:50:11 174
TU-Berlin UNSW 130:48:11 218
UNSW TU-Berlin 130:46:38 24
Berkeley Stanford 124:21:55 258
Stanford Berkeley 124:21:19 2
Stanford UIUC 89:53:17 4
UIUC Stanford 76:39:10 74
Berkeley UIUC 89:54:11 6
UIUC Berkeley 76:39:40 3
What does this mean?
• If we have a failure detection scheme– Timeout of 2 sec– False positives can be as low as once a day– For many pairs of Internet hosts
• In comparison, BGP route recovery:– > 30 seconds– Can take upto 10s of minutes, Labovitz & Ahuja,
SIGCOMM 2000– Worse with multi-homing (an increasing trend)
Architectural Requirements
• Efficiency:– Monitoring per-session too much overhead– End-to-end more latency monitoring less
effective
• Need aggregation– Client-side aggregation: using a SPAND-like server– Server-side aggregation: clusters– But not all clients have the same server & vice-versa
• Service infrastructure to address this– Several service clusters on the Internet
Architecture
Internet
Service cluster: Compute cluster capable of running
services
Keep-alive stream
Client
Source
Overlay topology
Nodes = service clusters
Links = monitoring
channels between clusters
Source Client
Routed via monitored paths
Could go through an intermediate
service
Local recovery using a backup
path
Architecture
Monitoring within cluster for process/machine failure
Monitoring across clusters for network path failure
Peering of service clusters – to server as backups for
one another – or for monitoring the path
between them
Architecture: Advantages
• 2-level monitoring – Process/machine failures still handled within cluster– Common failure cases do not require wide-area
mechanisms
• Aggregation of monitoring across clusters – Efficiency
• Model works for cascaded services as well
Client
SourceS1
S2
Architecture: Potential Criticism
• Potential criticism:– Does not handle resource reservation
• Response:– Related issue, but could be orthogonal– Aggregated/hierarchical reservation schemes (e.g.,
the Clearing House)– Even if reservation is solved (or is not needed), we
still need to address failures
Architecture: Issues(1) Routing
Internet
Client
Source
Given the overlay topology,
Need a routing algorithm to go from source to destination via service(s)
Also need: (a) WA-SDS, (b) Closest service cluster to a given host
Architecture: Issues(2) Topology
How many service-clusters? (nodes)
How many monitored-paths? (links)
Ideas on Routing
• BGP– Border router– Peering session
• Heartbeats
– IP route• Destination based
• Service infrastructure– Service cluster– Peering session
• Heartbeats
– Route across clusters• Based on destination
and intermediate service(s)
Similarities with BGP
• How is routing in the overlay topology different from BGP?
• Overlay topology– More freedom than physical topology– Constraints on graph can be imposed more easily– For example, can have local recovery
Ideas on Routing
S (Berkeley)
S’ (San Francisco)
Ideas on Routing
AP1
AP2
AP3
BGP exchanges a lot of information
Increases with the number of APs – O(100,000)
Service clusters need to exchange very little information
Problems of (a) Service discovery (b) Knowing the nearest service cluster – are decoupled from routing
Source
Client
Ideas on Routing
• BGP routers do not have per-session state
• But, service clusters can maintain per-session state– Can have local recovery pre-
determined for each session
• Finally, we probably don’t need to have as many nodes as the number of BGP routers– can have more aggressive routing
algorithm
It is feasible to have fail-over in the overlay topology quicker than Internet route recovery with BGP
Ideas on topology
• Decision criteria– Additional latency for client session
• Sparse topology more latency
– Monitoring overhead• Many monitored paths more overhead, but additional
flexibility might reduce end-to-end latency
Ideas on topology
• Number of nodes:– Claim: upper bound is: one per AS– This will ensure that there is a service cluster “close”
to every client– Topology will be close to Internet backbone topology
• Number of monitoring channels:– # AS: ~7000 as of 1999– Monitoring overhead: ~100 Bytes/sec
• Can have ~1000 peering sessions per service cluster easily
– Dense overlay topology possible (to minimize additional latency)
Implementation + Performance
Service cluster
Peer clusterManager node
Exchange ofsession-information + Monitoring heartbeat
Implementation + Performance
• PCM GSM codec service– Overhead of “hot” backup service: 1 process (idle)– Service reinstantiation time: ~100ms
• End-to-end recovery over wide-area: three components
Detection time – O(2sec)
Communication with replica – RTT – O(100ms)
Replica activation: ~100ms
Implementation + Performance
• Overhead of monitoring– Keep-alive heartbeat– One per 300ms in our implementation– O(100Bytes/sec)
• Overhead of false-positive in failure detection– Session transfer– One message exchange across adjacent clusters
• Few hundred bytes
Summary
• Real-time applications with long-lived sessions– No support exists for path-failure recovery (unlike
say, the PSTN)
• Service infrastructure to provide this support– Wide-area monitoring for path liveness:
• O(2sec) failure detection with low rate of false positives
– Peering and replication model for quick fail-over
• Interesting issues:– Nature of overlay topology– Algorithms for routing and fail-over
Questions
• What are the factors in deciding topology?• Strategies for handling failure near client/source
– Currently deployed mechanisms for RIP/OSPF, ATM/MPLS failure recovery
– What is the time to recover?
• Applications: video/audio streaming– More?– Games? Proxies for games on hand-held devices?
http://www.cs.berkeley.edu/~bhaskar
(Presentation running under VMWare under Linux)