Upload
lyxuyen
View
216
Download
3
Embed Size (px)
Citation preview
Paolo Costa
joint work with Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns)
• Data Centers – Large group of networked servers – Typically built from commodity hardware
• Scale-out vs. scale-up – Many low-cost PCs rather than few expensive high-end servers – Economies of scale
• Custom software stack – Microsoft: Autopilot, Dryad – Google: GFS, BigTable, MapReduce – Yahoo: Hadoop, HDFS – Amazon: Dynamo, Astrolabe – Facebook: Cassandra, Scribe, HBase
Focus of this talk
Paolo Costa CamCube - Rethinking the Data Center Cluster 3/54
• Mismatch between abstraction & reality
• How programmers see the network
Key-based Overlay networks Explicit structure Symmetry of role …
Paolo Costa CamCube - Rethinking the Data Center Cluster 4/54
~ 20-40 servers / rack
~12 racks / pod
10 Gbps
10 Gbps
Paolo Costa CamCube - Rethinking the Data Center Cluster 6/54
~ 20-40 servers / rack
~12 racks / pod
10 Gbps
10 Gbps
Paolo Costa CamCube - Rethinking the Data Center Cluster 7/54
~ 20-40 servers / rack
~12 racks / pod
10 Gbps
10 Gbps
Paolo Costa CamCube - Rethinking the Data Center Cluster
Data center services are distributed but they have no control over the network Must infer network properties (e.g., locality) Must use inefficient overlay networks No bandwidth guarantee
8/54
• Top-down (applications)
– Custom routing? • Overlays
– Priority traffic? • Use multiple TCP flows
– Locality? • Use traceroute and ping
– TCP Incast? • Add random jitter
• Bottom-up (network)
– More bandwidth? • Use richer topologies
(e.g., fat-trees)
– Fate-sharing? • Use prediction models to
estimate flow length / traffic patterns
– TCP Incast? • Use ECN in routers
Paolo Costa CamCube - Rethinking the Data Center Cluster 9/24
• The network is still a black box
Good for a multi-owned, heterogeneous Internet
Limiting for a single-owned, homogenous data center
• DCs are not mini-Internets
– single owner / administration domain
– we know (and define) the topology
– low hardware/software diversity
– no malicious or selfish node
• Is TCP/IP still appropriate?
Paolo Costa CamCube - Rethinking the Data Center Cluster 10/54
• “A data center from the perspective of a distributed systems builder”
• What happens if you integrate network & compute?
HPC Networking
Distributed
Systems
CamCube
Paolo Costa CamCube - Rethinking the Data Center Cluster 11/54
Link and key-based
TCP/IP
Service composition
Topology
API
Architecture
Paolo Costa CamCube - Rethinking the Data Center Cluster 12/54
• No distinction between overlay and underlying network
• Nodes have (x,y,z)coordinate
– defines key-space
• Simple one hop API to send/receive packets
– all other functionality implemented as services
• Key coordinates are re-mapped in case of failures
– enable a KBR-like API
x
y
z
(1,2,1) (1,2,2) (1,2,1) (1,2,2)
Paolo Costa CamCube - Rethinking the Data Center Cluster 13/54
• No distinction between overlay and underlying network
• Nodes have (x,y,z)coordinate
– defines key-space
• Simple one hop API to send/receive packets
– all other functionality implemented as services
• Key coordinates are re-mapped in case of failures
– enable a KBR-like API
x
y
z
(1,2,2) (1,2,1)
Paolo Costa CamCube - Rethinking the Data Center Cluster 14/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 15/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 16/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 17/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 18/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 19/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 20/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 21/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 22/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 23/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 24/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 25/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 26/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Fetch mail costa:1234
Paolo Costa CamCube - Rethinking the Data Center Cluster 27/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Paolo Costa CamCube - Rethinking the Data Center Cluster 28/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
(1,1,0)
Paolo Costa CamCube - Rethinking the Data Center Cluster 29/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
(1,1,0)
Paolo Costa CamCube - Rethinking the Data Center Cluster 30/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
(1,1,0)
Paolo Costa CamCube - Rethinking the Data Center Cluster 31/54
• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible
to all services
(0,0,0) (1,0,0) (2,0,0)
Paolo Costa CamCube - Rethinking the Data Center Cluster 32/54
• 27-server (3x3x3) prototype – quad-core Intel Xeon 5520 2.27 Ghz
– 12GB RAM
– 6 Intel PRO/1000 PT 1 Gbps ports
– Runtime & services implemented in C# • Multicast, key-value store, MapReduce,
graph processing engine, …
• Experiment – all servers saturating 6 links
– 11.94 Gbps (9K pkts) using < 22% CPU
Paolo Costa CamCube - Rethinking the Data Center Cluster 33/54
• MapReduce
– originally developed by Google
– open-source version (Hadoop) used by several companies
• Yahoo, Facebook, Twitter, Amazon, …
– many applications
• data mining, search, indexing…
• Simple abstraction
– map: apply a function that generates (key, value) pairs
– reduce: combine intermediate results
Paolo Costa CamCube - Rethinking the Data Center Cluster 34/54
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Reduce
Paolo Costa CamCube - Rethinking the Data Center Cluster 35/54
Map Map
Map
Map Map
Map
Reduce
Reduce
Map Map
Map
Map Map
Map
Results stored on a single node Can handle all reduce functions Bottleneck (network and CPU)
Paolo Costa CamCube - Rethinking the Data Center Cluster 36/54
Map Map
Map
Map Map
Map
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Map Map
Map
Map Map
Map
CPU & network load distributed N2 flows: needs full bisection bw results are split across R servers not always applicable (e.g., max)
Paolo Costa CamCube - Rethinking the Data Center Cluster 37/54
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Reduce
Paolo Costa CamCube - Rethinking the Data Center Cluster 38/54
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Reduce
Use hierarchical aggregation!
Paolo Costa CamCube - Rethinking the Data Center Cluster 39/54
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Paolo Costa CamCube - Rethinking the Data Center Cluster 40/54
Map Map
Map
Map Map
Map
Map Map
Map
Map Map
Map
Aggregate
Aggregate
Aggregate
Intermediate results can be aggregated at
each hop
Paolo Costa CamCube - Rethinking the Data Center Cluster 41/54
Traffic reduced at each hop No need for full bisection bw Final results stored on a single node
In-network aggregation
Paolo Costa CamCube - Rethinking the Data Center Cluster 44/54
• Camdoop builds 1 tree – failure resilient – (up to) 6 Gbps throughput / server
Paolo Costa CamCube - Rethinking the Data Center Cluster 45/54
• Camdoop builds 1 tree 6 disjoint trees – failure resilient – (up to) 6 Gbps throughput / server
Paolo Costa CamCube - Rethinking the Data Center Cluster 46/54
• 27-server testbed
• 27 GB input size
– 1GB / server
• Output size/ intermediate size (S)
– S=1/N ≈ 0 (full aggregation)
– S=1 (no aggregation)
• Two configurations:
– All-to-one
– All-to-all Paolo Costa CamCube - Rethinking the Data Center Cluster 47/54
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsca
le
Output size / intermediate size (S)
switch
Camdoop (no agg)
Camdoop
Switch 1 Gbps (bound)
No aggregation Full aggregation
Better
Worse
Paolo Costa CamCube - Rethinking the Data Center Cluster 48/54
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsca
le
Output size / intermediate size (S)
switch
Camdoop (no agg)
Camdoop
Switch 1 Gbps (bound)
No aggregation Full aggregation
Better
Worse
Performance independent of S
Impact of higher bandwidth
Facebook reported aggregation ratio
Impact of aggregation
Paolo Costa CamCube - Rethinking the Data Center Cluster 49/54
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsca
le
Output size / intermediate size (S)
switch
Camdoop (no agg)
Camdoop
Switch 1 Gbps (bound)
No aggregation Full aggregation
Better
Worse
Paolo Costa CamCube - Rethinking the Data Center Cluster 50/54
1
10
100
1000
0 0.2 0.4 0.6 0.8 1
Tim
e (
s) lo
gsca
le
Output size / intermediate size (S)
switch
Camdoop (no agg)
Camdoop
Switch 1 Gbps (bound)
No aggregation Full aggregation
Better
Worse
Impact of aggregation
Impact of higher
bandwidth
Paolo Costa CamCube - Rethinking the Data Center Cluster 51/54
0
10
20
30
40
50
60
20 KB 200 KB
Tim
e (
ms)
Input data size / server
switch
Camdoop (no agg)
Camdoop
Better
Worse
Paolo Costa CamCube - Rethinking the Data Center Cluster 52/54
0
10
20
30
40
50
60
20 KB 200 KB
Tim
e (
ms)
Input data size / server
switch
Camdoop (no agg)
Camdoop
Better
Worse
In-network aggregation beneficial also for
low-latency services
Paolo Costa CamCube - Rethinking the Data Center Cluster 53/54
• Building large-scale services is hard – mismatch between physical and logical topology – the network is a black box
• CamCube – designed from the perspective of developers – explicit topology and key-based abstraction
• Benefits simplified development more efficient applications and better fault-tolerance
H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. “Symbiotic
Routing for Future Data Centres”, In ACM SIGCOMM, 2010. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. “Camdoop: Exploiting In-network Aggregation for Big Data Applications”, In USENIX NSDI, 2012.