Upload
buinhu
View
229
Download
2
Embed Size (px)
Citation preview
Circuit Switching
Under the Radar with REACToR
He Liu, Feng Lu, Alex Forencich, Rishi Kapoor
Malveeka Tewari, Geoffrey M. Voelker
George Papen, Alex C. Snoeren, George Porter
1
2
How to build 100G datacenter networks?
3
Datacenters Traffic Is Skewed
4
Destinations
Ba
nd
wid
th R
eq
uir
ed
100G
10G Fat-Tree
5
Destinations
Ba
nd
wid
th R
eq
uir
ed
10G
100G
100G Fat-Tree
6
Destinations
Ba
nd
wid
th R
eq
uir
ed
Full Bi-sectional Bandwidth
But Expensive
cables, transceivers, …
100G
Helios, c-Through: Hotspot Circuits
7
Destinations
Ba
nd
wid
th R
eq
uir
ed
Use mirrors
Reconfigures in 10ms
Single optical circuit 100G
10G
[SIGCOMM 2010]
10G Electrical Fat-tree
OSA: More Circuits
8
Destinations
Ba
nd
wid
th R
eq
uir
ed
Several optical circuits 100G
[NSDI 2012]
Mordia: Fast Circuit Switching
9
Destinations
Ba
nd
wid
th R
eq
uir
ed
Reconfigures in 15 microseconds.
Enables time sharing a circuit.
100G
[SIGCOMM 2013]
Limitation: Still Circuit Switching
10
Destinations
Ba
nd
wid
th R
eq
uir
ed
Good with a few big flows
A B
t
A B
100G
Limitation: Still Circuit Switching
11
Destinations
Ba
nd
wid
th R
eq
uir
ed
100G
50G
Limitation: Inefficient with Small Flows
12
Destinations
Ba
nd
wid
th R
eq
uir
ed
Low link utilization
A
t Reconfiguring takes time
B C D E F A B C D E F
50G
100G
Our Approach: REACToR
13
Destinations
Ba
nd
wid
th R
eq
uir
ed
100G
Circuit
Switching
10G
Packet
Switching
100G Hybrid of Circuit and Packet
A B
t
A B
C D E F
t
C D E F C D E F C D E F C D E
100G
10G
Start with a Pre-existing 10G Network
14
ToR Box
End hosts
10G links
10G links
10G
Electrical Packet
Network
Connect via REACToR
15
10G
Electrical Packet
Network
10G links
End hosts
100G links
REACToR
Connect Circuits
16
100G
Optical Circuit
Network
100G links
End hosts
100G links
REACToR
10G
Electrical Packet
Network
10G links
Hybrid Network
17
100G links
End hosts
Small
Flows
Big
Flows
100G links
REACToR
10G links
Challenge: Two Different Networks
Electrical Packet
• Low bandwidth
• Buffers all the way
• Tx at any time
18
Optical Circuit
• High bandwidth
• Bufferless TDMA
• Tx only when circuit connects
Design Requirements
• Hybrid scheduling: classify traffic into circuits or packets
• Buffer packets at source hosts until circuit is available
• Have sources transmit when the circuit is connected
• Rate control to prevent downlink overload
19
The Hybrid Scheduling Problem
• Collect traffic demand from all hosts
• TDMA schedule the big flows on the circuit path
• Schedule the rest on the packet path
• An oracle predicts the demand and builds the schedules. 20
Destinations
Bandw
idth
Requir
ed
Centralized
TDMA Circuit
Schedule
Free-running
Packet Network
End Host: Classify and Buffer Packets
• Classify packets and map into different hardware queues
– Based on the schedule
• Packet path: one hardware queue for all destinations
– Can transmit at any time, but at 10G
• Circuit path: one hardware queue for each destination
– Can only transmit when the particular circuit is connected
• Buffer the packets in end-host memory
21
B
A
C D E F Packet Queue
Circuit Queues
Packet Transmission
• Packet path: Rate limit to
10G
• Circuit path: Transmit only
when the circuit is connected
• REACToR pulls packets from
the circuit queue in real-time
• Use PFC frames to
selectively unpause queues
22
B
A
D E F
Packet Queue
Circuit Queues
NIC
PFC:
Unpause A
Circuit to
Host A
REACToR
End Host
Packet
path
Circuit to
Host B
PFC:
Pause A
PFC:
Unpause B
C
Rate Control
• Problem: downlink merging 100G + 10G to 100G
23
Circuit 100G 100G to end host
10G Packet
What
To do?
Rate Control
• Problem: downlink merging 100G + 10G to 100G
• Our approach: Rate limit the circuit path at the source to
avoid overloading
24
Circuit 100G 100G to end host
10G Packet
What
To do?
Circuit
Packet
90G
10G
100G to end host
Implementation
25
26
REACToR
10G/1G Prototype
27
REACToR
Virtex 6 FPGA
H H H H
REACToR
Virtex 6 FPGA
H H H H
Linux servers (Intel 82599 10G NICs)
Circuit Switch
Mordia OCS
Control
Computer
Schedule
EPS
Fulcrum Monaco Circuit
Control
Schedule
Demand
10G 1G
10G
Timing Parameters
• End-to-end reconfiguration time: 30 μs
• Schedule reconfigures every 1500 us
• Example: 7 flows TDMA, 86% duty cycle
28
2
3
1
4
5
6
7
t
30μs
1500μs
Evaluation
• Experiment 1: Supporting TCP
– The performance on working with stock network stack
• Experiment 2: React to demand changes
– The dynamics on handling changes and mispredictions
• Experiment 3: Demonstrate the benefit of using hybrid
– The performance gain on handling skewed demand
29
Experiment 1: Supporting TCP
• Each host receives 7 TCP flows from all other hosts
• Hybrid schedule: data packets via OCS, ACKs via EPS
• 7 flows TDMA, fair sharing the link
• Check if TCP works with high throughput
30
TCP Throughput
31
1
2 3 4
5
6
7
TDMA invisible to TCP
Experiment 2: React to Demand Changes
32
Rack 1 Rack 2 Rack 1 Rack 2
From: Intra-rack Traffic To: Inter-rack Traffic
Use pktgen to impose precise and sudden traffic pattern change.
See if REACToR can react in time.
React to Demand Changes
33
1
2
3
4
5
6
7
3-host round robin demand change 4-host round robin
Tx with Packet
Rack 2
R
ack 1
React fast and robust
to demand changes
Experiment 3: Demonstrating Hybrid
34
Destinations
Bandw
idth
Skewed Very Skewed
Destinations
Ba
nd
wid
th
100G 100G
• Simulated 64 hosts with demand of different skewness
• Big benefit from a small electrical packet switch
Experiment 3: Demonstrating Hybrid
• Simulated 64 hosts with demand of different skewness
• Big benefit from a small electrical packet switch
35
Destinations
Bandw
idth
Skewed
Destinations
Ba
nd
wid
th
99G
50G
20 flows, 1G 20 flows, 50G
100G 100G
Very Skewed
Optical Circuit Switching Not Enough
36
Fraction of bandwidth the big flow uses
Lin
k U
tiliz
ed
100%
99% 50%
70%
100G Electrical Fat-tree
Optical Circuit Switching Not Enough
37
Fraction of bandwidth the big flow uses
Lin
k U
tiliz
ed
Optical Circuit 85%
100%
99% 50%
70%
75%
100G Electrical Fat-tree
Optical Circuit Switching Not Enough
38
Lin
k U
tiliz
ed
Optical Circuit 85%
Cost of switching
Lost due to
Inefficient
circuit usage
100%
70%
100G Electrical Fat-tree
75%
Fraction of bandwidth the big flow uses 99% 50%
Optical Circuit Switching Not Enough
39
Lin
k U
tiliz
ed
Optical Circuit 85%
75%
100%
70%
Datacenter
typically works
in this region
100G Electrical Fat-tree
Fraction of bandwidth the big flow uses 99% 50%
Hybrid Switching with REACToR
40
Lin
k U
tiliz
ed
Optical Circuit 85%
90%
REACToR Hybrid
100%
70%
75%
100G Electrical Fat-tree
Fraction of bandwidth the big flow uses 99% 50%
Hybrid Switching with REACToR
41
Lin
k U
tiliz
ed
Optical Circuit 85%
90%
REACToR Hybrid
100%
70%
Degrades
Gradually
90%
75%
100G Electrical Fat-tree
Fraction of bandwidth the big flow uses 99% 50%
Hybrid Switching with REACToR
42
Lin
k U
tiliz
ed
Optical Circuit 85%
90%
REACToR Hybrid
100%
70%
Degrades
Gradually
90%
75%
100G Electrical Fat-tree
Fraction of bandwidth the big flow uses 99% 50%
Performance akin to
100G Fat-tree
Conclusion
43
100G Electrical
Packet Switching 100G Optical
Circuit Switching
10G Electrical
Packet Switching
+ =
For datacenter workloads
REACToR
At a lower cost
Thank you!
44
DCTCP: Datacenter Workload
45
[SIGCOMM 2010]
Cost of Transceivers
• Cost of 10G Transceivers
– Cost: $500 per pair
– Power: 1Watt per pair
– (100G costs even more)
• 3-Level Fat-tree: 27.6k hosts
• Transceivers per host:
46
Scheduling
• Problem: matrix decomposition
– Similar to BvN, but must consider reconfiguration penalty
– NP-complete problem
– Goal: schedule all the big flows (90% of the demand)
• Greedy approach: e.g. iSLIP
– Suboptimal
• Naïve BvN:
– Fragmented by small elements and residuals
• A good algorithm should:
– Prioritize the big flows
– Perform full matrix decomposition (like BvN)
– Minimize number of reconfigurations at the same time
47