Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2
What are Data Centers?
•Large facilities with 10s of thousands of networked servers
• Compute, storage, and networking working in concert
• “Warehouse-Scale Computers”
2
33
Types of Data Centers
•Specialized data centers built for one big app
• Social networking: Facebook
• Web Search: Google, Bing
•“Cloud” data centers
• Amazon EC2, Windows
Azure
• Google App Engine
44
Cloud Computing
•On-demand
• Use resources when you need it; pay-as-you-go
•Elastic
• Scale up & down based on demand
•Multi-tenancy
• Multiple independent users share infrastructure
• Security and resource isolation
• SLAs on performance & reliability (sometimes)
•Dynamic Management
• Resiliency: isolate failure of servers and storage
• Workload movement: move work to other locations
5
Data Centers with 100,000+ Servers
Microsoft
Google Facebook
Microsoft
5
66
7
These things are really big
100 billion searches per month
120+ million users
1.15 billion users
10-100K servers
100s of Petabytes of storage
100s of Terabits/s of Bw(more than core of Internet)
10-100MW of power(1-2 % of global energy
consumption)
100s of millions of dollars
8
Google Datacenter Infrastructure
99
Datacenter Traffic Growth
DCN bandwidth growth demanded much more
12
Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized
Control in Google’s Datacenter Network”, SIGCOMM 2015.
1010
INTERNET
Servers
Fabric
What’s Different about DCNs?
10
1111
What’s Different about DCNs?
•Single administrative domain
•No need to be compatible with outside world
•Tiny round trip times (microseconds)
•Latency/tail latency critical
•Massive multipath topologies
•Shallow buffers
•Backplane for large-scale parallel computation
1212
TLA
MLAMLA
Worker Nodes
………
Example: Web Search
12
Picasso
“Everything you can imagine is real.”“Bad artists copy. Good artists steal.”
“It is your work in life that is the ultimate seduction.“
“The chief enemy of creativity is good sense.“
“Inspiration does exist, but it must find you working.”
“I'd like to live as a poor man with lots of money.“
“Art is a lie that makes usrealize the truth.
“Computers are useless. They can only give you answers.”
1.
2.
3.
…..
1. Art is a lie…
2. The chief…
3.
…..
1.
2. Art is a lie…
3. …
..
Art is…
Picasso
• Strict deadlines
• Tail Latency Matters
Deadline = 250ms
Deadline = 50ms
Deadline = 10ms
Partition/Aggregate App Structure
1313
Data Center Challenges
• Massive bisection bandwidth
• Topologies
• Load balancing
• Optics
• Ultra-Low latency (<10 microseconds)
• Scheduling
• Centralized or distributed control?
• Managing resources across network & servers
• Multi-tenant performance isolation
• App-aware network scheduling (e.g. for big data)
• Next-generation hardware
• RDMA, Rack-Scale Computing
14
Data Center Costs
Amortized
Cost*
Component Sub-Components
~45% Servers CPU, memory, disk
~25% Power
infrastructure
UPS, cooling, power
distribution
~15% Power draw Electrical utility costs
~15% Network Switches, links, transit
*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm
CCR 2009. Greenberg, Hamilton, Maltz, Patel.
1515
Server Costs
Ugly secret: 30% utilization considered “good” in data centers
• Uneven application fit
• Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others
• Long provisioning timescales
• New servers purchased quarterly at best
• Uncertainty in demand
• Demand for a new service can spike quickly
• Risk management
• Not having spare servers to meet demand brings failure just when success is at hand
• Session state and storage constraints
• If the world were stateless servers, life would be good
1616
Goal: Agility – Any service, Any Server
•Turn the servers into a single large fungible pool
• Dynamically expand and contract service footprint
as needed
•Benefits
• Increase service developer productivity
• Lower cost
• Achieve high performance and reliability
The 3 motivators of most infrastructure projects
1717
Achieving Agility
•Workload management
• Means for rapidly installing a service’s code on a
server
• Virtual machines, disk images, containers
•Storage Management
• Means for a server to access persistent data
• Distributed filesystems (e.g., HDFS, blob stores)
•Network
• Means for communicating with other servers,
regardless of where they are in the data center
18
More Software…
Tens of millions of lines of code.
Closed, proprietary, outdated.
Specialized
Control
Plane
Specialized
Hardware
Specialized
Features
Hundreds of protocols
6,500 RFCs
Billions of gates.
Power hungry and bloated.
19
Packet
Forwarding Packet
Forwarding
Packet
Forwarding
Packet
Forwarding
Packet
Forwarding
Control
Control
Control
Control
Control
Global Network Map
Control Plane
Control
Program
Control
Program
Control
Program
19
Software Defined Networking
2020
Conventional DC Network
Reference – “Data Center: Load balancing Data Center Services”, Cisco
2004
CR CR
AR AR AR AR. . .
SS
DC-Layer 3
Internet
SS
…
SS
…
. . .
DC-Layer 2
Key• CR = Core Router (L3)
• AR = Access Router (L3)
• S = Ethernet Switch (L2)
• A = Rack of app. servers
~ 1,000 servers/pod == IP
subnet
2121
Layer 2 vs. Layer 3
•Ethernet switching (layer 2)
✓Fixed addresses and auto-configuration (plug & play)
✓Seamless mobility, migration, and failover
x Broadcast limits scale (ARP)
x Spanning Tree Protocol
•IP routing (layer 3)
✓Scalability through hierarchical addressing
✓Multipath routing through equal-cost multipath
x More complex configuration
x Can’t migrate w/o changing IP address
2222
Conventional DC Network Problems
CR CR
AR AR AR AR
SS
S
… …
. . .
SS
… …
~ 5:1
~ 40:1
~ 200:1
•Dependence on high-cost proprietary routers
•Extremely limited server-to-server capacity
2323
Some DC network designs…
Fat-tree [SIGCOMM’08]
Jellyfish (random) [NSDI’12]
BCube [SIGCOMM’10]
24
INTERNET
Servers
Fabric
100Kbps–100Mbps links
~100ms latency
10–40Gbps links
~10–100μs latency
Transport inside the DC
25
INTERNET
Servers
Fabric
web appdata-base
map-reduce
HPC monitoringcache
Interconnect for distributed compute workloads
Transport inside the DC
2626
What’s Different About DC Transport?
•Network characteristics• Very high link speeds (Gb/s); very low latency (microseconds)
•Application characteristics• Large-scale distributed computation
•Challenging traffic patterns• Diverse mix of mice & elephants
• Incast
•Cheap switches• Single-chip shared-memory devices; shallow buffers
2727
DC Traffic Characteristics
• Instrumented a large cluster used for data mining and identified distinctive traffic patterns
• Traffic patterns are highly volatile
• A large number of distinctive patterns even in a day
• Traffic patterns are unpredictable
• Correlation between patterns very weak
2828
•Short messages
(e.g., query, coordination)
•Large flows
(e.g., data update, backup)
Low Latency
High Throughput
Data Center Workloads
Mice & Elephants
2929
Comments: 1500 node cluster in a data center that supports data mining on petabytes of data. T̂he
servers are distributed roughly evenly across 75 ToR switches, which are connected hierarchically in a
C-DC
Figure: Mice are numerous; 99% of flows are smaller than 1 MB. However, more than 90%
of bytes are in fows between 100 MB and 1 GB
3030
TCP timeout
Worker 1
Worker 2
Worker 3
Worker 4
Aggregator
RTOmin = 300 ms
• Synchronized fan-in congestion
Incast
Vasudevan et al. (SIGCOMM’09)
3131
DC Transport Requirements
1. Low Latency
– Short messages, queries
2. High Throughput
– Continuous data updates, backups
3. High Burst Tolerance
– Incast
The challenge is to achieve these together
32
High Throughput Low Latency
Baseline fabric latency (propagation + switching): 10 microseconds
3434
TCP in the Data Center
•TCP [Jacobsen et al.’88] is widely used in the data center
• More than 99% of the traffic
•Operators work around TCP problems‒ Ad-hoc, inefficient, often expensive solutions
‒ TCP is deeply ingrained in applications
Practical deployment is hard → keep it simple!
3535
Review: The TCP Algorithm
Sender 1
Sender 2
Receiver
ECN = Explicit Congestion Notification
Time
Win
do
w S
ize
(R
ate
)
Additive Increase:W →W+1 per round-trip time
Multiplicative Decrease:W →W/2 per drop or ECN mark
ECN Mark (1 bit)
3636
TCP Buffer Requirement
•Bandwidth-delay product rule of thumb:
• A single flow needs C×RTT buffers for 100% Throughput.
Thro
ugh
pu
tB
uff
er
Size
100%
B
B ≥ C×RTT
B
100%
B < C×RTT
3737
Window Size
(Rate)
Buffer Size
Throughput
100%
•Appenzeller et al. (SIGCOMM ‘04):
• Large # of flows: is enough.
37
Reducing Buffer Requirements
3838
•Appenzeller et al. (SIGCOMM ‘04):
• Large # of flows: is enough
•Can’t rely on stat-mux benefit in the DC.
• Measurements show typically only 1-2 large flows at each
server
38
Key Observation:Low variance in sending rate → Small buffers suffice
Reducing Buffer Requirements
3939
➢Extract multi-bit feedback from single-bit stream of ECN marks
• Reduce window size based on fraction of marked packets.ECN Marks TCP DCTCP
1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%
0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%
DCTCP: Main IdeaW
ind
ow
Siz
e (B
ytes
)
Win
do
w S
ize
(Byt
es)
Time (sec) Time (sec)
TCP DCTCP
4040
DCTCP: Algorithm
Switch side:
• Mark packets when Queue Length > K.
Sender side:
– Maintain running average of fraction of packets marked (α).
➢ Adaptive window decreases:
– Note: decrease factor between 1 and 2.
B KMark Don’t Mark
each RTT : F =# of marked ACKs
Total # of ACKs (1− g) + gF
W (1−
2)W
4141
0
100
200
300
400
500
600
700
0
Que
ue L
en
gth
(P
acke
ts)
Time (seconds)
DCTCP, 2 flowsTCP, 2 flows
0
100
200
300
400
500
600
700
0
Que
ue L
en
gth
(P
acke
ts)
Time (seconds)
DCTCP, 2 flowsTCP, 2 flows
0
100
200
300
400
500
600
700
0
Que
ue L
en
gth
(P
acke
ts)
Time (seconds)
DCTCPTCP
(KB
ytes
)
Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch
ECN Marking Thresh = 30KB
DCTCP vs TCP
Buffer is mostly empty
DCTCP mitigates Incast by creating a large buffer headroom
4242
1. Low Latency
✓ Small buffer occupancies → low queuing delay
2. High Throughput
✓ ECN averaging → smooth rate adjustments, low
variance
3. High Burst Tolerance
✓ Large buffer headroom → bursts fit
✓ Aggressive marking → sources react before packets
are dropped
Why it Works