Data Center Networking - MathUniPDabujari/fis1819/lecSlides/dcn.pdfData Center Networking 1 2 What are Data Centers? •Large facilities with 10s of thousands of networked servers

© Armir Bujari – [email protected]

Universita Degli Studi di Padova

Data Center Networking

1

2

What are Data Centers?

•Large facilities with 10s of thousands of networked servers

• Compute, storage, and networking working in concert

• “Warehouse-Scale Computers”

2

33

Types of Data Centers

•Specialized data centers built for one big app

• Social networking: Facebook

• Web Search: Google, Bing

•“Cloud” data centers

• Amazon EC2, Windows

Azure

• Google App Engine

44

Cloud Computing

•On-demand

• Use resources when you need it; pay-as-you-go

•Elastic

• Scale up & down based on demand

•Multi-tenancy

• Multiple independent users share infrastructure

• Security and resource isolation

• SLAs on performance & reliability (sometimes)

•Dynamic Management

• Resiliency: isolate failure of servers and storage

• Workload movement: move work to other locations

5

Data Centers with 100,000+ Servers

Microsoft

Google Facebook

Microsoft

5

66

7

These things are really big

100 billion searches per month

120+ million users

1.15 billion users

10-100K servers

100s of Petabytes of storage

100s of Terabits/s of Bw(more than core of Internet)

10-100MW of power(1-2 % of global energy

consumption)

100s of millions of dollars

8

Google Datacenter Infrastructure

99

Datacenter Traffic Growth

DCN bandwidth growth demanded much more

12

Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized

Control in Google’s Datacenter Network”, SIGCOMM 2015.

1010

INTERNET

Servers

Fabric

What’s Different about DCNs?

10

1111

What’s Different about DCNs?

•Single administrative domain

•No need to be compatible with outside world

•Tiny round trip times (microseconds)

•Latency/tail latency critical

•Massive multipath topologies

•Shallow buffers

•Backplane for large-scale parallel computation

1212

TLA

MLAMLA

Worker Nodes

………

Example: Web Search

12

Picasso

“Everything you can imagine is real.”“Bad artists copy. Good artists steal.”

“It is your work in life that is the ultimate seduction.“

“The chief enemy of creativity is good sense.“

“Inspiration does exist, but it must find you working.”

“I'd like to live as a poor man with lots of money.“

“Art is a lie that makes usrealize the truth.

“Computers are useless. They can only give you answers.”

1.

2.

3.

…..

1. Art is a lie…

2. The chief…

3.

…..

1.

2. Art is a lie…

3. …

..

Art is…

Picasso

• Strict deadlines

• Tail Latency Matters

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Partition/Aggregate App Structure

1313

Data Center Challenges

• Massive bisection bandwidth

• Topologies

• Load balancing

• Optics

• Ultra-Low latency (<10 microseconds)

• Scheduling

• Centralized or distributed control?

• Managing resources across network & servers

• Multi-tenant performance isolation

• App-aware network scheduling (e.g. for big data)

• Next-generation hardware

• RDMA, Rack-Scale Computing

14

Data Center Costs

Amortized

Cost*

Component Sub-Components

~45% Servers CPU, memory, disk

~25% Power

infrastructure

UPS, cooling, power

distribution

~15% Power draw Electrical utility costs

~15% Network Switches, links, transit

*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm

CCR 2009. Greenberg, Hamilton, Maltz, Patel.

1515

Server Costs

Ugly secret: 30% utilization considered “good” in data centers

• Uneven application fit

• Each server has CPU, memory, disk: most applications exhaust one resource, stranding the others

• Long provisioning timescales

• New servers purchased quarterly at best

• Uncertainty in demand

• Demand for a new service can spike quickly

• Risk management

• Not having spare servers to meet demand brings failure just when success is at hand

• Session state and storage constraints

• If the world were stateless servers, life would be good

1616

Goal: Agility – Any service, Any Server

•Turn the servers into a single large fungible pool

• Dynamically expand and contract service footprint

as needed

•Benefits

• Increase service developer productivity

• Lower cost

• Achieve high performance and reliability

The 3 motivators of most infrastructure projects

1717

Achieving Agility

•Workload management

• Means for rapidly installing a service’s code on a

server

• Virtual machines, disk images, containers

•Storage Management

• Means for a server to access persistent data

• Distributed filesystems (e.g., HDFS, blob stores)

•Network

• Means for communicating with other servers,

regardless of where they are in the data center

18

More Software…

Tens of millions of lines of code.

Closed, proprietary, outdated.

Specialized

Control

Plane

Specialized

Hardware

Specialized

Features

Hundreds of protocols

6,500 RFCs

Billions of gates.

Power hungry and bloated.

19

Packet

Forwarding Packet

Forwarding

Packet

Forwarding

Packet

Forwarding

Packet

Forwarding

Control

Control

Control

Control

Control

Global Network Map

Control Plane

Control

Program

Control

Program

Control

Program

19

Software Defined Networking

2020

Conventional DC Network

Reference – “Data Center: Load balancing Data Center Services”, Cisco

2004

CR CR

AR AR AR AR. . .

SS

DC-Layer 3

Internet

SS

…

SS

…

. . .

DC-Layer 2

Key• CR = Core Router (L3)

• AR = Access Router (L3)

• S = Ethernet Switch (L2)

• A = Rack of app. servers

~ 1,000 servers/pod == IP

subnet

2121

Layer 2 vs. Layer 3

•Ethernet switching (layer 2)

✓Fixed addresses and auto-configuration (plug & play)

✓Seamless mobility, migration, and failover

x Broadcast limits scale (ARP)

x Spanning Tree Protocol

•IP routing (layer 3)

✓Scalability through hierarchical addressing

✓Multipath routing through equal-cost multipath

x More complex configuration

x Can’t migrate w/o changing IP address

2222

Conventional DC Network Problems

CR CR

AR AR AR AR

SS

S

… …

. . .

SS

… …

~ 5:1

~ 40:1

~ 200:1

•Dependence on high-cost proprietary routers

•Extremely limited server-to-server capacity

2323

Some DC network designs…

Fat-tree [SIGCOMM’08]

Jellyfish (random) [NSDI’12]

BCube [SIGCOMM’10]

24

INTERNET

Servers

Fabric

100Kbps–100Mbps links

~100ms latency

10–40Gbps links

~10–100μs latency

Transport inside the DC

25

INTERNET

Servers

Fabric

web appdata-base

map-reduce

HPC monitoringcache

Interconnect for distributed compute workloads

Transport inside the DC

2626

What’s Different About DC Transport?

•Network characteristics• Very high link speeds (Gb/s); very low latency (microseconds)

•Application characteristics• Large-scale distributed computation

•Challenging traffic patterns• Diverse mix of mice & elephants

• Incast

•Cheap switches• Single-chip shared-memory devices; shallow buffers

2727

DC Traffic Characteristics

• Instrumented a large cluster used for data mining and identified distinctive traffic patterns

• Traffic patterns are highly volatile

• A large number of distinctive patterns even in a day

• Traffic patterns are unpredictable

• Correlation between patterns very weak

2828

•Short messages

(e.g., query, coordination)

•Large flows

(e.g., data update, backup)

Low Latency

High Throughput

Data Center Workloads

Mice & Elephants

2929

Comments: 1500 node cluster in a data center that supports data mining on petabytes of data. T̂he

servers are distributed roughly evenly across 75 ToR switches, which are connected hierarchically in a

C-DC

Figure: Mice are numerous; 99% of flows are smaller than 1 MB. However, more than 90%

of bytes are in fows between 100 MB and 1 GB

3030

TCP timeout

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

RTOmin = 300 ms

• Synchronized fan-in congestion

Incast

Vasudevan et al. (SIGCOMM’09)

3131

DC Transport Requirements

1. Low Latency

– Short messages, queries

2. High Throughput

– Continuous data updates, backups

3. High Burst Tolerance

– Incast

The challenge is to achieve these together

32

High Throughput Low Latency

Baseline fabric latency (propagation + switching): 10 microseconds

© Armir Bujari – [email protected]

Universita Degli Studi di Padova

Data Center TCP

3434

TCP in the Data Center

•TCP [Jacobsen et al.’88] is widely used in the data center

• More than 99% of the traffic

•Operators work around TCP problems‒ Ad-hoc, inefficient, often expensive solutions

‒ TCP is deeply ingrained in applications

Practical deployment is hard → keep it simple!

3535

Review: The TCP Algorithm

Sender 1

Sender 2

Receiver

ECN = Explicit Congestion Notification

Time

Win

do

w S

ize

(R

ate

)

Additive Increase:W →W+1 per round-trip time

Multiplicative Decrease:W →W/2 per drop or ECN mark

ECN Mark (1 bit)

3636

TCP Buffer Requirement

•Bandwidth-delay product rule of thumb:

• A single flow needs C×RTT buffers for 100% Throughput.

Thro

ugh

pu

tB

uff

er

Size

100%

B

B ≥ C×RTT

B

100%

B < C×RTT

3737

Window Size

(Rate)

Buffer Size

Throughput

100%

•Appenzeller et al. (SIGCOMM ‘04):

• Large # of flows: is enough.

37

Reducing Buffer Requirements

3838

•Appenzeller et al. (SIGCOMM ‘04):

• Large # of flows: is enough

•Can’t rely on stat-mux benefit in the DC.

• Measurements show typically only 1-2 large flows at each

server

38

Key Observation:Low variance in sending rate → Small buffers suffice

Reducing Buffer Requirements

3939

➢Extract multi-bit feedback from single-bit stream of ECN marks

• Reduce window size based on fraction of marked packets.ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

DCTCP: Main IdeaW

ind

ow

Siz

e (B

ytes

)

Win

do

w S

ize

(Byt

es)

Time (sec) Time (sec)

TCP DCTCP

4040

DCTCP: Algorithm

Switch side:

• Mark packets when Queue Length > K.

Sender side:

– Maintain running average of fraction of packets marked (α).

➢ Adaptive window decreases:

– Note: decrease factor between 1 and 2.

B KMark Don’t Mark

each RTT : F =# of marked ACKs

Total # of ACKs (1− g) + gF

W (1−

2)W

4141

0

100

200

300

400

500

600

700

0

Que

ue L

en

gth

(P

acke

ts)

Time (seconds)

DCTCP, 2 flowsTCP, 2 flows

0

100

200

300

400

500

600

700

0

Que

ue L

en

gth

(P

acke

ts)

Time (seconds)

DCTCP, 2 flowsTCP, 2 flows

0

100

200

300

400

500

600

700

0

Que

ue L

en

gth

(P

acke

ts)

Time (seconds)

DCTCPTCP

(KB

ytes

)

Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

ECN Marking Thresh = 30KB

DCTCP vs TCP

Buffer is mostly empty

DCTCP mitigates Incast by creating a large buffer headroom

4242

1. Low Latency

✓ Small buffer occupancies → low queuing delay

2. High Throughput

✓ ECN averaging → smooth rate adjustments, low

variance

3. High Burst Tolerance

✓ Large buffer headroom → bursts fit

✓ Aggressive marking → sources react before packets

are dropped

Why it Works

Documents

Data Center Networking - MathUniPDabujari/fis1819/lecSlides/dcn.pdfData Center Networking 1 2 What are Data Centers? •Large facilities with 10s of thousands of networked servers