52
Spanning Tree and Datacenters EE 122, Fall 2013 Sylvia Ratnasamy http://inst.eecs.berkeley.edu/~ee122 / Material thanks to Mike Freedman, Scott Shenker, Ion Stoica, Jennifer Rexford, and many other colleagues

Spanning Tree and Datacenters

  • Upload
    kami

  • View
    32

  • Download
    8

Embed Size (px)

DESCRIPTION

Spanning Tree and Datacenters. EE 122, Fall 2013 Sylvia Ratnasamy http://inst.eecs.berkeley.edu/~ee122 / Material thanks to Mike Freedman, Scott Shenker , Ion Stoica , Jennifer Rexford, and many other colleagues. Last time: Self- Learning in Ethernet. Approach : - PowerPoint PPT Presentation

Citation preview

Page 1: Spanning Tree and Datacenters

Spanning Tree and Datacenters

EE 122, Fall 2013Sylvia Ratnasamy

http://inst.eecs.berkeley.edu/~ee122/

Material thanks to Mike Freedman, Scott Shenker, Ion Stoica, Jennifer Rexford, and many other colleagues

Page 2: Spanning Tree and Datacenters

Last time: Self- Learning in Ethernet

Approach:

• Flood first packet to node you are trying to reach

• Avoids loop by restricting flooding to spanning tree

• Flooding allows packet to reach destination

• And in the process switches learn how to reach source of flood

Page 3: Spanning Tree and Datacenters

Missing Piece: Building the Spanning Tree

• Distributed• Self-configuring• No global information• Must adapt when failures occur

Page 4: Spanning Tree and Datacenters

A convenient spanning tree

• Shortest paths to (or from) a node form a tree– Covers all nodes in the graph– No shortest path can have a cycle

Page 5: Spanning Tree and Datacenters

Algorithm Has Two Aspects

• Pick a root:– Pick the one with the smallest identifier (MAC addr.)– This will be the destination to which shortest paths go

• Compute shortest paths to the root– Only keep the links on shortest-paths– Break ties by picking the lowest neighbor switch addr

• Ethernet’s spanning tree construction does both with a single algorithm

Page 6: Spanning Tree and Datacenters

Constructing a Spanning Tree• Messages (Y, d, X)

– From node X– Proposing Y as the root– And advertising a distance d to Y

• Switches elect the node with smallest identifier (MAC address) as root– Y in the messages

• Each switch determines if a link is on the shortest path from the root; excludes it from the tree if not– d to Y in the message

Page 7: Spanning Tree and Datacenters

Steps in Spanning Tree Algorithm

• Initially, each switch proposes itself as the root– Example: switch X announces (X, 0, X)

• Switches update their view of the root– Upon receiving message (Y, d, Z) from Z, check Y’s id– If Y’s id < current root: set root = Y

• Switches compute their distance from the root– Add 1 to the shortest distance received from a neighbor

• If root or shortest distance to it changed, “flood” updated message (Y, d+1, X)

Page 8: Spanning Tree and Datacenters

Example(root, dist, from)1

2

3

4

5

67

(1,0,1)

(2,0,2)

(3,0,3)

(4,0,4)

(5,0,5)

(6,0,6)(7,0,7)

1

2

3

4

5

67

(1,0,1)

(2,0,2)

(1,1,3)

(2,1,4)

(1,1,5)

(1,1,6)(2,1,7)

1

2

3

4

5

67

(1,0,1)

(1,2,3)

(1,1,3)

(2,1,4)

(1,1,5)

(1,1,6)(2,1,7)

1

2

3

4

5

67

(1,0,1)

(1,2,3)

(1,1,3)

(1,3,2)

(1,1,5)

(1,1,6)(1,3,2)✗

✗✗

Page 9: Spanning Tree and Datacenters

Robust Spanning Tree Algorithm

• Algorithm must react to failures– Failure of the root node– Failure of other switches and links

• Root switch continues sending messages– Periodically re-announcing itself as the root

• Detecting failures through timeout (soft state)– If no word from root, time out and claim to be the root!

Page 10: Spanning Tree and Datacenters

Problems with Spanning Tree?

• Delay in reestablishing spanning tree– Network is “down” until spanning tree rebuilt

• Much of the network bandwidth goes unused– Forwarding is only over the spanning tree

• A real problem for datacenter networks…

Page 11: Spanning Tree and Datacenters

Datacenters

Page 12: Spanning Tree and Datacenters

What you need to know

• Characteristics of a datacenter environment – goals, constraints, workloads, etc.

• How and why DC networks are different (vs. WAN)– e.g., latency, geo, autonomy, …

• How traditional solutions fare in this environment– e.g., IP, Ethernet, TCP, ARP, DHCP

• Not details of how datacenter networks operate

Page 13: Spanning Tree and Datacenters

Disclaimer

• Material is emerging (not established) wisdom

• Material is incomplete– many details on how and why datacenter networks

operate aren’t public

Page 14: Spanning Tree and Datacenters

• Servers organized in racks

What goes into a datacenter (network)?

Page 15: Spanning Tree and Datacenters

• Servers organized in racks• Each rack has a `Top of Rack’ (ToR) switch

What goes into a datacenter (network)?

Page 16: Spanning Tree and Datacenters

What goes into a datacenter (network)?

• Servers organized in racks• Each rack has a `Top of Rack’ (ToR) switch• An `aggregation fabric’ interconnects ToR switches

Page 17: Spanning Tree and Datacenters

What goes into a datacenter (network)?

• Servers organized in racks• Each rack has a `Top of Rack’ (ToR) switch• An `aggregation fabric’ interconnects ToR switches• Connected to the outside via `core’ switches

– note: blurry line between aggregation and core • With network redundancy of ~2x for robustness

Page 18: Spanning Tree and Datacenters

Brocade reference design

Example 1

Page 19: Spanning Tree and Datacenters

CR CR

AR AR AR AR. . .

SS

Internet

SS

A AA …

SS

A AA …

. . .

~ 40-80 servers/rack

Example 2

Cisco reference design

Page 20: Spanning Tree and Datacenters

Observations on DC architecture

• Regular, well-defined arrangement• Hierarchical structure with rack/aggr/core layers• Mostly homogenous within a layer• Supports communication between servers and

between servers and the external world

Contrast: ad-hoc structure, heterogeneity of WANs

Page 21: Spanning Tree and Datacenters

Datacenters have been around for a while

1961, Information Processing Center at the National Bank of Arizona

Page 22: Spanning Tree and Datacenters

What’s new?

Page 23: Spanning Tree and Datacenters

SCALE!

Page 24: Spanning Tree and Datacenters

How big exactly?

• 1M servers [Microsoft] – less than google, more than amazon

• > $1B to build one site [Facebook]

• >$20M/month/site operational costs [Microsoft ’09]

But only O(10-100) sites

Page 25: Spanning Tree and Datacenters

What’s new?

• Scale • Service model

– user-facing, revenue generating services– multi-tenancy– jargon: SaaS, PaaS, DaaS, IaaS, …

Page 26: Spanning Tree and Datacenters

Implications

• Scale – need scalable solutions (duh)– improving efficiency, lowering cost is critical `scale out’ solutions w/ commodity technologies

• Service model – performance means $$– virtualization for isolation and portability

Page 27: Spanning Tree and Datacenters

Multi-Tier Applications

• Applications decomposed into tasks– Many separate components– Running in parallel on different machines

27

Page 28: Spanning Tree and Datacenters

Componentization leads to different types of network traffic

• “North-South traffic”– Traffic between external clients and the datacenter– Handled by front-end (web) servers, mid-tier application

servers, and back-end databases– Traffic patterns fairly stable, though diurnal variations

28

Page 29: Spanning Tree and Datacenters

North-South Traffic

29

Router

Web Server Web Server Web Server

DataCache

DataCache Database Database

Front-EndProxy

Front-EndProxy

user requests from the Internet

Page 30: Spanning Tree and Datacenters

Componentization leads to different types of network traffic

• “North-South traffic”– Traffic between external clients and the datacenter– Handled by front-end (web) servers, mid-tier application

servers, and back-end databases– Traffic patterns fairly stable, though diurnal variations

• “East-West traffic”– Traffic between machines in the datacenter– Commn. within “big data” computations (e.g. Map Reduce)– Traffic may shift on small timescales (e.g., minutes)

Page 31: Spanning Tree and Datacenters

East-West Traffic

31

DistributedStorage

DistributedStorage

MapTasks

ReduceTasks

Page 32: Spanning Tree and Datacenters

East-West TrafficCR CR

AR AR AR AR

SS

SS

A AA …

SS

A AA …

. .

.

SS

SS

A AA …

SS

A AA …

Page 33: Spanning Tree and Datacenters

East-West Traffic

33

DistributedStorage

DistributedStorage

MapTasks

ReduceTasks

Often doesn’t cross the network

Always goes over the network

Some fraction (typically 2/3)

crosses the network

Page 34: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale:

– ~20,000 switches/routers– contrast: AT&T ~500 routers

Page 35: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale: • Limited geographic scope:

– High bandwidth: 10/40/100G – Contrast: DSL/WiFi– Very low RTT: 10s of microseconds– Contrast: 100s of milliseconds in the WAN

Page 36: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale• Limited geographic scope• Single administrative domain

– Can deviate from standards, invent your own, etc.– “Green field” deployment is still feasible

Page 37: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale• Limited geographic scope• Single administrative domain• Control over one/both endpoints

– can change (say) addressing, congestion control, etc.– can add mechanisms for security/policy/etc. at the

endpoints (typically in the hypervisor)

Page 38: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale• Limited geographic scope• Single administrative domain• Control over one/both endpoints• Control over the placement of traffic source/sink

– e.g., map-reduce scheduler chooses where tasks run– alters traffic pattern (what traffic crosses which links)

Page 39: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale• Limited geographic scope• Single administrative domain• Control over one/both endpoints• Control over the placement of traffic source/sink• Regular/planned topologies (e.g., trees/fat-trees)

– Contrast: ad-hoc WAN topologies (dictated by real-world geography and facilities)

Page 40: Spanning Tree and Datacenters

What’s different about DC networks?

Characteristics• Huge scale• Limited geographic scope• Single administrative domain• Control over one/both endpoints• Control over the placement of traffic source/sink• Regular/planned topologies (e.g., trees/fat-trees)• Limited heterogeneity

– link speeds, technologies, latencies, …

Page 41: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements

– recall: all that east-west traffic– target: any server can communicate at its full link speed– problem: server’s access link is 10Gbps!

Page 42: Spanning Tree and Datacenters

Full Bisection Bandwidth

CR CR

AR AR AR AR. . .

SS

Internet

SS

A AA …

SS

A AA …

. . .

~ 40-80 servers/rack

10Gbps

O(40x10)Gbps

O(40x10x100)Gbps

Traditional tree topologies “scale up”• full bisection bandwidth is expensive

• typically, tree topologies “oversubscribed”

Page 43: Spanning Tree and Datacenters

A “Scale Out” Design

• Build multi-stage `Fat Trees’ out of k-port switches– k/2 ports up, k/2 down– Supports k3/4 hosts:

• 48 ports, 27,648 hosts

All links are the same speed (e.g. 10Gps)

Page 44: Spanning Tree and Datacenters

Full Bisection Bandwidth Not Sufficient

• To realize full bisectional throughput, routing must spread traffic across paths

• Enter load-balanced routing– How? (1) Let the network split traffic/flows at random

(e.g., ECMP protocol -- RFC 2991/2992)– How? (2) Centralized flow scheduling (e.g., w/ SDN, Dec 2nd lec.)– Many more research proposals

44

Page 45: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements

– real money on the line– current target: 1μs RTTs– how? cut-through switches making a comeback (lec. 2!)

• reduces switching time

Page 46: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements

– real money on the line– current target: 1μs RTTs– how? cut-through switches making a comeback (lec. 2!)– how? avoid congestion

• reduces queuing delay

Page 47: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements

– real money on the line– current target: 1μs RTTs– how? cut-through switches making a comeback (lec. 2!)– how? avoid congestion– how? fix TCP timers (e.g., default timeout is 500ms!)– how? fix/replace TCP to more rapidly fill the pipe

Page 48: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements • Predictable, deterministic performance

– “your packet will reach in Xms, or not at all”– “your VM will always see at least YGbps throughput”– Resurrecting `best effort’ vs. `Quality of Service’ debates– How is still an open question

Page 49: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements • Predictable, deterministic performance• Differentiating between tenants is key

– e.g., “No traffic between VMs of tenant A and tenant B”– “Tenant X cannot consume more than XGbps” – “Tenant Y’s traffic is low priority”– We’ll see how in the lecture on SDN

Page 50: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements • Predictable, deterministic performance• Differentiating between tenants is key• Scalability (of course)

– Q: How’s Ethernet spanning tree looking?

Page 51: Spanning Tree and Datacenters

What’s different about DC networks?

Goals• Extreme bisection bandwidth requirements • Extreme latency requirements • Predictable, deterministic performance• Differentiating between tenants is key• Scalability (of course) • Cost/efficiency

– focus on commodity solutions, ease of management– some debate over the importance in the network case

Page 52: Spanning Tree and Datacenters

Announcements/Summary

• I will not have office hours tomorrow– email me to schedule an alternate time

• Recap: datacenters – new characteristics and goals– some liberating, some constraining– scalability is the baseline requirement– more emphasis on performance – less emphasis on heterogeneity– less emphasis on interoperability