Peregrine : An All-Layer-2 Container Computer Network

Peregrine: An All-Layer-2 Container Computer Network

Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan∗Computer Science Department, Stony Brook University

†Industrial Technology Research Institute, Taiwan

Outline

Motivation Layer 2 + Layer 3 design Requirements for cloud-scale DC Problems of Classic Ethernet to the Cloud

Solutions Related Solutions Peregrine’s Solution

Implementation and Evaluation Software architecture Performance Evaluation

L2 + L3 Architecture: Problems3

Problem: Configuration:- Routing table in the routers- IP assignment- DHCP coordination - VLAN and STP

Virtual Machine Mobility Constrained to a Physical Location

Bandwidth BottleneckBandwidth Bottleneck

Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System

Problem: forwarding table sizeCommodity switch 16-32k

Requirements for Cloud-Scale DC Any-to-any connectivity with non-blocking fabric

Scale to more than 10,000 physical nodes Virtual machine mobility

Large layer 2 domain Fast fail-over

Quick failure detection and recovery Support for Multi-tenancy

Share resources between different customers Load balancing routing

Efficiently use all available links

Layer 2 Switch

Solution: A Huge L2 Switch!

Single L2 NetworkSingle L2 Network Non-blocking Backplane Bandwidth

Non-blocking Backplane Bandwidth

Config-free, Plug-and play

Linear cost and power scaling

VMVMVMVM

….. Scale to 1 million VMs

However, Ethernet does not scale!However, Ethernet does not scale!

Revisit Ethernet: Spanning Tree Topology

N1 N8 N7

Revisit Ethernet: Spanning Tree Topology

N1 N8 N7

Revisit Ethernet: Broadcast and Source Learning

N1 N8 N7

Benefit: plug-and-playBenefit: plug-and-play

Ethernet’s Scalability Issues Limited forwarding table size

Commodity switch: 16k to 64k entries STP as a solution to loop prevention

Not all physical links are used No load-sensitive dynamic routing

Slow fail-over Fail-over latency is high ( > 5 seconds)

Broadcast overhead Typical Layer 2 consists of hundreds of hosts

Related Works / Solution Strategies

Scalability: Clos network / Fat-tree to scale out

Alternative to STP Link aggregation, e.g. LACP, Layer 2 trunking Routing protocols to layer 2 network

Limited forwarding table size Packet header encapsulation or re-writing

Load balancing Randomness or traffic engineering approach

Design of the Peregrine11

Peregrine’s Solutions

Not all links are used Disable Spanning Tree protocol

L2 Loop Prevention Redirect broadcast and block flooding packet

Source learning and forwarding calculate all routes for all node-pairs by Route Server

Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module

ARP Intercept and Redirect13

AA sw1sw1

sw2sw2

sw3sw3

sw4sw4 BB

1. DS-ARP 2. DS-Reply

Control flowData flow

3. Send data

Directory Service Route Algorithm Server

Peregrine’s Solutions

Not all links are used Disable Spanning Tree protocol

L2 Loop Prevention Redirect broadcast and block flooding packet

Source learning and forwarding Calculate all routes for all node-pairs by Route Server Fast fail-over: primary and backup routes for each pair

Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module

Mac-in-Mac Encapsulation

AA sw1sw1

sw2sw2

sw3sw3

sw4sw4 BB

1. ARP redirect

2. B locates at sw4

3. sw4

Control flowData flow

5. Decap and restore original frame

Decapsulation

4. Encap sw4 in source mac

Encapsulation

sw 4 B A

Fast Fail-Over16

Goal: Fail-over latency < 100 msec Application agnostic TCP timeout: 200ms

Strategy: Pre-compute a primary and backup route for each VM Each VM has two virtual MACs When a link fails, notify hosts using affected primary routes

that they should switch to corresponding backup routes

When a Network Link Fails17

IMPLEMENTATION & Evaluation

Software Architecture19

Review All Components

AA sw1sw1

sw2sw2

sw3sw3

sw4sw4 BB

ARP redirect

ARP request rate

sw5sw5

sw6sw6

sw7sw7

How fast can DS handle request?

How long can RAS process request?How long can RAS process request?MIM moduleMIM module

Backup route

Performance of: MIM, DS, RAS, switch ?

Mac-In-Mac Performance

Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU)Around 2.66K / 13.3K / 18.6K cycles

Aggregate Throughput for Multiple VMs

1. APR table size < 1k2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each

other.

ARP Broadcast Rate in a Data Center What’s the ARP traffic rate in real world?

From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average.

From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average.

To scale to 1M node, 20K-30K ARP/sec on average. Current optimal DS: 100K ARP/sec

Fail-over time and its breakdown Average fail-over time:

75ms Switch: 25 ~ 45 ms

sending trap (soft unplug) RS: 25ms

receiving trap and processing DS: 2ms

receiving info from RS and inform DS

The rests are network delay and dom0 processing time

Conclusion

A unified Layer-2-only network for LAN and SAN Centralized control plane and distributed data

plane Use only Commodity Ethernet switches

Army of commodity switches vs. few high-port-density switches

Requirements on switches: run fast and has programmable routing table

Centralized load-balancing routing using real-time traffic matrix

Fast fail-over using pre-computed primary/back routes

Thank you

Questions?26

Review All Components: Result

AA sw1sw1

sw2sw2

sw3sw3

sw4sw4 BB

ARP redirect

ARP request rate

sw5sw5

sw6sw6

sw7sw7

100K ARP/sec100K ARP/sec

25ms per request25ms per request

Link down35ms

7us for Packet processing

Backup route

Thank you

Backup slides28

OpenFlow Architecture

OpenFlow switch: A data plane that implements a set of flow rules specified in terms of the OpenFlow instruction set

OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches

OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches

Data Path (Hardware)Data Path (Hardware)

Control Control PathPath OpenFlowOpenFlow

OpenFlow OpenFlow ControllerController

OpenFlow Protocol (SSL/TCP)

Conclusion and Contribution

Using commodity switches to build a large scale layer 2 network

Provide solutions to Ethernet’s scalability issues Suppressing broadcast Load balancing route calculation Controlling MAC forwarding table Scale up to one million VMs by Mac-in-Mac two stage forwarding Fast fail-over

Future work High Availability of DS and RAS, mater-slave model Inter

Comparisons

Scalable and available data center fabrics IEEE 802.1aq: Shortest Path Bridging IETF TRILL Competitors: Cisco, Juniper, Brocade Differences: commodity switches, centralized load

balancing routing and proactive backup route deployment

Network virtualization OpenStack Quantum API Competitors: Nicira, NEC Generality carries a steep performance price

Every virtual network link is a tunnel Differences: Simpler and more efficient because it runs

on L2 switches directly

Three Stage Clos Network (m,n,r)

r x r m x nn x m

Clos Network Theory

Clos(m, n, r) configuration:rn inputs, rn outputs

2r nxm + m rxr switches, less than rn x rn Each rxr switch can in turn be implemented as

a 3-stage Clos network Clos(m,n,r) is rearrangeably non-blocking iff m >= n Clos(m,n,r) is stricly non-blocking iff m >= 2n-1

Link Aggregation35

ECMP: Equal-Cost Multipath

Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link

Example: Brocade Data Center

Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers

Link aggregationLink aggregation

L3 ECMPL3 ECMP

PortLand• Scale-out: Three-layer,

multi-root topology• Hierarchical, encode

location into MAC address• Local Discover Protocol to

find shortest path, route by MAC

• Fabric Manager maintains IP to MAC mapping

• 60-80 ms failover, centrally control and notify

VL2: Virtual Layer 2

• Three layer, Clos network• Flat, IP-in-IP, Location

address(LA) an Application Address (AA)

• Link-state routing to disseminate LA

• VLB + flow-based ECMP• Depend on ECMP to

detect link failure• Packet interception at S VL2 Directory Service

Monsoon• Three layer, multi-root

topology• 802.1ah MAC-in-MAC

encapsulation, source routing

• centralized routing decision

• VLB + MAC rotation• Depend on LSA to detect

failures• Packet interception at S Monsoon Directory Service

IP <-> (server MAC, ToR MAC)

TRILL and SPB

TRILL• Transparent Interconnect of

Lots of Links, IETF• IS-IS as a topology

management protocol• Shortest path forwarding• New TRILL header• Transit hash to select next-

SPB• Shortest Path Bridging, IEEE• IS-IS as a topology

management protocol• Shortest path forwarding• 802.1ah MAC-in-MAC• Compute 16 source node

based trees

TRILL Packet ForwardingLink-state routing

TRILL header

A-ID: nickname of AC-ID: nickname of C

HopC: hop countRef: NIL Data Communications42

SPB Packet ForwardingLink-state routing

802.1ah Mac-in-Mac

Ref: NIL Data Communications

I-SID: Backbone Service Instance IdentifierB-VID: backbone VLAN identifier

Re-arrangeable non-blocking Clos network

nxk (N/n)x(N/n) kxn

N=6n=2k=2

2x2input output

Example:1. Three-stage Clos network2. Condition: k>=n3. An unused input at ingress switch can always be connected

to an unused output at egress switch4. Existing calls may have to be rearranged

ingress middle egress

Features of Peregrine network

• Utilize all links• Load balancing routing algorithm• Scale up to 1 million VMs

– Two stage dual mode forwarding• Fast fail over• Load balancing routing algorithm

• Given a mesh network and traffic profile– Load balance the network resource utilization

• Prevent congestion by balancing the network load to support as many traffic load as possible

– Provide fast recovery from failure• Provide primary-backup route to minimize recovery

Primary

Backup

Factors

• Only hop count• Hop count and link residual capacity• Hop count, link residual capacity, and link

expected load• Hop count, link residual capacity, link expected

load and additional forwarding table entries requiredHow to combine them into one number for a particular candidate

route?How to combine them into one number for a particular candidate

route?

Route Selection: idea

Which route is better from S1 to D1?

S2-D2 shares link C-D

Leave C-D free

Share with S2-D2

Link C-D is more important! Idea: use it as sparsely as possible

Route Selection: hop count and Residual capacity

Leave C-D free

Share with S2-D2

Using Hop count or residual capacity makes no difference!

Traffic Matrix:S1 -> D1: 1GS2 -> D2: 1G

Determine Criticality of A Link

= fraction of all (s, d) routes that pass through link l

Expected load of a link at initial stateExpected load of a link at initial state

= Bandwidth demand matrix for s and d

Determine the importance of a linkDetermine the importance of a link

Criticality ExampleFrom B to C has four possible routes.

Case2: Calculates = B, d = C

Case3: s = A, d = C is similar

Case2: Calculates = B, d = C

Case3: s = A, d = C is similar

2/4 2/4

2/42/4

Expected LoadAssumption: load is equally distributed over each

possible routes between S and D.

Consider bandwidth demand for B-C is 20.Expected Load:

Cost MetricsCost metric represents the expected load per

unit of available capacity on the link

= Residual Capacity= Expected Load

Idea: pick the link with minimum costIdea: pick the link with minimum cost

0.01 0.01

0.01 0.010.01 0.01

0 0.02 0.02

Forwarding Table MetricConsider using commodity switch with 16-32k

forwarding table size.

Idea: minimize entry consumption, prevent forwarding table from being exhaustedIdea: minimize entry consumption, prevent forwarding table from being exhaustedA B C

= available forwarding table entries at node n= available forwarding table entries at node n

INC_FWD= extra entries needed to route A-C

Load Balanced Routing• Simulated network

– 52 PMs with 4 NICs, total 384 links– Replay 17 multi-VDC 300-second traces

• Compare – Random shortest path

routing (RSPR)– Full Link Criticality-based

routing (FLCR)• Metrics: congestion count

– # of links with exceeded capacity

• Low additional traffic induced by FLCR

Peregrine : An All-Layer-2 Container Computer Network

Documents

Peregrine Nations

Peregrine 1.0

Peregrine History Book

Perkins Peregrine

The Peregrine - Three Rivers Birding Club3rbc.org/newsarch/newsjul17.pdf · 2019-08-30 · 2 The Peregrine The Peregrine Three Rivers Birding Club Newsletter Published bimonthly:

Peregrine Group

Peregrine News May 2015

1 The Peregrine Falcon. 2 Background Decline in peregrine population between the 1930’s and 1960’s –Human egg collectors falconers shooters –DDT Peregrine

The Peregrine - Three Rivers Birding ClubThe Peregrine The Peregrine Three Rivers Birding Club Newsletter Published bimonthly: January, March, May, July, September, November Send articles

Peregrine System’s Help Desk Walking the Talk Kelly Hoopes Director, Customer Support Peregrine Systems

Peregrine News February 2016

The Peregrine Phots Post

Plymouth Scouts Peregrine Challengeplymouthscouts.org.uk/.../11/Plymouth-Scouts-Peregrine-Challenge-N… · Peregrine White was born on board the Mayflower in the harbour in Massachusetts

Peregrine Falcon Reproduction in Southeast Michigan...INDICATOR: Peregrine Falcon Reproduction in Southeast Michigan Background . Peregrine falcons, known for their swift flight, have

Peregrine - Fall/Winter 2008

Peregrine Portfolio 2.7.12

Peregrine: Workload Optimization for Cloud Query Enginesaroy/peregrine-socc19.pdf · Peregrine: Workload Optimization for Cloud Query Engines Alekh Jindal, Hiren Patel, Abhishek Roy,

Peregrine News January 2015

Peregrine Presentation Final

Peregrine News March 2016