47
SDN for the Cloud Albert Greenberg Distinguished Engineer Director of Networking @ Microsoft Azure [email protected]

SDN for the Cloud

  • Upload
    danganh

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SDN for the Cloud

SDN for the CloudAlbert Greenberg

Distinguished Engineer

Director of Networking @ Microsoft Azure

[email protected]

Page 2: SDN for the Cloud

Road to SDN

• WAN• Motivating scenario: network engineering at

scale

• Innovation: infer traffic, control routing, centralize control to meet network-wide goals

• Cloud• Motivating scenario: software defined data

center, requiring per customer virtual networks

• Innovation: VL2• scale-out L3 fabric

• network virtualization at scale, enabling SDN and NFV

TomoGravityACM Sigmetrics 2013 Test of Time Award

RCPUsenix NSDI 2015 Test of Time Award

4DACM Sigcomm 2015 Test of Time Award

VL2ACM Sigcomm 2009

Page 3: SDN for the Cloud

Cloud provided the killer scenario for SDN

• Cloud has the right scenarios• Economic and scale pressure huge leverage • Control huge degree of control to make changes in the right places• Virtualized Data Center for each customer prior art fell short

• Cloud had the right developers and the right systems• High scale fault tolerant distributed systems and data management

At Azure we changed everything because we had to,

from optics to host to NIC to physical fabric to WAN to

Edge/CDN to ExpressRoute (last mile)

Page 4: SDN for the Cloud

Hyperscale Cloud

Page 5: SDN for the Cloud

Microsoft Cloud Network

15 billion dollar bet

Page 6: SDN for the Cloud

Microsoft Cloud Network

Page 7: SDN for the Cloud

Microsoft Cloud Network

Page 8: SDN for the Cloud

2010 2015

100K MillionsCompute

Instances

10’s of PBExabytesAzure

Storage

10’s of TbpsDatacenter

Network Pbps

Page 9: SDN for the Cloud

Fortune 500 using

Microsoft Cloud

>85%

>60 TRILLIONAzure storage objects

425MILLIONAzure Active

Directory users

1 out of 4 Azure VMsare Linux VMs>5

MILLIONrequests/sec

1 TRILLIONAzure Event Hubs

events/month

1,400,000SQL databases

in Azure

Scale

>18BILLIONAzure Active Directory authentications/week

New Azure customers a month

>93,000

Page 10: SDN for the Cloud

Consistent cloud design principles for SDN

Physical and Virtual networks, NFV

Integration of enterprise & cloud, physical & virtual

Future: reconfigurable network hardware

Demo

Career Advice

Acknowledgements

Agenda

Page 11: SDN for the Cloud

Azure SmartNIC

Page 12: SDN for the Cloud

Cloud Design Principles

Scale-out N-Active Data PlaneEmbrace and Isolate failures

Centralized control plane: drive network to target stateResource managers service requests, while meeting system wide objectives

Controllers drive each component relentlessly to the target state

Stateless agents plumb the policies dictated by the controllers

These principles are built into every component

Page 13: SDN for the Cloud

Hyperscale Physical Networks

Page 14: SDN for the Cloud

VL2 Azure Clos Fabrics with 40G NICs

16

T2-1-1 T2-1-2 T2-1-8

T3-1 T3-2 T3-3 T3-4

Row Spine

T2-2-1 T2-2-2 T2-2-4Data Center Spine

T1-1 T1-8T1-7…T1-2

… …

Regional Spine

T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2

Rack …T0-1 T0-2 T0-20

Servers

…T0-1 T0-2 T0-20

Servers

…T0-1 T0-2 T0-20

Servers

Outcome of >10 years of history, with major

revisions every six months

Scale-out, active-active

L3

L2

LB/FW LB/FW LB/FW LB/FW

Scale-up, active-passive

Page 15: SDN for the Cloud

Challenges of Scale

• Clos network management problem• Huge number of paths, ASICS, switches to examine, with a dynamic set of

gray failure modes, when chasing app latency issues at 99.995% levels

• Solution• Infrastructure for graph and state tracking to provide an app platform

• Monitoring to drive out gray failures

• Azure Cloud Switch OS to manage the switches as we do servers

Page 16: SDN for the Cloud

Azure State Management System ArchitectureInfrastructure

Provisioning

Traffic

Engineering… Checker

Observed

State

Target

StateProposed

State

Monitor Updater

Abstract network graph and state

model: the basic programming

paradigm

48 servers

podset

… 16 Spine Switchs

… 17 pods

podset

… 17 pods

… 16 Spine Switchs

… 16 Spine Switchs

… 16 Spine Switchs

64x10G 64x10G

48 servers

Border RoutersEach border router connects to each Interconnect switch

GNS/WAN Network

XXx10G

XXx10G

XXx10G

XXx10G

XXx10G

XXx10G

XXx10G

XXx10G

n7k

n7k

pod

BGP

BGP

BGP

BGP

BGP

BGP

BGP

64x10G64x10G 64x10G 64x10G 64x10G64x10G 64x10G 64x10G

eBGPeBGP

eBGP

eBGP

Fault

Mitigation

High scale infrastructure is complex:

multiple vendors, designs, software

versions, failures

Centralized control plane: drive

network to target state

Page 17: SDN for the Cloud

App I: Automatic Failure Mitigation

Device

monitor

Checker

Observed

StateTarget

State

Proposed

State

Fault

Mitigation

Link

updater

Link

monitor

Repair bad links

Device

updater

Repair bad devices

Page 18: SDN for the Cloud

App II: Traffic Engineering Towards High Utilization

LSP

monitorService

monitor

Routes

Topology,

link load Service

demand

Routing

updater

Add or remove

LSPs, change

weights

Checker

Observed

State

Target

StateProposed

State

Traffic

Engineering

Device &Link

monitorService LB

updater

SWAN250

200

150

100

50

0

Gb

ps

Sun Mo

nTue We

d

Thu Fri Sat

= Total traffic

= Elastic 100

Gbps

= Background 100

Gbps

= Interactive 20-60

Gbps

Page 19: SDN for the Cloud

Azure Scale Monitoring – Pingmesh

• Problem: Is it the app or the net explaining app latency issues?

• Solution: Measure the network latency between any two servers

• Full coverage, always-on, brute force

• Running in Microsoft DCs for near 5 years, generating 200+B probes every day

NormalPodset down

Podset failure Spine

failure

Page 20: SDN for the Cloud

EverFlow: Packet-level Telemetry + Cloud Analytics

22

Guided

prober

Reshuffler

Match & mirror

Collector

Collector

Collector

Distributed

storage

Filter &

aggregation

EverFlow controller

Query

Load

imbalanceLatency

spike

Packet

dropLoop

Cloud

Network

Challenge

Cloud

Scale Data

Azure

Storage

Azure

Analytics

Page 21: SDN for the Cloud

Azure Cloud Switch – Open Way to Build Switch OSO

pera

ting

Sys

tem

Switch ASIC

Driver

Switch ASIC SDK

Switch State Service

Switch

ASIC

BGP LLDP

Other SDN Apps

ACL

Hard

ware

Provision

Deployment

Automation

Sync

Agent1Switch Abstraction Interface (SAI)

Ethernet Interfaces

Sync

Agent2

Sync

Agent3

Hw Load

Balancer

SNMP

Front Panel Ports

User space

• SAI collaboration is industry wide

• SAI simplifies bringing up Azure Cloud Switch (Azure’s switch OS) on new ASICSSwitch Control: drive to target state

SAI is on github

Page 22: SDN for the Cloud

Hyperscale Virtual Networks

Page 23: SDN for the Cloud

ISP/MPLS

QoS

Internet

Network Virtualization (VNet)

Microsoft Azure Virtual Networks

Azure is the hub of your enterprise, reach to branch offices via VPN

VNet is the right abstraction, the counterpart of the VM for compute

Efficient and scalable communication within and across VNets

Microsoft Azure

10.1/16 10.1/16

L3 T

un

nel

Page 24: SDN for the Cloud

Hyperscale SDN: All Policy is in the Host

Management

Control

Data

Proprietary

ApplianceAzure Resource

Manager

Controller

Switch (Host)

Management Plane

Data Plane

SDN

Control Plane

ControllerController

Switch (Host)Switch (Host)

Azure Resource

ManagerAzure Resource

Manager

Page 25: SDN for the Cloud

Key Challenges for Hyperscale SDN Controllers

Must scale up to 500k+ Hosts in a region

Needs to scale down to small deployments too

Must handle millions of updates per day

Must support frequent updates without downtime

Page 26: SDN for the Cloud

Microsoft Azure Service Fabric: A platform for reliable, hyperscale, microservice-based applications

App2

App1

http://aka.ms/servicefabric

Page 27: SDN for the Cloud

Gateway

Regional Network Manager Microservices

VIP Manager

MAC Manager

VNET Manager

(VNETs, Gateways)

Service Manager

(VMs, Cloud Services)

Network Resource Manager

(ACLs, Routes, QoS)

Partitioned Federated Controller

Management REST

APIs

Fabric Task Driver

Software

Load

Balancers

Other

Appliances

Cluster Network Managers

Page 28: SDN for the Cloud

Regional Network Controller Stats

• 10s of millions of API calls per day

• API execution time• Read : <50 milliseconds

• Write : <150 milliseconds

• Varying deployment footprint• Smallest : <10 Hosts

• Largest : >100 Hosts

Write API Transactions

AddOrUpdateVnet

AllocateMacs

AllocateVips

AssociateToAclGroup

CreateOrUpdateNetworkServi

ce

DeleteNetworkService

FreeMac

Page 29: SDN for the Cloud

Hyperscale Network Function Virtualization

Page 30: SDN for the Cloud

Azure SLB: Scaling Virtual Network Functions

• Key Idea: Decompose Load Balancing into Tiers to achieve scale-out data plane and centralized control plane

• Tier 1: Distribute packets (Layer 3)• Routers ECMP

• Tier 2: Distribute connections (Layer 3-4)• Multiplexer or Mux• Enable high availability and scale-out

• Tier 3: Virtualized Network Functions (Layer 3-7)• Example: Azure VPN, Azure Application

Gateway, third-party firewall

Load-balanced

Connections

VNF VNF VNF

Router Router

Mux Mux Mux

Load-

balanced IP

Packets

Page 31: SDN for the Cloud

Express Route: Direct Connectivity to the Cloud

Virtual Networks

Azure Edge

(network virtual

function)

Connectivity Provider

Infrastructure (physicalWAN)

Page 32: SDN for the Cloud

Data Center-Scale Distributed Router

Customer’s network Customer’s Virtual Network in AzureMicrosoft WAN

Centralized control plane Distributed data plane

Page 33: SDN for the Cloud

Building a Hyperscale Host SDN

Page 34: SDN for the Cloud

Virtual Filtering Platform (VFP)

Acts as a virtual switch inside Hyper-V VMSwitch

Provides core SDN functionality for Azure networking services, including:

Address Virtualization for VNETVIP -> DIP Translation for SLBACLs, Metering, and Security Guards

Uses programmable rule/flow tables to perform per-packet actions

Supports all Azure data plane policy at 40GbE+ with offloads

Coming to private cloud in Windows Server 2016

NIC vNIC

VM Switch

VFP

VMvNIC

VM

ACLs, Metering, Security

VNET

SLB (NAT)

Page 35: SDN for the Cloud

Host: 10.4.1.5

Flow Tables: the Right Abstraction for the Host

VMSwitch exposes a typed Match-Action-Table API to the controller

• Controllers define policy

• One table per policy

Key insight: Let controller tell switch exactly what to do with which packets

• e.g. encap/decap, rather than trying to use existing abstractions (tunnels, …)

Tenant Description

VNet Description

VNet Routing

PolicyACLsNAT

Endpoints

VFP

VM1

10.1.1.2NIC

Flow Action

TO: 10.2/16 Encap to GW

TO: 10.1.1.5 Encap to 10.5.1.7

TO: !10/8 NAT out of VNET

Flow Action

TO: 79.3.1.2 DNAT to

10.1.1.2

TO: !10/8 SNAT to

79.3.1.2

Flow Action

TO: 10.1.1/24 Allow

10.4/16 Block

TO: !10/8 Allow

VNET LB NAT ACLS

Controller

Page 36: SDN for the Cloud

Table Typing/Flow Caching are Critical to Performance

Flow Action

TO: 10.2/16 Encap to GW

TO: 10.1.1.5 Encap to

10.5.1.7

TO: !10/8 NAT out of

VNET

• COGS in the cloud is driven by VM density: 40GbE is here

• First-packet actions can be complex

• Established-flow matches must be typed, predictable, and simple hash lookups

Blue VM110.1.1.2

NIC

VFP

First Packet

Subsequent Packets

Flow ActionFlow ActionFlow Action

TO: 79.3.1.2 DNAT to

10.1.1.2

TO: !10/8 SNAT to

79.3.1.2

Flow Action

TO: 10.1.1/24 Allow

10.4/16 Block

TO: !10/8 Allow

VNET LB NAT ACLS

Connection Action

10.1.1.2,10.2.3.5,80,9876 DNAT + Encap to GW

10.1.1.2,10.2.1.5,80,9876 Encap to 10.5.1.7

10.1.1.2,64.3.2.5,6754,80 SNAT to 79.3.1.2

Page 37: SDN for the Cloud

RDMA/RoCEv2 at Scale in Azure

• RDMA addresses high CPU cost and long latency tail of TCP• Zero CPU Utilization at 40Gbps • μs level E2E latency

• Running RDMA at scale• RoCEv2 for RDMA over commodity IP/Ethernet switches• Cluster-level RDMA• DCQCN* for end-to-end congestion control

NICNICMemory

Buffer A

Memory

Buffer BWrite local buffer at Address A to remotebuffer at Address B

Buffer B is filled

ApplicationApplication

*DCQCN is running on Azure NICs

Page 38: SDN for the Cloud

TCP 99th percentile

RDMA 99.9th percentile

RDMA 99th percentile

RDMA Latency Reduction at 99.9th %ile in Bing

Page 39: SDN for the Cloud

Host SDN Scale Challenges

• Host network is Scaling Up: 1G 10G 40G 50G 100G• The driver is VM density (more VMs per host), reducing COGs

• Need the performance of hardware to implement policy without CPU

• Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance• We are always pushing richer semantics to virtual networks

• Need the programmability of software to be agile and future-proof

How do we get the performance of hardware with programmability

of software?

Page 40: SDN for the Cloud

Azure SmartNIC• Use an FPGA for reconfigurable

functions• FPGAs are already used in Bing (Catapult)• Roll out Hardware as we do software

• Programmed using Generic Flow Tables (GFT)• Language for programming SDN to hardware• Uses connections and structured actions as primitives

• SmartNIC can also do Crypto, QoS, storage acceleration, and more…

Host

CPU

NIC

ASICFPGA

SmartNIC

ToR

Page 41: SDN for the Cloud

Flow Action

Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP

Southbound API

GFT Offload API (NDIS)

VMSwitch

VM

Northbound API

GFT

Table

First Packet

GFT Offload Engine

SmartNIC50G

QoSCrypto RDMAFlow Action

Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80

GFT

43

SmartNIC

TranspositionEngine

Rewrite

SLB Decap SLB NAT VNET ACL Metering

Rule Action Rule ActionRule Action Rule Action Rule Action Rule ActionDecap* DNAT* Rewrite* Allow* Meter*

ControllerControllerController

En

cap

Page 42: SDN for the Cloud

Azure SmartNIC

Page 43: SDN for the Cloud

Demo: SmartNIC Encryption

Page 44: SDN for the Cloud

Closing Thoughts

Cloud scale, financial pressure unblocked SDNControl and systems developed earlier for compute, storage, power helped

Moore’s Law helped: order 7B transistors per ASIC

We did not wait for a moment for standards, for vendor persuasion

SDN realized through consistent application of principles of Cloud designEmbrace and isolate failure

Centralize (partition, federate) control and relentlessly drive to target state

Microsoft Azure re-imagined networking, created SDN and it paid off

Page 45: SDN for the Cloud

Career Advice

Cloud

Software leverage and agilityEven for hardware people

Quantity timeWith team, with project

Hard infrastructure problems take >3 years, but it’s worth it

Usage and its measurement oxygen for ideasQuick wins (3 years is a long time)

Foundation and proof that the innovation matters

Page 46: SDN for the Cloud

Shout-Out to Colleagues & Mentors

AT&T & Bell Labs• Han Nguyen, Brian Freeman, Jennifer Yates, … and entire AT&T Labs team• Alumni: Dave Belanger, Rob Calderbank, Debasis Mitra, Andrew Odlyzko, Eric Sumner Jr.

Microsoft Azure• Reza Baghai, Victor Bahl, Deepak Bansal, Yiqun Cai, Luis Irun-Briz, Yiqun Cai, Alireza

Dabagh, Nasser Elaawar, Gopal Kakivaya, Yousef Khalidi, Chuck Lenzmeier, Dave Maltz, Aaron Ogus, Parveen Patel, Mark Russinovich, Murari Sridharan, Marne Staples, Junhua Wang, Jason Zander, and entire Azure team

• Alumni: Arne Josefsberg, James Hamilton, Randy Kern, Joe Chau, Changhoon Kim, Parantap Lahiri, Clyde Rodriguez, Amitabh Srivastava, Sumeet Singh, Haiyong Wang

Academia• Nick Mckeown, Jennifer Rexford, Hui Zhang

Page 47: SDN for the Cloud

MSFT @ SIGCOMM’15DCQCN

Congestion control for large RDMA deployments2000x reduction in Pause Frames, 16x better

performance

EverflowPacket-level telemetry for large DC networks106x reduction in trace overhead, pinpoint accuracy

SiloVirtual networks with guaranteed bw and latency

No changes to host stack!

IridiumLow-latency geo-distributed analytics

19x query speedup, 64% reduction in WAN traffic

CorralJoint data & compute placement for big data jobs

56% reduction in completion time over Yarn

Eden

Enable network functions at end hostDynamic, stateful policies, F#-based API

R2C2Network stack for rack-scale computing

Rack is the new building block!

Hopper

Speculation-aware scheduling of big-data jobs

66% speedup of production queries

PingMeshDC network latency measurement and analysis

200 billion probes per day!

http://research.microsoft.com/en-us/um/cambridge/events/sigcomm2015/papers.html