Download pptx - Microsofts Configurable Cloud

Microsoft's Production Configurable Cloud

Mark RussinovichChief Technology Officer, Microsoft Azure

Context in 2010

2

• Moore’s Law was fine• More cores, but single-thread

performance gains slowing• No real focus on datacenter

accelerators• FPGAs still strong in their

traditional markets• But were a non-consensus bet for

compute• But there were stormclouds on

the horizon

>90% of Fortune 500 are on the

Microsoft Cloud

>80% of the world’s largest banks

areAzure customers

>75% of G-SIFIs have signed enterprise

agreements with Azure

FedRAMP HighDISA IL-4

ITARCJIS

Central USIowa

West USCaliforn

ia

East USVirgini

a

US GovVirginia

North Central US

Illinois

US GovIowa

South Central US

Texas

Brazil SouthSao Paulo

State

West Europe

Netherlands

China North Beijing

China South

Shanghai

Japan EastTokyo,

Saitama

Japan WestOsaka

India South

ChennaiEast Asia

Hong Kong

SE AsiaSingapo

re

Australia South East

Victoria

Australia EastNew South

Wales

India CentralPune

Canada East

Quebec City

Canada CentralToronto

India West

Mumbai

Germany North East

Magdeburg

Germany Central

Frankfurt

North EuropeIreland

East US 2

Virginia

United KingdomRegions

US DoD East

Virginia

US DoD WestIowa

FranceParis France

Marseille

KoreaSeoul

KoreaBusan

*New

38Azure regionsMore than AWS andGoogle combined

Training

Inference

Client Cloud

Humans

ASICs

GPUs

?

What Drives a Post-CPU “Enhanced” Cloud?

Generality(CPUs)

Efficiency(ASICS)

Homogeneity

Accelerators

Cloud Applications

FPGAs

7

% o

f ser

vers

Time Workload is Stable (years)1 52 3 4

0

100

6

Ideal for ASICs

What is FPGA Technology?• Field Programmable Gate Array• Programmable hardware

• Can be rewritten with new image (bitstream) in seconds, soon to be 100s of ms

• Chip has large quantities of programmable units• Network, memories, and logic (LUTs)

• Program specialized circuits that communicate directly• Stored as bit tables rather than polygons of materials• Can build functional units, state machines, networking

circuits, etc.

• Need to program them in same languages (e.g. Verilog) used to design ASIC chips

• FPGA chips are now large SoCs• Thousands of hardened DSP blocks, DRAM controllers,

PCIe controllers, and now ARM cores

• Now: growing process gap between ASICs and “big iron” (CPU, FPGA, GPU)

8

RAM

RAM

DSP

DSP

DSP

Net

wor

kPC

Ie

Gene

ric L

ogic

(LUT

s)

Specialized I/O Blocks

DSP

Mul

tipl

ier

Bloc

ks

20 Kb Dual Port RAMs

05/03/2023

Microsoft Confidential 10

First try: v0• Use commodity SuperMicro servers• 6 Xilinx LX240T FPGAs• One appliance per rack• All rack machines communicate over 1Gb

Ethernet

• 1U rack-mounted• 2 x 10Ge ports• 3 x16 PCIe slots• 12 Intel Westmere

cores (2 sockets)

No production:• Additional single point of failure• Additional SKU to maintain• Too much load on the network• Inelastic FPGA scaling or stranded capacity

11

Second try: V1• Altera Stratix V D5• 172.6K ALMs, 2014 M20Ks

• 457KLEs• 1 KLE == ~12K gates• M20K is a 2.5KB SRAM

• PCIe Gen 2 x8, 8GB DDR3• 20 Gb network among FPGAs

Stratix V

8GB DDR3

PCIe Gen3 x8

Mapped Fabric into a Pod• 1 Pod = 48 servers

• Occupies 1 half-rack• 48-port 10G TOR switch• 1 server = 2 sockets, 64GB RAM,

2TB SSD storage

• FPGA Network• 20Gb (2x10Gb) links to N/S/E/W neighbors• 2-D Torus Topology (6x8 torus)

• Offered Capabilities• Low-latency access to a local FPGA• Compose multiple FPGAs to accelerate large workloads• Low-latency, high-bandwidth sharing of storage and

memory across server boundaries

Server 1FPGA

Server 48FPGA

Top Of Rack Switch (TOR)

Server 2FPGA

Server 47FPGA

Server 3FPGA

Server 46FPGA

Server 4FPGA

Server 45FPGA

Server 23FPGA

Server 26FPGA

Server 24FPGA

Server 25FPGA

… …

DTWS DTWS

DTWS DTWS

DTWS DTWS

DTWS DTWS

DTWS DTWS

DTWS DTWS

S0 S0

S0 S0

S0 S0

S1 S1

S2 S2

S2 S2

10Gb

Eth

erne

t Lin

ks

FPGA Torus

12

Built Three Programmable Engines for Bing

Complex ALU

Ln, ÷, div

Basic Tile

Basic Tile

Basic Tile

Basic Tile

Registers

Constants

FFE 1 Inst.

FFE n Inst.

Compression Thresholds

…

Local ALU

DSPDSP

Scheduling Logic

Distribution latches

Control/Data Tokens

Featu

re Extracti

on FSMsFeature

Transmission

Network

Stream Preprocessing

FSM

FE FFE DTS

FE0

89 Non-BodyBlock Features

34 State Machines55 % Utilization

FE1

55 BodyBlock Features

20 State Machines45 % Utilization

DTT [3][7]DTT [3][6]DTT [3][5]DTT [3][4]DTT [3][3]DTT [3][2]DTT [3][1]DTT [3][0]

DTT [3][11]DTT [3][10]DTT [3][9]DTT [3][8]


DTT [2][11]DTT [2][10]DTT [2][9]DTT [2][8]


DTT [1][11]DTT [1][10]DTT [1][9]DTT [1][8]


DTT [0][11]DTT [0][10]DTT [0][9]DTT [0][8]

FFE [1][3]

FFE [1][2]

FFE [1][1]

FFE [1][0]

FFE [0][3]

FFE [0][2]

FFE [0][1]

FFE [0][0]

FFE: 64 cores / chip 256-512 threadsDTT: 48 DTT tiles/chip 240 tree processors 2880 trees/chip

1,632 server pilot deployed in BN214

No production:• Microsoft was converging on a single SKU• No one else wanted the secondary network

• Complex, difficult to handle failures• Difficult to service boxes

• No killer infrastructure accelerator• Application presence is too small

Hyperscale SDN: Building the Right Abstractions

Management

Control

Data

ProprietaryAppliance

Management plane Create a tenant

Control plane Plumb tenant ACLs to switches

Data plane Apply ACLs to flows

Azure Resource Manager

Controller

Switch (Host)

Management Plane

Data Plane

SDN

Control Plane

Key to flexibility and scale is Host SDN

Virtual Filtering Platform (VFP)• Acts as a virtual switch inside Hyper-V

VMSwitch• Provides core SDN functionality for Azure

networking services, including:• Address Virtualization for VNET• VIP -> DIP Translation for SLB• ACLs, Metering, and Security Guards

• Uses programmable rule/flow tables to perform per-packet actions

• Supports all Azure data plane policy at 40GbE+ with offloads

• Coming to private cloud in Windows Server 2016

NIC vNIC

VM Switch

VFP

VMvNIC

VM

ACLs, Metering, Security

VNET

SLB (NAT)

Host: 10.4.1.5

Flow Tables: the Right Abstraction for the Host• VMSwitch exposes a typed Match-

Action-Table API to the controller• Controllers define policy• One table per policy

• Key insight: Let controller tell switch exactly what to do with which packets

• e.g. encap/decap, rather than trying to use existing abstractions (tunnels, …)

Tenant Description

VNet Description

VNet Routing Policy ACLsNAT

Endpoints

VFP

VM110.1.1.2NIC

Flow ActionFlow ActionFlow Action

TO: 10.2/16 Encap to GW

TO: 10.1.1.5 Encap to 10.5.1.7

TO: !10/8 NAT out of VNET

Flow ActionFlow Action

TO: 79.3.1.2 DNAT to 10.1.1.2

TO: !10/8 SNAT to 79.3.1.2

Flow Action

TO: 10.1.1/24 Allow

10.4/16 Block

TO: !10/8 Allow

VNET LB NAT ACLS

Controller

Host SDN Scale Challenges• Hosts are Scaling Up: 1G 10G 40G 50G 100G

• Reduces COGS of VMs (more VMs per host) and enables new workloads• Need the performance of hardware to implement policy without CPU

• Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance• We are always pushing richer semantics to virtual networks• Need the programmability of software to be agile and future-proof

How do we get the performance of hardware with programmability of software?

Azure SmartNIC• Use an FPGA for reconfigurable functions

• FPGAs are already used in Bing (Catapult)• Roll out Hardware as we do software

• Programmed using Generic Flow Tables (GFT)

• Language for programming SDN to hardware• Uses connections and structured actions as

primitives

• SmartNIC can also do Crypto, QoS, storage acceleration, and more…

Host

CPU

NICASIC

FPGA

SmartNIC

ToR

Flow ActionDecap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP

Southbound API

GFT Offload API (NDIS)

VMSwitch

VM

Northbound API

GFTTable

First Packet

GFT Offload Engine

SmartNIC50G

QoSCrypto RDMAFlow ActionDecap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80

GFT

20

SmartNIC

TranspositionEngine

Rewrite

SLB Decap SLB NAT VNET ACL MeteringRule Action Rule ActionRule Action Rule Action Rule Action Rule Action

Decap* DNAT* Rewrite* Allow* Meter*

ControllerControllerController

Encap

Catapult V2: This one works

21

CPU CPU FPGA

NIC

DRAM DRAM DRAM

WCS 2.0 Server Blade Catapult V2

Gen3 2x8

Gen3 x8

QPI Switch

QSFP

QSFP

QS FP

40Gb/s

40Gb/s

WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA

Pikes Peak

WCS Tray Backplane

Option Card Mezzanine Connectors

Catapult v2 Mezzanine card

• The architecture justifies the economics1. Can act as a local compute accelerator2. Can act as a network/storage accelerator3. Can act as a remote compute accelerator

Configurable CloudCPU compute layer

Reconfigurable compute layer

Converged network

Local acceleration

Production Results (December 2015)

24

software

FPGA

99.9% Query Latency versus Queries/sec

HW

vs.

SW

Lat

ency

and

Loa

daverage software load

99.9% software latency

99.9% FPGA latency

average FPGA query load

Infrastructure acceleration

Azure Accelerated Networking: Fastest Cloud Network!

• Highest bandwidth VMs of any cloud• DS15v2 & D15v2 VMs get up to 25Gbps with <25μs latency

• Consistent low latency network performance• Provides SR-IOV to the VM• 10x latency improvement• Increased packets per second (PPS)• Reduced jitter means more consistency in workloads

• Enables workloads requiring native performance to run in cloud VMs• >2x improvement for many DB and OLTP applications

Accelerated Networking InternalsSDN/Networking policy applied in software in the host

FPGA acceleration used to apply all policies

Remote acceleration

Azure Data Center Network Fabrics with 40G NICs

T2-1-1 T2-1-2 T2-1-8

T3-1 T3-2 T3-3 T3-4

Row Spine

T2-4-1 T2-4-2 T2-4-4Data Center Spine

T1-1 T1-8T1-7…T1-2

… …

Regional Spine

…

T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2

Rack …T0-1 T0-2 T0-20

Servers

…T0-1 T0-2 T0-20

Servers

…T0-1 T0-2 T0-20

Servers

>10 years of experience, with major revisions every six months

Scale-out, active-active

Up to 128 switches wide!

• Microsoft software on merchant silicon• Switch Abstraction Layer

(SAI)• SONiC: Linux-based

Switch firmware• OCP support

Architecture of a Configurable Cloud

ToR

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

CS0 CS1 CS2 CS3

ToR

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

SP0 SP1 SP2 SP3 • FPGAs can encapsulate their own UDB packets

• Low-latency inter-FPGA communication

• Can provide strong network primitives

• Reliable transport• Smart transport

• But this topology opens up other opportunities

L0

L1/L2

30

Credits

Virtual Channel

Data

Header

ElasticRouter

(multi-VCon-chip router)

Send Connection Table

Transmit State Machine

Send Frame QueueConnection

Lookup

Packetizer and

TransmitBuffer

Unack’d Frame Store

Ethernet Encap

Ethernet Decap

40G MAC+PHY

Receive Connection Table

Credits

Virtual Channel

Data

Header

Depacketizer

Credit Management

Ack Receiver

Ack Generation

Receive State Machine

Solid links show Data flow, Dotted links show ACK flow

DatacenterNetwork

Lightweight Transport Layer (LTL)

LTL Enables Iron Channels

Server ServerFPGA FPGANetwork

FPGA-to-FPGA round-trip latencies over LTL

HaaS: Deploying Hardware Microservices

ToR ToR

CS CS

• Services may co-design with their local FPGAs or allocate a HaaS service remotely.

• Currently Bing ranking co-locates SW and HW fabric, but decoupling is trivial.

• Line-rate services (crypto) should be local.

• Can pipeline services with no software intervention

Bing Ranking HW

ToR ToR

Bing Ranking SW

34

HMMLB

Large-scaledeep learning

AudioDecode

Benefits of HaaSDecouple CPU to FPGA usage ratioFlexibility: many services need a large number of FPGAs, others underutilize theirs

Share accelerators (oversubscription)Many accelerators can handle load of multiple software clientsConsolidate underutilized FPGA accelerators into fewer shared instancesIncreases efficiency & makes room for more accelerators

Expose multiple accelerators to a serviceMany datacenter services need to access multiple types of accelerators

HW Plane

SW Plane

HW Plane

SW Plane

HW Plane

SW Plane

Ranking

DNN

Programmable DNNs on FPGA• Programmed in C++

• No Verilog skills required

• Accelerator primitives• Matrix-vector multiply• Vector-vector add/sub, multiply• Element-wise ops: sigm, tanh, etc

• 1.2 TeraOps / FPGA• 16-bit fixed point• 3000 MulAdds/cycle @ 200MHz

• No minibatch required

Aggressive ML: Scalable DNNs over HaaS

• Ultra low latency, high throughput evaluation• Long term: in-situ training for model freshness

& training within compliance boundaries• Achieve high ops/$ and ops/W vs. CPUs

F F F

L0

L1

F F F

L0

NN Model FPGAs over HaaS and LTL FPGA vector Engine

Instr Decoder & Control

Neural FU

Microsoft FPGA DNN goals

• 1.2 TOPs of 16b fixed point inference in hundreds of us or few ms

• No Verilog expertise required• Engines can be composed to support large

scale models

MS Engine: SW-Programmable DNN Engine

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190

1000

2000

3000

4000

5000

6000

3072x1024 Matrix Multiplication (for high-dimensional LSTM Evaluation)

Series1 Series2 Series3

Batch Size

Gig

aOp/

sec

FPGAs Excellent for Low-Latency DNN inference

Operating range for online interactive services (e.g., search) Peak Performance of

low-power Nvidia M4 GPU @ high batch sizes

Low Level AI Representation (LLAIR)Objectives• Serve pre-trained CNTK and

TensorFlow models on FPGA, CPU, and other backends

• Export models to Framework-neutral Low Level AI Representation (LLAIR)

• Develop federated and modular runtime for backward compatibility with existing DNN runtimes while allowing extensibility

CNTK or TFModel

FPGA CPU

Add500

1000-dim Vector

1000-dim Vector

Split

500x500Matrix

MatMul500

500x500Matrix

MatMul500 MatMul500 MatMul500

500x500Matrix

Add500 Add500

Sigmoid500 Sigmoid500

Split

Add500

500 500

Concat

500 500

500x500Matrix

Framework NeutralLLAIR

Federated Runtime Executes LLAIR Subgraphs

CNTK Exporter

TensorFlow Exporter

LLAIR File

Transformer

Subgraph-Compiler

CustomCompiler

LLAIR / ModelBundle

Packager

HaaS-AP

CNTKCompiler

FPGACompiler

SearchGold SingleBox

Model Bundle

Federated Runtime

CustomBackend

CNTKBackend

FPGABackend

Deploy Bundle

Optimized & Partitioned

LLAIR

Configurable Clouds will Change the World• Ability to reprogram a datacenter’s hardware protocols

• Networking, storage, security

• Can turn homogenous machines into specialized SKUs dynamically• Unprecedented performance and low latency at hyperscale

• Exa-ops of performance with a 10 microsecond diameter

• What would you do with the world’s most powerful fabric?

40

Exa-ops at 10us

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.