Microsoft's Production Configurable Cloud
Mark RussinovichChief Technology Officer, Microsoft Azure
Context in 2010
2
• Moore’s Law was fine• More cores, but single-thread
performance gains slowing• No real focus on datacenter
accelerators• FPGAs still strong in their
traditional markets• But were a non-consensus bet for
compute• But there were stormclouds on
the horizon
>90% of Fortune 500 are on the
Microsoft Cloud
>80% of the world’s largest banks
areAzure customers
>75% of G-SIFIs have signed enterprise
agreements with Azure
FedRAMP HighDISA IL-4
ITARCJIS
Central USIowa
West USCaliforn
ia
East USVirgini
a
US GovVirginia
North Central US
Illinois
US GovIowa
South Central US
Texas
Brazil SouthSao Paulo
State
West Europe
Netherlands
China North Beijing
China South
Shanghai
Japan EastTokyo,
Saitama
Japan WestOsaka
India South
ChennaiEast Asia
Hong Kong
SE AsiaSingapo
re
Australia South East
Victoria
Australia EastNew South
Wales
India CentralPune
Canada East
Quebec City
Canada CentralToronto
India West
Mumbai
Germany North East
Magdeburg
Germany Central
Frankfurt
North EuropeIreland
East US 2
Virginia
United KingdomRegions
US DoD East
Virginia
US DoD WestIowa
FranceParis France
Marseille
KoreaSeoul
KoreaBusan
*New
38Azure regionsMore than AWS andGoogle combined
Training
Inference
Client Cloud
Humans
ASICs
GPUs
?
What Drives a Post-CPU “Enhanced” Cloud?
Generality(CPUs)
Efficiency(ASICS)
Homogeneity
Accelerators
Cloud Applications
FPGAs
7
% o
f ser
vers
Time Workload is Stable (years)1 52 3 4
0
100
6
Ideal for ASICs
What is FPGA Technology?• Field Programmable Gate Array• Programmable hardware
• Can be rewritten with new image (bitstream) in seconds, soon to be 100s of ms
• Chip has large quantities of programmable units• Network, memories, and logic (LUTs)
• Program specialized circuits that communicate directly• Stored as bit tables rather than polygons of materials• Can build functional units, state machines, networking
circuits, etc.
• Need to program them in same languages (e.g. Verilog) used to design ASIC chips
• FPGA chips are now large SoCs• Thousands of hardened DSP blocks, DRAM controllers,
PCIe controllers, and now ARM cores
• Now: growing process gap between ASICs and “big iron” (CPU, FPGA, GPU)
8
RAM
RAM
DSP
DSP
DSP
Net
wor
kPC
Ie
Gene
ric L
ogic
(LUT
s)
Specialized I/O Blocks
DSP
Mul
tipl
ier
Bloc
ks
20 Kb Dual Port RAMs
05/03/2023
Microsoft Confidential 10
First try: v0• Use commodity SuperMicro servers• 6 Xilinx LX240T FPGAs• One appliance per rack• All rack machines communicate over 1Gb
Ethernet
• 1U rack-mounted• 2 x 10Ge ports• 3 x16 PCIe slots• 12 Intel Westmere
cores (2 sockets)
No production:• Additional single point of failure• Additional SKU to maintain• Too much load on the network• Inelastic FPGA scaling or stranded capacity
11
Second try: V1• Altera Stratix V D5• 172.6K ALMs, 2014 M20Ks
• 457KLEs• 1 KLE == ~12K gates• M20K is a 2.5KB SRAM
• PCIe Gen 2 x8, 8GB DDR3• 20 Gb network among FPGAs
Stratix V
8GB DDR3
PCIe Gen3 x8
Mapped Fabric into a Pod• 1 Pod = 48 servers
• Occupies 1 half-rack• 48-port 10G TOR switch• 1 server = 2 sockets, 64GB RAM,
2TB SSD storage
• FPGA Network• 20Gb (2x10Gb) links to N/S/E/W neighbors• 2-D Torus Topology (6x8 torus)
• Offered Capabilities• Low-latency access to a local FPGA• Compose multiple FPGAs to accelerate large workloads• Low-latency, high-bandwidth sharing of storage and
memory across server boundaries
Server 1FPGA
Server 48FPGA
Top Of Rack Switch (TOR)
Server 2FPGA
Server 47FPGA
Server 3FPGA
Server 46FPGA
Server 4FPGA
Server 45FPGA
Server 23FPGA
Server 26FPGA
Server 24FPGA
Server 25FPGA
… …
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
S0 S0
S0 S0
S0 S0
S1 S1
S2 S2
S2 S2
10Gb
Eth
erne
t Lin
ks
FPGA Torus
12
Built Three Programmable Engines for Bing
Complex ALU
Ln, ÷, div
Basic Tile
Basic Tile
Basic Tile
Basic Tile
Registers
Constants
FFE 1 Inst.
FFE n Inst.
Compression Thresholds
…
Local ALU
DSPDSP
Scheduling Logic
Distribution latches
Control/Data Tokens
Featu
re Extracti
on FSMsFeature
Transmission
Network
Stream Preprocessing
FSM
FE FFE DTS
FE0
89 Non-BodyBlock Features
34 State Machines55 % Utilization
FE1
55 BodyBlock Features
20 State Machines45 % Utilization
DTT [3][7]DTT [3][6]DTT [3][5]DTT [3][4]DTT [3][3]DTT [3][2]DTT [3][1]DTT [3][0]
DTT [3][11]DTT [3][10]DTT [3][9]DTT [3][8]
DTT [2][7]DTT [2][6]DTT [2][5]DTT [2][4]DTT [2][3]DTT [2][2]DTT [2][1]DTT [2][0]
DTT [2][11]DTT [2][10]DTT [2][9]DTT [2][8]
DTT [1][7]DTT [1][6]DTT [1][5]DTT [1][4]DTT [1][3]DTT [1][2]DTT [1][1]DTT [1][0]
DTT [1][11]DTT [1][10]DTT [1][9]DTT [1][8]
DTT [0][7]DTT [0][6]DTT [0][5]DTT [0][4]DTT [0][3]DTT [0][2]DTT [0][1]DTT [0][0]
DTT [0][11]DTT [0][10]DTT [0][9]DTT [0][8]
FFE [1][3]
FFE [1][2]
FFE [1][1]
FFE [1][0]
FFE [0][3]
FFE [0][2]
FFE [0][1]
FFE [0][0]
FFE: 64 cores / chip 256-512 threadsDTT: 48 DTT tiles/chip 240 tree processors 2880 trees/chip
1,632 server pilot deployed in BN214
No production:• Microsoft was converging on a single SKU• No one else wanted the secondary network
• Complex, difficult to handle failures• Difficult to service boxes
• No killer infrastructure accelerator• Application presence is too small
Hyperscale SDN: Building the Right Abstractions
Management
Control
Data
ProprietaryAppliance
Management plane Create a tenant
Control plane Plumb tenant ACLs to switches
Data plane Apply ACLs to flows
Azure Resource Manager
Controller
Switch (Host)
Management Plane
Data Plane
SDN
Control Plane
Key to flexibility and scale is Host SDN
Virtual Filtering Platform (VFP)• Acts as a virtual switch inside Hyper-V
VMSwitch• Provides core SDN functionality for Azure
networking services, including:• Address Virtualization for VNET• VIP -> DIP Translation for SLB• ACLs, Metering, and Security Guards
• Uses programmable rule/flow tables to perform per-packet actions
• Supports all Azure data plane policy at 40GbE+ with offloads
• Coming to private cloud in Windows Server 2016
NIC vNIC
VM Switch
VFP
VMvNIC
VM
ACLs, Metering, Security
VNET
SLB (NAT)
Host: 10.4.1.5
Flow Tables: the Right Abstraction for the Host• VMSwitch exposes a typed Match-
Action-Table API to the controller• Controllers define policy• One table per policy
• Key insight: Let controller tell switch exactly what to do with which packets
• e.g. encap/decap, rather than trying to use existing abstractions (tunnels, …)
Tenant Description
VNet Description
VNet Routing Policy ACLsNAT
Endpoints
VFP
VM110.1.1.2NIC
Flow ActionFlow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Controller
Host SDN Scale Challenges• Hosts are Scaling Up: 1G 10G 40G 50G 100G
• Reduces COGS of VMs (more VMs per host) and enables new workloads• Need the performance of hardware to implement policy without CPU
• Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance• We are always pushing richer semantics to virtual networks• Need the programmability of software to be agile and future-proof
How do we get the performance of hardware with programmability of software?
Azure SmartNIC• Use an FPGA for reconfigurable functions
• FPGAs are already used in Bing (Catapult)• Roll out Hardware as we do software
• Programmed using Generic Flow Tables (GFT)
• Language for programming SDN to hardware• Uses connections and structured actions as
primitives
• SmartNIC can also do Crypto, QoS, storage acceleration, and more…
Host
CPU
NICASIC
FPGA
SmartNIC
ToR
Flow ActionDecap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP
Southbound API
GFT Offload API (NDIS)
VMSwitch
VM
Northbound API
GFTTable
First Packet
GFT Offload Engine
SmartNIC50G
QoSCrypto RDMAFlow ActionDecap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT
20
SmartNIC
TranspositionEngine
Rewrite
SLB Decap SLB NAT VNET ACL MeteringRule Action Rule ActionRule Action Rule Action Rule Action Rule Action
Decap* DNAT* Rewrite* Allow* Meter*
ControllerControllerController
Encap
Catapult V2: This one works
21
CPU CPU FPGA
NIC
DRAM DRAM DRAM
WCS 2.0 Server Blade Catapult V2
Gen3 2x8
Gen3 x8
QPI Switch
QSFP
QSFP
QS FP
40Gb/s
40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak
WCS Tray Backplane
Option Card Mezzanine Connectors
Catapult v2 Mezzanine card
• The architecture justifies the economics1. Can act as a local compute accelerator2. Can act as a network/storage accelerator3. Can act as a remote compute accelerator
Configurable CloudCPU compute layer
Reconfigurable compute layer
Converged network
Local acceleration
Production Results (December 2015)
24
software
FPGA
99.9% Query Latency versus Queries/sec
HW
vs.
SW
Lat
ency
and
Loa
daverage software load
99.9% software latency
99.9% FPGA latency
average FPGA query load
Infrastructure acceleration
Azure Accelerated Networking: Fastest Cloud Network!
• Highest bandwidth VMs of any cloud• DS15v2 & D15v2 VMs get up to 25Gbps with <25μs latency
• Consistent low latency network performance• Provides SR-IOV to the VM• 10x latency improvement• Increased packets per second (PPS)• Reduced jitter means more consistency in workloads
• Enables workloads requiring native performance to run in cloud VMs• >2x improvement for many DB and OLTP applications
Accelerated Networking InternalsSDN/Networking policy applied in software in the host
FPGA acceleration used to apply all policies
Remote acceleration
Azure Data Center Network Fabrics with 40G NICs
T2-1-1 T2-1-2 T2-1-8
T3-1 T3-2 T3-3 T3-4
Row Spine
T2-4-1 T2-4-2 T2-4-4Data Center Spine
T1-1 T1-8T1-7…T1-2
… …
Regional Spine
…
T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2
Rack …T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
>10 years of experience, with major revisions every six months
Scale-out, active-active
Up to 128 switches wide!
• Microsoft software on merchant silicon• Switch Abstraction Layer
(SAI)• SONiC: Linux-based
Switch firmware• OCP support
Architecture of a Configurable Cloud
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
CS0 CS1 CS2 CS3
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
SP0 SP1 SP2 SP3 • FPGAs can encapsulate their own UDB packets
• Low-latency inter-FPGA communication
• Can provide strong network primitives
• Reliable transport• Smart transport
• But this topology opens up other opportunities
L0
L1/L2
30
Credits
Virtual Channel
Data
Header
ElasticRouter
(multi-VCon-chip router)
Send Connection Table
Transmit State Machine
Send Frame QueueConnection
Lookup
Packetizer and
TransmitBuffer
Unack’d Frame Store
Ethernet Encap
Ethernet Decap
40G MAC+PHY
Receive Connection Table
Credits
Virtual Channel
Data
Header
Depacketizer
Credit Management
Ack Receiver
Ack Generation
Receive State Machine
Solid links show Data flow, Dotted links show ACK flow
DatacenterNetwork
Lightweight Transport Layer (LTL)
LTL Enables Iron Channels
Server ServerFPGA FPGANetwork
FPGA-to-FPGA round-trip latencies over LTL
HaaS: Deploying Hardware Microservices
ToR ToR
CS CS
• Services may co-design with their local FPGAs or allocate a HaaS service remotely.
• Currently Bing ranking co-locates SW and HW fabric, but decoupling is trivial.
• Line-rate services (crypto) should be local.
• Can pipeline services with no software intervention
Bing Ranking HW
ToR ToR
Bing Ranking SW
34
HMMLB
Large-scaledeep learning
AudioDecode
Benefits of HaaSDecouple CPU to FPGA usage ratioFlexibility: many services need a large number of FPGAs, others underutilize theirs
Share accelerators (oversubscription)Many accelerators can handle load of multiple software clientsConsolidate underutilized FPGA accelerators into fewer shared instancesIncreases efficiency & makes room for more accelerators
Expose multiple accelerators to a serviceMany datacenter services need to access multiple types of accelerators
HW Plane
SW Plane
HW Plane
SW Plane
HW Plane
SW Plane
Ranking
DNN
Programmable DNNs on FPGA• Programmed in C++
• No Verilog skills required
• Accelerator primitives• Matrix-vector multiply• Vector-vector add/sub, multiply• Element-wise ops: sigm, tanh, etc
• 1.2 TeraOps / FPGA• 16-bit fixed point• 3000 MulAdds/cycle @ 200MHz
• No minibatch required
Aggressive ML: Scalable DNNs over HaaS
• Ultra low latency, high throughput evaluation• Long term: in-situ training for model freshness
& training within compliance boundaries• Achieve high ops/$ and ops/W vs. CPUs
F F F
L0
L1
F F F
L0
NN Model FPGAs over HaaS and LTL FPGA vector Engine
Instr Decoder & Control
Neural FU
Microsoft FPGA DNN goals
• 1.2 TOPs of 16b fixed point inference in hundreds of us or few ms
• No Verilog expertise required• Engines can be composed to support large
scale models
MS Engine: SW-Programmable DNN Engine
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190
1000
2000
3000
4000
5000
6000
3072x1024 Matrix Multiplication (for high-dimensional LSTM Evaluation)
Series1 Series2 Series3
Batch Size
Gig
aOp/
sec
FPGAs Excellent for Low-Latency DNN inference
Operating range for online interactive services (e.g., search) Peak Performance of
low-power Nvidia M4 GPU @ high batch sizes
Low Level AI Representation (LLAIR)Objectives• Serve pre-trained CNTK and
TensorFlow models on FPGA, CPU, and other backends
• Export models to Framework-neutral Low Level AI Representation (LLAIR)
• Develop federated and modular runtime for backward compatibility with existing DNN runtimes while allowing extensibility
CNTK or TFModel
FPGA CPU
Add500
1000-dim Vector
1000-dim Vector
Split
500x500Matrix
MatMul500
500x500Matrix
MatMul500 MatMul500 MatMul500
500x500Matrix
Add500 Add500
Sigmoid500 Sigmoid500
Split
Add500
500 500
Concat
500 500
500x500Matrix
Framework NeutralLLAIR
Federated Runtime Executes LLAIR Subgraphs
CNTK Exporter
TensorFlow Exporter
LLAIR File
Transformer
Subgraph-Compiler
CustomCompiler
LLAIR / ModelBundle
Packager
HaaS-AP
CNTKCompiler
FPGACompiler
SearchGold SingleBox
Model Bundle
Federated Runtime
CustomBackend
CNTKBackend
FPGABackend
Deploy Bundle
Optimized & Partitioned
LLAIR
Configurable Clouds will Change the World• Ability to reprogram a datacenter’s hardware protocols
• Networking, storage, security
• Can turn homogenous machines into specialized SKUs dynamically• Unprecedented performance and low latency at hyperscale
• Exa-ops of performance with a 10 microsecond diameter
• What would you do with the world’s most powerful fabric?
40
Exa-ops at 10us
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.