Overview of Smart NICs II:Programmable Data Planes and
“Reflex Control”Sean Choi, Muhammad Shahbaz, Balaji Prabhakar and Mendel Rosenblum
CPU
RAMGPU
NIC
NICCPU GPU
PCI-eRDMA
Reflex Control on SmartNICs
• Reflex Control: Applications that need instantaneous local response
• Motivation for using SmartNICs for Reflex Control
• Low Latency, bump-in-the-wire hardwarethat exists in every server
• Programmable Processors (NPUs)(ASIC and FPGA based)
• Familiar Languages for Programming NPUS(P4, eBPF and C)
Netronome SmartNICs (ASIC-based)
• Programmable NPUs capable up to 100G
• Contains up to 120 Cores @ 1.2Ghz and 8GB DDR3 RAM
• Programmable via P4 and Micro-C
• Active open-source community (openNFP.org)
• Currently used to offload OpenvSwitch and other network functions
Programming Netronome SmartNICs: P4
• P4: A domain-specific language developed for expressing how the pipeline of a network device should process packets.• P4 Abstract Forwarding Model
• Limited language constructs for actions• Use cases of P4 in Netronome NICs: VM Switching, Simple Firewall
PacketParser
Packet Deparser
CustomMatch-Action
Tables
Ingress Egress
Programming Netronome SmartNICs: Micro-C
• Micro-C: C-like language developed by Netronome for NPUs• Complements P4 with Custom Actions• Contains added constructs for accessing different types of memory• Not a full-blown C• No floating point types or recursions, and limited standard library functions
• Use Cases: Deep Packet Inspection, Custom Functions
Architecture of Netronome SmartNICs
• Each NPU (Micro-Engine, or ME) is grouped into a unit called island• Each ME has its own instructions and data, and can run 8 threads
(each ME can run different programs)• There also are shared memory and special execution units for shared data
Code
Data
IslandME
Code
Data
MECode
Data
MECode
Data
ME
Architecture of Netronome SmartNICs
• Netronome memories are organized into hierarchies• Within NPU: Registers (1 cycle), Local Memory (1 to 3 cycles)• Within Island:
Cluster Local Scratch (20 to 50 cycles), Cluster Target Memory (50 to 100 cycles)• Outside of Island:
Internal Memory (150 to 250 cycles), External Memory (150 to 500 cycles)
IslandLocal Mem: 4KB
CLS: 64KB
CTM: 256k
IMEM: 8MB EMEM: 8GB
Evaluation
• Type of Serverless Workloads
1. Short Workload without External Data Dependency• Simple server sending simple response
2. Short Workload with External Data Dependency• API backend making queries to remote memcached Database
3. Long Workload with External Data Dependency• Image transformer (Grayscale by averaging over RGB channels)
Architecture Overview
• Compute Architectures: (SmartNIC (λ-NIC) vs. Container vs. Bare-Metal)
ParseHeader
MatchWorkload
SmartNICWorkload (W1)
Container Engine
Operating System
Container Workload (W2)
W2 W3W1
SmartNIC
Worker Node
ResponseRequest
Bare-MetalProcess (W3)
M2SmartNIC
Container
M3SmartNIC
Container
M4SmartNIC
Container
M5SmartNIC
Container
RequestGenerator
M1
Switch
Evaluation
• Experimental Testbed• 5x Dell R640 1U Server• Intel Xeon Gold 5117 14 Cores @ 2Ghz• 32GB DDR4 RAM• 120GB SSD• Netronome CX 10Gb SmartNIC
• 56 Cores @ 633MHz• 2GB RAM
• 10Gb Arista Switch
Evaluation
• Average Latency (ms)• vs. Container: 5x to 880x improvements• vs. Bare-Metal: 3x to 32x improvements
λ-NIC Bare-Metal Container
Simple Server 0.05 1.64 (+32x) 45.56 (+880x)
Memcached GET 0.11 1.69 49.93Image Transform 199.80 656.20 (+3x) 943.00 (+5x)
Evaluation
• Workload Throughput (up to 734x improvement)• Measuring end-to-end time (s) for completing 10,000 Requests
1. Sequentially by 1 thread2. In Parallel by 56 threads.
1.21 0.3817.03 9.96
515.69
38.850
100
200
300
400
500
600
Memcached Get1 Thread
Memcached Get56 Threads
!-NIC Bare Metal
0.62 0.1816.37 9.62
455.18
34.640
100
200
300
400
500
Simple Server1 Thread
Simple Server56 Threads
Wor
k Co
mpl
etio
n Ti
me
(s)
!-NIC Bare Metal Container
1923.81
48.68
5932.76
240.82
9395.26
720.21
0
2000
4000
6000
8000
10000
Image Transform1 Thread
Image Transform56 Threads
!-NIC Bare Metal Container