1 of 20
Smart-NICs: Power Proxying for Reduced Power Consumption in
Network Edge DevicesKarthikeyan Sabhanatarajan, Ann Gordon-Ross+ , Mark Oden, Mukund Navada ,
Alan D. George+
High Performance Computing and Simulation Research LaboratoryDepartment of Electrical and Computer Engineering
University of Florida , Gainesville
This work was supported by the U.S. National Science Foundation
+ Also Affiliated with NSF Center for High-Performance Reconfigurable Computing
22 of 20
Introduction
INTERNET
33 of 20
Introduction• Connected edge devices account for 2% of the total power consumed in the US [EPA-06]
– 130 TWh/Year
• This is $1.3 billion @ $.10 per kWh• 1 single-unit nuclear power plant
outputs 8 TWh/Year
• Translates to 16 single-unit nuclear power plants!
• Why so much power?– PCs can consume up to 200 W– 1 billion PCs worldwide by 2010 [Kanellos-04]
• What can we do?– PCs are idle 75% of the time [Purushothaman-06]– But only 10% of PCs are allowed to sleep during that time [EPA-06]– Sleeping reduces power consumption by 80% or more– If PCs were allowed to sleep, only 3 single-unit nuclear power plants would be required
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Question: Why aren’t these PCs asleep?!?!
44 of 20
Maintaining Network Connectivity
INTERNET
IDLE
GNUTELLA FILE SHARING APPLICATION
FILE QUERY PACKET
FILE RESPONSE PACKET
Bob
Alice
Alice checks to see if Bob has a file needed for p2p file sharing
Z
Z
z
zFILE QUERY PACKET
Problem: PC must be awake to maintain network connectivity
5 of 205
A Solution – Power Proxying• Primary challenge is to maintain network connectivity
while the PC is power down to standby mode - sleeping• Some packets do not require a complex response
– Automated responses are sufficient
– Network Interface Card (NIC) can act as proxy for the PC
– Allow the PC to sleep while NIC services packets with automated responses
– A technique known as power proxying
– We call such a NIC a “Smart”-NIC - SNIC
66 of 20
Power Proxying
INTERNET
IDLE
GNUTELLA FILE SHARING APPLICATION
Alice
Bob
Z
Z
z
z
PC delegates power to the SNIC to handle to network traffic
FILE QUERY PACKET
FILE RESPONSE PACKET
77 of 20
Power Proxying
INTERNET
IDLE
Proxiable Packet
Response
Z
Z
z
z
Chatter Packet
Non-Proxiable/Wake up Packet
SNIC
ResponseBob
88 of 20
What to Proxy? - Proxiable Protocols• Proxiable protocols - Network protocols amenable to proxying
– Responses may be automated
– Keep alive packets, IP conflict avoidance, etc.
Z
Z
z
z
IDLE
FOUR Categories of Proxiable Packets
ARP QUERY
ARP RESPONSE
PING
PING RESPONSE
P2P FILE QUERY
P2P RESPONSE
Mail Notification
ARP (Address Resolution Protocol)
ICMP (Internet Control Message Protocol)
TCP (Transmission Control Protocol)
UDP (User Datagram Protocol)
99 of 20
Response
Power Proxying Operation
z
z
z
IDLE
SNIC
Packet Classifier
Application Handler
1. PC decides to sleep2. PC offloads power proxy rules to the SNIC3. PC sleeps and SNIC proxy is activated
Rules
4. Packet Arrives
Rules
Rules
source
addr
source port
dest port
?=?=?=?=?=
Match?
No
(not
cha
tter
)
7(a) Wake up PC
7(b) DiscardNo
(chatter)
Yes
7(c) Invoke app handler
Payload Header
6. Rule checking
5. Header inspection
Payload Header
App ID
8. Determine response
?
9. Proxyied Response
SW
HW or SW?
source
addr
source port
dest port
1010 of 20
Packet Classifier Requirements
PC-BASED CLASSIFIER ROUTER-BASED CLASSIFIER
3) Operates only during system inactivity 3) Continual operation
4) Process packets addressed only to a particular destination and Broad/MultiCast packets
4) Process packets to any destination
5) Limited processing resources - processors clocked in MHz
5) Processors clocked in GHz range
1) Must sustain link rates of 10/100/1000/10000 Mbps
1) Must sustain link rates of 10/100/1000/10000 Mbps
2) No packet loss allowed 2) No packet loss allowed
6) Limited number of rules directly depend on number of proxiable applications running
6) Larger number rules with a wide complexity range
7) Packets match only one rule - rules are disjoint 7) Packets can match multiple rules
1111 of 20
Packet Classifier - SW vs. HW
Software Classifier Hardware Classifier
1) Limited operating frequency between 66 MHz to 400 MHz
1) Custom hardware can be designed for the required frequency
2) Cannot meet the network throughput demands even for the fastest packet classification algorithms
2) Can easily meet the network throughput demands
3) High power even during idle period 3) Comparatively lower power
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
1212 of 20
Rules
Header Processor
Header Processor
Incoming Packet(From MAC Core)
Packet Class
Application ID
Source Port CAM
Dest PortCAM
Match Matc
h
Matc
h
Match AddressMatch Address
Addre
ss
Addre
ss
Address
Matc
h ID
MultiMatch
Source PortSource IP Dest Port
Custom HW Packet Classification
Source IP
Source IPAddress
CAM
Source Port
Source Port CAMSource
Port CAM
Dest Port
Dest PortCAM
Dest PortCAM
Invokes applicati
on handler
OR
MultiMatch
Mult
iMatc
h
Mult
iMatc
h
Source IPAddress
CAM
Source IPAddress
CAM
1313 of 20
Packet Classifier Placement
From PHY
Packet ClassifierPacket Descriptor FIFO
Tx FIFO
Rx FIFO
MAC Core
uPApplication
Handler
Response No change to critical path
1414 of 20
Experimental Setup
• Software packet classifier– Implemented on RiceNIC platform using PowerPC405
• RiceNIC is a programmable NIC
– PowerPC clocked at 300 MHz and 100 MHz
• Hardware packet classifier – Xilinx IP cores to generate CAMs as block memory
– Prototyped in Verilog HDL
– System implemented and simulated using Xilinx ISE 9.1 and ModelSIM
– Clocked at 1.25 MHz, 12.5 MHz, and 125 MHz corresponding to 10 Mbps, 100 Mbps, and 1000 Mbps
– Power calculated using Xilinx XPower
1515 of 20
Results – Packet Classification Time• Hardware classification outperforms software
classification at 300 MHz and 100 MHz
Worst-case packet classification time for each protocol class with 100 rules
0
500
1000
1500
2000
2500
3000
3500
4000
ARP ICMP UDP TCP
Classification Time(ns)
Hardware at125MHz(1Gbps)PowerPC - 300MHz
PowerPC - 100MHz
Increasing packet classification complexity
1616 of 20
Results – Classification Time Vs Rules• As more applications are identified as proxyiable, rule set sizes will increase
• Thus scalability is important
0
500
1000
1500
2000
2500
3000
3500
4000
20 40 60 80 100
Number of Rules
Classification Time(ns)
Hardware - 125 MHz - UDP Hardware - 125MHz - TCPPPC - 300MHz - UDP PPC - 300MHz - TCPPPC - 100MHz - UDP PPC - 100MHz - TCP
Software
Logarithmic
ConstantHardware
1717 of 20
Results – Packet Throughput• Throughput is measured in Millions of Packets Per Second
(MPPS)
0
0.5
1
1.5
2
2.5
3
20 40 60 80 100
Number of Rules
Throughput(MPPS)
Hardware - 125MHz PPC - 300MHz
PPC - 100MHz Ethernet Throughput Limit
Minimum throughput for 1Gpbs
Software cannot meet requirement
s!
Hardware exceeds Gbps throughput
1818 of 20
Results – HW Speedup vs. SW
0
2
4
6
8
10
12
20 40 60 80 100
Number of Rules
Speedup
300 MHz - UDP 300 MHz - TCP
100 MHz - UDP 100 MHz -TCP
9x speedupto
2.5x speedup
1919 of 20
Results – Power Consumption• SW classifier is 2.4x more power than HW
– SW = 259.5 mW and 441 mW for 100 MHz and 300 MHz respectively
– HW = 180 mW for 100 rules.
• Link rate scalability– For SW to meet 1 Gpbs throughput
• Clocked at 500 MHz
• Require an additional 294 mW of power
• Resulting in 4x more power than HW
2020 of 20
Conclusions• PCs consume a lot of power
– Left powered on to maintain network connectivity
• Introduced power proxying– SNIC maintains network connectivity so PC can sleep
– Can increase sleep time by 85% [Purushothamom-06]
• Low-power hardware-based packet classifier to enable power proxying– Exceeds Gigabit Ethernet throughput requirement
– Up to 9x speedup in packet classification time over a software packet classifier
– 75% less power than a software packet classifier
– Better scalability with respect to future rule set size and link rates than a software packet classifier