Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Luca BeniniIIS-ETHZ & DEI-UNIBO
PULP: an open source hardware-software platform for near-sensor analytics
2
An IoT System View
cm2 Harvesting powered mW (MaxP) + uW (AvgP)
Long range, low BW
Short range, BW
Low rate (periodic) data
SW update, commands
Transmit
Idle: ~1µWActive: ~ 10mW
Analyze
µController
IOs
1 ÷ 25 MOPS1 ÷ 10 mW
e.g. CortexM
Sense
MEMS IMU
MEMS Microphone
ULP Imager
100 µW ÷ 2 mW
EMG/ECG/EIT
L2 Memory
1 ÷ 2000 MOPS1 ÷ 10 mW
The Computing Bottleneck
Luca Benini 3
*not exhaustive
High performance MCUs
Low-Pow
er MC
Us
Our Target
Microcontroller Landscape
Reaching pJ/OP
1. Extract descriptors from raw data 2D: Corners, blobs, … 1D: LPC coefficients, …
2. Use descriptors to classify data among familyrepresentatives Machine learning, Bayesian, ….
Usually highly parallel
Also highly parallel
A general pattern
Content Understanding
Minimum energy operation
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Total EnergyLeakage EnergyDynamic Energy
0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
Ener
gy/C
ycle
(nJ)
32nm CMOS, 25oC
4.7X
Logic Vcc / Memory Vcc (V)
Source: Vivek De, INTEL – Date 2013
Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion
2. Recover performance with parallel execution6
The “best” Processor
Single issue in-order is most energy efficient Put more than one + shared memory to fill cluster area
7Departement Informationstechnologie und Elektrotechnik
[AziziISCA10]
L1 TCDMMB0 MBM
LOGARITHMIC INTERCONNECT
. . . . .
I$
PE0 PEN‐1
Near-Threshold Multiprocessing
IL0 IL0
I$B0 I$Bk
Up to 16 simple cores
DMA
Shared L1 I$ + configurable broadcasting
Private Loop/prefetch buffer
SIMD/MIMD/SEQ
Shared L1 DataMem +configurable interleaving
Tightly Coupled DMA
NT but parallel Max. Energy efficiency when Active + strong PM for (partial) idleness
Near threshold FDSOI technology
9
Body bias: Highly effective knob for power management!
Luca Benini 10
Silicon Results
Technology UTBB FD-SOI 28nm
Transistors Flip wellL = 24 nm
Cluster area 1.3 mm2
VDD range(memories)
0.32V - 1.15V(0.45 – 1.15V)
BB range
0V - 1.75V
SRAM macros
8 x 32 kbit (TCDM)
SCM macros
16x4 kbit (TCDM)4x 2x4 kbit (I$)
Gates 200K
Frequency range
NO BB: 40.5-710 MHz MAX FBB: 63.5 - 825 MHz
Power range
NO FBB: 0.56 - 85 mWMAX FBB: 6.9 - 480 mW
Hot Chips 15 V1 Cool Chips 16 V2 193MOPS/mW @162MOPS ~5pJ/OP
Luca Benini 11
Cluster Energy Efficiency
193 MOPS/mW @ 40.5MHz, 0.46V, 0V FBB, 840 µW
10 mW @ 1GOPS, 100 MOPS/mW, 0.66V, 0.5V FBB
Full FBB heavily degrades energy efficiency at low voltage due to high Leakage!
PULP Boards
86.5mm x 57 mm Battery supply PULP Interfaces: JTAG ‒ SPI I2S ‒ I2C UART ‒ LEDs
128 Mbit Flash Apollo M4 Interfaces: SWD ‒ I2C SPI ‒ LEDs Button ‒ GPIO
Daughterboard expansions14.09.2016 12
PCB Front
PCB Back
13
Open SourceParallel ULP computing for the IoT
(sub)-pJ/op computing platform - let’s make it Open!
VirtualizationLayer
CompilerInfrastructure
Programming Model
Processor &Hardware IPs
What has been released
RISC-V compatible 32-bit efficient microprocessor core with AXI/AMBA peripherals.
RV32I, RV32C supported Most of RV32M (full
support soon) Custom extensions (needs
our compiler extensions) Hardware loops Post-increment load ALU/MAC instructions
Hundreds of GIT forks
Taped out UMC65nm 32.8mW, @1.2V, 400MHz
14
Confirmed by silicon measurement
Why Open Hardware?
Community Building We want that PULP is used, need a community
Cooperation with Academic Partners Allows us to exchange ideas, projects freely We find more partners we can work with
Cooperation/supporting Industry Lowers costs for an SME in entering IC business Creates jobs/opportunities (for our students and others)
Integrated Systems Laboratory 15
IP, Consulting, Customization
FundingFunded Projects
Volume Chip production
Towards fJ/OP
Maximizing Silicon Efficiency
1 > 1003 6
CPU GPGPU HW IP
GOPS/W
Accelerator Gap
SW HWMixed
ThroughputComputing
General-purposeComputing 1GOPS/mW
Closing The Accelerator Efficiency Gap with Agile Customization
17
Fractal Heterogeneity
18
Fixed function accelerators have limited reuse… how to limit proliferation?
Learn to Accelerate
Brain-inspired systems are high performers in many tasks over many domains.
Image recognition[E.g., Krizhevsky et al., 2012]
Speech recognition[E.g., Heigold et al., 2013]
NLP[E.g., Socher et al., ICML 2011;Collobert & Weston, ICML 2008]
[Honglak Lee]
19
PULP CNN Accelerator
20Departement Informationstechnologie und Elektrotechnik
How do we fare?
IBM TrueNorth[Merolla et al.]
Convolution Engine[Qadeer et al.]
DianNao[Chen et al.]
NeuFlow/nn-X[Gokhale et al.]
Spiking-Based4096 neurons @ 64 mW46・109 spiking ops/s/W
Deep Network ASIC452 GOPS @ 931 GOPS/W
ConvNet FPGA / ASICup to 230 GOPS/W
SIMD-like Convolution ISA Extension
410 GOPS @ 236 GOPS/W
PULP + HWCE0.4V: 2.78 GOPS @ 2750
GOPS/W0.8V: 74 GOPS @ 412 GOPS/W
Ample margins for further improvements…