PULP: an open source hardware-software platform for near ...retis.sssup.it/iwes/technical/benini.pdf · PULP: an open source hardware-software platform for near-sensor analytics

Luca BeniniIIS-ETHZ & DEI-UNIBO

PULP: an open source hardware-software platform for near-sensor analytics

2

An IoT System View

cm2 Harvesting powered mW (MaxP) + uW (AvgP)

Long range, low BW

Short range, BW

Low rate (periodic) data

SW update, commands

Transmit

Idle: ~1µWActive: ~ 10mW

Analyze

µController

IOs

1 ÷ 25 MOPS1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU

MEMS Microphone

ULP Imager

100 µW ÷ 2 mW

EMG/ECG/EIT

L2 Memory

1 ÷ 2000 MOPS1 ÷ 10 mW

The Computing Bottleneck

Luca Benini 3

*not exhaustive

High performance MCUs

Low-Pow

er MC

Us

Our Target

Microcontroller Landscape

Reaching pJ/OP

1. Extract descriptors from raw data 2D: Corners, blobs, … 1D: LPC coefficients, …

2. Use descriptors to classify data among familyrepresentatives Machine learning, Bayesian, ….

Usually highly parallel

Also highly parallel

A general pattern

Content Understanding

Minimum energy operation

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Total EnergyLeakage EnergyDynamic Energy

0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2

Ener

gy/C

ycle

(nJ)

32nm CMOS, 25oC

4.7X

Logic Vcc / Memory Vcc (V)

Source: Vivek De, INTEL – Date 2013

Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion

2. Recover performance with parallel execution6

The “best” Processor

Single issue in-order is most energy efficient Put more than one + shared memory to fill cluster area

7Departement Informationstechnologie und Elektrotechnik

[AziziISCA10]

L1 TCDMMB0 MBM

LOGARITHMIC INTERCONNECT

. . . . .

I$

PE0 PEN‐1

Near-Threshold Multiprocessing

IL0 IL0

I$B0 I$Bk

Up to 16 simple cores

DMA

Shared L1 I$ + configurable broadcasting

Private Loop/prefetch buffer

SIMD/MIMD/SEQ

Shared L1 DataMem +configurable interleaving

Tightly Coupled DMA

NT but parallel Max. Energy efficiency when Active + strong PM for (partial) idleness

Near threshold FDSOI technology

9

Body bias: Highly effective knob for power management!

Luca Benini 10

Silicon Results

Technology UTBB FD-SOI 28nm

Transistors Flip wellL = 24 nm

Cluster area 1.3 mm2

VDD range(memories)

0.32V - 1.15V(0.45 – 1.15V)

BB range

0V - 1.75V

SRAM macros

8 x 32 kbit (TCDM)

SCM macros

16x4 kbit (TCDM)4x 2x4 kbit (I$)

Gates 200K

Frequency range

NO BB: 40.5-710 MHz MAX FBB: 63.5 - 825 MHz

Power range

NO FBB: 0.56 - 85 mWMAX FBB: 6.9 - 480 mW

Hot Chips 15 V1 Cool Chips 16 V2 193MOPS/mW @162MOPS ~5pJ/OP

Luca Benini 11

Cluster Energy Efficiency

193 MOPS/mW @ 40.5MHz, 0.46V, 0V FBB, 840 µW

10 mW @ 1GOPS, 100 MOPS/mW, 0.66V, 0.5V FBB

Full FBB heavily degrades energy efficiency at low voltage due to high Leakage!

PULP Boards

86.5mm x 57 mm Battery supply PULP Interfaces: JTAG ‒ SPI I2S ‒ I2C UART ‒ LEDs

128 Mbit Flash Apollo M4 Interfaces: SWD ‒ I2C SPI ‒ LEDs Button ‒ GPIO

Daughterboard expansions14.09.2016 12

PCB Front

PCB Back

13

Open SourceParallel ULP computing for the IoT

(sub)-pJ/op computing platform - let’s make it Open!

VirtualizationLayer

CompilerInfrastructure

Programming Model

Processor &Hardware IPs

What has been released

RISC-V compatible 32-bit efficient microprocessor core with AXI/AMBA peripherals.

RV32I, RV32C supported Most of RV32M (full

support soon) Custom extensions (needs

our compiler extensions) Hardware loops Post-increment load ALU/MAC instructions

Hundreds of GIT forks

Taped out UMC65nm 32.8mW, @1.2V, 400MHz

14

Confirmed by silicon measurement

Why Open Hardware?

Community Building We want that PULP is used, need a community

Cooperation with Academic Partners Allows us to exchange ideas, projects freely We find more partners we can work with

Cooperation/supporting Industry Lowers costs for an SME in entering IC business Creates jobs/opportunities (for our students and others)

Integrated Systems Laboratory 15

IP, Consulting, Customization

FundingFunded Projects

Volume Chip production

Towards fJ/OP

Maximizing Silicon Efficiency

1 > 1003 6

CPU GPGPU HW IP

GOPS/W

Accelerator Gap

SW HWMixed

ThroughputComputing

General-purposeComputing 1GOPS/mW

Closing The Accelerator Efficiency Gap with Agile Customization

17

Fractal Heterogeneity

18

Fixed function accelerators have limited reuse… how to limit proliferation?

Learn to Accelerate

Brain-inspired systems are high performers in many tasks over many domains.

Image recognition[E.g., Krizhevsky et al., 2012]

Speech recognition[E.g., Heigold et al., 2013]

NLP[E.g., Socher et al., ICML 2011;Collobert & Weston, ICML 2008]

[Honglak Lee]

19

PULP CNN Accelerator

20Departement Informationstechnologie und Elektrotechnik

How do we fare?

IBM TrueNorth[Merolla et al.]

Convolution Engine[Qadeer et al.]

DianNao[Chen et al.]

NeuFlow/nn-X[Gokhale et al.]

Spiking-Based4096 neurons @ 64 mW46･109 spiking ops/s/W

Deep Network ASIC452 GOPS @ 931 GOPS/W

ConvNet FPGA / ASICup to 230 GOPS/W

SIMD-like Convolution ISA Extension

410 GOPS @ 236 GOPS/W

PULP + HWCE0.4V: 2.78 GOPS @ 2750

GOPS/W0.8V: 74 GOPS @ 412 GOPS/W

Ample margins for further improvements…

Documents

PULP: an open source hardware-software platform for near ...retis.sssup.it/iwes/technical/benini.pdf · PULP: an open source hardware-software platform for near-sensor analytics