Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery
operated edge devices
Martin Croome, VP Business Development, GreenWaves Technologies
1RISC-V Day in Shanghai, 30 June 2018
What this talk is about?
RISC-V Foundation 2
The IoT pipeNB-IoT, LTE-M, Sigfox,
LoRa, etc.
B/day to kB/dayBattery operated
sensors
8-bit, 160x120 @ 10 fps =4.6 Mbit/s
24-bit @ 50kHz = 1.2 Mbit/s
Linear PCM =1.4 Mbit/s
Market DemandRich sensor data
Keyword SpottingBeam formingSpeech pre-processing
Vibration analysisFault detection
Face detectionPresence detectionCountingEmotion detection
30 June 2017
What this talk is about?
RISC-V Foundation 3
The IoT pipeNB-IoT, LTE-M, Sigfox,
LoRa, etc.
B/day to kB/dayBattery operated
sensors
8-bit, 160x120 @ 10 fps =4.6 Mbit/s
24-bit @ 50kHz = 1.2 Mbit/s
Linear PCM =1.4 Mbit/s
Market DemandRich sensor data
B/day to kB/day
CNNSVM
BayesianBoostingCepstral analysis
30 June 2017
What this talk is about?
RISC-V Foundation 4
The IoT pipeNB-IoT, LTE-M, Sigfox,
LoRa, etc.
B/day to kB/dayBattery operated
sensors
8-bit, 160x120 @ 10 fps =4.6 Mbit/s
24-bit @ 50kHz = 1.2 Mbit/s
Linear PCM =1.4 Mbit/s
Market DemandRich sensor data
B/day to kB/day
CNNSVM
BayesianBoostingCepstral analysis
Issue: way more MIPS than an MCU can
deliver but needs to bewithin an MCU power
envelope ?
30 June 2017
General Patterns for content understanding
RISC-V Foundation 5
• Extract descriptors from raw data• 2D: Corners, blobs, HOG, DOG, …• 1D: LPC coefficients, Cepstral coeffs, …
• Use descriptors to classify data among representative families• Machine learning (CNN, SVM, Boost), Bayesian, ….
Usually highly parallel
Also highly parallel30 June 2017
GAP8: Ultra Low Power IoT Processor
RISC-V Foundation 6
Architecture efficiency• Extended RISC-V ISA• Low contention shared memory 8 +1 core
clustered architecture• Tight synchronization• CNN based pattern matching engine (HWCE)
Performance• up to 12GOPS• up to 0.4GOPS @ 1mW, • up to 40MOPS @ 300uW• 3 uWatt stand-by power
consumption
HW features• Smart IOs• Voltage regulator/DVFS • RTC• Secured execution
30 June 2017
monitoring event qualification,protocol stack,system control
data analysis & classification
Smart I/Osvoltage regulator & RTCSRAM in retentive mode
extended RISC-V extended RISC-Vefficient 8 core parallelization
HW synchronizationshared instruction cache
CNN HW engine
Quasi stand-by Low computing power High computing power
uWs mWs 10 to 50 mWs
primary energy consumption primary energy consumption
GAP8 hierarchical power architecture
7RISC-V Foundation30 June 2017
GAP8: Open Source Origin
RISC-V Foundation 8
GAP8Best in class Instruction Set Architecture (ISA)UC Berkeley originated
Open Source Computing Platformcreated by ETHZ and UniBo
Engineered as Ultra-low power IoT Application Processor
30 June 2017
9RISC-V Foundation
SW development flowFC clock & voltage domain
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction Cache
Cor
e 0
Debug
ClusterDMA
H/WSYNC
Cor
e 1
Cor
e 7
Cor
e 6
Cor
e 5
Cor
e 4
Cor
e 3
Cor
e 2
HW
CE
MemoryL2
DebugPMU RTC
FabricController
L1
ROM
I$
LVDS
Serial I/Q
UART
SPI
I2C
I2S
CPI
HyperBus
GPIO / PWM
Mic
ro D
MA
Cluster clock & voltage domain
Identical cores – Single GCC/GDB toolchain(including support for extended ISA)
CNN graph translators(TF2GAP8, ONNX2GAP8 in development)
Code generators for common algorithms
(CNN layers, Matrix, FIR, FFT, HoG, MFCC, …)
GAP8 AutoTilerSeparates kernel parallelization / vectorization
and data flowAutomatic code generation for data flow
OpenMP or Native API
GAPUINO development board.
Classic MCU developmentPULP OS, ARM™ Mbed, FreeRTOS, Other OS’s in
developmentDrivers
Cluster APIs
Arm and Mbed are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.
30 June 2017
10RISC-V Foundation
Automated Memory Management
Basic KernelsHow to handle a parametric tile• Vectorization + Parallelization• No assumption on where actual data are located
User Kernels
Passing actual data to basic kernels and having data circulating between them• A multi dimensional iteration space (2D; 3D; 4D) and a
traversal order• Each argument is a sub space of the iteration space and
has actual dimensions, location (L2, external) and properties
• Given a memory budget the auto tiler “tiles” each argument and generates a fully pipelined implementation interleaving processing and data movements
• Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …)
• Generated tiles are passed to Basic Kernels
Usually seen as libraries
Can be grouped and organized as generators
30 June 2017
11RISC-V Foundation
Automated Memory Management
BasicKernelsUser KernelsGroup of User KernelsGenerators
C Programs, calls to Autotiler’s Model API
C Libraries
Autotiler Library
(Constraints Solver, C Code Generator)
Compile & Run on PC
C code for the target handling data movements and Basic Kernels dispatch on cluster’s cores
#include "AutoTilerLib.h"
#include "CNN_Generator.h"
void Mnist()
{
CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1);
CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1);
CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0);
}
30 June 2017
Algorithm Benchmarks
RISC-V Foundation 12
Application Cores1 2 4 8
1D FFT1024 Radix4 28.2 14.3 7.8 4.7
2D FFT 256 x 256 Radix4 78.9 41.9 22.6 13.3 0.88 MHz/Frame
Byte 5x5 Conv 18.5 9.3 4.7 2.2
Short 5x5 Conv 37.8 18.9 9.5 4.6
Binary 5x5 Conv 20.8 10.5 5.3 2.8
Short MaxPool2x2 8.2 4.2 2.1 1.1
Short MatMult 32x32 41.9 20.9 14.0 5.2
Short 2048 to 1 Fully Connected 3112.0 1616.0 847.0 495.0
CannyEdge 99.5 50.9 26.2 12.7 VGA: 3.9 MHz/Frame
AES-CTR 128b 15.3 7.7 4.0 2.1 0.47 MHz/Mbs-1
64 Mel Coefficients 542.7 299.4 176.7 101.3 10ms slots 0.64MHz
HoG, 8x8 Cells, 2x2Blocks, 9 Bins 65.0 35.0 18.0 9.0 VGA: 2.76 MHz/Frame
Cycles per produced output30 June 2017
Algorithm Benchmarks
RISC-V Foundation 13
7.1
30 June 2017
CNN based text recognition
RISC-V Foundation 14
Trainable Par: 421 263Neurons: 1 511 904
33ms per image
30 June 2017
Dronet – Autonomous Drone
RISC-V Foundation 15
Power envelope breakdown @ 165MHz 12 images/sec
30 June 2017
Unique energy efficiency vs performance
20XExtended Instruction Set (ISA)Efficient parallelization
Shared instruction cacheHW Convolution Engine
Ultra fast HW state changes
best in class ULP MCUs
high end low power MCUs,mid-range application processors
Embedded vision processorsDedicated CNN processors
GAP8uAs asleepmWs awake
10s of mWs
ener
gy
effic
ienc
y
computing power
100s of MOPS several GOPS TFLOPS
Comparison of Latest optimized ARMCMSIS-CNN library versus GAP8 implementation of identical CNN graph trained on CIFAR-10 imagesSource: ARM processors blog
Running on GAP8 cluster* No Hardware Convolution Engine** With Hardware Convolution Engine
Target Clock Time Cycles Active Power
STM32 F7 216Mhz 99.1ms 21 400 000 60mW
GAP8 * 15.4Mhz 99.1ms 1 500 000 3.7mW
GAP8 * 175Mhz 8.7ms 1 500 000 70mW
GAP8 ** 4.7Mhz 99.1ms 460 000 0.8mW
16 X reduction
STM 32 H7 216Mhz40nm
11 X
16
Unique energy efficiency vs performance
RISC-V Foundation 17
@1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5SW time 129.7 us 332.1 us
SW Power 12.58 mW 12.80 mW
HWCE time 69.2 us 60.8 us
HWCE Power 4.95 mW 5.1 mW
@1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5Speed gain 1.87 5.46Power gain intrinsic 2.54 2.51
Power gain combined with speed gain 4.76 13.71
HWCE: Boosted convolution
30 June 2017
Conclusion
30 June 2017 RISC-V Foundation 18
GAP8’s Extended RISC-V ISA and flexible, programmable architecture enables massive deployment of edge
intelligence
by dramatically reducing rich sensing device installation costs through true autonomy
and by reducing solution costwith system on a chip integration
Built on top of 2 major HW open source initiatives
Architectural Innovation
enabled by PULP, RISC-V
and Open Source
Thank You!
RISC-V Foundation 1930 June 2017
Backup Slides
RISC-V Foundation 2030 June 2017
People Counting
RISC-V Foundation 2130 June 2017
22RISC-V Foundation
Advanced Power Management
ü Embedded DC/DC, low currentü Real Time Clock 32KHz onlyü L2 Memory partially retentive
MCU sleep mode
uW ra
nge
ü Embedded DC/DC, high currentü Voltage can dynamically changeü One clock gen active, frequency can dynamically
changeü Systematic clock gating
MCU active mode
1 m
W ra
nge
ü Embedded DC/DC, high currentü Voltage can dynamically changeü Two clock gen active, frequencies can
dynamically changeü Systematic Clock Gating
MCU + Parallel processor active mode
10-4
0 m
W ra
nge
Ultra fast switching time from one mode to anotherUltra fast voltage and frequency change time
Highly optimized system level power consumption
30 June 2017
23RISC-V Foundation
Source of Energy Efficiency?
data analysis & classification,
extended RISC-Vefficient 8 core parallelization
HW synchronizationshared instruction cache
CNN HW engine
3-5x
1.4x
4x
1.5x
eRIS
C-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Micro
DM
A
ClkDbg
Rom eRIS
C-V
eRIS
C-V
eRIS
C-V
eRIS
C-V
eRIS
C-V
eRIS
C-V
eRIS
C-V
overall, in practice on targeted algorithms,
typically 20x
30 June 2017
System Cost
RISC-V Foundation 24
Sys
tem
cos
t
computing power100s of MOPS several GOPS TFLOPS
best in class ULP MCUs
high end low power MCUs,mid-range application processors
Embedded vision processorsDedicated CNN processors
GAP8
2-3X
System-On-a-ChipHigh integration
30 June 2017