Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
11 September 2017 Sapphyre-P-009 v1.0
Power Efficient Computation through
Processor & Algorithm Co-Design
Bryan Donoghue
NMI: High Performance Digital Systems & Applications Event
Bryan Donoghue Biography:
Bryan Donoghue is Group Leader of the Digital Systems Group at Cambridge Consultants. He has over 20 years’ experience in the field of electronics and
chip design at Cambridge Consultants, 3Com Networks and Hewlett Packard Research Laboratories. Bryan holds 15 patents in the fields of wireless
communications and ASIC design. His current areas of technical interest are in fully-digital radio design and in processor optimisation for signal processing
and machine learning.
11 September 2017 Sapphyre-P-009 v1.0 2
Power Efficient Computation
When you don’t care:
– Do you know or care whether you Desktop PC consumes 2W, 20W or 200W?
When you do care:
– Battery powered-systems
– Cell phones
– Tablets
– Laptops
– Cooling-constrained systems
– Cloud data centres
11 September 2017 Sapphyre-P-009 v1.0 3
What is driving power-constrained computation?
Wireless modulation standards
– GSM: GMSK
– 3G: CDMA
– LTE: OFDM, 64QAM
Machine Learning
– Cars: latency-sensitive image recognition
– IoT: Tiered wake-up
– Cloud Systems: cooling-constrained massive systems
11 September 2017 Sapphyre-P-009 v1.0 4
Computation Systems
Conventional solutions trade flexibility for power-efficiency
HIGH
HIGH
Pure hardware
Microprocessor
Conventional
DSP, GPU
Flexibility
Eff
icie
ncy
Worst
Best
?
11 September 2017 Sapphyre-P-009 v1.0 5
How to improve power efficiency?
Gates = Power
Reduce the ratio of control and datapath logic to computation logic
Processor Number of
Multipliers
Gate Count MACs/
MegaGate
16*16 Multiply-
Accumulator
1 5K 200
Ceva Teaklite-II 1 100K 10
ARM Cortex-R7 1 1350K 0.74
11 September 2017 Sapphyre-P-009 v1.0 6
Why to improve power efficiency?
Lost Cycles = Power
Computation is memory-access
limited
MAC needs 3 memory accesses
– Hardware = 1 cycle
– CPU = 3 to 30 cycles
11 September 2017 Sapphyre-P-009 v1.0 7
Pure Hardware DSP
Comparison with conventional CPU or DSP…
Advantages Disadvantages
Lowest Power Time-consuming and costly to
design in RTL
Lowest Silicon Area Limited Flexibility in case of:
• Standard / Algorithm change
• RTL error
• Re-use IP in a new product
11 September 2017 Sapphyre-P-009 v1.0 8
How to build Flexible Hardware DSP?
Programmable VLIW DSP Engine
VLIW instruction mini-opcodes control
– Sequencer (program counter)
– Many DSP modules
– Dynamic data routing
– Access to multiple memories
Advantages
– Low control and data-path overhead
– Choose DSP modules & routing for application
Data R
ou
ting
Instructiondecoder
ProgramMemory
Sequencer
ALU
Indexer
MAC
MemoryInterface
IORegisters
Data Bus
Module N
IO Bus
11 September 2017 Sapphyre-P-009 v1.0 9
Design philosophy
Run it slow(er)
– Short pipelines: efficient loops and low control logic and datapath overhead
– Low-latency access to memory
– Low drive/power gates
Match the mix of modules to your algorithm
Match memory bandwidth to task
– e.g. MAC has 3 memory accesses per cycle
If you want to go faster…
– Add modules e.g. multiple MACs
– Add VLIW cores
11 September 2017 Sapphyre-P-009 v1.0 10
Sapphyre™ VLIW DSP
VLIW instruction controls multiple modules each clock cycle
Modules interconnected by multiplexed data routing
Modules have next-cycle access to multiple memories
Library of modules to suit different algorithms
Balanced cores – you can really use the available processing
capacity for processing
Data R
ou
ting
Instructiondecoder
ProgramMemory
Sequencer
ALU
Indexer
MAC
MemoryInterface
IORegisters
Data Bus
Module N
IO Bus
Sequencer ALU MAC Cart2Polar Constants
Debug
monitor
Memory
Interface
I/O
Registers
Bit
operator
Register
Bank
Adder ABS Sin Cos Indexer Min Max
Shifter Radix FFT Addr Oscillator Limiter
11 September 2017 Sapphyre-P-009 v1.0 11
Sapphyre™ DSP – Programmers Toolchain
Developing code for SapphyreTM cores is supported by the Programmers Toolchain,
consisting of:
– Macro Assembler
– Export Tool
– Graphical Simulator
– Real-time Debug Monitor
11 September 2017 Sapphyre-P-009 v1.0 12
Sapphyre™ DSP – Graphical Simulator
Configurable for :
– DSP Module choice
– Data paths
– New DSP modules
Macro Assembler
Bit & cycle-accurate simulation
Single stepping, breakpoints,
register watch windows
Profiling for code efficiency
11 September 2017 Sapphyre-P-009 v1.0 13
Sapphyre™ DSP – Real-time debug monitoring output in Silicon
Real-time and non-invasive
Test point monitoring of inputs, configuration and intermediate outputs
Replay and debug in the simulator
11 September 2017 Sapphyre-P-009 v1.0 14
Sapphyre™ DSP – Simultaneous Core and Code Development
We develop the DSP application code in parallel with the customised core
Simultaneous development allows quick prototyping of data routing and modules
– Balanced I/O, memory access and processing
– Reduced development time
– Algorithm can be written before ASIC is complete
Reduced ASIC development risk
– ASIC RTL verified against DSP simulator vectors of real application code
– Modest clock speed allows real-time verification of ASIC RTL on FPGA
The resulting Sapphyre™ DSP cores are balanced, efficient designs, tailored to
an application but with the flexibility to cope with future expansions
11 September 2017 Sapphyre-P-009 v1.0 15
Does it really work?
384MMAC/s
1mW typical
$0.03 of silicon
Sapphyre™ Gen 5
Geometry 40nm
Clock 96MHz
Gates 116K
Program Memory (typical) 64KByte
Data Memory (typical) 64KByte
MMAC/s 384
Power (mW) 8 (peak)
1 (avrg)
Power (uW/MHz) 80 (peak)
10 (avrg)
Die Area (mm2) 0.06 Core
0.25 Mem
11 September 2017 Sapphyre-P-009 v1.0 16
How does Sapphyre™ VLIW approach compare?
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Sapphyre Gen 3 Sapphyre Gen 5 ARM Cortex-R4 ARM Cortex-R5 ARM Cortex-R7 Ceva Teaklite-II Ceva Teaklite-III-tl3210
MACs/MegaGate
11 September 2017 Sapphyre-P-009 v1.0 17
VLIW DSP – What applications is it good for?
Low-power audio processing e.g. codecs
Software defined radio
Machine learning inference
Taking cost out of projects - replace dollars of DSP with cents of silicon
CPU hardware accelerators
11 September 2017 Sapphyre-P-009 v1.0
UK
Cambridge Consultants is part of the Altran group, a global
leader in Innovation. www.Altran.com
www.CambridgeConsultants.com
USA SINGAPORE JAPAN
Registered No. 1036296 England