View
230
Download
6
Category
Tags:
Preview:
Citation preview
ENG3050 Embedded Reconfigurable
Computing Systems
Application Specific Instruction Application Specific Instruction Processors “ASIPS” Processors “ASIPS”
““Reconfigurable Processors”Reconfigurable Processors”
ENG3050 ERCS 2
TopicsTopics
ASIPs: DefinitionASIPs: Definition MotivationMotivation How to customize ASIPsHow to customize ASIPs Tools for ASIPsTools for ASIPs ApproachesApproaches ConclusionsConclusions
ENG3050 ERCS 3
References
1.1. ““Engineering the Complex SOC: Fast, Flexible Engineering the Complex SOC: Fast, Flexible Design with Configurable Processors”, by Chris Design with Configurable Processors”, by Chris Rowen, 2004,Rowen, 2004,
2. “Xtensa Architecture and Performance”, Tensilica Inc, Sep 2002.
3. “Configurable Processors: What, Why, How?”, Tensilica Inc, June 2007
ENG3050 ERCS 4
Microprocessors and ASICs
For the ultimate in flexibilityflexibility, programmers map the application onto a general-purpose microprocessor.
For the ultimate in performanceperformance, logic designers map the application into a custom circuit.
App
licat
ion
Microprocessor
ASIC
Programmers
Logic designers
FPGA
ENG3050 ERCS 5
Classic Options for Systems-on-Chip
Design Gap!
ENG3050 ERCS 6
General Purpose Processors
ENG3050 ERCS 7
A Case for Customization
General Purpose Processors: Flexible, but tends to customize the application to
the architecture! ASICS:
High performance, but Expensive, and tends to customize the architecture to the application!
We need to find a technology that can:We need to find a technology that can: customize the architecture to the applicationcustomize the architecture to the application and at the same time flexible and cheap!and at the same time flexible and cheap!
ENG3050 ERCS 8
Processor Specialization:Get the Best of Both Options
Gains!
ENG3050 ERCS 9
Motivations: reduce size
Pentium 4 die can fit about 50 ARM9 processors at 0.13um, and 80 at 0.10um
At 0.13um and 250MHz clock, ARM9 dissipates 0.1W50 ARM9s = 5W
12mm
12mm
ARM9 at 0.13um=3mm2
Pentium4 at 0.13um= 144mm2
Cost, Power, and Size are important for embedded applications! Processing vs. Dedicated hardware (ASIC)? System-On-a-Chip concept
ENG3050 ERCS 10
Programmable Processors
Past Microprocessor Microcontroller DSP Graphics
Processor
Now / Future Network Processor Sensor Processor Crypto Processor Game Processor Wearable Processor Mobile Processor
ENG3050 ERCS 11
A Case for Customization General purpose processors handles many
applications fairly well, but…Each application has different requirementsThe instruction set is fixed!Data path width may not suit your application!Cache size/configuration may not be optimalRegister file is either too small or …Functional units might be missing or … Internal busses are slow or too narrow …
ENG3050 ERCS 12
Processor Customizations
Specialized Specialized instructionsinstructions
Optimization, searching, classification, …Optimization, searching, classification, …
Specialized Specialized functional unitsfunctional units
MAC Units, Special Comparators, Sorting UnitsMAC Units, Special Comparators, Sorting Units
Parameterized Parameterized busses and datapathsbusses and datapaths
8-bit, 16 bits, synch/async busses8-bit, 16 bits, synch/async busses
Parameterized Parameterized register filesregister files
Parameterized Parameterized cachescaches
Cache size, replacement strategy, …Cache size, replacement strategy, …
P
RegFile
D/I - Caches
FU1 FU2 FU3
ENG3050 ERCS 13
Application-specific instruction processors An ASIP is a stored-memory CPU whose architecture architecture
is tailoredis tailored for a particular set of applications. The instruction-sets tailoredinstruction-sets tailored to specific applications or
application domains Customized functional units within data pathwithin data path for high
performance Programmability allows changesallows changes to implementation, Can be used in several differentused in several different products.
Application-specific architecture provides smaller silicon areaarea, higher speedspeed, lower power consumptionpower consumption.
ENG3050 ERCS 14
RecallRecall: Different levels of coupling: Different levels of coupling
FU
Workstation
Coprocessor
CPU Memory Caches
I/O Interfac
e
Standalone Processing Unit
Attached Processing Unit
Tightly CoupledTightly Coupled
Loosely CoupledLoosely Coupled
ENG3050 ERCS 15
FPGA
ASIC P
Design costDesign costTime-to-marketFlexibilityDeterminismPowerPowerPerformancePerformance
Design costDesign costTime-to-marketTime-to-marketFlexibilityFlexibilityDeterminismPowerPerformance
Design costTime-to-marketFlexibilityDeterminismPowerPerformance
Application Specific Instruction Processors
ENG3050 ERCS 16
FPGA
ASIC P
Design costDesign costTime-to-marketTime-to-marketFlexibilityFlexibilityDeterminismDeterminismPowerPowerPerformancePerformanceASCP
Application-Specific Customizable Embedded Processor– Helps preserve the benefits of generality Helps preserve the benefits of generality – Alleviates the drawbacks of general-purpose processorsAlleviates the drawbacks of general-purpose processors
Embedded Applications Requirements
ENG3050 ERCS 17
Performance vs. FlexibilityF
lexi
bil
ity
Performance
ASIC
GPP
DSP
RCS
ASIPs!!
ENG3050 ERCS 18
ASIPs: Advantages Tailor for specific applications by:
Customize the instruction set Add Customized execution units that efficiently
perform task specific algorithms. Add special registers sized to the natural data
types of the tasks to be performed. Instructions will often execute in one or two
clock cycles which will keep clock rates low and thus energy consumption low as well.
You can further customize the processor as your application evolves with time.
ENG3050 ERCS 19
ASIP Design MethodologyA
pplic
atio
n
Design-time configurable
microprocessor
Profile the application
Create custom hardware and instructions to
accelerate critical application sections
Most of the application runs as
execution of general-purpose
instructions
20
ASIP based approach R
econ
fig
ura
ble
In
str
ucti
on
Set
Pro
cessors
C Parsing
Optimizations
Inst. Identification
Inst. Selection
Config. Scheduling
Code Generation
C Code
Assembly Code
HardwareGeneration
Configuration bits
HardwareEstimator
Compiler Structure
ENG3050 ERCS 21
Instruction Set Extension
Idea:Provide a way to augmentaugment the processor’s
instruction set with? Operations needed by a particular application
22
Determinates of CPU PerformanceDeterminates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count
CPI clock_cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
TechnologyX
XX
XX
X X
X
X
X
X
X
ENG3050 ERCS
ENG3050 ERCS 23
Instruction Specialization The instruction set determines the functions
directly implemented in hardware and the operations which can be performed in parallel.
How to improve the instruction set?How to improve the instruction set? Operations which can frequently be scheduled
concurrently should be coded in the same instruction
Operations which can often be chained should be coded in the same way
Multiply-accumulation Vector operations
ENG3050 ERCS 24
Computationally demanding parts of applications run on special hardwarespecial hardware
New instructions New instructions use the special hardware
Instruction Set Customization
CUSTOM
XOR
MPY LD
XOR
SHR
XOR
MOV
MPYLD
SHR
AND
25
Automatically Collapsing Clusters of Instructions into New Ones
If the ad-hoc functional unit completes the
job faster GAIN
One ad-hoc complex operation instead of a long
sequence of standard ones
ENG3050 ERCS
ENG3050 ERCS 26
Function Unit and Data Path Specialization
To reduce power consumption and increase performance Word length adaptationWord length adaptation Implementation of application specific HW functionsspecific HW functions
String manipulation String matching Pixel operation Multiplication-accumulation
Special consideration: clock frequency It may be better to use a slower clock in embedded
systems.
ENG3050 ERCS 27
Customized Function Units Goal: support important
computation subgraphs Add specialized units within
the data path of the processor Exploits subgraph parallelism Allows natural data
propagation
FU FU FU …
FU FU FU …
IN 1
…
IN 2
…
Fetch
Issue
…ALU
ALU
CCA
… WB
ENG3050 ERCS 28
Interconnect Specialization
Specialization can be done in respect to: Interconnect of functional modules
Reduced bus instead of standard system bus to save cost or power consumption
Dedicated connection between registers (accumulator) and memories to increased parallelism
Protocol usedProtocol used for the communication between components.
Synchronous Asynchronous Semi synchronous
ENG3050 ERCS 29
Optimizing Power in ASIPs
29
Configurable processors have a deep influence on low power design in two ways: Compared to hardwired logic, software based design
allows for more sophisticated algorithms and control of operating modes.
In many applications, the software can be much smarter than custom RTL about when to run and how fast
ASIPs pack the same work into far few cycles than GPPs allowing the SOC to run at a lower clock frequency (How?)
ENG3050 ERCS 30
Optimizing Power in ASIPs
30
E = alpha C V2n E Energy use due to active switching in
CMOS logic C is the total capacitance of all the switched
nodes in the circuit V is the voltage alpha is the average fraction of circuit nodes
switching between one and zero each cycle n is the number of cycles required to execute
the function.
ENG3050 ERCS 31
Optimizing Power (insight)
31
The impact of a good processor configuration is to sharply reduce ‘n’ , while increasing ‘C’ only slightly relative to a baseline processor.
ASIPs can be quite smart about activating execution units only when necessary. The processor generator can determine the
combinations of logic blocks that must be active at each stage of the pipeline and create logic for fine-granularity clock gatingclock gating thereby reducing ‘alpha’
ENG3050 ERCS 32
ToolsTools??
ENG3050 ERCS 33
Tensilica
Tensilica has two main product lines of 32-bit 32-bit
processor coresprocessor cores for SOC design (IP):1. Diamond Standard processors (non modifiable)
2. Xtensa processors (can be modified)
Tensilica also has several CAD tool flowsCAD tool flows to extend the instructions sets
TIE Language
XPRESS Compiler
ENG3050 ERCS 34
1. Tensilica Diamond Processor Are a set of off-the-shelf synthesizable cores (fixed and
not configurable) directly available from Tensilica and foundry partners that range from area-efficient, low-power controllerscontrollers an audioaudio processor, a high-performance DSPDSP, and a videovideo processor
Diamond Standard processors come with a comprehensive software tool set: Compilers Assemblers Debuggers, ….
ENG3050 ERCS 35
2. Tensilica Xtensa Processor Tensilica’s Xtensa processors are synthesizable
processors that are configurable and extensible.!
ENG3050 ERCS 36
Xtensa Processors Architecture The Xtensa Instruction Set Architecture (ISA) is a 32-bit
RISC architecture featuring a compact instruction set optimized for embedded designs.
RISC?
• A small number of memory addressing modes• Large uniform register files for computation operations• Fixed-size instruction words Optimized Pipelined Architecture Simple and fixed instruction-field encoding Memory access via loads and stores of registers
ENG3050 ERCS 37
Xtensa Processors Architecture The architecture has:
a 32-bit ALU; 16, 32 or 64 general-purpose physical registers; six special purpose registers; Cache:Cache:
up to 32 KB and up to 32 KB and 1,2,3,4 way set associative cache?1,2,3,4 way set associative cache? Replacement Policy?Replacement Policy? Write back vs. Write through?Write back vs. Write through?
ENG3050 ERCS 38
Xtensa Processors Architecture The architecture has:
a 32-bit ALU; 16, 32 or 64 general-purpose physical registers; six special purpose registers; 5 or 7 stage pipelines:5 or 7 stage pipelines:
5-stage: Power Usage: 47 uW/MHZ @ 350 MHz 5-stage: Power Usage: 47 uW/MHZ @ 350 MHz 7-stage: Power Usage: 57 uW/MHz @ 400 MHz7-stage: Power Usage: 57 uW/MHz @ 400 MHz
ENG3050 ERCS 39
Tensilica Xtensa Architecture
ENG3050 ERCS 40
Xtensa Processor Generator The designer can select from a broad selection of predefined
standard RISC microprocessor options and can add instructions and register extensions to the tailored processor.
Or the designer can use Tensilica's XPRES Compiler to automatically tailor the processor to optimize existing C/C++ code. The Xtensa Processor Generator then creates the complete processor
solution set – pre-verified processor hardware description in source RTL (Verilog or
VHDL), plus supporting hardware implementation methodology scripts.
This complete package includes software development tools including commercial RTOS support, and comprehensive system modeling and
modeling co-verification support.
ENG3050 ERCS 41
XPRES Compiler
ENG3050 ERCS 42
XPRES CompilerXPRES Compiler
ENG3050 ERCS 43
XPRES Compiler
ENG3050 ERCS 44
Tensilica Instruction Extension (TIE) TIE is a Verilog-like language used to
describe desired custom instructions.
You can express the desired functionality in the Tensilica Instruction Extension (TIE) language.
TIE helps you get orders of magnitude performance increases out of your processor design.
1. Fusion,
2. SIMD (Single Instruction Multiple Data),
3. FLIX (Flexible Length Instruction Encoding)
ENG3050 ERCS 45
TIE Extensions
ENG3050 ERCS 46
(I) Fusion
ENG3050 ERCS 47
Affect of TIE Instructions
ENG3050 ERCS 48
TIE Flow
ENG3050 ERCS 49
Fusion Example
ENG3050 ERCS 50
Exploiting Parallelism
ENG3050 ERCS 51
Creating SIMD TIE Execution Units
ENG3050 ERCS 52
FLIX Acceleration
ENG3050 ERCS 53
Creating FLIX (VLIW) Acceleration An Xtensa processor can become a multi-issue VLIW processor.
The Xtensa C/C++ compiler’s is capable to aggressively extract instruction-level parallelism from the code. The compiler can schedule multiple operations in a VLIW instructions.
By allowing two or three instructions to execute simultaneously, FLIX allows the processor to act as a 2- or 3- issue VLIW CPU, accelerating general purpose code by 40-60 %.
ENG3050 ERCS 54
FLIX
ENG3050 ERCS 55
Estimation (energy)
ENG3050 ERCS 56
Example: MPEG Acceleration One of the most difficult parts of encoding MPEG-4 video
streams is motion estimation which searches adjacent video frames for similar pixel blocks as part of the MPEG-4 decompression algorithm.
The search algorithm’s inner loop contains a SAD (sum of absolute differences) algorithm consisting of Subtraction Absolute value operation Addition of the resulting value with previously computed values
For a QCIF (quarter common image format) frame, a 15-Hz frame rate and an exhaustive search motion estimation scheme, SAD operations require slightly more than 641 641 millionmillion operations/sec.
ENG3050 ERCS 57
MPEG Acceleration Combining all three SAD component operations (subtraction, absolute
value, addition) into one operation that executes in one clock cycle and executing 16 single-pixel SAD operations in one SIMD SAD SIMD SAD instruction during the same clock cycle reduces the cycle count from 641 million reduces the cycle count from 641 million instructions/sec to 14 million instructions/sec – a 98% reductioninstructions/sec to 14 million instructions/sec – a 98% reduction
ENG3050 ERCS 58
MPEG Acceleration The full MPEG-4 decoder adds approximately 100,000 gates to the base
processor and implements a 2-way (coder and decoder) QCIF video coded that operates at 15 frames/sec.
When instructions are added to accelerate all of these MPEG-4 decoding tasks, creating an MPEG-4 SIMD engine within the tailored processor, the results can be quite surprising.
The resulting SIMD engine drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10MHz (power dissipation!!)
ENG3050 ERCS 59
How Xtensa Compares
Reconfigurable Instruction Reconfigurable Instruction Set ProcessorsSet Processors
ENG3050 ERCS 60
61
Two roads to customizationTwo roads to customization
Augment GPPs with programmable logicCouple standard processor (ARM, MIPS) with
an FPGA fabricFixed processor instruction setFPGA implements custom instructions
Implement them in FPGAsCustomize instructions at compile time or at
run time
ENG3050 ERCS
Reconfigurable Instruction Set ProcessorsReconfigurable Instruction Set Processors
Duplicated instruction decode logic (2 simmetrical data- channels)
Duplicated commonly used function Units (Alu and Shifter)
All others function units are shared (DSP operations, Memory handler)
A tightly coupled pipelined configurable Gate Array
ENG3050 ERCS 62
Dynamic Instruction Set Extension(1)
for (i=0; i<16;i++) { temp = abs (v1[i]-v2[i]); out = out + temp; }
A-B B-A
MUX
Accumulator
for (i=0; i<16;i++) {
pgaop (out, v1[i], v2[i]);
}
PiCoGAR
egis
ter
File
ALUs & Multiplier
Memory Unit
A-B
B-A
MU
XA
ccu
mu
lato
r
Original code Optimized XiRisc code
ENG3050 ERCS 63
ENG3050 ERCS 64
Summary Configurable and extensible (tailorable) processor cores are a
combination of hardware and software IP that give system developers the ability to tailor processors for better performance tailor processors for better performance in specific applicationsin specific applications
The main difference between GPPs and ASIPs is specializationspecialization. It is important to note that specialization must not compromise flexibility!
Advantages:Advantages: Faster, more power efficient, less silicon areaFaster, more power efficient, less silicon area No other company will have your version of that task-No other company will have your version of that task-
specific processor.specific processor. No one will have the matching compiler and software tool No one will have the matching compiler and software tool
chain.chain.
ENG3050 ERCS 65
Conclusion ASIPs is somehow related to hardware/software co-designrelated to hardware/software co-design methodology
since a GP is involved along with hardware accelerators in the form of specialized functional units.
Tensilica provides all the necessary tools to automatically createautomatically create Application Specific Instruction Set Processors in minimum time.
The designer can rely either on the TIE language to manually extendTIE language to manually extend the instruction set of the newly
created processor. Another option would be to rely on the Tensilica XPRESS compilerTensilica XPRESS compiler to
automatically createautomatically create the processor and all the necessary software development tools such as compilers, debuggers …
The designer can extend the capabilities of the processor by changing the cache, ports, queues, register files, functional units, ….
It is worth pursuing using the Tensilica tools to perform some type of perform some type of design explorationdesign exploration for your application before you attempt to custom build hardware accelerators.
ENG3050 ERCS 66
Recommended