Upload
jabari
View
17
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Platform Design. ASIP Application Specific Instruction-set Processor. TU/e 5kk70 Henk Corporaal Bart Mesman. Application domain specific processors (ADSP or ASIP). DSP. Programmable CPU. Programmable DSP. Application domain specific. Application - PowerPoint PPT Presentation
Citation preview
Platform Design
TU/e 5kk70Henk Corporaal Bart Mesman
ASIPApplication Specific Instruction-set Processor
04/22/23 Platform Design H.Corporaal and B. Mesman
2
flexibility
efficiency
DSP
Programmable CPU
Programmable DSP
Application domain specific
Applicationspecific processor
Application domain specific processors (ADSP or ASIP)
04/22/23 Platform Design H.Corporaal and B. Mesman
3
Application domain specific processors (ADSP or ASIP)
takes a well defined application domain as a starting point• exploits characteristics of the domain (computation kernels)• still programmable within the domain
e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ...
performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume)
Appl. domain
implementation
ADSP
implementation
Appl. domain
GP
problems - specification manual design, - design time and effort large effort => synthesized cores
04/22/23 Platform Design H.Corporaal and B. Mesman
4
Part DescriptionClock(MHz)
Size(gates)
ROM(Kbyte)
RAM(Kbyte)
Speech Components
ADPCM Full duplex ITU-T G.726 compliant and 40 kbit/s speech-compression encoder/decoder. 4 5,100 1.3 0.128
ADPCM-16 Full duplex 16 Channel ITU-T G.726 compliant 16, 24, 32 and 40 kbit/s speech-compression encoder/decoder. 32 10,200 1.3 2.048
IW-ASRSpeechRecognition
Template-based speaker-dependent, isolated-word automatic speech recognition 1.3 9,000 6approx.1kbyte/word
G.723.1 Low bit-rate ITU-TG.723.1 compliant speech-compression at 6.3 kbit/s; can be combined with G.723.1A. 20 24,000 22 2.3
G.723.1AExtended version of G.723.1 to reduce bit rate by a silence compression scheme. Uses voice activity detection andcomfort-noise generation. Fully compliant with Annex A of speech-compression standard CODEC G.723.1.Yields no additional hardware cost.
20 24,000 22 2.3
SpeechSynthesis
Phrase-concatenated speech synthesisDepends on compressionrequirements
Telecommunications
EchoCancellation
High-performance Echo-cancellation and suppression processor. 4 6,000 2.80 0.15
DTMF Full-duplex DTMF transceiver. 2 4,000 1.00 0.15
Caller-ID On-hook and off-hook caller line identification. Includes DTMF and V.23. 3 6,000 2.10 0.15
Reed-Solomon Full-duplex Reed-Solomon codec 7,000 3.75 0.15
ViterbiDecoder
Configurable rate, code and constraint-length. (depending on throughput) Configurable traceback depth. Supportssoft & hard decision making. Supports code puncturing.
5,000
to9,000
--- ---
V.23 modem ITU-T V23 compliant 1200 baud FSK modem 6,000 0.80 0.15
Other
Pink NoiseGenerator
Low-ripple pink noise filter with filter characteristic of -3 ± 0.08 dB per octave over the bandwidth 20Hz to 20kHz 4,000 0.10 0.10
CCIR 656/601 Digital video converter : CCIR to raw-video data and vice versa. 1,500 none none
www.adelantetech.com
04/22/23 Platform Design H.Corporaal and B. Mesman
5
application(s)processor
-model
OK?
more appl.? yes
no
noyes
Estimationscycles/algoccupation
HWdesign
SW (code generation)
Estimationsnsec/cycle,
area, power/instr
go to phase 2
3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw
Fast, accurate and early feedback
Design process
parametersinstance
e.g. VLIW withshared RFs
04/22/23 Platform Design H.Corporaal and B. Mesman
6
A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file.
A guarded register transfer pattern (GRTP) is a register transferpattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101GRTPs contain all inter-RT-conflict information.
Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor.
Problem statement
04/22/23 Platform Design H.Corporaal and B. Mesman
7
Algorithmspec
FE
CDFG
Code Generation
Machinecode
Processorspec (instance)
ISE
GRTP
Problem statement
in ch 4 this is
part of the code
generator
04/22/23 Platform Design H.Corporaal and B. Mesman
8
PC
IM
+1
I.(20:0)
RAM
I.(12:5)
I.(4)
Inp
I.(20:13)
I.(3:2)
I.(1:0)
REG
outp
Example: Simple processor [Leupers]
04/22/23 Platform Design H.Corporaal and B. Mesman
9
Instruction Instruction bits21111111111098765432109876543210
PC := PC + 1 xxxxxxxxxxxxxxxxxxxxxREG := Inp xxxxxxxxxxxxxxxxx011x
REG := IM PC .(20..13) xxxxxxxxxxxxxxxxx001x
REG := RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x1xREG := REG - Inp xxxxxxxxxxxxxxxxx0101
REG := REG - IM PC .(20..13) xxxxxxxxxxxxxxxxx0001
REG := REG - RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x01REG := REG + Inp xxxxxxxxxxxxxxxxx0100
REG := REG + IM PC .(20..13) xxxxxxxxxxxxxxxxx0000
REG := REG + RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x00RAM IM PC . (12..5 ) := REG xxxxxxxxxxxxxxxx1xxxxoutp := REG xxxxxxxxxxxxxxxxxxxxxRAM_NOP xxxxxxxxxxxxxxxx0xxxx
Example: Simple processor [Leupers]
04/22/23 Platform Design H.Corporaal and B. Mesman
10
ASIP/VLIW architectures
A|RT designer template as an example (= set of rules, a model)
Differences with VLIW processors of ch. 41. // FUs
• ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc …
• larger grainsize, more heterogeneous, more pipelines2. Rfiles
• many Rfiles (>5 vs 1 or 2)• limited # ports (3 vs 15) • limited size (<16 vs. 128)
3. Issue slots• all in parallel vs. 5
04/22/23 Platform Design H.Corporaal and B. Mesman
11
RF1
FU1
RF2 RF3
FU2
RF4 RF5
FU3
RF6 RF7
FU4
RF8
IR1 IR2 IR3 IR4
Instruction memory Con-trol
flags
04/22/23 Platform Design H.Corporaal and B. Mesman
12
readaddress
RF 1
writeaddress
RF 1
readaddress
RF 2
writeaddress
RF 2mux 1 mux 2
controlFU
outputdrivers
Additional characteristics of the A|RT designer template• interconnect network: busses + input multiplexers
mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged
• memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output
• Each FU can generate one or more flags• instruction format (per issue slot)
ASIP/VLIW architectures
04/22/23 Platform Design H.Corporaal and B. Mesman
13
ALU MACbus1 bus2
RF1 RF2 RF3 RF4
mux 2
read RF1
write RF1
read RF2
write RF2
ALU instr.mux
3read RF4
write RF4
read RF3
write RF3
MAC instr.
091019
ASIP/VLIW architectures: example
04/22/23 Platform Design H.Corporaal and B. Mesman
14
GRTP Instruction bits1 1 1 1 1 1 1 1 1 19 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
RF1 = ALU (RF1, RF2) x c c c c x x c c c x x x x x x x x x xRF2 = ALU (RF1, RF2) x c x c c c c c c c x x x x x x x x x xRF3 = ALU (RF1, RF2) x c x c c x x c c c c x x c c x x x x xRF3 = MAC (RF3, RF4) x x x x x x x x x x c c c c c c x c c cRF4 = MAC (RF3, RF4) x x x x x x x x x x x c c x x c c c c cRF2 = MAC (RF3, RF4) c x x x x c c x x x x c c x x c x c c c
ASIP/VLIW architectures : example
04/22/23 Platform Design H.Corporaal and B. Mesman
15
Datapath synthesis
Controller synthesis
OK?
Changepragmas
Algorithmspec
no
yes
RTs
Estimationsarea, power, timing
RF1 : x = RF2 : y, RF3 : z | ALU = ADDInmux = bus2
assign ( a+b, ALU, fu_alu1)assign ( a+_, ALU, fu_alu2)assign ( _+_, ALU, fu_alu3)
VLIW makes relatively simple code selection
possible
ASIP/VLIW architectures:design flow
04/22/23 Platform Design H.Corporaal and B. Mesman
16
*1
+2
*3
*4
*5
+6
+7
*8
*9
+10
IPB
OPB
ALU
MULT
IPB
OPB
+2*3
*1
*1
*3
+2
*1
*3
*4
*3
*4
*4
*3+6
*3
+6
+7*8
*5
*5
*8
*8
+7
*5
*9
*5
*9
*5
*9+10
*9
+10
CandidateLIST
Conflict & Priority Comp.
ScheduledOperation
0 0
1 1
2 2
3 3
4 4
5
ASIP/VLIW architectures: list scheduling
04/22/23 Platform Design H.Corporaal and B. Mesman
17
architecture viewarchitecture view
life-time analysislife-time analysis
resource loadresource load
bus loadbus load
cycle-countcycle-count
ASIP/VLIW architectures: feedback
04/22/23 Platform Design H.Corporaal and B. Mesman
18
ImplementationIndependent
Design Database
ImplementationIndependent
Design Database
Low power aspects
• Estimation
EXU ACTIVITY AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
area
speed
power
Estimation Database
+Architecture
Mistral2 Mistral2
04/22/23 Platform Design H.Corporaal and B. Mesman
19
GSM viterbi decoder : default solution
13750
EXU ACTIV AREA POWERalu_1 96% 3469 46196romctrl_1 48% 39 259acu_1 26% 327 1209ipb_1 5% 131 105opb_1 23% 1804 5801ctrl 9821 135035total 15591 188605
EXU ACTIV AREA POWERalu_1 96% 3469 46196romctrl_1 48% 39 259acu_1 26% 327 1209ipb_1 5% 131 105opb_1 23% 1804 5801ctrl 9821 135035total 15591 188605
• controller responsible for 70% of power consumption
– maximum resource-sharing
– heavy decision-making : “main” loop with 16 metrics-computations per iteration
• EXU-numbers include Registers for local storage
04/22/23 Platform Design H.Corporaal and B. Mesman
20
GSM viterbi decoder : no loop-folding
• area down by 33%• power down by 35%• next step: reduce # of program-steps with
second ALU
14247
EXU ACTIV AREA POWERalu_1 92% 3411 45073romctrl_1 45% 39 255acu_1 25% 294 1087ipb_1 5% 107 86opb_1 22% 1661 5340ctrl 4919 70087total 10431 121928
EXU ACTIV AREA POWERalu_1 92% 3411 45073romctrl_1 45% 39 255acu_1 25% 294 1087ipb_1 5% 107 86opb_1 22% 1661 5340ctrl 4919 70087total 10431 121928
04/22/23 Platform Design H.Corporaal and B. Mesman
21
GSM viterbi decoder : 2 ALU’s
9739
EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731
EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731
cycle count down 30%
area up 42% power down by 5% next step: introduce
ASU to reduce ALU-load
04/22/23 Platform Design H.Corporaal and B. Mesman
22
GSM viterbi decoder : 1 x ACS-ASU
EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944
func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;
func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;
=
1930
cycle count down 5X power down 20X !
04/22/23 Platform Design H.Corporaal and B. Mesman
23
GSM viterbi decoder : 4 x ACS-ASU
EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645
EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645
cycle count down another 5X
area up 23% power down another
3X !
425
04/22/23 Platform Design H.Corporaal and B. Mesman
24
GSM viterbi example : summary
ImplementationIndependent
Design Database
ImplementationIndependent
Design Database
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
default loop 2 ALU 1 ACS 4 ACS
power
areacycles
72x !72x !
Mistral2 Mistral2
04/22/23 Platform Design H.Corporaal and B. Mesman
25
Exploration phase
Application softwaredevelopment:
constraint driven compilation
application(s)processor
-model
OK?
more appl.? yes
no
noyes
HWdesign
SW (code generation)
application(s)
OK?no
yes
SW (code generation)
Freezeprocessor
model
no
Discussion: phase 3
04/22/23 Platform Design H.Corporaal and B. Mesman
26
Discussion: problems with VLIWs
• code compaction = reduce code size after scheduling possible compaction ratio ?e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47
maximum compression factor 2 • control parallelism during scheduling = switch between
different processor models (10% of code = 90% runtime) • architecture
reduce number of control bits for operand addressese.g. 128 reg (TM) -> 28 bits/issue slot for addresses only=> use stacks and fifos
code size and instruction bandwidth
04/22/23 Platform Design H.Corporaal and B. Mesman
27
RF1
FU1 FU2 FU3 FU4
IR1 IR2 IR3 IR4
Instruction memory Con-trol
flags
RF2 RF3 RF4
04/22/23 Platform Design H.Corporaal and B. Mesman
28
Conclusions
• ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency).
• The methodology is interesting for IP creation.
• The key problem is retargetable compilation.
• A (distributed) VLIW model is a good compromise between HW and SW.
• Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.