Upload
zagiri
View
49
Download
0
Embed Size (px)
DESCRIPTION
Application-Specific Customization of Soft Processor Microarchitecture. Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Edward S. Rogers Sr. Department of Electrical and Computer Engineering. Processors and FPGA Systems. We seek improvement through customization. - PowerPoint PPT Presentation
Citation preview
Application-Specific Customization of Soft Processor MicroarchitecturePeter YiannacourasJ. Gregory Steffan Jonathan Rose
University of Toronto
Edward S. Rogers Sr. Department of Electrical and Computer Engineering
2
Processors and FPGA Systems
We seek improvement through customization
Processors lie at the “heart” of FPGA systems
MemoryInterface
UART
Custom Logic
Ethernet
Performs coordination and even computation Better processors => less hardware to design
Soft Processor
3
Motivating Application-Specific Customizations of Soft Processors
1. FPGA Configurability Can consider unlimited processor variants
2. A soft processor might be used to run either:a) A single applicationb) A single class of applicationsc) Many applications, but can be reconfigured
3. Applications differ in architectural requirements Can specialize architecture for each application
We want to evaluate effectiveness of specialization
4
Research Goals
To investigate1. The potential for “Application-tuning”
Tune processor microarchitecture to favour an application Preserve general purpose functionality
2. “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application
3. Combination of both methods
Measure efficiency gained through real implementations
5
SPREE
SPREE System (Soft Processor Rapid Exploration Environment)
RTL
ISA Datapath ■ Input: Processor description■ Made of hand-coded components
1. Verify ISA against datapath
2. Datapath Instantiation3. Control Generation
■ Multi-cycle/variable-cycle FUs■ Multiplexer select signals■ Interlocking■ Branch handling
■ SPREE System
■ Output: Synthesizable Verilog
ProcessorDescription
6
Back-End Infrastructure
RTL
2. Resource Usage3. Clock Frequency4. Power
1. Cycle Count
Quartus II 4.2CAD Software
ModelsimRTL Simulator
Benchmarks(MiBench,
Dhrystone 2.1,RATES,XiRisc)
Stratix 1S40C5
We can measure area/performance/energy accurately
7
Comparison to Altera’s Nios II
Has three variations:Nios II/e – unpipelined, no HW multiplierNios II/s – 5-stage, with HW multiplierNios II/f – 6-stage, dynamic branch prediction
8
Architectural Parameters Used in SPREE
We focus on core microarchitecture
Multiplication Support Hardware FU or software routine
Shifter implementation Flipflops, multiplier, or LUTs
Pipelining Depth
(2-7 stages)
Organization Forwarding
9
300
500
700
900
1100
1300
1500
1700
1900
500 700 900 1100 1300 1500 1700 1900
Area (Equivalent LEs)
Ge
om
ea
n W
all
Clo
ck
Tim
e (
us
) SPREE Processors
Altera Nios II/e
Altera Nios II/s
Altera Nios II/f
SPREE vs Nios II
smaller
faster
-3-stage pipe-HW multiply-Multiply-based shifter
10
Exploration of Soft Processor Architectural Customizations
1. Architectural-tuning
2. Instruction-set subsetting
3. Combination (Arch-tuning + Subsetting)
11
1. Architectural Tuning Experiment
Vary the same parameters Multiplication Support Shifter implementation Pipelining
Determine1. Best overall (general purpose) processor
2. Best per application (application-tuned)
Metric: Performance per Area (MIPS/LE) Basically inverse of Area-Delay product
12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
Performance per Area of All Processors
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
General Purpose
serialshift_norise
serialshift_lowrise
serialshift_judicialrise
serialshift_highrise
mulshift_norise
mulshift_minrise
mulshift_highrise
barrelshift_norise
barrelshift_minrise
barrelshift_highrise
serialshiftdatamem_norisepipe3_serialshift
pipe3_mulshift
pipe3_barrelshift
pipe4_serialshift
pipe4_mulshift
pipe4_barrelshift
pipe4_serialshift_2
pipe4_mulshift_stall_2
pipe4_barrelshift_2
pipe5_serialshift
pipe5_mulshift
pipe5_barrelshift
pipe5_serialshift_load
pipe5_mulshift_load
pipe5_mulshift_stall_loadpipe5_barrelshift_load
pipe7_serialshift
pipe7_mulshift
pipe7_barrelshift
pipe7_barrelshift_2
barrelshift_norise.nomul
barrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomul
serialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul
mulshift_minrise.nomul
mulshift_highrise.nomul
serialshiftdatamem_norise.nomulpipe3_serialshift.nomul
pipe3_mulshift.nomul
pipe3_mulshift_stall.nomulpipe3_barrelshift.nomul
pipe4_serialshift.nomul
pipe4_mulshift.nomul
pipe4_barrelshift.nomul
pipe5_serialshift.nomul
pipe5_mulshift.nomul
pipe5_barrelshift.nomul
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
General Purpose
Application-tuned
serialshift_norise
serialshift_lowrise
serialshift_judicialrise
serialshift_highrise
mulshift_norise
mulshift_minrise
mulshift_highrise
barrelshift_norise
barrelshift_minrise
barrelshift_highrise
serialshiftdatamem_norisepipe3_serialshift
pipe3_mulshift
pipe3_barrelshift
pipe4_serialshift
pipe4_mulshift
pipe4_barrelshift
pipe4_serialshift_2
pipe4_mulshift_stall_2pipe4_barrelshift_2
pipe5_serialshift
pipe5_mulshift
pipe5_barrelshift
pipe5_serialshift_load
pipe5_mulshift_load
pipe5_mulshift_stall_loadpipe5_barrelshift_loadpipe7_serialshift
pipe7_mulshift
pipe7_barrelshift
pipe7_barrelshift_2
barrelshift_norise.nomulbarrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomulserialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul
mulshift_minrise.nomulmulshift_highrise.nomulserialshiftdatamem_norise.nomulpipe3_serialshift.nomulpipe3_mulshift.nomul
pipe3_mulshift_stall.nomulpipe3_barrelshift.nomulpipe4_serialshift.nomulpipe4_mulshift.nomul
pipe4_barrelshift.nomulpipe5_serialshift.nomulpipe5_mulshift.nomul
pipe5_barrelshift.nomul
14.1%
32%
13
2. Instruction-set Subsetting
SPREE automatically removesUnused connectionsUnused components
Reduce processor by reducing the ISACan create application-specific processor
Eliminate unused parts of the ISA
14
Instruction-set Usage of Benchmarks
Applications do not use complete ISA
0%
25%
50%
75%
100%
bu
bb
le_
so
rt
crc
de
s fft
fir
qu
an
t
iqu
an
t
turb
o
vlc
bit
cn
ts
CR
C3
2
qs
ort
sh
a
str
ing
se
arc
h
FF
T
dijk
str
a
pa
tric
ia
go
l
dc
t
dh
ry
AV
ER
AG
E
Pe
rce
nt
of
ISA
Us
ed
Strong potential for hardware reduction
15
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
bubb
le_s
ort crc
des fft fir
quan
t
iqua
nt
turb
o vlc
bitc
nts
CR
C32
qsor
t
sha
strin
gsea
rch
FF
T_M
I
dijk
stra
patr
icia go
l
dct
dhry
AV
ER
AG
E
Benchmark
Fra
ctio
n o
f Are
aF
ract
ion
of
Are
aArea Reduction from Subsetting
Area reduced by 60% in some, 23% on average
23%
Similar reductions for energy, small impact on performance
16
3. Combining Application Tuning and Instruction-set Subsetting
Subsetting is effective on its own Can apply subsetting on top of tuning
Compare different customization methods1. Tuning
2. Subsetting
3. Tuning + Subsetting
17
Combining Application Tuning and Instruction-set Subsetting
0
10
20
30
40
50
60
General purpose Application-tuned General purpose +subsetted
Application-tuned +subsetted
Pe
rfo
rma
nc
e P
er
Are
a (
MIP
S/1
00
0L
Es
)
14% 16%25%
Tuning reduces the waste that subsetting eliminates
18
Summary of Presented Architectural Conclusions
Application tuning 14% average efficiency gainWill increase with more architectural axes
Instruction-set SubsettingUp to 60% area & energy savings16% average efficiency gain
Combined Tuning & Subsetting25% average efficiency gain
19
Future Work
Consider other promising architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy
Compiler assistance Can improve tuning & subsetting