View
271
Download
0
Tags:
Embed Size (px)
Citation preview
Frank Vahid, UC Riverside
1
Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
Frank Vahid, UC Riverside 2
Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific standard product
Domain-specific pre-fabricated IC
e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC
Unique IC design Ignores quantity of same IC
ASIC design starts decreasing Due to strong benefits of
using pre-fabricated devices
Sourc
e:
Gart
ner/
Data
quest
Septe
mber’
01
Frank Vahid, UC Riverside 3
Will High End ICs Still be Made?
YES The point is that
mainstream designers likely won’t be making them
Very high volume or very high cost products
Platforms are one such product – high volume
Need to be highly configurable to adapt to different applications and constraints
0
10
20
30
40
50
60
70
1 2 3 4
Volume
Cost
per
IC 1990
20002010Mainstream
design
Becoming out of reach of
mainstream designers
Frank Vahid, UC Riverside 4
UCR Focus
Configurable Cache Hardware/Software Partitioning
Frank Vahid, UC Riverside 5
UCR Focus
Configurable Cache Hardware/Software Partitioning
Frank Vahid, UC Riverside 6
Configurable Cache: Why
uP
L1 cache
DSP
JPEG dcd
Periph-erals
FPGA
Pre-fabricated Platform
(A pre-designed system-level architecture)
IC ARM920T: Caches consume
half of total power (Segars 01)
M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99)
L1 cache
Frank Vahid, UC Riverside 7
Best Cache for Embedded Systems?
Not clear Huge variety among popular embedded processors
What’s the best… Associativity, Line size, Total size?
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 8
Cache Associativity
Direct mapped cache Certain bits “index”
into cache Remaining “tag” bits
compared
00 0 000
11 0 000
A
B
C
D
01 0 000
10 0 000 Conflict
0000DTag11
Direct mapped cache
(1-way set associative)
Index
Set associative cache Multiple “ways” Fewer index bits, more
tag bits, simultaneous comparisons
More expensive, but better hit rate
D110 C100
2-way set associative
cache
000
Frank Vahid, UC Riverside 9
Cache Associativity
Reduces miss rate – thus improving performance Impact on power and energy?
(Energy = Power * Time)
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s r
ate
epic
mpeg2
Frank Vahid, UC Riverside 10
Associativity is Costly
Associativity improves hit rate, but at the cost of more power per access
Are the power savings from reduced misses outweighed by the increased power per hit?
sa_data
wordline_databitline_data
decode_data
data output driver
mux driver
comparator
bitline_tag sa_tag
wordline_tag
decode_tag
Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1w ay 2w ay 4w ay
Associativity
En
erg
y p
er a
ccess(n
J)
Energy per access for 8 Kbyte cache
Frank Vahid, UC Riverside 11
Associativity and Energy
Best performing cache is not always lowest energy
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s ra
te
epic
mpeg2
0.0
0.2
0.4
0.6
0.8
1.0
1 2 4
AssociativityN
orm
aliz
ed e
nerg
y
epic
mpeg2
Significantly poorer energy
Frank Vahid, UC Riverside 12
Associativity Dilemma
Direct mapped cache Good hit rate on most examples
Low power per access But poor hit rate on some examples
High power due to many misses
Four-way set-associative cache Good hit rate on nearly all examples But high power per access
Overkill for most examples, thus wasting energy
Dilemma: Design for the average or worst case?
Frank Vahid, UC Riverside 13
Associativity Dilemma
Obviously not a clear choice Previous work
Albonesi – proposed configurable cache having way shutdown ability to save dynamic power
Motorola M*CORE also
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
D1100
11 0 000
0000
Frank Vahid, UC Riverside 14
Our Solution: Way Concatenatable Cache
Can be configured as 4, 2, or 1 way Ways can be
concatenated
D11xx C10x
11 0 000
This bit selects the way
0000
Frank Vahid, UC Riverside 15
Configurable Cache Design: Way Concatenation (4, 2 or 1 way)
index
c1 c3c0 c2
a11
a12
reg1
reg0
sense ampscolumn mux
tag part
tag address
mux driver
c1
line offset
data output
critical path
c0
c2
c0 c1
6x64
6x64
c3c2
6x64
6x64
c3
6x64
6x64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
Configuration circuit
data array
bitline
Small area and performance overhead
Frank Vahid, UC Riverside 16
Way Concatenate Experiments
Experiment Motorola PowerStone benchmark g3fax Considering dynamic power only
L1 access energy, CPU stall energy, memory access energy Way concatenate outperforms 4 way and direct map.
Just as good as way shutdown
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
Configuration
En
erg
y(n
J)
Frank Vahid, UC Riverside 17
Way Concatenate Experiments
Considered 23 programs (Powerstone, MediaBench, and Spec2000) Dynamic power only (L1 access energy, CPU stall energy, memory access energy)
Way concatenate Better than way shutdown (due to less performance penalty) Saves over conventional 4-way Also avoids big penalties of 1-way on some programs
100% = 4-way conventional cache 111%113% 289%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
jpeg
mpe
g2
pegw
it
g721 ar
t
mcf
pars
er vpr
Ave
rage
Benchmarks
Ene
rgy
(no
rmal
ize
d)
CnvI1D1
cnct
shut
both
Frank Vahid, UC Riverside 18
Way Concatenate Experiments
Best configuration varies Need to tune
configuration to a given program
Example Best Example Bestpadpcm I8KD8KI1D1 ucbqsort I4KD4KI1D1
crc I4KD4KI1D1 v42 I8KD8KI1D1
auto2 I8KD4KI1D1 adpcm I2KD8KI1D1
bcnt I8KD2KI1D1 epic I8KD8KI1D1
bilv I4KD4KI1D1 jpeg I8KD8KI4D2
binary I8KD2KI1D1 mpeg2 I8KD8KI1D2
blit I2KD8KI1D1 g721 I8KD8KI2D1
brev I8KD4KI1D2 art I4KD8KI1D1
g3fax I4KD4KI1D1 mcf I8KD8KI1D1
fir I8KD2KI1D1 parser I8KD8K41D1
pjepg I4KD8KI1D1 vpr I8KD8KI2D1
pegw it I8KD8KI1D1
Frank Vahid, UC Riverside 19
Normalized Execution Times
122% 245% 121%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
padp
cm crc
auto
2
bcnt
bliv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
jpeg
mpe
g2
pegw
it
g721 ar
t
mcf
pars
er vpr
Ave
rage
CnvI1D1
cncf
shut
both
Way shutdown suffers performance penalty As does direct mapped
Way concatenate has almost no performance penalty Though 3% longer critical path than conventional 4-
way
Frank Vahid, UC Riverside 20
Way Shutdown for Static Power Savings
Albonesi and Motorola used logic to gate clock Reduced dynamic power, but not static (leakage) Way concatenate clearly superior for reducing dynamic
pwr Shutting down ways still useful to save static power
But we’ll use another method (Agarwal DRG-cache)
Gnd
Vdd bitlinebitline
Gated-VddControl
SRAM cell
Frank Vahid, UC Riverside 21
Way Concatenate Plus Way Shutdown
We set static power = 30% of dynamic power Way shutdown now preferred in many
examples But way concatenate still very helpful
114%268%116%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
jpeg
mpe
g2
pegw
it
g721 ar
t
mcf
pars
er vpr
Ave
rage
Benchmarks
En
erg
y (n
orm
aliz
ed
)
CnvI1D1cnctshutboth
Frank Vahid, UC Riverside 22
Configurable Line Size Too
Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases
100% = 4-way conventional cache
127% 127%122%
126% 129%
119%
1.44E+00 147%230% 133%144%125%
0%
20%
40%
60%
80%
100%
120%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
g721
pegw
it
mpe
g
jpeg
csb16 csb32 cbs64 cnv4w32 cnv1w32
csb: concatenate plus shutdown cache
Frank Vahid, UC Riverside 23
Configurable Cache
A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy
Well-suited for configurable devices like Triscend’s
Frank Vahid, UC Riverside 24
UCR Focus
Configurable Cache Hardware/Software Partitioning
Frank Vahid, UC Riverside 25
Using On-Chip FPGA to Reduce Sw Energy
Hennessey/Patterson: “The best way to save power is to have less
hardware” (pg 392) Actually, best way is to have less ACTIVE hw
Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced
How?
Frank Vahid, UC Riverside 26
Using On-Chip FPGA to Reduce Sw Energy
uP
L1 cache
DSP
JPEG dcd
Periph-erals
FPGA
Pre-fabricated Platform
Move critical sw loops to FPGA
Loop executes in 1/10th the time
Use this time to power down the system longer during task period
Alternatively, slow down the microprocessor using voltage scaling
ICFPGA
uP
idleuP active
idleuP FPGA
Task period
Frank Vahid, UC Riverside 27
The 90-10 rule (or 80-20 rule)
Most software time is spent in a few small loops
e.g., MediaBench and NetBench benchmarks
Known as the 90-10 rule
10% of the code accounts for 90% of the execution time
Move those loops to FPGA
g721 adpcm
pegwit dh
md5 tl
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
Loop
Per
cen
t E
xecu
tio
n T
ime
Series1
Frank Vahid, UC Riverside 28
Hardware/Software Partitioning Results
Example Archit Cyclesorig Cyclessw Cycleshw
Loop Sp. Clkhw
Total Sp. Psw Phw Eorig Esw/hw ESav
PS_g3fax 8051 19,675,456 10,812,544 176,562 61 25 2.2 0.05 0.032 0.1142 0.05408 53%PS_crc 8051 291,196 180,224 7,168 25 25 2.5 0.05 0.028 0.0017 0.00071 58%PS_summin 8051 109,821,892 20,394,080 384,416 53 25 1.2 0.05 0.033 0.6376 0.53657 16%PS_brev 8051 330,064 305,768 1,360 225 25 12.9 0.05 0.034 0.0019 0.00015 92%PS_matmul 8051 119,420 101,576 2,560 40 25 5.9 0.05 0.035 0.0007 0.00012 82%PS_g3fax MIPS 15,600,000 4,720,000 599,000 8 100 1.4 0.07 0.111 0.0265 0.02163 18%PS_adpcm MIPS 113,000 29,300 5,440 5 100 1.3 0.07 0.181 0.0002 0.00018 6%PS_crc MIPS 5,040,000 3,480,000 460,800 8 100 2.5 0.07 0.061 0.0086 0.00379 56%PS_des MIPS 142,000 70,700 15,100 5 100 1.6 0.07 0.197 0.0002 0.00019 20%PS_engine MIPS 915,000 145,000 28,100 5 100 1.1 0.07 0.082 0.0016 0.00146 6%PS_jpeg MIPS 7,900,000 646,000 171,000 4 100 1.1 0.07 0.092 0.0134 0.01360 -1%PS_summin MIPS 2,920,000 1,270,000 266,000 5 100 1.5 0.07 0.111 0.0050 0.00375 24%PS_v42 MIPS 3,850,000 846,000 216,000 4 100 1.2 0.07 0.102 0.0065 0.00605 7%PS_brev MIPS 3,566 2,499 138 18 100 3.0 0.07 0.107 0.0000 0.00000 62%MB_g721 MIPS 838,230,002 457,674,179 9,985,261 46 100 2.1 0.07 0.152 1.4250 0.75035 47%MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 28 42 11.6 0.07 0.130 0.0559 0.00821 85%MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 15 50 3.1 0.07 0.170 0.0727 0.03241 55%NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 30 69 3.5 0.07 0.121 3.0482 1.00547 67%NB_md5 MIPS 5,374,034 3,046,881 289,877 11 47 1.8 0.07 0.251 0.0091 0.00722 21%NB_tl MIPS 57,412,470 29,244,221 2,479,552 12 58 1.8 0.07 0.059 0.0976 0.05930 39%
Average: 30 3.2 Average: 34%
Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)
Simulation based
Frank Vahid, UC Riverside 29
Analysis of Ideal Speedup
Each loop is 10x faster in hw (average based on observations)
Notice the leveling off after the first couple loops (due to 90-10 rule)
Thus, most speedup comes from the first few loops
Good for us -- Moderate amount of FPGA gives most of the speedup
How much FPGA?
g721-10% adpcm-10%
pegwit-10% dh-10%
md5-10% tl-10%
url-10% avg-10%
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
ee
du
p
Series1
Frank Vahid, UC Riverside 30
Speedup Gained with Relatively Few Gates
Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates
Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of
Embedded Systems, 2002 (to appear).
1.0
2.0
3.0
4.0
5.0
0 5,000 10,000 15,000 20,000 25,000
Gates
Sp
ee
du
p
G721(MB)
ADPCM(MB)
PEGWIT(MB)
DH(NB)
MD5(NB)
TL(NB)
URL(NB)
27.2
2.05 at 90,000
Frank Vahid, UC Riverside 31
Impact of Microprocessor/FPGA Clock Ratio
Previous data assumed equal clock freq.
A faster microprocessor has significant impact Analyzed 1:1, 2:1, 3:1,
4:1, 5:1 ratios Planning additional
such analyses Memory bandwidth Power ratios More
g721 adpcm
pegwit dh
md5 tl
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Loop
Sp
eed
up
1 to 1
2 to 1
3 to 1
4 to 1
5 to 1
Frank Vahid, UC Riverside 32
Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement
Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better)
A7 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%
Average: 2.3 Average: 53%
E5 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%
Average: 4.8 Average: 64%
A7 IC
Triscend A7 development
board
Frank Vahid, UC Riverside 33
Other Research Directions: Tiny Caches
Impact of tiny caches on instruction fetch power Filter caches, dynamic loop cache, preloaded loop
cache Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 Gordon-Ross, Vahid, ICCD 2002. Cotterell, Vahid, ISSS 2002 and ICCAD 2002 Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002
Processor
Loop cache
L1 cache or I-mem
Mux
0102030405060708090
100
adpc
mbc
nt
binar
y blit
compr
ess
crc
des
engin
e fir
g3fa
xjpe
g
summin
ucbq
sort
v42
AVERAGE
benchmark
pe
rce
nt
sa
vin
gs config
30
config105
Frank Vahid, UC Riverside 34
Other Research Directions: Platform-Based CAD
Use physical platform to aid search of configuration space
Configure cache, hw/sw partition
Configure, execute, and measure
Goal: Define best cooperation between desktop CAD and platform
NSF grant 2002-2005 (with N. Dutt at UC Irvine)
Frank Vahid, UC Riverside 35
Other Research Directions: Dynamic Hw/Sw Partitioning
My favorite Add component on-chip:
Detects most frequent sw loops Decompiles a loop Performs compiler
optimizations Synthesizes to a netlist Places and routes the netlist
onto FPGA Updates sw to call FPGA
Self-improving IC Can be invisible to designer Appears as efficient processor Can also dynamically tune the
cache configuration
Config. Logic
MemProcessor
DMA
D$
I$
Profiler
Proc.
Mem
Frank Vahid, UC Riverside 36
Current Researchers Working in Embedded Systems at UCR
Prof. Frank Vahid 5 Ph.D. students, 2 M.S.
Prof. Walid Najjar 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on
compiling C to FPGAs Prof. Tom Payne
1 Ph.D. student, working on compiling C to FPGAs Prof. Jun Yang (new hire)
Working on low power architectures (frequent value detection) Prof. Harry Hsieh
2 Ph.D. students, working on formal verification of system models
Prof. Sheldon Tan (new hire) 1 Ph.D, working on physical design, and analog synthesis
Frank Vahid, UC Riverside 37
Conclusions
Highly configurable platforms have a bright future
Cost equations just don’t justify ASIC production as much as before
Triscend parts are well situated; close collaboration desired Configurable cache improves memory energy
Tuning to a particular program is CRUCIAL to low energy Way concatenation is effective at reducing dynamic power Way shutdown saves static power Variable line size reduces traffic All must be tuned to a particular program
Configurable logic improves software energy Without requiring excessive amounts of hardware
Many exciting avenues to investigate!