Upload
evelyn-malone
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Roman Lysecky University of California, Riverside 1
Pre-fetching for Improved Core Interfacing
Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh PatelDepartment of Computer Science
University of CaliforniaRiverside, CA 92521
{rlysecky, vahid, givargis, rrpatel}@cs.ucr.edu
This work was supported in part by the NSF and a DAC scholarship.
Roman Lysecky University of California, Riverside 2
Introduction
CoreLibrary
MIPSMEM
Cache
DSPDMA
Core X Core Y
• Core-based designs are becoming common– available as both soft and hard
• Problem - How can interfacing be simplified to ease integration?
Roman Lysecky University of California, Riverside 3
Introduction• One Solution - One standard on-chip bus
– All cores have same interface
– Appears to be unlikely (VSIA)
• Another Solution - Divide core into a bus wrapper and internal parts– Rowson and Sangiovanni-Vincentelli ‘97 -
Interface-Based Design
– VSIA developing standard for interface between wrapper and internals
• Far simpler than standard on-chip bus
– Refer to bus wrapper as an interface module(IM)
standardinterface
any bus
IM
internals
standardinterface
standard bus
Roman Lysecky University of California, Riverside 4
Introduction• Problem - Using an Interface Module can result in extra
cycles for reads
• Pre-fetching can reduce or eliminate extra cycles
• Outline– Interfacing Options
– Classification of registers and common registers occurrences
– Architecture of IM and pre-fetch heuristics
– Experiments
– Conclusions
Roman Lysecky University of California, Riverside 5
No Interface Module(IM)
• Interface logic is designed as part of the core’s internal logic
• Pros– Small Size
– High Performance (No Overhead)
• Cons– May be hard to integrate with different
busses
clkrd
addrdata
2 cycles
D
data addr rd wr
coreD
Roman Lysecky University of California, Riverside 6
Separating a Core into IM & Internals
• Interface module is separate from core internal– Standard bus between IM and internals
• Pros– Easily integrate with different busses
– Any changes are restricted to the IM
• Cons– May incur performance overhead due to
the interface module
– Possible increases in size and power
clkrd
addrdata
i_rdi_addri_data
4 cycles total
D
D
2 cycles overhead
data addr rd wr
Core internals
IM
i_w
r
i_ad
dr
i_d
ata
i_rd
D
Roman Lysecky University of California, Riverside 7
Proposed Solution - Pre-fetching in IM
• Pre-fetching– Analogous to caching, store local copies of
registers inside the interface module
– Enable quick response time
– Eliminates extra cycles for register reads
– Transparent to system bus and core internals
• Pros– Easily integrate with different busses
– No performance overhead
• Cons– Possible increases in size and power
2 cycles
pre-fetch
D
D’
clkrd
addrdata
i_rdi_addri_data
data addr rd wr
Core internals
IM
i_w
r
i_a
dd
r
i_d
ata
i_rd
D
D’
Roman Lysecky University of California, Riverside 8
Classification of Core Registers• Different registers need different pre-fetch
scheme
• Need classification for registers– Update Type
– Access Type
– Notification Type
– Structure Type
Roman Lysecky University of California, Riverside 9
Common Register Types
• We identified three common register combinations found in cores– Configuration, Task, and Input-buffered registers
– Implemented cores representative of each of these three common register combinations
– Provide classification for registers in each of the cores
Roman Lysecky University of California, Riverside 10
Common Register Types
• Core1 - Configuration Registers– Example: Configuration registers in a UART or DMA
Controller
D
D'
Core internals
IM
Controller
data addr rd wr
e
ld
ConfigurationRegister(D)
Roman Lysecky University of California, Riverside 11
Common Register Types
• Core2 - Task Registers– Example: JPEG or MPEG CODEC, or DES
Encryption
DI
DO'
Core internals
IM
Controller
data addr rd wr
S'
PFUrd
DO S
writing
e1
e2
ld1 ld2
Data InputRegister(DI)
Data OutputRegister(DO)
StatusRegister(S)
Roman Lysecky University of California, Riverside 12
Common Register Types
• Core3 - Input-buffered Registers– Example: FIFO or UART
D
D'
Core internals
IM
Controller
data addr rd wr
S'
PFUrd
S
e1
e2
ld1 ld2
StatusRegister(S)
Data Register(D)
Roman Lysecky University of California, Riverside 13
Architecture of IM
DI
DO'
Core internals
IM
Controller
data addr rd wr
S'
PFUrd
DO S
writing
e1
e2
ld1 ld2
pre-fetchregisters
Pre-fetch Unit - Implements the pre-fetching heuristicGoal: maximize the number of hits
Controller - Interfaces to system bus
Roman Lysecky University of California, Riverside 14
clk
wr
i_rd
i_addr
i_data
addr
rd
2 cycles
data
DS
S' D'
Pre-fetch Heuristic for Core2
• Core2 - Task Register– After system writes to register DI
• Read S into pre-fetch register S’
• When S indicates completion, read DO from core into pre-fetch register DO’
– Repeat this process
• Similar heuristics were developed for Core1 and Core3
Roman Lysecky University of California, Riverside 15
Experiments - Area(Gates)
0
2000
4000
6000
8000
10000
12000
14000
Core1 Core2 Core3
No IM
IM w/o PF
IM w/ PF
Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size.
Average increase of IM w/o PFover no IM of 1.4K gates
Average increase of IM w/ PF over IM w/o PF of 1.3K gates
Roman Lysecky University of California, Riverside 16
Experiments - Performance(ns)
0
100020003000
4000500060007000
80009000
10000
Core1 Core2 Core3
No IM
IM w/o PF
IM w/ PF
Roman Lysecky University of California, Riverside 17
Experiments - Energy(nJ)
0
2
4
6
8
10
12
14
Core1 Core2 Core3
No IM
IM w/o PF
IM w/ PF
Roman Lysecky University of California, Riverside 18
Digital Camera Peripheral Read Access(cycles)
0
200
400
600
800
1000
1200
CO
DE
CSt
atus
CO
DE
CD
ata
CC
DSt
atus
CC
DD
ata
IM w/o PFIM w/ PF
12% of execution time for peripheral reads
50% decrease in peripheral read access25% decrease in overall peripheral access3.2% improvement in overall system performance
Roman Lysecky University of California, Riverside 19
Conclusion
• Separating interface from internals eases core integration but may yield increase in read cycles
• Pre-fetching eliminated the performance degradation in common cases– Increases in size and power were acceptable
– Transparent to system bus and core internals
– Pre-fetching thus improves the marketability of cores