Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 1

Pre-fetching for Improved Core Interfacing

Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh PatelDepartment of Computer Science

University of CaliforniaRiverside, CA 92521

{rlysecky, vahid, givargis, rrpatel}@cs.ucr.edu

This work was supported in part by the NSF and a DAC scholarship.


Introduction

CoreLibrary

MIPSMEM

Cache

DSPDMA

Core X Core Y

• Core-based designs are becoming common– available as both soft and hard

• Problem - How can interfacing be simplified to ease integration?


Introduction• One Solution - One standard on-chip bus

– All cores have same interface

– Appears to be unlikely (VSIA)

• Another Solution - Divide core into a bus wrapper and internal parts– Rowson and Sangiovanni-Vincentelli ‘97 -

Interface-Based Design

– VSIA developing standard for interface between wrapper and internals

• Far simpler than standard on-chip bus

– Refer to bus wrapper as an interface module(IM)

standardinterface

any bus

IM

internals

standardinterface

standard bus


Introduction• Problem - Using an Interface Module can result in extra

cycles for reads

• Pre-fetching can reduce or eliminate extra cycles

• Outline– Interfacing Options

– Classification of registers and common registers occurrences

– Architecture of IM and pre-fetch heuristics

– Experiments

– Conclusions


No Interface Module(IM)

• Interface logic is designed as part of the core’s internal logic

• Pros– Small Size

– High Performance (No Overhead)

• Cons– May be hard to integrate with different

busses

clkrd

addrdata

2 cycles

D

data addr rd wr

coreD


Separating a Core into IM & Internals

• Interface module is separate from core internal– Standard bus between IM and internals

• Pros– Easily integrate with different busses

– Any changes are restricted to the IM

• Cons– May incur performance overhead due to

the interface module

– Possible increases in size and power

clkrd

addrdata

i_rdi_addri_data

4 cycles total

D

D

2 cycles overhead

data addr rd wr

Core internals

IM

i_w

r

i_ad

dr

i_d

ata

i_rd

D


Proposed Solution - Pre-fetching in IM

• Pre-fetching– Analogous to caching, store local copies of

registers inside the interface module

– Enable quick response time

– Eliminates extra cycles for register reads

– Transparent to system bus and core internals

• Pros– Easily integrate with different busses

– No performance overhead

• Cons– Possible increases in size and power

2 cycles

pre-fetch

D

D’

clkrd

addrdata

i_rdi_addri_data

data addr rd wr

Core internals

IM

i_w

r

i_a

dd

r

i_d

ata

i_rd

D

D’


Classification of Core Registers• Different registers need different pre-fetch

scheme

• Need classification for registers– Update Type

– Access Type

– Notification Type

– Structure Type


Common Register Types

• We identified three common register combinations found in cores– Configuration, Task, and Input-buffered registers

– Implemented cores representative of each of these three common register combinations

– Provide classification for registers in each of the cores



• Core1 - Configuration Registers– Example: Configuration registers in a UART or DMA

Controller

D

D'

Core internals

IM

Controller

data addr rd wr

e

ld

ConfigurationRegister(D)



• Core2 - Task Registers– Example: JPEG or MPEG CODEC, or DES

Encryption

DI

DO'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

DO S

writing

e1

e2

ld1 ld2

Data InputRegister(DI)

Data OutputRegister(DO)

StatusRegister(S)



• Core3 - Input-buffered Registers– Example: FIFO or UART

D

D'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

S

e1

e2

ld1 ld2

StatusRegister(S)

Data Register(D)


Architecture of IM

DI

DO'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

DO S

writing

e1

e2

ld1 ld2

pre-fetchregisters

Pre-fetch Unit - Implements the pre-fetching heuristicGoal: maximize the number of hits

Controller - Interfaces to system bus


clk

wr

i_rd

i_addr

i_data

addr

rd

2 cycles

data

DS

S' D'

Pre-fetch Heuristic for Core2

• Core2 - Task Register– After system writes to register DI

• Read S into pre-fetch register S’

• When S indicates completion, read DO from core into pre-fetch register DO’

– Repeat this process

• Similar heuristics were developed for Core1 and Core3


Experiments - Area(Gates)

0

2000

4000

6000

8000

10000

12000

14000

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF

Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size.

Average increase of IM w/o PFover no IM of 1.4K gates

Average increase of IM w/ PF over IM w/o PF of 1.3K gates


Experiments - Performance(ns)

0

100020003000

4000500060007000

80009000

10000

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF


Experiments - Energy(nJ)

0

2

4

6

8

10

12

14

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF


Digital Camera Peripheral Read Access(cycles)

0

200

400

600

800

1000

1200

CO

DE

CSt

atus

CO

DE

CD

ata

CC

DSt

atus

CC

DD

ata

IM w/o PFIM w/ PF

12% of execution time for peripheral reads

50% decrease in peripheral read access25% decrease in overall peripheral access3.2% improvement in overall system performance


Conclusion

• Separating interface from internals eases core integration but may yield increase in read cycles

• Pre-fetching eliminated the performance degradation in common cases– Increases in size and power were acceptable

– Transparent to system bus and core internals

– Pre-fetching thus improves the marketability of cores

Documents

Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel