19
Roman Lysecky University of California, Riverside 1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel Department of Computer Science University of California Riverside, CA 92521 {rlysecky, vahid, givargis, rrpatel}@cs.ucr.edu This work was supported in part by the NSF and a DAC scholarship.

Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Embed Size (px)

Citation preview

Page 1: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 1

Pre-fetching for Improved Core Interfacing

Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh PatelDepartment of Computer Science

University of CaliforniaRiverside, CA 92521

{rlysecky, vahid, givargis, rrpatel}@cs.ucr.edu

This work was supported in part by the NSF and a DAC scholarship.

Page 2: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 2

Introduction

CoreLibrary

MIPSMEM

Cache

DSPDMA

Core X Core Y

• Core-based designs are becoming common– available as both soft and hard

• Problem - How can interfacing be simplified to ease integration?

Page 3: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 3

Introduction• One Solution - One standard on-chip bus

– All cores have same interface

– Appears to be unlikely (VSIA)

• Another Solution - Divide core into a bus wrapper and internal parts– Rowson and Sangiovanni-Vincentelli ‘97 -

Interface-Based Design

– VSIA developing standard for interface between wrapper and internals

• Far simpler than standard on-chip bus

– Refer to bus wrapper as an interface module(IM)

standardinterface

any bus

IM

internals

standardinterface

standard bus

Page 4: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 4

Introduction• Problem - Using an Interface Module can result in extra

cycles for reads

• Pre-fetching can reduce or eliminate extra cycles

• Outline– Interfacing Options

– Classification of registers and common registers occurrences

– Architecture of IM and pre-fetch heuristics

– Experiments

– Conclusions

Page 5: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 5

No Interface Module(IM)

• Interface logic is designed as part of the core’s internal logic

• Pros– Small Size

– High Performance (No Overhead)

• Cons– May be hard to integrate with different

busses

clkrd

addrdata

2 cycles

D

data addr rd wr

coreD

Page 6: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 6

Separating a Core into IM & Internals

• Interface module is separate from core internal– Standard bus between IM and internals

• Pros– Easily integrate with different busses

– Any changes are restricted to the IM

• Cons– May incur performance overhead due to

the interface module

– Possible increases in size and power

clkrd

addrdata

i_rdi_addri_data

4 cycles total

D

D

2 cycles overhead

data addr rd wr

Core internals

IM

i_w

r

i_ad

dr

i_d

ata

i_rd

D

Page 7: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 7

Proposed Solution - Pre-fetching in IM

• Pre-fetching– Analogous to caching, store local copies of

registers inside the interface module

– Enable quick response time

– Eliminates extra cycles for register reads

– Transparent to system bus and core internals

• Pros– Easily integrate with different busses

– No performance overhead

• Cons– Possible increases in size and power

2 cycles

pre-fetch

D

D’

clkrd

addrdata

i_rdi_addri_data

data addr rd wr

Core internals

IM

i_w

r

i_a

dd

r

i_d

ata

i_rd

D

D’

Page 8: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 8

Classification of Core Registers• Different registers need different pre-fetch

scheme

• Need classification for registers– Update Type

– Access Type

– Notification Type

– Structure Type

Page 9: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 9

Common Register Types

• We identified three common register combinations found in cores– Configuration, Task, and Input-buffered registers

– Implemented cores representative of each of these three common register combinations

– Provide classification for registers in each of the cores

Page 10: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 10

Common Register Types

• Core1 - Configuration Registers– Example: Configuration registers in a UART or DMA

Controller

D

D'

Core internals

IM

Controller

data addr rd wr

e

ld

ConfigurationRegister(D)

Page 11: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 11

Common Register Types

• Core2 - Task Registers– Example: JPEG or MPEG CODEC, or DES

Encryption

DI

DO'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

DO S

writing

e1

e2

ld1 ld2

Data InputRegister(DI)

Data OutputRegister(DO)

StatusRegister(S)

Page 12: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 12

Common Register Types

• Core3 - Input-buffered Registers– Example: FIFO or UART

D

D'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

S

e1

e2

ld1 ld2

StatusRegister(S)

Data Register(D)

Page 13: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 13

Architecture of IM

DI

DO'

Core internals

IM

Controller

data addr rd wr

S'

PFUrd

DO S

writing

e1

e2

ld1 ld2

pre-fetchregisters

Pre-fetch Unit - Implements the pre-fetching heuristicGoal: maximize the number of hits

Controller - Interfaces to system bus

Page 14: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 14

clk

wr

i_rd

i_addr

i_data

addr

rd

2 cycles

data

DS

S' D'

Pre-fetch Heuristic for Core2

• Core2 - Task Register– After system writes to register DI

• Read S into pre-fetch register S’

• When S indicates completion, read DO from core into pre-fetch register DO’

– Repeat this process

• Similar heuristics were developed for Core1 and Core3

Page 15: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 15

Experiments - Area(Gates)

0

2000

4000

6000

8000

10000

12000

14000

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF

Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size.

Average increase of IM w/o PFover no IM of 1.4K gates

Average increase of IM w/ PF over IM w/o PF of 1.3K gates

Page 16: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 16

Experiments - Performance(ns)

0

100020003000

4000500060007000

80009000

10000

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF

Page 17: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 17

Experiments - Energy(nJ)

0

2

4

6

8

10

12

14

Core1 Core2 Core3

No IM

IM w/o PF

IM w/ PF

Page 18: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 18

Digital Camera Peripheral Read Access(cycles)

0

200

400

600

800

1000

1200

CO

DE

CSt

atus

CO

DE

CD

ata

CC

DSt

atus

CC

DD

ata

IM w/o PFIM w/ PF

12% of execution time for peripheral reads

50% decrease in peripheral read access25% decrease in overall peripheral access3.2% improvement in overall system performance

Page 19: Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel

Roman Lysecky University of California, Riverside 19

Conclusion

• Separating interface from internals eases core integration but may yield increase in read cycles

• Pre-fetching eliminated the performance degradation in common cases– Increases in size and power were acceptable

– Transparent to system bus and core internals

– Pre-fetching thus improves the marketability of cores