24
Roman Lysecky University of California, Riverside 1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis Department of Computer Science University of California Riverside, CA 92521 {rlysecky, vahid, givargis}@cs.ucr.edu This work was supported in part by the NSF and a DAC scholarship.

Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 1

Techniques for Reducing Read Latency of Core Bus Wrappers

Roman L. Lysecky, Frank Vahid, & Tony D. GivargisDepartment of Computer Science

University of CaliforniaRiverside, CA 92521

{rlysecky, vahid, givargis}@cs.ucr.edu

This work was supported in part by the NSF and a DAC scholarship.

Page 2: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 2

Introduction

CoreLibrary

MIPSMEM

Cache

DSPDMA

Core X Core Y

• Core-based designs are becoming common– available as both soft and hard

• Problem - How can interfacing be simplified to ease integration?

Page 3: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 3

Introduction

• One Solution - One standard on-chip bus– All cores have same interface– Appears to be unlikely (VSIA)

• Another Solution - Divide core into a bus wrapper and internal parts– Rowson and Sangiovanni-Vincentelli ‘97 -

Interface-Based Design– VSIA developing standard for interface

between wrapper and internals• Far simpler than standard on-chip bus

– Refer to bus wrapper as an interface module(IM)

standardinterface

any bus

IMs

internals

s t a n d a r di n t e r f a c e

s t a n d a r d b u s

Page 4: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 4

Previous Work - Pre-fetching

2 cycles

pre-fetch

D

D’

clkrd

addrdata

i_rdi_addri_data

data addr rd wr

Core internals

IM

i_w

r

i_ad

dr

i_d

ata

i_rd

D

D’

• Analogous to caching, store local copies of registers inside the interface module

• Enable quick response time• Eliminates extra cycles for register reads• Transparent to system bus and core

internals• Easily integrate with different busses• No performance overhead• Acceptable increases in size and power• Pre-fetching was manually added to each

core

Page 5: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 5

data addr rdwr

On-chip bus

A

A'

Core internals

Bus wrapper

ControllerB'

PFU

B C

writing

e1

e2

ld1 ld2

...

Internal bus

Core

i_rd

i_w

r

i_ad

dr

i_da

ta

Previous Work - Architecture of IM

pre-fetchregisters

Pre-fetch Unit - Implements the pre-fetching heuristicGoal: maximize the number of hits

Controller - Interfaces to system bus

How can we automate the design of the PFU?

Page 6: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 6

Outline

• “Real-time” Pre-fetching– Mapping to real-time scheduling

• Update Dependency Model– General Register Attributes– Petri Net model construction– Petri Net model refinement– Pre-fetch Scheduling

• Experiments• Conclusions

Page 7: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 7

Real-time Pre-fetching

Idle cycle Schedule 1 Schedule 20 A A1 B B2 A3 A4 B A5 A6 B7 A B8 A A9 B

A - Age Constraint = 4B - Age Constraint = 6Access-time Constraint = 2

Naïve Schedule More Efficient Schedule

• Age constraint– Number of cycles old data may be when read

• Access-time constraint– Maximum number of cycles a read access may take

Page 8: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 8

Real-time Pre-fetching

• Mapping to Real-time scheduling– Register -> Process

– Internal bus -> Processor

– Pre-fetch -> Process execution

– Register age constraint -> Process period

– Register Access-time constraint -> Process deadline

– Pre-fetch time -> Process computation time• Assume a pre-fetch requires 2 cycles

Page 9: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 9

Real-time Pre-fetching

• Cyclic Executive– Major cycle = time required to pre-fetch all registers

– Minor cycle = rate at which highest priority process will be executed

– Problems• Sporadic writes

• All process periods must be multiples of the minor cycle

• Computationally infeasible for large register sets

Page 10: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 10

Real-time Pre-fetching

Core Register Max Age D Priority RM Pre-fetch Time Resp. Time Utilization Util. Bound1 DATA 3 2 1 2 2 66.7 1002 GCD1 10 2 1 2 2 50 78

GCD2 10 2 2 2 4CS 20 2 3 2 6

3 STAT 5 2 1 2 2 86 75.7A 25 2 3 2 8B 25 2 4 2 16RES 10 2 2 2 4

• Rate monotonic priority assignment– Register with smallest register age constraint will have the

highest priority

Page 11: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 11

Real-time Pre-fetching

Ci = Computation Time for register iAi = Pre-fetch Time for register i

N

i

N

i

iN

A

C

1

112

Core Register Max Age D Priority RM Pre-fetch Time Resp. Time Utilization Util. Bound1 DATA 3 2 1 2 2 66.7 1002 GCD1 10 2 1 2 2 50 78

GCD2 10 2 2 2 4CS 20 2 3 2 6

3 STAT 5 2 1 2 2 86 75.7A 25 2 3 2 8B 25 2 4 2 16RES 10 2 2 2 4

• Utilization-based schedulability test

Page 12: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 12

Real-time Pre-fetching

Ri = Response Time for register iCi = Computation Time for register iIi = Maximum interference in interval [t, t+Ri)

iii I C R

Core Register Max Age D Priority RM Pre-fetch Time Resp. Time Utilization Util. Bound1 DATA 3 2 1 2 2 66.7 1002 GCD1 10 2 1 2 2 50 78

GCD2 10 2 2 2 4CS 20 2 3 2 6

3 STAT 5 2 1 2 2 86 75.7A 25 2 3 2 8B 25 2 4 2 16RES 10 2 2 2 4

• Response Time Analysis– Response of register I is defined as follows– Register set is schedulable if for each register the response

time is less than or equal to its age constraint

Page 13: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 13

Real-time Pre-fetching

• Sporadic register writes– Writes to registers are sporadic– Take control of internal bus, thus delaying pre-fetching of

registers• Deadline monotonic priority

– Register with smallest register access-time constraint will have the highest priority

– Add a write register WR to register set• Access-time constraint = Deadline• Age constraint = maximum rate at which write will occur

Page 14: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 14

Experiments - Area(Gates)

0

2

4

6

8

10

12

14

16

ADJUST CODEC FIFO

No BW

BW

RTPF

Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size.

Average increase of IM w/ RTPFover IM w/ BW of 1.4K gates

Page 15: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 15

Experiments - Performance(ns)

0

5

10

15

20

25

30

ADJUST CODEC FIFO

No BW

BW

RTPF

Page 16: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 16

Experiments - Energy(nJ)

0

2

4

6

8

10

12

14

ADJUST CODEC FIFO

No BW

BW

RTPF

Page 17: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 17

Register Attributes• Register Attributes

– Update type, access type, notification type, and structure type

• Update dependencies– Internal dependencies

• dependencies between registers

– External dependencies• updates to register via reads and writes from on-chip bus• updates from external ports to internal core register

• Petri Nets– Determined that we could use Petri Nets to model our

update dependencies

Page 18: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 18

Petri Net Based Dependency Model

On-ChipBus

wr = 1addr = GO

wr = 1addr = MD

GO = 1wr = 1addr = MD

GO

MD

S

BusPlace

Random Transition

RegisterPlaces

UpdateDependencies

Page 19: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 19

Refined Petri Net Model

On-ChipBus

wr = 1addr = GO

wr = 1addr = MD

GO = 1S = 1wr = 1addr = MD

GO

MD

S

Data Dependency

RefinedTransition

Page 20: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 20

Pre-fetch Schedule• Create a heap registers to be pre-fetched

• Create a list for update arcs

• Repeat

– if request detected then• add outgoing arcs to heap

• set write register access-time to 0 and add to heap

– if read request detected then• add outgoing arcs to update arc list

– for register at top of heap do• if access-time = 0 then pre-fetch register, remove from heap

• if current age = 0 then pre-fetch register, reset current age, add register to heap

– while update arcs list is not empty do• if transition fires then set register’s access-time to 0 and add to heap

Page 21: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 21

Experiments - Area(Gates)

0

2

4

6

8

10

12

14

16

ADJUST CODEC FIFO

No BW

BW

RTPF

PF

Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size.

Average increase of IM w/ PFover IM w/ BW of 1.5K gates

Average increase of IM w/ PF over IM w/ RTPF of .1K gates

Page 22: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 22

Experiments - Performance(ns)

0

5

10

15

20

25

30

ADJUST CODEC FIFO

No BW

BW

RTPF

PF

Page 23: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 23

Experiments - Energy(nJ)

0

2

4

6

8

10

12

14

ADJUST CODEC FIFO

No BW

BW

RTPF

PF

Page 24: Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis

Roman Lysecky University of California, Riverside 24

Conclusions

• Real-time pre-fetching and update dependency pre-fetching produce good results

• Update dependency model is more efficient in pre-fetching registers

• Two approaches are complementary• Enable the automatic generation of pre-fetching unit