30
An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour , Hamid Noori †† , Morteza Saheb Zamani , Kazuaki Murakami †† , Mehdi Sedighi †, Koji Inoue †† , Computer and IT Engineering Department Amirkabir University of Technology { , , } . . mehdipur szamani msedighi @aut ac ir †† , Department of Informatics Graduate School of Information Science and , Electrical Engineering Kyushu University [email protected] , { , } . -. . murakami inoue @i kyushu u ac jp

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

Embed Size (px)

Citation preview

Page 1: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

An Integrated Temporal Partitioning and Mapping Framework for Handling

Custom Instructions on a Reconfigurable Functional Unit

Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki Murakami††, Mehdi Sedighi†, Koji Inoue††

†Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir

††Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University

[email protected], {murakami,inoue}@i.kyushu-u.ac.jp

Page 2: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Agenda Introduction General overview of the architecture Generating Custom Instructions Reconfigurable Functional Unit (RFU) Tool Chain used for our quantitative

approach Integrated Temporal Partitioning and

Mapping The Integrated Framework Incremental Temporal Partitioning Algorithm Mapping Procedure

Experimental Results

Page 3: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Introduction Approaches for designing embedded SoCs

Application Specific Integrated Circuits (ASICs) Higher performance Lower power consumption Not flexible Expensive and time consuming design process

General Purpose Processors (GPPs) Availability of tools Programmability Low performance High power consumption

Application Specific Instruction-set Processors (ASIPs) More flexible than ASICs Higher performance than GPPs Long and costly design and verification

Extensible Processors More flexibility significant non-recurring engineering costs

Page 4: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

General overview of the architecture

Adaptive Dynamic Extensible Processor

Base Processor

Reg FileFetch

Decode

Execute

Memory

Write

Augmented Hardware

RFU

Profiler

Sequencer

N-wayin-order

general RISC

Detects start addresses of

Hot Basic Blocks (HBBs)

Executes Custom

Instructions

Switches between main processor and

RFU

Page 5: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Operation modes

Applications

ProcessorProfiler

RFU

Training Mode

SequencerProcessor

Profiler

RFU Sequencer

Running Tools for Generating

Custom Instructions, Generating

Configuration Data for RFU

and Initializing Sequencer

Table

Training Mode Normal Mode

ProcessorProfiler

RFU Sequencer

Monitors PC and

Switches between

main processor and RFU

Executing CIs

ApplicationsApplications

Binary Rewritin

g

Profiler

Binary-Level

Profiling

Detecting Start

Address of HBBs

Page 6: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Integrating base processor with other components

Register File

ID/EXE Reg

RFU

ConfigurationMemory

Functional Unit

MUX SequencerSequencer

Table

EXE/MEM Reg Profiler ProfilerTable

GPP Augmented HW

OnlineTraining

Page 7: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Generation of Custom Instructions

Custom instructions Limited to one Hot Basic Block (HBB) Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP

and all other fixed point instructions

Simple algorithm for generating custom instructions HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on

the base processor (in online training mode)

Page 8: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Generating Custom Instructions4052c0 addiu $29,$29,-324052c8 mov.d $f0,$f124052d0 sw $18,24($29)4052d8 addu $18,$0,$64052e0 sw $31,28($29)4052e8 sw $16,16($29)4052f0 mfc1 $16,$f04052f8 mfc1 $17,$f1405300 srl $6,$17,0x14405308 andi $6,$6,2047405310 sltiu $2,$6,2047405318 addu $6,$6,$18405320 sltiu $2,$6,2047405328 lui $2,32783405330 and $17,$17,$2405338 andi $2,$6,2047405340 sll $2,$2,0x14405348 or $17,$17,$2405350 mtc1 $16,$f0405358 mtc1 $17,$f1405360 lw $31,28($29)405370 lw $16,16($29)405378 addiu $29,$29,32405380 jr $31

Finding the biggest sequence of instructions in the HBB that can be executed on the ACC

Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency

Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency

Rewriting object code if instructions have been moved

Moving instructions, should not modify the logic of the application

Custom instruction generation is done without considering any other constraints.

Page 9: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Reconfigurable Functional Unit (RFU)

RFU is a matrix of Functional Units (FUs) RFU has configuration memory FUs support only logical operations,

add/subtract, shifts and compare RFU updates the PC after executing each CI RFU has variable delay which depends on

depth of DFG of Custom Instructions

Page 10: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

RFU Architecture: A Quantitative Approach

22 programs of MiBench were chosen Simplescalar toolset was utilized for simulation RFU is a matrix of FUs

No of Inputs No of Outputs No of FUs Width Depth Connections Location of Inputs & Outputs

Coverage (Mapping) rate: Percentage of generated CIs that can be mapped on the RFU considering constraints

Considering frequency and weight in measurement CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight)

Page 11: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Tool Chain

Profiler Base Processor

Detecting Start Addr of HBBs

Reading HBBs from Obj Code

Custom Instruction Generator

Optimization (Constant

Propagation)

Simplescalar (PISA

Configuration) 22

Applications of Mibench

Generating DFG for HBBs

Updating DFG

Results are used for

designing RFU

Mapping CIs on the RFU

Page 12: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

RFU Inputs (no constraint)

Input No Analysis-Optimized Version

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19

Input No.

Co

vera

ge

Series1

96.3789.37 98.48

8

Page 13: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

RFU Outputs (no constraint)

6

Output No. Analysis- Optimized Version

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Output No.

Co

vera

ge

96.58

Page 14: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

RFU Architecture

Distributing Inputs in different rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1

Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1

Synthesis results using Hitachi 0.18 μm

Area : 1.1534 mm2

Delay : 9.66 ns

Page 15: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Generating Custom Instruction for the Target RFU In our primary CI generator we did not

consider any constraints for the generated CIs and tried to generate CIs as large as possible.

Therefore, some of the generated CIs could not be mapped on the proposed RFU due to its constraints after fixing the architecture.

Page 16: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Customizing CI generator for the Target RFU – First Approach (CIGen) Some primary constraints of the

RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable.

In this approach the CI generator is unaware of the mapping process results

Some of CIs may not be ultimately mapped to the RFU due to the routing and connection constraints

CIs

RFU ArchitecturalConstraints

Mapping is successful?YES

NO

Rejected CIs(Have to be run on base

processor)

CI GenerationTool

FinalConfigurations

Page 17: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Customizing CI generator for the Target RFU – Second Approach Integrated Framework

Performs an integrated temporal partitioning and mapping process

Takes rejected CIs as input Partitions them to appropriate

mappable CIs

Advantages All generated CIs are mappable Using a mapping-aware temporal

partitioning process

Mapping on RFU

TemporalPartitioning

TemporalPartitions

Mapping is successful?

FinalConfigurations

IncrementalTemporal

Partitioning

IncrementalTemporal Partitions

NO

CIs generated byCI generation Tool

Mapping is successful?

Integrated Framework

NO

YES

YES

Page 18: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental Temporal

Partitioning The node with the

highest ASAP level is selected and moved to the subsequent partition.

Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

2

6 4

0

1

5

RFU Map

0

1

2

3

8

9

4

10

11

5

12

13

6

14

15

7

Data Flow Graph of Input CI

1st Partition

2nd Partition

0

1

2

4 5 6

7

1st Partition

3

8

9

10

11

12

13

14

15

2nd Partition123

5

4

6 7 8

9

3

11

10 12 14 7

8 13 15

9

RFU Map

10

Page 19: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Mapping Custom Instructions

Mapping is the same as the well-known placement problem: Determining the appropriate positions for DFG

nodes on the RFU. Assigning CI instructions to FUs is done

based on the priority of the nodes.

Page 20: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

An Example: Mapping of a CI on the RFU

2

3

4

1

1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2

ADDU

SRA

SLT

SUBU

BNE

R3R0 R0R0

R10

R8

R2

R2

R30x3

400488

2

3

4

5

1

A Custom Instruction

Data Flow GraphRFU Map

5

Page 21: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Customizing Mapping Tool

Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU

5 6

7 1

2

4 3

1: SLL R2,R17,0x12: ADDU R2, R2, R173: SLL R2, R2, 0x34: ADDU R2, R2, R175: SLL R2,R2, 0x46: ADDU R2, R2, R207: SLL R3, R18, 0x28: ADDU R3, R3, R2

SLL

ADDU

SLL

SLL

SLL

ADDU

R3R18 0x1R17

R2

R2

R2

R2

R3R17

0x3

ADDU

ADDU

R2

R2

R3

R17

0x4

R20

1

2

3

4

5

6

8

7

8

: Critical path

A Custom Instruction

Data Flow Graph

RFU Map

Page 22: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

CIs length for Mibench applications

Page 23: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Percentage of rejected CIs for CIGen

0

10

20

30

40

50

60

70%

of

Re

jec

ted

CIs

bit

co

un

ts

blo

wfi

sh

blo

wfi

sh

(d

ec

)

cjp

eg

djp

eg fft

fft

(in

v)

gs

m (

de

c)

gs

m (

en

c)

lam

e

Page 24: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Initial and final number of partitions

0

10

20

30

40

50

60

70

80

90N

o. o

f P

art

itio

ns

bit

cn

ts

blo

wfi

sh

blo

wfi

sh

(d

ec

)

cjp

eg

djp

eg fft

fft

(in

v)

gs

m (

de

c)

gs

m (

en

c)

lam

e

rijn

da

el (

en

c)

rijn

da

el (

de

c)

sh

a

Initial No. of Partitions Final No. of Partitions

Page 25: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Maximum critical path length for CIs

0

1

2

3

4

5

6

7

8

Ma

xim

um

Cri

tic

al P

ath

Le

ng

th

bit

cn

ts

blo

wfi

sh

blo

wfi

sh

(de

c)

cjp

eg

djp

eg fft

fft

(in

v)

gs

m (

de

c)

gs

m (

en

c)

lam

e

rijn

da

el

(en

c)

rijn

da

el

(de

c)

sh

a

Page 26: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Performance Evaluation

issue 1-way

L1- I cache 32K, 2 way, 1 cycle latency

L1- D cache 32K, 4 way, 1 cycle latency

Unified L2 1M, 6 cycle latency

Execution units 1 integer, 1 floating point

RUU size 64

Fetch queue size 64

Simplescalar was configured to behave as a MIPS324K processor. The base processor supports MIPS instruction set.

22 applications of Mibench

Page 27: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Delay of RFU according to CI length

CI Length RFU Delay (ns)

1 1.38

2 2.28

3 3.12

4 4.89

5 6.47

6 7.57

7 8.65

8 9.66

Synopsys Tools + Hitachi 0.18μm

Page 28: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Speedup

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Sp

eed

up

bit

co

un

ts

blo

wfi

sh

blo

wfi

sh

(de

c)

cjp

eg

djp

eg fft

fft

(in

v)

gs

m (

de

c)

gs

m (

en

c)

lam

e

rijn

da

el

(en

c)

rijn

da

el

(de

c)

sh

a

Speedup using integrated framework Speedup using CI generation tool

Page 29: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Conclusions

Proposing a reconfigurable functional unit for an Adaptive Dynamic Extensible Processor using a quantitative approach.

Developing an integrated framework for partitioning and mapping custom instructions for the proposed RFU.

Page 30: An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid

ACSAC 2006 - Shanghai, China Kyushu University

Thank you for your attention.