39
High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

High Performance, Low Power Reconfigurable Processor for Embedded Systems

  • Upload
    syshe

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

High Performance, Low Power Reconfigurable Processor for Embedded Systems. Farhad Mehdipour , Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University, Japan. Outline. Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution - PowerPoint PPT Presentation

Citation preview

Page 1: High Performance, Low Power Reconfigurable Processor for Embedded Systems

High Performance, Low Power Reconfigurable Processor for Embedded Systems

Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami

Kyushu University, Japan

Page 2: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 2/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 3: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 3/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 4: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 4/38

Oct 16, 2007

Introduction

Designing Embedded Systems Embedded Microprocessors Application Specific Integrated Circuits (ASICs) Application Specific Instruction set Processors (ASIPs) Extensible Processors

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2

LD/ST: Load / Store

CFU: Custom Functional Unit

Page 5: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 5/38

Oct 16, 2007

Goal

Improving the performance and energy efficiency of embedded processors, while maintaining compatibility and flexibility.

Page 6: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 6/38

Oct 16, 2007

Extensible Processors

Enhancing the performance of a processor in embedded systems Using an accelerator for accelerating frequently executed portions of applications

Accelerator implementations reconfigurable fine/coarse grain hw custom hardware (such as ASIP or Extensible Processors)

AND AND

OR

AND2_OR

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2

LD/ST: Load / Store

CFU: Custom Functional Unit

Page 7: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 7/38

Oct 16, 2007

Custom Instructions

Critical segments Most frequently executed portions of the applications

Instruction set customization hardware/software partitioning

Custom Instructions (CIs) are extracted from critical segments of an application and executed on an Custom Functional Unit (CFU)

A CI is represented as a Data Flow Graph (DFG)

CI or DFG generation levels high level or binary level (DFG nodes are the instructions level operations)

Page 8: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 8/38

Oct 16, 2007

Adaptive Extensible Processors

Issues of Extensible Processors High NRE (Non-Recurring Engineering) and manufacturing costs Long time-to-market

Adaptive Extensible Processor Adding and generating custom instructions after fabrication Using a reconfigurable functional unit (RFU) instead of custom functional unit

CPU

Instruction Dispatcher

Register File

+ & x LD/ST CFU1 CFU2RFU Config Mem

CFU: Custom Functional Unit

RFU: Reconfigurable Functional Unit

Page 9: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 9/38

Oct 16, 2007

General Overview of the Proposed Architecture

GPP: General Purpose Processor

RFU: Reconfigurable Functional Unit

Register File

ID/EXE Reg

RFUALU

MUX

EXE/MEM Reg

GPP Augmented HW

400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ....

Hot Basic Block

Config Memory

Page 10: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 10/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 11: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 11/38

Oct 16, 2007

Control Data Flow Graph: Definition

CDFG The DFG containing control instructions (e.g. branch instruction)

The sequence of execution changes due to the result of a branch instructions

Types of CDFGs: CDFGs containing at most one branch instruction (last instruction)

accelerator does not need to support conditional execution

CDFGs containing more than one branch instruction accelerator should support conditional execution

Page 12: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 12/38

Oct 16, 2007

Why need to support Control Instructions- Motivations (1/2)

Quantitative analysis approach using applications of Mibench DFG extraction process

Short distance control instructions small size DFGs (SSDFG) SSDFGs do not offer noticeable speedup have to be run on the base

processor

1 2 5 beq 7 8bgez3 9 10 bne 12 13 14 15 . . .bne16

Page 13: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 13/38

Oct 16, 2007

Why need to support Control Instructions- Motivations (2/2)

Analysis on 17 application of Mibench bitcount

almost 92% of application is hot 32% out of 92% of hot portions do not worth to be accelerated due to the SSDFGs

fft, fft(inv) and sha include few branch instructions supporting conditional execution results in no considerable speedup

0102030405060708090

100

adpcm

(enc)

adpcm

(dec

)

bitcounts

blowfis

h

blowfis

h (dec

)

basic

mat

hcj

peg crc

dijkst

ra

djpeg fft

fft (i

nv)la

me

patric

iash

a

strin

gsear

ch

susa

n

%

Percentage of hot portions Percentage of eliminated hot portions due to SSDFGs

Page 14: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 14/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 15: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 15/38

Oct 16, 2007

Basic Requirements

inst. # address inst. operands (dest, src1, src2)

1 400418 addu R13 R0 R02 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400450 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R09 400458 sra R8 R9 0x310 400460 slt R2 R3 R911 400468 bne 400488 R212 400470 addiu R10 R0 413 400478 subu R3 R3 R914 400480 addu R8 R8 R915 400488 sra R9 R9 0x116 400490 slt R2 R3 R917 400498 bne 4004b8 R218 4004a0 ori R10 R10 219 4004a8 subu R3 R3 R920 4004b0 addu R8 R8 R921 4004b8 sra R9 R9 0x122 4004c0 slt R2 R3 R923 4004c8 bne 4004e0 R224 4004d0 ori R10 R10 125 4004d8 addu R8 R8 R9

19,20

13: subu

16: slt

17:bne

21: sra

22: slt

15: sra

23:bne

19

3, 7

3, 7

253, 7, 19

R3

R3

R3

R9

R9

R3

R24004e0

4004b8

0x1

R90x1

R9

1 5 6 7 84 11 15

17 16212223242526

... ... ...

...

Control Flow Graph

DFG nodes receive their input from a single source CDFG nodes can have multiple sources The correct source is selected at run time according to the results of branches

Page 16: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 16/38

Oct 16, 2007

Basic Requirements

Capability of selective receiving of inputs from both accelerator primary inputs and output of other instructions (FUs) for each node

Capability of selecting the valid outputs from several outputs generated by accelerator according to conditions made by branch instructions

Accelerator should be equipped by control path to provide with the correct selection of inputs and outputs for each FU and entire accelerator

Page 17: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 17/38

Oct 16, 2007

CDFG Generation-Example (1/3)

BB1 BB2 BB3

BB4

2 5 beq 7 8bgez3 9

18 17 16 14151920bne

bne10 11 1210

…………….30

load load

Page 18: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 18/38

Oct 16, 2007

Example

2 5 beq 7 8bgez3 10

1920bne

bne11 120

exit4

exit3

exit1

exit2

2 5 beq 7 8bgez3 9

18 17 16 14151920bne

bne10 11 1210

BB1BB2 BB3

BB4

…………….30

CDFG Generation-Example (2/3)

Page 19: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 19/38

Oct 16, 2007

CDFG Generation- Example (3/3)

inst. # address inst. operands (dest, src1, src2)

1 400418 lw R23 100 R22 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400468 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R09 400458 lw R8 R9 0x310 400460 slt R2 R3 R9

13 400478 bne 4004a8 R214 400480 addiu R10 R0 415 400488 subu R3 R3 R916 400490 addu R8 R8 R917 400498 sra R9 R9 0x118 4004a0 slt R2 R3 R9

21 4004b8 bne 400410 R2

19 4004a8 ori R10 R10 220 4004b0 subu R3 R3 R9

22 4004c0 slt R2 R3 R9

11 40046812 400470

0 400410 addu R13 R0 R0

inst. # address inst. operands (dest, src1, src2)

1 400410 lw R23 100 R2

2 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400468 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R0

9 400460 lw R8 R9 0x310 400458 slt R2 R3 R9

13 400478 bne 4004a8 R214 400480 addiu R10 R0 415 400488 subu R3 R3 R916 400490 addu R8 R8 R917 400498 sra R9 R9 0x118 4004a0 slt R2 R3 R9

21 4004b8 bne 400410 R2

19 4004a8 ori R10 R10 220 4004b0 subu R3 R3 R9

22 4004c0 slt R2 R3 R9

11 40046812 400470

0 400418 addu R13 R0 R0

Code before generating MECI Code after generating MECI

ori R10 R10 1addu R8 R8 R9

ori R10 R10 1addu R8 R8 R9

2 5 beq 7 8bgez3 10

1920bne

bne11 120exit4

exit3

exit1

exit2

Page 20: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 20/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 21: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 21/38

Oct 16, 2007

CDFG Temporal Partitioning Algorithms

CDFGs extracted from various applications have different sizes for some of the CDFGs

• the whole of CDFG can not be mapped on the accelerator due to the resource limitations of the accelerator

Resource constraints the number of inputs, outputs, logics and routing resource

constraints

Page 22: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 22/38

Oct 16, 2007

CDFG Temporal Partitioning Algorithms

Temporal partitioning Temporally divides a DFG into a number of smaller partitions each partition can fit into the target hardware dependencies among the nodes are not violated

Temporal Partitioning algorithms for CDFGs Not-Taken Path Traversing alg. (NTPT) Frequency based TP alg.

Page 23: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 23/38

Oct 16, 2007

TP Based on Not-Taken Paths (NTPT)

Adds instructions from not-taken path of a control instruction to a partition until violating the target hardware architectural constraints or reaching to a terminator control instruction

Generating a new partition is started with the branch instructions which at least one of their taken or not-taken instructions has not been located in the current partition

Terminator instruction an instruction which changes execution direction of the program an exit point for a CDFG

Page 24: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 24/38

Oct 16, 2007

TP Based on Not-Taken Paths (NTPT)

2 5 beq 7 8bgez3 9

18 17 161920bne

bne10 11 1210

0 1 2 53 bgez beq 7 8 9 10beq 7 8 9 10 11 12 bne 151414 15

Page 25: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 25/38

Oct 16, 2007

Frequency-Based TP Algorithm

NTPT algorithm instructions were selected only from not-taken paths of branches

Frequency based algorithm execution frequency of taken and not-taken paths is an effective factor

for in adding them to the current partition a frequency threshold is defined to determine that whether instruction is

critical or not one of the taken or not-taken paths or both of them can be critical

Page 26: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 26/38

Oct 16, 2007

TP Based on Not-Taken Paths (NTPT)

2 5 beq 7 8bgez3 9

18 17 161920bne

bne10 11 12100 1 2 53 bgez beq 7 8 9 10 11 12 bne 151414 15

50%

50%

90%

10%

11

80%

20%

12

Page 27: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 27/38

Oct 16, 2007

Evaluating Proposed Algorithms

Average Partition No. Efficiency

0

1

2

3

4

5

6

Av

era

ge

Pa

rtit

ion

No

.

NTPT Alg. Frequency Based Partitioning Alg.

012345678

Eff

icie

nc

y

NTPT Alg. Frequency Based Partitioning Alg.

Page 28: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 28/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 29: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 29/38

Oct 16, 2007

Case study: Extending an Extensible Processor to Support Conditional Execution

AMBER an extensible processor targeted

for embedded systems Goal: accelerating application

execution base processor a general

RISC processor Including: profiler, sequencer and

a coarse grain RFU RFU: an accelerator without

conditional execution support

Register File

ID/EXE Reg

RFU

Multi-ContextMemory

Cache

Functional Unit

MUX SequencerSequencer

Tabel

EXE/MEM Reg Profiler ProfilerTable

GPP Augmented HW

OnlineTraining

Page 30: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 30/38

Oct 16, 2007

Extending AMBER’s RFU to Support Conditional Execution (CRFU)

The selectors of muxes are used for choosing data for FU inputs controlled by the configuration bits

The outputs of FUs are only applied to the Selector-Muxes in the lower-level rows

Connections from input ports toinputs of the rows

RFU Input Ports

RFU Output Ports

Outputs of 2nd row to theinputs of 4th and 5th rows

Row1

Row5

FU FU FU FU

Outputs of 1st row to theinputs of 3rd, 4th and 5th rows

Page 31: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 31/38

Oct 16, 2007

CRFU

FU1 FU2

FU3 FU4

ConfigurationBits

ConfigurationBits

DataSelectionMux

Branch result from FU1

Branch result from FU2

Configuration Bits

Branch result from FU1

Branch result from FU2

Configuration Bits

Configuration Bits

Configuration Bits

Selection using branch results

Inputs of Selector-Mux (one-bit width) originate from the FUs executing branch instructions

Data Selector-MuxSelection using configuration bits

Selection using configuration bits

Page 32: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 32/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 33: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 33/38

Oct 16, 2007

Experimental Results

Synthesis result of the extended RFU Synopsys tools and Hitachi 0.18μm area 2.1 mm2.

CDFG configuration bit-stream 615 bits Control signals: 375 bits

Executing applications on the Simplescalar as ISS Profiling and extracting hot instruction sequence

NTPT temporal partitioning algorithm for generating mappable CDFGs

Page 34: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 34/38

Oct 16, 2007

Experimental Results-Speedup

Speedup Comparison

11.21.41.61.8

22.22.42.62.8

3

adpcm

(enc)

adpcm

(dec

)

blowfis

h (enc)

blowfis

h (dec

) cr

c

dijkst

ra

Avera

ge

Sp

eed

up

DFG

CDFG

Average Speedup: 2.1 (CDFG) versus 1.1 (DFG)

Page 35: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 35/38

Oct 16, 2007

Experimental Results-Energy Saving

Energy Saving

01020304050607080

adpcm

(enc)

adpcm

(dec

)

blowfis

h(enc)

blowfis

h(dec

)cr

c

dijkst

ra

Avera

ge

To

tal E

ner

gy

Red

uct

ion

(%

) C

DF

G v

s D

FG

DFGCDFG

Average Energy Saving: 43% (CDFG) versus 21% (DFG)

Page 36: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 36/38

Oct 16, 2007

Experimental Results-Comparison

Page 37: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 37/38

Oct 16, 2007

Outline

Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support

Conditional Execution Experimental Results Conclusion

Page 38: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Kyushu University ISOCC 2007@Seoul, Korea 38/38

Oct 16, 2007

Conclusion

Handling branch instruction & extending DFGs to CDFGs Basic requirements for an accelerator featuring conditional

execution CDFG temporal partitioning algorithms

NTPT traverse not-taken path of the branch instructions Frequency-based temporal partitioning algorithm

Extending AMBER’s RFU to support conditional execution Increasing speedup Reducing energy consumption

Page 39: High Performance, Low Power Reconfigurable Processor for Embedded Systems

Thank You!