19
Application- Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Application-Specific Customization of Soft Processor Microarchitecture

  • Upload
    zagiri

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Application-Specific Customization of Soft Processor Microarchitecture. Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Edward S. Rogers Sr. Department of Electrical and Computer Engineering. Processors and FPGA Systems. We seek improvement through customization. - PowerPoint PPT Presentation

Citation preview

Page 1: Application-Specific Customization of Soft Processor Microarchitecture

Application-Specific Customization of Soft Processor MicroarchitecturePeter YiannacourasJ. Gregory Steffan Jonathan Rose

University of Toronto

Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Page 2: Application-Specific Customization of Soft Processor Microarchitecture

2

Processors and FPGA Systems

We seek improvement through customization

Processors lie at the “heart” of FPGA systems

MemoryInterface

UART

Custom Logic

Ethernet

Performs coordination and even computation Better processors => less hardware to design

Soft Processor

Page 3: Application-Specific Customization of Soft Processor Microarchitecture

3

Motivating Application-Specific Customizations of Soft Processors

1. FPGA Configurability Can consider unlimited processor variants

2. A soft processor might be used to run either:a) A single applicationb) A single class of applicationsc) Many applications, but can be reconfigured

3. Applications differ in architectural requirements Can specialize architecture for each application

We want to evaluate effectiveness of specialization

Page 4: Application-Specific Customization of Soft Processor Microarchitecture

4

Research Goals

To investigate1. The potential for “Application-tuning”

Tune processor microarchitecture to favour an application Preserve general purpose functionality

2. “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application

3. Combination of both methods

Measure efficiency gained through real implementations

Page 5: Application-Specific Customization of Soft Processor Microarchitecture

5

SPREE

SPREE System (Soft Processor Rapid Exploration Environment)

RTL

ISA Datapath ■ Input: Processor description■ Made of hand-coded components

1. Verify ISA against datapath

2. Datapath Instantiation3. Control Generation

■ Multi-cycle/variable-cycle FUs■ Multiplexer select signals■ Interlocking■ Branch handling

■ SPREE System

■ Output: Synthesizable Verilog

ProcessorDescription

Page 6: Application-Specific Customization of Soft Processor Microarchitecture

6

Back-End Infrastructure

RTL

2. Resource Usage3. Clock Frequency4. Power

1. Cycle Count

Quartus II 4.2CAD Software

ModelsimRTL Simulator

Benchmarks(MiBench,

Dhrystone 2.1,RATES,XiRisc)

Stratix 1S40C5

We can measure area/performance/energy accurately

Page 7: Application-Specific Customization of Soft Processor Microarchitecture

7

Comparison to Altera’s Nios II

Has three variations:Nios II/e – unpipelined, no HW multiplierNios II/s – 5-stage, with HW multiplierNios II/f – 6-stage, dynamic branch prediction

Page 8: Application-Specific Customization of Soft Processor Microarchitecture

8

Architectural Parameters Used in SPREE

We focus on core microarchitecture

Multiplication Support Hardware FU or software routine

Shifter implementation Flipflops, multiplier, or LUTs

Pipelining Depth

(2-7 stages)

Organization Forwarding

Page 9: Application-Specific Customization of Soft Processor Microarchitecture

9

300

500

700

900

1100

1300

1500

1700

1900

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

Ge

om

ea

n W

all

Clo

ck

Tim

e (

us

) SPREE Processors

Altera Nios II/e

Altera Nios II/s

Altera Nios II/f

SPREE vs Nios II

smaller

faster

-3-stage pipe-HW multiply-Multiply-based shifter

Page 10: Application-Specific Customization of Soft Processor Microarchitecture

10

Exploration of Soft Processor Architectural Customizations

1. Architectural-tuning

2. Instruction-set subsetting

3. Combination (Arch-tuning + Subsetting)

Page 11: Application-Specific Customization of Soft Processor Microarchitecture

11

1. Architectural Tuning Experiment

Vary the same parameters Multiplication Support Shifter implementation Pipelining

Determine1. Best overall (general purpose) processor

2. Best per application (application-tuned)

Metric: Performance per Area (MIPS/LE) Basically inverse of Area-Delay product

Page 12: Application-Specific Customization of Soft Processor Microarchitecture

12

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08b

ub

ble

_so

rt

crc

de

s fft fir

qu

an

t

iqu

an

t

turb

o

vlc

bitc

nts

CR

C3

2

qso

rt

sha

stri

ng

sea

rch

FF

T

dijk

stra

pa

tric

ia go

l

dct

dh

ry

OV

ER

AL

L

Benchmark

Pe

rfo

rma

nc

e p

er

Un

it A

rea

(M

IPS

/LE

)

Performance per Area of All Processors

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08b

ub

ble

_so

rt

crc

de

s fft fir

qu

an

t

iqu

an

t

turb

o

vlc

bitc

nts

CR

C3

2

qso

rt

sha

stri

ng

sea

rch

FF

T

dijk

stra

pa

tric

ia go

l

dct

dh

ry

OV

ER

AL

L

Benchmark

Pe

rfo

rma

nc

e p

er

Un

it A

rea

(M

IPS

/LE

)

General Purpose

serialshift_norise

serialshift_lowrise

serialshift_judicialrise

serialshift_highrise

mulshift_norise

mulshift_minrise

mulshift_highrise

barrelshift_norise

barrelshift_minrise

barrelshift_highrise

serialshiftdatamem_norisepipe3_serialshift

pipe3_mulshift

pipe3_barrelshift

pipe4_serialshift

pipe4_mulshift

pipe4_barrelshift

pipe4_serialshift_2

pipe4_mulshift_stall_2

pipe4_barrelshift_2

pipe5_serialshift

pipe5_mulshift

pipe5_barrelshift

pipe5_serialshift_load

pipe5_mulshift_load

pipe5_mulshift_stall_loadpipe5_barrelshift_load

pipe7_serialshift

pipe7_mulshift

pipe7_barrelshift

pipe7_barrelshift_2

barrelshift_norise.nomul

barrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomul

serialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul

mulshift_minrise.nomul

mulshift_highrise.nomul

serialshiftdatamem_norise.nomulpipe3_serialshift.nomul

pipe3_mulshift.nomul

pipe3_mulshift_stall.nomulpipe3_barrelshift.nomul

pipe4_serialshift.nomul

pipe4_mulshift.nomul

pipe4_barrelshift.nomul

pipe5_serialshift.nomul

pipe5_mulshift.nomul

pipe5_barrelshift.nomul

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08b

ub

ble

_so

rt

crc

de

s fft fir

qu

an

t

iqu

an

t

turb

o

vlc

bitc

nts

CR

C3

2

qso

rt

sha

stri

ng

sea

rch

FF

T

dijk

stra

pa

tric

ia go

l

dct

dh

ry

OV

ER

AL

L

Benchmark

Pe

rfo

rma

nc

e p

er

Un

it A

rea

(M

IPS

/LE

)

General Purpose

Application-tuned

serialshift_norise

serialshift_lowrise

serialshift_judicialrise

serialshift_highrise

mulshift_norise

mulshift_minrise

mulshift_highrise

barrelshift_norise

barrelshift_minrise

barrelshift_highrise

serialshiftdatamem_norisepipe3_serialshift

pipe3_mulshift

pipe3_barrelshift

pipe4_serialshift

pipe4_mulshift

pipe4_barrelshift

pipe4_serialshift_2

pipe4_mulshift_stall_2pipe4_barrelshift_2

pipe5_serialshift

pipe5_mulshift

pipe5_barrelshift

pipe5_serialshift_load

pipe5_mulshift_load

pipe5_mulshift_stall_loadpipe5_barrelshift_loadpipe7_serialshift

pipe7_mulshift

pipe7_barrelshift

pipe7_barrelshift_2

barrelshift_norise.nomulbarrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomulserialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul

mulshift_minrise.nomulmulshift_highrise.nomulserialshiftdatamem_norise.nomulpipe3_serialshift.nomulpipe3_mulshift.nomul

pipe3_mulshift_stall.nomulpipe3_barrelshift.nomulpipe4_serialshift.nomulpipe4_mulshift.nomul

pipe4_barrelshift.nomulpipe5_serialshift.nomulpipe5_mulshift.nomul

pipe5_barrelshift.nomul

14.1%

32%

Page 13: Application-Specific Customization of Soft Processor Microarchitecture

13

2. Instruction-set Subsetting

SPREE automatically removesUnused connectionsUnused components

Reduce processor by reducing the ISACan create application-specific processor

Eliminate unused parts of the ISA

Page 14: Application-Specific Customization of Soft Processor Microarchitecture

14

Instruction-set Usage of Benchmarks

Applications do not use complete ISA

0%

25%

50%

75%

100%

bu

bb

le_

so

rt

crc

de

s fft

fir

qu

an

t

iqu

an

t

turb

o

vlc

bit

cn

ts

CR

C3

2

qs

ort

sh

a

str

ing

se

arc

h

FF

T

dijk

str

a

pa

tric

ia

go

l

dc

t

dh

ry

AV

ER

AG

E

Pe

rce

nt

of

ISA

Us

ed

Strong potential for hardware reduction

Page 15: Application-Specific Customization of Soft Processor Microarchitecture

15

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

bubb

le_s

ort crc

des fft fir

quan

t

iqua

nt

turb

o vlc

bitc

nts

CR

C32

qsor

t

sha

strin

gsea

rch

FF

T_M

I

dijk

stra

patr

icia go

l

dct

dhry

AV

ER

AG

E

Benchmark

Fra

ctio

n o

f Are

aF

ract

ion

of

Are

aArea Reduction from Subsetting

Area reduced by 60% in some, 23% on average

23%

Similar reductions for energy, small impact on performance

Page 16: Application-Specific Customization of Soft Processor Microarchitecture

16

3. Combining Application Tuning and Instruction-set Subsetting

Subsetting is effective on its own Can apply subsetting on top of tuning

Compare different customization methods1. Tuning

2. Subsetting

3. Tuning + Subsetting

Page 17: Application-Specific Customization of Soft Processor Microarchitecture

17

Combining Application Tuning and Instruction-set Subsetting

0

10

20

30

40

50

60

General purpose Application-tuned General purpose +subsetted

Application-tuned +subsetted

Pe

rfo

rma

nc

e P

er

Are

a (

MIP

S/1

00

0L

Es

)

14% 16%25%

Tuning reduces the waste that subsetting eliminates

Page 18: Application-Specific Customization of Soft Processor Microarchitecture

18

Summary of Presented Architectural Conclusions

Application tuning 14% average efficiency gainWill increase with more architectural axes

Instruction-set SubsettingUp to 60% area & energy savings16% average efficiency gain

Combined Tuning & Subsetting25% average efficiency gain

Page 19: Application-Specific Customization of Soft Processor Microarchitecture

19

Future Work

Consider other promising architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy

Compiler assistance Can improve tuning & subsetting