30
© 2005 Stretch Inc. All rights reserved. A Software Configurable Processor Architecture Ricardo E. Gonzalez

Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

  • Upload
    buimien

  • View
    250

  • Download
    19

Embed Size (px)

Citation preview

Page 1: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved.

A Software Configurable Processor Architecture

Ricardo E. Gonzalez

Page 2: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 2

Embedded System Design Dilemma

COMPUTEPERFORMANCE

SYSTEM TIME-TO-MARKET

GPP

FPGA

ASICASSP

DSPMEDIA

SECURITYETC.

SW HW

?

Page 3: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 3

Solving Compute Intensive Problems

Processor with embedded programmable logic

– Lots of compute resources

Utilize fabric by extending the ISA– Design your own FU– Add processor state

Simple programming model– Issue instructions to perform

computation

Tailored for high-throughput applications

– Compute intensive

Page 4: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 4

Compute Intensive Markets & Applications

• Sonar/Radar• Biometrics

• CAT Scan• Ultrasound

• Printers• Scanners

• Encryption/Decryption• Network Security

• Broadcast Equipment• Audio Studio

Page 5: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 5

877868

629379

275181

164122

6945

4037.7

28.319.5

15.313.61312.711.411.210.210.29.58.78.37.97.16.86.86.35.85.15.14.84.74.63.43.332.9

0 100 200 300 400 500 600 700 800 900 1000

Stretch S5000 - 300 MHz - C OPTIntrins ity Fas tMATH - 2 GHz - ASM OPT

TI C6416 - 720 MHz - ASM OPTTI C6416 - 720 MHz - C OPT

BOPS Manta v2.0 - - MHz - ASM OPTBOPS Manta v2.0 - - MHz - C OPT

Motorola MPC7447 - 1.3GHz - Altivec OPTMotorola MPC7455 - 1GHz - Altivec OPT

TI C6203 - 300 MHz - ASM OPTTI C6203 - 300 MHz - C OPT

LSI 402ZX - 200 MHz - ASM OPTMotorola MPC7447 - 1.3GHz

Motorola MPC7455-1GHzTI TMS320C6416-720IBM 440GX - 667 MHz

Intrins ity Fas tMATH 2GHzPMC Sierra RM7000C - 625MHz

Motorola PowerPC 7400-500IBM 440GP-500

IBM PowerPC 750CX - 500MIPS 20Kc 600 MHz

Motorola MPC755-400IBM 440GP-500

AMD K6-IIIE+ 550/ACRMotorola MPC8245-400

AMD K6-2E+ 500/ACRIBM 405GPr - 400 MHz

NEC VR5500 - 400TI TMS320C6203-300AMD K6-2E 400/ACR

Motorola PowerPC 603e - 300AMD Au1100 - 396 MHz

NEC VR5500 - 300Infineon Carm el 10xx-170

IBM 405GPr - 266 MHzStretch S5000 - 300 MHz

NEC VR5000 - 250LSI402ZX - 200

IDT79RC64575-250Toshiba TMPR4927ATB-200

Proc

esso

r

EEMBC Telemark ScoreEEMBC Telemark

OPTIMIZED

OUT-OF-BOX

Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results

Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results

S5000 Performance

With ISEF Acceleration

Page 6: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved.

Application Acceleration Process

Page 7: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 7

RGB → YCbCr

Color Conversion Function:

Void RGB2YCBCR (WR A, WR *B) {char Y[4], Cb[4], Cr[4];char R[4], G[4], B[4];

R[0] = A(23,16); G[0] = A(15,8); B[0] = A(7,0); /* and so on. . . */

for (i = 0; i < 4; i++) {Y[i] = (77*R[i] + 150*G[i] + 29*B[i]) >> 8;Cb[i] = (32768 - 43*R[i] - 85*G[i] + (B[i] << 7)) >> 9;Cr[i] = (32768 + (R[i] << 7) - 107*G[i] - 21*B[i]) >> 9;

}*B = (Y[3],Cr[3]+Cr[2],Y[2],Cb[3]+Cb[2],Y[1],Cr[1]+Cr[0],Y[0],Cb[1]+Cb[0]);

}

Program Loop:for (…) {

WRGET0(&A, 12); /* Load 12 bytes (4 RGB pixels) */RGB2YCBCR(A, &B); /* Convert 4 pixels */WRPUT0(B, 8); /* Store 8 bytes (4 YCbCr pixels */

}

SE_FUNC /* Tells Stretch C-Compiler to reduce this function to an instruction */

Page 8: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 8

Programming Flow

Page 9: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 9

Development & Debug

Debug

Programmers develop & debug using familiar tools

Faster debug cycle

Port & accelerate apps within Stretch IDE

No hardware design experience required!

Page 10: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 10

Stretch S5 Development Flow

Start DoneApplication

code Compile ExecuteFast

enough?Correctoutput?

ProfileSelect

hot-spot

Create orModify

ExtensionInstructions

Modifyapplication

Y

NN

Y

C/C++Gnu-basedOptimizingcompiler

Targetsimulator

or chipGUI Debugger

Easy touse GUIOne EI does the work of

many operationsand iterations

Page 11: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved.

Under The Hood

Page 12: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 12

S5 Datapath and Key Parameters

Wide Register File32 128-bit entries

Load/store unit128-bit load/storeAuto increment/decrementImmediate, indirect, circularVariable-byte load/storeVariable-bit load/store

ISEF3 inputs and 2 outputsPipelined, interlocked32 16-bit MACs and 256 ALUsBit-sliced for arbitrary bit-widthAllow control logicAllow internal processor state

RISC ProcessorTensilica – Xtensa V32 KB I & D CacheOn-Chip Memory, MMU24 Channels of DMA, FPU

Page 13: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 13

Instruction Set Architecture

Based on Xtensa ISASingle opcode space

– Different for each applicationHardware “caches” EI subset

– # instructions not limited by hardware

Xtensa V WR loads/stores EI’s

Always present

Xtensa V WR Loads/stores EI’s

Cached set

App 1

App 2

Page 14: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 14

Matching Operating Frequency

IR

Decoder WRF

ISEF

Two isochronous clock domains– Processor runs @ higher MHZ– Programmable clock ratio (1:1→1:4)– WR acts as gateway between them– Constraints EI issue rate

Clock ratio invisible to programmer

Table-driven EI decode

ISEF clock domain

Page 15: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 15

S5 Pipeline

Standard 5-stage RISC pipelineExtension instruction may be much longer

– Fully pipelinedCompiler schedules allinstructionsHardware interlocks ensure correctness

– Instruction groups subdivided by use/def requirements

– Interlocks driven by config table in Instruction Unit

I Cache

Decode

Addr gen

D Cache

Write back

AR WR

ALU Forward

DP RAM Forward

Stage 0

Stage 1

Stage 2

Stage n-1

Write back

I

R

E

M

W

Page 16: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 16

Pipeline Schedule

Extension Instructions may have different latenciesInitiation interval of EI a function of:

– ISEF clock ratio– WR and state use/def

Optimal schedule typically has initiation interval > 1

– No performance penalty for lower clock rate!

Statically scheduled– Pipeline behavior is completely

known at compile time

U

D

D

U

U

D

D

U

LD n+2

OP n

ST n-4

LD n+3

OP n+1

ST n-3

DLD 0

DLD 1Startuptransient

Steadystate

Time

InitiationInterval

Page 17: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 17

ISEF Architecture

Array of 64-bit ALUs– May be broken at 4-bit or 1-bit boundaries– Conditional ALU operation: Y = C ? (A op1 B) : (A op2 B)– Registers to implement state, pipelining

Array of 4x8 multipliers, cascadable to 32x32– Programmable pipelining

Programmable routing fabricExecution clock divided from processor clock: 1:1, 1:2, 1:3, 1:4Up to 16 instructions per ISEF configurationMany configurations per executableReconfiguration

– User directed or on demand– ~100 µsec for complete reconfiguration

ALUArray

MultiplierArray

Routing Fabric

ISEF Block

ALUArray

MultiplierArray

Routing Fabric

ISEF Block

WR

ALUArray

MultiplierArray

Routing Fabric

ISEF Block

Page 18: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 18

Dynamic ISEF Configuration

Two ISEFsEach holds up to 16 instructionsConfigured via system bus Managed by DMAIndependently configurable

On-Demand ConfigurationAuto managed by HW and OSHandled like cache miss

Configuration PreloadingManaged by applicationHandled like instruction pre-fetch

TaskConfig

On-Demand Configuration Configuration Preloading

Page 19: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 19

ISEF Bitstream Generation

ELFC/C++

opt

map

plac

e

rout

e

retim

e

bits

tream

xt-x

cc

cc

linke

r

Page 20: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 20

C Compiler Tradeoffs

Page 21: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved.

Examples of Extension Instructions

Page 22: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 22

RGB2YCbCr: SCP Instruction Efficiency

R G B R G B R G B R G B

0 1 2 3RGB 96 bits

YCbCr 64 bits Y

Y +* *+*

>>

Cb Cr

Cr* * *+

++

>>

Cb* * *+

++

>>

0 1 2 3Y Y Cb Y Cr

*+* * * * * * * *++>> +>>

++>>

* * * * * * * * *++

+>>

++

+>>

++

+>>

* * * * * * * * *++

+>>

++

+>>

++

+>>

SCALAR PROCESSOR

4W VLIW PROCESSOR

8W SIMD PROCESSOR

SCP

Page 23: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 23

Flexibility in FIR Implementations

0011

22334455

6677

8899

101112

13

CC

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15

XX

0011223344556677889910111213

CC

-13-12 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15XX

01

2345

67

89

101112

13

C

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

X

8-MACs/Cycle 16-MACs/Cycle

32-MACs/CycleS5000 Strengths

– Up to 32-MAC/Cycle– Flexibility to trade-off ISEF usage

vs. compute speed– Wide Load & Stores feed the

compute array

Page 24: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved.

Sample Applications

Page 25: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 25

H.264 Encoder Challenges

ME

DeblockingFilter

MC

Intra Predict

ForwardTransform

InverseTransform

Quantization

InverseQuantization

Compute Intensive Tasks

Page 26: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 26

H.264 Encode

Page 27: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 27

802.16d (WiMAX) Challenges

FIR Freq/TimeCorrection

Sync

SoftDemapperDeinterleaver

FECViterbi

FECR-S

De-RandomizerFFT

Compute Intensive Tasks

Page 28: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 28

802.16d Transmit (PHY)

Page 29: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 29

802.16d Receiver (PHY)

Page 30: Software Configurable Processor - Stanford Universityweb.stanford.edu/.../SoftwareConfigurableProcessor.pdf · DSP MEDIA SECURITY ETC. ... Array of 4x8 multipliers, cascadable to

© 2005 Stretch Inc. All rights reserved. 30

Summary

C/C++ Programmers can tailor the processor for their application– Continue to develop & debug in a familiar environment (IDE/gdb)

Application-specific instruction provide 10-100X speedup– Modest porting effort

C/C++ programmers can easily design high-performance systems– Enabling a new system design methodology