24
5.5 GFLOPS Vector Units for “Emotion Synthesis” System ULSI Engineering Laboratory, TOSHIBA Corp. A.Kunimatsu, N.Ide, T.Sato, Y.Endo, H.Murakami, T.Kamei, M.Hirano Sony Computer Entertainment Inc. M.Oka, A.Ohba, T.Yutaka, T.Okada, M.Suzuoki

5.5 GFLOPS Vector Units for “Emotion Synthesis”

Embed Size (px)

Citation preview

Page 1: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

5.5 GFLOPS Vector Units for“Emotion Synthesis”

System ULSI Engineering Laboratory, TOSHIBA Corp.

A.Kunimatsu, N.Ide, T.Sato, Y.Endo,H.Murakami, T.Kamei, M.Hirano

Sony Computer Entertainment Inc.

M.Oka, A.Ohba, T.Yutaka, T.Okada,M.Suzuoki

Page 2: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Overview

• Strategy of VU architecture• VU features and instruction sets• Examples• Performance

Page 3: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Strategy of VU Architecture

• Targets– Optimized for 3D calculations– Easy to develop software pipelined programs– Simple hardware

• Solutions– VLIW processor– 4 parallel fMACs– All operations have same pipeline depth

(Except DIV/SQRT/RSQRT operations)And one more essence And one more essence …….. ..

Page 4: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

"Emotion Synthesis"

• "Emotion Synthesis"– more realistic, more detailed and more natural motions

• We prepared two different of Vector Unit(VU)s– VU0 : for flexible calculation with CPU control– VU1 : for well-defined 3D calculations

Need Huge Calculation Power

Page 5: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

COP1FPU

I$

128bitProcessor

Core

CPU

IPU MemoryInterface

VPU1VPU0

COP2

System Bus128-bit

128

External Memory Peripherals

10ch DMAC

I/OInterface

D$ SP-RAM

GIFVU0

Inst.MEM4KB

DataMEM4KB

Inst.MEM16KB

DataMEM16KB

to RenderingEngine

(GS LSI)

EFUVU1

VIF0 VIF1

“Emotion Engine” Block Diagram

COP1FPU

Page 6: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

VU• VLIW mode

– 64-bit VLIW instruction formats– 5 function units are available simultaneously

4 FMAC + (FDIV, Branch, load/store, or iALU)

• Co-processor mode– MIPS COP2 instruction : 32-bit opcode– Two kinds of COP2 instructions :

• 4 parallel Floating operations• call micro-subroutine of VLIW mode

Page 7: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

VU0 and VU1

AvailableAvailableVLIW mode

With … Inst. Memory : 16KB Data Memory : 16KB VIF(system bus I/F) GIF(Graphics I/F) EFU(VU option)

With … Inst. memory : 4KB Data memory : 4KB VIF(system bus I/F)

VPU

N/AAvailableCo-processor mode

Well-defined 3Dcalculations

Flexible Calculation withCPU control

Main purpose

VU1VU1VU0VU0

Page 8: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

32

Upper Instruction Lower Instruction64

32

fMAC

w

fMAC

z

fMAC

y

fMAC

x

Upper Execution Unit

RAN

DU/etc

LSU

iALU

Lower Execution Unitfloating

registers128bit x 32

Data Mem16KByte

integerregisters16bit x 16

16

16

16

Vector Unit :VU

fDIV

VIF

Inst. Mem16KByte

fMAC

fDIV

EFU

GIF

System Bus

64

to Rendering Engine (G

S LSI)

Path1

Path2

Path3

128

128 128

128

128

128

12864

128

VPU1 Block Diagram

Page 9: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Instruction opcode to support 2 modes

VLIW mode : 64-bit VLIW instruction opcodeVLIW mode : 64-bit VLIW instruction opcode

4 bit 5 bit 5 bit 5 bit 4 bit 2 b

dest ft r eg fs r eg fd r eg VMADD bc- - - - - - - - - - - - - - - - - - - 0010 - -

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

co1

010010 1COP2

6 bit

Opcode BodyOpcode Body31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

4 bit 5 bit 5 bit 11 bit

is r egft r egdest VLQI- - - - - - - - - - - - - - 01101 1111 00

co1

010010 1COP2

6 bit

Opcode BodyOpcode Body

1 1 1 1 1 1 1 4 bit 5 bit 5 bit 5 bit 4 bit 2 b

I E M D T - - dest ft r eg fs r eg fd r eg MADD bc- - - - - 0 0 - - - - - - - - - - - - - - - - - - - 0010 - -

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

7 bit 4 bit 5 bit 5 bit 11 bit

is r egft r egLow er OP. dest LQI1000000 - - - - - - - - - - - - - - 01101 1111 00

Opcode BodyOpcode Body Opcode BodyOpcode Body

Upper 32bit Lower 32bit

Co-processor mode : 32-bit MIPS COP2 instruction opcodeCo-processor mode : 32-bit MIPS COP2 instruction opcode

A VLIW A VLIW instruction includestwo COP2 two COP2 instructions.

Page 10: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

64-bit VLIW instruction

Upper Instructions : 4 parallel floating ADD/SUB 4 parallel floating MUL 4 parallel floating MADD/MSUB 4 parallel floating MAX/MIN Outer product calculation Clipping detection

Lower Instructions : Floating DIV/SQRT/RSQRT Load/Store 128-bit data(4 FP data) Jump/Branch Elementary Function Unit Random Unit

1 1 1 1 1 1 1 4 bit 5 bit 5 bit 5 bit 4 bit 2 b

I E M D T - - dest ft r eg fs r eg fd r eg MADD bc- - - - - 0 0 - - - - - - - - - - - - - - - - - - - 0010 - -

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

7 bit 4 bit 5 bit 5 bit 11 bit

is r egft r egLow er OP. dest LQI1000000 - - - - - - - - - - - - - - 01101 1111 00FMAC x 4FMAC x 4 FDIV, load/store, iALUFDIV, load/store, iALU

Upper 32-bit Lower 32-bit

Page 11: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Registers

::

128-bit floating register x 32

16-bit integer register x 16

95 64 63 32 31 0

VF00

32bit 32bit 32bit 32bit

VF01VF02VF03

15 0

VI00

127 96

1.00 0.00 0.00 0.00VF01w VF01z VF01y VF01xVF02w VF02z VF02y VF02xVF03w VF03z VF03y VF03x

VF31 VF31w VF31z VF31y VF31x

0VI01VI02

VI15

::

Page 12: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

4 Parallel FMADD

w z y x

w z y x

fMACx

ACCw ACCz ACCy ACCx

fMACyfMACzfMACw ++++

Source register 1Source register 1

Source register 2Source register 2

ACCw

Page 13: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

4 Parallel FMADD with Broadcast

w z y x

w z y x

fMACx

Source register 1Source register 1

ACCw ACCz ACCy ACCx

fMACyfMACzfMACw ++++

Source register 2Source register 2

ACCw

Page 14: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

VLIW mode Pipeline Stages

• 7 cycle Latency : DIV / SQRTDIVF D 1 2 3 4 5 6 7

SQRTF D 1 2 3 4 5 6 7

RSQRTF D 1 2 3 4 5 6 7

• 13 cycle Latency : RSQRT8 9 10 11 12 13

• 3 cycle Latency : Basic OperationsADD, SUB, MUL, MADD, MSUBload/store, iALU operations, e.t.c.

F D X1 X2 X3 W

Page 15: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

FMACxFMACyFMACzFMACw

m m m mm m m mm m m mm m m m

xyzw

m x m y m z m wm x m y m z m wm x m y m z m wm

::::

00 01 02 03

10 11 12 13

20 21 22 23

30 31 32 33

00 01 02 03

10 11 12 13

20 21 22 23

!

"

####

$

%

&&&&

!

"

####

$

%

&&&&

=

+ + ++ + ++ + +

3030 31 32 33x m y m z m w+ + +

!

"

####

$

%

&&&&

Geometry Transformation

Page 16: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

F D X1 X2 X3 W 4 x MULA

4 x MADDA

4 x MADDA

4 x MADD

Geometry Transformation Pipeline

F D X1 X2 X3 W

F D X1 X2 X3 W

F D X1 X2 X3 W

Upper ( )xmxmxmxm 30201000

( )ymymymym 31211101

( )zmzmzmzm 32221202

( )wmwmwmwm 33231303

+

+

+

x broadcastx broadcast

y broadcasty broadcast

z broadcastz broadcast

w broadcastw broadcast

FMACx FMACy FMACz FMACw

+

+

+

+

+

+

+

+

+

FDX1 - 3W

: Fetch: Decode: eXecute: Write back

Geometry Transformation Pipeline

Page 17: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Perspective Transformation

7 Cycles : DIV repeat time

Upper

Lower

4 x MULA4 x MADDA4 x MADDA

DIV

4 x MULF D X1 X2 X3 W

FF DD X1X1 X2X2 X3X3 WW

FF DD X1X1 X2X2 X3X3 WWFF DD X1X1 X2X2 X3X3 WW

FF DD X1X1 X2X2 X3X3 WW

F D d1 d2 d3 d4 d5 d6 d7

F D X1 X2 X3 W

F D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 W

F D X1 X2 X3 W

F D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 W

FF DD X1X1 X2X2 X3X3 WW

F D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 W

F D d1 d2 d3 d4 d5 d6 d7

FF DD d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7

F D d1 d2 d3 d4 d5 d6 d7

4 x MADD

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W Others

LOOP 1 LOOP 2 LOOP 3 LOOP 4Perspective Transformation Software Pipelining

Geo

met

ryTr

ansf

orm

atio

n

Another Lower inst.: load/store, loop control

Page 18: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Vector Normalization

RSQRT

Ç Ç X1 X2 W

F D 1 2 3 4 5 6 7 8 9 10111213

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 W

MUL 2xMADD 22 yx +MADD 222 zyx ++

222

1zyx ++

3 x MUL

''(

)

++++++ 222222222,,

zyxz

zyxy

zyxx

25 cycles for normalization

13 cycles for repeat(software pipelining)

X3 2x22 yx +

Page 19: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Co-processor Mode Pipeline stages• Basic operations

I Q WF D X1 X2 X3 W

R A D MIPS core pipelineVU pipeline

10 stages• call VLIW processor mode programs

I Q WR A D MIPS core pipelineF D X1 X2 X3 W

F D X1 X2 X3 WF D X1 X2 X3 W

F D X1 X2 X3 W………….

11 + n stages

n steps

VU VLIW pipelines

Page 20: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Example : Jelly

• Dynamic simulation

Spring model

Page 21: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Performance

150

66 66

46

0

20

40

60

80

100

120

140

160

Geometry Trans. PerspectiveTrans.

Distance 1/Distance

M v

ecto

rs/se

c

Emotion Engine (300 MHz)

Page 22: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

EE Micrograph

Die size : 15.02mm x 15.04mmFrequency : 300MHzTransistors : 13.5MPower : 18WattsDesign Rule : 0.25umGate Length : 0.18um

Page 23: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

VU M

icro

grap

h

Page 24: 5.5 GFLOPS Vector Units for “Emotion Synthesis”

Acknowledgements

TOSHIBA Corp.

Y.Fujimoto, H.Goto, S.Hayakawa, F.Ishihara, T.Kawaai,S.Kawakami, M.Saito, H.Tago, H.Takeda, T.Teruyama,Y.Watanabe, I.Yamazaki

TOSHIBA MICROELECTRONICS Corp.

Y.OgawaTOSHIBA ENGINEERING Corp.

N.Iwatsuki, M.Otsuki, Y.TsutsumiTOSHIBA INFORMATION SYSTEMS(JAPAN) Corp.

M.Kuramochi, Y.HikitaSONY COMPUTER ENTERTAINMENT Inc.

S.Aoki, K.Kutaragi, S.Shinohara, K.Umemura