A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun

A Low-Power Handheld GPU using Logarithmic Arithmetic and

Triple DVFS Power Domains

Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo

Outline• Backgrounds• Proposed Handheld GPU

• Low-Power Vertex Shader• Low-Power Rendering Engine• Triple-Domain DVFS

• Evaluation Results• Conclusion

Motivations• Handheld 3D Graphics Systems

• Low-Power long-battery lifetime• Small-Area limited-footprint• High-Performance more realistic scenes

• Previous Works• [Lindholm et al., SIGGRAPH01] Conventional

Vertex Shader Architecture• [Kameyama et al., GH03] Low-cost, limited performance• [Sohn et al., GH04] Fixed-Point Vertex Shader• [Yu et al., ISSCC06] High-Performance Floating-Point

Vertex Shader

Contributions• Handheld GPU using Logarithmic Arithmetic

• Power-, area-efficient, and high-throughput graphics pipeline

• Vertex shader using multi-function unit• Merges matrix, vector, and elementary functions in a

single-arithmetic 4-way unit

• Handheld GPU with Triple-domain DVFS• Separate control of each power domain

according to its workload• Minimize power consumption for given workload

Logarithmic Arithmetic• Various Math Operations for 3DG Pipeline

• Matrix-vector multiplication• Vector operations• Elementary functionsToo complicated for resource limited handsets

• Logarithmic Number System (LNS)• Reduces arithmetic complexity• Conversion errors

• Handheld 3D Graphics• Small screen tolerates a little computation error Logarithmic arithmetic can be used

2 2

2 2

2

(log log )

(log log )

log

2 2

2 2

2 2

x y X Y

x y X Y

y xy y X

x y

x y

x

MAT,VMUL, VMAD, VDOT, VDIV, VDSQ,

POW, LOG, SIN, COS, ATAN

Dynamic Voltage Frequency Scaling• Power Consumption is Proportional to

• Square of supply voltage• Operating frequency Dynamic scaling supply voltage and frequency

is important for low-power design• DVFS in module-wise manner

• Can assign the optimal frequency and voltage to each module [Sheaffer et al. GH04]• cf) Chip-wise DVFS cannot exploit dynamic workload

characteristic of each module

2ddP C V f

Overall Architecture

• Integrates RISC, VS, and RE• Logarithmic arithmetic based pipeline• Triple power domains with DVFS• 3 PMUs for management and FIFO for communication

RISCVertex Shader

(VS)

Rendering Engine

(RE)VIB

PMU

ClkVdd

PMU PMU

ClkVdd ClkVdd

External asynchronous SRAM

RISC Domain

VS Domain

RE Domain

VOB

IndexFIFO

MatrixFIFO

Application Geometry Rendering

Vertex Shader• Multi-function Unit

• Unifies matrix, vector, and elementary ftn.

• Single-cycle throughput for vector and elem. ftn.

• 2-cycle Transformation• Matrix-vector

multiplication using LNS• 4-cycle in conventional

way• Operand Forwarding

• Log-domain forwarding

InstructionMemory

(2KB)

GeneralPurposeRegister(512B)

Instruction Decode and C

ontrol

ConstantMemory

(4KB)

Vertex InputBuffer(512B)

SWZ

VOR(256B)

Multifunctionunit

IndexFIFO(64B)

SWZ SWZ

VertexFetch

MatrixFIFO(4KB)

Matrix

Number System

• Logarithmic converter • Antilogarithmic Converter

15-entry LUT

aim+bi

x

m>>ci m>>di m>>ei

CPA**

CSA*

CSA*

ci di ei bi

8-entry LUT

ai f+bi

x

f>>ci f>>di

CSA*

CPA**

CSA*

ci di bi

( )log log ( )log ( )

2 2

2

2 11

1

e

i i

x mx e m

m a m b

2 2 2

2

x e f f

fi i

e

a f b

• Hybrid Number System (HNS)• Complex op. in LNS• Linear add, sub in FLP• Number Converters based on

piecewise linear approximation

FLP*(add,sub)

LNS**(complex op)

HNS

+

* FLP: Floating-Point Number System** LNS: Logarithmic Number System

* CSA: Carry Save Adder** CPA: Carry Propagate Adder

Matrix-Vector Multiplication00 01 02 03 00 030 01 02

10 11 12 13 10 131 11 120 1 2 3

20 21 22 23 2 20 21 22 23

3 31 3230 31 32 33 30 33

c c c c c cx c cc c c c c cx c c

x x x xc c c c x c c c c

x c cc c c c c c

• MAT requires 16 multiplications• MAT in HNS requires 20 LOGCs, 16 ADDs, 16 ALOGCs• Fixed matrix coeffis. for an object 4 LOGCs, 16 ADDs, 16 ALOGCs• 2-phase implementation requires 2 LOGCs, 8 ADDs, 8 ALOGCs / phase

2 00 2 032 01 2 022 10 2 132 11 2 12

2 0 2 1 2 22 20 2 232 21 2 222 30 2 31 2 32 2 33

log loglog loglog loglog loglog log loglog loglog loglog log log log2 2 2 2

c cc cc cc cx x xc cc cc c c c

2 3log x

MAT 1st phaseMAT 2nd phase

* PMUL: programmable multiplier

to Channel1,2,3

x0x2

log2 c00log2 c02

log2 c01log2 c03

PMUL*

Cha

nnel

0

AD

D

adde

rtr

ee

LOG

LUT

LOGC

Cha

nnel

1C

hann

el 2

Cha

nnel

3

Boo

th E

nc.

log2x1, for 1st phase, log2x3, for 2nd phase (from LOGC of Channel1 )

ALO

GLU

TLO

GLU

T

E1 E2

adde

rtr

ee

AD

D

AD

D

AC

C

E3 E4 E5

adde

rtr

ee

ALO

GLU

T

ALOGC

>>

PADD

Vector Operations• Defined as a Generic Operation

• Vector MUL,DIV,DSQ,MAD• Converted into HNS

• 2 LOGCs are required / Channel

{0,1,2,3}

where { , }, { , }, {0.5,1}

pi i i i

p

x y z

2 2log log

{0,1,2,3}

where { , }, {0,1}

2 i ix y qii

q

z

x0

y0

z0

PMUL

Cha

nnel

0

AD

D

adde

rtr

ee

LOG

LUT

LOGC

Cha

nnel

1C

hann

el 2

Cha

nnel

3

Boo

th E

nc.

ALO

GLU

TLO

GLU

T

E1 E2

adde

rtr

ee

AD

D

AD

D

AC

C

E3 E4 E5

adde

rtr

ee

ALO

GLU

T

ALOGC

>>

PADD*

* PADD: programmable adder

Channel 0 Channel 1 Channel 2 Channel 3

+

+

+

+

Elementary Functions• Requires Power Ftns. Evaluations

• POW, Taylor Series for Trigonometric (SIN,COS,ATAN...)• Converted into HNS

• Booth multiplier is required0 31 2 4

0 1 2 3 4

where { , }, and are coefficients

k kk k k

c ki i

c x c x c x c x c x

3 3 21 1 2 2 2 2 4 4 20 loglog log log0 [2 2 2 2 ]C k xC k x C k x C k xkc x

xyxy y

x 22 loglog 22

x0

y0

bias

PMUL

Cha

nnel

0

AD

D

adde

rtr

ee

LOG

LUT

LOGC

Cha

nnel

1C

hann

el 2

Cha

nnel

3

Boo

th E

nc.

ALO

GLU

TLO

GLU

TE1 E2

adde

rtr

ee

AD

D

AD

D

AC

C

E3 E4 E5

adde

rtr

ee

ALO

GLU

T

ALOGC

>>

PADD

Operand Forwarding in Log-domaint

c

op1

op2

op1

op2

LogAlog

Log

Log

Log

Alog

Log

Log

Log

Alog

Cancelled out

• Avoids back-to-back antilog and log converters• Forward intermediate log value to next instruction• Pipeline latency reduction• Computation error reduction

• Divisions required in RE are implemented as subtractors• SIMD Interpolators in TSE and RST• SIMD Divider in Texture Unit

Rendering EngineVe

rtex

buffe

r

Vert

ex S

hade

r Triangle SetupEngine (TSE)

Rasterizer(RST)

IntplIntpl

Texture Unit(TEX)

DIV

Pixel pipeline

2 2 2 2 2log ,log ,log ,log log, , , / 2 s t u v ws t u v w

LOGConv

SUB

LOGConv

LOGConv

LOGConv

ALOGConv

ALOGConv

ALOGConv

LOGConv

ALOGConv

SUBSUBSUB

s t u v w

s/w t/w u/w v/w

Triple-Domain DVFS

PMU

FIFO FIFO

PMU PMU

Objects Vertices Pixels

RISC VS RE• Workloads for each module in 3D pipeline are

different• Objects for RISC, vertices for VS, pixels for RE

• Separate power-domains with DVFS• Managed separately according to its workload

Performance Evaluations• Environment Setup

• Accuracy simulation• C libraries for HNS & FLP ops.

• Performance measurement• Cycle count evaluator

• Hardware model for graphics processor core• Fully synthesizable structural HDL model

3D application

Hardware model

Cycle countevaluator

Power managementmodel

HNS library FLP library

Accuracy Comparison

from FLP from HNS

• FLP and HNS libraries• QVGA Test Screens• Tr&Lighting , Tr&Texturing• Tolerable image difference

Performance Comparison

• Test vectors• Standard OpenGL TnL• O-N and C-T for advanced diffuse and specular lighting

• Comparison results• 19.1%, 49.3%, and 23.9% cycle count reduction for OpenGL, O-N,

and C-T models from latest work [YCK*06]

OpenGL Oren-Nayar Cook-Torrance0

20

40

60

80

100

120

140

160

180

Cyc

le c

ount

s

[AYH*04][SWY04][KCY*06][YCK*06]

This work

Implementation

RISC D$I$

Matrix FIFO

VertexShader

CMEM

VOB

Rendering Engine

IMEMVIB

DMEM

RISCPMU

VSPMU

REPMU

• Process Technology• TSMC 0.18um CMOS 1P

6M• Area

• 17.2mm2

• 393K gates, 29KB SRAM• Processing Speed

• 5.26 Mvertices/s • 50Mpixels/s

• Power Consumption• 52.4mW @ 60fps• 153mW @ full speed

• PMU Specifications• 1.0 – 1.8V, 89 – 200MHz• 0.45mm2, 5mW

Power Consumption

• DVFS for VS according to the workload• Higher workload makes a droop in FIFO entry level• FIFO droop increases Freq and Vdd of VS• Increased VS speed restores FIFO entry level

• For test model with 5,878 triangles, 5 DC, and OpenGL TnL• GPU consumes 52.4mW @ 60fps

11

0

180

16

1.789

1.0

50 Triangle size

FIFO entry level

Freq. of VS

Voltage of VS

# of

pixe

ls#

ofen

trie

sC

lkVd

d

Handheld GPUs

• Compared with the latest work [YCK*06]• 2.47x performance improvement• 50.5% power reduction• 38.5% area reduction• 23.5% Kvertices/MHz improvement

[KKF*03]

[SWY04]

Thiswork

[YCK*06]

Functions

GE+RE

GE

GE+RE

GE

Performance(Mvertices/s)

0.185

7.2

5.26

2.13

mW @ fullspeed

38

115

153(87 for GE)

157

Process /Freq.

0.18um /30MHz

0.13um /400MHz

0.18um /200MHz

0.18um /100MHz

Area(Kgates)

360

230

393(242 for GE)

375

Kvertices/s/MHz

6.17

18

26.3

21.3

Conclusion• A Handheld GPU for Low-Power Wireless

Applications• 2.47x performance improvement• 50.5% power reduction• 38.5% area reduction

• 3D Graphics Pipeline based on Logarithmic Arithmetic• 5.26Mvertices/s for OpenGL TnL

• Triple power domains with DVFS• 52.4mW @ 60fps

Thank You• For more information

• e-mail: [email protected]

mailto:[email protected]

Documents

A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun