A Low-Power 0.7-V H.264 720p Video Decoder

A Low-Power 0.7-V H.264 720p Video Decoder

D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken,

A.P. Chandrakasan

A-SSCC 2008

Outline

• Motivation for low-power video decoders• Low-power techniques

– pipelining and parallelism– independent voltage/clock domains– efficient memory accessing

• ASIC results• Comparison with state of the art• Summary

Motivation

• High demand for video capture and playback on mobile devices

• H.264 state of the art video coding standard

• Goal: Ultra Low Power H.264 decoder in 65nm– 1280x720 @ 30fps

Digital Camera

iPhone

PSP

DVC

H.264 Decoder Architecture

Pipelined, highly parallel architecture to reduce voltage (and power)

ED

COEFFS

MODES

MVS

IT

SHARED

LUMA

CHROMA

INTRA

MC MEM

DB+

MVS

MUX

MUX

BitstreamInput

FRAME BUFFER(ZBT SRAM)

YUV->RGB(FPGA)

OFF-CHIP

MODES

MVS

INTRA

MC MEM

DB+

MVSFIFO PARALLEL

Pipeline FIFO Sizing

• Pipeline stages have variable latencies– ex: ED latency is 0-33 cycles per 4x4 block

• Larger FIFOs help average out workload– increase performance by up to 45%– FIFOs of depths 1-4 chosen to reduce area

NormalizedSystem

Throughput

11.051.1

1.151.2

1.251.3

1.351.4

1.451.5

1 4 16 32 256

FIFO DepthSizes on this chip

82

Deblocking Filter Parallelism

• Process entire 4x4 edge (4 filters) in parallel• Filter luma and chroma in parallel• 192 cycles reduced to ~ 46 cycles per 16x16 block

4 edges in parallel

Deblocking Filter Architecture

Last Line Cache104kb

[SRAM]

Internal Memory4x(4x4x8b)

[DFF]

Block IN

PIN

Filters (bS=4)

Filters (bS=1 to 3)

BoundaryStrength (bS)

DatapathControl

DatapathControl

Block OUT

DatapathControl

DatapathControl

DatapathControl

QOUT

POUT

(bS=0)

4x4 4x4

4x4 4x4

QIN

4x4

4x4

4x1

4x14x4

4x4

calc clip

<<

>>

......

......

threshold

threshold

<<

4 PARALLEL FILTERS

Motion Compensation (MC)

• Use two interpolators in parallel• Interpolate luma and chroma in parallel• 176 reduced to ~ 72 cycles per 16x16 block

4x4 block in current frame

Reference blockVector (1, -1)

Reference blockVector (0.5, -0.5)

Parallel MC Interpolators

• Interpolators can run in same cycle when…– motion vectors are all available– memory interface supplies 2 columns per cycle

• Interpolators are synchronized– MC0: even 4x4 rows, MC1: odd 4x4 rows– shared interpolation data reused

0 1 4 52 3 6 78 9 12 13

10 11 14 15

MC0

MC1

MC0

MC1

MC0

MC1

FrameBuffer(FB)

EntropyDecoding

(ED)

column0

MV0

MV1

column16:1

6:1

6:1

6:1

6:1x9

4x4 16x16

Clock & Voltage Domains

• Decouple voltage / clock domains– lower core voltage and frequency– 25% power savings vs. single domain

• 4% further savings if we used 3 domains

– dual-clock FIFOs and level-shifters link domains

02468

101214161820

MEM_luma

MEM_chroma

EDDB_c

hroma

DB_luma

MC_luma

MC_chroma

IT

Average Cycles / block

Memory Controller Core Domain

MemoryController

LevelShifters

VhighVlow

CLKfastCLKslow

CoreDomain

DFF

Dual-ClockFIFO

DQ D Q

DQ FIFOLOGIC

MemArray

D Q

VhighVlow

Workload Variation

• INTER-INTRA workload variation– MAX: maximum frequency on each domain– DVFS: 1 frame every 33ms– Frame Averaging (FA): 15 frames every 15 * 33ms

• switches less often than DVFS, but needs output buffer

50 @ 0.8414 @ 0.70P-frame

25 @ 0.7653 @ 0.90I-frame

7348 @ 0.8317 @ 0.72FA

7325 @ 0.7650 @ 0.84

14 @ 0.7053 @ 0.90DVFS

10050 @ 0.8453 @ 0.90MAX

Relative Power@720p

1 I-frame,14 P-frames

[%]

Memory Controller

[ MHz @ V ]

Core Domain

[ MHz @ V ]Power

Workload

Perfect DVFS

No DVFS

100%

ΔP

MC Data Overlap

– Neighboring 4x4 with same MVs– Overlap area shared horizontally and vertically– Reduced MC read bandwidth

OverlappedInterpolation

Area

HorizontalNeighbors

VerticalNeighbors

Current4x4 block Interpolation

Area

Last-line On-chip Caches

M A B CIJKL

D E F G H

Intra prediction

Deblocking 16x16block

Top-Left

Top Top-Right

Left

0

20

40

60

80

100

120

DB INTRA MC ED

Cache Size [kb]

807 Mbps

122 Mbps

26 Mbps 21 Mbps

Off-chip Bandwidth

• Frame buffer off-chip (1.4 MByte per frame)• P-frames more common than I-frames

– P-frame off-chip BW larger due to MC• 40% (0.9 Gbps) total reduction

– last-line caches– avoided redundant reads in MC

0

1

2

original caching cache &fewer reads

P-frameoff-chip BW

[ Gbps ]

26%19%

Voltage Scalable SRAM8T SRAM Cell Write Assist to improve

writability at low voltages

Extra 2 Tx ensures read stability at low voltages

snsEn

snsRefRDBL

DOUT

Pseudo-differential sense amplifier with global snsRef

• Low voltage SRAM needed– Typical 6T SRAMs fail at low

voltages– 8T SRAMs work down to 0.5V

H.264 Decoder ASIC3.

3 m

m

3.3 mm

MEMORY CONTROLLERDOMAIN

COREDOMAIN

CACHES

176 I/O PADS

Area (w/o pads) :Area Utilization :

Technology :I/O Pads :

On-chip SRAM :

2.76 x 2.76 mm2

31 %65-nm17617kB

DECODER STATISTICS

Area Breakdown

• Cache area 3x larger than logic

• Standard Cells: 134k

• Parallelism Overhead– 1.5% of active chip area– 4 luma + 2 chroma filters:

– 1.5% of DB– 2 luma + 4 chroma interpolators:

– 9% of MC DB56% MC

16.6%

INTRA15.6%

ED1.6% IT

8.2%

MISC1.4%

Caches75%

Logic25%

Power Measurements

3.21.81.8Power [ mW ]50 @ 0.8450 @ 0.8450 @ 0.84MEM [ MHz @ V ]

25 @ 0.8014 @ 0.7014 @ 0.70Core [ MHz @ V ]

720p Video mobcal shields parkrunInput Bitrate [ Mbps ] 5.4 7.0 26

0123456789

10

Minimum Core Vdd @ 720p

0123456789

10

DistributionAcross15 Dies

0.69 0.70 0.71 0.72

Power Breakdown

• P-frame power dominated by:– MC (frame buffer reads)– deblocking filter

MC42%

DB26%

MEMwrite7%

Pipelinecontrol & FIFOs

19%INTRA1%

IDCT1%

ED3% Motion Vector

predictor5%

MEMRead 75%

Interpolators20%

P-Frame

Survey of Other Decoders

0.01 mW

0.1 mW

1 mW

10 mW

100 mW

1 W

0.1 1 10 100

[work] - process, profileQCIF CIF D1 720p 1080p

Resolution30fps

Pow

er

Mpixels/s

15fps

[3] - 130-nm, Baseline



[6] - 180-nm, Main

This work - 65-nm, Baseline

0.70 V0.85 V

Core DomainMemory Cntl

0.5 V0.5 V

0.55 V0.68 V

0.85 V1.15 V

0.66 V0.74 V

Summary

• Pipeline and parallelism– Concurrency allows 14MHz @ 720p– Parallelism: luma DB = 4x, luma MC = 2x

• Separate voltage/clock domains– 25% P-frame power savings – DVFS on each domain for I/P-frame differences

• Efficient memory accesses– Low-voltage on-chip caches and data reuse– Off-chip BW lowered by 40%

Acknowledgements

• Funding: Nokia, TI, and NSERC • Chip fabrication: TI• Valuable feedback:

– Nokia: J. Hicks, G. Raghavan, J. Ankcorn– TI: M. Budagavi, D. Buss, M. Zhou– MIT: Arvind, E. Fleming

Video Demo

Documents

A Low-Power 0.7-V H.264 720p Video Decoder