Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A Low-Power 0.7-V H.264 720p Video Decoder
D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken,
A.P. Chandrakasan
A-SSCC 2008
Outline
• Motivation for low-power video decoders• Low-power techniques
– pipelining and parallelism– independent voltage/clock domains– efficient memory accessing
• ASIC results• Comparison with state of the art• Summary
Motivation
• High demand for video capture and playback on mobile devices
• H.264 state of the art video coding standard
• Goal: Ultra Low Power H.264 decoder in 65nm– 1280x720 @ 30fps
Digital Camera
iPhone
PSP
DVC
H.264 Decoder Architecture
Pipelined, highly parallel architecture to reduce voltage (and power)
ED
COEFFS
MODES
MVS
IT
SHARED
LUMA
CHROMA
INTRA
MC MEM
DB+
MVS
MUX
MUX
BitstreamInput
FRAME BUFFER(ZBT SRAM)
YUV->RGB(FPGA)
OFF-CHIP
MODES
MVS
INTRA
MC MEM
DB+
MVSFIFO PARALLEL
Pipeline FIFO Sizing
• Pipeline stages have variable latencies– ex: ED latency is 0-33 cycles per 4x4 block
• Larger FIFOs help average out workload– increase performance by up to 45%– FIFOs of depths 1-4 chosen to reduce area
NormalizedSystem
Throughput
11.051.1
1.151.2
1.251.3
1.351.4
1.451.5
1 4 16 32 256
FIFO DepthSizes on this chip
82
Deblocking Filter Parallelism
• Process entire 4x4 edge (4 filters) in parallel• Filter luma and chroma in parallel• 192 cycles reduced to ~ 46 cycles per 16x16 block
4 edges in parallel
Deblocking Filter Architecture
Last Line Cache104kb
[SRAM]
Internal Memory4x(4x4x8b)
[DFF]
Block IN
PIN
Filters (bS=4)
Filters (bS=1 to 3)
BoundaryStrength (bS)
DatapathControl
DatapathControl
Block OUT
DatapathControl
DatapathControl
DatapathControl
QOUT
POUT
(bS=0)
4x4 4x4
4x4 4x4
QIN
4x4
4x4
4x1
4x14x4
4x4
calc clip
<<
>>
......
......
threshold
threshold
<<
4 PARALLEL FILTERS
Motion Compensation (MC)
• Use two interpolators in parallel• Interpolate luma and chroma in parallel• 176 reduced to ~ 72 cycles per 16x16 block
4x4 block in current frame
Reference blockVector (1, -1)
Reference blockVector (0.5, -0.5)
Parallel MC Interpolators
• Interpolators can run in same cycle when…– motion vectors are all available– memory interface supplies 2 columns per cycle
• Interpolators are synchronized– MC0: even 4x4 rows, MC1: odd 4x4 rows– shared interpolation data reused
0 1 4 52 3 6 78 9 12 13
10 11 14 15
MC0
MC1
MC0
MC1
MC0
MC1
FrameBuffer(FB)
EntropyDecoding
(ED)
column0
MV0
MV1
column16:1
6:1
6:1
6:1
6:1x9
4x4 16x16
Clock & Voltage Domains
• Decouple voltage / clock domains– lower core voltage and frequency– 25% power savings vs. single domain
• 4% further savings if we used 3 domains
– dual-clock FIFOs and level-shifters link domains
02468
101214161820
MEM_luma
MEM_chroma
EDDB_c
hroma
DB_luma
MC_luma
MC_chroma
IT
Average Cycles / block
Memory Controller Core Domain
MemoryController
LevelShifters
VhighVlow
CLKfastCLKslow
CoreDomain
DFF
Dual-ClockFIFO
DQ D Q
DQ FIFOLOGIC
MemArray
D Q
VhighVlow
Workload Variation
• INTER-INTRA workload variation– MAX: maximum frequency on each domain– DVFS: 1 frame every 33ms– Frame Averaging (FA): 15 frames every 15 * 33ms
• switches less often than DVFS, but needs output buffer
50 @ 0.8414 @ 0.70P-frame
25 @ 0.7653 @ 0.90I-frame
7348 @ 0.8317 @ 0.72FA
7325 @ 0.7650 @ 0.84
14 @ 0.7053 @ 0.90DVFS
10050 @ 0.8453 @ 0.90MAX
Relative Power@720p
1 I-frame,14 P-frames
[%]
Memory Controller
[ MHz @ V ]
Core Domain
[ MHz @ V ]Power
Workload
Perfect DVFS
No DVFS
100%
ΔP
MC Data Overlap
– Neighboring 4x4 with same MVs– Overlap area shared horizontally and vertically– Reduced MC read bandwidth
OverlappedInterpolation
Area
HorizontalNeighbors
VerticalNeighbors
Current4x4 block Interpolation
Area
Last-line On-chip Caches
M A B CIJKL
D E F G H
Intra prediction
Deblocking 16x16block
Top-Left
Top Top-Right
Left
0
20
40
60
80
100
120
DB INTRA MC ED
Cache Size [kb]
807 Mbps
122 Mbps
26 Mbps 21 Mbps
Off-chip Bandwidth
• Frame buffer off-chip (1.4 MByte per frame)• P-frames more common than I-frames
– P-frame off-chip BW larger due to MC• 40% (0.9 Gbps) total reduction
– last-line caches– avoided redundant reads in MC
0
1
2
original caching cache &fewer reads
P-frameoff-chip BW
[ Gbps ]
26%19%
Voltage Scalable SRAM8T SRAM Cell Write Assist to improve
writability at low voltages
Extra 2 Tx ensures read stability at low voltages
snsEn
snsRefRDBL
DOUT
Pseudo-differential sense amplifier with global snsRef
• Low voltage SRAM needed– Typical 6T SRAMs fail at low
voltages– 8T SRAMs work down to 0.5V
H.264 Decoder ASIC3.
3 m
m
3.3 mm
MEMORY CONTROLLERDOMAIN
COREDOMAIN
CACHES
176 I/O PADS
Area (w/o pads) :Area Utilization :
Technology :I/O Pads :
On-chip SRAM :
2.76 x 2.76 mm2
31 %65-nm17617kB
DECODER STATISTICS
Area Breakdown
• Cache area 3x larger than logic
• Standard Cells: 134k
• Parallelism Overhead– 1.5% of active chip area– 4 luma + 2 chroma filters:
– 1.5% of DB– 2 luma + 4 chroma interpolators:
– 9% of MC DB56% MC
16.6%
INTRA15.6%
ED1.6% IT
8.2%
MISC1.4%
Caches75%
Logic25%
Power Measurements
3.21.81.8Power [ mW ]50 @ 0.8450 @ 0.8450 @ 0.84MEM [ MHz @ V ]
25 @ 0.8014 @ 0.7014 @ 0.70Core [ MHz @ V ]
720p Video mobcal shields parkrunInput Bitrate [ Mbps ] 5.4 7.0 26
0123456789
10
Minimum Core Vdd @ 720p
0123456789
10
DistributionAcross15 Dies
0.69 0.70 0.71 0.72
Power Breakdown
• P-frame power dominated by:– MC (frame buffer reads)– deblocking filter
MC42%
DB26%
MEMwrite7%
Pipelinecontrol & FIFOs
19%INTRA1%
IDCT1%
ED3% Motion Vector
predictor5%
MEMRead 75%
Interpolators20%
P-Frame
Survey of Other Decoders
0.01 mW
0.1 mW
1 mW
10 mW
100 mW
1 W
0.1 1 10 100
[work] - process, profileQCIF CIF D1 720p 1080p
Resolution30fps
Pow
er
Mpixels/s
15fps
[3] - 130-nm, Baseline
[4] - 180-nm, Baseline
[5] - 180-nm, Baseline
[6] - 180-nm, Main
This work - 65-nm, Baseline
0.70 V0.85 V
Core DomainMemory Cntl
0.5 V0.5 V
0.55 V0.68 V
0.85 V1.15 V
0.66 V0.74 V
Summary
• Pipeline and parallelism– Concurrency allows 14MHz @ 720p– Parallelism: luma DB = 4x, luma MC = 2x
• Separate voltage/clock domains– 25% P-frame power savings – DVFS on each domain for I/P-frame differences
• Efficient memory accesses– Low-voltage on-chip caches and data reuse– Off-chip BW lowered by 40%
Acknowledgements
• Funding: Nokia, TI, and NSERC • Chip fabrication: TI• Valuable feedback:
– Nokia: J. Hicks, G. Raghavan, J. Ankcorn– TI: M. Budagavi, D. Buss, M. Zhou– MIT: Arvind, E. Fleming
Video Demo