Upload
dominick-mccoy
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
S
Understanding the Sources of
Inefficiency in General-Purpose
Chips
General Purpose Processors serve a wide class of applications
Pros: Quick recovery Non Recurring Engineering costs
Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs
Video encoding is used as the representative application by the author
Motivation
H.264 Encoding Format
Prediction
•Inter Prediction (IME, FME)•Intra Prediction
Transform / Quantize
Entropy
Encode•CABAC
Input
H.264 Encoding Format
Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion
FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution
H.264 Encoding Format
IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image
DCT/Quantization – difference between current and predicted image block
CABAC – entropy encode coeffecients/elements
H.264 algorithm
Prediction
•Inter Prediction (IME, FME)•Intra Prediction
Transform / Quantize
Entropy
Encode•CABAC
Input
Data Parallel Sequential
Percentage execution timeH.264
5636
71.6
IMEFMEDCTCABAC
IME + FME account for 92% of the execution time!
CABAC is small but sequential – becomes bottleneck
What is exactly the problem?
(H.264)
33
10
619
22
10
Instruction FetchRegister FileALUD cachePipeline registersControl
2.8GHz Pentium 4 is 500x worse in energy
Four processor Tensilica based CMP is also 500x worse in energy
ASIC – Application Specific integrated circuits
ASIC - Feasibility
Is it feasible? – Inflexible Increased manufacturing and design time
Non Recurring Engineering costs? Expensive to make for every different application
General Idea
Is there any incremental way of going from GP to ASIC?
What are the nature of the overheads?
A solution that has the benefits of both GP and ASIC
Provide flexibility for application experts to build customized solution for future energy efficiency
Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors
Baseline H.264 Implementation
H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D
caches
Baseline H.264 Implementation
SIMD and ILP
Exploiting VLIW and SIMD
SIMD – Single Instruction Multiple Data
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
SIMD and ILP
VLIW – Very Long Instruction Word
ADD SUB MUL DIV
ALU1 ALU2 ALU3 ALU4
ADD SUB MUL DIV
SIMD and ILP
SIMD and ILP
Processor Energy breakdown SIMD and ILP
Operation Fusion
Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op
Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)
After fusion
acc = 0;acc = AddShft(acc, x0, x1, 20);acc = AddShft(acc, x-1, x2, -5);acc = AddShft(acc, x-2, x3, 1);Xn = Sat(acc);
Operation Fusion
Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically
Authors created manually
Found ~20 fusion instructions across 4 algorithmic blocks
Not a big gain
Not good enough
Problem remains that 90% of the energy is going in overhead instructions
Need more compute / overhead
Need to aggregate works in large chunks to create highly optimized FU
Magic Instructions
Can achieve a large number of computation at very low costs
Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links
IME Strategy
SAD – Sum of absolute differences
Hundreds of SAD calculations to get one image difference
Data for each calculation is nearly the same
IME Strategy
FME Strategy
Pixel up-sampling example
Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)
Normal register files require five register transfers per step
Augment them with 6 8-bit wide entry shift register structure
Works like FIFO – when a new entry comes, all shift
FME Strategy
X-2 X-1 X0 X1 X2 X3
X4
X-1 X-0 X1 X2 X3 X4
Create a six input multiplier /adder
For 2-D up-sampling, build a shift register that stores horizontally up-sampled data and feeds its output to the vertical up-sampling units
FME Strategy
X3
X2
X1
X-1
X-2
X-3
Other magic instructions
DCT Matrix Transpose Operation fusion with no limitation on number of
operands
Intra Prediction Customized interconnections for different prediction
modes
CABAC FIFO structures in binarization module Fundamentally different computation fused with no
restrictions
Magic Instructions Energy( within 3x of ASIC )
Magic Instructions Performance
Over 35% of the energy is now used in the ALU’s
Most of the code involved magic instructions
Summary
Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads
dominate To get 100 ops/sec, need specialized hardware &
memory
Authors put emphasis on making chip customization feasible The focus should be on designing chip generators
and not chips
Discussion Points
How are their architecture designs going to scale across multiple applications?
Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units
For very varied applications having specific requirements, this might just boil down into designing ASIC’s
They do not evaluate the quality of the video (and both encode time and power varies with quality)