Understanding the Sources of Inefficiency in General-Purpose Chips

S

Understanding the Sources of

Inefficiency in General-Purpose

Chips

General Purpose Processors serve a wide class of applications

Pros: Quick recovery Non Recurring Engineering costs

Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs

Video encoding is used as the representative application by the author

Motivation

H.264 Encoding Format

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Input


Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion

FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution


IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image

DCT/Quantization – difference between current and predicted image block

CABAC – entropy encode coeffecients/elements

H.264 algorithm

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Input

Data Parallel Sequential

Percentage execution timeH.264

5636

71.6

IMEFMEDCTCABAC

IME + FME account for 92% of the execution time!

CABAC is small but sequential – becomes bottleneck

What is exactly the problem?

(H.264)

33

10

619

22

10

Instruction FetchRegister FileALUD cachePipeline registersControl

2.8GHz Pentium 4 is 500x worse in energy

Four processor Tensilica based CMP is also 500x worse in energy

ASIC – Application Specific integrated circuits

ASIC - Feasibility

Is it feasible? – Inflexible Increased manufacturing and design time

Non Recurring Engineering costs? Expensive to make for every different application

General Idea

Is there any incremental way of going from GP to ASIC?

What are the nature of the overheads?

A solution that has the benefits of both GP and ASIC

Provide flexibility for application experts to build customized solution for future energy efficiency

Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors

Baseline H.264 Implementation

H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D

caches

Baseline H.264 Implementation

SIMD and ILP

Exploiting VLIW and SIMD

SIMD – Single Instruction Multiple Data

A0

A1

A2

A3

B0

B1

B2

B3

C0

C1

C2

C3

SIMD and ILP

VLIW – Very Long Instruction Word

ADD SUB MUL DIV

ALU1 ALU2 ALU3 ALU4

ADD SUB MUL DIV

SIMD and ILP

SIMD and ILP

Processor Energy breakdown SIMD and ILP

Operation Fusion

Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

After fusion

acc = 0;acc = AddShft(acc, x0, x1, 20);acc = AddShft(acc, x-1, x2, -5);acc = AddShft(acc, x-2, x3, 1);Xn = Sat(acc);

Operation Fusion

Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically

Authors created manually

Found ~20 fusion instructions across 4 algorithmic blocks

Not a big gain

Not good enough

Problem remains that 90% of the energy is going in overhead instructions

Need more compute / overhead

Need to aggregate works in large chunks to create highly optimized FU

Magic Instructions

Can achieve a large number of computation at very low costs

Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

IME Strategy

SAD – Sum of absolute differences

Hundreds of SAD calculations to get one image difference

Data for each calculation is nearly the same

IME Strategy

FME Strategy

Pixel up-sampling example

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

Normal register files require five register transfers per step

Augment them with 6 8-bit wide entry shift register structure

Works like FIFO – when a new entry comes, all shift

FME Strategy

X-2 X-1 X0 X1 X2 X3

X4

X-1 X-0 X1 X2 X3 X4

Create a six input multiplier /adder

For 2-D up-sampling, build a shift register that stores horizontally up-sampled data and feeds its output to the vertical up-sampling units

FME Strategy

X3

X2

X1

X-1

X-2

X-3

Other magic instructions

DCT Matrix Transpose Operation fusion with no limitation on number of

operands

Intra Prediction Customized interconnections for different prediction

modes

CABAC FIFO structures in binarization module Fundamentally different computation fused with no

restrictions

Magic Instructions Energy( within 3x of ASIC )

Magic Instructions Performance

Over 35% of the energy is now used in the ALU’s

Most of the code involved magic instructions

Summary

Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads

dominate To get 100 ops/sec, need specialized hardware &

memory

Authors put emphasis on making chip customization feasible The focus should be on designing chip generators

and not chips

Discussion Points

How are their architecture designs going to scale across multiple applications?

Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units

For very varied applications having specific requirements, this might just boil down into designing ASIC’s

They do not evaluate the quality of the video (and both encode time and power varies with quality)

Documents

Understanding the Sources of Inefficiency in General-Purpose Chips