33
S Sources of Inefficiency in General-Purpose Chips

Understanding the Sources of Inefficiency in General-Purpose Chips

Embed Size (px)

Citation preview

Page 1: Understanding the Sources of Inefficiency in General-Purpose Chips

S

Understanding the Sources of

Inefficiency in General-Purpose

Chips

Page 2: Understanding the Sources of Inefficiency in General-Purpose Chips

General Purpose Processors serve a wide class of applications

Pros: Quick recovery Non Recurring Engineering costs

Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs

Video encoding is used as the representative application by the author

Motivation

Page 3: Understanding the Sources of Inefficiency in General-Purpose Chips

H.264 Encoding Format

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Input

Page 4: Understanding the Sources of Inefficiency in General-Purpose Chips

H.264 Encoding Format

Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion

FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution

Page 5: Understanding the Sources of Inefficiency in General-Purpose Chips

H.264 Encoding Format

IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image

DCT/Quantization – difference between current and predicted image block

CABAC – entropy encode coeffecients/elements

Page 6: Understanding the Sources of Inefficiency in General-Purpose Chips

H.264 algorithm

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Input

Data Parallel Sequential

Page 7: Understanding the Sources of Inefficiency in General-Purpose Chips

Percentage execution timeH.264

5636

71.6

IMEFMEDCTCABAC

IME + FME account for 92% of the execution time!

CABAC is small but sequential – becomes bottleneck

Page 8: Understanding the Sources of Inefficiency in General-Purpose Chips

What is exactly the problem?

(H.264)

33

10

619

22

10

Instruction FetchRegister FileALUD cachePipeline registersControl

2.8GHz Pentium 4 is 500x worse in energy

Four processor Tensilica based CMP is also 500x worse in energy

Page 9: Understanding the Sources of Inefficiency in General-Purpose Chips

ASIC – Application Specific integrated circuits

Page 10: Understanding the Sources of Inefficiency in General-Purpose Chips

ASIC - Feasibility

Is it feasible? – Inflexible Increased manufacturing and design time

Non Recurring Engineering costs? Expensive to make for every different application

Page 11: Understanding the Sources of Inefficiency in General-Purpose Chips

General Idea

Is there any incremental way of going from GP to ASIC?

What are the nature of the overheads?

A solution that has the benefits of both GP and ASIC

Provide flexibility for application experts to build customized solution for future energy efficiency

Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors

Page 12: Understanding the Sources of Inefficiency in General-Purpose Chips

Baseline H.264 Implementation

H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D

caches

Page 13: Understanding the Sources of Inefficiency in General-Purpose Chips

Baseline H.264 Implementation

Page 14: Understanding the Sources of Inefficiency in General-Purpose Chips

SIMD and ILP

Exploiting VLIW and SIMD

SIMD – Single Instruction Multiple Data

A0

A1

A2

A3

B0

B1

B2

B3

C0

C1

C2

C3

Page 15: Understanding the Sources of Inefficiency in General-Purpose Chips

SIMD and ILP

VLIW – Very Long Instruction Word

ADD SUB MUL DIV

ALU1 ALU2 ALU3 ALU4

ADD SUB MUL DIV

Page 16: Understanding the Sources of Inefficiency in General-Purpose Chips

SIMD and ILP

Page 17: Understanding the Sources of Inefficiency in General-Purpose Chips

SIMD and ILP

Page 18: Understanding the Sources of Inefficiency in General-Purpose Chips

Processor Energy breakdown SIMD and ILP

Page 19: Understanding the Sources of Inefficiency in General-Purpose Chips

Operation Fusion

Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

After fusion

acc = 0;acc = AddShft(acc, x0, x1, 20);acc = AddShft(acc, x-1, x2, -5);acc = AddShft(acc, x-2, x3, 1);Xn = Sat(acc);

Page 20: Understanding the Sources of Inefficiency in General-Purpose Chips

Operation Fusion

Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically

Authors created manually

Found ~20 fusion instructions across 4 algorithmic blocks

Page 21: Understanding the Sources of Inefficiency in General-Purpose Chips

Not a big gain

Page 22: Understanding the Sources of Inefficiency in General-Purpose Chips

Not good enough

Problem remains that 90% of the energy is going in overhead instructions

Need more compute / overhead

Need to aggregate works in large chunks to create highly optimized FU

Page 23: Understanding the Sources of Inefficiency in General-Purpose Chips

Magic Instructions

Can achieve a large number of computation at very low costs

Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

Page 24: Understanding the Sources of Inefficiency in General-Purpose Chips

IME Strategy

SAD – Sum of absolute differences

Hundreds of SAD calculations to get one image difference

Data for each calculation is nearly the same

Page 25: Understanding the Sources of Inefficiency in General-Purpose Chips

IME Strategy

Page 26: Understanding the Sources of Inefficiency in General-Purpose Chips

FME Strategy

Pixel up-sampling example

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

Normal register files require five register transfers per step

Augment them with 6 8-bit wide entry shift register structure

Works like FIFO – when a new entry comes, all shift

Page 27: Understanding the Sources of Inefficiency in General-Purpose Chips

FME Strategy

X-2 X-1 X0 X1 X2 X3

X4

X-1 X-0 X1 X2 X3 X4

Create a six input multiplier /adder

For 2-D up-sampling, build a shift register that stores horizontally up-sampled data and feeds its output to the vertical up-sampling units

Page 28: Understanding the Sources of Inefficiency in General-Purpose Chips

FME Strategy

X3

X2

X1

X-1

X-2

X-3

Page 29: Understanding the Sources of Inefficiency in General-Purpose Chips

Other magic instructions

DCT Matrix Transpose Operation fusion with no limitation on number of

operands

Intra Prediction Customized interconnections for different prediction

modes

CABAC FIFO structures in binarization module Fundamentally different computation fused with no

restrictions

Page 30: Understanding the Sources of Inefficiency in General-Purpose Chips

Magic Instructions Energy( within 3x of ASIC )

Page 31: Understanding the Sources of Inefficiency in General-Purpose Chips

Magic Instructions Performance

Over 35% of the energy is now used in the ALU’s

Most of the code involved magic instructions

Page 32: Understanding the Sources of Inefficiency in General-Purpose Chips

Summary

Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads

dominate To get 100 ops/sec, need specialized hardware &

memory

Authors put emphasis on making chip customization feasible The focus should be on designing chip generators

and not chips

Page 33: Understanding the Sources of Inefficiency in General-Purpose Chips

Discussion Points

How are their architecture designs going to scale across multiple applications?

Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units

For very varied applications having specific requirements, this might just boil down into designing ASIC’s

They do not evaluate the quality of the video (and both encode time and power varies with quality)