Understanding the Sources of Inefficiency in General-Purpose Chips

Understanding the Sources of

Inefficiency in General-Purpose

General Purpose Processors serve a wide class of applications

Pros: Quick recovery Non Recurring Engineering costs

Cons: Low energy efficiency and poor performance Specific applications (cells, video cameras) have strict needs

Video encoding is used as the representative application by the author

Motivation

H.264 Encoding Format

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Integer Motion Estimation – Finds closest match for an image block from previous image and computes a vector to represent the observed motion

FME (Fractional Motion Estimation) – Finds a match at quarter pixel resolution

IP intra prediction – Uses previously encoded image-blocks within current image to form prediction of current image

DCT/Quantization – difference between current and predicted image block

CABAC – entropy encode coeffecients/elements

H.264 algorithm

Prediction

•Inter Prediction (IME, FME)•Intra Prediction

Transform / Quantize

Entropy

Encode•CABAC

Data Parallel Sequential

Percentage execution timeH.264

IMEFMEDCTCABAC

IME + FME account for 92% of the execution time!

CABAC is small but sequential – becomes bottleneck

What is exactly the problem?

(H.264)

Instruction FetchRegister FileALUD cachePipeline registersControl

2.8GHz Pentium 4 is 500x worse in energy

Four processor Tensilica based CMP is also 500x worse in energy

ASIC – Application Specific integrated circuits

ASIC - Feasibility

Is it feasible? – Inflexible Increased manufacturing and design time

Non Recurring Engineering costs? Expensive to make for every different application

General Idea

Is there any incremental way of going from GP to ASIC?

What are the nature of the overheads?

A solution that has the benefits of both GP and ASIC

Provide flexibility for application experts to build customized solution for future energy efficiency

Case study – transform a conventional CMP into a customizable processor which is an efficient H.264 encoder Use Tensilica to create optimized processors

Baseline H.264 Implementation

H.264 video encoding path is long and sequential Map five major algorithmic blocks to macro blocks Map four macro block to 4 processor CMP system Each processor has 16KB 2-way set associative I & D

caches

Baseline H.264 Implementation

SIMD and ILP

Exploiting VLIW and SIMD

SIMD – Single Instruction Multiple Data

SIMD and ILP

VLIW – Very Long Instruction Word

ADD SUB MUL DIV

ALU1 ALU2 ALU3 ALU4

ADD SUB MUL DIV

SIMD and ILP

Processor Energy breakdown SIMD and ILP

Operation Fusion

Operation fusion – Fusion of complex instruction subgraph Reduces instruction count and register file accesses Intermediate results are consumed within op

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

After fusion

acc = 0;acc = AddShft(acc, x0, x1, 20);acc = AddShft(acc, x-1, x2, -5);acc = AddShft(acc, x-2, x3, 1);Xn = Sat(acc);

Operation Fusion

Compiler can find interesting instructions to merge Tensilica’s Xpress tries to do this automatically

Authors created manually

Found ~20 fusion instructions across 4 algorithmic blocks

Not a big gain

Not good enough

Problem remains that 90% of the energy is going in overhead instructions

Need more compute / overhead

Need to aggregate works in large chunks to create highly optimized FU

Magic Instructions

Can achieve a large number of computation at very low costs

Achieved by creating instructions that are tightly connected to custom data storage elements with algorithm specific communications links

IME Strategy

SAD – Sum of absolute differences

Hundreds of SAD calculations to get one image difference

Data for each calculation is nearly the same

IME Strategy

FME Strategy

Pixel up-sampling example

Eg: xn = x-2 -5x-1 + 20x0 +20x1 -5x2 +x3 (Pixel upsampling)

Normal register files require five register transfers per step

Augment them with 6 8-bit wide entry shift register structure

Works like FIFO – when a new entry comes, all shift

FME Strategy

X-2 X-1 X0 X1 X2 X3

X-1 X-0 X1 X2 X3 X4

Create a six input multiplier /adder

For 2-D up-sampling, build a shift register that stores horizontally up-sampled data and feeds its output to the vertical up-sampling units

FME Strategy

Other magic instructions

DCT Matrix Transpose Operation fusion with no limitation on number of

operands

Intra Prediction Customized interconnections for different prediction

CABAC FIFO structures in binarization module Fundamentally different computation fused with no

restrictions

Magic Instructions Energy( within 3x of ASIC )

Magic Instructions Performance

Over 35% of the energy is now used in the ALU’s

Most of the code involved magic instructions

Summary

Many operations are very simple with low energy The SIMD / Vector parallelize well but overheads

dominate To get 100 ops/sec, need specialized hardware &

memory

Authors put emphasis on making chip customization feasible The focus should be on designing chip generators

and not chips

Discussion Points

How are their architecture designs going to scale across multiple applications?

Their comparison baseline for a general purpose CMP is invalid. They should compare against designs having similar units

For very varied applications having specific requirements, this might just boil down into designing ASIC’s

They do not evaluate the quality of the video (and both encode time and power varies with quality)

Understanding the Sources of Inefficiency in General-Purpose Chips

Documents

Understanding Sources of Inefficiency in General-Purpose Chipskozyraki/publications/2010.efficiency.isca.slides.pdf• For a specific performance target, energy and area efficiency

Inefficiency Fuels Green Business Opportunities

Affordable Housing: Of Inefficiency, Market Distortion

The Inefficiency of C++ Fact or Fiction? - Dynatemrtcgroup.com/arm/2007/presentations/123 - The Inefficiency of C++...The Inefficiency of C++ Fact or Fiction? ... in a C program

Quantifying Datacenter Inefficiency: Author: Making the ... › sites › default › files › HPE-Synergy_Deri… · IDC White Paper | uantifying Datacenter Inefficiency: Making

Externalities causing Economic Inefficiency

Market Inefficiency and Household Labor Supplypages.ucsd.edu/~yfadlon/pdfs/SurvivorsBenefits.pdfMarket Inefficiency and Household Labor Supply: Evidence from Social Security’s Survivors

Health Care Reform Quynh Smith. Sources of Inefficiency in the Health Care Delivery System We spend a substantial amount on high cost, low-value treatments

Identifying Sources of Inefficiency in Health Care · Identifying Sources of Inefficiency in Health Care Amitabh Chandra and Douglas O. Staiger NBER Working Paper No. 24035 November

Managers demand automation amid inefficiency

Sources of Technical Inefficiency of Smallholder Farmers in Sorghum Production … · Sources of Technical Inefficiency of Smallholder Farmers in Sorghum Production in Konso District,

Logistics Tracking System ³LogTrack Mitigates inefficiency ... · Logistics Tracking System ³LogTrack ´ RFID INTEGRATED SOLUTIONS Mitigates inefficiency & improves enterprise operational

Identifying Sources of Inefficiency in Health Carewe observe in the data. One possibility, motivated by Currie and MacLeod (2017), is that allocative inefficiency would arise if hospitals

Sources of Inefficiency and Growth in Agricultural Output in

Requisite Inefficiency

Ultrafast VCSEL-based plasmonic polymerase chain reaction ... · bulky size of commercial PCR machines or cost-inefficiency, complex fabrication and operation of microfluidic chips

Identifying Sources of Inefficiency in Health CareJanet Currie, Joe Doyle, Mark Duggan, Amy Finkelstein, Peter Hull, Matt Notowidigdo, Jonathan Skinner, and Heidi Williams for comments

Work inefficiency

Chapter 2: War’s Inefficiency Puzzle - WordPress.com

Sources of Inefficiency and Growth in Agricultual Output in Subsistence Agriculture: A Stochastic Frontier Analysis