42
Specialized Computing ECE 5775 (Fall’17) High-Level Digital Design Automation

Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Embed Size (px)

Citation preview

Page 1: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Specialized Computing

ECE 5775 (Fall’17)High-Level Digital Design Automation

Page 2: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ All students enrolled in CMS & Piazza

▸ Vivado HLS tutorial on Tuesday 8/29– Install an SSH client (mobaxterm or putty)– % ssh -X <netid>@ecelinux-01.ece.cornell.edu– Bring your laptop!

1

Announcements

Page 3: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Agenda

▸ Motivation for hardware specialization– Case study on convolution

▸ FPGA introduction– Basic building blocks– Basic homogeneous FPGA architecture– Modern heterogeneous FPGA architecture

2

Page 4: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Tension between Efficiency and Flexibility

3© 2012 Altera Corporation—Public

The Dilemma: Flexibility vs. Efficiency

16

Source: “High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era” July 2011by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp.

MO

PS

/mW

Programmable Processing

Page 5: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

4

A Simple Single-Cycle Microprocessor

Adder1

PC

RF

LDSASBDRD_in

ALU RAM

DataA

DataB

V C Z NSE

IMMMB

M_address

Data_in

MW MD

01

0

1

DRSASB

IMMMBFS

MDLD

MWRAM

Page 6: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

5

Evaluating an Simple Expression on CPU

Step-by-stepCPU activities

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

Source: Adapted from Desh Singh’s talk at HCP’14 workshop

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

Page 7: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

6

“Unrolling” the CPU Hardware

Space

1. Replicate the CPU hardware

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

R1 <= M[R0]

PC

RF ALU RAM

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

CPU2

CPU3

CPU4

CPU1

Page 8: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

7

Eliminating Unused Logic

RF ALU RAM

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3RF ALU RAM

RF ALU RAM

RF ALU

Space

1. Replicate the CPU hardware

2. Instruction fixed -> Remove FETCH logic

3. Remove unused ALU operations

4. Remove unusedLOAD/STORE logic

Page 9: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

8

A Special-Purpose Architecture

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

Space

1. Replicate the CPU hardware

2. Instruction fixed -> Remove FETCH logic

3. Remove unused ALU operations

4. Remove unusedLOAD/STORE logic

5. Wire up registers and propagate values

R2

R3

+

R1

+

R0

+

LW

LW

SW

Resulting circuit can be realized with either ASIC or FPGA

Page 10: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

9

Understanding Energy Inefficiency of General-Purpose Processors (GPPs)

L1-I$

BranchPredictor

FetchDecode Rename

RAT

Free list

Reservation-station

Schedule

Int RF

FP RF

RegisterRead/write

D-cache

LSQ + TLB

ALU

FPU

ROB

Commit

Typical Superscalar OoO Pipeline

Parameter Value

Fetch/issue/retire width 4

# Integer ALUs 3

# FP ALUs 2

# ROB entries 96

# Reservation station entries 64

L1 I-cache 32 KB, 8-way set associative

L1 D-cache 32 KB, 8-way set associative

L2 cache 6 MB, 8-way set associative

[source: Jason Cong, ISLPED’14 keynote]

Page 11: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

10

Energy Breakdown of Pipeline Components

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU14%

Mul/div4%

FPU8%

Memory10%

Misc23%

L1-I$

BranchPredictor

FetchDecode Rename

RAT

Free list

Reservation-station

Schedule

Int RF

FP RF

RegisterRead/write

D-cache

LSQ + TLB

ALU

FPU

ROB

Commit

Control

Page 12: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

11

Removing “Non-Computing” Portions

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU14%

Mul/div4%

FPU8%

Memory10%

Misc23%

▸ “Computing” portion: 10% (memory) + 26% (compute) = 36%

Page 13: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Energy Comparison of Processor ALUs and Dedicated Units

▸ Why are processor units so expensive?– ALU can perform multiple

operations• Add/sub/bitwise

XOR/OR/AND– 64-bit ALU– Dynamic/domino logic

used to run at high frequency• Higher power dissipation

Operation Processor ALU

45 nm TSMC standard cell library

32-bit add 0.122 nJ@2 GHz

0.002 nJ @1 GHz

32-bitmultiply

0.120 nJ@2 GHz

0.007 nJ @1 GHz

Single-precisionFP operation

0.150 nJ @ 2GHz

0.008 nJ @500 MHz

12

Page 14: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

13

Energy Breakdown with Standard-Cell ASICs

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU0.2%

Mul/div0.2%

FPU0.4%

ALU/FPU savings25.0%

Memory10%

Misc23%

▸ “Computing” portion: 10% (memory) + ~1% (compute) = 11%

Only 10X gain is attainable?

Page 15: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

14

Additional Energy Savings from Specialization

▸ Specialized memory architecture– Exploit regular memory access patterns to minimize energy per

memory read/write

▸ Specialized communication architecture– Exploit data movement patterns to optimize the

structure/topology of on-chip interconnection network

▸ Customized data type– Exploit data range information to reduce bitwidth/precision and

simply arithmetic operations

These techniques combined can lead to another 10-100X energy efficiency improvement over GPPs

Page 16: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Case Study: Memory Specialization for Convolution

▸ The main computation of image/video processing is performed over overlapping stencils, termed as convolution

-1 -2 -1

0 0 01 2 1

3x3 ConvolutionInput image frame

Output image frame

0123456

0 1 2 3 4 5 60123456

0 1 2 3 4 5 6

15

Page 17: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Example Application: Edge Detection

▸ Identifies discontinuities in an image where brightness (or image intensity) changes sharply– Very useful for feature extractions in computer vision

Figures: Pilho Kim, GaTech

Sobel operator G=(GX , GY)

16

Page 18: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

CPU Implementation of Convolution

CPU MainMemory

Cache

for (n=1; n<height-1; n++) for (m=1; m<width-1; m++)

for (i=0; i<3; i++) for (j=0; j<3; j++)

out[n][m]+=img[n+i][m+j] * f[i][j];

17

Page 19: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Minimizes main memory accesses to improve performance

▸ A general-purpose cache is expensive in cost and incurs nontrivial energy overhead

General-Purpose Cache for Convolution

W

Input picture(W pixels wide)

18

Page 20: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Specializing Cache for Convolution (1)

▸ Remove rows that are not in the neighborhood of the convolution window

19

W

Page 21: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

W

Specializing Cache for Convolution (2)

W W

Remove the edge pixels that are not needed for computation

Old Pixel

New Pixel

▸ Rearrange the rows as a 1D array of pixels▸ Each time we move the window to right and push in the

new pixel to the “cache”

20

New

Old

Page 22: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

A Specialized “Cache”: Line Buffer

▸ Line buffer: a fixed-width “cache” with (K-1)*W+K pixels in flight– Fixed addressing: Low area/power and high performance

▸ In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs

2W+3 (with K=3)

Old Pixel

New Pixel

21

Page 23: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

What is an FPGA?

▸ FPGA: Field-Programmable Gate Array– An integrated circuit designed to be configured by a customer or

a designer after manufacturing (wikipedia)

▸ Components in an FPGA Chip– Programmable logic blocks– Programmable interconnects– Programmable I/Os

22

Page 24: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ SRAM-based implementation is popular– Non-standard technology means older technology generation

23

Three Important Pieces

Pass transistor (controlled by an SRAM bit)

Multiplexer (controlled by SRAM bits)

Lookup table (LUT, formed by SRAM bits)

LUT

Page 25: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Any function of k variables can be implemented with a 2k:1 multiplexer

24

Multiplexer as a Universal Gate

01010101

00110011

00001111

CoutSCinBA01101001

01010101

00110011

00001111

CoutSCinBA

01101001

00010111

01010101

00110011

00001111

CoutSCinBA01101001

00010111

01010101

00110011

00001111

CoutSumCinBA

?? ?

01234567

S2

8:1 MUX

S1 S0

Cout

????????

Page 26: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ How many distinct 3-input 1-output Boolean functions exist?

▸ What about K inputs?

25

How Many Functions?

Page 27: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

26

Look-Up Table (LUT)

0/10/10/10/10/10/10/10/1

MU

X… Y

x2

A 3-input LUT

§ A k-input LUT (k-LUT) can be configured to implement any k-input 1-output combinational logic – 2k SRAM bits– Delay is independent of logic

function

x1 x0

Page 28: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ How many 3-input LUTs are needed to implement the following full adder?

▸ How about using 4-input LUTs?

27

How Many LUTs?

A B Cin Cout S

0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

Page 29: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

28

A Logic Element

LUT

▸ A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed▸ The LUT and FF combined form a logic element

Page 30: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

29

A Logic Block

▸ A logic block clusters multiple logic elements

▸ Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices – Two independent carry chains per

CLB for implementing adders– Each slice contains four LUTs

Crossbar Sw

itch

CIN CIN

COUT COUT

SLICE

SLICE

Page 31: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Traditional Homogeneous FPGA Architecture

30

Logic block

Switchblock

Routing track

Page 32: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Modern Heterogeneous Field-Programmable System-on-Chip

[Figure credit: embeddedrelated.com]

▸ Island-style configurable mesh routing▸ Lots of dedicated components

– Memories/multipliers, I/Os, processors– Specialization leads to higher performance and lower power

31

Page 33: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Built-in components for fast arithmetic operation optimized for DSP applications – Essentially a multiply-accumulate core with many

other features

– Fixed logic and connections, functionality may be configured using control signals at run time

– Much faster than LUT-based implementation (ASIC vs. LUT)

32

Dedicated DSP Blocks

Page 34: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

33

Example: Xilinx DSP48E Slice

§25x18 signed multiplier§48-bit add/subtract/accumulate§48-bit logic operations§SIMD operations (12/24 bit)§Pipeline registers for high speed

[source: Xilinx Inc.]

Page 35: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Finite Impulse Response (FIR) Filter Mapped to DSP Slices

C0 C1 C2 C3

0

DSP Slice

x(n)

y(n)38

18

y[n]= cix[n− i]i=0

N

[source: Xilinx Inc.]

Page 36: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Example: Xilinx 18K/36K block RAMs – 32k x 1 to 512 x 72 in one

36K block– Simple dual-port and true

dual-port configurations– Built-in FIFO logic– 64-bit error correction

coding per 36K block

[source: Xilinx Inc.] 35

DIADIPAADDRAWEAENA

CLKA

DIBDIPB

WEBADDRB

ENB

DOA

CLKB

DOPA

DOPBDOB

18K/36K block RAM

Dedicated Block RAMs (BRAMs)

Page 37: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

Embedded FPGA System-on-Chip

Xilinx Zynq All Programmable System-on-Chip[Source: Xilinx Inc.]

Dual ARM Cortex-A9 + NEON SIMD extension @600MHz~1GHz

Up to 350K logic cells2MB Block RAM 900 DSP48s

36

Page 38: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Massive amount of fine-grained parallelism

▸ Silicon configurable to fit the application

▸ Performance/watt advantage over CPUs & GPUs

37

FPGA as an Accelerator for Cloud Computing

Bloc

k RA

M

Bloc

k RA

M

~2 Million Logic Blocks

~5000DSP Blocks

~300MbBlock RAM

AWS Cloud F1 FPGA instance: Xilinx UltraScale+ VU9P

[Figure source: David Pellerin, AWS]

Page 39: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

38

Microsoft Deploying FPGAs in Datacenter

[source: Microsoft, BrainWave, HotChips’2017]

▸ FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services– e.g., project BrainWave claimed ~40Teraflops on large recurrent

neural networks using Intel Stratix 10 FPGAs

Page 40: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Massive amount of fine-grained parallelism – Highly parallel and/or deeply pipelined to achieve maximum

parallelism– Distributed data/control dispatch

▸ Silicon configurable to fit algorithm– Compute the exact algorithm at the desired level of numerical

accuracy• Bit-level sizing and sub-cycle chaining

– Customized memory hierarchy

▸ Performance/watt advantage– Low power consumption compared to CPU and GPGPUs

• Low clock speed• Specialized architecture blocks

39

Summary: FPGA as a Programmable Accelerator

Page 41: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ Vivado HLS tutorial (led by “TAs”)

40

Next Class

Page 42: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM

▸ These slides contain/adapt materials developed by– Prof. Jason Cong (UCLA)

41

Acknowledgements