Download ppt - ENG6530 Reconfigurable Computing Systems Introduction to Reconfigurable Computing Introduction to Reconfigurable Computing

ENG6530 Reconfigurable Computing

Systems

Introduction to Introduction to Reconfigurable ComputingReconfigurable Computing

ENG6530 RCS 2

Topics

The VLSI Design Process Application Specific Integrated Circuits Issues related to power consumption Issues related to scaling

Traditional Von Neumann Architecture Limitations Enhancements to VNA

Reconfigurable Computing (fill the gap!) Research Issues

Summary

ENG6530 RCS 3

References

I. “Reconfigurable Computing: Accelerating Computation with FPGAs”, by Maya Gokhale

II. “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications”, by C. Bobda

III. “Computer Organization and Design”, by Patterson and Hennessy

IV. “Digital Integrated Circuits: A Design Perspective”, by Jan Rabaey

ENG6530 RCS 4

PDA

Body

Entertainment

Household

Communication

Home Networking

Car

Medicine

PC

Super Computer

Computing Devices Everywhere!

Game console

ENG6530 RCS 5

The Transistor Revolution

First transistorBell Labs, 1948

Bipolar logic1960’s

• Intel 4004 processor • Designed in 1971• Almost 3000 transistors• Speed:1 MHz operation

ENG6530 RCS 6

The VLSI Design Process

Physical Design

Partitioning

Routing

Floorplanning&Placement

Specification

Functional design

Circuit design

Physical design

Test/Fabrication

Logic design

Very Costly

Time Consuming

ENG6530 RCS 7

Productivity Gap in VLSI Design

A growing gap between design complexity and design productivity

ENG6530 RCS 8

MOSFET: Metal Oxide Semiconductor

Field Effect Transistor

A voltage controlled device Handles less current than a BJT (Slower) Dissipates less power Achieves higher density on an IC Has full swing voltage 0 5V

ENG6530 RCS 9

MOS Transistors -Types and Symbols

D

S

G

D

S

G

G

S

D D

S

G

NMOS Enhancement NMOS

PMOS

Depletion

Enhancement

B

NMOS withBulk Contact

Circuits can be built using either NMOS transistors or PMOS transistors

10

Implementing Logic using: nMOS vs. pMOS Devices

ENG6530 RCS 11

Complementary MOS (CMOS)

NMOS Transistors pass a ``strong” 0 but a ``weak” 1

PMOS Transistors pass a ``strong” 1 but a ``weak” 0

Combining both would lead to circuits that can pass strong 0’s and strong 1’s

X Y

C

C

12

Static Complementary MOS (CMOS)

VDD

F(In1,In2,…InN)

In1In2

InN

In1In2

InN

PUN

PDN

PMOS only

NMOS only

PUN and PDN are dual logic networks

……

At every point in time (except during the switching transients) each gate output is connected to either VDD or VSS via a low resistive path

VSS

ENG6530 RCS 13

CMOS Inverter (normal output)

A Y

0

1

VDD

A Y

GNDA Y

Pull-up Network

Pull-down Network

ENG6530 RCS 14

CMOS Inverter

A Y

0

1 0

VDD

A=1 Y=0

GND

ON

OFF

A Y

ENG6530 RCS 15

CMOS Inverter

A Y

0 1

1 0

VDD

A=0 Y=1

GND

OFF

ON

A Y

16

Example Gate: NAND

Example Gate: NOR

17

ENG6530 RCS 18

Complex CMOS Gate

OUT = D + A • (B + C)

D

A

B C

D

A

B

C

Sources of Power Consumption

Power has three components Static power: when input isn’t

switching

Dynamic capacitive power: due to charging and discharging of load capacitance

Dynamic short-circuit power: direct current from VDD to Gnd when both transistors are on

Dynamic Power

° Dynamic power is required to charge and discharge load capacitances when transistors switch.

° One cycle involves a rising and falling output.• On rising output, charge Q = CVDD is required

• On falling output, charge is dumped to GND

Cfsw

iDD(t)

VDD1. Short circuit current

2. Charge/discharge current

Dynamic Power

Cfsw

iDD(t)

VDD

dynamic

0

0

sw

2sw

1( )

( )

T

DD DD

TDD

DD

DDDD

DD

P i t V dtT

Vi t dt

T

VTf CV

T

CV f

Short circuit power <10% of dynamic power

Lowering Dynamic Power

Pdyn = CL VDD2 f

Capacitance:Function of fan-out, wire length, transistor sizes

Supply Voltage:Has been dropping with successive generations

Clock frequency:Increasing…

Static Power Consumption

Static power consumption: Static current: in CMOS there is no static current there is no static current as long as Vin <

VTN or Vin > VDD+VTP

Leakage currentLeakage current: determined by “off” transistor Influenced by transistor widthtransistor width, supply voltagesupply voltage, transistor

threshold voltages

VDD

VI<VTN

Ileak,n

Vcc VDD

Ileak,p

Vo(low)

VDD

° Junction leakage

° Gate oxide leakage

° Subthreshold leakage

Static Power Consumption

Implementation Choices (target technology)Implementation Choices (target technology)

CustomCustom

Standard CellsStandard CellsMaMacro Cellscro Cells

Cell-basedCell-based

Pre-diffusedPre-diffused(Gate Arrays)(Gate Arrays)

Pre-wiredPre-wired(FPGAs, PLDs)(FPGAs, PLDs)

Array-basedArray-based

SemicustomSemicustom

Digital Circuit Implementation ApproachesDigital Circuit Implementation Approaches

ENG6530 RCS 26

Design Style I: Full Custom

IN Out

Vdd

Gnd

Designer Specifies the layout of each individual transistor and connection

Design StylesDesign Styles

• Full CustomFull Custom– Utilized for large production volume chips such as Utilized for large production volume chips such as

microprocessors.microprocessors.– No restriction on the placement of functional blocks No restriction on the placement of functional blocks

and their interconnections.and their interconnections.– Highly optimized, but labour intensive.Highly optimized, but labour intensive.

ENG6530 RCS 28

Design Style II: Gate-array

Oxide isolation

Gate isolation

PMOS

NMOS

Vdd

Gnd

BA

Out

Vdd

Gnd

A

B

Out

Sea of gates Channel based

NAND gate using gate isolation

Can in principle be used by adjacent cell

• Array of prefabricated gates/transistors

Design StyleDesign Style

• Gate ArraysGate Arrays– Pre-fabricated array of gates (could be NAND).Pre-fabricated array of gates (could be NAND).– Design is mapped onto the gates, and the Design is mapped onto the gates, and the

interconnections are routed.interconnections are routed.

ENG6530 RCS 30

Design Style III: Standard cells

Routing

Cell

IO cell

Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence)

Design StylesDesign Styles

• Standard CellStandard Cell– Utilized for smaller production ASICs that are Utilized for smaller production ASICs that are

generated by synthesis tools.generated by synthesis tools.– Layout arranged in row of cells that perform Layout arranged in row of cells that perform

computation.computation.– Routing done on “channels” between the rows.Routing done on “channels” between the rows.

Field Programmable Gate Array (FPGA)Field Programmable Gate Array (FPGA)

• Field programmable gate Field programmable gate arraysarrays– Pre-fabricated array of Pre-fabricated array of

programmable logic programmable logic and interconnections.and interconnections.

– No fabrication step No fabrication step required.required.

Design Style ComparisonsDesign Style Comparisons

STYLESTYLE

Full Full CustomCustom

Standard Standard CellCell

Gate Gate ArrayArray

FPGAFPGA

Cell sizeCell size VariableVariable Fixed Fixed heightheight

FixedFixed FixedFixed

Cell typeCell type VariableVariable VariableVariable FixedFixed Prog.Prog.

Cell placementCell placement VariableVariable In rowIn row FixedFixed FixedFixed

InterconnectionsInterconnections VariableVariable VariableVariable VariableVariable Prog.Prog.

Design costDesign cost HighHigh MediumMedium MediumMedium LowLow

ENG6530 RCS 35

Dual port RAM

Full custom

Standard cell

ASIC with mixture of full custom, RAM and standard cells

FIFO

Single port RAM

ASIC (Application Specific Integrated Circuit)

is an IC customized for a particular use, rather than intended for general purpose use, eg., phones

Technology Scaling

ENG6530 RCS 37

Technology Scaling

Technology scaling has a threefold objective: IncreaseIncrease the transistor density ReduceReduce the gate delay Stabilize Dynamic Power Consumption

Main challenges faced are static power static power consumption, delivery and density which determine the performance of the chip.

Finding solutions to these challenges makes technology scaling an important issue to academia and the industry.

ENG6530 RCS 38

VLSI Trends: Moore’s Law

In 1965, Gordon Moore predicted that transistors would continue to shrink, allowing: Doubled transistor density every 18-24 months Doubled performance every 18-24 months

History has proven Moore right But, is the end in sight?

Physical limitations Economic limitations

Gordon MooreIntel Co-Founder and Chairman Emeritus

Image source: Intel Corporation www.intel.com

ENG6530 RCS 39

Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. 2300 transistors, 1 MHz clock (Intel 4004) -

1971 16 Million transistors (Ultra Sparc III) 42 Million transistors, 2 GHz clock (Intel

Xeon) – 2001 55 Million transistors, 3 GHz, 130nm

technology, 250mm2 die (Intel Pentium 4) - 2004

140 Million transistor (HP PA-8500) 1 Billion transistor (SoC)

Technology Scaling: Moore’s LawTechnology Scaling: Moore’s Law

ENG6530 RCS 40

Moore’s law in Microprocessors

40048008

80808085 8086

286386

486Pentium® proc

P6

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010Year

Tra

nsi

sto

rs (

MT

)

2X growth in 1.96 years!

Transistors on Lead Microprocessors double every 2 yearsTransistors on Lead Microprocessors double every 2 years

ENG6530 RCS 41

Frequency

P6Pentium ® proc

486386

28680868085

8080

80084004

0.1

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Fre

qu

ency

(M

hz)

Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years

Doubles every2 years

ENG6530 RCS 42

Power Dissipation!

P6Pentium ® proc

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Po

wer

(W

atts

)

Lead Microprocessors power continues to increaseLead Microprocessors power continues to increase

ENG6530 RCS 43

Leakage or `Static’ Power

Leakage power becomes a substantial portion of total power.

Trend projects that the leakage power will account for up to 50% of total power.

Not a practical limit of power dissipation.

ENG6530 RCS 44

Power Dissipation Projection

Considering only active power, power dissipation increases from 100W in 1999 to 2000W in 2010.

Dotted line indicates the power dissipation including the leakage power.

ENG6530 RCS 45

Interconnect Scaling

Interconnect scaling: Higher densities are only possible if the

interconnects also scale. Reduced width increased resistance Denser interconnects higher capacitance To account for increased parasitics and integration

complexity more interconnection layers are added:

thinner and tighter layers local interconnections

thicker and sparser layers global interconnections and power

Higher resistance and capacitance leads to higher RC delay.

ENG6530 RCS 46

Technology Trends: Interconnect Delay

2.0 µ 1.5 µ 1.0 µ 0.8 µ 0.5 µ 0.35 µ

0.1

1.0

10

Dela

y (

ns)

Minimum Feature Size

TypicalGate Delay

InterconnectDelay

Interconnect delayInterconnect delay has become performance limiter as technology shrinks into sub micron regime.

ENG6530 RCS 47

Technology Trend and Challenges

Source:ITRS’03

Interconnect determines the overall performance In addition: noise, power => Design closure Furthermore: manufacturability => Manufacturing closure

Application-Specific Integrated Circuits

ASIC advantages Best performance (speed) Smallest chip (area) Best power consumption

ASIC limitations? Inflexible (fixed!) Expensive to design High fixed costs require large

production runs Requires skillful designers

How to cope with complexity & cost?

ENG6530 RCS 49

Intellectual Property (IP)-based Design

ENG6530 RCS 50

System on Chip (SoC)

Increasing complexityMPSoC, NoC

Assembly of “prefabricated component”

Maximize VC(IP) reuse: over 90%

New economics: fast and correct design > optimum design

Design and Verification at the system level

interface between VCs SW becomes more important

up memory

video unitgraphics

coms DSP custom

software

up

ENG6530 RCS 51

Principle

In 1945, the mathematician Von Neumann (VN)demonstrated in study of computation that acomputer could have a simple structure, capable of executing any kind of program, given a properly programmed control unit, without the need of hardware modification

The Von Neumann Computer

ENIAC - The first electronic computer (1946)

ENG6530 RCS 52

Structure A memory for storing program and data.

The memory consists of the word with the same length A control unit (control path) featuring a program counter for

controlling program execution An arithmetic and logic unit (ALU) also called data path for

program execution

Datapath

Controllpath

Processor orCentral processing unit

Dataand

Instructions

Addressregister

Memory

Instructionregister PC

Data

Address

Registers


ENG6530 RCS 53


CodingA program is coded as a set of instructions to besequentially executed

Program execution Instruction Fetch (IF): The next instruction to be

executed is fetched from the memory Decode (D): Instruction is decoded (operation?) Read operand (R): Operands read from the memory Execute (EX): Operation is executed on the ALU Write result (W): Results written back to the memory Instruction execution in Cycle (IF, D, R, EX, W)

What is the problem with this computing paradigm?

Bottlenecks in VN Architecutre

ENG6530 RCS 55

Processor-Memory Performance GapProcessor-Memory Performance Gap

1

10

100

1000

10000

Year

Per

form

ance

“Moore’s Law”

µProc55%/year(2X/1.5yr)

DRAM7%/year(2X/10yrs)

Processor-MemoryPerformance Gap(grows 50%/year)

ENG6530 RCS 56


Advantage: Simplicity. Flexibility: any well coded program can be executed

Drawbacks: Speed efficiency: Not efficient, due to the sequential

program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware

resources is required for the execution of an instruction. The rest remains idle.

Memory access: Memories are about 5 times slower than the processor

How to compensate for deficiencies?

ENG6530 RCS 57

Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling

Improve performance (increase clock frequency!)

2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System

Cache can enhance speed

5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)

6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)

7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance

8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs

9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems

ENG6530 RCS 58

Technology Scaling

• Technology scaling tends to increase transistor density and enhance performance.

• Scaling is essential to ensure sustained growth in the IC industry to meet future needs.

• However, main challengeschallenges faced are:

– PowerPower consumption,

– Manufacturing issues, i.e., yieldyield

– New Cad tools Cad tools that can support newer technologies

ENG6530 RCS 59

Instruction Set of Processor Instruction Set of Processor

CPU execution timeCPU execution time (CPU time) – time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles for a program for a program = x clock cycle

time

CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------

Can improve performance by: reducing either the length of the clock cycle or the number of

clock cycles required for a program

or

ENG6530 RCS 60

Example: Improving PerformanceExample: Improving Performance

Our favorite program runs in 10 seconds on computer A10 seconds on computer A, which has a 4 GHz clock. We are trying to help a computer designer build a computer, B, that will run this program in 6 B, that will run this program in 6 secondsseconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target?What clock rate should we tell the designer to target?

CPU timeA = CPU clock cyclesA / (Clock rate)A

10 seconds = CPU (clock cycles) A / 4 x 109 cycles/second

CPU (clock cycles)A = 10 seconds x 4 x 109 cycles/sec = 40 x 109 cycles

CPU timeB = 1.2 x CPU clock cyclesA / (Clock rate)B

6 seconds = 1.2 x 40 x 109 (clock cycles)A / (Clock rate) B

(Clock rate)B = 1.2 x 40 x 109 cycles / 6 seconds = 8 GHz

ENG6530 RCS 61

Clock Cycles per InstructionClock Cycles per Instruction

Not all instructions take the same amount of time to execute

One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x

CPI for this instruction class

A B C

CPI 1 2 3

62

Determinates of CPU PerformanceDeterminates of CPU Performance

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

TechnologyX

XX

XX

X X

X

X

X

X

X

ENG6530 RCS 63

Application Specific Processors

#1: CPU designed CPU designed for efficient DSP processing– Multiple MAC unitsMultiple MAC units, 2 Accumulators, Additional

Adder, Barrel Shifter

#2: Multiple busses Multiple busses for efficient data

and program flow– Four busses and large on-chip memory that

result in sustained performance near peak

#3: Highly tuned instruction set instruction set for powerful DSP computing

– Sophisticated instructions that execute in fewer fewer cyclescycles, with less code and low power demands

ENG6530 RCS 64

Memory Hierarchy: The Principle of Locality

The Principle of Locality: Program access a relatively small portion of the address space

at any instant of time.

Two Different Types of Locality: Temporal Locality Temporal Locality (Locality in Time):(Locality in Time): If an item is referenced, it

will tend to be referenced again soon (e.g., loopsloops, reuse) Spatial Locality Spatial Locality (Locality in Space):(Locality in Space): If an item is referenced,

items whose addresses are close by tend to be referenced soon (e.g., straight line code, array accessarray access)

Take advantage of locality. How? Use memory hierarchy.

ENG6530 RCS 65

Levels of the Memory Hierarchy (Speed vs. Cost)

CPU Registers100s Bytes<10s ns<10s ns

CacheK Bytes10-100 ns10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns200ns- 500ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

ENG6530 RCS 66

Harvard ArchitectureHarvard Architecture

CPU

PCdata memory

program memory

address

data

address

data

The Harvard architecture uses a different approach than the Von Neumann architecture where the program memory program memory and the data memory are not shared data memory are not shared (on different busses)

This allows the instructions to be fetched and executed concurrentlyfetched and executed concurrently with data. The Harvard architecture allows for a cleaner pipelining of instructions since there is no

contention in fetching data vs. instructions.

ENG6530 RCS 67

Exploiting ParallelismExploiting Parallelism

• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors

• Instruction level Instruction level parallelism (ILP): ~1985 through today

– Pipelining (Enhance Throughput)

– Superscalar

– Limits to benefits of ILP?

• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?

– Servers (Multi-Processing Systems)

– High-end Desktop dual processor (Multi-Core)

ENG6530 RCS 68

PipeliningPipelining

Exploits parallelism at the instruction levelinstruction level.Pipelining is an implementation technique in

which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose

processors but can also be used in hardware hardware acceleratorsaccelerators (Reconfigurable Computing Systems).

Pipelining StrategyPipelining Strategy

ENG6530 RCS 70

ENG6530 RCS 71

ENG6530 RCS 72

ENG6530 RCS 73

ENG6530 RCS 74

ENG6530 RCS 75

ENG6530 RCS 76

ENG6530 RCS 77

ENG6530 RCS 78

ENG6530 RCS 79

ENG6530 RCS 80

Speed Up Speed Up

stages pipe ofNumber

nsinstructiobetween Time nsinstructiobetween Time

ned)(nonpipeli

)(pipelined

If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:

Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage five stage pipeline is nearly pipeline is nearly five times fasterfive times faster..

Pipelining Pipelining improves performance by:improves performance by: increasing instruction throughputincreasing instruction throughput, , The execution time of an individual instruction The execution time of an individual instruction

remains the same (i.e., latency is same).remains the same (i.e., latency is same).

ENG6530 RCS 81

ENG6530 RCS 82

ENG6530 RCS 83

ENG6530 RCS 84

ENG6530 RCS 85

ENG6530 RCS 86

ENG6530 RCS 87

88

Example: Conventional Data Path Timing

The figure shows the maximum delay values for each of the components of a typical data path.

1. 4 ns (3ns + 1ns) to read two operands from register file.

2. 4ns to perform an operation.

3. 4ns to write info back

Total 12 ns 12 ns to perform a single micro operation.

The rate of execution is then set at 1/12ns = 83.3MHz83.3MHz

Can we make it faster?

89

Example: Pipelined Data Path Timing

We can break the delay of 12ns by inserting registers between the different components of the system.

A register is inserted between the function unit and the register file (OF)

Another register can be inserted between the function unit and MUX D. (EX + WB)

3 stage pipeline: OF / EX / WB

The maximum delay now is 5ns 5ns allowing a maximum clock frequency of 200 MHz200 MHz

ENG6530 RCS 90

ENG6530 RCS 91

Its Not That Easy for ComputersIts Not That Easy for Computers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions - two dogs fighting for the same bonetwo dogs fighting for the same bone

» In a computer system: instead of having two memories we have a single memory and we need to write a value and also read a new instruction.

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

» Example:• Add $s0, $t0, $t1• Sub $t2, $s0, $t3

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow:

» (branches and» jumps).

+Resource Resource HazardsHazards

A resource hazard occurs when A resource hazard occurs when two or more instructions that two or more instructions that are already in the pipeline are already in the pipeline need the same resourceneed the same resource

The result is that the instructions The result is that the instructions must be executed in serial rather must be executed in serial rather than parallel for a portion of the than parallel for a portion of the pipelinepipeline

A resource hazard is sometimes A resource hazard is sometimes referred to as a referred to as a structural hazard

ENG6530 RCS 93

ENG6530 RCS 94

ENG6530 RCS 95

ENG6530 RCS 96

ENG6530 RCS 97

ENG6530 RCS 98

ENG6530 RCS 99

ENG6530 RCS 100

Parallel Processing

Using more than one processor to solve a problem Idea is that n processorsn processors operating simultaneously can

achieve the result n times faster.

Motives Diminishing returns from ILP

- Limited ILP in programs

- ILP increasingly expensive to exploit

Fault tolerance Availability of Multiple Threads/Processes (independent).

ENG6530 RCS 101

Flynn’s Taxonomy of MultiprocessingFlynn’s Taxonomy of Multiprocessing

Single-instructionSingle-instruction single-data stream (SISD) machines

Single-instructionSingle-instruction multiple-data stream (SIMD) machines

Multiple-instruction Multiple-instruction single-data stream (MISD) machines

Multiple-instruction Multiple-instruction multiple-data stream (MIMD) machines

Instructions

Single (SI) Multiple (MI)D

ata

Mu

ltip

le (

MD

)SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Multi-threaded Programming

Sin

gle

(S

D)

ENG6530 RCS 102

Single Instruction Stream Single Data Single Instruction Stream Single Data Stream (SISD)Stream (SISD)

In a single processor computer, a single stream of instructions isgenerated by the program. The instructions operate on a singlestream of data items (Traditional Von Neumann ArchitectureTraditional Von Neumann Architecture)

ProcessorControl MemoryInstructionstream

Data stream

Algorithms for SISD computers do not contain any parallelismdo not contain any parallelism

ENG6530 RCS 103

Single Instruction Stream Multiple Single Instruction Stream Multiple Data Stream (SIMD)Data Stream (SIMD)

•A specially designed computer in which a single instruction streamsingle instruction stream is from a single program, but multiple data streamsmultiple data streams exist. •The Instructions from program are broadcast to more than one Processor. •Each processor executes the same instruction in synchronism, but using different data.

Vector computersVector computers

ENG6530 RCS 104

P1 PNP2

Control

Shared memory or interconnection memory

SIMD ArchitectureSIMD Architecture

Instruction stream

Data streams

The processors operate synchronously and a global clock is usedto ensure lockstep operation

ENG6530 RCS 105

SIMD Application: ExampleSIMD Application: Example

Add two matrices C = A + B

Say we have two matrices A and B of order 2 and we have 4 processors, ie we wish to calculate:

CC1111 = A = A1111 + B + B1111 CC1212 = A = A1212 + B + B1212

CC21 21 = A= A2121 + B + B2121 C C2222 = A = A2222 + B + B2222

The same instruction same instruction (add the two numbers) is sent toeach processor, but each processor receives different data

ENG6530 RCS 106

Multiple Instruction Stream Single Data Multiple Instruction Stream Single Data Stream (MISD)Stream (MISD)

• A computer with multiple processors each sharing a commonmemory. • There are multiple streams multiple streams of instructions and one stream one stream of

data.

Processor

Processor

Memory

Control

Control

Example?Example?

ENG6530 RCS 107

MISD exampleMISD example

Check whether a number Z is prime

• Each processor is assigned a set of test divisors in its instruction stream• Each processor, takes Z as input and tries to divide it by its divisors

MISD is awkward to implement and such machines are just experimental

No commercial MISD machine exists

ENG6530 RCS 108

Multiple Instruction Stream Multiple Data Multiple Instruction Stream Multiple Data Stream (MIMD)Stream (MIMD)

• General purpose multiprocessor system – • Each processor has a separate program and one instruction stream is generated from each program for each processor. • Each instruction stream operates upon different data.

The most general and most useful most general and most useful of our classifications.

ENG6530 RCS 109

P1 PNP2

C1 C2 CN

Shared memory or interconnection network

Processors

Controls

Each processor operates under the control of an instructionstream issued by its own control unit.

Processors operate asynchronously in general

MIMD architectureMIMD architecture

ENG6530 RCS 110

Parallel vs. Distributed

SharedMemory

Parallel: Multiple CPUs within a shared memory machine

Distributed: Multiple machines with own memory connected over a network

Ne

two

rk c

on

ne

ctio

nfo

r d

ata

tra

nsf

er

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

Instructions

ENG6530 RCS 111

MIMD: Parallel Programming ModelsMIMD: Parallel Programming Models

• Distributed Memory– Explicit communicationExplicit communication

» Send messages

» Send (tid, tag, message)

» Receive (tid, tag, message)

– SynchronizationSynchronization

» Block on messages (implicit sync)

• Shared Memory – Implicit communicationImplicit communication

» Using shared address space

» Loads and stores

– SynchronizationSynchronization

» Atomic memory operators

» Semaphores

Scalability?Scalability?

ENG6530 RCS 112

Multi-Processing Issues

• How to assign tasks to processors?

• What if we have more tasks than processors?

• What if processors need to share partial results?

• How do we aggregate partial results?

• How do we know all processors have finished?

• What if processors die?

ENG6530 RCS 113

Speedup factor

S(n) = Execution time on a single processor

Execution time on a multiprocessor with n processors

S(n) gives increase in speed by using a multiprocessor

Speedup factor can also be cast in terms of computational steps

S(n) = Number of steps using one processor

Number of parallel steps using n processors

Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?

ENG6530 RCS 114

Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement

to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.

• The speedupspeedup of a program using multiple processors in parallel computing is limited is limited by the time needed for the sequential fraction of the program.

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/Speedup

ENG6530 RCS 115

Limitations?Limitations?

Amdahl’s Law

T = 1

(1 – a) + a / s

T = Overall speedup

a = Fraction of the original program that could be enhanced by executing in parallel or transferring it to hardware.

s = Expected speedup obtained (from hardware) for particular fraction of program

ENG6530 RCS 116

Amdahl’s Law: Example• Assume you profiled an application and noticed that

12% of the application can be parallelized and 88% of the operations are not parallelizable.

• Question: What is the maximum speedup of the parallelized version?

• Answer: Amdahl’s law states that the maximum speedup of the parallelized version is:

1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.

ENG6530 RCS 117

AmdahlAmdahl’’s Law: s Law: ExampleExample

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

= 1

0.95= 1.053

Law of diminishing return:Law of diminishing return:Focus on the common case!Focus on the common case!

Speedupoverall = 1

(1-0.1) + 0.1/2=

ENG6530 RCS 118

Spatial vs. Temporal Computing

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC) Temporal (Processor)

Temporal vs. spatial based computing

Temporal-based execution(software)

Spatial-based execution(reconfigurable computing)

Ability to extract parallelism (or concurrency) from algorithm descriptions is the key to acceleration using reconfigurable computingacceleration using reconfigurable computing

Methods for executing algorithms

Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive

Hardware(Application Specific Integrated Circuits)

Software-programmed processors

Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware

Reconfigurablecomputing

Advantages:•fills the gap between hardware and software •much higher performance than software•higher level of flexibility than hardware

ENG6530 RCS 121

Reconfigurable Computing

The Ideal device should combine: the flexibility of the Von Neumann computer the efficiency of ASICs

The ideal device should be able to Optimally implement an application at a given time Re-adapt to allow the optimal implementation of a new

application. We call such a device a reconfigurable device.

Definition: Reconfigurable computing can be defined as the study of computations involving reconfigurable devices. This includes,

1. Architecture, 2. Algorithms, 3. Applications.

ENG6530 RCS 122

How to fill the gap?

Fle

xib

ilit

y

Performance

ASIC

GPs

DSP

RCS

ENG6530 RCS 123

Reconfigurable Hardware (FPGAs)

KEY ADVANTAGE: Performance of KEY ADVANTAGE: Performance of Hardware, Flexibility of SoftwareHardware, Flexibility of Software

CLB Block RAM IP Core (Multiplier)

Why reconfigurable computing is more relevant these days?

• Demand for high-performance computation is soaring: – large-scale optimization problems, physics and earth simulation,

bioinformatics, signal processing (e.g. HDTV), …, etc)• Why software-programmed processors are no longer attractive?

– Faster temporal execution of instructions is no longer improving– General-purpose multi-core processors requires coarse grain

thread-level parallelism• Why reconfigurable fabrics are currently attractive?

– Increased integration densities allow large FPGAs that can implement substantial functions

– Provide the spatial computational resources required to implement massively-parallel computations directly in hardware

ENG6530 RCS 125

Reconfigurable Devices

ENG6530 RCS 126

1. Rapid prototyping

Testing hardware in real conditions before fabrication

Software simulation Relatively inexpensive Slow Accuracy ?

Hardware emulationHardware testing under real

operation conditionsFastAccurateAllow several iterations

APTIX System Explorer

ITALTEL FLEXBENCH

ENG6530 RCS 127

2. Non-Frequent Reconfiguration

ENG6530 RCS 128

3. Frequently Reconfigured

Computing systems that are able to adapt their behaviour and structure to changing operating and environmental conditions, time-varying optimization objectives, and physical constraints like changing protocols, new standards, or dynamically changing operation conditions of technical systems

ENG6530 RCS 129

Static & Dynamic Reconfiguration

ENG6530 RCS 130

Benefits of RCS Non-permanent customization and application

development after fabrication “Late Binding”

Achieve high performance that require real time Time-to-market (evolving requirements and

standards, new ideas)

Disadvantages

• Efficiency penalty (area, performance, power) as compared to ASIC

• Ease of use as compared to GPP

Conclusions Conclusions

ENG6530 RCS 131