ENG6530 Reconfigurable Computing
Systems
Introduction to Introduction to Reconfigurable ComputingReconfigurable Computing
ENG6530 RCS 2
Topics
The VLSI Design Process Application Specific Integrated Circuits Issues related to power consumption Issues related to scaling
Traditional Von Neumann Architecture Limitations Enhancements to VNA
Reconfigurable Computing (fill the gap!) Research Issues
Summary
ENG6530 RCS 3
References
I. “Reconfigurable Computing: Accelerating Computation with FPGAs”, by Maya Gokhale
II. “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications”, by C. Bobda
III. “Computer Organization and Design”, by Patterson and Hennessy
IV. “Digital Integrated Circuits: A Design Perspective”, by Jan Rabaey
ENG6530 RCS 4
PDA
Body
Entertainment
Household
Communication
Home Networking
Car
Medicine
PC
Super Computer
Computing Devices Everywhere!
Game console
ENG6530 RCS 5
The Transistor Revolution
First transistorBell Labs, 1948
Bipolar logic1960’s
• Intel 4004 processor • Designed in 1971• Almost 3000 transistors• Speed:1 MHz operation
ENG6530 RCS 6
The VLSI Design Process
Physical Design
Partitioning
Routing
Floorplanning&Placement
Specification
Functional design
Circuit design
Physical design
Test/Fabrication
Logic design
Very Costly
Time Consuming
ENG6530 RCS 7
Productivity Gap in VLSI Design
A growing gap between design complexity and design productivity
ENG6530 RCS 8
MOSFET: Metal Oxide Semiconductor
Field Effect Transistor
A voltage controlled device Handles less current than a BJT (Slower) Dissipates less power Achieves higher density on an IC Has full swing voltage 0 5V
ENG6530 RCS 9
MOS Transistors -Types and Symbols
D
S
G
D
S
G
G
S
D D
S
G
NMOS Enhancement NMOS
PMOS
Depletion
Enhancement
B
NMOS withBulk Contact
Circuits can be built using either NMOS transistors or PMOS transistors
10
Implementing Logic using: nMOS vs. pMOS Devices
ENG6530 RCS 11
Complementary MOS (CMOS)
NMOS Transistors pass a ``strong” 0 but a ``weak” 1
PMOS Transistors pass a ``strong” 1 but a ``weak” 0
Combining both would lead to circuits that can pass strong 0’s and strong 1’s
X Y
C
C
12
Static Complementary MOS (CMOS)
VDD
F(In1,In2,…InN)
In1In2
InN
In1In2
InN
PUN
PDN
PMOS only
NMOS only
PUN and PDN are dual logic networks
……
At every point in time (except during the switching transients) each gate output is connected to either VDD or VSS via a low resistive path
VSS
ENG6530 RCS 13
CMOS Inverter (normal output)
A Y
0
1
VDD
A Y
GNDA Y
Pull-up Network
Pull-down Network
ENG6530 RCS 14
CMOS Inverter
A Y
0
1 0
VDD
A=1 Y=0
GND
ON
OFF
A Y
ENG6530 RCS 15
CMOS Inverter
A Y
0 1
1 0
VDD
A=0 Y=1
GND
OFF
ON
A Y
16
Example Gate: NAND
Example Gate: NOR
17
ENG6530 RCS 18
Complex CMOS Gate
OUT = D + A • (B + C)
D
A
B C
D
A
B
C
Sources of Power Consumption
Power has three components Static power: when input isn’t
switching
Dynamic capacitive power: due to charging and discharging of load capacitance
Dynamic short-circuit power: direct current from VDD to Gnd when both transistors are on
Dynamic Power
° Dynamic power is required to charge and discharge load capacitances when transistors switch.
° One cycle involves a rising and falling output.• On rising output, charge Q = CVDD is required
• On falling output, charge is dumped to GND
Cfsw
iDD(t)
VDD1. Short circuit current
2. Charge/discharge current
Dynamic Power
Cfsw
iDD(t)
VDD
dynamic
0
0
sw
2sw
1( )
( )
T
DD DD
TDD
DD
DDDD
DD
P i t V dtT
Vi t dt
T
VTf CV
T
CV f
Short circuit power <10% of dynamic power
Lowering Dynamic Power
Pdyn = CL VDD2 f
Capacitance:Function of fan-out, wire length, transistor sizes
Supply Voltage:Has been dropping with successive generations
Clock frequency:Increasing…
Static Power Consumption
Static power consumption: Static current: in CMOS there is no static current there is no static current as long as Vin <
VTN or Vin > VDD+VTP
Leakage currentLeakage current: determined by “off” transistor Influenced by transistor widthtransistor width, supply voltagesupply voltage, transistor
threshold voltages
VDD
VI<VTN
Ileak,n
Vcc VDD
Ileak,p
Vo(low)
VDD
° Junction leakage
° Gate oxide leakage
° Subthreshold leakage
Static Power Consumption
Implementation Choices (target technology)Implementation Choices (target technology)
CustomCustom
Standard CellsStandard CellsMaMacro Cellscro Cells
Cell-basedCell-based
Pre-diffusedPre-diffused(Gate Arrays)(Gate Arrays)
Pre-wiredPre-wired(FPGAs, PLDs)(FPGAs, PLDs)
Array-basedArray-based
SemicustomSemicustom
Digital Circuit Implementation ApproachesDigital Circuit Implementation Approaches
ENG6530 RCS 26
Design Style I: Full Custom
IN Out
Vdd
Gnd
Designer Specifies the layout of each individual transistor and connection
Design StylesDesign Styles
• Full CustomFull Custom– Utilized for large production volume chips such as Utilized for large production volume chips such as
microprocessors.microprocessors.– No restriction on the placement of functional blocks No restriction on the placement of functional blocks
and their interconnections.and their interconnections.– Highly optimized, but labour intensive.Highly optimized, but labour intensive.
ENG6530 RCS 28
Design Style II: Gate-array
Oxide isolation
Gate isolation
PMOS
NMOS
Vdd
Gnd
BA
Out
Vdd
Gnd
A
B
Out
Sea of gates Channel based
NAND gate using gate isolation
Can in principle be used by adjacent cell
• Array of prefabricated gates/transistors
Design StyleDesign Style
• Gate ArraysGate Arrays– Pre-fabricated array of gates (could be NAND).Pre-fabricated array of gates (could be NAND).– Design is mapped onto the gates, and the Design is mapped onto the gates, and the
interconnections are routed.interconnections are routed.
ENG6530 RCS 30
Design Style III: Standard cells
Routing
Cell
IO cell
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence)
Design StylesDesign Styles
• Standard CellStandard Cell– Utilized for smaller production ASICs that are Utilized for smaller production ASICs that are
generated by synthesis tools.generated by synthesis tools.– Layout arranged in row of cells that perform Layout arranged in row of cells that perform
computation.computation.– Routing done on “channels” between the rows.Routing done on “channels” between the rows.
Field Programmable Gate Array (FPGA)Field Programmable Gate Array (FPGA)
• Field programmable gate Field programmable gate arraysarrays– Pre-fabricated array of Pre-fabricated array of
programmable logic programmable logic and interconnections.and interconnections.
– No fabrication step No fabrication step required.required.
Design Style ComparisonsDesign Style Comparisons
STYLESTYLE
Full Full CustomCustom
Standard Standard CellCell
Gate Gate ArrayArray
FPGAFPGA
Cell sizeCell size VariableVariable Fixed Fixed heightheight
FixedFixed FixedFixed
Cell typeCell type VariableVariable VariableVariable FixedFixed Prog.Prog.
Cell placementCell placement VariableVariable In rowIn row FixedFixed FixedFixed
InterconnectionsInterconnections VariableVariable VariableVariable VariableVariable Prog.Prog.
Design costDesign cost HighHigh MediumMedium MediumMedium LowLow
ENG6530 RCS 35
Dual port RAM
Full custom
Standard cell
ASIC with mixture of full custom, RAM and standard cells
FIFO
Single port RAM
ASIC (Application Specific Integrated Circuit)
is an IC customized for a particular use, rather than intended for general purpose use, eg., phones
Technology Scaling
ENG6530 RCS 37
Technology Scaling
Technology scaling has a threefold objective: IncreaseIncrease the transistor density ReduceReduce the gate delay Stabilize Dynamic Power Consumption
Main challenges faced are static power static power consumption, delivery and density which determine the performance of the chip.
Finding solutions to these challenges makes technology scaling an important issue to academia and the industry.
ENG6530 RCS 38
VLSI Trends: Moore’s Law
In 1965, Gordon Moore predicted that transistors would continue to shrink, allowing: Doubled transistor density every 18-24 months Doubled performance every 18-24 months
History has proven Moore right But, is the end in sight?
Physical limitations Economic limitations
Gordon MooreIntel Co-Founder and Chairman Emeritus
Image source: Intel Corporation www.intel.com
ENG6530 RCS 39
Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. 2300 transistors, 1 MHz clock (Intel 4004) -
1971 16 Million transistors (Ultra Sparc III) 42 Million transistors, 2 GHz clock (Intel
Xeon) – 2001 55 Million transistors, 3 GHz, 130nm
technology, 250mm2 die (Intel Pentium 4) - 2004
140 Million transistor (HP PA-8500) 1 Billion transistor (SoC)
Technology Scaling: Moore’s LawTechnology Scaling: Moore’s Law
ENG6530 RCS 40
Moore’s law in Microprocessors
40048008
80808085 8086
286386
486Pentium® proc
P6
0.001
0.01
0.1
1
10
100
1000
1970 1980 1990 2000 2010Year
Tra
nsi
sto
rs (
MT
)
2X growth in 1.96 years!
Transistors on Lead Microprocessors double every 2 yearsTransistors on Lead Microprocessors double every 2 years
ENG6530 RCS 41
Frequency
P6Pentium ® proc
486386
28680868085
8080
80084004
0.1
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Fre
qu
ency
(M
hz)
Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years
Doubles every2 years
ENG6530 RCS 42
Power Dissipation!
P6Pentium ® proc
486
3862868086
80858080
80084004
0.1
1
10
100
1971 1974 1978 1985 1992 2000Year
Po
wer
(W
atts
)
Lead Microprocessors power continues to increaseLead Microprocessors power continues to increase
ENG6530 RCS 43
Leakage or `Static’ Power
Leakage power becomes a substantial portion of total power.
Trend projects that the leakage power will account for up to 50% of total power.
Not a practical limit of power dissipation.
ENG6530 RCS 44
Power Dissipation Projection
Considering only active power, power dissipation increases from 100W in 1999 to 2000W in 2010.
Dotted line indicates the power dissipation including the leakage power.
ENG6530 RCS 45
Interconnect Scaling
Interconnect scaling: Higher densities are only possible if the
interconnects also scale. Reduced width increased resistance Denser interconnects higher capacitance To account for increased parasitics and integration
complexity more interconnection layers are added:
thinner and tighter layers local interconnections
thicker and sparser layers global interconnections and power
Higher resistance and capacitance leads to higher RC delay.
ENG6530 RCS 46
Technology Trends: Interconnect Delay
2.0 µ 1.5 µ 1.0 µ 0.8 µ 0.5 µ 0.35 µ
0.1
1.0
10
Dela
y (
ns)
Minimum Feature Size
TypicalGate Delay
InterconnectDelay
Interconnect delayInterconnect delay has become performance limiter as technology shrinks into sub micron regime.
ENG6530 RCS 47
Technology Trend and Challenges
Source:ITRS’03
Interconnect determines the overall performance In addition: noise, power => Design closure Furthermore: manufacturability => Manufacturing closure
Application-Specific Integrated Circuits
ASIC advantages Best performance (speed) Smallest chip (area) Best power consumption
ASIC limitations? Inflexible (fixed!) Expensive to design High fixed costs require large
production runs Requires skillful designers
How to cope with complexity & cost?
ENG6530 RCS 49
Intellectual Property (IP)-based Design
ENG6530 RCS 50
System on Chip (SoC)
Increasing complexityMPSoC, NoC
Assembly of “prefabricated component”
Maximize VC(IP) reuse: over 90%
New economics: fast and correct design > optimum design
Design and Verification at the system level
interface between VCs SW becomes more important
up memory
video unitgraphics
coms DSP custom
software
up
ENG6530 RCS 51
Principle
In 1945, the mathematician Von Neumann (VN)demonstrated in study of computation that acomputer could have a simple structure, capable of executing any kind of program, given a properly programmed control unit, without the need of hardware modification
The Von Neumann Computer
ENIAC - The first electronic computer (1946)
ENG6530 RCS 52
Structure A memory for storing program and data.
The memory consists of the word with the same length A control unit (control path) featuring a program counter for
controlling program execution An arithmetic and logic unit (ALU) also called data path for
program execution
Datapath
Controllpath
Processor orCentral processing unit
Dataand
Instructions
Addressregister
Memory
Instructionregister PC
Data
Address
Registers
The Von Neumann Computer
ENG6530 RCS 53
The Von Neumann Computer
CodingA program is coded as a set of instructions to besequentially executed
Program execution Instruction Fetch (IF): The next instruction to be
executed is fetched from the memory Decode (D): Instruction is decoded (operation?) Read operand (R): Operands read from the memory Execute (EX): Operation is executed on the ALU Write result (W): Results written back to the memory Instruction execution in Cycle (IF, D, R, EX, W)
What is the problem with this computing paradigm?
Bottlenecks in VN Architecutre
ENG6530 RCS 55
Processor-Memory Performance GapProcessor-Memory Performance Gap
1
10
100
1000
10000
Year
Per
form
ance
“Moore’s Law”
µProc55%/year(2X/1.5yr)
DRAM7%/year(2X/10yrs)
Processor-MemoryPerformance Gap(grows 50%/year)
ENG6530 RCS 56
The Von Neumann Computer
Advantage: Simplicity. Flexibility: any well coded program can be executed
Drawbacks: Speed efficiency: Not efficient, due to the sequential
program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware
resources is required for the execution of an instruction. The rest remains idle.
Memory access: Memories are about 5 times slower than the processor
How to compensate for deficiencies?
ENG6530 RCS 57
Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling
Improve performance (increase clock frequency!)
2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System
Cache can enhance speed
5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)
6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)
7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance
8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs
9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems
ENG6530 RCS 58
Technology Scaling
• Technology scaling tends to increase transistor density and enhance performance.
• Scaling is essential to ensure sustained growth in the IC industry to meet future needs.
• However, main challengeschallenges faced are:
– PowerPower consumption,
– Manufacturing issues, i.e., yieldyield
– New Cad tools Cad tools that can support newer technologies
ENG6530 RCS 59
Instruction Set of Processor Instruction Set of Processor
CPU execution timeCPU execution time (CPU time) – time the CPU spends working on a task
Does not include time waiting for I/O or running other programs
CPU execution time # CPU clock cycles for a program for a program = x clock cycle
time
CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------
Can improve performance by: reducing either the length of the clock cycle or the number of
clock cycles required for a program
or
ENG6530 RCS 60
Example: Improving PerformanceExample: Improving Performance
Our favorite program runs in 10 seconds on computer A10 seconds on computer A, which has a 4 GHz clock. We are trying to help a computer designer build a computer, B, that will run this program in 6 B, that will run this program in 6 secondsseconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target?What clock rate should we tell the designer to target?
CPU timeA = CPU clock cyclesA / (Clock rate)A
10 seconds = CPU (clock cycles) A / 4 x 109 cycles/second
CPU (clock cycles)A = 10 seconds x 4 x 109 cycles/sec = 40 x 109 cycles
CPU timeB = 1.2 x CPU clock cyclesA / (Clock rate)B
6 seconds = 1.2 x 40 x 109 (clock cycles)A / (Clock rate) B
(Clock rate)B = 1.2 x 40 x 109 cycles / 6 seconds = 8 GHz
ENG6530 RCS 61
Clock Cycles per InstructionClock Cycles per Instruction
Not all instructions take the same amount of time to execute
One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction
Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x
CPI for this instruction class
A B C
CPI 1 2 3
62
Determinates of CPU PerformanceDeterminates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count
CPI clock_cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
TechnologyX
XX
XX
X X
X
X
X
X
X
ENG6530 RCS 63
Application Specific Processors
#1: CPU designed CPU designed for efficient DSP processing– Multiple MAC unitsMultiple MAC units, 2 Accumulators, Additional
Adder, Barrel Shifter
#2: Multiple busses Multiple busses for efficient data
and program flow– Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set instruction set for powerful DSP computing
– Sophisticated instructions that execute in fewer fewer cyclescycles, with less code and low power demands
ENG6530 RCS 64
Memory Hierarchy: The Principle of Locality
The Principle of Locality: Program access a relatively small portion of the address space
at any instant of time.
Two Different Types of Locality: Temporal Locality Temporal Locality (Locality in Time):(Locality in Time): If an item is referenced, it
will tend to be referenced again soon (e.g., loopsloops, reuse) Spatial Locality Spatial Locality (Locality in Space):(Locality in Space): If an item is referenced,
items whose addresses are close by tend to be referenced soon (e.g., straight line code, array accessarray access)
Take advantage of locality. How? Use memory hierarchy.
ENG6530 RCS 65
Levels of the Memory Hierarchy (Speed vs. Cost)
CPU Registers100s Bytes<10s ns<10s ns
CacheK Bytes10-100 ns10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns200ns- 500ns$.0001-.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
ENG6530 RCS 66
Harvard ArchitectureHarvard Architecture
CPU
PCdata memory
program memory
address
data
address
data
The Harvard architecture uses a different approach than the Von Neumann architecture where the program memory program memory and the data memory are not shared data memory are not shared (on different busses)
This allows the instructions to be fetched and executed concurrentlyfetched and executed concurrently with data. The Harvard architecture allows for a cleaner pipelining of instructions since there is no
contention in fetching data vs. instructions.
ENG6530 RCS 67
Exploiting ParallelismExploiting Parallelism
• Bit level Bit level parallelism: 1970 to ~1985– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level Instruction level parallelism (ILP): ~1985 through today
– Pipelining (Enhance Throughput)
– Superscalar
– Limits to benefits of ILP?
• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?
– Servers (Multi-Processing Systems)
– High-end Desktop dual processor (Multi-Core)
ENG6530 RCS 68
PipeliningPipelining
Exploits parallelism at the instruction levelinstruction level.Pipelining is an implementation technique in
which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose
processors but can also be used in hardware hardware acceleratorsaccelerators (Reconfigurable Computing Systems).
Pipelining StrategyPipelining Strategy
ENG6530 RCS 70
ENG6530 RCS 71
ENG6530 RCS 72
ENG6530 RCS 73
ENG6530 RCS 74
ENG6530 RCS 75
ENG6530 RCS 76
ENG6530 RCS 77
ENG6530 RCS 78
ENG6530 RCS 79
ENG6530 RCS 80
Speed Up Speed Up
stages pipe ofNumber
nsinstructiobetween Time nsinstructiobetween Time
ned)(nonpipeli
)(pipelined
If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions – is equal to:assuming ideal conditions – is equal to:
Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage five stage pipeline is nearly pipeline is nearly five times fasterfive times faster..
Pipelining Pipelining improves performance by:improves performance by: increasing instruction throughputincreasing instruction throughput, , The execution time of an individual instruction The execution time of an individual instruction
remains the same (i.e., latency is same).remains the same (i.e., latency is same).
ENG6530 RCS 81
ENG6530 RCS 82
ENG6530 RCS 83
ENG6530 RCS 84
ENG6530 RCS 85
ENG6530 RCS 86
ENG6530 RCS 87
88
Example: Conventional Data Path Timing
The figure shows the maximum delay values for each of the components of a typical data path.
1. 4 ns (3ns + 1ns) to read two operands from register file.
2. 4ns to perform an operation.
3. 4ns to write info back
Total 12 ns 12 ns to perform a single micro operation.
The rate of execution is then set at 1/12ns = 83.3MHz83.3MHz
Can we make it faster?
89
Example: Pipelined Data Path Timing
We can break the delay of 12ns by inserting registers between the different components of the system.
A register is inserted between the function unit and the register file (OF)
Another register can be inserted between the function unit and MUX D. (EX + WB)
3 stage pipeline: OF / EX / WB
The maximum delay now is 5ns 5ns allowing a maximum clock frequency of 200 MHz200 MHz
ENG6530 RCS 90
ENG6530 RCS 91
Its Not That Easy for ComputersIts Not That Easy for Computers
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of instructions - two dogs fighting for the same bonetwo dogs fighting for the same bone
» In a computer system: instead of having two memories we have a single memory and we need to write a value and also read a new instruction.
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
» Example:• Add $s0, $t0, $t1• Sub $t2, $s0, $t3
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow:
» (branches and» jumps).
+Resource Resource HazardsHazards
A resource hazard occurs when A resource hazard occurs when two or more instructions that two or more instructions that are already in the pipeline are already in the pipeline need the same resourceneed the same resource
The result is that the instructions The result is that the instructions must be executed in serial rather must be executed in serial rather than parallel for a portion of the than parallel for a portion of the pipelinepipeline
A resource hazard is sometimes A resource hazard is sometimes referred to as a referred to as a structural hazard
ENG6530 RCS 93
ENG6530 RCS 94
ENG6530 RCS 95
ENG6530 RCS 96
ENG6530 RCS 97
ENG6530 RCS 98
ENG6530 RCS 99
ENG6530 RCS 100
Parallel Processing
Using more than one processor to solve a problem Idea is that n processorsn processors operating simultaneously can
achieve the result n times faster.
Motives Diminishing returns from ILP
- Limited ILP in programs
- ILP increasingly expensive to exploit
Fault tolerance Availability of Multiple Threads/Processes (independent).
ENG6530 RCS 101
Flynn’s Taxonomy of MultiprocessingFlynn’s Taxonomy of Multiprocessing
Single-instructionSingle-instruction single-data stream (SISD) machines
Single-instructionSingle-instruction multiple-data stream (SIMD) machines
Multiple-instruction Multiple-instruction single-data stream (MISD) machines
Multiple-instruction Multiple-instruction multiple-data stream (MIMD) machines
Instructions
Single (SI) Multiple (MI)D
ata
Mu
ltip
le (
MD
)SISD
Single-threaded process
MISD
Pipeline architecture
SIMD
Vector Processing
MIMD
Multi-threaded Programming
Sin
gle
(S
D)
ENG6530 RCS 102
Single Instruction Stream Single Data Single Instruction Stream Single Data Stream (SISD)Stream (SISD)
In a single processor computer, a single stream of instructions isgenerated by the program. The instructions operate on a singlestream of data items (Traditional Von Neumann ArchitectureTraditional Von Neumann Architecture)
ProcessorControl MemoryInstructionstream
Data stream
Algorithms for SISD computers do not contain any parallelismdo not contain any parallelism
ENG6530 RCS 103
Single Instruction Stream Multiple Single Instruction Stream Multiple Data Stream (SIMD)Data Stream (SIMD)
•A specially designed computer in which a single instruction streamsingle instruction stream is from a single program, but multiple data streamsmultiple data streams exist. •The Instructions from program are broadcast to more than one Processor. •Each processor executes the same instruction in synchronism, but using different data.
Vector computersVector computers
ENG6530 RCS 104
P1 PNP2
Control
Shared memory or interconnection memory
SIMD ArchitectureSIMD Architecture
Instruction stream
Data streams
The processors operate synchronously and a global clock is usedto ensure lockstep operation
ENG6530 RCS 105
SIMD Application: ExampleSIMD Application: Example
Add two matrices C = A + B
Say we have two matrices A and B of order 2 and we have 4 processors, ie we wish to calculate:
CC1111 = A = A1111 + B + B1111 CC1212 = A = A1212 + B + B1212
CC21 21 = A= A2121 + B + B2121 C C2222 = A = A2222 + B + B2222
The same instruction same instruction (add the two numbers) is sent toeach processor, but each processor receives different data
ENG6530 RCS 106
Multiple Instruction Stream Single Data Multiple Instruction Stream Single Data Stream (MISD)Stream (MISD)
• A computer with multiple processors each sharing a commonmemory. • There are multiple streams multiple streams of instructions and one stream one stream of
data.
Processor
Processor
Memory
Control
Control
Example?Example?
ENG6530 RCS 107
MISD exampleMISD example
Check whether a number Z is prime
• Each processor is assigned a set of test divisors in its instruction stream• Each processor, takes Z as input and tries to divide it by its divisors
MISD is awkward to implement and such machines are just experimental
No commercial MISD machine exists
ENG6530 RCS 108
Multiple Instruction Stream Multiple Data Multiple Instruction Stream Multiple Data Stream (MIMD)Stream (MIMD)
• General purpose multiprocessor system – • Each processor has a separate program and one instruction stream is generated from each program for each processor. • Each instruction stream operates upon different data.
The most general and most useful most general and most useful of our classifications.
ENG6530 RCS 109
P1 PNP2
C1 C2 CN
Shared memory or interconnection network
Processors
Controls
Each processor operates under the control of an instructionstream issued by its own control unit.
Processors operate asynchronously in general
MIMD architectureMIMD architecture
ENG6530 RCS 110
Parallel vs. Distributed
SharedMemory
Parallel: Multiple CPUs within a shared memory machine
Distributed: Multiple machines with own memory connected over a network
Ne
two
rk c
on
ne
ctio
nfo
r d
ata
tra
nsf
er
D D D D D D D
Processor
Instructions
D D D D D D D
Processor
Instructions
ENG6530 RCS 111
MIMD: Parallel Programming ModelsMIMD: Parallel Programming Models
• Distributed Memory– Explicit communicationExplicit communication
» Send messages
» Send (tid, tag, message)
» Receive (tid, tag, message)
– SynchronizationSynchronization
» Block on messages (implicit sync)
• Shared Memory – Implicit communicationImplicit communication
» Using shared address space
» Loads and stores
– SynchronizationSynchronization
» Atomic memory operators
» Semaphores
Scalability?Scalability?
ENG6530 RCS 112
Multi-Processing Issues
• How to assign tasks to processors?
• What if we have more tasks than processors?
• What if processors need to share partial results?
• How do we aggregate partial results?
• How do we know all processors have finished?
• What if processors die?
ENG6530 RCS 113
Speedup factor
S(n) = Execution time on a single processor
Execution time on a multiprocessor with n processors
S(n) gives increase in speed by using a multiprocessor
Speedup factor can also be cast in terms of computational steps
S(n) = Number of steps using one processor
Number of parallel steps using n processors
Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?
ENG6530 RCS 114
Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement
to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.
• The speedupspeedup of a program using multiple processors in parallel computing is limited is limited by the time needed for the sequential fraction of the program.
ENG6530 RCS 115
Limitations?Limitations?
Amdahl’s Law
T = 1
(1 – a) + a / s
T = Overall speedup
a = Fraction of the original program that could be enhanced by executing in parallel or transferring it to hardware.
s = Expected speedup obtained (from hardware) for particular fraction of program
ENG6530 RCS 116
Amdahl’s Law: Example• Assume you profiled an application and noticed that
12% of the application can be parallelized and 88% of the operations are not parallelizable.
• Question: What is the maximum speedup of the parallelized version?
• Answer: Amdahl’s law states that the maximum speedup of the parallelized version is:
1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.
ENG6530 RCS 117
AmdahlAmdahl’’s Law: s Law: ExampleExample
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
= 1
0.95= 1.053
Law of diminishing return:Law of diminishing return:Focus on the common case!Focus on the common case!
Speedupoverall = 1
(1-0.1) + 0.1/2=
ENG6530 RCS 118
Spatial vs. Temporal Computing
Ax2 + Bx + c (Ax + B)x + C
Spatial (ASIC) Temporal (Processor)
Temporal vs. spatial based computing
Temporal-based execution(software)
Spatial-based execution(reconfigurable computing)
Ability to extract parallelism (or concurrency) from algorithm descriptions is the key to acceleration using reconfigurable computingacceleration using reconfigurable computing
Methods for executing algorithms
Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive
Hardware(Application Specific Integrated Circuits)
Software-programmed processors
Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware
Reconfigurablecomputing
Advantages:•fills the gap between hardware and software •much higher performance than software•higher level of flexibility than hardware
ENG6530 RCS 121
Reconfigurable Computing
The Ideal device should combine: the flexibility of the Von Neumann computer the efficiency of ASICs
The ideal device should be able to Optimally implement an application at a given time Re-adapt to allow the optimal implementation of a new
application. We call such a device a reconfigurable device.
Definition: Reconfigurable computing can be defined as the study of computations involving reconfigurable devices. This includes,
1. Architecture, 2. Algorithms, 3. Applications.
ENG6530 RCS 122
How to fill the gap?
Fle
xib
ilit
y
Performance
ASIC
GPs
DSP
RCS
ENG6530 RCS 123
Reconfigurable Hardware (FPGAs)
KEY ADVANTAGE: Performance of KEY ADVANTAGE: Performance of Hardware, Flexibility of SoftwareHardware, Flexibility of Software
CLB Block RAM IP Core (Multiplier)
Why reconfigurable computing is more relevant these days?
• Demand for high-performance computation is soaring: – large-scale optimization problems, physics and earth simulation,
bioinformatics, signal processing (e.g. HDTV), …, etc)• Why software-programmed processors are no longer attractive?
– Faster temporal execution of instructions is no longer improving– General-purpose multi-core processors requires coarse grain
thread-level parallelism• Why reconfigurable fabrics are currently attractive?
– Increased integration densities allow large FPGAs that can implement substantial functions
– Provide the spatial computational resources required to implement massively-parallel computations directly in hardware
ENG6530 RCS 125
Reconfigurable Devices
ENG6530 RCS 126
1. Rapid prototyping
Testing hardware in real conditions before fabrication
Software simulation Relatively inexpensive Slow Accuracy ?
Hardware emulationHardware testing under real
operation conditionsFastAccurateAllow several iterations
APTIX System Explorer
ITALTEL FLEXBENCH
ENG6530 RCS 127
2. Non-Frequent Reconfiguration
ENG6530 RCS 128
3. Frequently Reconfigured
Computing systems that are able to adapt their behaviour and structure to changing operating and environmental conditions, time-varying optimization objectives, and physical constraints like changing protocols, new standards, or dynamically changing operation conditions of technical systems
ENG6530 RCS 129
Static & Dynamic Reconfiguration
ENG6530 RCS 130
Benefits of RCS Non-permanent customization and application
development after fabrication “Late Binding”
Achieve high performance that require real time Time-to-market (evolving requirements and
standards, new ideas)
Disadvantages
• Efficiency penalty (area, performance, power) as compared to ASIC
• Ease of use as compared to GPP
Conclusions Conclusions
ENG6530 RCS 131