Tiled Processing Systems Shervin Vakili [email protected] October 21, 2007 All materials are copyrights of their respective authors as listed in references

Tiled Processing Tiled Processing SystemsSystems

Shervin VakiliShervin [email protected]@gmail.com

October 21, 2007October 21, 2007All materials are copyrights of their respective authors as listed in All materials are copyrights of their respective authors as listed in

referencesreferences

ContentsContents

Why parallel processingWhy parallel processing Fundamental MP design decisionFundamental MP design decision Design Space SoC Architectures Design Space SoC Architectures Tiled ProcessorTiled Processor M.I.T. Raw ProcessorM.I.T. Raw Processor Field Programmable Function ArrayField Programmable Function Array Performance Analysis for Data Intensive Performance Analysis for Data Intensive

ApplicationApplication

Why parallel processingWhy parallel processing

Performance drivePerformance drive Diminishing returns for exploiting ILP and Diminishing returns for exploiting ILP and

OLPOLP Multiple processors fit easily on a chipMultiple processors fit easily on a chip Cost effective (just connect existing Cost effective (just connect existing

processors or processor cores)processors or processor cores) Low power: parallelism may allow lowering Low power: parallelism may allow lowering

VddVdd

However:However: Parallel programming is hardParallel programming is hard

Which parallelism are we Which parallelism are we talking about? Classification: talking about? Classification:

Flynn CategoriesFlynn Categories [9] [9] SISD (Single Instruction Single Data)SISD (Single Instruction Single Data)

UniprocessorsUniprocessors

MISD (Multiple Instruction Single Data)MISD (Multiple Instruction Single Data) Stream based processingStream based processing

SIMD (Single Instruction Multiple Data = DLP)SIMD (Single Instruction Multiple Data = DLP) Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal

(Philips), Imagine (Stanford), Vector machines, Cell (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony)architecture (Sony)

Simple programming modelSimple programming model Low overheadLow overhead

MIMD (Multiple Instruction Multiple Data)MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Examples: Sun Enterprise 5000, Cray T3D, SGI Origin,

Multi-core Pentiums, and many more….Multi-core Pentiums, and many more….

Fundamental MP design Fundamental MP design decisiondecisionWe have already discussed:We have already discussed:

Shared memory versus Message passingShared memory versus Message passing Coherence, Consistency and Synchronization issuesCoherence, Consistency and Synchronization issues

Other extremely important decisions:Other extremely important decisions: Processing units: Processing units:

Homogeneous versus Heterogeneous?Homogeneous versus Heterogeneous? Generic versus application specific ?Generic versus application specific ?

Interconnect: Interconnect: Bus versus Network ?Bus versus Network ? Type (topology) of networkType (topology) of network

What types of parallelism to support ?What types of parallelism to support ? Focus on Performance, Power or Cost ?Focus on Performance, Power or Cost ? Memory organization ?Memory organization ?

SMP: Symmetric Multi-SMP: Symmetric Multi-Processor [9]Processor [9]

Memory: centralized with uniform access time Memory: centralized with uniform access time ((UMAUMA) and bus interconnect, I/O) and bus interconnect, I/O

Examples: Sun Enterprise 6000, SGI Challenge, Examples: Sun Enterprise 6000, SGI Challenge, Intel Intel

Main memory I/O System

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

DSM: Distributed Shared DSM: Distributed Shared Memory [9]Memory [9]

Nonuniform access time (Nonuniform access time (NUMANUMA) and scalable ) and scalable interconnect (distributed memory)interconnect (distributed memory)

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Main memory I/O System

Independent Memory [9]Independent Memory [9]

Appropriate for message passing schemeAppropriate for message passing scheme

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

I/O System

Homogeneous or Homogeneous or HeterogeneousHeterogeneous

Homogenous: Homogenous: Replication effectReplication effect Memory dominated any Memory dominated any

wayway less performanceless performance Advantages:Advantages:

ScalabilityScalability Degradability (Intel Core Degradability (Intel Core

SOLO)SOLO) ……

Homogeneous or Homogeneous or HeterogeneousHeterogeneous

HeterogeneousHeterogeneous Better fit to application Better fit to application

domaindomain Most modern systems are Most modern systems are

HeterogeneousHeterogeneous

MP vs. SoCMP vs. SoC

SoC (System on Chip) is multi-IP system SoC (System on Chip) is multi-IP system

IPs can be:IPs can be: Custom hardwareCustom hardware General purpose or DSP processorGeneral purpose or DSP processor Coprocessor Coprocessor Memory blocksMemory blocks Reconfigurable matrixReconfigurable matrix I/O protocol coresI/O protocol cores ……

Multi-Processor systems can be categorized as a Multi-Processor systems can be categorized as a SoC (MPSoC) SoC (MPSoC)

Design Space SoC Design Space SoC Architectures [7]Architectures [7]

(R-SOC)

FINE GRAIN(FPGA)

MULTI GRANULARITY(Heterogeneous)

COARSE GRAIN(Systolic)

Processor +Coprocessor

Tile-BasedArchitecture

Coarse Grain Coprocessor

Fine GrainCoprocessor

IslandTopology

Hierarchical Topology

LinearTopology Hierarchical

Topology

Mesh Topology (Tiled processors)

• Chameleon• REMARC• Morphosys

• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC

• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA

• Altera Stratix• Altera Apex• Altera Cyclone

• Systolic Ring• RaPiD• PipeRench

• DART• FPFA

• RAW• AsAP• CHESS• MATRIX• KressArray• Systolix Pulsedsp

• aSoC• E-FPFA

Tiled ProcessorTiled Processor

Homogeneous multi processor systemsHomogeneous multi processor systems Generally with 2D structureGenerally with 2D structure

Well-mapping on 2D die Well-mapping on 2D die Uses simple processors for each tileUses simple processors for each tile Advantages:Advantages:

ScalabilityScalability Potential degradability fault tolerance Potential degradability fault tolerance

Disadvantages:Disadvantages: Less efficient than heterogeneous strocturesLess efficient than heterogeneous stroctures

M.I.T. Raw ProcessorM.I.T. Raw Processor

M.I.T. Raw architecture workstation (Raw) M.I.T. Raw architecture workstation (Raw) architecturearchitecture

Raw processor tile ArrayRaw processor tile Array What’s in a Raw tile?What’s in a Raw tile? Raw processor tileRaw processor tile

Inside the Compute ProcessorInside the Compute Processor Raw’s Networking Routing ResourcesRaw’s Networking Routing Resources Raw Inter-processor CommunicationRaw Inter-processor Communication

M.I.T. Raw novel featuresM.I.T. Raw novel features

M.I.T. raw Architecture M.I.T. raw Architecture Workstation (RAW) Workstation (RAW)

ArchitectureArchitecture

Composed of a replicated processor tile. Composed of a replicated processor tile. [8][8]

8 stage pipelined MIPS-like 32-bit processor 8 stage pipelined MIPS-like 32-bit processor [7][7]

Static and dynamic routersStatic and dynamic routers Any tile output can be routed off the edge of the Any tile output can be routed off the edge of the

chip to the I/O pins.chip to the I/O pins. Chip bandwidth (16-tile version).Chip bandwidth (16-tile version).

Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 MHz.MHz.

14 channels for a total chip bandwidth of 201 Gb/s 14 channels for a total chip bandwidth of 201 Gb/s @ 225 MHz.@ 225 MHz.

RAW Architecture [8]RAW Architecture [8]

Divide the silicon into an array of identical, programmable tiles.

Raw Processor Tile [8]Raw Processor Tile [8]

ComputeProcessor

Routers

On-chip networks

Inside the Compute Inside the Compute Processor [8]Processor [8]

IF RFDA TL

M1 M2

F P

E

U

TV

F4 WB

r26

r27

r25

r24

InputFIFOsfromStaticRouter

r26

r27

r25

r24

OutputFIFOstoStaticRouter

Local BypassNetwork

Tiles Static Tiles Static Communication [8]Communication [8]

RAW’s Static NetworkRAW’s Static Network Consists of two tightly-coupled sub-

networks: Tile interconnection network

For operands & streams between tiles Controlled by the 16 tiles’ static router

processors Used to:

route operands among local and remote ALUs route data streams among tiles, DRAM, I/O ports

Local bypass network For operands & streams within a tile

RAW’s Dynamic NetworkRAW’s Dynamic Network Insert header, and < 32 data words. Worms through network. Enable MPI programming Inter-message ordering not guaranteed. RAW’s memory network RAW’s general network

User-level messaging Can interrupt tile when message arrives Lower performance; for coarse-grained apps For non-compile time predictable

communication among tiles possibly with I/O devices

M.I.T. Raw Novel M.I.T. Raw Novel FeaturesFeatures

■ Dynamic and Static Network Dynamic and Static Network Routers.Routers.

■ Scalability of Raw chips.Scalability of Raw chips.■ Fabricated Raw chips can be placed in Fabricated Raw chips can be placed in

an array to further increase the system an array to further increase the system computing performance.computing performance.

■ Specifies a homogenous 2-D array of Specifies a homogenous 2-D array of very simple processorsvery simple processors

■ Local bypass networkLocal bypass network

Field Programmable Function Field Programmable Function Array of Chameleon Structure Array of Chameleon Structure

A FPFA consists of interconnected processor tiles Multiple processes can coexit in parallel on different tiles Within a tile multiple data streams can be processed in parallel Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit

FPFA

processor tile

interconnection crossbar

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

FPFA with five ALUs [7]

Field Programmable Field Programmable Function ArrayFunction Array

The FPFA concept has a number of The FPFA concept has a number of advantageadvantage The FPFA has a highly regular The FPFA has a highly regular

organizationorganization We use general purpose process coreWe use general purpose process core Its scalability stands in contrast to the Its scalability stands in contrast to the

dedicated chips designed nowadaysdedicated chips designed nowadays The FPFA can do media processing tasks The FPFA can do media processing tasks

such as compression/decompression such as compression/decompression efficientlyefficiently

Field Programmable Field Programmable Function ArrayFunction Array

ALU ALU ALU ALU ALU

M M M M M M M M M M Memory

CrossBar

Registers

ALUs

Processor tilesProcessor tiles Consists of five identical blocks, which share a control Consists of five identical blocks, which share a control

unit and a communication unitunit and a communication unit An individual block contains an ALU, two memories and An individual block contains an ALU, two memories and

four register banks of four 20-bit wide registerfour register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the A crossbar-switch makes flexible routing between the

ALUs, registers and memoriesALUs, registers and memories This structure is convenient for the Fast Fourier This structure is convenient for the Fast Fourier

Transform(6-input,4-output) and the Finite Impulse Transform(6-input,4-output) and the Finite Impulse ResponseResponse

[7]

Performance Analysis for Performance Analysis for Data Intensive Application Data Intensive Application

[1][1] Three data-intensive radar sub-systems including:

The corner turn: matrix transpose operation The matrix size is larger than Imagine’s SRF (128 KB) and Raw’s

internal memories (2 MB), but smaller than VIRAM’s on-chip memory (13 MB)

beam steering: directs a phased-array radar without physically rotating the

antenna coherent side-lobe canceller (CSLC) kernels and

consists of FFTs, a weight application (multiplication) stage, and IFFTs

Implemented on: Processors In Memory (PIM) Stream Processors Tile Processors PowerPC

Vector Intelligent RAM (VIRAM, Berkeley) [6]

Merge DRAM with Vector Processor

Mixed logic-DRAM CMOS process

Scalar MIPS processor core

6.4 16-bit GOPS, 1.6 GFLOPS

4 float ALUs; 8 32bit int ALUs; 16 16bit ALUs

12.8 GB/s peak memory access

13 MB DRAM 15 x 18 mm; IBM

Foundry Chips fabbed in Q1 ‘03,

ISI board on

Imagine Streaming Processor(Stanford)

300 MHz, VLIW SIMD machine

28 16-bit GOPS, 14 GFLOPS 128 kB Streaming Register

File 8 ALU Clusters 6 ALUs / cluster 84-95% ALU utilization typical 256 x 32 bit local register file Streaming Memory Buffers Re-order DRAM accesses Expose data locality ALU Intra-cluster BW - 435

GB/sec DRAM BW - 2.1 GB/sec 16 x 16 mm; TI Foundry

MIT RAWMIT RAW 16 tiles of MIPS R4000 @

300 MHz 4.6 GOPS or GFLOPS 4 Communication

Networks 2 Static Networks, 38.3 GB / sec 2 Dynamic Networks 14 External Ports (I/O or

DRAM) 33.5 GB/sec C and ASM; gcc based

compiler 18.2 x 18.2 mm; IBM

Foundry Fully scalable architecture

Experimental Results [1]Experimental Results [1]

Processor Processor ParametersParameters

Experimental Results Experimental Results (*10^3 cycles)(*10^3 cycles)

Speedup compared with PPC with AltiVec

ReferencesReferences[1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis [1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels,” kernels,” Proc. of the International Symposium on Computer ArchitectureProc. of the International Symposium on Computer Architecture, Jun. 2003. , Jun. 2003. [2] M. B. Taylor, “The Raw processor specification,” [2] M. B. Taylor, “The Raw processor specification,” Comprehensive specification for Comprehensive specification for the Raw processorthe Raw processor, Cambridge, MA, Continuously Updated 2003., Cambridge, MA, Continuously Updated 2003.[3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable [3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable composable Computation Fabric,” composable Computation Fabric,” High Performance Embedded Computing Workshop,High Performance Embedded Computing Workshop, 2001 2001 [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of the Rawthe Raw microprocessor: an exposed-wire-delay architecture for ILP and streams,” microprocessor: an exposed-wire-delay architecture for ILP and streams,” Proceedings of International Symposium on Computer ArchitectureProceedings of International Symposium on Computer Architecture, Jun. 2004 , Jun. 2004 [5] M.I.T. Raw architecture workstation website: [5] M.I.T. Raw architecture workstation website: http://cag-www.lcs.mit.edu/raw/http://cag-www.lcs.mit.edu/raw/[6] Berkeley Intelligent RAM website: [6] Berkeley Intelligent RAM website: http://iram.cs.berkeley.edu/http://iram.cs.berkeley.edu/[7] “[7] “Reconfigurable computation and communication architectures,” Available on: Reconfigurable computation and communication architectures,” Available on:

http://http://vada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architectuvada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architecture.pptre.ppt[8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, [8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, Course Presentation, Available on: Course Presentation, Available on: http://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdfhttp://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdf..[9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on: [9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on:

http://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppthttp://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppt

AppendixAppendix

ChessChess UC Davis Asynchronous Array of simple UC Davis Asynchronous Array of simple

Processors (AsAP)Processors (AsAP)

ChessChess HP Labs – Bristol, EnglandHP Labs – Bristol, England 2-D array – similar to Matrix2-D array – similar to Matrix Contains more “FPGA-like” routing Contains more “FPGA-like” routing resources.resources.

No reported software or application No reported software or application resultsresults

Doesn’t support incremental Doesn’t support incremental compilationcompilation

Chess Chess InterconnectInterconnect

More like an More like an FPGAFPGA

Takes Takes advantage of advantage of near-neighbor near-neighbor connectivityconnectivity

Chess Basic BlockChess Basic Block

Switchbox Switchbox memory can be memory can be used as storageused as storage

ALU core for ALU core for computationcomputation

Chess Chess StatisticsStatistics

Use metrics to evaluate computational Use metrics to evaluate computational power.power.

Efficient multiplies due to embedded Efficient multiplies due to embedded ALUALU

Process independent.Process independent.

UC Davis Asynchronous Array UC Davis Asynchronous Array of simple Processors (AsAP) of simple Processors (AsAP)

ArchitectureArchitecture

Composed of a replicated processor tile. Composed of a replicated processor tile. 9-stage pipelined reduced complexity DSP processor 9-stage pipelined reduced complexity DSP processor Four nearest neighbor inter-processor communication.Four nearest neighbor inter-processor communication. Individual processor tile can operate at different Individual processor tile can operate at different

frequencies than its neighbors.frequencies than its neighbors. Off chip access to the I/O pins must be reached by routing to Off chip access to the I/O pins must be reached by routing to

boundary processors.boundary processors. Chip BandwidthChip Bandwidth

Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz.Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz. The array topology of AsAP is well-suited for applications The array topology of AsAP is well-suited for applications

that are composed of a series of independent tasks.that are composed of a series of independent tasks. Each of these tasks can be assigned to one or more processors.Each of these tasks can be assigned to one or more processors.

Asynchronous Array of Asynchronous Array of simple Processors [8]simple Processors [8]

What’s in an AsAP tile?What’s in an AsAP tile?

16-bit fixed point datapath single issue CPU16-bit fixed point datapath single issue CPU Instructions for AsAP processors are 32-bits wideInstructions for AsAP processors are 32-bits wide. .

ALU, MACALU, MAC Small Instruction/Data MemoriesSmall Instruction/Data Memories

64-entry instruction memory and a 128-word data memory.64-entry instruction memory and a 128-word data memory. Hardware address generation Hardware address generation

Each processor has 4 address generators that calculate Each processor has 4 address generators that calculate addresses for data memory. addresses for data memory.

Local programmable clock oscillator Local programmable clock oscillator 2 Input and 1 Output 16-bits wide and 32-words deep 2 Input and 1 Output 16-bits wide and 32-words deep

dual-clock FIFOs. dual-clock FIFOs. ~1.1mm~1.1mm22/processor in 0.18/processor in 0.18m CMOSm CMOS 800 MHz targeted operation800 MHz targeted operation

AsAP Single Processor Tile AsAP Single Processor Tile [8][8]

AsAP Inter-processor AsAP Inter-processor CommunicationCommunication

• Each processor output is hard-wired to its four nearest neighbors input multiplexers.

•At power-up the input multiplexers are configured.

•As input FIFOs fill up the sourcing neighbor can be halted by asserting corresponding hold signal.

[8]

AsAP ContributionsAsAP Contributions

Provides parallel execution of independent Provides parallel execution of independent tasks by providing many, parallel, independent tasks by providing many, parallel, independent processing enginesprocessing engines

AsAP specifies a homogenous 2-D array of very AsAP specifies a homogenous 2-D array of very simple processorssimple processors Single-issue pipelined CPUsSingle-issue pipelined CPUs

Independent tasks are mapped across Independent tasks are mapped across processors and executed in parallelprocessors and executed in parallel

Allows efficient exploitation of Application-level Allows efficient exploitation of Application-level parallelism.parallelism.

Documents

Tiled Processing Systems Shervin Vakili [email protected] October 21, 2007 All materials are copyrights of their respective authors as listed in references