42
Tiled Processing Tiled Processing Systems Systems Shervin Vakili Shervin Vakili [email protected] [email protected] October 21, 2007 October 21, 2007 All materials are copyrights of their respective authors as listed in references All materials are copyrights of their respective authors as listed in references

Tiled Processing Systems Shervin Vakili [email protected] October 21, 2007 All materials are copyrights of their respective authors as listed in references

Embed Size (px)

Citation preview

Page 1: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Tiled Processing Tiled Processing SystemsSystems

Shervin VakiliShervin [email protected]@gmail.com

October 21, 2007October 21, 2007All materials are copyrights of their respective authors as listed in All materials are copyrights of their respective authors as listed in

referencesreferences

Page 2: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

ContentsContents

Why parallel processingWhy parallel processing Fundamental MP design decisionFundamental MP design decision Design Space SoC Architectures Design Space SoC Architectures Tiled ProcessorTiled Processor M.I.T. Raw ProcessorM.I.T. Raw Processor Field Programmable Function ArrayField Programmable Function Array Performance Analysis for Data Intensive Performance Analysis for Data Intensive

ApplicationApplication

Page 3: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Why parallel processingWhy parallel processing

Performance drivePerformance drive Diminishing returns for exploiting ILP and Diminishing returns for exploiting ILP and

OLPOLP Multiple processors fit easily on a chipMultiple processors fit easily on a chip Cost effective (just connect existing Cost effective (just connect existing

processors or processor cores)processors or processor cores) Low power: parallelism may allow lowering Low power: parallelism may allow lowering

VddVdd

However:However: Parallel programming is hardParallel programming is hard

Page 4: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Which parallelism are we Which parallelism are we talking about? Classification: talking about? Classification:

Flynn CategoriesFlynn Categories [9] [9] SISD (Single Instruction Single Data)SISD (Single Instruction Single Data)

UniprocessorsUniprocessors

MISD (Multiple Instruction Single Data)MISD (Multiple Instruction Single Data) Stream based processingStream based processing

SIMD (Single Instruction Multiple Data = DLP)SIMD (Single Instruction Multiple Data = DLP) Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal

(Philips), Imagine (Stanford), Vector machines, Cell (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony)architecture (Sony)

Simple programming modelSimple programming model Low overheadLow overhead

MIMD (Multiple Instruction Multiple Data)MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Examples: Sun Enterprise 5000, Cray T3D, SGI Origin,

Multi-core Pentiums, and many more….Multi-core Pentiums, and many more….

Page 5: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Fundamental MP design Fundamental MP design decisiondecisionWe have already discussed:We have already discussed:

Shared memory versus Message passingShared memory versus Message passing Coherence, Consistency and Synchronization issuesCoherence, Consistency and Synchronization issues

Other extremely important decisions:Other extremely important decisions: Processing units: Processing units:

Homogeneous versus Heterogeneous?Homogeneous versus Heterogeneous? Generic versus application specific ?Generic versus application specific ?

Interconnect: Interconnect: Bus versus Network ?Bus versus Network ? Type (topology) of networkType (topology) of network

What types of parallelism to support ?What types of parallelism to support ? Focus on Performance, Power or Cost ?Focus on Performance, Power or Cost ? Memory organization ?Memory organization ?

Page 6: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

SMP: Symmetric Multi-SMP: Symmetric Multi-Processor [9]Processor [9]

Memory: centralized with uniform access time Memory: centralized with uniform access time ((UMAUMA) and bus interconnect, I/O) and bus interconnect, I/O

Examples: Sun Enterprise 6000, SGI Challenge, Examples: Sun Enterprise 6000, SGI Challenge, Intel Intel

Main memory I/O System

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

Page 7: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

DSM: Distributed Shared DSM: Distributed Shared Memory [9]Memory [9]

Nonuniform access time (Nonuniform access time (NUMANUMA) and scalable ) and scalable interconnect (distributed memory)interconnect (distributed memory)

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Main memory I/O System

Page 8: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Independent Memory [9]Independent Memory [9]

Appropriate for message passing schemeAppropriate for message passing scheme

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

I/O System

Page 9: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Homogeneous or Homogeneous or HeterogeneousHeterogeneous

Homogenous: Homogenous: Replication effectReplication effect Memory dominated any Memory dominated any

wayway less performanceless performance Advantages:Advantages:

ScalabilityScalability Degradability (Intel Core Degradability (Intel Core

SOLO)SOLO) ……

Page 10: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Homogeneous or Homogeneous or HeterogeneousHeterogeneous

HeterogeneousHeterogeneous Better fit to application Better fit to application

domaindomain Most modern systems are Most modern systems are

HeterogeneousHeterogeneous

Page 11: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

MP vs. SoCMP vs. SoC

SoC (System on Chip) is multi-IP system SoC (System on Chip) is multi-IP system

IPs can be:IPs can be: Custom hardwareCustom hardware General purpose or DSP processorGeneral purpose or DSP processor Coprocessor Coprocessor Memory blocksMemory blocks Reconfigurable matrixReconfigurable matrix I/O protocol coresI/O protocol cores ……

Multi-Processor systems can be categorized as a Multi-Processor systems can be categorized as a SoC (MPSoC) SoC (MPSoC)

Page 12: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Design Space SoC Design Space SoC Architectures [7]Architectures [7]

(R-SOC)

FINE GRAIN(FPGA)

MULTI GRANULARITY(Heterogeneous)

COARSE GRAIN(Systolic)

Processor +Coprocessor

Tile-BasedArchitecture

Coarse Grain Coprocessor

Fine GrainCoprocessor

IslandTopology

Hierarchical Topology

LinearTopology Hierarchical

Topology

Mesh Topology (Tiled processors)

• Chameleon• REMARC• Morphosys

• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC

• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA

• Altera Stratix• Altera Apex• Altera Cyclone

• Systolic Ring• RaPiD• PipeRench

• DART• FPFA

• RAW• AsAP• CHESS• MATRIX• KressArray• Systolix Pulsedsp

• aSoC• E-FPFA

Page 13: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Tiled ProcessorTiled Processor

Homogeneous multi processor systemsHomogeneous multi processor systems Generally with 2D structureGenerally with 2D structure

Well-mapping on 2D die Well-mapping on 2D die Uses simple processors for each tileUses simple processors for each tile Advantages:Advantages:

ScalabilityScalability Potential degradability fault tolerance Potential degradability fault tolerance

Disadvantages:Disadvantages: Less efficient than heterogeneous strocturesLess efficient than heterogeneous stroctures

Page 14: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

M.I.T. Raw ProcessorM.I.T. Raw Processor

M.I.T. Raw architecture workstation (Raw) M.I.T. Raw architecture workstation (Raw) architecturearchitecture

Raw processor tile ArrayRaw processor tile Array What’s in a Raw tile?What’s in a Raw tile? Raw processor tileRaw processor tile

Inside the Compute ProcessorInside the Compute Processor Raw’s Networking Routing ResourcesRaw’s Networking Routing Resources Raw Inter-processor CommunicationRaw Inter-processor Communication

M.I.T. Raw novel featuresM.I.T. Raw novel features

Page 15: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

M.I.T. raw Architecture M.I.T. raw Architecture Workstation (RAW) Workstation (RAW)

ArchitectureArchitecture

Composed of a replicated processor tile. Composed of a replicated processor tile. [8][8]

8 stage pipelined MIPS-like 32-bit processor 8 stage pipelined MIPS-like 32-bit processor [7][7]

Static and dynamic routersStatic and dynamic routers Any tile output can be routed off the edge of the Any tile output can be routed off the edge of the

chip to the I/O pins.chip to the I/O pins. Chip bandwidth (16-tile version).Chip bandwidth (16-tile version).

Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 MHz.MHz.

14 channels for a total chip bandwidth of 201 Gb/s 14 channels for a total chip bandwidth of 201 Gb/s @ 225 MHz.@ 225 MHz.

Page 16: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

RAW Architecture [8]RAW Architecture [8]

Divide the silicon into an array of identical, programmable tiles.

Page 17: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Raw Processor Tile [8]Raw Processor Tile [8]

ComputeProcessor

Routers

On-chip networks

Page 18: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Inside the Compute Inside the Compute Processor [8]Processor [8]

IF RFDA TL

M1 M2

F P

E

U

TV

F4 WB

r26

r27

r25

r24

InputFIFOsfromStaticRouter

r26

r27

r25

r24

OutputFIFOstoStaticRouter

Local BypassNetwork

Page 19: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Tiles Static Tiles Static Communication [8]Communication [8]

Page 20: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

RAW’s Static NetworkRAW’s Static Network Consists of two tightly-coupled sub-

networks: Tile interconnection network

For operands & streams between tiles Controlled by the 16 tiles’ static router

processors Used to:

route operands among local and remote ALUs route data streams among tiles, DRAM, I/O ports

Local bypass network For operands & streams within a tile

Page 21: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

RAW’s Dynamic NetworkRAW’s Dynamic Network Insert header, and < 32 data words. Worms through network. Enable MPI programming Inter-message ordering not guaranteed. RAW’s memory network RAW’s general network

User-level messaging Can interrupt tile when message arrives Lower performance; for coarse-grained apps For non-compile time predictable

communication among tiles possibly with I/O devices

Page 22: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

M.I.T. Raw Novel M.I.T. Raw Novel FeaturesFeatures

■ Dynamic and Static Network Dynamic and Static Network Routers.Routers.

■ Scalability of Raw chips.Scalability of Raw chips.■ Fabricated Raw chips can be placed in Fabricated Raw chips can be placed in

an array to further increase the system an array to further increase the system computing performance.computing performance.

■ Specifies a homogenous 2-D array of Specifies a homogenous 2-D array of very simple processorsvery simple processors

■ Local bypass networkLocal bypass network

Page 23: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Field Programmable Function Field Programmable Function Array of Chameleon Structure Array of Chameleon Structure

A FPFA consists of interconnected processor tiles Multiple processes can coexit in parallel on different tiles Within a tile multiple data streams can be processed in parallel Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit

FPFA

processor tile

interconnection crossbar

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

RAM RAM

ALU

FPFA with five ALUs [7]

Page 24: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Field Programmable Field Programmable Function ArrayFunction Array

The FPFA concept has a number of The FPFA concept has a number of advantageadvantage The FPFA has a highly regular The FPFA has a highly regular

organizationorganization We use general purpose process coreWe use general purpose process core Its scalability stands in contrast to the Its scalability stands in contrast to the

dedicated chips designed nowadaysdedicated chips designed nowadays The FPFA can do media processing tasks The FPFA can do media processing tasks

such as compression/decompression such as compression/decompression efficientlyefficiently

Page 25: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Field Programmable Field Programmable Function ArrayFunction Array

ALU ALU ALU ALU ALU

M M M M M M M M M M Memory

CrossBar

Registers

ALUs

Processor tilesProcessor tiles Consists of five identical blocks, which share a control Consists of five identical blocks, which share a control

unit and a communication unitunit and a communication unit An individual block contains an ALU, two memories and An individual block contains an ALU, two memories and

four register banks of four 20-bit wide registerfour register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the A crossbar-switch makes flexible routing between the

ALUs, registers and memoriesALUs, registers and memories This structure is convenient for the Fast Fourier This structure is convenient for the Fast Fourier

Transform(6-input,4-output) and the Finite Impulse Transform(6-input,4-output) and the Finite Impulse ResponseResponse

[7]

Page 26: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Performance Analysis for Performance Analysis for Data Intensive Application Data Intensive Application

[1][1] Three data-intensive radar sub-systems including:

The corner turn: matrix transpose operation The matrix size is larger than Imagine’s SRF (128 KB) and Raw’s

internal memories (2 MB), but smaller than VIRAM’s on-chip memory (13 MB)

beam steering: directs a phased-array radar without physically rotating the

antenna coherent side-lobe canceller (CSLC) kernels and

consists of FFTs, a weight application (multiplication) stage, and IFFTs

Implemented on: Processors In Memory (PIM) Stream Processors Tile Processors PowerPC

Page 27: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Vector Intelligent RAM (VIRAM, Berkeley) [6]

Merge DRAM with Vector Processor

Mixed logic-DRAM CMOS process

Scalar MIPS processor core

6.4 16-bit GOPS, 1.6 GFLOPS

4 float ALUs; 8 32bit int ALUs; 16 16bit ALUs

12.8 GB/s peak memory access

13 MB DRAM 15 x 18 mm; IBM

Foundry Chips fabbed in Q1 ‘03,

ISI board on

Page 28: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Imagine Streaming Processor(Stanford)

300 MHz, VLIW SIMD machine

28 16-bit GOPS, 14 GFLOPS 128 kB Streaming Register

File 8 ALU Clusters 6 ALUs / cluster 84-95% ALU utilization typical 256 x 32 bit local register file Streaming Memory Buffers Re-order DRAM accesses Expose data locality ALU Intra-cluster BW - 435

GB/sec DRAM BW - 2.1 GB/sec 16 x 16 mm; TI Foundry

Page 29: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

MIT RAWMIT RAW 16 tiles of MIPS R4000 @

300 MHz 4.6 GOPS or GFLOPS 4 Communication

Networks 2 Static Networks, 38.3 GB / sec 2 Dynamic Networks 14 External Ports (I/O or

DRAM) 33.5 GB/sec C and ASM; gcc based

compiler 18.2 x 18.2 mm; IBM

Foundry Fully scalable architecture

Page 30: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Experimental Results [1]Experimental Results [1]

Processor Processor ParametersParameters

Experimental Results Experimental Results (*10^3 cycles)(*10^3 cycles)

Speedup compared with PPC with AltiVec

Page 31: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

ReferencesReferences[1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis [1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels,” kernels,” Proc. of the International Symposium on Computer ArchitectureProc. of the International Symposium on Computer Architecture, Jun. 2003. , Jun. 2003. [2] M. B. Taylor, “The Raw processor specification,” [2] M. B. Taylor, “The Raw processor specification,” Comprehensive specification for Comprehensive specification for the Raw processorthe Raw processor, Cambridge, MA, Continuously Updated 2003., Cambridge, MA, Continuously Updated 2003.[3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable [3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable composable Computation Fabric,” composable Computation Fabric,” High Performance Embedded Computing Workshop,High Performance Embedded Computing Workshop, 2001 2001 [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of the Rawthe Raw microprocessor: an exposed-wire-delay architecture for ILP and streams,” microprocessor: an exposed-wire-delay architecture for ILP and streams,” Proceedings of International Symposium on Computer ArchitectureProceedings of International Symposium on Computer Architecture, Jun. 2004 , Jun. 2004 [5] M.I.T. Raw architecture workstation website: [5] M.I.T. Raw architecture workstation website: http://cag-www.lcs.mit.edu/raw/http://cag-www.lcs.mit.edu/raw/[6] Berkeley Intelligent RAM website: [6] Berkeley Intelligent RAM website: http://iram.cs.berkeley.edu/http://iram.cs.berkeley.edu/[7] “[7] “Reconfigurable computation and communication architectures,” Available on: Reconfigurable computation and communication architectures,” Available on:

http://http://vada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architectuvada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architecture.pptre.ppt[8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, [8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, Course Presentation, Available on: Course Presentation, Available on: http://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdfhttp://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdf..[9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on: [9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on:

http://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppthttp://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppt

Page 32: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

AppendixAppendix

ChessChess UC Davis Asynchronous Array of simple UC Davis Asynchronous Array of simple

Processors (AsAP)Processors (AsAP)

Page 33: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

ChessChess HP Labs – Bristol, EnglandHP Labs – Bristol, England 2-D array – similar to Matrix2-D array – similar to Matrix Contains more “FPGA-like” routing Contains more “FPGA-like” routing resources.resources.

No reported software or application No reported software or application resultsresults

Doesn’t support incremental Doesn’t support incremental compilationcompilation

Page 34: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Chess Chess InterconnectInterconnect

More like an More like an FPGAFPGA

Takes Takes advantage of advantage of near-neighbor near-neighbor connectivityconnectivity

Page 35: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Chess Basic BlockChess Basic Block

Switchbox Switchbox memory can be memory can be used as storageused as storage

ALU core for ALU core for computationcomputation

Page 36: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Chess Chess StatisticsStatistics

Use metrics to evaluate computational Use metrics to evaluate computational power.power.

Efficient multiplies due to embedded Efficient multiplies due to embedded ALUALU

Process independent.Process independent.

Page 37: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

UC Davis Asynchronous Array UC Davis Asynchronous Array of simple Processors (AsAP) of simple Processors (AsAP)

ArchitectureArchitecture

Composed of a replicated processor tile. Composed of a replicated processor tile. 9-stage pipelined reduced complexity DSP processor 9-stage pipelined reduced complexity DSP processor Four nearest neighbor inter-processor communication.Four nearest neighbor inter-processor communication. Individual processor tile can operate at different Individual processor tile can operate at different

frequencies than its neighbors.frequencies than its neighbors. Off chip access to the I/O pins must be reached by routing to Off chip access to the I/O pins must be reached by routing to

boundary processors.boundary processors. Chip BandwidthChip Bandwidth

Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz.Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz. The array topology of AsAP is well-suited for applications The array topology of AsAP is well-suited for applications

that are composed of a series of independent tasks.that are composed of a series of independent tasks. Each of these tasks can be assigned to one or more processors.Each of these tasks can be assigned to one or more processors.

Page 38: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

Asynchronous Array of Asynchronous Array of simple Processors [8]simple Processors [8]

Page 39: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

What’s in an AsAP tile?What’s in an AsAP tile?

16-bit fixed point datapath single issue CPU16-bit fixed point datapath single issue CPU Instructions for AsAP processors are 32-bits wideInstructions for AsAP processors are 32-bits wide. .

ALU, MACALU, MAC Small Instruction/Data MemoriesSmall Instruction/Data Memories

64-entry instruction memory and a 128-word data memory.64-entry instruction memory and a 128-word data memory. Hardware address generation Hardware address generation

Each processor has 4 address generators that calculate Each processor has 4 address generators that calculate addresses for data memory. addresses for data memory.

Local programmable clock oscillator Local programmable clock oscillator 2 Input and 1 Output 16-bits wide and 32-words deep 2 Input and 1 Output 16-bits wide and 32-words deep

dual-clock FIFOs. dual-clock FIFOs. ~1.1mm~1.1mm22/processor in 0.18/processor in 0.18m CMOSm CMOS 800 MHz targeted operation800 MHz targeted operation

Page 40: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

AsAP Single Processor Tile AsAP Single Processor Tile [8][8]

Page 41: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

AsAP Inter-processor AsAP Inter-processor CommunicationCommunication

• Each processor output is hard-wired to its four nearest neighbors input multiplexers.

•At power-up the input multiplexers are configured.

•As input FIFOs fill up the sourcing neighbor can be halted by asserting corresponding hold signal.

[8]

Page 42: Tiled Processing Systems Shervin Vakili shervinv@gmail.com October 21, 2007 All materials are copyrights of their respective authors as listed in references

AsAP ContributionsAsAP Contributions

Provides parallel execution of independent Provides parallel execution of independent tasks by providing many, parallel, independent tasks by providing many, parallel, independent processing enginesprocessing engines

AsAP specifies a homogenous 2-D array of very AsAP specifies a homogenous 2-D array of very simple processorssimple processors Single-issue pipelined CPUsSingle-issue pipelined CPUs

Independent tasks are mapped across Independent tasks are mapped across processors and executed in parallelprocessors and executed in parallel

Allows efficient exploitation of Application-level Allows efficient exploitation of Application-level parallelism.parallelism.