Upload
martin-cole
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Tiled Processing Tiled Processing SystemsSystems
Shervin VakiliShervin [email protected]@gmail.com
October 21, 2007October 21, 2007All materials are copyrights of their respective authors as listed in All materials are copyrights of their respective authors as listed in
referencesreferences
ContentsContents
Why parallel processingWhy parallel processing Fundamental MP design decisionFundamental MP design decision Design Space SoC Architectures Design Space SoC Architectures Tiled ProcessorTiled Processor M.I.T. Raw ProcessorM.I.T. Raw Processor Field Programmable Function ArrayField Programmable Function Array Performance Analysis for Data Intensive Performance Analysis for Data Intensive
ApplicationApplication
Why parallel processingWhy parallel processing
Performance drivePerformance drive Diminishing returns for exploiting ILP and Diminishing returns for exploiting ILP and
OLPOLP Multiple processors fit easily on a chipMultiple processors fit easily on a chip Cost effective (just connect existing Cost effective (just connect existing
processors or processor cores)processors or processor cores) Low power: parallelism may allow lowering Low power: parallelism may allow lowering
VddVdd
However:However: Parallel programming is hardParallel programming is hard
Which parallelism are we Which parallelism are we talking about? Classification: talking about? Classification:
Flynn CategoriesFlynn Categories [9] [9] SISD (Single Instruction Single Data)SISD (Single Instruction Single Data)
UniprocessorsUniprocessors
MISD (Multiple Instruction Single Data)MISD (Multiple Instruction Single Data) Stream based processingStream based processing
SIMD (Single Instruction Multiple Data = DLP)SIMD (Single Instruction Multiple Data = DLP) Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal
(Philips), Imagine (Stanford), Vector machines, Cell (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony)architecture (Sony)
Simple programming modelSimple programming model Low overheadLow overhead
MIMD (Multiple Instruction Multiple Data)MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Examples: Sun Enterprise 5000, Cray T3D, SGI Origin,
Multi-core Pentiums, and many more….Multi-core Pentiums, and many more….
Fundamental MP design Fundamental MP design decisiondecisionWe have already discussed:We have already discussed:
Shared memory versus Message passingShared memory versus Message passing Coherence, Consistency and Synchronization issuesCoherence, Consistency and Synchronization issues
Other extremely important decisions:Other extremely important decisions: Processing units: Processing units:
Homogeneous versus Heterogeneous?Homogeneous versus Heterogeneous? Generic versus application specific ?Generic versus application specific ?
Interconnect: Interconnect: Bus versus Network ?Bus versus Network ? Type (topology) of networkType (topology) of network
What types of parallelism to support ?What types of parallelism to support ? Focus on Performance, Power or Cost ?Focus on Performance, Power or Cost ? Memory organization ?Memory organization ?
SMP: Symmetric Multi-SMP: Symmetric Multi-Processor [9]Processor [9]
Memory: centralized with uniform access time Memory: centralized with uniform access time ((UMAUMA) and bus interconnect, I/O) and bus interconnect, I/O
Examples: Sun Enterprise 6000, SGI Challenge, Examples: Sun Enterprise 6000, SGI Challenge, Intel Intel
Main memory I/O System
One ormore cache
levels
Processor
One ormore cache
levels
Processor
One ormore cache
levels
Processor
One ormore cache
levels
Processor
DSM: Distributed Shared DSM: Distributed Shared Memory [9]Memory [9]
Nonuniform access time (Nonuniform access time (NUMANUMA) and scalable ) and scalable interconnect (distributed memory)interconnect (distributed memory)
Interconnection NetworkInterconnection Network
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
Main memory I/O System
Independent Memory [9]Independent Memory [9]
Appropriate for message passing schemeAppropriate for message passing scheme
Interconnection NetworkInterconnection Network
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
I/O System
Homogeneous or Homogeneous or HeterogeneousHeterogeneous
Homogenous: Homogenous: Replication effectReplication effect Memory dominated any Memory dominated any
wayway less performanceless performance Advantages:Advantages:
ScalabilityScalability Degradability (Intel Core Degradability (Intel Core
SOLO)SOLO) ……
Homogeneous or Homogeneous or HeterogeneousHeterogeneous
HeterogeneousHeterogeneous Better fit to application Better fit to application
domaindomain Most modern systems are Most modern systems are
HeterogeneousHeterogeneous
MP vs. SoCMP vs. SoC
SoC (System on Chip) is multi-IP system SoC (System on Chip) is multi-IP system
IPs can be:IPs can be: Custom hardwareCustom hardware General purpose or DSP processorGeneral purpose or DSP processor Coprocessor Coprocessor Memory blocksMemory blocks Reconfigurable matrixReconfigurable matrix I/O protocol coresI/O protocol cores ……
Multi-Processor systems can be categorized as a Multi-Processor systems can be categorized as a SoC (MPSoC) SoC (MPSoC)
Design Space SoC Design Space SoC Architectures [7]Architectures [7]
(R-SOC)
FINE GRAIN(FPGA)
MULTI GRANULARITY(Heterogeneous)
COARSE GRAIN(Systolic)
Processor +Coprocessor
Tile-BasedArchitecture
Coarse Grain Coprocessor
Fine GrainCoprocessor
IslandTopology
Hierarchical Topology
LinearTopology Hierarchical
Topology
Mesh Topology (Tiled processors)
• Chameleon• REMARC• Morphosys
• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC
• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA
• Altera Stratix• Altera Apex• Altera Cyclone
• Systolic Ring• RaPiD• PipeRench
• DART• FPFA
• RAW• AsAP• CHESS• MATRIX• KressArray• Systolix Pulsedsp
• aSoC• E-FPFA
Tiled ProcessorTiled Processor
Homogeneous multi processor systemsHomogeneous multi processor systems Generally with 2D structureGenerally with 2D structure
Well-mapping on 2D die Well-mapping on 2D die Uses simple processors for each tileUses simple processors for each tile Advantages:Advantages:
ScalabilityScalability Potential degradability fault tolerance Potential degradability fault tolerance
Disadvantages:Disadvantages: Less efficient than heterogeneous strocturesLess efficient than heterogeneous stroctures
M.I.T. Raw ProcessorM.I.T. Raw Processor
M.I.T. Raw architecture workstation (Raw) M.I.T. Raw architecture workstation (Raw) architecturearchitecture
Raw processor tile ArrayRaw processor tile Array What’s in a Raw tile?What’s in a Raw tile? Raw processor tileRaw processor tile
Inside the Compute ProcessorInside the Compute Processor Raw’s Networking Routing ResourcesRaw’s Networking Routing Resources Raw Inter-processor CommunicationRaw Inter-processor Communication
M.I.T. Raw novel featuresM.I.T. Raw novel features
M.I.T. raw Architecture M.I.T. raw Architecture Workstation (RAW) Workstation (RAW)
ArchitectureArchitecture
Composed of a replicated processor tile. Composed of a replicated processor tile. [8][8]
8 stage pipelined MIPS-like 32-bit processor 8 stage pipelined MIPS-like 32-bit processor [7][7]
Static and dynamic routersStatic and dynamic routers Any tile output can be routed off the edge of the Any tile output can be routed off the edge of the
chip to the I/O pins.chip to the I/O pins. Chip bandwidth (16-tile version).Chip bandwidth (16-tile version).
Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 Single channel (32-bit) bandwidth of 7.2 Gb/s @ 225 MHz.MHz.
14 channels for a total chip bandwidth of 201 Gb/s 14 channels for a total chip bandwidth of 201 Gb/s @ 225 MHz.@ 225 MHz.
RAW Architecture [8]RAW Architecture [8]
Divide the silicon into an array of identical, programmable tiles.
Raw Processor Tile [8]Raw Processor Tile [8]
ComputeProcessor
Routers
On-chip networks
Inside the Compute Inside the Compute Processor [8]Processor [8]
IF RFDA TL
M1 M2
F P
E
U
TV
F4 WB
r26
r27
r25
r24
InputFIFOsfromStaticRouter
r26
r27
r25
r24
OutputFIFOstoStaticRouter
Local BypassNetwork
Tiles Static Tiles Static Communication [8]Communication [8]
RAW’s Static NetworkRAW’s Static Network Consists of two tightly-coupled sub-
networks: Tile interconnection network
For operands & streams between tiles Controlled by the 16 tiles’ static router
processors Used to:
route operands among local and remote ALUs route data streams among tiles, DRAM, I/O ports
Local bypass network For operands & streams within a tile
RAW’s Dynamic NetworkRAW’s Dynamic Network Insert header, and < 32 data words. Worms through network. Enable MPI programming Inter-message ordering not guaranteed. RAW’s memory network RAW’s general network
User-level messaging Can interrupt tile when message arrives Lower performance; for coarse-grained apps For non-compile time predictable
communication among tiles possibly with I/O devices
M.I.T. Raw Novel M.I.T. Raw Novel FeaturesFeatures
■ Dynamic and Static Network Dynamic and Static Network Routers.Routers.
■ Scalability of Raw chips.Scalability of Raw chips.■ Fabricated Raw chips can be placed in Fabricated Raw chips can be placed in
an array to further increase the system an array to further increase the system computing performance.computing performance.
■ Specifies a homogenous 2-D array of Specifies a homogenous 2-D array of very simple processorsvery simple processors
■ Local bypass networkLocal bypass network
Field Programmable Function Field Programmable Function Array of Chameleon Structure Array of Chameleon Structure
A FPFA consists of interconnected processor tiles Multiple processes can coexit in parallel on different tiles Within a tile multiple data streams can be processed in parallel Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit
FPFA
processor tile
interconnection crossbar
RAM RAM
ALU
RAM RAM
ALU
RAM RAM
ALU
RAM RAM
ALU
RAM RAM
ALU
FPFA with five ALUs [7]
Field Programmable Field Programmable Function ArrayFunction Array
The FPFA concept has a number of The FPFA concept has a number of advantageadvantage The FPFA has a highly regular The FPFA has a highly regular
organizationorganization We use general purpose process coreWe use general purpose process core Its scalability stands in contrast to the Its scalability stands in contrast to the
dedicated chips designed nowadaysdedicated chips designed nowadays The FPFA can do media processing tasks The FPFA can do media processing tasks
such as compression/decompression such as compression/decompression efficientlyefficiently
Field Programmable Field Programmable Function ArrayFunction Array
ALU ALU ALU ALU ALU
M M M M M M M M M M Memory
CrossBar
Registers
ALUs
Processor tilesProcessor tiles Consists of five identical blocks, which share a control Consists of five identical blocks, which share a control
unit and a communication unitunit and a communication unit An individual block contains an ALU, two memories and An individual block contains an ALU, two memories and
four register banks of four 20-bit wide registerfour register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the A crossbar-switch makes flexible routing between the
ALUs, registers and memoriesALUs, registers and memories This structure is convenient for the Fast Fourier This structure is convenient for the Fast Fourier
Transform(6-input,4-output) and the Finite Impulse Transform(6-input,4-output) and the Finite Impulse ResponseResponse
[7]
Performance Analysis for Performance Analysis for Data Intensive Application Data Intensive Application
[1][1] Three data-intensive radar sub-systems including:
The corner turn: matrix transpose operation The matrix size is larger than Imagine’s SRF (128 KB) and Raw’s
internal memories (2 MB), but smaller than VIRAM’s on-chip memory (13 MB)
beam steering: directs a phased-array radar without physically rotating the
antenna coherent side-lobe canceller (CSLC) kernels and
consists of FFTs, a weight application (multiplication) stage, and IFFTs
Implemented on: Processors In Memory (PIM) Stream Processors Tile Processors PowerPC
Vector Intelligent RAM (VIRAM, Berkeley) [6]
Merge DRAM with Vector Processor
Mixed logic-DRAM CMOS process
Scalar MIPS processor core
6.4 16-bit GOPS, 1.6 GFLOPS
4 float ALUs; 8 32bit int ALUs; 16 16bit ALUs
12.8 GB/s peak memory access
13 MB DRAM 15 x 18 mm; IBM
Foundry Chips fabbed in Q1 ‘03,
ISI board on
Imagine Streaming Processor(Stanford)
300 MHz, VLIW SIMD machine
28 16-bit GOPS, 14 GFLOPS 128 kB Streaming Register
File 8 ALU Clusters 6 ALUs / cluster 84-95% ALU utilization typical 256 x 32 bit local register file Streaming Memory Buffers Re-order DRAM accesses Expose data locality ALU Intra-cluster BW - 435
GB/sec DRAM BW - 2.1 GB/sec 16 x 16 mm; TI Foundry
MIT RAWMIT RAW 16 tiles of MIPS R4000 @
300 MHz 4.6 GOPS or GFLOPS 4 Communication
Networks 2 Static Networks, 38.3 GB / sec 2 Dynamic Networks 14 External Ports (I/O or
DRAM) 33.5 GB/sec C and ASM; gcc based
compiler 18.2 x 18.2 mm; IBM
Foundry Fully scalable architecture
Experimental Results [1]Experimental Results [1]
Processor Processor ParametersParameters
Experimental Results Experimental Results (*10^3 cycles)(*10^3 cycles)
Speedup compared with PPC with AltiVec
ReferencesReferences[1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis [1] J. Suh, E.G. Kim, S. P. Crago, L. Srinivasan, M. C. French, ”A Performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels,” kernels,” Proc. of the International Symposium on Computer ArchitectureProc. of the International Symposium on Computer Architecture, Jun. 2003. , Jun. 2003. [2] M. B. Taylor, “The Raw processor specification,” [2] M. B. Taylor, “The Raw processor specification,” Comprehensive specification for Comprehensive specification for the Raw processorthe Raw processor, Cambridge, MA, Continuously Updated 2003., Cambridge, MA, Continuously Updated 2003.[3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable [3] D. Wentzlaff, M. B. Taylor., “The Raw architecture: signal processing on a scalable composable Computation Fabric,” composable Computation Fabric,” High Performance Embedded Computing Workshop,High Performance Embedded Computing Workshop, 2001 2001 [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of [4] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, “Evaluation of the Rawthe Raw microprocessor: an exposed-wire-delay architecture for ILP and streams,” microprocessor: an exposed-wire-delay architecture for ILP and streams,” Proceedings of International Symposium on Computer ArchitectureProceedings of International Symposium on Computer Architecture, Jun. 2004 , Jun. 2004 [5] M.I.T. Raw architecture workstation website: [5] M.I.T. Raw architecture workstation website: http://cag-www.lcs.mit.edu/raw/http://cag-www.lcs.mit.edu/raw/[6] Berkeley Intelligent RAM website: [6] Berkeley Intelligent RAM website: http://iram.cs.berkeley.edu/http://iram.cs.berkeley.edu/[7] “[7] “Reconfigurable computation and communication architectures,” Available on: Reconfigurable computation and communication architectures,” Available on:
http://http://vada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architectuvada.skku.ac.kr/ClassInfo/system_level_design/sdr_slides/lec5-reconfigurable-architecture.pptre.ppt[8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, [8] J. W. Webb, “Processor architectures at A glance: M.I.T. Raw vs. UC Davis AsAP”, Course Presentation, Available on: Course Presentation, Available on: http://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdfhttp://www.ece.ucdavis.edu/~jwwebb/docs/eec289q_presentation.pdf..[9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on: [9] H. Corporaal, “Multi-Processor”, Course Presentation, Available on:
http://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppthttp://www.es.ele.tue.nl/~heco/courses/aca/lect-8-MP.ppt
AppendixAppendix
ChessChess UC Davis Asynchronous Array of simple UC Davis Asynchronous Array of simple
Processors (AsAP)Processors (AsAP)
ChessChess HP Labs – Bristol, EnglandHP Labs – Bristol, England 2-D array – similar to Matrix2-D array – similar to Matrix Contains more “FPGA-like” routing Contains more “FPGA-like” routing resources.resources.
No reported software or application No reported software or application resultsresults
Doesn’t support incremental Doesn’t support incremental compilationcompilation
Chess Chess InterconnectInterconnect
More like an More like an FPGAFPGA
Takes Takes advantage of advantage of near-neighbor near-neighbor connectivityconnectivity
Chess Basic BlockChess Basic Block
Switchbox Switchbox memory can be memory can be used as storageused as storage
ALU core for ALU core for computationcomputation
Chess Chess StatisticsStatistics
Use metrics to evaluate computational Use metrics to evaluate computational power.power.
Efficient multiplies due to embedded Efficient multiplies due to embedded ALUALU
Process independent.Process independent.
UC Davis Asynchronous Array UC Davis Asynchronous Array of simple Processors (AsAP) of simple Processors (AsAP)
ArchitectureArchitecture
Composed of a replicated processor tile. Composed of a replicated processor tile. 9-stage pipelined reduced complexity DSP processor 9-stage pipelined reduced complexity DSP processor Four nearest neighbor inter-processor communication.Four nearest neighbor inter-processor communication. Individual processor tile can operate at different Individual processor tile can operate at different
frequencies than its neighbors.frequencies than its neighbors. Off chip access to the I/O pins must be reached by routing to Off chip access to the I/O pins must be reached by routing to
boundary processors.boundary processors. Chip BandwidthChip Bandwidth
Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz.Single channel (16-bit) bandwidth of 16 Gb/s @ 800 MHz. The array topology of AsAP is well-suited for applications The array topology of AsAP is well-suited for applications
that are composed of a series of independent tasks.that are composed of a series of independent tasks. Each of these tasks can be assigned to one or more processors.Each of these tasks can be assigned to one or more processors.
Asynchronous Array of Asynchronous Array of simple Processors [8]simple Processors [8]
What’s in an AsAP tile?What’s in an AsAP tile?
16-bit fixed point datapath single issue CPU16-bit fixed point datapath single issue CPU Instructions for AsAP processors are 32-bits wideInstructions for AsAP processors are 32-bits wide. .
ALU, MACALU, MAC Small Instruction/Data MemoriesSmall Instruction/Data Memories
64-entry instruction memory and a 128-word data memory.64-entry instruction memory and a 128-word data memory. Hardware address generation Hardware address generation
Each processor has 4 address generators that calculate Each processor has 4 address generators that calculate addresses for data memory. addresses for data memory.
Local programmable clock oscillator Local programmable clock oscillator 2 Input and 1 Output 16-bits wide and 32-words deep 2 Input and 1 Output 16-bits wide and 32-words deep
dual-clock FIFOs. dual-clock FIFOs. ~1.1mm~1.1mm22/processor in 0.18/processor in 0.18m CMOSm CMOS 800 MHz targeted operation800 MHz targeted operation
AsAP Single Processor Tile AsAP Single Processor Tile [8][8]
AsAP Inter-processor AsAP Inter-processor CommunicationCommunication
• Each processor output is hard-wired to its four nearest neighbors input multiplexers.
•At power-up the input multiplexers are configured.
•As input FIFOs fill up the sourcing neighbor can be halted by asserting corresponding hold signal.
[8]
AsAP ContributionsAsAP Contributions
Provides parallel execution of independent Provides parallel execution of independent tasks by providing many, parallel, independent tasks by providing many, parallel, independent processing enginesprocessing engines
AsAP specifies a homogenous 2-D array of very AsAP specifies a homogenous 2-D array of very simple processorssimple processors Single-issue pipelined CPUsSingle-issue pipelined CPUs
Independent tasks are mapped across Independent tasks are mapped across processors and executed in parallelprocessors and executed in parallel
Allows efficient exploitation of Application-level Allows efficient exploitation of Application-level parallelism.parallelism.