Upload
silvester-stewart
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
CSE 661 PAPER CSE 661 PAPER PRESENTATIONPRESENTATION
ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR
By D. Wentzlaff et al
Presented BySALAMI, Hamza Onoruoiza
g201002240
OUTLINE OF OUTLINE OF PRESENTATIONPRESENTATION IntroductionTile64 Architecture Interconnect HardwareNetwork UsesNetwork to Tile InterfaceReceive-side Hardware DemultiplexingProtectionShared Memory Communication and
Ordering Interconnect SoftwareCommunication InterfaceApplicationsConclusion
2
INTRODUCTIONINTRODUCTIONTile Processor’s five on-chip 2D
mesh networks differ from ◦traditional bus based scheme;
requires global broadcast hence not scalable beyond 8 – 16 cores
◦1D ring not scalable; bisection BW is constant
Can support few or many processors with minimal changes to network structure
3
TILE64 ARCHITECTURETILE64 ARCHITECTURE2D grid of 64 identical compute
elements (tiles) arranged in 8 x 8 mesh1GHz clock, 3-way VLIW, 192 bil. 32-bit
instructions/sec4.8MB distributed cache, per tile TLBSupports DMA and virtual memoryTiles may run independent OSs. May be
combined to run multiprocessor OS such as SMP Linux
Shared memory. Cores directly access other cores’ cache
through on-chip interconnects4
TILE64 ARCHITECTURE (3)TILE64 ARCHITECTURE (3)
6
Courtesy: http://www.tilera.com/products/processors/TILE64
INTERCONNECT INTERCONNECT HARDWAREHARDWARE5 low latency mesh networksEach network connects tile in five
directions; north, south, east, west and processor
Each link made of two 32-bit unidirectional links
7
NETWORK USESNETWORK USES4 dynamic networks
◦ packet header contains destination’s (x, y) coordinate and packet length (≤128 words)
◦ Flow controlled, reliable delivery◦ UDN: low latency comm. between userland
processes without OS intervention◦ IDN: direct communication with I/O devices◦ MDN: communication with off-chip memory◦ TDN: direct tile-to-tile transfers; requests
through TDN, response through MDN1 static network
◦ Streams of data instead of packets◦ First setup route, then send streams (circuit
switched)◦ Also a userland network
9
LOGICAL VS. PHYSICAL LOGICAL VS. PHYSICAL NETWORKSNETWORKS
5 physically independent networks
Lots of free nearest neighbor on-chip wiring
Buffer space takes about 60% tile area vs 1.1% for each network
More reliable on-chip network => less buffering to manage link failure
10
NETWORK TO TILE NETWORK TO TILE INTERFACEINTERFACETiles have register access to on-
chip networks. Instructions can read/write from/to UDN, IDN or STN.
MDN and UDN used indirectly on cache miss
Register-mapped network access is provided
11
RECEIVE-SIDE HARDWARE RECEIVE-SIDE HARDWARE DEMULTIPLEXINGDEMULTIPLEXINGTag word = (sending node, stream
num., message type)Receiving hardware demultiplexes
message into appropriate queue using tag.
On a tag miss, send data to ‘catch all’ queue, then raise interrupt
UDN has 4 deMUX queues, one ‘catch all’
IDN has 2 deMUX queues, one ‘catch all’
128-word reverse side buffering per tile12
PROTECTIONPROTECTIONTile Architecture implements Multicore
Hardwall (MH)MH protects UDN, IDN and STN linksStandard memory protection mechanisms
used for MDN and TDN MH blocks attempts to send traffic over
hardwalled link, then signals an interrupt to system software
Protection is implemented on outbound links
14
SHARED MEMORY SHARED MEMORY COMMUNICATION AND COMMUNICATION AND ORDERINGORDERINGOn-chip distributed shared cacheData could be retrieved from
1. Local cache2. Home tile (request sent through TDN).
Data available only in home tile. Coherency maintained here.
3. Main MemoryNo guaranteed ordering between
networks and shared memoryMemory fence instructions used to
enforce ordering
15
INTERCONNECT INTERCONNECT SOFTWARESOFTWAREC based iLib provides
communication primitives implemented via UDN◦Lightweight socket-like streaming
channels for streaming algorithms◦MPI-like message passing interface
for adhoc messaging
16
COMMUNICATION COMMUNICATION INTERFACESINTERFACESiLib Socket
◦Long-lived FIFO point-to-point connection between two processes
◦Good for producer-consumer relationship
◦Multiple senders-one receiver possible; good for forwarding results to single node for aggregation
◦Raw Channels: low overhead; use as much space as available in buffer
◦Buffered Channels: higher overhead, but virtualization of memory is possible
17
COMMUNICATION COMMUNICATION INTERFACES(2)INTERFACES(2)Message Passing API
◦Similar to MPI◦Messages can be sent from a node to any
other at all times◦No need to establish connections
Implementation◦Sender: Send packet with message key
and size◦Receiver’s catch-all queue interrupts
processor◦ If expecting a message with this key, send
packet to sender to begin transfer◦Else, save notification. ◦On ilib_msg_receive() with same key, send
packet to interrupt sender to begin transfer
18
COMMUNICATION COMMUNICATION INTERFACES(4)INTERFACES(4)UDN’s maximum BW is 4
bytes/cycleRaw Channels’ max BW 3.93
bytes/cycle; overhead due to header word and tag word
Buffered Channel: Overhead of memory read/write
Message Passing: Overhead of interrupting receiving tile
20
APPLICATIONSAPPLICATIONSCorner Turn
◦Reorganize distributed array from 1 dimension to another
◦Each core send data to every other core
◦Important Factors Network for Distribution (TDN using
shared memory or UDN using raw channels)
Network for tiles’ synchronization (STN or UDN)
22
APPLICATIONS (2)APPLICATIONS (2)
◦Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data
◦Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages.
◦Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock
23
APPLICATIONS (3)APPLICATIONS (3)Dot Product
◦ Pairwise element multiplication, followed by addition of all products.
◦ 65,536-element dot product◦ Shared memory not scalable, higher
communication overhead◦ From 2 to 4 tiles, speedup is sublinear
because dataset completely fits into tiles’ L2 cache.
24