View
218
Download
3
Category
Preview:
Citation preview
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and
Communication Overlapping
Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris
National Technical University of AthensDept. of Electrical and Computer Engineering
Computing Systems Laboratory
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Overview
Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping
scheme Vertical vs. hyperplane grouping Application on clusters of SMP
nodes
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
TCP/IP over FastEthernet
Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor
sdHub
write send
read receive
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
CPUkernelmode
bufferlength
TCP
IP ETH
Fast
2) CPU copies datafrom user to kernel space
3) CPU adds protocolheaders
5) DMA copies data to NIC
write(sd, buffer, length);
Example: Send
1) system call (CPU)
user 4) CPU programs DMA eng.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
SCI What about Scalable Coherent
Interface? Point-to-point , DSM approach
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
SCI DSM schemeexportedmemorysegment
importedmemorysegment
SCI
write 100
100
read
50
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
process VM area
Physical Memory
Contiguous data in process VMare not contiguous in Physical Memory
SCI Zero Copy Scheme
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
process VM area
Physical Memory
is mapped to
pinned down memory
SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory
SCI Zero Copy Scheme
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Data transfers
Programmed I/O mode CPU handles data transferring “lost” CPU cycles
DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
SCI
SCI DMA approach
No copying by CPU•Data already
contiguous in PM•DMA engine copies
data to network
•No packetizationDone in hardware
•But, init only by kernel
We need VIA
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Nested For-Loops
for (i1=l1; i1<=u1; i1++)
for (i2=l2; i2<=u2; i2++)
… … … … …
for (in=ln; in<=un; in++)
{
Loop Body
}
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Dependence Vectors
i2
i1
for (i1=0; i1<=7; i1++)
for (i2=0; i2<=7; i2++)
A[i,j]=A[i-1,j]+A[i,j-1]
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Tiling
i2
i1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Tiling
i2
i1
Processor 0
Processor 1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Non-Overlapping Scheme
i2
i1
Processor 0
Processor 1
Processor 2
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Non-Overlapping vs. Overlapping Scheme
P0
P1
P2
P3
P0
P1
P2
P3
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Overlapping Scheme
i2
i1
Processor 0
Processor 1
Processor 2
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Generalization to SMPs
P0
P1
P2
P3
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Generalization to SMPs
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Generalization to SMPs
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Generalization to SMPs
CPU1
CPU0
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Generalization to SMPs
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Vertical vs. Hyperplane grouping
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
SMP0
SMP1
SMP2
SMP3
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Example
Tile SpaceGroup Space
SMP node0
SMP node1
Scheduling vector Π=(1,1)
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Non-overlapping vs. Overlapping scheme
Almost half duration of execution steps Slightly more steps
P0
P1
P2
P3
P0
P1
P2
P3
Non-overlapping scheme
9 computation +8 communication steps
Overlapping scheme
12 steps
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Vertical vs. Hyperplane Grouping
Slower pipeline filling Faster execution because of lack of intratile synchronization
preferable for Tile Spaces, where the mapping direction is comparatively large
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
SMP0
SMP1
SMP2
SMP3
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Experimental Platform
Linux SMP (Symmetric Multi-Processors) Cluster
8 nodes 128MB RAM 2 Pentium III 800MHz
SCI ring (SCI Dolphin’s PCI-SCI D330 cards)
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Initial Code
for (i=1; i<=X; i++)for (j=1; j<=Y; j++)
for (k=1; k<=Z; k++){
A[i][j][k] = func(A[i-1][j][k],
A[i][j-1][k], A[i][j][k-1])
}
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Experimental results
3
3.5
4
4.5
5
5.5
6
6.5
7
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
sec)
Tile Height
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
sec)
Tile Height
Iteration Space 16x16x1024K Iteration Space 48x48x512K
Non-overlapping scheme – vertical
grouping
Overlapping scheme – vertical grouping
Non-overlapping scheme – hyperplane
grouping
Overlapping scheme – hyperplane grouping
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Grouping matrix
n
i
iG
m
m
m
m
H
10000
01
000
11111
0001
0
00001
1
1
1
nii mmmm 111 = number of CPUs within an SMP node
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
National Technical University of AthensComputing Systems Laboratory
Example
Tile SpaceGroup Space
SMP node0
SMP node1
30
31,
3
10
111GGG HPH
Scheduling vector Π=(1,1)
Recommended