Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Category: Supercomputingposter
Sc03 contact name
toshihiro Hanawa: [email protected]
HA-PACS/TCA will be constructed in the Fall of 2013 using PEACH2 board in each node. HA-PACS/TCA will be managed with HA-PACS base cluster, and HA-PACS will be a Peta-scale class HPC cluster totally.
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication among GPUs
Center for Computational Sciences
GPU0
PEACH2
Internal Memory PCIe Address
Space
Node 1
Node 2
CPU
1
N
E
W
N
E
W
2
N
E
W
3
N
E
W
0
0 E W
1 E
W
2 E
W
3 E
W
PE
AC
H2
Add
ress
Spa
ce
GPU1 Node 3
Node 0
(Front View) (Top View)
Toshihiro Hanawa Yuetsu Kodama Taisuke Boku Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, 305-8577 Japan
CPU Intel Xeon E5-2670 2.6 GHz 8 core x 2 socket
Memory DDR3-1600 4ch. 128 Gbytes Mother Board
SuperMicro X9DRG-QF Intel S2600IP
PEACH2(FPGA)
Altera Stratix IV 530GX EP4SGX530NF45C2
OS CentOS 6.3 Kernel 2.6.32-279.19.1.el6.x86_64 GPU NVIDIA K20c Driver NVIDIA-Linux-x86_64-310.32 CUDA 5.0 MPI MVAPICH2 1.8.1
(with “Enable cuda” option)
Concept GPGPU is now widely used for accelerating scientific and
engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing.
To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes.
HA-PACS project is supported by MEXT special fund program named “Research and Education on Interdisciplinary Computational Science Based on Exascale Computing Technology Development, “ and by the JST/CREST program named “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era,” in the research area of “Development of System Software Technologies for post-Peta Scale High Performance Computing.”
Contact E-mail: [email protected] URL: http://www.ccs.tsukuba.ac.jp/CCS/eng/
research-activities/projects/ha-pacs Figure 5. TCA Communication (PEACH2) Board
Preliminary Evaluation
Figure 2. Block Diagram of TCA Nodes of HA-PACS/TCA
CPU & GPU side
(Endpoint)
To PEACH2 (Root Complex /
Endpoint)
To PE
AC
H2
(Endpoint)
To PE
AC
H2
(Root C
omplex)
NIOS (CPU)
Memory
DMAC
Routing function
N port
E port
W port
S port
PCI Express Direct Communication Link PEARL (PCI Express Adaptive and Reliable Link): a concept for direct communication by using PCI Express technology PEACH2 (PCI Express Adaptive Communication Hub ver. 2): implementation of communication hardware for PEARL
HA-PACS System HA-PACS (Highly Accelerated Parallel Advanced system for
Computational Sciences) is the 8th generation of PACS/PAX series supercomputer in Center for Computational Sciences, University of Tsukuba. HA-PACS system consists of two parts as follows.
HA-PACS base cluster is operated for the development and product-run on advanced scientific computations since Feb. 2012 with 802 Tflops of peak performance. Each node is equipped with two Intel Xeon E5 CPUs, four M2090 GPUs, and dual port InfiniBand QDR.
HA-PACS/TCA is an extension of base cluster, and each node has not only InfiniBand HCA but also TCA Communication board (PEACH2 board) as the proprietary network for GPU.
Figure 1. Photograph of HA-PACS base cluster system (26 racks)
Figure 4. Address Mapping and Routing Mechanism on TCA
Overview of PEACH2 Chip • PCI Express Gen2 x8, 4 ports • 1 port for host connection • 3 ports for p2p connection
with PCIe external cable • Implementation by FPGA - Altera Stratix IV GX
• Clock frequency: 250MHz • DMA controller with 4ch - Chaining DMA mechanism - Block Stride Access Support
• Unique PCIe address is assigned for the node and device within TCA sub-cluster, and routing can be decided only by comparing destination PCIe address (Fig.4).
Figure 3. Block Diagram of PEACH2 chip • GPU-to-GPU communication over the node under TCA is realized
by GPU Direct for RDMA Support with Kepler and CUDA • TCA sub-cluster including 8 to 16 nodes is built up as a basic unit,
and TCA sub-clusters are connected with InfiniBand. Hybrid Communication between TCA and InfiniBand
Infin
iBan
d N
etw
ork
QPI
PCIe
InfiniBand
G3
x16
G3
x16
G3
x16
G3
x16
G2 x8 G3 x8
GPU0
CPU0 CPU1
GPU1 GPU2 GPU3
PEACH2
QPI
PCIe
InfiniBand
G3
x16
G3
x16
G3
x16
G3
x16
G2 x8 G3 x8
GPU0
CPU0 CPU1
GPU1 GPU2 GPU3
PEACH2
G2
x 8
G2
x8
G2
x 8
G2 x8 PEACH2
PEACH2 G2 x8
Future Plan
0
500
1000
1500
2000
2500
3000
3500
8 32 128 512 2048
Ban
dWid
th (M
byte
s/se
c)
Data Size (Bytes)
CPU-to-CPU CPU-to-GPU GPU-to-CPU GPU-to-GPU
Figure 7. DMA bandwidth between components in local node by PEACH2
(with 128 times, within driver)
Table 1. Evaluation Environment
Summary of Evalua:on • PEACH2 achieves 90% of
theoretical peak on PCIe Gen2 x8 with 3.3Gbytes/sec.
• GPU read performance is limited up to 800 Mbytes/sec, and GPU write performance is almost same as that of CPU read/write.
• Pingpong performance is almost maximum at 128Kbytes.
• PEACH2 is 3.5x faster than CUDA p2p copy with UVA in a node, and 6.8x faster than MVAPICH2 with short size.
0
500
1000
1500
2000
2500
3000
3500
8 64 512 4096 32768
Ban
dwid
th (M
byte
s/se
c)
Data size (Bytes)
CPU-to-CPU CPU-to-GPU GPU-to-GPU
Figure 8. Bandwidth by Pingpong between two nodes (user level)
0
5
10
15
20
25
30
35
40
8 32 128 512 2048 8192
Late
ncy
(use
c)
Data size (Bytes)
CUDA(no-UVA) CUDA(UVA)
MVAPICH2(2node) MVAPICH2(IPC=1)
MVAPICH2(IPC=0) PEACH2
Figure 9. GPU-to-GPU latency and bandwidth between two nodes in comparison with CUDA p2p copy and MVAPICH2
0
500
1000
1500
2000
2500
3000
3500
8 64 512 4096 32768
Ban
dwid
th (M
byte
s/se
c)
Data size (Bytes)
CUDA(no-UVA) CUDA(UVA) MVAPICH2(2node) MVAPICH2(IPC=1) MVAPICH2(IPC=0) PEACH2