Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Core Research for Evolutional Science and Technology
Core Research for Evolutional Science and Technology
Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology
Core Research for Evolutional Science and Technology
Core Research for Evolutional Science and Technology
Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology
Peta-scale General Solver for Semidefinite Programming Problems with over Two Million Constraints
High-‐Performance General Solver for Extremely Large-‐scale Semidefinite Programming Problems (SDPs)1. Mathematical Programming: one of the most central problems in mathematical optimization. 2. Many Applications: combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc.
• SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP)
• Parallel computation for two major bottleneck parts • ELEMENTS ⇒ Computation of Schur complement matrix (SCM)• CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM)
• SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.0 supercomputer and some techniques of processor affinity and memory interleaving when the generation of SCM constituted a bottleneck.
• With 4,080 NVIDIA M2050 GPUs on the TSUBAME 2.0 supercomputer, our implementation achieved 1.018 PFlops in double precision for a large-scale problem with over two million constraints.
SemiDefinite Programming (SDP)
The major boBleneck parts of PDIPM for the various types of SDP problems
1. CHOLESKY : Sparse SCM e.g.) the sensor network locaLon problem and the polynomial opLmizaLon problem Parallel Sparse Cholesky factorizaLon of sparse SCM
2. ELEMENTS : m < n (not m >> n), and Fully Dense SCM e.g.) the quantum chemistry problem and the truss topology problem Efficient Hybrid (MPI-‐OpenMP) parallel computaLon Automa5c configura5on for CPU Affinity and memory interleaving
3. CHOLESKY : m >> n, and Fully Dense SCM Massively parallel dense Cholesky factorizaLon using GPUs e.g.) the combinatorial opLmizaLon problem the quadraLc assignment problem (QAP) Overlapping techniques for computaLon, PCI-‐Express comm., and MPI comm. Our implementaLon achieved 1.018 PFlops in double precision for large-‐scale Cholesky factorizaLon using 2,720 CPUs and 4,080 GPUs.
Data DecomposiLon • The dense matrix B(m x m) is distributed with
2D Block-‐Cyclic distribuLon with block size nb
For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers.Key for petaflops is overlapping computation, PCI-Express communication and MPI communication.
SDPARA can solve the largest SDP problem• DNN relaxation problem : QAP10 QAPLIB with
2.33 million constraints • Using 1360 nodes, 2720 CPUs, 4080 M2050 GPUs • 1.018PFLOPS in CHOLESKY (2.33m x 2.33m) • The fastest and largest result as mathematical
optimization problems!!
nb =1024
Matrix distribution on 6 (=2x3) processes
BB’
Each process has a “partial-matrix”
L’
L
''' LLBB ×−=
m
Local rank=2
7, 9, 11
Processor cores GPU2
RAM1
Local rank=0
0, 2, 4, 6, 8, 10
Processor coresGPU0
Local rank=1
1, 3, 5
Processor cores
RAM1GPU1
RAM0
Parallel ComputaLon for CHOLESKY
GPU2
Local rank=0
0, 2, 4, 6, 8, 10
Processor cores
Local rank=1
1, 3, 5, 7, 9, 11
Processor coresRAM1
GPU1
RAM0
Processor ID Package ID Core ID
0 0 0
1 1 1
2 0 2
・・・
Step3: thread−Processor assignment table
Thread ID Sca9er Compact
0 0 0
1 1 2
2 2 4
・・・
Automate configuraLon of CPU Affinity and memory allocaLon policy
12 13
0
14 15
2
16 17
4
18 19
6
20 21
8
22 23
1 3 5 7 94 56 78 910 11
CPU package 0 CPU package 1
0 12 3
10 11
12 13
0
14 15
2
16 17
4
18 19
6
20 21
8
22 23
1 3 5 7 94 56 78 910 11
CPU package 0 CPU package 1
0 12 3
10 11
12 13
0
14 15
2
16 17
4
18 19
6
20 21
8
22 23
1 3 5 7 92
8
3
9
4 5
111110
CPU package 0 CPU package 1
0
6
1
7
10 11
12 13
0
14 15
2
16 17
4
18 19
6
20 21
8
22 23
1 3 5 7 92
8
3
9
4 5
111110
CPU package 0 CPU package 1
0
6
1
7
10 11
ScaGer-‐type Affinity
Compact-‐type Affinity
quantum torus
Parallel ComputaLon for ELEMENTS
10.6
1
2
4
8
16
1360 64 128 256 512 1024 2048
Scal
abilit
y an
d ef
ficie
ncy
of E
LEM
ENTS
number of nodes
idealElectronic4 (scatter, interleaving)
100.0%
97.1%
93.3%
83.3%75.9%
Electronic5 (scatter, interleaving)
100.0%
93.0%
89.8%
70.6%59.2%
Efficiency for large problems
Strong Scaling based on 128 nodes
/sys/devices/system/*
Step1: Linux device files Step2: Processor mapping table
Automate detecLon
TSUBAME 2.0 HP node
128
256
512
1024
2048
4096
1360 64 128 256 512 1024 2048
ELEM
ENTS
tim
e (s
econ
ds/it
erat
ion)
number of nodes
defaultscatter
compactdefault, interleavingscatter, interleaving
compact, interleaving
packages← ∅for thread id = 0, 1, · · · , p − 1 do
packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set
Figure: Memory interleaving configuration for the memory allocation policy
#omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()
}
Figure: Scatter-type CPU Affinity Configuration
packages← ∅for thread id = 0, 1, · · · , p − 1 do
packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set
Figure: Memory interleaving configuration for the memory allocation policy
#omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()
}
Figure: Scatter-type CPU Affinity Configuration
Fi
Process01
Process02
Process03
Process04
Process01
Process02
Process03
Process04
Process01
Process02
Process03
Process04
Process01
Process02
Process03
Process04
OpenMP(dynamic)
MPI+OpenMP
Fj
MPI
Fig. CPU Lme for Electronic4 Fig. Scalability for Electronic4 and Electronic5
• the memory interleaving is effecLve. • the parameter set “scaGer and interleaving” is fastest.
• high efficiency (75.9% for Electronic4) • higher efficiency when solving an SDP problem larger than Electronic4 because the efficiency for Electronic4 is higher than that for Electronic5.
binds the (n+1)-‐th OpenMP thread in a free-‐thread context as close as possible to the thread context in which the n-‐th OpenMP thread was bound.
distributes OpenMP threads as evenly as possible across the enLre system.
combinatorial opLmizaLon
128
256
512
1024
400 700 1360 256 512 1024 2048
Spee
d of
CHO
LESK
Y (T
Flop
s)
number of nodes
QAP6 QAP7 QAP8 QAP9 QAP10
232.9
329.1
437.8
306.2
469.9
718.8
512.9
825.6964.4
1018.5
Performance of CHOLESKY on TSUBAME2.0 QuadraLc assignment problems (QAP)
Basic Algorithm For k = 0, 1, 2, …, -‐1 1. Diagonal block factorizaLon 2. Panel factorizaLon à Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’ 4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM
Design Strategy Our target is large problems with m > 2 million à GPU memory is too small. Matrix data are usually
placed on host memory • Blocksize nb should be sufficiently large to miLgate
CPU-‐GPU PCIe communicaLon • We sLll suffer from heavy PCIe communicaLon à nb=1024 à Overlap GPU comp., PCIe comm., and MPI comm. in
Step 2, 3 and 4
LmeLt0 Lt1 Lt2
Lt0 Lt1 Lt2B0 B1
B0 B2
B1
L0 L1 L2 L3
L0 L1 L2 L3
LmeCPU
GPU
L0 L1 L2 L3
L0 L1 L2 L3
MPI Comm.
PCIe Comm.
DTRSM DGEMM
Producer of panel L Consumers of panel L
CPU/GPU Affinity • 1 MPI process per GPU à 3 processes per node • CPU/memory affinity is
determined in a GPU centric way
1.018PFLOPS(DP) with 4080GPUs!!
Katsuki FUJISAWA 1, 4
1 Chuo University
2 Tokyo Institute of Technology
For inquiries please email to [email protected]
3 Kyushu University
Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers
Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era
Research Area:Technologies for Post-Peta Scale High Performance Computing
Primal
Dual
The number of constraints of the primal problem
The matrix size
Hitoshi SATO 2, 4
Toshio ENDO 2, 4
Naoki MATSUZAWA 2
Yuichiro YASUI 1, 4
Hayato WAKI 3, 4
4 JST CREST