Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
d
Peta-scale General Solver for Semidefinite Programming Problems with over
Two Million Constraints
1.Mathematical Programming:one of the most central problems in mathematical optimization.
2.Many Applications:combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc.
Data Decomposition • The dense matrix B (m x m) is distributed in
2D Block-Cyclic distribution with block size nb
For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers. Key for petaflops is overlapping computation, PCI-Express communication and MPI communication.
SDPARA can solve the largest SDP problem • DNN relaxation problem for QAP10 QAPLIB with
2.33 million constraints • Using 1360 nodes, 2720 CPUs, 4080 K20X GPUs • 1.713 PFLOPS in CHOLESKY (2.33m x 2.33m) • The fastest and largest result as mathematical
optimization problems!!
nb =1024 (for TSUBAME 2.0), 2048 (for TSUBAME 2.5)
Matrix distribution on 6 (=2x3) processes
B B’
Each process has a “partial-matrix”
L’
L
''' LLBB
m Local rank=2
Local rank=0
Local rank=1
Parallel Computation for CHOLESKY
quantum torus
Parallel Computation for ELEMENTS
10.6
1
2
4
8
1360 128 256 512 1024
Scala
bili
ty a
nd e
ffic
iency o
f E
LE
ME
NT
S
number of nodes
idealElectronic4 (scatter, interleaving)
100.0%
97.1%
93.3%
83.3%
75.9%Electronic5 (scatter, interleaving)
100.0%
93.0%
89.8%
70.6%59.2%
Efficiency for large problem
Strong Scaling based on 128 nodes
Performance of ELEMENTS for quantum on TSUBAME 2.0
Comb. Opt.
256
512
1024
400 700 1360 512 1024
Speed o
f C
HO
LE
SK
Y (
TF
lops)
Number of nodes
QAP6/org(2.0)QAP7/org(2.0)QAP8/org(2.0)QAP9/org(2.0)
QAP10/org(2.0)
QAP6/new(2.0)QAP7/new(2.0)QAP8/new(2.0)QAP9/new(2.0)
QAP10/new(2.0)
QAP6/new(2.5)QAP7/new(2.5)QAP8/new(2.5)QAP9/new(2.5)
QAP10/new(2.5)
233
329
438
306
470
719
513
826
9641019
314
388
509506
707
952
825
1186
1526
1713
Performance of CHOLESKY for QAP on TSUBAME2.0 / 2.5
Basic Algorithm
Design Strategy Our target is large problems with m > two million GPU memory is too small. Matrix data are usually
placed on host memory • Blocksize nb should be sufficiently large to mitigate
CPU-GPU PCIe communication • We still suffer from heavy PCIe communication nb=1024 for TSUBAME 2.0, nb=2048 for TSUBAME 2.5 Overlap GPU comp., PCIe comm., and MPI comm. in
Step 2, 3 and 4 of algorithm
time Lt0 Lt1 Lt2
Lt0 Lt1 Lt2
B0 B1
B0
B2
B1
L0 L1 L2 L3
L0 L1 L2 L3
time CPU
GPU
L0 L1 L2 L3
L0 L1 L2 L3
MPI Comm.
PCIe Comm.
DTRSM DGEMM
Producer of panel L Consumers of panel L
CPU/GPU Affinity
• 1 MPI / GPU 3 MPI / node • CPU/memory affinity is determined
in a GPU centric way
1.713 PFLOPS (DP) with 4080GPUs!!
Katsuki FUJISAWA 1, 4
1 Chuo University 2 Tokyo Institute of Technology
For inquiries please email to [email protected]
3 Kyushu University
Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers
Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era
Research Area:Technologies for Post-Peta Scale High Performance Computing
Hitoshi SATO 2, 4 Toshio ENDO 2, 4 Naoki MATSUZAWA 2 Yuichiro YASUI 1, 4 Hayato WAKI 3, 4
4 JST CREST
SemiDefinite Programming (SDP) Algorithmic Framework of PDIPM
• ELEMENTS : Computation of the SCM Memory Access-intensive Time-complexity:
• CHOLESKY : Cholesky factorizations of the SCM Compute-intensive Time-complexity:
The major bottleneck parts (80% - 90% of total execution time)
m < n (not m >> n), and Fully Dense SCM m >> n, and Fully Dense SCM
ELEMENTS-bound SDP problems CHOLESKY-bound SDP problems
or
n : matrix size m: # of constraints
Fast computation of sparse SCM is future work. e.g.) The sensor network location problem and The polynomial optimization problem
Quad. assignment
Dual
Primal
Efficient Hybrid (MPI-OpenMP) parallel Computation of SCM B
with Automatic configuration for CPU Affinity and memory interleaving
m
m
Bi j i
j
Bj i
Widely access for Fk
Fk (k = 1,…, m)
Memory interleaving
Scatter-type CPU Affinity
Schur complement matrix (SCM) Our algorithm
i-loop for Fi
j-loop for Fi
B
j-loop for Fj
i-loop for Fi
128
256
512
1024
2048
4096
1360 64 128 256 512 1024
ELE
ME
NT
S tim
e (
seconds/ite
ration)
number of nodes
default (w/o)scatter
compactdefault (w/o), interleaving
scatter, interleavingcompact, interleaving
1701
906
709
1150
613580
550
307
250
417
233
198
2034
547
307
258
3284
1544
795
414
232
191
Electronic5 (SCM size: 47.7k x 47.7k)
“interleaving” is effective.
Electronic4 (SCM size: 76.6k x 76.6k)
high efficiency
Scalability of “scatter and interleaving” CPU time varied with affinity/interleaving setting
“scatter and interleaving” is fastest.
Primal-Dual Formulation
for
1. Diagonal block factorization 2. Panel factorization Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’
4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM
1/,...,2,1,0 nbmk
• SDPARA is a parallel implementation of the primal-dual interior-point method(PDIPM) for Semidefinite Programming Parallel computation for two major bottleneck parts • ELEMENTS ⇒ Computation of Schur complement matrix (SCM) • CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM)
• SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.5 supercomputer and some techniques of processor affinity and memory interleaving when the computation of SCM (ELEMENTS) constituted a bottleneck.
• With 4,080 NVIDIA GPUs on the TSUBAME 2.0 & 2.5 supercomputer, our implementation achieved 1.019 PFlops (TSUBAME 2.0) & 1.713PFlops(TSUBAME 2.5) in double precision for a large-scale problem (CHOLESKY) with over two million constraints.
ELMENTS for large-scale SDP problems generally requires significant computational resources in terms of CPU cores and memory bandwidth.