1
d Peta-scale General Solver for Semidefinite Programming Problems with over Two Million Constraints 1.Mathematical Programming:one of the most central problems in mathematical optimization. 2.Many Applications:combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc. Data Decomposition The dense matrix B (m x m) is distributed in 2D Block-Cyclic distribution with block size nb For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers. Key for petaflops is overlapping computation, PCI-Express communication and MPI communication. SDPARA can solve the largest SDP problem DNN relaxation problem for QAP10 QAPLIB with 2.33 million constraints Using 1360 nodes, 2720 CPUs, 4080 K20X GPUs 1.713 PFLOPS in CHOLESKY (2.33m x 2.33m) The fastest and largest result as mathematical optimization problems!! nb =1024 (for TSUBAME 2.0), 2048 (for TSUBAME 2.5) Matrix distribution on 6 (=2x3) processes B BEach process has a “partial-matrix” LL ' ' ' L L B B m Local rank=2 Local rank=0 Local rank=1 Parallel Computation for CHOLESKY quantum torus Parallel Computation for ELEMENTS 10.6 1 2 4 8 1360 128 256 512 1024 Scalability and efficiency of ELEMENTS number of nodes ideal Electronic4 (scatter, interleaving) 100.0% 97.1% 93.3% 83.3% 75.9% Electronic5 (scatter, interleaving) 100.0% 93.0% 89.8% 70.6% 59.2% Efficiency for large problem Strong Scaling based on 128 nodes Performance of ELEMENTS for quantum on TSUBAME 2.0 Comb. Opt. 256 512 1024 400 700 1360 512 1024 Speed of CHOLESKY (TFlops) Number of nodes QAP6/org(2.0) QAP7/org(2.0) QAP8/org(2.0) QAP9/org(2.0) QAP10/org(2.0) QAP6/new(2.0) QAP7/new(2.0) QAP8/new(2.0) QAP9/new(2.0) QAP10/new(2.0) QAP6/new(2.5) QAP7/new(2.5) QAP8/new(2.5) QAP9/new(2.5) QAP10/new(2.5) 233 329 438 306 470 719 513 826 964 1019 314 388 509 506 707 952 825 1186 1526 1713 Performance of CHOLESKY for QAP on TSUBAME2.0 / 2.5 Basic Algorithm Design Strategy Our target is large problems with m > two million GPU memory is too small. Matrix data are usually placed on host memory Blocksize nb should be sufficiently large to mitigate CPU-GPU PCIe communication We still suffer from heavy PCIe communication nb=1024 for TSUBAME 2.0, nb=2048 for TSUBAME 2.5 Overlap GPU comp., PCIe comm., and MPI comm. in Step 2, 3 and 4 of algorithm time Lt 0 Lt 1 Lt 2 Lt 0 Lt 1 Lt 2 B 0 B 1 B 0 B 2 B 1 L 0 L 1 L 2 L 3 L 0 L 1 L 2 L 3 time CPU GPU L 0 L 1 L 2 L 3 L 0 L 1 L 2 L 3 MPI Comm. PCIe Comm. DTRSM DGEMM Producer of panel L Consumers of panel L CPU/GPU Affinity 1 MPI / GPU 3 MPI / node CPU/memory affinity is determined in a GPU centric way 1.713 PFLOPS (DP) with 4080GPUs!! Katsuki FUJISAWA 1, 4 1 Chuo University 2 Tokyo Institute of Technology For inquiries please email to [email protected] 3 Kyushu University Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era Research AreaTechnologies for Post-Peta Scale High Performance Computing Hitoshi SATO 2, 4 Toshio ENDO 2, 4 Naoki MATSUZAWA 2 Yuichiro YASUI 1, 4 Hayato WAKI 3, 4 4 JST CREST SemiDefinite Programming (SDP) Algorithmic Framework of PDIPM ELEMENTS : Computation of the SCM Memory Access-intensive Time-complexity: CHOLESKY : Cholesky factorizations of the SCM Compute-intensive Time-complexity: The major bottleneck parts 80% - 90% of total execution timem < n (not m >> n), and Fully Dense SCM m >> n, and Fully Dense SCM ELEMENTS-bound SDP problems CHOLESKY-bound SDP problems or n : matrix size m: # of constraints Fast computation of sparse SCM is future work. e.g.) The sensor network location problem and The polynomial optimization problem Quad. assignment Dual Primal Efficient Hybrid (MPI-OpenMP) parallel Computation of SCM B with Automatic configuration for CPU Affinity and memory interleaving m m B i j i j B j i Widely access for F k F k (k = 1,…, m) Memory interleaving Scatter-type CPU Affinity Schur complement matrix (SCM) Our algorithm i-loop for F i j-loop for F i B j-loop for F j i-loop for F i 128 256 512 1024 2048 4096 1360 64 128 256 512 1024 ELEMENTS time (seconds/iteration) number of nodes default (w/o) scatter compact default (w/o), interleaving scatter, interleaving compact, interleaving 1701 906 709 1150 613 580 550 307 250 417 233 198 2034 547 307 258 3284 1544 795 414 232 191 Electronic5 (SCM size: 47.7k x 47.7k) “interleaving” is effective. Electronic4 (SCM size: 76.6k x 76.6k) high efficiency Scalability of “scatter and interleavingCPU time varied with affinity/interleaving setting scatter and interleaving” is fastest. Primal-Dual Formulation for 1. Diagonal block factorization 2. Panel factorization Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’ 4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM 1 / ,..., 2 , 1 , 0 nb m k SDPARA is a parallel implementation of the primal-dual interior-point method(PDIPM) for Semidefinite Programming Parallel computation for two major bottleneck parts ELEMENTS ⇒ Computation of Schur complement matrix (SCM) CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM) SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.5 supercomputer and some techniques of processor affinity and memory interleaving when the computation of SCM (ELEMENTS) constituted a bottleneck. With 4,080 NVIDIA GPUs on the TSUBAME 2.0 & 2.5 supercomputer, our implementation achieved 1.019 PFlops (TSUBAME 2.0) & 1.713PFlops(TSUBAME 2.5) in double precision for a large-scale problem (CHOLESKY) with over two million constraints. ELMENTS for large-scale SDP problems generally requires significant computational resources in terms of CPU cores and memory bandwidth.

Two Million Constraints - Supercomputingsc13.supercomputing.org/sites/default/files/PostersArchive/tech... · d Peta-scale General Solver for Semidefinite Programming Problems with

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Two Million Constraints - Supercomputingsc13.supercomputing.org/sites/default/files/PostersArchive/tech... · d Peta-scale General Solver for Semidefinite Programming Problems with

d

Peta-scale General Solver for Semidefinite Programming Problems with over

Two Million Constraints

1.Mathematical Programming:one of the most central problems in mathematical optimization.

2.Many Applications:combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc.

Data Decomposition • The dense matrix B (m x m) is distributed in

2D Block-Cyclic distribution with block size nb

For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers. Key for petaflops is overlapping computation, PCI-Express communication and MPI communication.

SDPARA can solve the largest SDP problem • DNN relaxation problem for QAP10 QAPLIB with

2.33 million constraints • Using 1360 nodes, 2720 CPUs, 4080 K20X GPUs • 1.713 PFLOPS in CHOLESKY (2.33m x 2.33m) • The fastest and largest result as mathematical

optimization problems!!

nb =1024 (for TSUBAME 2.0), 2048 (for TSUBAME 2.5)

Matrix distribution on 6 (=2x3) processes

B B’

Each process has a “partial-matrix”

L’

L

''' LLBB

m Local rank=2

Local rank=0

Local rank=1

Parallel Computation for CHOLESKY

quantum torus

Parallel Computation for ELEMENTS

10.6

1

2

4

8

1360 128 256 512 1024

Scala

bili

ty a

nd e

ffic

iency o

f E

LE

ME

NT

S

number of nodes

idealElectronic4 (scatter, interleaving)

100.0%

97.1%

93.3%

83.3%

75.9%Electronic5 (scatter, interleaving)

100.0%

93.0%

89.8%

70.6%59.2%

Efficiency for large problem

Strong Scaling based on 128 nodes

Performance of ELEMENTS for quantum on TSUBAME 2.0

Comb. Opt.

256

512

1024

400 700 1360 512 1024

Speed o

f C

HO

LE

SK

Y (

TF

lops)

Number of nodes

QAP6/org(2.0)QAP7/org(2.0)QAP8/org(2.0)QAP9/org(2.0)

QAP10/org(2.0)

QAP6/new(2.0)QAP7/new(2.0)QAP8/new(2.0)QAP9/new(2.0)

QAP10/new(2.0)

QAP6/new(2.5)QAP7/new(2.5)QAP8/new(2.5)QAP9/new(2.5)

QAP10/new(2.5)

233

329

438

306

470

719

513

826

9641019

314

388

509506

707

952

825

1186

1526

1713

Performance of CHOLESKY for QAP on TSUBAME2.0 / 2.5

Basic Algorithm

Design Strategy Our target is large problems with m > two million GPU memory is too small. Matrix data are usually

placed on host memory • Blocksize nb should be sufficiently large to mitigate

CPU-GPU PCIe communication • We still suffer from heavy PCIe communication nb=1024 for TSUBAME 2.0, nb=2048 for TSUBAME 2.5 Overlap GPU comp., PCIe comm., and MPI comm. in

Step 2, 3 and 4 of algorithm

time Lt0 Lt1 Lt2

Lt0 Lt1 Lt2

B0 B1

B0

B2

B1

L0 L1 L2 L3

L0 L1 L2 L3

time CPU

GPU

L0 L1 L2 L3

L0 L1 L2 L3

MPI Comm.

PCIe Comm.

DTRSM DGEMM

Producer of panel L Consumers of panel L

CPU/GPU Affinity

• 1 MPI / GPU 3 MPI / node • CPU/memory affinity is determined

in a GPU centric way

1.713 PFLOPS (DP) with 4080GPUs!!

Katsuki FUJISAWA 1, 4

1 Chuo University 2 Tokyo Institute of Technology

For inquiries please email to [email protected]

3 Kyushu University

Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers

Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era

Research Area:Technologies for Post-Peta Scale High Performance Computing

Hitoshi SATO 2, 4 Toshio ENDO 2, 4 Naoki MATSUZAWA 2 Yuichiro YASUI 1, 4 Hayato WAKI 3, 4

4 JST CREST

SemiDefinite Programming (SDP) Algorithmic Framework of PDIPM

• ELEMENTS : Computation of the SCM Memory Access-intensive Time-complexity:

• CHOLESKY : Cholesky factorizations of the SCM Compute-intensive Time-complexity:

The major bottleneck parts (80% - 90% of total execution time)

m < n (not m >> n), and Fully Dense SCM m >> n, and Fully Dense SCM

ELEMENTS-bound SDP problems CHOLESKY-bound SDP problems

or

n : matrix size m: # of constraints

Fast computation of sparse SCM is future work. e.g.) The sensor network location problem and The polynomial optimization problem

Quad. assignment

Dual

Primal

Efficient Hybrid (MPI-OpenMP) parallel Computation of SCM B

with Automatic configuration for CPU Affinity and memory interleaving

m

m

Bi j i

j

Bj i

Widely access for Fk

Fk (k = 1,…, m)

Memory interleaving

Scatter-type CPU Affinity

Schur complement matrix (SCM) Our algorithm

i-loop for Fi

j-loop for Fi

B

j-loop for Fj

i-loop for Fi

128

256

512

1024

2048

4096

1360 64 128 256 512 1024

ELE

ME

NT

S tim

e (

seconds/ite

ration)

number of nodes

default (w/o)scatter

compactdefault (w/o), interleaving

scatter, interleavingcompact, interleaving

1701

906

709

1150

613580

550

307

250

417

233

198

2034

547

307

258

3284

1544

795

414

232

191

Electronic5 (SCM size: 47.7k x 47.7k)

“interleaving” is effective.

Electronic4 (SCM size: 76.6k x 76.6k)

high efficiency

Scalability of “scatter and interleaving” CPU time varied with affinity/interleaving setting

“scatter and interleaving” is fastest.

Primal-Dual Formulation

for

1. Diagonal block factorization 2. Panel factorization Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’

4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM

1/,...,2,1,0 nbmk

• SDPARA is a parallel implementation of the primal-dual interior-point method(PDIPM) for Semidefinite Programming Parallel computation for two major bottleneck parts • ELEMENTS ⇒ Computation of Schur complement matrix (SCM) • CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM)

• SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.5 supercomputer and some techniques of processor affinity and memory interleaving when the computation of SCM (ELEMENTS) constituted a bottleneck.

• With 4,080 NVIDIA GPUs on the TSUBAME 2.0 & 2.5 supercomputer, our implementation achieved 1.019 PFlops (TSUBAME 2.0) & 1.713PFlops(TSUBAME 2.5) in double precision for a large-scale problem (CHOLESKY) with over two million constraints.

ELMENTS for large-scale SDP problems generally requires significant computational resources in terms of CPU cores and memory bandwidth.