Æ & > o ®Y v Peta-scale General Solver for Semideﬁnite ... · • SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP) • Parallel

Core Research for Evolutional Science and Technology


Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology



Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology

Peta-scale General Solver for Semidefinite Programming Problems with over Two Million Constraints

High-‐Performance General Solver for Extremely Large-‐scale Semidefinite Programming Problems (SDPs)1. Mathematical Programming： one of the most central problems in mathematical optimization. 2. Many Applications： combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc.

•  SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP)

•  Parallel computation for two major bottleneck parts •  ELEMENTS ⇒ Computation of Schur complement matrix (SCM)•  CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM)

•  SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.0 supercomputer and some techniques of processor affinity and memory interleaving when the generation of SCM constituted a bottleneck.

•  With 4,080 NVIDIA M2050 GPUs on the TSUBAME 2.0 supercomputer, our implementation achieved 1.018 PFlops in double precision for a large-scale problem with over two million constraints.

SemiDefinite Programming (SDP)

The major boBleneck parts of PDIPM for the various types of SDP problems

1.  CHOLESKY : Sparse SCM e.g.) the sensor network locaLon problem and the polynomial opLmizaLon problem Parallel Sparse Cholesky factorizaLon of sparse SCM

2.  ELEMENTS : m < n (not m >> n), and Fully Dense SCM e.g.) the quantum chemistry problem and the truss topology problem Efficient Hybrid (MPI-‐OpenMP) parallel computaLon Automa5c configura5on for CPU Affinity and memory interleaving

3. CHOLESKY : m >> n, and Fully Dense SCM Massively parallel dense Cholesky factorizaLon using GPUs e.g.) the combinatorial opLmizaLon problem the quadraLc assignment problem (QAP) Overlapping techniques for computaLon, PCI-‐Express comm., and MPI comm. Our implementaLon achieved 1.018 PFlops in double precision for large-‐scale Cholesky factorizaLon using 2,720 CPUs and 4,080 GPUs.

Data DecomposiLon •  The dense matrix B(m x m) is distributed with

2D Block-‐Cyclic distribuLon with block size nb

For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers.Key for petaflops is overlapping computation, PCI-Express communication and MPI communication.

SDPARA can solve the largest SDP problem•  DNN relaxation problem : QAP10 QAPLIB　with

2.33 million constraints •  Using 1360 nodes, 2720 CPUs, 4080 M2050 GPUs •  1.018PFLOPS in CHOLESKY (2.33m x 2.33m) •  The fastest and largest result as mathematical

optimization problems!!

nb =1024

Matrix distribution on 6 (=2x3) processes

BB’

Each process has a “partial-matrix”

L’

L

''' LLBB ×−=

m

Local rank=2

7, 9, 11

Processor cores GPU2

RAM1

Local rank=0

0, 2, 4, 6, 8, 10

Processor coresGPU0

Local rank=1

1, 3, 5

Processor cores

RAM1GPU1

RAM0

Parallel ComputaLon for CHOLESKY

GPU2

Local rank=0

0, 2, 4, 6, 8, 10

Processor cores

Local rank=1

1, 3, 5, 7, 9, 11

Processor coresRAM1

GPU1

RAM0

Processor ID Package ID Core ID

0 0 0

1 1 1

2 0 2

・・・

Step3: thread−Processor assignment table

Thread ID Sca9er Compact

0 0 0

1 1 2

2 2 4

・・・

Automate configuraLon of CPU Affinity and memory allocaLon policy

12 13

0

14 15

2

16 17

4

18 19

6

20 21

8

22 23

1 3 5 7 94 56 78 910 11

CPU package 0 CPU package 1

0 12 3

10 11

12 13

0

14 15

2

16 17

4

18 19

6

20 21

8

22 23

1 3 5 7 94 56 78 910 11


0 12 3

10 11

12 13

0

14 15

2

16 17

4

18 19

6

20 21

8

22 23

1 3 5 7 92

8

3

9

4 5

111110


0

6

1

7

10 11

12 13

0

14 15

2

16 17

4

18 19

6

20 21

8

22 23

1 3 5 7 92

8

3

9

4 5

111110


0

6

1

7

10 11

ScaGer-‐type Affinity

Compact-‐type Affinity

quantum torus

Parallel ComputaLon for ELEMENTS

10.6

1

2

4

8

16

1360 64 128 256 512 1024 2048

Scal

abilit

y an

d ef

ficie

ncy

of E

LEM

ENTS

number of nodes

idealElectronic4 (scatter, interleaving)

100.0%

97.1%

93.3%

83.3%75.9%

Electronic5 (scatter, interleaving)

100.0%

93.0%

89.8%

70.6%59.2%

Efficiency for large problems

Strong Scaling based on 128 nodes

/sys/devices/system/*

Step1: Linux device files Step2: Processor mapping table

Automate detecLon

TSUBAME 2.0 HP node

128

256

512

1024

2048

4096

1360 64 128 256 512 1024 2048

ELEM

ENTS

tim

e (s

econ

ds/it

erat

ion)

number of nodes

defaultscatter

compactdefault, interleavingscatter, interleaving

compact, interleaving

packages← ∅for thread id = 0, 1, · · · , p − 1 do

packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set

Figure: Memory interleaving configuration for the memory allocation policy

#omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()

}

Figure: Scatter-type CPU Affinity Configuration

packages← ∅for thread id = 0, 1, · · · , p − 1 do

packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set

Figure: Memory interleaving configuration for the memory allocation policy

#omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()

}

Figure: Scatter-type CPU Affinity Configuration

Fi

Process01

Process02

Process03

Process04

Process01

Process02

Process03

Process04

Process01

Process02

Process03

Process04

Process01

Process02

Process03

Process04

OpenMP(dynamic)

MPI+OpenMP

Fj

MPI

Fig. CPU Lme for Electronic4 Fig. Scalability for Electronic4 and Electronic5

•  the memory interleaving is effecLve. •  the parameter set “scaGer and interleaving” is fastest.

•  high efficiency (75.9% for Electronic4) •  higher efficiency when solving an SDP problem larger than Electronic4 because the efficiency for Electronic4 is higher than that for Electronic5.

binds the (n+1)-‐th OpenMP thread in a free-‐thread context as close as possible to the thread context in which the n-‐th OpenMP thread was bound.

distributes OpenMP threads as evenly as possible across the enLre system.

combinatorial opLmizaLon

128

256

512

1024

400 700 1360 256 512 1024 2048

Spee

d of

CHO

LESK

Y (T

Flop

s)

number of nodes

QAP6 QAP7 QAP8 QAP9 QAP10

232.9

329.1

437.8

306.2

469.9

718.8

512.9

825.6964.4

1018.5

Performance of CHOLESKY on TSUBAME2.0 QuadraLc assignment problems (QAP)

Basic Algorithm For k = 0, 1, 2, …, -‐1 1.  Diagonal block factorizaLon 2.  Panel factorizaLon à Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’ 4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM

Design Strategy Our target is large problems with m > 2 million à  GPU memory is too small. Matrix data are usually

placed on host memory •  Blocksize nb should be sufficiently large to miLgate

CPU-‐GPU PCIe communicaLon •  We sLll suffer from heavy PCIe communicaLon à  nb=1024 à  Overlap GPU comp., PCIe comm., and MPI comm. in

Step 2, 3 and 4

LmeLt0 Lt1 Lt2

Lt0 Lt1 Lt2B0 B1

B0 B2

B1

L0 L1 L2 L3

L0 L1 L2 L3

LmeCPU

GPU

L0 L1 L2 L3

L0 L1 L2 L3

MPI Comm.

PCIe Comm.

DTRSM DGEMM

Producer of panel L Consumers of panel L

CPU/GPU Affinity •  1 MPI process per GPU à 3 processes per node •  CPU/memory affinity is

determined in a GPU centric way

1.018PFLOPS(DP) with 4080GPUs!!

Katsuki FUJISAWA 1, 4

1 Chuo University

2 Tokyo Institute of Technology

For inquiries please email to [email protected]

3 Kyushu University

Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers

Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era

Research Area：Technologies for Post-Peta Scale High Performance Computing

Primal

Dual

The number of constraints of the primal problem

The matrix size

Hitoshi SATO 2, 4

Toshio ENDO 2, 4

Naoki MATSUZAWA 2

Yuichiro YASUI 1, 4

Hayato WAKI 3, 4

4 JST CREST

Documents

Æ & > o ®Y v Peta-scale General Solver for Semideﬁnite ... · • SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP) • Parallel