1
Peta-scale General Solver for Semidefinite Programming Problems with over Two Million Constraints HighPerformance General Solver for Extremely Largescale Semidefinite Programming Problems (SDPs) 1.Mathematical Programming: one of the most central problems in mathematical optimization. 2.Many Applications: combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc. SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP) Parallel computation for two major bottleneck parts ELEMENTS ⇒ Computation of Schur complement matrix (SCM) CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM) SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.0 supercomputer and some techniques of processor affinity and memory interleaving when the generation of SCM constituted a bottleneck. With 4,080 NVIDIA M2050 GPUs on the TSUBAME 2.0 supercomputer, our implementation achieved 1.018 PFlops in double precision for a large-scale problem with over two million constraints. SemiDefinite Programming (SDP) The major boBleneck parts of PDIPM for the various types of SDP problems 1. CHOLESKY : Sparse SCM e.g.) the sensor network locaLon problem and the polynomial opLmizaLon problem Parallel Sparse Cholesky factorizaLon of sparse SCM 2. ELEMENTS : m < n (not m >> n), and Fully Dense SCM e.g.) the quantum chemistry problem and the truss topology problem Efficient Hybrid (MPIOpenMP) parallel computaLon Automa5c configura5on for CPU Affinity and memory interleaving 3. CHOLESKY : m >> n, and Fully Dense SCM Massively parallel dense Cholesky factorizaLon using GPUs e.g.) the combinatorial opLmizaLon problem the quadraLc assignment problem (QAP) Overlapping techniques for computaLon, PCIExpress comm., and MPI comm. Our implementaLon achieved 1.018 PFlops in double precision for largescale Cholesky factorizaLon using 2,720 CPUs and 4,080 GPUs. Data DecomposiLon The dense matrix B(m x m) is distributed with 2D BlockCyclic distribuLon with block size nb For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers. Key for petaflops is overlapping computation, PCI-Express communication and MPI communication. SDPARA can solve the largest SDP problem DNN relaxation problem : QAP10 QAPLIB with 2.33 million constraints Using 1360 nodes, 2720 CPUs, 4080 M2050 GPUs 1.018PFLOPS in CHOLESKY (2.33m x 2.33m) The fastest and largest result as mathematical optimization problems!! nb =1024 Matrix distribution on 6 (=2x3) processes B BEach process has a “partial-matrix” LL ' ' ' L L B B × = m Local rank=2 7, 9, 11 Processor cores GPU2 RAM1 Local rank=0 0, 2, 4, 6, 8, 10 Processor cores GPU0 Local rank=1 1, 3, 5 Processor cores RAM1 GPU1 RAM0 Parallel ComputaLon for CHOLESKY GPU2 Local rank=0 0, 2, 4, 6, 8, 10 Processor cores Local rank=1 1, 3, 5, 7, 9, 11 Processor cores RAM1 GPU1 RAM0 Processor ID Package ID Core ID 0 0 0 1 1 1 2 0 2 ・・・ Step3: thread−Processor assignment table Thread ID Sca9er Compact 0 0 0 1 1 2 2 2 4 ・・・ Automate configuraLon of CPU Affinity and memory allocaLon policy 13 15 17 19 21 23 1 3 5 7 9 5 7 9 1 1 CPU package 1 1 3 11 12 0 14 2 16 4 18 6 20 8 22 4 6 8 1 0 CPU package 0 0 2 10 12 0 14 2 16 4 18 6 20 8 22 2 8 3 9 4 5 1 1 1 1 1 0 CPU package 0 0 6 1 7 10 13 15 17 19 21 23 1 3 5 7 9 CPU package 1 11 ScaGertype Affinity Compacttype Affinity quantum torus Parallel ComputaLon for ELEMENTS 10.6 1 2 4 8 16 1360 64 128 256 512 1024 2048 Scalability and efficiency of ELEMENTS number of nodes ideal Electronic4 (scatter, interleaving) 100.0% 97.1% 93.3% 83.3% 75.9% Electronic5 (scatter, interleaving) 100.0% 93.0% 89.8% 70.6% 59.2% Efficiency for large problems Strong Scaling based on 128 nodes /sys/devices/system/* Step1: Linux device files Step2: Processor mapping table Automate detecLon TSUBAME 2.0 HP node 128 256 512 1024 2048 4096 1360 64 128 256 512 1024 2048 ELEMENTS time (seconds/iteration) number of nodes default scatter compact default, interleaving scatter, interleaving compact, interleaving packages ←∅ for thread id = 0, 1, ··· , p - 1 do packages packages { get package id(thread id, scatter ) } end set mempolicy(packages, MPOL INTERLEAVE) // set #omp parallel { thread id omp get thread num() pinned(thread id, scatter) ... unpinned() } F i Process01 Process02 Process03 Process04 Process01 Process02 Process03 Process04 Process01 Process02 Process03 Process04 Process01 Process02 Process03 Process04 OpenMP(dynamic) MPI+OpenMP F j MPI Fig. CPU Lme for Electronic4 Fig. Scalability for Electronic4 and Electronic5 the memory interleaving is effecLve. the parameter set “scaGer and interleaving” is fastest. high efficiency (75.9% for Electronic4) higher efficiency when solving an SDP problem larger than Electronic4 because the efficiency for Electronic4 is higher than that for Electronic5. binds the (n+1)th OpenMP thread in a freethread context as close as possible to the thread context in which the nth OpenMP thread was bound. distributes OpenMP threads as evenly as possible across the enLre system. combinatorial opLmizaLon 128 256 512 1024 400 700 1360 256 512 1024 2048 Speed of CHOLESKY (TFlops) number of nodes QAP6 QAP7 QAP8 QAP9 QAP10 232.9 329.1 437.8 306.2 469.9 718.8 512.9 825.6 964.4 1018.5 Performance of CHOLESKY on TSUBAME2.0 QuadraLc assignment problems (QAP) Basic Algorithm For k = 0, 1, 2, …, 1 1. Diagonal block factorizaLon 2. Panel factorizaLon Compute L by GPU DTRSM 3. Broadcast L, transpose L (L’) and broadcast L’ 4. Update B‘ = B‘– L x L‘ with fast GPU DGEMM Design Strategy Our target is large problems with m > 2 million GPU memory is too small. Matrix data are usually placed on host memory Blocksize nb should be sufficiently large to miLgate CPUGPU PCIe communicaLon We sLll suffer from heavy PCIe communicaLon nb=1024 Overlap GPU comp., PCIe comm., and MPI comm. in Step 2, 3 and 4 Lme Lt 0 Lt 1 Lt 2 Lt 0 Lt 1 Lt 2 B 0 B 1 B 0 B 2 B 1 L 0 L 1 L 2 L 3 L 0 L 1 L 2 L 3 Lme CPU GPU L 0 L 1 L 2 L 3 L 0 L 1 L 2 L 3 MPI Comm. PCIe Comm. DTRSM DGEMM Producer of panel L Consumers of panel L CPU/GPU Affinity 1 MPI process per GPU 3 processes per node CPU/memory affinity is determined in a GPU centric way 1.018PFLOPS(DP) with 4080GPUs!! Katsuki FUJISAWA 1, 4 1 Chuo University 2 Tokyo Institute of Technology For inquiries please email to [email protected] 3 Kyushu University Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era Research AreaTechnologies for Post-Peta Scale High Performance Computing Primal Dual The number of constraints of the primal problem The matrix size Hitoshi SATO 2, 4 Toshio ENDO 2, 4 Naoki MATSUZAWA 2 Yuichiro YASUI 1, 4 Hayato WAKI 3, 4 4 JST CREST

Æ & > o ®Y v Peta-scale General Solver for Semidefinite ... · • SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP) • Parallel

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Core Research for Evolutional Science and Technology

    Core Research for Evolutional Science and Technology

    Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology

    Core Research for Evolutional Science and Technology

    Core Research for Evolutional Science and Technology

    Core Research for Evolutional Science and TechnologyCore Research for Evolutional Science and Technology

    Peta-scale General Solver for Semidefinite Programming Problems with over Two Million Constraints

    High-‐Performance  General  Solver  for  Extremely  Large-‐scale  Semidefinite  Programming  Problems  (SDPs)1. Mathematical Programming: one of the most central problems in mathematical optimization. 2. Many Applications: combinatorial optimization, control theory, structural optimization, quantum chemistry, sensor network location, data mining, etc.

    •  SDPARA is a parallel implementation of the interior-point method for Semidefinite Programming (SDP)

    •  Parallel computation for two major bottleneck parts •  ELEMENTS ⇒ Computation of Schur complement matrix (SCM)•  CHOLESKY ⇒ Cholesky factorization of Schur complement matrix (SCM)

    •  SDPARA could attain high scalability using 16,320 CPU cores on the TSUBAME 2.0 supercomputer and some techniques of processor affinity and memory interleaving when the generation of SCM constituted a bottleneck.

    •  With 4,080 NVIDIA M2050 GPUs on the TSUBAME 2.0 supercomputer, our implementation achieved 1.018 PFlops in double precision for a large-scale problem with over two million constraints.

    SemiDefinite  Programming  (SDP)  

    The  major  boBleneck  parts  of  PDIPM  for  the  various  types  of  SDP  problems

    1.  CHOLESKY        :  Sparse  SCM  e.g.)  the  sensor  network  locaLon  problem  and  the  polynomial  opLmizaLon  problem  Parallel  Sparse  Cholesky  factorizaLon  of  sparse  SCM  

    2.  ELEMENTS        :  m  <  n  (not  m  >>  n),  and  Fully  Dense  SCM  e.g.)  the  quantum  chemistry  problem  and  the  truss  topology  problem  Efficient  Hybrid  (MPI-‐OpenMP)  parallel  computaLon  Automa5c  configura5on  for  CPU  Affinity  and  memory  interleaving  

    3.      CHOLESKY        :  m  >>  n,  and  Fully  Dense  SCM  Massively  parallel  dense  Cholesky  factorizaLon  using  GPUs  e.g.)  the  combinatorial  opLmizaLon  problem  the  quadraLc  assignment  problem  (QAP)  Overlapping  techniques  for  computaLon,  PCI-‐Express  comm.,  and  MPI  comm.  Our  implementaLon  achieved  1.018  PFlops  in  double  precision  for  large-‐scale  Cholesky  factorizaLon  using  2,720  CPUs  and  4,080  GPUs.

     

     

     

    Data  DecomposiLon  •  The  dense  matrix  B(m  x  m)  is  distributed  with  

    2D  Block-‐Cyclic  distribuLon  with  block  size  nb  

    For problems with m >> n, high performance CHOLESKY is implemented for GPU supercomputers.Key for petaflops is overlapping computation, PCI-Express communication and MPI communication.

    SDPARA can solve the largest SDP problem•  DNN relaxation problem : QAP10 QAPLIB with

    2.33 million constraints •  Using 1360 nodes, 2720 CPUs, 4080 M2050 GPUs •  1.018PFLOPS in CHOLESKY (2.33m x 2.33m) •  The fastest and largest result as mathematical

    optimization problems!!

    nb =1024

    Matrix distribution on 6 (=2x3) processes

    BB’

    Each process has a “partial-matrix”

    L’

    L

    ''' LLBB ×−=

    m

    Local  rank=2

    7,  9,  11

    Processor  cores GPU2

    RAM1

    Local  rank=0

    0,  2,  4,  6,  8,  10

    Processor  coresGPU0

    Local  rank=1

    1,  3,  5

    Processor  cores

    RAM1GPU1

    RAM0

    Parallel  ComputaLon  for  CHOLESKY

    GPU2

    Local  rank=0

    0,  2,  4,  6,  8,  10

    Processor  cores

    Local  rank=1

    1,  3,  5,  7,  9,  11

    Processor  coresRAM1

    GPU1

    RAM0

    Processor  ID Package  ID Core  ID

    0 0 0

    1 1 1

    2 0 2

    ・・・

    Step3:  thread−Processor  assignment  table

    Thread  ID Sca9er Compact

    0 0 0

    1 1 2

    2 2 4

    ・・・

    Automate  configuraLon  of  CPU  Affinity  and  memory  allocaLon  policy

    12 13

    0

    14 15

    2

    16 17

    4

    18 19

    6

    20 21

    8

    22 23

    1 3 5 7 94 56 78 910 11

    CPU package 0 CPU package 1

    0 12 3

    10 11

    12 13

    0

    14 15

    2

    16 17

    4

    18 19

    6

    20 21

    8

    22 23

    1 3 5 7 94 56 78 910 11

    CPU package 0 CPU package 1

    0 12 3

    10 11

    12 13

    0

    14 15

    2

    16 17

    4

    18 19

    6

    20 21

    8

    22 23

    1 3 5 7 92

    8

    3

    9

    4 5

    111110

    CPU package 0 CPU package 1

    0

    6

    1

    7

    10 11

    12 13

    0

    14 15

    2

    16 17

    4

    18 19

    6

    20 21

    8

    22 23

    1 3 5 7 92

    8

    3

    9

    4 5

    111110

    CPU package 0 CPU package 1

    0

    6

    1

    7

    10 11

    ScaGer-‐type  Affinity  

    Compact-‐type  Affinity  

    quantum torus

    Parallel  ComputaLon  for  ELEMENTS

    10.6

    1

    2

    4

    8

    16

    1360 64 128 256 512 1024 2048

    Scal

    abilit

    y an

    d ef

    ficie

    ncy

    of E

    LEM

    ENTS

    number of nodes

    idealElectronic4 (scatter, interleaving)

    100.0%

    97.1%

    93.3%

    83.3%75.9%

    Electronic5 (scatter, interleaving)

    100.0%

    93.0%

    89.8%

    70.6%59.2%

    Efficiency      for  large  problems  

    Strong  Scaling  based  on  128  nodes

    /sys/devices/system/*

    Step1:  Linux  device  files Step2:  Processor  mapping  table

    Automate  detecLon  

    TSUBAME  2.0  HP  node

    128

    256

    512

    1024

    2048

    4096

    1360 64 128 256 512 1024 2048

    ELEM

    ENTS

    tim

    e (s

    econ

    ds/it

    erat

    ion)

    number of nodes

    defaultscatter

    compactdefault, interleavingscatter, interleaving

    compact, interleaving

    packages← ∅for thread id = 0, 1, · · · , p − 1 do

    packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set

    Figure: Memory interleaving configuration for the memory allocation policy

    #omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()

    }

    Figure: Scatter-type CPU Affinity Configuration

    packages← ∅for thread id = 0, 1, · · · , p − 1 do

    packages← packages ∪ { get package id(thread id, scatter) }endset mempolicy(packages, MPOL INTERLEAVE) // set

    Figure: Memory interleaving configuration for the memory allocation policy

    #omp parallel {thread id ← omp get thread num()pinned(thread id, scatter)...unpinned()

    }

    Figure: Scatter-type CPU Affinity Configuration

    Fi

    Process01

    Process02

    Process03

    Process04

    Process01

    Process02

    Process03

    Process04

    Process01

    Process02

    Process03

    Process04

    Process01

    Process02

    Process03

    Process04

    OpenMP(dynamic)

    MPI+OpenMP

    Fj

    MPI

    Fig.  CPU  Lme  for  Electronic4  Fig.  Scalability  for  Electronic4  and  Electronic5  

    •  the  memory  interleaving  is  effecLve.  •  the  parameter  set  “scaGer  and  interleaving”  is  fastest.  

    •  high  efficiency  (75.9%  for  Electronic4)    •  higher  efficiency  when  solving  an  SDP  problem  larger  than  Electronic4  because  the  efficiency  for  Electronic4  is  higher  than  that  for  Electronic5.  

    binds  the  (n+1)-‐th  OpenMP  thread  in  a  free-‐thread  context  as  close  as  possible  to  the  thread  context  in  which  the  n-‐th  OpenMP  thread  was  bound.  

    distributes  OpenMP  threads  as  evenly  as  possible  across  the  enLre  system.  

    combinatorial  opLmizaLon

    128

    256

    512

    1024

    400 700 1360 256 512 1024 2048

    Spee

    d of

    CHO

    LESK

    Y (T

    Flop

    s)

    number of nodes

    QAP6 QAP7 QAP8 QAP9 QAP10

    232.9

    329.1

    437.8

    306.2

    469.9

    718.8

    512.9

    825.6964.4

    1018.5

    Performance  of  CHOLESKY  on  TSUBAME2.0  QuadraLc  assignment  problems  (QAP)    

    Basic  Algorithm  For  k  =  0,  1,  2,  …,                                -‐1  1.  Diagonal  block  factorizaLon  2.  Panel  factorizaLon            à  Compute  L  by  GPU  DTRSM  3.    Broadcast  L,  transpose    L  (L’)  and  broadcast  L’  4.    Update    B‘  =  B‘–  L  x  L‘  with  fast  GPU  DGEMM

    Design  Strategy  Our  target  is  large  problems  with  m  >  2  million  à  GPU  memory  is  too  small.  Matrix  data  are  usually  

    placed  on  host  memory  •  Blocksize  nb  should  be  sufficiently  large  to  miLgate  

    CPU-‐GPU  PCIe  communicaLon  •  We  sLll  suffer  from  heavy  PCIe  communicaLon  à  nb=1024  à  Overlap  GPU  comp.,  PCIe  comm.,  and  MPI  comm.  in  

    Step  2,  3  and  4  

    LmeLt0   Lt1 Lt2

    Lt0   Lt1 Lt2B0   B1  

    B0  B2  

    B1  

    L0    L1    L2    L3

    L0    L1    L2    L3

    LmeCPU

    GPU

    L0        L1              L2    L3

    L0          L1            L2              L3

    MPI  Comm.

    PCIe  Comm.

    DTRSM DGEMM

    Producer  of  panel  L   Consumers  of  panel  L  

    CPU/GPU  Affinity  •  1  MPI  process  per  GPU              à  3  processes  per  node  •  CPU/memory  affinity  is  

    determined  in  a  GPU  centric  way  

    1.018PFLOPS(DP)  with  4080GPUs!!

    Katsuki FUJISAWA 1, 4

    1 Chuo University

    2 Tokyo Institute of Technology

    For inquiries please email to [email protected]

    3 Kyushu University

    Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers

    Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era

    Research Area:Technologies for Post-Peta Scale High Performance Computing

    Primal

    Dual

    The  number  of  constraints  of  the  primal  problem

    The  matrix  size

    Hitoshi SATO 2, 4

    Toshio ENDO 2, 4

    Naoki MATSUZAWA 2

    Yuichiro YASUI 1, 4

    Hayato WAKI 3, 4

    4 JST CREST