1
http://www.gsic.titech.ac.jp/sc15 Project introducon Thanks to Supercomputers, the large-scale simula- tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput- ing node doesn’t work in effect. We’re seeking a solu- tion to the problem. FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface that enables scalable failure detection, dynamic node allocation, fast and transparent recovery. 1 0 3 2 5 4 7 6 FMI rank (virtual rank) User’s view P3 P2 P5 P4 Node 1 Node 2 Node 3 P9 P8 Node 4 P7 P6 Scalable failure detection and Dynamic node allocation P2-2 P2-1 Parity 2 P2-0 P3-2 P3-1 Parity 3 P3-0 P4-2 Parity 4 P4-1 P4-0 P5-2 Parity 5 P5-1 P5-0 Parity 6 P6-2 P6-1 P6-0 Parity 7 P7-2 P7-1 P7-0 P0-2 P0-1 P0-0 Parity 0 P1-2 P1-1 P1-0 Parity 1 FMI’s view Node 0 P1 P0 P0-2 P0-1 P0-0 Parity 0 P1-2 P1-1 P1-0 Parity 1 P0-2 P0-1 P0-0 Parity 0 P1-2 P1-1 P1-0 Parity 1 MPI-like interface Fast checkpoint/restart 0 500 1000 1500 2000 2500 0 500 1000 1500 Performance (GFlops) # of Processes (12 processes/node) MPI FMI MPI + C FMI + C FMI + C/R Himeno benchmark int main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize(); } Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap- plied to checkpoint data then checkpoint size is reduced. Target 1,2,3D mesh w/o pointer Approach: 1. wavelet transformation 2. quantization 3. encoding 4. gzip Original checkpoint data (Floating point array) Low-frequency band array High-frequency band array High-frequency band array High-frequency band array bitmap array bitmap array Low and high-frequency band arrays 1. Wavelet transformation 2. Quantization 3. Encoding 4. Formatting 5. Applying gzip Compressed data -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Values in high-frequency band Values in high-frequency band Values in high-frequency band d = 10 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 n = 4 High-frequency band Make histogram Red elements belong to “spike” parts 0 1 1 1 1 1 1 0 bitmap simple quanzaon proposed quanzaon -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 n = 4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Values in high-frequency band Values in high-frequency band Values in high-frequency band Frequency Distribution of high-frequency band n = 4 Make histogram High-frequency band 0 10 20 30 40 50 60 70 80 90 100 Compression rate [%] original gzip Simple quantization (n=128) Proposed quantization (n=128) An average error on pressure array An average error on temperature array 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 1 2 4 8 16 32 64 128 Relative error [%] Division number Simple quantization Proposed quantization" 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 8 16 32 64 128 Relative error [%] Division number Simple quantization Proposed quantization Target Applicaon: NICAM [M.Satoh, 2008] climate simulaon about Pressure, temperature and velocity 3D array, 1156*82*2 / 720 step Machine Spec CPU: Intel Core i7-3930K (6 cores 3.20GHz) Mem: 16GB These works were supported by JSPS KAKENHI Grant Number 23220003. Cp. size is much reduced with low error in parcular situaon Highly Optimized Stencils Larger than GPU Memory HHRT: System Software for GPU Memory Swap Integration with Real Simulation Application For extremely large stencil simula- tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2]. Eliminating redundant computation Reducing memory footprint of TB algorithm 7-point stencil on K20X GPU Execution Model For easier programming, we implement- ed system software, named HHRT (hy- brid hierarchical runtime) [3]. HHRT supports user programs written in MPI and CUDA with little modification Oversubscription based execution model HHRT implicitly supports memory swapping between GPU memory and host We integrated our techniques with the city airflow simulation. Original code on MPI+CUDA was devel- oped by Naoyuki On- odera, Tokyo Tech. We integrated TB into it and executed on HHRT. Airflow performance on a K20X GPU [1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013. [2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014. [3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014. PI: Toshio Endo ([email protected]), supported by JST-CREST 8x larger simulation than device memory! Node Device memory Host memory Process’s data w/o HHRT (typically) With HHRT MPI comm cudaMemcpy Node Device memory Host memory Process’s data MPI comm ● ● ● ● ● ● Overview of Project On Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme- ly Fast&Big Simulations. This project promotes research towards this problem via co-design ap- proach among application algorithms, system software, architecture. Target Architecture Deeper memory hierarchy that consists of heterogeneous memory devices Hybrid Memory Cube (HMC): DRAM chips are stacked with TSV technology. It will have advantage in bandwidth over DDR, but capacity will be smaller. NAND Flash: SSDs are already commodity. Newer products, such as IO-drive have O(GB/s) bandwidth. Next-gen non-volatile RAM (NVRAM): Several kinds of NVRAM such as STT- MRAM, ReRAM, FeRAM, etc, will be available in a few years. Target: Realizing extremely Fast&Big simulations of O(100PB/s) & O(100PB) in Exa-scale era Comm/BW reducing algorithms System soware for mem hierarchy mgmt HPC Architecture with hybrid memory devices HMC, HBM O(GB/ s) Flash Next-gen NVM Fault Tolerant Infrastructure for Billion-Way Parallelization Dealing with Deeper Memory Hierarchy Extreme Resilience and Deeper Memory Hierarchy

Extreme Resilience and Deeper Memory Hierarchy...Deeper memory hierarchy that consists of heterogeneous memory devices Hybrid Memory Cube (HMC): DRAM chips are stacked with TSV technology

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • http://www.gsic.titech.ac.jp/sc15

    Project introduction Thanks to Supercomputers, the large-scale simula-

    tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput-ing node doesn’t work in effect. We’re seeking a solu-tion to the problem.

    FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface

    that enables scalable failure detection, dynamic node allocation, fast and transparent recovery.

    10 32 54 76FMI rank (virtual rank)

    FMIUser’s view

    P3P2 P5P4

    Node 1 Node 2 Node 3

    P9P8

    Node 4

    P7P6

    Scalable failure detection and Dynamic node allocation

    P2-2P2-1

    Parity 2P2-0

    P3-2P3-1

    Parity 3P3-0

    P4-2Parity 4

    P4-1P4-0

    P5-2Parity 5

    P5-1P5-0

    Parity 6P6-2P6-1P6-0

    Parity 7P7-2P7-1P7-0

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    FMI’s view

    Node 0

    P1P0

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    P0-2P0-1P0-0

    Parity 0

    P1-2P1-1P1-0

    Parity 1

    MPI-like interfaceFast checkpoint/restart

    0

    500

    1000

    1500

    2000

    2500

    0 500 1000 1500

    Perf

    orm

    ance

    (GFl

    ops)

    # of Processes (12 processes/node)

    MPI FMI MPI + C FMI + C FMI + C/R

    Himeno benchmarkint main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize();}

    Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap-

    plied to checkpoint data then checkpoint size is reduced. Target● 1,2,3D mesh w/o pointerApproach:1. wavelet transformation2. quantization3. encoding4. gzip

    Original checkpoint data (Floating point array)

    Low-frequency band array High-frequency band array

    High-frequency band array

    High-frequency band array

    bitmap array

    bitmap array Low and high-frequency band arrays

    1. Wavelet transformation

    2. Quantization

    3. Encoding

    4. Formatting

    5. Applying gzip

    Compressed data

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4-3

    -2.5 -2

    -1.5 -1

    -0.5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    Values in high-frequency band

    Values in high-frequency band

    Values in high-frequency band

    d = 10

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    n = 4

    High-frequency band

    Makehistogram Red elements belong

    to “spike” parts

    0 1 1 1 1 1 1 0bitmap

    simple quantization ↑↓proposed quantization

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4 -3

    -2.5 -2

    -1.5 -1

    -0.5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    n = 4

    -3-2

    .5 -2-1

    .5 -1-0

    .5 0 0.5 1

    1.5 2

    2.5 3

    3.5 4

    Values in high-frequency band Values in high-frequency band Values in high-frequency band

    Freq

    uenc

    y

    Distribution of high-frequency band n = 4

    Makehistogram

    High-frequency band

    0102030405060708090

    100

    Com

    pres

    sion

    rate

    [%]

    original

    gzip

    Simple quantization(n=128)Proposed quantization(n=128)

    An average error on pressure array An average error on temperature array

    0

    0.001

    0.002

    0.003

    0.004

    0.005

    0.006

    0.007

    1 2 4 8 16 32 64 128

    Rel

    ativ

    e er

    ror

    [%]

    Division number

    Simple quantization Proposed quantization"

    00.10.20.30.40.50.60.70.80.9

    1 2 4 8 16 32 64 128

    Rel

    ativ

    e er

    ror

    [%]

    Division number

    Simple quantization Proposed quantization

    Target Application:NICAM [M.Satoh, 2008]climate simulation aboutPressure, temperature and velocity3D array, 1156*82*2 / 720 step

    Machine SpecCPU:Intel Core i7-3930K(6 cores 3.20GHz)Mem: 16GB

    These works were supported by JSPS KAKENHI Grant Number 23220003.

    Cp. size is much reduced with low error in particular situation

    Highly Optimized Stencils Larger than GPU Memory

    HHRT: System Software for GPU Memory Swap

    Integration with Real Simulation Application

    For extremely large stencil simula-tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2].● Eliminating redundant computation● Reducing memory footprint of TB algorithm

    7-point stencil on K20X GPU

    Execution ModelFor easier programming, we implement-ed system software, named HHRT (hy-brid hierarchical runtime) [3]. ● HHRT supports user programs written in MPI and CUDA with little modification● Oversubscription based execution model● HHRT implicitly supports memory swapping between GPU memory and host

    We integrated our techniques with the city airflow simulation.

    Original code on MPI+CUDA was devel-oped by Naoyuki On-odera, Tokyo Tech.We integrated TB into it and executed on HHRT.

    Airflow performance on a K20X GPU

    [1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013.[2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014.[3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014.

    PI: Toshio Endo ([email protected]), supported by JST-CREST

    8x largersimulation thandevice memory!

    Node Device memory

    Host memoryProcess’s data

    w/o HHRT (typically)

    With HHRT

    MPI commcudaMemcpy

    Node Device memory

    Host memoryProcess’s data

    MPI comm

    ● ● ●

    ● ● ●

    Overview of ProjectOn Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme-ly Fast&Big Simulations.This project promotes research towards this problem via co-design ap-proach among application algorithms, system software, architecture.

    Target ArchitectureDeeper memory hierarchy that consists of heterogeneous memory devices

    Hybrid Memory Cube (HMC):DRAM chips are stacked with TSV technology.It will have advantage in bandwidth over DDR,but capacity will be smaller.

    NAND Flash:SSDs are already commodity.Newer products, such as IO-drive haveO(GB/s) bandwidth.Next-gen non-volatile RAM (NVRAM):Several kinds of NVRAM such as STT-MRAM, ReRAM, FeRAM, etc, will be available in a few years.

    Target: Realizing extremely Fast&Big

    simulations ofO(100PB/s) & O(100PB)

    in Exa-scale era

    Comm/BW reducingalgorithms

    System software for mem hierarchy mgmt

    HPC Architecture with hybrid memory devices

    HMC, HBM O(GB/s) FlashNext-gen NVM

    Fault Tolerant Infrastructure for Billion-Way Parallelization

    Dealing with Deeper Memory Hierarchy

    Extreme Resilienceand Deeper Memory Hierarchy