Extreme Resilience and Deeper Memory Hierarchy...Deeper memory hierarchy that consists of heterogeneous memory devices Hybrid Memory Cube (HMC): DRAM chips are stacked with TSV technology

http://www.gsic.titech.ac.jp/sc15

Project introduction Thanks to Supercomputers, the large-scale simula-

tions can be achieved. However, the increasing number of nodes and components will lead to a very high failure frequency. In Exa-scale supercomputers, the MTBF will be no more than tens of minutes, which means comput-ing node doesn’t work in effect. We’re seeking a solu-tion to the problem.

FMI: Fault Tolerant Messaging Interface FMI is an MPI-like survivable messaging interface

that enables scalable failure detection, dynamic node allocation, fast and transparent recovery.

10 32 54 76FMI rank (virtual rank)

FMIUser’s view

P3P2 P5P4

Node 1 Node 2 Node 3

P9P8

Node 4

P7P6

Scalable failure detection and Dynamic node allocation

P2-2P2-1

Parity 2P2-0

P3-2P3-1

Parity 3P3-0

P4-2Parity 4

P4-1P4-0

P5-2Parity 5

P5-1P5-0

Parity 6P6-2P6-1P6-0

Parity 7P7-2P7-1P7-0

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

FMI’s view

Node 0

P1P0

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

P0-2P0-1P0-0

Parity 0

P1-2P1-1P1-0

Parity 1

MPI-like interfaceFast checkpoint/restart

0

500

1000

1500

2000

2500

0 500 1000 1500

Perf

orm

ance

(GFl

ops)

# of Processes (12 processes/node)

MPI FMI MPI + C FMI + C FMI + C/R

Himeno benchmarkint main (int *argc, char *argv[]) { FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ while ((n = FMI_Loop(…)) < numloop) { /* Application’s program */ } /* Application’s finalization */ FMI_Finalize();}

Lossy Compression for cp./rst. To reduce checkpoint time, lossy compression is ap-

plied to checkpoint data then checkpoint size is reduced. Target● 1,2,3D mesh w/o pointerApproach:1. wavelet transformation2. quantization3. encoding4. gzip

Original checkpoint data (Floating point array)

Low-frequency band array High-frequency band array

High-frequency band array

High-frequency band array

bitmap array

bitmap array Low and high-frequency band arrays

1. Wavelet transformation

2. Quantization

3. Encoding

4. Formatting

5. Applying gzip

Compressed data

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4-3

-2.5 -2

-1.5 -1

-0.5 0 0.5 1

1.5 2

2.5 3

3.5 4

Values in high-frequency band



d = 10

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4

n = 4

High-frequency band

Makehistogram Red elements belong

to “spike” parts

0 1 1 1 1 1 1 0bitmap

simple quantization ↑↓proposed quantization

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4 -3

-2.5 -2

-1.5 -1

-0.5 0 0.5 1

1.5 2

2.5 3

3.5 4

n = 4

-3-2

.5 -2-1

.5 -1-0

.5 0 0.5 1

1.5 2

2.5 3

3.5 4

Values in high-frequency band Values in high-frequency band Values in high-frequency band

Freq

uenc

y

Distribution of high-frequency band n = 4

Makehistogram

High-frequency band

0102030405060708090

100

Com

pres

sion

rate

[%]

original

gzip

Simple quantization(n=128)Proposed quantization(n=128)

An average error on pressure array An average error on temperature array

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

1 2 4 8 16 32 64 128

Rel

ativ

e er

ror

[%]

Division number

Simple quantization Proposed quantization"

00.10.20.30.40.50.60.70.80.9

1 2 4 8 16 32 64 128

Rel

ativ

e er

ror

[%]

Division number

Simple quantization Proposed quantization

Target Application:NICAM [M.Satoh, 2008]climate simulation aboutPressure, temperature and velocity3D array, 1156*82*2 / 720 step

Machine SpecCPU:Intel Core i7-3930K(6 cores 3.20GHz)Mem: 16GB

These works were supported by JSPS KAKENHI Grant Number 23220003.

Cp. size is much reduced with low error in particular situation

Highly Optimized Stencils Larger than GPU Memory

HHRT: System Software for GPU Memory Swap

Integration with Real Simulation Application

For extremely large stencil simula-tions, we implemented temporal blocking (TB) technique and clever optimizations on GPUs [1][2].● Eliminating redundant computation● Reducing memory footprint of TB algorithm

7-point stencil on K20X GPU

Execution ModelFor easier programming, we implement-ed system software, named HHRT (hy-brid hierarchical runtime) [3]. ● HHRT supports user programs written in MPI and CUDA with little modification● Oversubscription based execution model● HHRT implicitly supports memory swapping between GPU memory and host

We integrated our techniques with the city airflow simulation.

Original code on MPI+CUDA was devel-oped by Naoyuki On-odera, Tokyo Tech.We integrated TB into it and executed on HHRT.

Airflow performance on a K20X GPU

[1] G. Jin, T. Endo, S. Matsuoka. A Parallel Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPUs . IEEE Cluster 2013.[2] G. Jin, J. Lin, T. Endo. Efficient Utilization of Memory Hierarchy to Enable the Computation on Bigger Domains for Stencil Computation in CPU-GPU Based Systems. IEEE ICHPA 2014.[3] T. Endo, G. Jin: Software Technologies Coping with Memory Hierarchy of GPGPU Clusters for Stencil Computations. IEEE Cluster 2014.

PI: Toshio Endo ([email protected]), supported by JST-CREST

8x largersimulation thandevice memory!

Node Device memory

Host memoryProcess’s data

w/o HHRT (typically)

With HHRT

MPI commcudaMemcpy

Node Device memory

Host memoryProcess’s data

MPI comm

● ● ●

● ● ●

Overview of ProjectOn Exa-scale supercomputers, the “Memory Wall” problem will become even more severe, which prevents the realization of Extreme-ly Fast&Big Simulations.This project promotes research towards this problem via co-design ap-proach among application algorithms, system software, architecture.

Target ArchitectureDeeper memory hierarchy that consists of heterogeneous memory devices

Hybrid Memory Cube (HMC):DRAM chips are stacked with TSV technology.It will have advantage in bandwidth over DDR,but capacity will be smaller.

NAND Flash:SSDs are already commodity.Newer products, such as IO-drive haveO(GB/s) bandwidth.Next-gen non-volatile RAM (NVRAM):Several kinds of NVRAM such as STT-MRAM, ReRAM, FeRAM, etc, will be available in a few years.

Target: Realizing extremely Fast&Big

simulations ofO(100PB/s) & O(100PB)

in Exa-scale era

Comm/BW reducingalgorithms

System software for mem hierarchy mgmt

＋

＋

HPC Architecture with hybrid memory devices

HMC, HBM O(GB/s) FlashNext-gen NVM

Fault Tolerant Infrastructure for Billion-Way Parallelization

Dealing with Deeper Memory Hierarchy

Extreme Resilienceand Deeper Memory Hierarchy

Documents

Extreme Resilience and Deeper Memory Hierarchy...Deeper memory hierarchy that consists of heterogeneous memory devices Hybrid Memory Cube (HMC): DRAM chips are stacked with TSV technology