神威太湖之光上的并行算法设计 - bbs.nsccwx.cnbbs.nsccwx.cn/assets/uploads/files/1528178274621-太湖之光上的... · memory (SPM), and distributes the kernel workload

神威太湖之光上的并行算法设计

薛巍

清华大学计算机系、国家超级计算无锡中心

2018年05月30日，清华大学

神威太湖之光

神威太湖之光编程与性能分析工具


提纲

最高端计算系统对比（最新TOP500榜单）

系统名称峰值性能

（PFLOPS）持续性能

（PFLOPS）绿色指标

（MFLOPS/W）总核数（个）

总内存容量（PB）

处理器架构

神威·太湖之光 125.436 93.015 6051.131 10,649,600 1.3 异构众核

天河二号(KNC) 54.90 33.86 1901.54 3,120,000 1.0 异构众核

代恩特峰(P100) 25.33 19.59 8622.36 361,760 0.34 异构众核

晓光 28.19 19.14 14,174.67 19,860,000 0.58 异构众核

泰坦(K20x) 27.11 17.59 2142.77 560,640 0.7 异构众核

红杉(BQC) 20.13 17.17 2176.58 1,572,864 1.6 众核

Trinity 43.90 14.14 3677.76 979,968 2.07 众核

CORI(KNL) 27.88 14.01 3557.93 622,336 0.9 众核

众核已成为高性能计算的趋势

Comparison of current high-end systems (2017.9)

System TaihuLight Tianhe-2 Piz Daint Titan Sequoia

Rank of Top500 1 2 3 4 5

Rank of Green500 17 147 6 109 100

Rank of Graph500 2 8 / / 3

Rank of HPCG 3 2 4 8 7

K computer gains the Rank 1st of Graph500 and HPCG

国产众核处理器SW26010

Each CPU

Peak Performance 3.06 Tflops (DP)

Memory 32 GB

Memory Bandwidth 136.5 GB/s

# CPU 1

# cores 260

Intel KNL (2016)DP: 3TFMemBW:- MCDRAM 400+- DDR 90+

NV Pascal (2016)DP: 5+TFNvlink: 80GB/sPCIe G3: 16GB/s

Core Group 2

Data Transfer

Network

MPE8*8 CPE

Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE

Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPE

Mesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

Computing

Core

LDM

Column

Communication Bus

Control

Network

Registers

Row

Communication

Bus

Transfer Agent (TA)

Memory Level

LDM Level

Register Level

Computing Level

8*8 CPE Mesh

SW26010: Sunway 260-Core Processor

...

...

...

...

从核0,0

从核0,1

从核1,0

从核1,1

从核1,2

从核2,0

从核2,1

从核2,2

从核0,2

从核1,7

从核2,7

从核0,7

从核7,0

从核7,1

从核7,2

从核7,7

...

...

...

...

从核阵列

从核阵列

内存控制器

主核

核组

从核阵列

内存控制器

主核

核组

片上网络

从核阵列内存控制器

主核

核组

从核阵列内存控制器

主核

核组

内存内存

内存内存

手动缓存

一级缓存

二级缓存

接口总线

8×8

申威架构

7

Direct Memory Access (DMA) @26+ GB/sGlobal load/store (gload/gstore) @1.5 GB/s

• 手动缓存（SPM）

• 粗粒度的DMA

• 寄存器通信

P2P延迟小于11指令周期；集合带宽达到600+GB/s;

每个从核64KB，流水意义下每个指令周期可以读/写32B数据。

Xu Z, Lin J, Matsuoka S. Benchmarking sw26010 many-core processor. Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017. 743–752.

NRCPC

The measured memory bandwidth of one CG with DMA sequential read

Theoretical BWfor each CG is 34 GB/s

SW26010 Capability Model

Computation

Time

Memory

Access Time

Overlapping

TimeTotal Time

𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠

𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑖𝑛𝑠𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚

𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛

𝑆𝑖𝑧𝑒

𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡

𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ

Which Bound?

𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸

𝑀𝑅𝑃− 1 ∗ 𝐷𝑀𝐴

𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒

Case1:𝑀𝐴𝑇 ∗ 1 −1

𝐶𝑦𝑐𝑙𝑒

Case2: 𝐶𝑇 ∗ (1 −1

𝐶𝑦𝑐𝑙𝑒)

Shizhen Xu et al., Taming the “Monster”: Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling, IPDPS 2018

SW26010 Capability Model

MRT Latency refers to memory access time for CPE.

SW26010 Capability ModelRodinia 3.1 OpenMP Benchmarks

The mean of absolute errors for the model is 3.5% The max absolute errors for the model is 12.3% in BFS

神威•太湖之光系统

国产处理器

运算节点

运算插件运算超节点

运算系统

运算机仓

3.168TFlops6.336TFlops

25.344TFlops811.008TFlops

3.244PFlops

1024处理器

125.436PFlops

40运算机仓

IB FDR；两级网络：超节点内、超节点间- 超节点内全互联；节点内4核组竞争一个IB端口；16x16配置，目标节点模16用同一个路由

- 超节点间树网，1/4裁剪，目标处理器编号模64用同一链路

神威太湖之光存储系统架构

挑战：

I/O路径复杂

资源竞争

缓冲的使用

调度与排队

神威太湖之光



提纲

NRCPC

Principal Programming Model on TaihuLight

MPI+X

X : OpenACC* / Athread

One MPI process manages to run on one management core (MPE)

OpenACC* conducts data transfer between main memory and on-chip

memory (SPM), and distributes the kernel workload across compute cores

(CPEs)

Athread is the threading library to manage thread on compute core (CPE),

which is used in OpenACC* implementation

NRCPC

Brief view of SWACC/SWAFORT compiler

OpenACC* is directive-based programming tool for SW26010

OpenACC2.0 based

Extensions for the architecture of SW26010

Supported by SWACC/SWAFORT compiler

Interactive debugging supported

OpenACC* compiler: SWACC/SWAFORT Compiler

Source-to-Source compiler

SWACC: C99; SWAFORT: Fortran 2003

SWACC and SWAFORT are developed by NRCPC.

Based on ROSE compiler infrastructure (0.9.6a)

• An open Source compiler infrastructure to build source-to-source program transformation and analysis

• Developed by LLNL

NRCPC

Brief view of SWACC/SWAFORT compiler

CPE Code

MPE Code

int A[1024][1024];intB[1024][1024];intC[1024][1024];#pragma acc parallel loop \copyin(B, C) copyout(A)for(i = 0; i < 1024; i ++) {

for(j = 0; j < 1024; j ++) {A[i][j] = B[i][j] + C[i][j];

}}

…

CPEs_swpan(CPE_kernel, args);…

__SPM_localint SPM_A[1][1024];__SPM_localint SPM_B[1][1024];__SPM_localint SPM_C[1][1024];voidCPE_kernel(args) {for(i=CPE_id; i < 1024; i +=CPE_num) {

dma_get(&B[i][0], SPM_B, 4096);dma_get(&C[i][0], SPM_C, 4096);for(j = 0; j < 1024; j ++) {SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j];} //j-loopdma_put(SPM_A, &A[i][0], 4096);

} //i-loop}

SWACC

Source Code withOpenACC* directives

Basic Compiler

a.out

Compute pattern: data in to SPM -> calculation -> data out to Main memory

Workload distribution and the size for data transfer are automatically determined by compiler

NRCPC

Motivation to extend OpenACC

Memory Model of OpenACC Standard [OpenACC2.0-1.3]

mainly support non-shared memory device

• The memory on the accelerator may be physically and/or virtually separate from host memory; All

data movement between host memory and device memory must be performed by the host thread.

• This is the case with most current GPUs.

data environment of OpenACC can be ignored on shared-memory device

• the implementation need not create new copies of the data for the device and no need of data

movement between host and device.

What we need: Data environment for OpenACC*

Utilize the high-speed SPMs and

the aggregated bandwidth of CPEs for performance.

NRCPC

The difference between the Memory Models

Host Memory

Device Memory

Host

Device

The memory that Accelerator

threads can be accessed

OpenACC

Host MemoryMPE

S P M …

…

SW26010

CPEs

Data Moving

is executed by

host thread

Data movement

is initiated by

each CPE thread

The memory that Accelerator

threads can be accessed

NRCPC

Three kinds of Memory spaces that CPE thread can access

Memory Space of Host thread : shared by all accelerator threads

Private space : owned by each acc-thread, locate on Host-Mem, large size.

Local Space : owned by each acc-thread, locate on SPM, limited size

The Memory Model of OpenACC*

Accelerator thread 1

Private Space of Accelerator thread

Memory Space of Host-Thread

Local Space of Accelerator thread

Private Space of Accelerator thread

Local Space of Accelerator thread… …

HostMemory

DeviceMemory

Accelerator thread n

NRCPC

The Principal Extension of OpenACC*

Extend the usage of OpenACC’s data environment directives

Use data copy inside the accelerator parallel region

copy on parallel perform data moving and distributing between SPMs

Add new directives/clauses

local clause, to allocate space on SPM of CPE thread.

Data transform support to speedup data transfer

• pack/packin/packout clause

• swap/swapin/swapout clause

annotate clauses to better controls of data movement and execution from compiler

• tilemask, entire, co_compute

Gprof，mpip等

性能分析工具

运行作业时可以加--sw3runarg="-p –f "选项，生成

gmon.out，使用gprof a.out gmon.out 可以看到主从核的

性能分析数据，从核函数带slave前缀。

% cumulative self self total

time seconds seconds calls Ts/call Ts/call name

98.15 12.75 12.75 slave_Array_Waiting_For_Task

0.77 12.85 0.10 _IO_vfscanf_internal

0.38 12.90 0.05 athread_halt

0.31 12.94 0.04 slave_BARFIT

0.23 12.97 0.03 ____strtod_l_internal

0.08 12.98 0.01 __mpn_impn_sqr_n_basecase

0.08 12.99 0.01 slave_bipol

性能计数器

void rpcc_(unsigned long *counter)

{

unsigned long rpcc;

asm("rtc %0": "=r" (rpcc) : );

*counter=rpcc;

}

void rtc_(unsigned long *counter)

{

unsigned long rpcc;

asm volatile("rcsr %0, 4":"=r"(rpcc));

*counter=rpcc;

}

主核拍数统计：

从核拍数统计：

SW26010的性能数据访问路线

• 路线1: 从核状态寄存器->从核通用寄存器

• 路线2: 从核状态寄存器->全局内存映射->主核通用寄存器

• 路线3: 从核状态寄存器->全局内存映射->其他核组

SW26010性能计数配置• 每个从核:

• 1个周期计数器 (CC).• 3个性能计数器 (PCR) :

• 2个用于计数核内性能事件• 1个用于计数传输事件.• 高5位为事件选择.

• 1个性能计数器控制寄存器 (PCRC) :• 开关3个性能计数器.• 选择浮点事件计数.

• 同一芯片内任一主核可以访问任一从核PC/CC/PCR/PCRC.

PCR0/PCR1可以计数的事件

• PCR0:• (条件) 转移指令计数• 指令计数• 流水线0指令计数• LD/ST缓冲满阻塞周期数• GLQ/GSQ满阻塞周期数• 通道缓冲满阻塞周期数• LDM读/写次数• ICache访问次数• 核内访LDM被阻塞周期数• 分支预测失败计数• LDM访问冲突计数• LDM带宽不满计数• 加减乘计数

• PCR1• 转移失败计数• (无) 条件转移计数• 流水线1指令计数• 同步阻塞周期数• 访存阻塞周期数• 寄存器通信阻塞周期数• 不能全速发射周期数• 数据相关阻塞周期数• 同步/条件转移阻塞周期数• ICache Miss数• LDM读写总次数• 传输/原子/地址相关阻塞周期数• LDM带宽满计数• 除/平方根计数

PCR2可以计数的事件

• 阵列控制网络总请求数• GLD/GST计数• GF&A计数• GUPDT计数• L1 ICache Miss次数• 行/列同步计数• 用户中断计数• DMA请求计数• SBMD启动停止中断计数• Miss导致的流水线阻塞计数• SBMD周期数

• 自主运行周期数• 行/列/总共的寄存器通信发送数• 外部对LDM读/写次数• DMA回答字自增1计数• 指令装填总数• 指令装填Miss计数• SBMD指令装填计数• 主存读/写响应计数• 外部IO读/写总数• 行/列/总共读&发送计数

Beacon介绍

部署于神威太湖之光超算系统，全栈式的I/O资源监控与诊断系统

轻量级，低开销，对采集的数据进行线上压缩并保证了准确性

采集与分析计算节点、I/O Forwarding节点和存储节点的I/O行为数据，

并依此刻画分析应用和系统的I/O行为特征

Distributed Database

I/O forwarding

nodes (160)

Storage

nodes(288)

Metadata

nodes(2)

Compute nodes

(40960)

LWFS client LWFS server

MDS

Data compression Profiling point

Lustre client Lustre server

Statistics analysis

User

Job database

(MySQL)

N85

I/O

diagnostic

systemN1

84+1 part-time servers ( consist of 85 storage nodes)

N81 N84N82 N83N80

N2

In-memory cache (Redis)

Trace/log data collector (Logstash)

Distributed log database (Elasticsearch)

Beacon架构

Beacon可视化工具

计算节点

• 带宽

• IOPS

• 元数据访问

• 数据分布

• 活跃I/O进程数

Forwarding节点

存储节点

系统管理员：

• 系统负载

• 用户统计

用户：

• 应用详情

• 历史应用查询与统计分析

神威太湖之光



提纲

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per

node22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunication

among CPEs

Major Features to Consider

Sunway TaihuLight

125 Pflops


node22 flops/byte

10 millioncores

MPE + CPE



among CPEs

Major Features to Consider

Intel KNL 7250 of Cori: 6.5 flops/byte

NVIDIA P100 of Piz Daint: 7.2 flops/byte

Sunway TaihuLight

125 Pflops


node22 flops/byte

10 millioncores

MPE + CPE



among CPEs

Major Challenge #1: Scaling

Two-level approach

first level: up to 163,840 MPI processes

second level: 64 or 65 concurrent threads

General Programming Approach

Reserve enough parallelism for

the 65 cores in each CG

Redesign your algorithm to

expose enough parallelism for

the cores in each CG

racks chips core-groups cores total number of cores

163,840 processes 65 threads

Sunway TaihuLight

125 Pflops


node22 flops/byte

10 millioncores

MPE + CPE



among CPEs

Major Challenge #2: Memory Wall

Sunway TaihuLight

125 Pflops


node22 flops/byte

10 millioncores

MPE + CPE



among CPEs

Major Challenge #2: Memory Wall

Refactoring and Redesigning

Register Communication of SW26010 Processor

Get C

Get

R

Put

Get C

Get

R

Put

Get C

Get

R

Put

Get C

Get

R

Put

//P2P Test

if (id%2 == 0)

while(1)

putr(data, id+1);

else

while(1)

getr(&data);

Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core

Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017

IEEE International. IEEE, 2017.

Latency: less than 11 cycles

Bandwidth: 637 GB/s

Risks of Register Communication

1. Expensive cost for Manually estimating whether cache is miss or not;

2. Limitation of register communication

CPE

(0,0)

CPE

(0,1)

CPE

(1,1)

Potential

Dead-Lock

Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten

Million Cores[C]//Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE

International. IEEE, 2017: 635-645.

CPE

(1,0)

Cycle

相同点大量小核，宽向量（细粒度并行）； SW26010向量宽度相对低

内存带宽跑不过计算（局部性）；SW26010带宽问题更加突出

低主频/复杂节点/topology架构 -> 如何饱和网络（异构程序、专用通信优化）

不同点 Intel与NV在对于多线程的处理上一个温和一个激进

SW受制程影响打访存主意

编程标准不统一（OpenMP、OpenACC；…）

众核架构（含SW26010）特点分析

粗粒度并行向细粒度（分层）并行的转变

同构程序向异构程序的转变，支持计算通信重叠，甚至功能并行与异步

高计算访存比算法设计

访存的深度规范化（规则性、局部性、少量）数据结构、顶层算法等重新评估和可能的重新设计，

应用专家主导

Program Portability 选择合适语言实现

隐藏不同细节（软硬件）加速理解和分析流程

基础算法、编译/分析工具、库、系统等支持，可以更多由计算机工程师来跟踪和完成

众核深度优化

Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric

Modeling on Sunway TaihuLight, IPDPS 2017

• Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018

• Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017

• Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017

• Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017

Computation-Communication Overlapping

44

• An inner-outer subdomain partition:• Maximize the most continuous dimension.• Deepen the halo to replace the communication by

computation (May do in the future).

• Overlap the halo exchange and data movement on MPE with the inner computation on CPE.

• MPE: halo exchange, copy, pack, and unpack.

• CPE: inner and halo computation.

Optimization of Comp-Comm Overlapping

45

• DMGlobalToLocalBegin (PETSc):1. Copy the inner data.2. Pack the halo data.3. Initial the communication.

• DMGlobalToLocalEnd (PETSc):1. Wait for the end of the

communication.

2. Unpack the halo data.

• MPE: halo exchange.

• CPE: copy, pack (optional), unpack, inner and halo computation.

Optimization of pack and unpack: deal with data continuously on the CPE cluster as much as possible until detecting a switch according to the loaded index array.

Locality-aware Thread Blocking (2.5D)

Z

Y

Unused plane

Prefetching plane

Current computing plane

Dependent plane

Z

Y

Step 1 Step 2 Step 3

...

Overla

pp

ing

Overla

pp

ing

Overla

pp

ing

46

• Each subdomain is partitioned along the z-x plane into small blocks.• Each thread is responsible for one block, which is processed plane by plane.• The required data by the CPE are loaded into LDM by DMA.• A circular array and the double-buffering approach are employed.

Optimization of Locality-aware Thread Blocking

47

...

...

...

...

...

...

...

...

Data Mapping

CPE Cluster

R1

R2

R7

C0 C1 C2 C7...

...

R0

HaloInner

• For the inner part:• A second-level partitioning along the 𝑦

direction may be conducted if the occupation rate is less than 50%.

• For the halo part:• Each area is treated as a flat rectangular

block.• Different areas are scheduled to the CPE

cluster simultaneously.• The west and east areas need to be

padded for vectorization.

Collaborative Data Accessing

Thread 0

Duplicating

Splitting

Grouping

Z

X

4×4 4×4 4×4 4×4 8

20

2-layer halo

20 32

1

2

Mem

ory

CPE 0 CPE1

CPE 2 CPE 3

Excha

ng

ing

3

Thread 1 Thread 2 Thread 3Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

48

1. Grouping: 4 CPEs are grouped together, each of which loads the continuous larger chunk through DMA.

2. Duplicating: some data on each CPE are duplicated to construct these data pieces including the 2-layer halos.

3. Exchanging: the resulted data pieces are exchanged on chip so that each CPE gets their required data.

4 CPEs with 4 × 4 block size

Online Data Layout Transformation

50

4 cell structures (each has 6 double elements)

• Conversion between array of structure (AoS) and structure of array (SoA).

• The shuffle instruction can output an vector from two input vectors in one cycle.

6 vectors (each has 4 double elements)

Limited Memory System of SW26010

Limited size of fast memory (only 64KB each core)

Limited Memory Bandwidth (22 flops/byte, 34GB/s each CG in

theory)

Restricted use for getting good performance in applications

Manually controlled SPM (cache less architecture)

Single CG: 22.6 GB/s (DMA) vs. 1.5 GB/s (fine granularity

Global Load/Store)

Register communication only supports in-raw and in-column

communication

Challenges for Designing SpTRSV on SW architecture

CPE

(0,0)

CPE

(0,1)

CPE

(1,1)

Challenges for Designing SpTRSV on SW architecture

Limitation of memory

• Manually controlled SPM

• DMA vs. Gload

• Restricted register comm.

Refactor in algorithm and Impl.

• Check if data is in SPM, load missed data element

• Support arbitrary two-core comm.

Problems

• High overhead

• Inefficient memory access

• Potential dead lock*

*Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten

Million Cores, IPDPS 2017: 635-645.

swSpTRSV

55

稀疏层次块：• 适应手动缓存• 适应粗粒度的DMA

生产者消费者配对：• 适应规则的寄存器通信• 避免死锁

稀疏层次块

56

解向量x

右端向

量b

1、细粒度；2、随机；3、不可预取；

1、粗粒度；2、可预测；3、可预取；

solution vector x

rig

ht-

han

d v

ecto

r b

稀疏层次块

57

0

4

8

3

69

12

5

7 10

B区域0

B区域1

B区域2

B区域3

X区域0 X区域1 X区域2 X区域3

• 可预测：限制了不同稀疏块所求解的x数量和更新的b的数量，以保证一定缓存命中，而不需要额外的分支判断语句

• 可预取：稀疏块的存储顺序和计算顺序一致；对于右端向量b的需求是连续的

• 粗粒度：所有的非零元按照稀疏块存储在一起，可以批量读入，对于右端向量b来说，连续+可预取=粗粒度

swSPTRSV

58

稀疏层次块：• 适应手动缓存• 适应粗粒度的DMA

生产者消费者配对：• 适应规则的寄存器通信• 避免死锁

生产者消费者配对方式

59

消费者8行

4列

生产者

𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗𝑏𝑖 = 𝑏𝑖 − 𝑙𝑖𝑗𝑥𝑗

𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗

𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗

∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗

4列

从核0,0

从核0,3

从核0,4

从核0,7

从核7,0

从核7,3

从核7,4

从核7,7

CPE

Compression: Squeezing Extra Performance

Peak Utilized %

Flops 765 G 94.7 G 12.2%

Memorysize

5 G 4.6 G 92%

MemoryBW

34 GB/s 25 GB/s 73.5%

LDMsize

64 KB 60 KB 93.8%

64 KBLDM

DDR MEM(data in compressed form)

computefunctions

decompress

compress

DM

A

CPE

Compression: Squeezing Extra Performance

Peak Utilized %

Flops 765 G 94.7 G 12.2%

Memorysize

5 G 4.6 G 92%

MemoryBW

34 GB/s 25 GB/s 73.5%

LDMsize

64 KB 60 KB 93.8%

64 KBLDM

DDR MEM(data in compressed form)

DM

A

computefunctions

decompress

compress

enable even larger problems

pumping more data in and out

Additional complexity and cost

Extra LDM read/write due tocompression/decompression operations

Broken floating-point instruction pipeline

Compression: Not an Easy Task

Additional complexityand cost

Extra LDM read/write due tocompression/decompressionoperations

Broken floating-pointinstruction pipeline

Compression: Further Optimization





1/3 of original performance





compress every pointon the fly buffering a plane

1/3 to 90% of original performance





90% to 120% of original performance

LOAD LDM1,$raSSL $ra, $raSTORE $a, LDM1LOAD LDM2,$rbSSL $rb, $rbSTORE $rb, LDM2LOAD LDM3, $rcSSL $rc $rcSTORE $rc, LDM3LOAD LDM1,$raLOAD LDM2,$rbADD $ra, $rb, $raLOAD LDM3, $rcMUL $ra, $rc, $raSTORE $ra, LDM2

LOAD LDM1,$raSSL $ra, $raLOAD LDM2,$rbSSL $rb, $rbLOAD LDM3, $rc$rc $rc

ADD $a, $b, $aMUL $a, $c, $aSTORE $a, LDM2

switch the buffering of temporaryvariables from LDM to registersby using intrinsic assemblyinstructions, especially forfunction calls





120% to 130% of original performance

On-the-fly Compression

0s 20s 40s 60s 80s 100s 120s

Cangzhou

Ninghe

blue solid line: base

red dased line: compressed

Element

…

Sub-Element 0

Sub-Element 1

Sub-Element i-1

One CPE for one Element

…CPE 0

CPE 1

CPE i-1

#i CPEs for one Element

寻找更多并行度

More parallelism orlower LDM requirement

Decomp.

OpenACC

Direct

Cubed-sphere mesh

Challenge for Future: Effort on CAM-SE for 3 yearsHow to do it easily?

PHY

DYN

INITIAL 754,129 LOCs

304 Kernels, with no hotspots

Modified 152,336 LOCAdded 57,709 LOC

Optimized 185 Kernels80% of total time

谢谢大家！

感谢聆听！

Documents

神威太湖之光上的并行算法设计 - bbs.nsccwx.cnbbs.nsccwx.cn/assets/uploads/files/1528178274621-太湖之光上的... · memory (SPM), and distributes the kernel workload