神威太湖之光上的并行算法设计
薛 巍
清华大学计算机系、国家超级计算无锡中心
2018年05月30日,清华大学
神威太湖之光
神威太湖之光编程与性能分析工具
神威太湖之光上的并行算法设计
提纲
最高端计算系统对比(最新TOP500榜单)
系统名称峰值性能
(PFLOPS)持续性能
(PFLOPS)绿色指标
(MFLOPS/W)总核数(个)
总内存容量(PB)
处理器架构
神威·太湖之光 125.436 93.015 6051.131 10,649,600 1.3 异构众核
天河二号(KNC) 54.90 33.86 1901.54 3,120,000 1.0 异构众核
代恩特峰(P100) 25.33 19.59 8622.36 361,760 0.34 异构众核
晓光 28.19 19.14 14,174.67 19,860,000 0.58 异构众核
泰坦(K20x) 27.11 17.59 2142.77 560,640 0.7 异构众核
红杉(BQC) 20.13 17.17 2176.58 1,572,864 1.6 众核
Trinity 43.90 14.14 3677.76 979,968 2.07 众核
CORI(KNL) 27.88 14.01 3557.93 622,336 0.9 众核
众核已成为高性能计算的趋势
Comparison of current high-end systems (2017.9)
System TaihuLight Tianhe-2 Piz Daint Titan Sequoia
Rank of Top500 1 2 3 4 5
Rank of Green500 17 147 6 109 100
Rank of Graph500 2 8 / / 3
Rank of HPCG 3 2 4 8 7
K computer gains the Rank 1st of Graph500 and HPCG
国产众核处理器SW26010
Each CPU
Peak Performance 3.06 Tflops (DP)
Memory 32 GB
Memory Bandwidth 136.5 GB/s
# CPU 1
# cores 260
Intel KNL (2016)DP: 3TFMemBW:- MCDRAM 400+- DDR 90+
NV Pascal (2016)DP: 5+TFNvlink: 80GB/sPCIe G3: 16GB/s
Core Group 2
Data Transfer
Network
MPE8*8 CPE
Mesh
PPU
iMC
Memory
Core Group 0
MPE8*8 CPE
Mesh
iMC
PPU
Memory
Core Group 1
MPE8*8 CPE
Mesh
PPU
Core Group 3 iMC
Memory
MPE8*8 CPE
Mesh
PPU
iMC
Memory
NoC
Computing
Core
LDM
Column
Communication Bus
Control
Network
Registers
Row
Communication
Bus
Transfer Agent (TA)
Memory Level
LDM Level
Register Level
Computing Level
8*8 CPE Mesh
SW26010: Sunway 260-Core Processor
...
...
...
...
从核0,0
从核0,1
从核1,0
从核1,1
从核1,2
从核2,0
从核2,1
从核2,2
从核0,2
从核1,7
从核2,7
从核0,7
从核7,0
从核7,1
从核7,2
从核7,7
...
...
...
...
从核阵列
从核阵列
内存控制器
主核
核组
从核阵列
内存控制器
主核
核组
片上网络
从核阵列内存控制器
主核
核组
从核阵列内存控制器
主核
核组
内存 内存
内存 内存
手动缓存
一级缓存
二级缓存
接口总线
8×8
申威架构
7
Direct Memory Access (DMA) @26+ GB/sGlobal load/store (gload/gstore) @1.5 GB/s
• 手动缓存(SPM)
• 粗粒度的DMA
• 寄存器通信
P2P延迟小于11指令周期;集合带宽达到600+GB/s;
每个从核64KB,流水意义下每个指令周期可以读/写32B数据。
Xu Z, Lin J, Matsuoka S. Benchmarking sw26010 many-core processor. Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017. 743–752.
NRCPC
The measured memory bandwidth of one CG with DMA sequential read
Theoretical BWfor each CG is 34 GB/s
SW26010 Capability Model
Computation
Time
Memory
Access Time
Overlapping
TimeTotal Time
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑖𝑛𝑠𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚
𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛
𝑆𝑖𝑧𝑒
𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡
𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ
Which Bound?
𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸
𝑀𝑅𝑃− 1 ∗ 𝐷𝑀𝐴
𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒
Case1:𝑀𝐴𝑇 ∗ 1 −1
𝐶𝑦𝑐𝑙𝑒
Case2: 𝐶𝑇 ∗ (1 −1
𝐶𝑦𝑐𝑙𝑒)
Shizhen Xu et al., Taming the “Monster”: Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling, IPDPS 2018
SW26010 Capability Model
MRT Latency refers to memory access time for CPE.
SW26010 Capability ModelRodinia 3.1 OpenMP Benchmarks
The mean of absolute errors for the model is 3.5% The max absolute errors for the model is 12.3% in BFS
神威•太湖之光系统
国产处理器
运算节点
运算插件运算超节点
运算系统
运算机仓
3.168TFlops6.336TFlops
25.344TFlops811.008TFlops
3.244PFlops
1024处理器
125.436PFlops
40运算机仓
IB FDR;两级网络:超节点内、超节点间- 超节点内全互联;节点内4核组竞争一个IB端口;16x16配置,目标节点模16用同一个路由
- 超节点间树网,1/4裁剪,目标处理器编号模64用同一链路
神威太湖之光存储系统架构
挑战:
I/O路径复杂
资源竞争
缓冲的使用
调度与排队
神威太湖之光
神威太湖之光编程与性能分析工具
神威太湖之光上的并行算法设计
提纲
NRCPC
Principal Programming Model on TaihuLight
MPI+X
X : OpenACC* / Athread
One MPI process manages to run on one management core (MPE)
OpenACC* conducts data transfer between main memory and on-chip
memory (SPM), and distributes the kernel workload across compute cores
(CPEs)
Athread is the threading library to manage thread on compute core (CPE),
which is used in OpenACC* implementation
NRCPC
Brief view of SWACC/SWAFORT compiler
OpenACC* is directive-based programming tool for SW26010
OpenACC2.0 based
Extensions for the architecture of SW26010
Supported by SWACC/SWAFORT compiler
Interactive debugging supported
OpenACC* compiler: SWACC/SWAFORT Compiler
Source-to-Source compiler
SWACC: C99; SWAFORT: Fortran 2003
SWACC and SWAFORT are developed by NRCPC.
Based on ROSE compiler infrastructure (0.9.6a)
• An open Source compiler infrastructure to build source-to-source program transformation and analysis
• Developed by LLNL
NRCPC
Brief view of SWACC/SWAFORT compiler
CPE Code
MPE Code
int A[1024][1024];intB[1024][1024];intC[1024][1024];#pragma acc parallel loop \copyin(B, C) copyout(A)for(i = 0; i < 1024; i ++) {
for(j = 0; j < 1024; j ++) {A[i][j] = B[i][j] + C[i][j];
}}
…
CPEs_swpan(CPE_kernel, args);…
__SPM_localint SPM_A[1][1024];__SPM_localint SPM_B[1][1024];__SPM_localint SPM_C[1][1024];voidCPE_kernel(args) {for(i=CPE_id; i < 1024; i +=CPE_num) {
dma_get(&B[i][0], SPM_B, 4096);dma_get(&C[i][0], SPM_C, 4096);for(j = 0; j < 1024; j ++) {SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j];} //j-loopdma_put(SPM_A, &A[i][0], 4096);
} //i-loop}
SWACC
Source Code withOpenACC* directives
Basic Compiler
a.out
Compute pattern: data in to SPM -> calculation -> data out to Main memory
Workload distribution and the size for data transfer are automatically determined by compiler
NRCPC
Motivation to extend OpenACC
Memory Model of OpenACC Standard [OpenACC2.0-1.3]
mainly support non-shared memory device
• The memory on the accelerator may be physically and/or virtually separate from host memory; All
data movement between host memory and device memory must be performed by the host thread.
• This is the case with most current GPUs.
data environment of OpenACC can be ignored on shared-memory device
• the implementation need not create new copies of the data for the device and no need of data
movement between host and device.
What we need: Data environment for OpenACC*
Utilize the high-speed SPMs and
the aggregated bandwidth of CPEs for performance.
NRCPC
The difference between the Memory Models
Host Memory
Device Memory
Host
Device
The memory that Accelerator
threads can be accessed
OpenACC
Host MemoryMPE
S P M …
…
SW26010
CPEs
Data Moving
is executed by
host thread
Data movement
is initiated by
each CPE thread
The memory that Accelerator
threads can be accessed
NRCPC
Three kinds of Memory spaces that CPE thread can access
Memory Space of Host thread : shared by all accelerator threads
Private space : owned by each acc-thread, locate on Host-Mem, large size.
Local Space : owned by each acc-thread, locate on SPM, limited size
The Memory Model of OpenACC*
Accelerator thread 1
Private Space of Accelerator thread
Memory Space of Host-Thread
Local Space of Accelerator thread
Private Space of Accelerator thread
Local Space of Accelerator thread… …
HostMemory
DeviceMemory
Accelerator thread n
NRCPC
The Principal Extension of OpenACC*
Extend the usage of OpenACC’s data environment directives
Use data copy inside the accelerator parallel region
copy on parallel perform data moving and distributing between SPMs
Add new directives/clauses
local clause, to allocate space on SPM of CPE thread.
Data transform support to speedup data transfer
• pack/packin/packout clause
• swap/swapin/swapout clause
annotate clauses to better controls of data movement and execution from compiler
• tilemask, entire, co_compute
Gprof,mpip等
性能分析工具
运行作业时可以加--sw3runarg="-p –f "选项,生成
gmon.out,使用gprof a.out gmon.out 可以看到主从核的
性能分析数据,从核函数带slave前缀。
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
98.15 12.75 12.75 slave_Array_Waiting_For_Task
0.77 12.85 0.10 _IO_vfscanf_internal
0.38 12.90 0.05 athread_halt
0.31 12.94 0.04 slave_BARFIT
0.23 12.97 0.03 ____strtod_l_internal
0.08 12.98 0.01 __mpn_impn_sqr_n_basecase
0.08 12.99 0.01 slave_bipol
性能计数器
void rpcc_(unsigned long *counter)
{
unsigned long rpcc;
asm("rtc %0": "=r" (rpcc) : );
*counter=rpcc;
}
void rtc_(unsigned long *counter)
{
unsigned long rpcc;
asm volatile("rcsr %0, 4":"=r"(rpcc));
*counter=rpcc;
}
主核拍数统计:
从核拍数统计:
SW26010的性能数据访问路线
• 路线1: 从核状态寄存器->从核通用寄存器
• 路线2: 从核状态寄存器->全局内存映射->主核通用寄存器
• 路线3: 从核状态寄存器->全局内存映射->其他核组
SW26010性能计数配置• 每个从核:
• 1个周期计数器 (CC).• 3个性能计数器 (PCR) :
• 2个用于计数核内性能事件• 1个用于计数传输事件.• 高5位为事件选择.
• 1个性能计数器控制寄存器 (PCRC) :• 开关3个性能计数器.• 选择浮点事件计数.
• 同一芯片内任一主核可以访问任一从核PC/CC/PCR/PCRC.
PCR0/PCR1可以计数的事件
• PCR0:• (条件) 转移指令计数• 指令计数• 流水线0指令计数• LD/ST缓冲满阻塞周期数• GLQ/GSQ满阻塞周期数• 通道缓冲满阻塞周期数• LDM读/写次数• ICache访问次数• 核内访LDM被阻塞周期数• 分支预测失败计数• LDM访问冲突计数• LDM带宽不满计数• 加减乘计数
• PCR1• 转移失败计数• (无) 条件转移计数• 流水线1指令计数• 同步阻塞周期数• 访存阻塞周期数• 寄存器通信阻塞周期数• 不能全速发射周期数• 数据相关阻塞周期数• 同步/条件转移阻塞周期数• ICache Miss数• LDM读写总次数• 传输/原子/地址相关阻塞周期数• LDM带宽满计数• 除/平方根计数
PCR2可以计数的事件
• 阵列控制网络总请求数• GLD/GST计数• GF&A计数• GUPDT计数• L1 ICache Miss次数• 行/列同步计数• 用户中断计数• DMA请求计数• SBMD启动停止中断计数• Miss导致的流水线阻塞计数• SBMD周期数
• 自主运行周期数• 行/列/总共的寄存器通信发送数• 外部对LDM读/写次数• DMA回答字自增1计数• 指令装填总数• 指令装填Miss计数• SBMD指令装填计数• 主存读/写响应计数• 外部IO读/写总数• 行/列/总共读&发送计数
Beacon介绍
部署于神威太湖之光超算系统,全栈式的I/O资源监控与诊断系统
轻量级,低开销,对采集的数据进行线上压缩并保证了准确性
采集与分析计算节点、I/O Forwarding节点和存储节点的I/O行为数据,
并依此刻画分析应用和系统的I/O行为特征
Distributed Database
I/O forwarding
nodes (160)
Storage
nodes(288)
Metadata
nodes(2)
Compute nodes
(40960)
LWFS client LWFS server
MDS
Data compression Profiling point
Lustre client Lustre server
Statistics analysis
User
Job database
(MySQL)
N85
I/O
diagnostic
systemN1
84+1 part-time servers ( consist of 85 storage nodes)
N81 N84N82 N83N80
N2
In-memory cache (Redis)
Trace/log data collector (Logstash)
Distributed log database (Elasticsearch)
Beacon架构
Beacon可视化工具
计算节点
• 带宽
• IOPS
• 元数据访问
• 数据分布
• 活跃I/O进程数
Forwarding节点
存储节点
系统管理员:
• 系统负载
• 用户统计
用户:
• 应用详情
• 历史应用查询与统计分析
神威太湖之光
神威太湖之光编程与性能分析工具
神威太湖之光上的并行算法设计
提纲
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per
node22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunication
among CPEs
Major Features to Consider
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per
node22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunication
among CPEs
Major Features to Consider
Intel KNL 7250 of Cori: 6.5 flops/byte
NVIDIA P100 of Piz Daint: 7.2 flops/byte
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per
node22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunication
among CPEs
Major Challenge #1: Scaling
Two-level approach
first level: up to 163,840 MPI processes
second level: 64 or 65 concurrent threads
General Programming Approach
Reserve enough parallelism for
the 65 cores in each CG
Redesign your algorithm to
expose enough parallelism for
the cores in each CG
racks chips core-groups cores total number of cores
163,840 processes 65 threads
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per
node22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunication
among CPEs
Major Challenge #2: Memory Wall
Sunway TaihuLight
125 Pflops
32 GB and136GB/s per
node22 flops/byte
10 millioncores
MPE + CPE
user-controlled64 KB LDM
registercommunication
among CPEs
Major Challenge #2: Memory Wall
Refactoring and Redesigning
Register Communication of SW26010 Processor
Get C
Get
R
Put
Get C
Get
R
Put
Get C
Get
R
Put
Get C
Get
R
Put
//P2P Test
if (id%2 == 0)
while(1)
putr(data, id+1);
else
while(1)
getr(&data);
Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core
Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017
IEEE International. IEEE, 2017.
Latency: less than 11 cycles
Bandwidth: 637 GB/s
Risks of Register Communication
1. Expensive cost for Manually estimating whether cache is miss or not;
2. Limitation of register communication
CPE
(0,0)
CPE
(0,1)
CPE
(1,1)
Potential
Dead-Lock
Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten
Million Cores[C]//Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE
International. IEEE, 2017: 635-645.
CPE
(1,0)
Cycle
相同点 大量小核,宽向量 (细粒度并行); SW26010向量宽度相对低
内存带宽跑不过计算(局部性);SW26010带宽问题更加突出
低主频/复杂节点/topology架构 -> 如何饱和网络(异构程序、专用通信优化)
不同点 Intel与NV在对于多线程的处理上一个温和一个激进
SW受制程影响打访存主意
编程标准不统一(OpenMP、OpenACC;…)
众核架构(含SW26010)特点分析
粗粒度并行向细粒度(分层)并行的转变
同构程序向异构程序的转变,支持计算通信重叠,甚至功能并行与异步
高计算访存比算法设计
访存的深度规范化(规则性、局部性、少量) 数据结构、顶层算法等重新评估和可能的重新设计,
应用专家主导
Program Portability 选择合适语言实现
隐藏不同细节(软硬件)加速理解和分析流程
基础算法、编译/分析工具、库、系统等支持,可以更多由计算机工程师来跟踪和完成
众核深度优化
Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric
Modeling on Sunway TaihuLight, IPDPS 2017
• Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018
• Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017
• Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017
• Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017
Computation-Communication Overlapping
44
• An inner-outer subdomain partition:• Maximize the most continuous dimension.• Deepen the halo to replace the communication by
computation (May do in the future).
• Overlap the halo exchange and data movement on MPE with the inner computation on CPE.
• MPE: halo exchange, copy, pack, and unpack.
• CPE: inner and halo computation.
Optimization of Comp-Comm Overlapping
45
• DMGlobalToLocalBegin (PETSc):1. Copy the inner data.2. Pack the halo data.3. Initial the communication.
• DMGlobalToLocalEnd (PETSc):1. Wait for the end of the
communication.
2. Unpack the halo data.
• MPE: halo exchange.
• CPE: copy, pack (optional), unpack, inner and halo computation.
Optimization of pack and unpack: deal with data continuously on the CPE cluster as much as possible until detecting a switch according to the loaded index array.
Locality-aware Thread Blocking (2.5D)
Z
Y
Unused plane
Prefetching plane
Current computing plane
Dependent plane
Z
Y
Step 1 Step 2 Step 3
...
Overla
pp
ing
Overla
pp
ing
Overla
pp
ing
46
• Each subdomain is partitioned along the z-x plane into small blocks.• Each thread is responsible for one block, which is processed plane by plane.• The required data by the CPE are loaded into LDM by DMA.• A circular array and the double-buffering approach are employed.
Optimization of Locality-aware Thread Blocking
47
...
...
...
...
...
...
...
...
Data Mapping
CPE Cluster
R1
R2
R7
C0 C1 C2 C7...
...
R0
HaloInner
• For the inner part:• A second-level partitioning along the 𝑦
direction may be conducted if the occupation rate is less than 50%.
• For the halo part:• Each area is treated as a flat rectangular
block.• Different areas are scheduled to the CPE
cluster simultaneously.• The west and east areas need to be
padded for vectorization.
Collaborative Data Accessing
Thread 0
Duplicating
Splitting
Grouping
Z
X
4×4 4×4 4×4 4×4 8
20
2-layer halo
20 32
1
2
Mem
ory
CPE 0 CPE1
CPE 2 CPE 3
Excha
ng
ing
3
Thread 1 Thread 2 Thread 3Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3
Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3
Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3
48
1. Grouping: 4 CPEs are grouped together, each of which loads the continuous larger chunk through DMA.
2. Duplicating: some data on each CPE are duplicated to construct these data pieces including the 2-layer halos.
3. Exchanging: the resulted data pieces are exchanged on chip so that each CPE gets their required data.
4 CPEs with 4 × 4 block size
Online Data Layout Transformation
50
4 cell structures (each has 6 double elements)
• Conversion between array of structure (AoS) and structure of array (SoA).
• The shuffle instruction can output an vector from two input vectors in one cycle.
6 vectors (each has 4 double elements)
Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric
Modeling on Sunway TaihuLight, IPDPS 2017
• Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018
• Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017
• Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017
• Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017
Limited Memory System of SW26010
Limited size of fast memory (only 64KB each core)
Limited Memory Bandwidth (22 flops/byte, 34GB/s each CG in
theory)
Restricted use for getting good performance in applications
Manually controlled SPM (cache less architecture)
Single CG: 22.6 GB/s (DMA) vs. 1.5 GB/s (fine granularity
Global Load/Store)
Register communication only supports in-raw and in-column
communication
Challenges for Designing SpTRSV on SW architecture
CPE
(0,0)
CPE
(0,1)
CPE
(1,1)
Challenges for Designing SpTRSV on SW architecture
Limitation of memory
• Manually controlled SPM
• DMA vs. Gload
• Restricted register comm.
Refactor in algorithm and Impl.
• Check if data is in SPM, load missed data element
• Support arbitrary two-core comm.
Problems
• High overhead
• Inefficient memory access
• Potential dead lock*
*Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten
Million Cores, IPDPS 2017: 635-645.
swSpTRSV
55
稀疏层次块:• 适应手动缓存• 适应粗粒度的DMA
生产者消费者配对:• 适应规则的寄存器通信• 避免死锁
稀疏层次块
56
解向量x
右端向
量b
1、细粒度;2、随机;3、不可预取;
1、粗粒度;2、可预测;3、可预取;
solution vector x
rig
ht-
han
d v
ecto
r b
稀疏层次块
57
0
4
8
3
69
12
5
7 10
B区域0
B区域1
B区域2
B区域3
X区域0 X区域1 X区域2 X区域3
• 可预测:限制了不同稀疏块所求解的x数量和更新的b的数量,以保证一定缓存命中,而不需要额外的分支判断语句
• 可预取:稀疏块的存储顺序和计算顺序一致;对于右端向量b的需求是连续的
• 粗粒度:所有的非零元按照稀疏块存储在一起,可以批量读入,对于右端向量b来说,连续+可预取=粗粒度
swSPTRSV
58
稀疏层次块:• 适应手动缓存• 适应粗粒度的DMA
生产者消费者配对:• 适应规则的寄存器通信• 避免死锁
生产者消费者配对方式
59
消费者8行
4列
生产者
𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗𝑏𝑖 = 𝑏𝑖 − 𝑙𝑖𝑗𝑥𝑗
𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗
𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗
∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗
4列
从核0,0
从核0,3
从核0,4
从核0,7
从核7,0
从核7,3
从核7,4
从核7,7
Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric
Modeling on Sunway TaihuLight, IPDPS 2017
• Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018
• Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017
• Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017
• Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017
CPE
Compression: Squeezing Extra Performance
Peak Utilized %
Flops 765 G 94.7 G 12.2%
Memorysize
5 G 4.6 G 92%
MemoryBW
34 GB/s 25 GB/s 73.5%
LDMsize
64 KB 60 KB 93.8%
64 KBLDM
DDR MEM(data in compressed form)
computefunctions
decompress
compress
DM
A
CPE
Compression: Squeezing Extra Performance
Peak Utilized %
Flops 765 G 94.7 G 12.2%
Memorysize
5 G 4.6 G 92%
MemoryBW
34 GB/s 25 GB/s 73.5%
LDMsize
64 KB 60 KB 93.8%
64 KBLDM
DDR MEM(data in compressed form)
DM
A
computefunctions
decompress
compress
enable even larger problems
pumping more data in and out
Additional complexity and cost
Extra LDM read/write due tocompression/decompression operations
Broken floating-point instruction pipeline
Compression: Not an Easy Task
Additional complexityand cost
Extra LDM read/write due tocompression/decompressionoperations
Broken floating-pointinstruction pipeline
Compression: Further Optimization
Additional complexityand cost
Extra LDM read/write due tocompression/decompressionoperations
Broken floating-pointinstruction pipeline
Compression: Further Optimization
1/3 of original performance
Additional complexityand cost
Extra LDM read/write due tocompression/decompressionoperations
Broken floating-pointinstruction pipeline
Compression: Further Optimization
compress every pointon the fly buffering a plane
1/3 to 90% of original performance
Additional complexityand cost
Extra LDM read/write due tocompression/decompressionoperations
Broken floating-pointinstruction pipeline
Compression: Further Optimization
90% to 120% of original performance
LOAD LDM1,$raSSL $ra, $raSTORE $a, LDM1LOAD LDM2,$rbSSL $rb, $rbSTORE $rb, LDM2LOAD LDM3, $rcSSL $rc $rcSTORE $rc, LDM3LOAD LDM1,$raLOAD LDM2,$rbADD $ra, $rb, $raLOAD LDM3, $rcMUL $ra, $rc, $raSTORE $ra, LDM2
LOAD LDM1,$raSSL $ra, $raLOAD LDM2,$rbSSL $rb, $rbLOAD LDM3, $rc$rc $rc
ADD $a, $b, $aMUL $a, $c, $aSTORE $a, LDM2
switch the buffering of temporaryvariables from LDM to registersby using intrinsic assemblyinstructions, especially forfunction calls
Additional complexityand cost
Extra LDM read/write due tocompression/decompressionoperations
Broken floating-pointinstruction pipeline
Compression: Further Optimization
120% to 130% of original performance
On-the-fly Compression
0s 20s 40s 60s 80s 100s 120s
Cangzhou
Ninghe
blue solid line: base
red dased line: compressed
Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric
Modeling on Sunway TaihuLight, IPDPS 2017
• Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018
• Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017
• Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017
• Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017
Element
…
Sub-Element 0
Sub-Element 1
Sub-Element i-1
One CPE for one Element
…CPE 0
CPE 1
CPE i-1
#i CPEs for one Element
寻找更多并行度
More parallelism orlower LDM requirement
Decomp.
OpenACC
Direct
Cubed-sphere mesh
Challenge for Future: Effort on CAM-SE for 3 yearsHow to do it easily?
PHY
DYN
INITIAL 754,129 LOCs
304 Kernels, with no hotspots
Modified 152,336 LOCAdded 57,709 LOC
Optimized 185 Kernels80% of total time
谢谢大家!
感谢聆听!