71
神威太湖之光上的并行算法设计 薛巍 清华大学计算机系、国家超级计算无锡中心 2018年05月30日,清华大学

神威太湖之光上的并行算法设计 - bbs.nsccwx.cnbbs.nsccwx.cn/assets/uploads/files/1528178274621-太湖之光上的... · memory (SPM), and distributes the kernel workload

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

  • 神威太湖之光上的并行算法设计

    薛 巍

    清华大学计算机系、国家超级计算无锡中心

    2018年05月30日,清华大学

  • 神威太湖之光

    神威太湖之光编程与性能分析工具

    神威太湖之光上的并行算法设计

    提纲

  • 最高端计算系统对比(最新TOP500榜单)

    系统名称峰值性能

    (PFLOPS)持续性能

    (PFLOPS)绿色指标

    (MFLOPS/W)总核数(个)

    总内存容量(PB)

    处理器架构

    神威·太湖之光 125.436 93.015 6051.131 10,649,600 1.3 异构众核

    天河二号(KNC) 54.90 33.86 1901.54 3,120,000 1.0 异构众核

    代恩特峰(P100) 25.33 19.59 8622.36 361,760 0.34 异构众核

    晓光 28.19 19.14 14,174.67 19,860,000 0.58 异构众核

    泰坦(K20x) 27.11 17.59 2142.77 560,640 0.7 异构众核

    红杉(BQC) 20.13 17.17 2176.58 1,572,864 1.6 众核

    Trinity 43.90 14.14 3677.76 979,968 2.07 众核

    CORI(KNL) 27.88 14.01 3557.93 622,336 0.9 众核

    众核已成为高性能计算的趋势

  • Comparison of current high-end systems (2017.9)

    System TaihuLight Tianhe-2 Piz Daint Titan Sequoia

    Rank of Top500 1 2 3 4 5

    Rank of Green500 17 147 6 109 100

    Rank of Graph500 2 8 / / 3

    Rank of HPCG 3 2 4 8 7

    K computer gains the Rank 1st of Graph500 and HPCG

  • 国产众核处理器SW26010

    Each CPU

    Peak Performance 3.06 Tflops (DP)

    Memory 32 GB

    Memory Bandwidth 136.5 GB/s

    # CPU 1

    # cores 260

    Intel KNL (2016)DP: 3TFMemBW:- MCDRAM 400+- DDR 90+

    NV Pascal (2016)DP: 5+TFNvlink: 80GB/sPCIe G3: 16GB/s

  • Core Group 2

    Data Transfer

    Network

    MPE8*8 CPE

    Mesh

    PPU

    iMC

    Memory

    Core Group 0

    MPE8*8 CPE

    Mesh

    iMC

    PPU

    Memory

    Core Group 1

    MPE8*8 CPE

    Mesh

    PPU

    Core Group 3 iMC

    Memory

    MPE8*8 CPE

    Mesh

    PPU

    iMC

    Memory

    NoC

    Computing

    Core

    LDM

    Column

    Communication Bus

    Control

    Network

    Registers

    Row

    Communication

    Bus

    Transfer Agent (TA)

    Memory Level

    LDM Level

    Register Level

    Computing Level

    8*8 CPE Mesh

    SW26010: Sunway 260-Core Processor

  • ...

    ...

    ...

    ...

    从核0,0

    从核0,1

    从核1,0

    从核1,1

    从核1,2

    从核2,0

    从核2,1

    从核2,2

    从核0,2

    从核1,7

    从核2,7

    从核0,7

    从核7,0

    从核7,1

    从核7,2

    从核7,7

    ...

    ...

    ...

    ...

    从核阵列

    从核阵列

    内存控制器

    主核

    核组

    从核阵列

    内存控制器

    主核

    核组

    片上网络

    从核阵列内存控制器

    主核

    核组

    从核阵列内存控制器

    主核

    核组

    内存 内存

    内存 内存

    手动缓存

    一级缓存

    二级缓存

    接口总线

    8×8

    申威架构

    7

    Direct Memory Access (DMA) @26+ GB/sGlobal load/store (gload/gstore) @1.5 GB/s

    • 手动缓存(SPM)

    • 粗粒度的DMA

    • 寄存器通信

    P2P延迟小于11指令周期;集合带宽达到600+GB/s;

    每个从核64KB,流水意义下每个指令周期可以读/写32B数据。

    Xu Z, Lin J, Matsuoka S. Benchmarking sw26010 many-core processor. Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 2017. 743–752.

  • NRCPC

    The measured memory bandwidth of one CG with DMA sequential read

    Theoretical BWfor each CG is 34 GB/s

  • SW26010 Capability Model

    Computation

    Time

    Memory

    Access Time

    Overlapping

    TimeTotal Time

    𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠

    𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑖𝑛𝑠𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚

    𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛

    𝑆𝑖𝑧𝑒

    𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝐶𝑜𝑢𝑛𝑡

    𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ

    Which Bound?

    𝐴𝑐𝑡𝑖𝑣𝑒 𝐶𝑃𝐸

    𝑀𝑅𝑃− 1 ∗ 𝐷𝑀𝐴

    𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒

    Case1:𝑀𝐴𝑇 ∗ 1 −1

    𝐶𝑦𝑐𝑙𝑒

    Case2: 𝐶𝑇 ∗ (1 −1

    𝐶𝑦𝑐𝑙𝑒)

    Shizhen Xu et al., Taming the “Monster”: Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling, IPDPS 2018

  • SW26010 Capability Model

    MRT Latency refers to memory access time for CPE.

  • SW26010 Capability ModelRodinia 3.1 OpenMP Benchmarks

    The mean of absolute errors for the model is 3.5% The max absolute errors for the model is 12.3% in BFS

  • 神威•太湖之光系统

    国产处理器

    运算节点

    运算插件运算超节点

    运算系统

    运算机仓

    3.168TFlops6.336TFlops

    25.344TFlops811.008TFlops

    3.244PFlops

    1024处理器

    125.436PFlops

    40运算机仓

    IB FDR;两级网络:超节点内、超节点间- 超节点内全互联;节点内4核组竞争一个IB端口;16x16配置,目标节点模16用同一个路由

    - 超节点间树网,1/4裁剪,目标处理器编号模64用同一链路

  • 神威太湖之光存储系统架构

    挑战:

    I/O路径复杂

    资源竞争

    缓冲的使用

    调度与排队

  • 神威太湖之光

    神威太湖之光编程与性能分析工具

    神威太湖之光上的并行算法设计

    提纲

  • NRCPC

    Principal Programming Model on TaihuLight

    MPI+X

    X : OpenACC* / Athread

    One MPI process manages to run on one management core (MPE)

    OpenACC* conducts data transfer between main memory and on-chip

    memory (SPM), and distributes the kernel workload across compute cores

    (CPEs)

    Athread is the threading library to manage thread on compute core (CPE),

    which is used in OpenACC* implementation

  • NRCPC

    Brief view of SWACC/SWAFORT compiler

    OpenACC* is directive-based programming tool for SW26010

    OpenACC2.0 based

    Extensions for the architecture of SW26010

    Supported by SWACC/SWAFORT compiler

    Interactive debugging supported

    OpenACC* compiler: SWACC/SWAFORT Compiler

    Source-to-Source compiler

    SWACC: C99; SWAFORT: Fortran 2003

    SWACC and SWAFORT are developed by NRCPC.

    Based on ROSE compiler infrastructure (0.9.6a)

    • An open Source compiler infrastructure to build source-to-source program transformation and analysis

    • Developed by LLNL

  • NRCPC

    Brief view of SWACC/SWAFORT compiler

    CPE Code

    MPE Code

    int A[1024][1024];intB[1024][1024];intC[1024][1024];#pragma acc parallel loop \copyin(B, C) copyout(A)for(i = 0; i < 1024; i ++) {

    for(j = 0; j < 1024; j ++) {A[i][j] = B[i][j] + C[i][j];

    }}

    CPEs_swpan(CPE_kernel, args);…

    __SPM_localint SPM_A[1][1024];__SPM_localint SPM_B[1][1024];__SPM_localint SPM_C[1][1024];voidCPE_kernel(args) {for(i=CPE_id; i < 1024; i +=CPE_num) {

    dma_get(&B[i][0], SPM_B, 4096);dma_get(&C[i][0], SPM_C, 4096);for(j = 0; j < 1024; j ++) {SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j];} //j-loopdma_put(SPM_A, &A[i][0], 4096);

    } //i-loop}

    SWACC

    Source Code withOpenACC* directives

    Basic Compiler

    a.out

    Compute pattern: data in to SPM -> calculation -> data out to Main memory

    Workload distribution and the size for data transfer are automatically determined by compiler

  • NRCPC

    Motivation to extend OpenACC

    Memory Model of OpenACC Standard [OpenACC2.0-1.3]

    mainly support non-shared memory device

    • The memory on the accelerator may be physically and/or virtually separate from host memory; All

    data movement between host memory and device memory must be performed by the host thread.

    • This is the case with most current GPUs.

    data environment of OpenACC can be ignored on shared-memory device

    • the implementation need not create new copies of the data for the device and no need of data

    movement between host and device.

    What we need: Data environment for OpenACC*

    Utilize the high-speed SPMs and

    the aggregated bandwidth of CPEs for performance.

  • NRCPC

    The difference between the Memory Models

    Host Memory

    Device Memory

    Host

    Device

    The memory that Accelerator

    threads can be accessed

    OpenACC

    Host MemoryMPE

    S P M …

    SW26010

    CPEs

    Data Moving

    is executed by

    host thread

    Data movement

    is initiated by

    each CPE thread

    The memory that Accelerator

    threads can be accessed

  • NRCPC

    Three kinds of Memory spaces that CPE thread can access

    Memory Space of Host thread : shared by all accelerator threads

    Private space : owned by each acc-thread, locate on Host-Mem, large size.

    Local Space : owned by each acc-thread, locate on SPM, limited size

    The Memory Model of OpenACC*

    Accelerator thread 1

    Private Space of Accelerator thread

    Memory Space of Host-Thread

    Local Space of Accelerator thread

    Private Space of Accelerator thread

    Local Space of Accelerator thread… …

    HostMemory

    DeviceMemory

    Accelerator thread n

  • NRCPC

    The Principal Extension of OpenACC*

    Extend the usage of OpenACC’s data environment directives

    Use data copy inside the accelerator parallel region

    copy on parallel perform data moving and distributing between SPMs

    Add new directives/clauses

    local clause, to allocate space on SPM of CPE thread.

    Data transform support to speedup data transfer

    • pack/packin/packout clause

    • swap/swapin/swapout clause

    annotate clauses to better controls of data movement and execution from compiler

    • tilemask, entire, co_compute

  • Gprof,mpip等

    性能分析工具

    运行作业时可以加--sw3runarg="-p –f "选项,生成

    gmon.out,使用gprof a.out gmon.out 可以看到主从核的

    性能分析数据,从核函数带slave前缀。

    % cumulative self self total

    time seconds seconds calls Ts/call Ts/call name

    98.15 12.75 12.75 slave_Array_Waiting_For_Task

    0.77 12.85 0.10 _IO_vfscanf_internal

    0.38 12.90 0.05 athread_halt

    0.31 12.94 0.04 slave_BARFIT

    0.23 12.97 0.03 ____strtod_l_internal

    0.08 12.98 0.01 __mpn_impn_sqr_n_basecase

    0.08 12.99 0.01 slave_bipol

  • 性能计数器

    void rpcc_(unsigned long *counter)

    {

    unsigned long rpcc;

    asm("rtc %0": "=r" (rpcc) : );

    *counter=rpcc;

    }

    void rtc_(unsigned long *counter)

    {

    unsigned long rpcc;

    asm volatile("rcsr %0, 4":"=r"(rpcc));

    *counter=rpcc;

    }

    主核拍数统计:

    从核拍数统计:

  • SW26010的性能数据访问路线

    • 路线1: 从核状态寄存器->从核通用寄存器

    • 路线2: 从核状态寄存器->全局内存映射->主核通用寄存器

    • 路线3: 从核状态寄存器->全局内存映射->其他核组

  • SW26010性能计数配置• 每个从核:

    • 1个周期计数器 (CC).• 3个性能计数器 (PCR) :

    • 2个用于计数核内性能事件• 1个用于计数传输事件.• 高5位为事件选择.

    • 1个性能计数器控制寄存器 (PCRC) :• 开关3个性能计数器.• 选择浮点事件计数.

    • 同一芯片内任一主核可以访问任一从核PC/CC/PCR/PCRC.

  • PCR0/PCR1可以计数的事件

    • PCR0:• (条件) 转移指令计数• 指令计数• 流水线0指令计数• LD/ST缓冲满阻塞周期数• GLQ/GSQ满阻塞周期数• 通道缓冲满阻塞周期数• LDM读/写次数• ICache访问次数• 核内访LDM被阻塞周期数• 分支预测失败计数• LDM访问冲突计数• LDM带宽不满计数• 加减乘计数

    • PCR1• 转移失败计数• (无) 条件转移计数• 流水线1指令计数• 同步阻塞周期数• 访存阻塞周期数• 寄存器通信阻塞周期数• 不能全速发射周期数• 数据相关阻塞周期数• 同步/条件转移阻塞周期数• ICache Miss数• LDM读写总次数• 传输/原子/地址相关阻塞周期数• LDM带宽满计数• 除/平方根计数

  • PCR2可以计数的事件

    • 阵列控制网络总请求数• GLD/GST计数• GF&A计数• GUPDT计数• L1 ICache Miss次数• 行/列同步计数• 用户中断计数• DMA请求计数• SBMD启动停止中断计数• Miss导致的流水线阻塞计数• SBMD周期数

    • 自主运行周期数• 行/列/总共的寄存器通信发送数• 外部对LDM读/写次数• DMA回答字自增1计数• 指令装填总数• 指令装填Miss计数• SBMD指令装填计数• 主存读/写响应计数• 外部IO读/写总数• 行/列/总共读&发送计数

  • Beacon介绍

    部署于神威太湖之光超算系统,全栈式的I/O资源监控与诊断系统

    轻量级,低开销,对采集的数据进行线上压缩并保证了准确性

    采集与分析计算节点、I/O Forwarding节点和存储节点的I/O行为数据,

    并依此刻画分析应用和系统的I/O行为特征

  • Distributed Database

    I/O forwarding

    nodes (160)

    Storage

    nodes(288)

    Metadata

    nodes(2)

    Compute nodes

    (40960)

    LWFS client LWFS server

    MDS

    Data compression Profiling point

    Lustre client Lustre server

    Statistics analysis

    User

    Job database

    (MySQL)

    N85

    I/O

    diagnostic

    systemN1

    84+1 part-time servers ( consist of 85 storage nodes)

    N81 N84N82 N83N80

    N2

    In-memory cache (Redis)

    Trace/log data collector (Logstash)

    Distributed log database (Elasticsearch)

    Beacon架构

  • Beacon可视化工具

    计算节点

    • 带宽

    • IOPS

    • 元数据访问

    • 数据分布

    • 活跃I/O进程数

    Forwarding节点

    存储节点

    系统管理员:

    • 系统负载

    • 用户统计

    用户:

    • 应用详情

    • 历史应用查询与统计分析

  • 神威太湖之光

    神威太湖之光编程与性能分析工具

    神威太湖之光上的并行算法设计

    提纲

  • Sunway TaihuLight

    125 Pflops

    32 GB and136GB/s per

    node22 flops/byte

    10 millioncores

    MPE + CPE

    user-controlled64 KB LDM

    registercommunication

    among CPEs

    Major Features to Consider

  • Sunway TaihuLight

    125 Pflops

    32 GB and136GB/s per

    node22 flops/byte

    10 millioncores

    MPE + CPE

    user-controlled64 KB LDM

    registercommunication

    among CPEs

    Major Features to Consider

    Intel KNL 7250 of Cori: 6.5 flops/byte

    NVIDIA P100 of Piz Daint: 7.2 flops/byte

  • Sunway TaihuLight

    125 Pflops

    32 GB and136GB/s per

    node22 flops/byte

    10 millioncores

    MPE + CPE

    user-controlled64 KB LDM

    registercommunication

    among CPEs

    Major Challenge #1: Scaling

  • Two-level approach

    first level: up to 163,840 MPI processes

    second level: 64 or 65 concurrent threads

    General Programming Approach

    Reserve enough parallelism for

    the 65 cores in each CG

    Redesign your algorithm to

    expose enough parallelism for

    the cores in each CG

    racks chips core-groups cores total number of cores

    163,840 processes 65 threads

  • Sunway TaihuLight

    125 Pflops

    32 GB and136GB/s per

    node22 flops/byte

    10 millioncores

    MPE + CPE

    user-controlled64 KB LDM

    registercommunication

    among CPEs

    Major Challenge #2: Memory Wall

  • Sunway TaihuLight

    125 Pflops

    32 GB and136GB/s per

    node22 flops/byte

    10 millioncores

    MPE + CPE

    user-controlled64 KB LDM

    registercommunication

    among CPEs

    Major Challenge #2: Memory Wall

    Refactoring and Redesigning

  • Register Communication of SW26010 Processor

    Get C

    Get

    R

    Put

    Get C

    Get

    R

    Put

    Get C

    Get

    R

    Put

    Get C

    Get

    R

    Put

    //P2P Test

    if (id%2 == 0)

    while(1)

    putr(data, id+1);

    else

    while(1)

    getr(&data);

    Xu, Zhigeng, James Lin, and Satoshi Matsuoka. "Benchmarking SW26010 Many-Core

    Processor." Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017

    IEEE International. IEEE, 2017.

    Latency: less than 11 cycles

    Bandwidth: 637 GB/s

  • Risks of Register Communication

    1. Expensive cost for Manually estimating whether cache is miss or not;

    2. Limitation of register communication

    CPE

    (0,0)

    CPE

    (0,1)

    CPE

    (1,1)

    Potential

    Dead-Lock

    Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten

    Million Cores[C]//Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE

    International. IEEE, 2017: 635-645.

    CPE

    (1,0)

    Cycle

  • 相同点 大量小核,宽向量 (细粒度并行); SW26010向量宽度相对低

    内存带宽跑不过计算(局部性);SW26010带宽问题更加突出

    低主频/复杂节点/topology架构 -> 如何饱和网络(异构程序、专用通信优化)

    不同点 Intel与NV在对于多线程的处理上一个温和一个激进

    SW受制程影响打访存主意

    编程标准不统一(OpenMP、OpenACC;…)

    众核架构(含SW26010)特点分析

  • 粗粒度并行向细粒度(分层)并行的转变

    同构程序向异构程序的转变,支持计算通信重叠,甚至功能并行与异步

    高计算访存比算法设计

    访存的深度规范化(规则性、局部性、少量) 数据结构、顶层算法等重新评估和可能的重新设计,

    应用专家主导

    Program Portability 选择合适语言实现

    隐藏不同细节(软硬件)加速理解和分析流程

    基础算法、编译/分析工具、库、系统等支持,可以更多由计算机工程师来跟踪和完成

    众核深度优化

  • Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric

    Modeling on Sunway TaihuLight, IPDPS 2017

    • Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018

    • Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017

    • Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017

    • Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017

  • Computation-Communication Overlapping

    44

    • An inner-outer subdomain partition:• Maximize the most continuous dimension.• Deepen the halo to replace the communication by

    computation (May do in the future).

    • Overlap the halo exchange and data movement on MPE with the inner computation on CPE.

    • MPE: halo exchange, copy, pack, and unpack.

    • CPE: inner and halo computation.

  • Optimization of Comp-Comm Overlapping

    45

    • DMGlobalToLocalBegin (PETSc):1. Copy the inner data.2. Pack the halo data.3. Initial the communication.

    • DMGlobalToLocalEnd (PETSc):1. Wait for the end of the

    communication.

    2. Unpack the halo data.

    • MPE: halo exchange.

    • CPE: copy, pack (optional), unpack, inner and halo computation.

    Optimization of pack and unpack: deal with data continuously on the CPE cluster as much as possible until detecting a switch according to the loaded index array.

  • Locality-aware Thread Blocking (2.5D)

    Z

    Y

    Unused plane

    Prefetching plane

    Current computing plane

    Dependent plane

    Z

    Y

    Step 1 Step 2 Step 3

    ...

    Overla

    pp

    ing

    Overla

    pp

    ing

    Overla

    pp

    ing

    46

    • Each subdomain is partitioned along the z-x plane into small blocks.• Each thread is responsible for one block, which is processed plane by plane.• The required data by the CPE are loaded into LDM by DMA.• A circular array and the double-buffering approach are employed.

  • Optimization of Locality-aware Thread Blocking

    47

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    Data Mapping

    CPE Cluster

    R1

    R2

    R7

    C0 C1 C2 C7...

    ...

    R0

    HaloInner

    • For the inner part:• A second-level partitioning along the 𝑦

    direction may be conducted if the occupation rate is less than 50%.

    • For the halo part:• Each area is treated as a flat rectangular

    block.• Different areas are scheduled to the CPE

    cluster simultaneously.• The west and east areas need to be

    padded for vectorization.

  • Collaborative Data Accessing

    Thread 0

    Duplicating

    Splitting

    Grouping

    Z

    X

    4×4 4×4 4×4 4×4 8

    20

    2-layer halo

    20 32

    1

    2

    Mem

    ory

    CPE 0 CPE1

    CPE 2 CPE 3

    Excha

    ng

    ing

    3

    Thread 1 Thread 2 Thread 3Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

    Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

    Thread 0Thread 1Thread 2Thread 3Thread 0Thread 1Thread 2Thread 3

    48

    1. Grouping: 4 CPEs are grouped together, each of which loads the continuous larger chunk through DMA.

    2. Duplicating: some data on each CPE are duplicated to construct these data pieces including the 2-layer halos.

    3. Exchanging: the resulted data pieces are exchanged on chip so that each CPE gets their required data.

    4 CPEs with 4 × 4 block size

  • Online Data Layout Transformation

    50

    4 cell structures (each has 6 double elements)

    • Conversion between array of structure (AoS) and structure of array (SoA).

    • The shuffle instruction can output an vector from two input vectors in one cycle.

    6 vectors (each has 4 double elements)

  • Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric

    Modeling on Sunway TaihuLight, IPDPS 2017

    • Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018

    • Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017

    • Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017

    • Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017

  • Limited Memory System of SW26010

    Limited size of fast memory (only 64KB each core)

    Limited Memory Bandwidth (22 flops/byte, 34GB/s each CG in

    theory)

    Restricted use for getting good performance in applications

    Manually controlled SPM (cache less architecture)

    Single CG: 22.6 GB/s (DMA) vs. 1.5 GB/s (fine granularity

    Global Load/Store)

    Register communication only supports in-raw and in-column

    communication

    Challenges for Designing SpTRSV on SW architecture

    CPE

    (0,0)

    CPE

    (0,1)

    CPE

    (1,1)

  • Challenges for Designing SpTRSV on SW architecture

    Limitation of memory

    • Manually controlled SPM

    • DMA vs. Gload

    • Restricted register comm.

    Refactor in algorithm and Impl.

    • Check if data is in SPM, load missed data element

    • Support arbitrary two-core comm.

    Problems

    • High overhead

    • Inefficient memory access

    • Potential dead lock*

    *Lin H, Tang X, Yu B, et al. Scalable Graph Traversal on Sunway TaihuLight with Ten

    Million Cores, IPDPS 2017: 635-645.

  • swSpTRSV

    55

    稀疏层次块:• 适应手动缓存• 适应粗粒度的DMA

    生产者消费者配对:• 适应规则的寄存器通信• 避免死锁

  • 稀疏层次块

    56

    解向量x

    右端向

    量b

    1、细粒度;2、随机;3、不可预取;

    1、粗粒度;2、可预测;3、可预取;

    solution vector x

    rig

    ht-

    han

    d v

    ecto

    r b

  • 稀疏层次块

    57

    0

    4

    8

    3

    69

    12

    5

    7 10

    B区域0

    B区域1

    B区域2

    B区域3

    X区域0 X区域1 X区域2 X区域3

    • 可预测:限制了不同稀疏块所求解的x数量和更新的b的数量,以保证一定缓存命中,而不需要额外的分支判断语句

    • 可预取:稀疏块的存储顺序和计算顺序一致;对于右端向量b的需求是连续的

    • 粗粒度:所有的非零元按照稀疏块存储在一起,可以批量读入,对于右端向量b来说,连续+可预取=粗粒度

  • swSPTRSV

    58

    稀疏层次块:• 适应手动缓存• 适应粗粒度的DMA

    生产者消费者配对:• 适应规则的寄存器通信• 避免死锁

  • 生产者消费者配对方式

    59

    消费者8行

    4列

    生产者

    𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗𝑏𝑖 = 𝑏𝑖 − 𝑙𝑖𝑗𝑥𝑗

    𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗

    𝑥𝑗 = 𝑏𝑗/𝑙𝑗𝑗

    ∆𝑖𝑗= 𝑙𝑖𝑗𝑥𝑗𝑏𝑖 = 𝑏𝑖 − ∆𝑖𝑗

    4列

    从核0,0

    从核0,3

    从核0,4

    从核0,7

    从核7,0

    从核7,3

    从核7,4

    从核7,7

  • Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric

    Modeling on Sunway TaihuLight, IPDPS 2017

    • Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018

    • Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017

    • Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017

    • Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017

  • CPE

    Compression: Squeezing Extra Performance

    Peak Utilized %

    Flops 765 G 94.7 G 12.2%

    Memorysize

    5 G 4.6 G 92%

    MemoryBW

    34 GB/s 25 GB/s 73.5%

    LDMsize

    64 KB 60 KB 93.8%

    64 KBLDM

    DDR MEM(data in compressed form)

    computefunctions

    decompress

    compress

    DM

    A

  • CPE

    Compression: Squeezing Extra Performance

    Peak Utilized %

    Flops 765 G 94.7 G 12.2%

    Memorysize

    5 G 4.6 G 92%

    MemoryBW

    34 GB/s 25 GB/s 73.5%

    LDMsize

    64 KB 60 KB 93.8%

    64 KBLDM

    DDR MEM(data in compressed form)

    DM

    A

    computefunctions

    decompress

    compress

    enable even larger problems

    pumping more data in and out

  • Additional complexity and cost

    Extra LDM read/write due tocompression/decompression operations

    Broken floating-point instruction pipeline

    Compression: Not an Easy Task

  • Additional complexityand cost

    Extra LDM read/write due tocompression/decompressionoperations

    Broken floating-pointinstruction pipeline

    Compression: Further Optimization

  • Additional complexityand cost

    Extra LDM read/write due tocompression/decompressionoperations

    Broken floating-pointinstruction pipeline

    Compression: Further Optimization

    1/3 of original performance

  • Additional complexityand cost

    Extra LDM read/write due tocompression/decompressionoperations

    Broken floating-pointinstruction pipeline

    Compression: Further Optimization

    compress every pointon the fly buffering a plane

    1/3 to 90% of original performance

  • Additional complexityand cost

    Extra LDM read/write due tocompression/decompressionoperations

    Broken floating-pointinstruction pipeline

    Compression: Further Optimization

    90% to 120% of original performance

    LOAD LDM1,$raSSL $ra, $raSTORE $a, LDM1LOAD LDM2,$rbSSL $rb, $rbSTORE $rb, LDM2LOAD LDM3, $rcSSL $rc $rcSTORE $rc, LDM3LOAD LDM1,$raLOAD LDM2,$rbADD $ra, $rb, $raLOAD LDM3, $rcMUL $ra, $rc, $raSTORE $ra, LDM2

    LOAD LDM1,$raSSL $ra, $raLOAD LDM2,$rbSSL $rb, $rbLOAD LDM3, $rc$rc $rc

    ADD $a, $b, $aMUL $a, $c, $aSTORE $a, LDM2

    switch the buffering of temporaryvariables from LDM to registersby using intrinsic assemblyinstructions, especially forfunction calls

  • Additional complexityand cost

    Extra LDM read/write due tocompression/decompressionoperations

    Broken floating-pointinstruction pipeline

    Compression: Further Optimization

    120% to 130% of original performance

  • On-the-fly Compression

    0s 20s 40s 60s 80s 100s 120s

    Cangzhou

    Ninghe

    blue solid line: base

    red dased line: compressed

  • Examples (个人视角)• Yulong Ao, et al., 26 PFLOPS Stencil Computation for Atmospheric

    Modeling on Sunway TaihuLight, IPDPS 2017

    • Xinliang Wang, et al., swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures, PPoPP 2018

    • Heng Lin, et al., Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores, IPDPS 2017

    • Haohuan Fu, et al., 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. SC 2017

    • Haohuan Fu, et al., Redesigning CAM-SE on Sunway TaihuLight forPeta-Scale Performance and Ultra-High Resolution, SC 2017

  • Element

    Sub-Element 0

    Sub-Element 1

    Sub-Element i-1

    One CPE for one Element

    …CPE 0

    CPE 1

    CPE i-1

    #i CPEs for one Element

    寻找更多并行度

    More parallelism orlower LDM requirement

    Decomp.

    OpenACC

    Direct

    Cubed-sphere mesh

  • Challenge for Future: Effort on CAM-SE for 3 yearsHow to do it easily?

    PHY

    DYN

    INITIAL 754,129 LOCs

    304 Kernels, with no hotspots

    Modified 152,336 LOCAdded 57,709 LOC

    Optimized 185 Kernels80% of total time

  • 谢谢大家!

    感谢聆听!