1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

Embed Size (px)

Citation preview

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    1/36

    Luis Garrido, 2012

    IEE5008Autumn 2012

    Memory SystemsSurvey on the Off-Chip Scheduling of

    Memory Accesses in the Memory

    Interface of GPUsGarrido Platero, Luis Angel

    EECS Graduate Program

    National Chiao Tung University

    [email protected]

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    2/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Outline

    Introduction

    Overview of GPU Architectures

    The SIMD Execution Model

    Memory Requests in GPUs

    Differences between GDDRx and DDRx

    State-of-the-art Memory Scheduling Techniques

    Effect of instruction fetch and memory scheduling inGPU Performance

    An alternative Memory Access Scheduling in Manycore

    Accelerators

    2

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    3/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Outline

    DRAM Scheduling Policy for GPGPU Architectures

    Based on a Potential Function

    Staged Memory Scheduling: Achieving High

    Performance and Scalability in Heterogeneous Systems

    ConclusionReferences

    3

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    4/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Introduction

    Memory controllers are a critical bottleneck of

    performance

    Scheduling policies compliant with the SIMD

    execution model.

    Characteristics of the memory requests in GPU

    architectures

    Integration of GPU+CPU systems

    4

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    5/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Overview: SIMD execution model

    5

    Core DPUnitLD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    Core DPUnitLD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    REGISTER FILE (65536 x 32-bit)

    Warp Scheduler Warp Scheduler Warp Scheduler

    Instruction Cache

    Interconnect Network

    64 KB Shared Memory / L1 Cache

    48 KB Read-Only Data Cache

    Tex Tex Tex TexTex Tex Tex Tex

    Core DPUnitLD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    Core DPUnitLD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    Core

    Core

    Core

    Core

    Core

    Core

    DPUnit

    DPUnit

    LD/ST

    LD/ST

    REGISTER FILE (65536 x 32-bit)

    Warp Scheduler Warp Scheduler Warp Scheduler

    Instruction Cache

    Interconnect Network

    64 KB Shared Memory / L1 Cache

    48 KB Read-Only Data Cache

    Tex Tex Tex TexTex Tex Tex Tex

    L2 Unified Cache

    MemoryController

    Memo

    ryController

    MemoryController

    MemoryController

    GigaThread Engine

    Block (0,0) Block (1,0) Block (2,0)

    Block (0,1) Block (1,1) Block (2,1)

    Block (0,2) Block (1,2) Block (2,2)

    GRID

    Thread (0,0) Thread (0, 1) Thread (0,2)

    Thread (0,1) Thread (1, 1) Thread (2,1)

    Thread (0,2) Thread (1, 2) Thread (2,2)

    Block(0,2)

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    Interconnect Network

    Memory

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    6/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Overview: Memory requests in GPUs

    Varying number accesses

    with different characteristics

    generated simultaneously

    6

    i

    A

    A

    B

    B

    i+1

    C

    C

    D

    D

    i+2

    E

    E

    B

    C

    i+3

    A

    F

    A

    -

    i+4

    F

    F

    B

    F

    i+5

    A

    E

    E

    E

    i+6

    B

    B

    B

    B

    Load/Store units handle the data fetchConcept of memory coalescing

    Accesses can be merged

    Intra-core merging

    Inter-core merging

    Reduce number of transactions

  • 8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

    7/36

    NCTU IEE5008 Memory Systems 2012Luis Garrido

    Overview: Memory requests in GPUs

    7

    Number of transactions depend on:

    Parameters of memory sub-system: how many cache

    levels, cache line size,

    Applications behavior

    Scheduling policy and GDDRx controller capabilities

    0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

    0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

    0 32 64 96 128 160 192 224 256 288 320 352 384

    a) b)

    Coalesced Accesses =

    1 transaction

    c) d)

    e)

    Scattered accesses =

    k transactions

    Same word accesses =

    1 transactionMisaligned accesses =

    2 transactions

    Permutated accesses =

    1 transaction

    b)

    d)

    Same word access =

    1 transaction

    Permuted accesses =

    4 transactions

    0 32 64 96 128 160 192 224 256 288 320 352 384

    a)

    Coalesced accesses =

    4 transactions

    c)

    Scattered accesses =

    k transactions

    Misaligned access =