1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

8/13/2019 1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis

1/36

Luis Garrido, 2012

IEE5008Autumn 2012

Memory SystemsSurvey on the Off-Chip Scheduling of

Memory Accesses in the Memory

Interface of GPUsGarrido Platero, Luis Angel

EECS Graduate Program

National Chiao Tung University

[email protected]


2/36

NCTU IEE5008 Memory Systems 2012Luis Garrido

Outline

Introduction

Overview of GPU Architectures

The SIMD Execution Model

Memory Requests in GPUs

Differences between GDDRx and DDRx

State-of-the-art Memory Scheduling Techniques

Effect of instruction fetch and memory scheduling inGPU Performance

An alternative Memory Access Scheduling in Manycore

Accelerators

2


3/36


Outline

DRAM Scheduling Policy for GPGPU Architectures

Based on a Potential Function

Staged Memory Scheduling: Achieving High

Performance and Scalability in Heterogeneous Systems

ConclusionReferences

3


4/36


Introduction

Memory controllers are a critical bottleneck of

performance

Scheduling policies compliant with the SIMD

execution model.

Characteristics of the memory requests in GPU

architectures

Integration of GPU+CPU systems

4


5/36


Overview: SIMD execution model

5

Core DPUnitLD/ST

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

DPUnit

DPUnit

LD/ST

LD/ST

LD/ST

LD/ST

Core DPUnitLD/ST

Core

Core

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

LD/ST

LD/ST

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

LD/ST

LD/ST

REGISTER FILE (65536 x 32-bit)

Warp Scheduler Warp Scheduler Warp Scheduler

Instruction Cache

Interconnect Network

64 KB Shared Memory / L1 Cache

48 KB Read-Only Data Cache

Tex Tex Tex TexTex Tex Tex Tex

Core DPUnitLD/ST

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

DPUnit

DPUnit

LD/ST

LD/ST

LD/ST

LD/ST

Core DPUnitLD/ST

Core

Core

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

LD/ST

LD/ST

Core

Core

Core

Core

Core

Core

DPUnit

DPUnit

LD/ST

LD/ST

REGISTER FILE (65536 x 32-bit)

Warp Scheduler Warp Scheduler Warp Scheduler

Instruction Cache


64 KB Shared Memory / L1 Cache

48 KB Read-Only Data Cache

Tex Tex Tex TexTex Tex Tex Tex

L2 Unified Cache

MemoryController

Memo

ryController

MemoryController

MemoryController

GigaThread Engine

Block (0,0) Block (1,0) Block (2,0)



GRID

Thread (0,0) Thread (0, 1) Thread (0,2)



Block(0,2)

LD/ST

LD/ST

LD/ST

LD/ST


Memory


6/36


Overview: Memory requests in GPUs

Varying number accesses

with different characteristics

generated simultaneously

6

i

A

A

B

B

i+1

C

C

D

D

i+2

E

E

B

C

i+3

A

F

A

-

i+4

F

F

B

F

i+5

A

E

E

E

i+6

B

B

B

B

Load/Store units handle the data fetchConcept of memory coalescing

Accesses can be merged

Intra-core merging

Inter-core merging

Reduce number of transactions


7/36


Overview: Memory requests in GPUs

7

Number of transactions depend on:

Parameters of memory sub-system: how many cache

levels, cache line size,

Applications behavior

Scheduling policy and GDDRx controller capabilities

0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

0 32 64 96 128 160 192 224 256 288 320 352 384

a) b)

Coalesced Accesses =

1 transaction

c) d)

e)

Scattered accesses =

k transactions

Same word accesses =

1 transactionMisaligned accesses =

2 transactions

Permutated accesses =

1 transaction

b)

d)

Same word access =

1 transaction

Permuted accesses =

4 transactions

0 32 64 96 128 160 192 224 256 288 320 352 384

a)

Coalesced accesses =

4 transactions

c)

Scattered accesses =

k transactions

Misaligned access =

Documents

1. Scheduling of Memory Accesses in the Memory Interface of GPUs - Luis