Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and

Distances

Ahmad LashgarECE

University of Tehran

Amirali BaniasadiECE

University of Victoria

WDDD-9June 5, 2011

This Work

Goal: Investigating GPU performance for general-purpose workloads

How: Studying the isolated impact ofI. Memory divergence II. Branch divergence III. Context-keeping resources

Key finding: Memory has the biggest impact.Branch divergence solution needs memory consideration.

2A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Outline

Background

Performance Impacting Parameters

Machine Models

Performance Potentials

Performance Distances

Sensitivity Analysis

Conclusion


Distances.

GPU Architecture

Interconnection Netw

ork

MCtrl6

DRAM1DRAM1DRAM1

DRAM6

...... ... ...

TPC1

SM1 SM2 SM3

MCtrl1

DRAM1DRAM1DRAM1

DRAM1

MCtrl2

DRAM1DRAM1DRAM1

DRAM2

MCtrl5

DRAM1DRAM1DRAM1

DRAM5TPC10

SM1 SM2 SM3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4

Thread Pool

L1Data L1Cost L1Text

PE32PE1 PE2 PE31

Register File

CTAID Program Counter

TID CTAID Program Counter.

.

.

.

.

.

.

.

.

.

.

.

TID

…

…

•Number of concurrent CTAs per SM is limited by the size of 3 shared resources:

1. Thread Pool2. Register File3. Shared Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Branch Divergence

SM is SIMD processor Group of threads (warp) execute the same

instruction on the lanes. Branch instruction potentially diverge warp to two

groups:1. Threads with taken outcome2. Threads with not-taken outcome


Distances.

A 1 1 1 1 1 1 1 1

B 1 1 0 1 0 0 1 0

C 0 0 1 0 1 1 0 1

D 1 1 1 1 1 1 1 1

A: // Pre-Divergence if(CONDITION) {B: //NT path } else {C: //T path }D: // reconvergence point

Control-flow mechanism

Control-flow solutions address this. Previous solutions:

Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally

reconverging all paths Dynamic Warp Formulation (DWF)

Regrouping the threads in diverging paths into new warps


Distances.

PDOM


Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W0100

1W1

1110

D W0111

1W1

1111

W0011

0W1

0001

TOS

TOS

Dynamic regrouping ofdiverged threads at same path

increases utilization

DWF


Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W2100

1W3

1110

D W0011

1W1

1111

Warp Pool

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 A 1 1 1 1

W2 C 1 0 0 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 B 0 0 0 1

W2 C 1 0 0 1

W3 C 1 1 1 0

Wi PC Mask Vector

W0 B 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 D 1 0 0 0

Wi PC Mask Vector

W0 D 1 1 1 1

W1 D 1 1 1 1

W1111

1W2

1000

W0011

1

W2100

0

W0111

1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 D 1 1 1 1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

W1111

1

W0111

1

Merge

Possibilit

y

Performance impacting parameters Memory Divergence

Increase of memory pressure with un-coalesced memory accesses Branch Divergence

Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism

CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources:

1. Shared Memory2. Register File3. Thread Pool


Distances.

-

Machine Models

10

Limited Resources :LRUnlimited

Resources :UR

X

DC: DWF Control-flowPC: PDOM Control-flowIC: Ideal Control-flow (MIMD)

IM: Ideal Memory M: Real Memory


Y ZX Y Z-

Isolates the impact of each parameter:

Machine Models continued…


Distances.

LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM

Real-Memory

Ideal-Memory

Real-Memory

Ideal-Memory

Limitedper SM resources

Unlimitedper SM resources

Methodology

GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and

CUDA SDK 2.3


Distances.

Parameter ValueNoC

Total Number of SMs 30Number of Memory Ctrls 6Number of SM Sharing an

Interconnect3

SM

Warp Size 32 ThreadsNumber of Thread per SM 1024

Number of Register per SM 16384 32-bit

Number of PEs per SM 32Shared Memory Size 16KB

L1 Data Cache 32KB

Parameter ValueClocking

Core Clock 325 MHzInterconnect Clock 650 MHz

DRAM memory Clock 800MHzControl-Flow Mechanisms

Base DWF issue heuristic MajorityPDOM warp scheduling round-robin

Amirali

processor config?

Performance Potentials

The speedup can be reached if the impacting parameter is idealized

3 Potentials (per control-flow mechanism): Memory Potential

Speedup due to ideal memory Control Potential

Speedup due to free-of-divergence architecture Resource Potential

Speedup due to infinite CTA-limiting resources per SM


Distances.

Performance Potentials continued…


Distances.

Memory Potentials


Distances.

DWF61%PDOM59%

Resource Potentials


Distances.

DWF8.6%PDOM9.4%

Control Potentials


Distances.

DWF2%

PDOM-7%

Performance Distances

How much an otherwise ideal GPU is distanced from ideal due to the parameter.

3 Distances: Memory Distance

Distance form ideal GPU due to real memory Resource Distance

Distance from ideal GPU due to limited resources Control Distance

Distance from ideal GPU due to branch divergence


Distances.

Performance Distances continued…


Distances.

Memory Distance


Distances.

40%

Resource Distance


Distances.

2%

Control Distances


Distances.

DWF15%

PDOM8%

Sensitivity Analysis

Validating the findings under aggressive configurations: Aggressive-Memory

2x L1 caches 2x Number of memory controllers

Aggressive-Resource 2x CTA-limiting resources

Limited to performance potentials


Distances.

Aggressive-memory

Memory Potentials


Distances.

PDOM memory potential

28%

DWF memory potential

28%

Aggressive-memory continued…

Control Potentials


Distances.

PDOM control potential

-8%

DWF control potential

-0.4%

Aggressive-memory continued…

Resource Potentials


Distances.

PDOM resource potential

8%

DWF resource potential

~0%

Aggressive-resource

Memory Potentials


Distances.

PDOM memory potential

51%

DWF memory potential

52%

Aggressive-resource continued…

Control Potentials


Distances.

PDOM control potential

-8%

DWF control potential

2%

Aggressive-resource continued…

Resource Potentials


Distances.

PDOM resource potential

4%

DWF resource potential

3%

Conclusion

30A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion

Performance in GPUs Potentials: Improvement by idealizing

Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF

Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2%

Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources

31A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

32

Thank you.

Questions?


Why 32 PEs per SM

GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs:

Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp

We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs:

33

0-7 8-15 16-23 24-31

0-31


Documents

Performance in GPU Architectures: Potentials and Distances