33
Performance in GPU Architectures: Potentials and Distances Ahmad Lashgar ECE University of Tehran Amirali Baniasadi ECE University of Victoria WDDD-9 June 5, 2011

Performance in GPU Architectures: Potentials and Distances

  • Upload
    nolcha

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Performance in GPU Architectures: Potentials and Distances. Amirali Baniasadi ECE University of Victoria. Ahmad Lashgar ECE University of Tehran. WDDD-9 June 5, 2011. This Work. Goal : Investigating GPU performance for general-purpose workloads How : Studying the isolated impact of - PowerPoint PPT Presentation

Citation preview

Page 1: Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and

Distances

Ahmad LashgarECE

University of Tehran

Amirali BaniasadiECE

University of Victoria

WDDD-9June 5, 2011

Page 2: Performance in GPU Architectures: Potentials and Distances

This Work

Goal: Investigating GPU performance for general-purpose workloads

How: Studying the isolated impact ofI. Memory divergence II. Branch divergence III. Context-keeping resources

Key finding: Memory has the biggest impact.Branch divergence solution needs memory consideration.

2A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 3: Performance in GPU Architectures: Potentials and Distances

Outline

Background

Performance Impacting Parameters

Machine Models

Performance Potentials

Performance Distances

Sensitivity Analysis

Conclusion

3A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 4: Performance in GPU Architectures: Potentials and Distances

GPU Architecture

Interconnection Netw

ork

MCtrl6

DRAM1DRAM1DRAM1

DRAM6

...... ... ...

TPC1

SM1 SM2 SM3

MCtrl1

DRAM1DRAM1DRAM1

DRAM1

MCtrl2

DRAM1DRAM1DRAM1

DRAM2

MCtrl5

DRAM1DRAM1DRAM1

DRAM5TPC10

SM1 SM2 SM3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4

Thread Pool

L1Data L1Cost L1Text

PE32PE1 PE2 PE31

Register File

CTAID Program Counter

TID CTAID Program Counter.

.

.

.

.

.

.

.

.

.

.

.

TID

•Number of concurrent CTAs per SM is limited by the size of 3 shared resources:

1. Thread Pool2. Register File3. Shared Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Page 5: Performance in GPU Architectures: Potentials and Distances

Branch Divergence

SM is SIMD processor Group of threads (warp) execute the same

instruction on the lanes. Branch instruction potentially diverge warp to two

groups:1. Threads with taken outcome2. Threads with not-taken outcome

5A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

A 1 1 1 1 1 1 1 1

B 1 1 0 1 0 0 1 0

C 0 0 1 0 1 1 0 1

D 1 1 1 1 1 1 1 1

A: // Pre-Divergence if(CONDITION) {B: //NT path } else {C: //T path }D: // reconvergence point

Page 6: Performance in GPU Architectures: Potentials and Distances

Control-flow mechanism

Control-flow solutions address this. Previous solutions:

Postdominator Reconvergence (PDOM) Masking and serializing in diverging paths, finally

reconverging all paths Dynamic Warp Formulation (DWF)

Regrouping the threads in diverging paths into new warps

6A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 7: Performance in GPU Architectures: Potentials and Distances

PDOM

7A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W0100

1W1

1110

D W0111

1W1

1111

W0011

0W1

0001

TOS

TOS

Dynamic regrouping ofdiverged threads at same path

increases utilization

Page 8: Performance in GPU Architectures: Potentials and Distances

DWF

8A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

ASIMD

Utilizationover time

W0111

1W1

1111

B W0011

0W1

0001

C W2100

1W3

1110

D W0011

1W1

1111

Warp Pool

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 A 1 1 1 1

W2 C 1 0 0 1

Wi PC Mask Vector

W0 B 0 1 1 0

W1 B 0 0 0 1

W2 C 1 0 0 1

W3 C 1 1 1 0

Wi PC Mask Vector

W0 B 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 C 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 C 1 0 0 0

Wi PC Mask Vector

W0 D 0 1 1 1

W1 D 1 1 1 1

W2 D 1 0 0 0

Wi PC Mask Vector

W0 D 1 1 1 1

W1 D 1 1 1 1

W1111

1W2

1000

W0011

1

W2100

0

W0111

1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 D 1 1 1 1

Wi PC Mask Vector

W0 A 1 1 1 1

W1 A 1 1 1 1

W1111

1

W0111

1

Merge

Possibilit

y

Page 9: Performance in GPU Architectures: Potentials and Distances

Performance impacting parameters Memory Divergence

Increase of memory pressure with un-coalesced memory accesses Branch Divergence

Decrease of SIMD efficiency with inter-warp diverging-branch Workload Parallelism

CTA-limiting resources bound memory latency hiding capability Concurrent CTAs share 3 CTA-limiting resources:

1. Shared Memory2. Register File3. Thread Pool

9A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 10: Performance in GPU Architectures: Potentials and Distances

-

Machine Models

10

Limited Resources :LRUnlimited

Resources :UR

X

DC: DWF Control-flowPC: PDOM Control-flowIC: Ideal Control-flow (MIMD)

IM: Ideal Memory M: Real Memory

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Y ZX Y Z-

Isolates the impact of each parameter:

Page 11: Performance in GPU Architectures: Potentials and Distances

Machine Models continued…

11A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

LR-DC-M LR-PC-M LR-IC-M LR-DC-IM LR-PC-IM LR-IC-IM UR-DC-M UR-PC-M UR-IC-M UR-DC-IM UR-PC-IM UR-IC-IM

Real-Memory

Ideal-Memory

Real-Memory

Ideal-Memory

Limitedper SM resources

Unlimitedper SM resources

Page 12: Performance in GPU Architectures: Potentials and Distances

Methodology

GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and

CUDA SDK 2.3

12A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Parameter ValueNoC

Total Number of SMs 30Number of Memory Ctrls 6Number of SM Sharing an

Interconnect3

SM

Warp Size 32 ThreadsNumber of Thread per SM 1024

Number of Register per SM 16384 32-bit

Number of PEs per SM 32Shared Memory Size 16KB

L1 Data Cache 32KB

Parameter ValueClocking

Core Clock 325 MHzInterconnect Clock 650 MHz

DRAM memory Clock 800MHzControl-Flow Mechanisms

Base DWF issue heuristic MajorityPDOM warp scheduling round-robin

Amirali
processor config?
Page 13: Performance in GPU Architectures: Potentials and Distances

Performance Potentials

The speedup can be reached if the impacting parameter is idealized

3 Potentials (per control-flow mechanism): Memory Potential

Speedup due to ideal memory Control Potential

Speedup due to free-of-divergence architecture Resource Potential

Speedup due to infinite CTA-limiting resources per SM

13A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 14: Performance in GPU Architectures: Potentials and Distances

Performance Potentials continued…

14A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 15: Performance in GPU Architectures: Potentials and Distances

Memory Potentials

15A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF61%PDOM59%

Page 16: Performance in GPU Architectures: Potentials and Distances

Resource Potentials

16A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF8.6%PDOM9.4%

Page 17: Performance in GPU Architectures: Potentials and Distances

Control Potentials

17A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF2%

PDOM-7%

Page 18: Performance in GPU Architectures: Potentials and Distances

Performance Distances

How much an otherwise ideal GPU is distanced from ideal due to the parameter.

3 Distances: Memory Distance

Distance form ideal GPU due to real memory Resource Distance

Distance from ideal GPU due to limited resources Control Distance

Distance from ideal GPU due to branch divergence

18A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 19: Performance in GPU Architectures: Potentials and Distances

Performance Distances continued…

19A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 20: Performance in GPU Architectures: Potentials and Distances

Memory Distance

20A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

40%

Page 21: Performance in GPU Architectures: Potentials and Distances

Resource Distance

21A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

2%

Page 22: Performance in GPU Architectures: Potentials and Distances

Control Distances

22A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

DWF15%

PDOM8%

Page 23: Performance in GPU Architectures: Potentials and Distances

Sensitivity Analysis

Validating the findings under aggressive configurations: Aggressive-Memory

2x L1 caches 2x Number of memory controllers

Aggressive-Resource 2x CTA-limiting resources

Limited to performance potentials

23A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

Page 24: Performance in GPU Architectures: Potentials and Distances

Aggressive-memory

Memory Potentials

24A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM memory potential

28%

DWF memory potential

28%

Page 25: Performance in GPU Architectures: Potentials and Distances

Aggressive-memory continued…

Control Potentials

25A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM control potential

-8%

DWF control potential

-0.4%

Page 26: Performance in GPU Architectures: Potentials and Distances

Aggressive-memory continued…

Resource Potentials

26A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM resource potential

8%

DWF resource potential

~0%

Page 27: Performance in GPU Architectures: Potentials and Distances

Aggressive-resource

Memory Potentials

27A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM memory potential

51%

DWF memory potential

52%

Page 28: Performance in GPU Architectures: Potentials and Distances

Aggressive-resource continued…

Control Potentials

28A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM control potential

-8%

DWF control potential

2%

Page 29: Performance in GPU Architectures: Potentials and Distances

Aggressive-resource continued…

Resource Potentials

29A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and

Distances.

PDOM resource potential

4%

DWF resource potential

3%

Page 30: Performance in GPU Architectures: Potentials and Distances

Conclusion

30A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Page 31: Performance in GPU Architectures: Potentials and Distances

Conclusion

Performance in GPUs Potentials: Improvement by idealizing

Memory: 59% and 61% for PDOM and DWF Control: -7% and 2% for PDOM and DWF Resource: 9.4% and 8.6 for PDOM and DWF

Distances: Distance from ideal system due to a none-ideal factor Memory: 40% Control: 8% and 15% for PDOM and DWF Resource: 2%

Findings: Memory has the biggest impact among the 3 factors Improving control-flow mechanism has to consider memory pressure Same trend under aggressive memory and context-keeping resources

31A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Page 32: Performance in GPU Architectures: Potentials and Distances

32

Thank you.

Questions?

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Page 33: Performance in GPU Architectures: Potentials and Distances

Why 32 PEs per SM

GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs:

Example: Warp Size = 32, PEs per SM = 8 4 independent coalescing domains in a warp

We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs:

33

0-7 8-15 16-23 24-31

0-31

A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.