37
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Embed Size (px)

Citation preview

Page 1: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

1

Page 2: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

2

Page 3: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

3

Page 4: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores: how it happened

3210

BusC

Northbridge

MC

DRAM memory

4

0 1 2 3 7654

BusC

4 5 6 7

BusC BusC BusC BusC BusC BusC

MC

First generation: SMP

Page 5: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores: how it happened

3210

BusC

Northbridge

DRAM memory

5

7654

BusC

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

IC IC

Page 6: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores: how it happened

3210

DRAM memory

6

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 7: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores: how it happened

3210

DRAM memory

7

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 8: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

3210

DRAM memory

7654

MC MC

DRAM memory

IC IC

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

8

0 1 2 3 4 5 6 7

Page 9: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

9

Page 10: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

10

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

Page 11: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Bandwidth sharing: local accesses

11

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

0

DRAM memory

3

Global Queue

Processor 0 Processor 1

Page 12: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Bandwidth sharing: remote accesses

12

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Page 13: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Bandwidth sharing: combined accesses

13

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Global Queue

Page 14: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

14

Page 15: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++)

{

a[i]=b[i]+SCALAR*c[i];

}

• Multiple co-executing triad clones

15

Page 16: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

16

C C

C C

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

Page 17: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: local accesses

17

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 18: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: remote accesses

18

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 19: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

19

Page 20: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: combined accesses

Execute clones in all possible configurations

20

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

(2L, 3R)

Page 21: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: combined accesses

Execute clones in all possible configurations

21

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 22: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: combined accesses

22

Total bandwidth [GB/s]

Page 23: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

GQ fairness: combined accesses

Execute clones in all possible configurations

23

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 24: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Combined accesses

24

Total bandwidth [GB/s]

Page 25: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

25

Page 26: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

26

Page 27: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Bandwidth sharing model

27

remotelocaltotal bandwidthbandwidthbandwidth )1(

3210

DRAM memory

7654

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

C C

Page 28: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

28

Page 29: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Contention affects sharing factor

29

DRAM

Processor 0 Processor 0

C

CQPI

contenders

C

C

C

Page 30: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Contention affects sharing factor

30

Sharing factor ()

Page 31: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Combined accesses

31

Total bandwidth [GB/s]

Page 32: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Contention affects sharing factor

• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

32

Page 33: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

33

Page 34: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

34

3210

DRAM memory

IMC

DRAM memory

QPI

Level 3 cache

Global Queue

BA98

IMCQPI

Level 3 cache

Global Queue

764 5

Page 35: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

The next generation

35

Total bandwidth [GB/s]

Page 36: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling36

Page 37: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Thank you! Questions?

37