Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

1

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

2

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

3

NUMA multicores: how it happened

3210

BusC

Northbridge

MC

DRAM memory

4

0 1 2 3 7654

BusC

4 5 6 7

BusC BusC BusC BusC BusC BusC

MC

First generation: SMP


3210

BusC

Northbridge

DRAM memory

5

7654

BusC

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

IC IC


3210

DRAM memory

6

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC



3210

DRAM memory

7

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC


3210

DRAM memory

7654

MC MC

DRAM memory

IC IC

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

8

0 1 2 3 4 5 6 7

Outline





9

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

10

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

Bandwidth sharing: local accesses

11

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

0

DRAM memory

3

Global Queue


Bandwidth sharing: remote accesses

12

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3


Bandwidth sharing: combined accesses

13

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3


Global Queue

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

14

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++)

{

a[i]=b[i]+SCALAR*c[i];

}

• Multiple co-executing triad clones

15

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

16

C C

C C

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

GQ fairness: local accesses

17

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C


CC

GQ fairness: remote accesses

18


3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C


CC

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

19

GQ fairness: combined accesses

Execute clones in all possible configurations

20

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

(2L, 3R)



21

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4


22




23

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Combined accesses

24


Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

25

Outline





26

Bandwidth sharing model

27

remotelocaltotal bandwidthbandwidthbandwidth )1(

3210

DRAM memory

7654

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

C C

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

28

Contention affects sharing factor

29

DRAM


C

CQPI

contenders

C

C

C


30

Sharing factor ()

Combined accesses

31



• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

32

Outline





33

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

34

3210

DRAM memory

IMC

DRAM memory

QPI

Level 3 cache

Global Queue

BA98

IMCQPI

Level 3 cache

Global Queue

764 5

The next generation

35


Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling36

Thank you! Questions?

37

Documents

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1