Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

NUMA multicores: how it happened

1 2 3 4 5 6 7 80

Active cores

Total bandwidth [GB/s]

Northbridge

DRAM memory

0 1 2 3 7654

4 5 6 7

BusC BusC BusC BusC BusC BusC

First generation: SMP

1 2 3 4 5 6 7 80

Active cores

Northbridge

DRAM memory

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

1 2 3 4 5 6 7 80

NUMA (local)

Active cores

DRAM memory

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7 80

NUMA (local)

NUMA (re-mote)

Active cores

DRAM memory

0 1 2 3 4 5 6 7

DRAM memory

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

0 1 2 3 4 5 6 7

Outline

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

Bandwidth sharing: local accesses

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

Global Queue

Bandwidth sharing: remote accesses

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

Global Queue

Bandwidth sharing: combined accesses

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++){

a[i]=b[i]+SCALAR*c[i];}

• Multiple co-executing triad clones

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

GQ fairness: local accesses

1 L 2 L 3 L 4 L0

Core 0 Core 1 Core 2 Core 3

Benchmark configurations

IMC IMC

QPI QPI

DRAM memory

1 R 2 R 3 R 4 R0

1 L 1 R 2 L 2 R 3 L 3 R 4 L 4 R0

GQ fairness: remote accesses

IMC IMC

QPI QPI

DRAM memory

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

GQ fairness: combined accesses

Execute clones in all possible configurations

# local clones

0 1 2 3 4

# remote clones

4(2L, 3R)

# local clones

0 1 2 3 4

# remote clones

GQ fairness: combined accessesTotal bandwidth [GB/s]

(4L, 0R) (4L, 1R) (4L, 2R) (4L, 3R) (4L, 4R)0

10,000

12,000

14,000

local clones remote clones

# local clones

0 1 2 3 4

# remote clones

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

remote clonelocal clone 1local clone 2local clone 3local clone 4

Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

Outline

Bandwidth sharing model

remotelocaltotal bandwidthbandwidthbandwidth )1(

DRAM memory

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

Contention affects sharing factor

contenders

+0L +1L +2L +3L0%

Additional contention

Sharing factor ()

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

remote clonelocal clone 1local clone 2local clone 3local clone 4

• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

Outline

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

DRAM memory

Level 3 cache

Global Queue

IMCQPI

Level 3 cache

Global Queue

The next generationTotal bandwidth [GB/s]

remote clonelocal clone 1local clone 2local clone 3local clone 4local clone 5local clone 6

Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling

Thank you! Questions?

Memory System Performance in a NUMA Multicore Multiprocessor

Documents

Taking Advantage of Multicore/Multiprocessor Technologies to Meet

Multicore and Multiprocessor Systems: Part II · code ENOMEM) or other resources (EAGAIN) Max Planck Institute Magdeburg Jens Saak, Scienti c Computing II 73/337. Pthread coordination

Numa Multiprocessors

1 CS 140 : Jan 29 – Feb 3, 2008 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Numa presentation101

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland

Performance Optimizations for NUMA-Multicore Systems

Parallel Computing...Introduction Motivating Parallelism Multiprocessor / Multicore architectures get more and more usual Data intensive applications: web server / databases / data

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability

Multicore Application Debugging Multicore Debugging

THE EVOLUTION AND ARCHITECTURE Professor Ken ......IDEA MAP FOR TODAY CORNELL CS4414 - FALL 2020. 2 Computers are multicore NUMA machines capable of many forms of parallelism. They

Using Multicore Navigator Multicore Applications

Exploiting the cache capacity of a single-chip multicore processor … · • A.k.a. chip-multiprocessor (CMP) • Main motivation: run multiple tasks • Integration on a single

1 CS 140 : Jan 27 – Feb 3, 2010 Multicore (and Shared Memory) Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++

Chapter 5 Multiprocessors and Thread-Level Parallelismcse.msu.edu/~cse820/lectures/CAQA5e_ch5.pdf · • NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed

NUMA-Aware Shared-Memory Collective Communication for MPI · collective operations on multicore architectures. 3. We establish performance models based on memory access latency and

Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms* Yuanfang Zhang, Christopher Gill, and Chenyang Lu Department of

Multiprocessors and Multithreadingcseweb.ucsd.edu/classes/wi13/cse141-b/slides/10-Multithreading.pdf · • Multicore processors – multiprocessor where the CPU cores coexist on

Processor Architecture - uni-potsdam.de · Cache . Memory . Symmetric Multiprocessor (SMP) Cache . Cache Interconnect P . P … P . Cache Mem . Non-Uniform Memory Access (NUMA) Distributed