Upload
sherman-vincent-arnold
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Memory System Performance in a NUMA Multicore Multiprocessor
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
1
Summary
• NUMA multicore systems are unfair to local memory accesses
• Local execution sometimes suboptimal
2
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
3
NUMA multicores: how it happened
3210
BusC
Northbridge
MC
DRAM memory
4
0 1 2 3 7654
BusC
4 5 6 7
BusC BusC BusC BusC BusC BusC
MC
First generation: SMP
NUMA multicores: how it happened
3210
BusC
Northbridge
DRAM memory
5
7654
BusC
MC MCMC
DRAM memory
BusC BusC
Next generation: NUMA
IC IC
NUMA multicores: how it happened
3210
DRAM memory
6
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
NUMA multicores: how it happened
3210
DRAM memory
7
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
3210
DRAM memory
7654
MC MC
DRAM memory
IC IC
Bandwidth sharing
• Frequent scenario:
bandwidth shared between cores
• Sharing model for the Intel Nehalem
8
0 1 2 3 4 5 6 7
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
9
Evaluation system
Intel Nehalem E5520
2 x 4 cores
8 MB level 3 cache
12 GB DDR3 RAM
5.86 GT/s QPI
10
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
QPI QPI
Global Queue Global Queue
Processor 0 Processor 1
Bandwidth sharing: local accesses
11
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
0
DRAM memory
3
Global Queue
Processor 0 Processor 1
Bandwidth sharing: remote accesses
12
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
Bandwidth sharing: combined accesses
13
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
Global Queue
Global Queue
• Mechanism to arbitrate between different types of memory accesses
• We look at fairness of the Global Queue:
– local memory accesses
– remote memory accesses
– combined memory accesses
14
Benchmark program
• STREAM triad
for (i=0; i<SIZE; i++)
{
a[i]=b[i]+SCALAR*c[i];
}
• Multiple co-executing triad clones
15
Multi-clone experiments
• All memory allocated on Processor 0
• Local clones: Remote clones:
• Example benchmark configurations:
16
C C
C C
(2L, 0R)
C C C C C C C C
(0L, 3R) (2L, 3R)
Processor 0 Processor 1 Processor 0 Processor 1
GQ fairness: local accesses
17
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
GQ fairness: remote accesses
18
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
Global Queue fairness
• Global Queue fair when there areonly local/remote accesses in the system
• What about combined accesses?
19
GQ fairness: combined accesses
Execute clones in all possible configurations
20
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
(2L, 3R)
GQ fairness: combined accesses
Execute clones in all possible configurations
21
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
GQ fairness: combined accesses
22
Total bandwidth [GB/s]
GQ fairness: combined accesses
Execute clones in all possible configurations
23
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
Combined accesses
24
Total bandwidth [GB/s]
Combined accesses
• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone
• Remote execution can be better than local
25
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
26
Bandwidth sharing model
27
remotelocaltotal bandwidthbandwidthbandwidth )1(
3210
DRAM memory
7654
IMC IMC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
DRAM memory
C C
Sharing factor ()
• Characterizes the fairness of the Global Queue
• Dependence of sharing factor on contention?
28
Contention affects sharing factor
29
DRAM
Processor 0 Processor 0
C
CQPI
contenders
C
C
C
Contention affects sharing factor
30
Sharing factor ()
Combined accesses
31
Total bandwidth [GB/s]
Contention affects sharing factor
• Sharing factor decreases with contention
• With local contention remote execution becomes more favorable
32
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
33
The next generation
Intel Westmere X5680
2 x 6 cores
12 MB level 3 cache
144 GB DDR3 RAM
6.4 GT/s QPI
34
3210
DRAM memory
IMC
DRAM memory
QPI
Level 3 cache
Global Queue
BA98
IMCQPI
Level 3 cache
Global Queue
764 5
The next generation
35
Total bandwidth [GB/s]
Conclusions
• Optimizing for data locality can be suboptimal
• Applications:
– OS scheduling (see ISMM’11 paper)
– data placement and computation scheduling36
Thank you! Questions?
37