View
214
Download
0
Embed Size (px)
Citation preview
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Uppsala UniversityDept. of Information Technology
Div. of Computer SystemsUppsala Architecture Research Team [UART]
Exploiting Store Locality throughPermission Caching in Software DSMs
Exploiting Store Locality throughPermission Caching in Software DSMs
Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten [email protected]
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Software Distributed Shared Memory
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Traditional Software DSMs
Page based coherence [e.g., Ivy, Munin, TreadMarks] Virtual memory hardware for coherence checks
• Expensive TLB traps
Large coherence unit size• Problem: False sharing• Solution: Weak memory consistency models
CPUs
DATAdir
req. ST miss
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Fine-Grain Software DSMs
Fine-grain access-control checks [Shasta, Blizzard] Relies on binary instrumentation Avoids operating system trapping Less false sharing Extra instructions introduce overhead
CPUs
DATAdir
req.
if (miss)
goto st_protocol
ST
Checking code instrumentedinto the application
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Fine-Grain Pros and Cons
Pros Small coherence unit Hardware-like memory consistency model
Cons Extra check instructions to execute
Our proposal: Write Permission Cache (WPC) Exploits store locality Caches write permission Effectively reduces the store instrumentation cost
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Outline
Motivation Problem: Instrumentation Overhead Solution: Write Permission Cache Experimental Setup Results on Real HW- and SW-DSM Systems Conclusions
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
add R1, R2 -> R3loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]
Software Fine-Grain Coherence
add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]
Binary instrumentation of global loads and stores Inserted code “snippet” maintains coherence
Original program Instrumented program
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Operation CUID Original snippet handling
ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98
ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99
ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99
The Lock Problem (original DSZOOM)
Example store access pattern (array traversal)
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]
DSZOOM Fine-Grain Coherence
Magic value (load), atomic operations (store)
add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]
Original program Instrumented program
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Average instrumentation overhead when run on a single processor (SPLASH2 –O3):
Integer load instrumentation overhead: 3% Overhead when only integer loads are instrumented
Float load instrumentation overhead: 31% Only floating-point loads instrumented
Store instrumentation overhead: 61% Only stores instrumented
Sequential Instrumentation Overhead
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Operation CUID WPC snippet handling
ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store
ST 0xE22F0008 98 check WPC; hit; store
ST 0xE22F0010 98 check WPC; hit; store
ST 0xE22F0018 98 check WPC; hit; store
ST 0xE22F0020 98 check WPC; hit; store
ST 0xE22F0028 98 check WPC; hit; store
ST 0xE22F0030 98 check WPC; hit; store
ST 0xE22F0038 98 check WPC; hit; store
ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store
ST 0xE22F0048 99 check WPC; hit; store
Write Permission Caching in Action
Example store access pattern (array traversal)
Write Permission Cache 9899
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
add R1, R2 -> R3loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loopL134: st R3 -> [R7 + 4]
WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store
WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL();
The Write Permission Cache Idea
Keep the lock Rely on store locality SPARC application registers
Original program Write Permission Cache Snippet
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Experimental Setup: Software
Benchmarks: unmodified SPLASH2
Compiler: GCC 3.3.3 (-O0 and –O3)
Instrumentation tool: custom made
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Experimental Setup: Hardware
SMP: Sun Enterprise E6000 Server 16 UltraSPARC II (250 MHz) Memory access time 330 ns [lmbench]
HW-DSM: Sun Wildfire (2 E6000 nodes) Remote memory access time 1700 ns [lmbench] Hardware coherent interconnect. BW 800 MB/s
DSZOOM: Runs in user space on the Wildfire system put (get) = uncacheable block load (store) operation atomic = ldstub (load store unsigned byte SPARC V9) maintains coherence between private copies of G_MEM
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Write Permission Cache Hit Rate
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
fft
lu-c
lu-n
c
radi
x
barn
es
chol
esky
fmm
ocea
n-c
ocea
n-nc
radi
osity
rayt
race
wat
er-n
sq
wat
er-s
p
aver
age
WP
C H
it R
ate
1 wpc entry 2 wpc entries 4 wpc entries 8 wpc entries16 wpc entries 32 wpc entries 1024 wpc entries
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Sequential Instrumentation Overhead
0%
50%
100%
150%
200%
250%
300%
fft
lu-c
lu-n
c
rad
ix
ba
rne
s
cho
lesk
y
fmm
oce
an
-c
oce
an
-nc
rad
iosi
ty
rayt
race
wa
ter-
nsq
wa
ter-
sp
ave
rag
e
Inst
rum
en
tatio
n O
verh
ea
d [%
]
st st-swpc st-dwpc
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Execution Time, 16 processors (2x8)Performance bug in paper (popc).
0.0
0.5
1.0
1.5
2.0
2.5
3.0
fft
lu-c
lu-n
c
radi
x
barn
es
fmm
radi
osity
rayt
race
wat
er-n
sq
wat
er-s
p
aver
age
Nor
mal
ized
Exe
cutio
n T
ime
HW-DSM DSZOOM-base DSZOOM-dwpc
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Conclusions
Write permission cache (WPC) Effectively reduces store instrumentation overhead 2 entries is sufficient
Store instrumentation overhead reduction: 42% HW-, SW-DSM gap reduction: 28% Parallel performance improvement: 9%
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
http://www.it.uu.se/research/group/uart
Thanks and Questions
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Memory Consistency
The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted
Introducing the WPC in an invalidation-based environment will not weaken the memory model
WPC just extends the duration of the permission tenure before the write permission is given up
If the memory model of each node is weaker than SC, it will decide the memory model of the system
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Deadlock
WPC entries are flushed at: Synchronization points Failures to acquire directory locks Thread termination
WPC + flag synchronization can lead to deadlock Timers Interrupt other CPUs Lack of forward progress
Euro-Par 2004. Uppsala Architecture Research Team [UART] | www.it.uu.se/research/group/uart
Directory Collisions
Directory collision: if a requesting processor fails to acquire a directory lock
The number of directory collisions doesn’t increase when less than 32 WPC entries are used
More information in the paper