Upload
beverly-webb
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Improving Real-Time Performance on Multicore Platforms Using MemGuard
University of Kansas Dr. Heechul Yun
10/28/2013
Multicore
2
Server Desktop Mobile RT/Embedded
Challenges: Shared Resources
3
CPU
Memory Hierarchy
Unicore
T1 T2
Core1
Memory Hierarchy
Core2
Core3
Core4
Multicore
T1
T2
T3
T4
T5
T6
T7
T8
Performance Impact
Case Study• HRT
– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive
• X-server – Scrolling text on a gnome-terminal
• Hardware platform– Intel Xeon 3530– 8MB shared L3 cache– 4GB DDR3 1333MHz DIMM (1ch)
• CPU cores are isolated4
A desktop PC(Intel Xeon 3530)
DRAM
L3 (8MB)
Core1 Core2
HRT Xsrv.
HRT Time Distribution
• 28% deadline violations• Due to contention in DRAM
5
solo99pct: 10.2ms
w/ Xserver99pct: 14.3ms
Outline
• Motivation
• Background– DRAM basics– Worst-case memory performance– MemGuard[RTAS’13]
• Improving Real-Time Performance with MemGuard
6
Background: DRAM Organization
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1 Core2 Core3 Core4
• Have multiple banks• Different banks can be
accessed in parallel
Best-case
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1 Core2 Core3 Core4
Fast• Peak = 10.6 GB/s
– DDR3 1333Mhz
Best-case
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1 Core2 Core3 Core4
• Peak = 10.6 GB/s – DDR3 1333Mhz
• Out-of-order processors
Fast
Most-cases
(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1 Core2 Core3 Core4
Mess• Performance = ??
Worst-case
(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
• 1bank b/w – Less than peak b/w– How much?
Slow
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1 Core2 Core3 Core4
Background: DRAM Operation
• Stateful per-bank access time– Row miss: 19 cycles– Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
Row 1Row 2Row 3Row 4Row 5
Bank 1
Row Bufferactivate
precharge
Read/write
Col7
READ (Bank 1, Row 3, Col 7)
Real Worst-case
(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
1 bank & always row miss ~1.2GB/s
L3
DRAM DIMM
Memory Controller (MC)
Bank4
Bank3
Bank2
Bank1
Core1
Core2
Core3
Core4
Row 1Row 2Row 3Row 4Row 1Row 2
…
Request order
time
Each core = ¼ x 1.2GB/s = 300MB/s ?
Background: Memory Controller(MC)
• Request queue(s)– Not fair (open-row first re-ordering)– Unpredictable queuing delay
14
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.
• Multiple parallel resources (banks)• Stateful bank access latency• Queuing delay
• Unpredictable memory performance
Challenges for Real-Time Systems
15
16
MemGuard [RTAS’13]
• Goal: guarantee minimum memory b/w for each core• How: b/w reservation + best effort sharing
Operating System
Core1 Core2 Core3 Core4PMC PMC PMC PMC
DRAM DIMM
MemGuard
Multicore ProcessorMemory Controller
BWRegulator
BWRegulator
BWRegulator
BWRegulator0.6GB/s 0.2GB/s 0.2GB/s 0.2GB/s
Reclaim Manager
Reservation• Idea
– Scheduler regulates per-core memory b/w using h/w counters– Period = 1 scheduler tick (e.g., 1ms)
17
1ms 2ms0Schedule a RT idle task
Suspend the RT idle task
Budget
Coreactivity
21
computation memory fetch
Reservation
• Key insight– Worst-case bandwidth can be guaranteed. – Total reserved bandwidth < worst-case DRAM
bandwidth ()
• System-wide reservation rule
• m: #of cores• Bi: Core i’s b/w reservation
18
Best-Effort Sharing
• Spare Sharing [RTAS’13]• Proportional Sharing [Unpublished TR]
19
Core0900MB/s
time(
ms)
Core1300MB/s
0
guaranteed b/w
best-effort b/w
throttled
reschedule
1
2
Case Study• HRT
– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive
• X-server – Scrolling text on a gnome-terminal
• Hardware platform– Intel Xeon 3530– 8MB shared cache– 4GB DDR3 1333MHz DIMM
20
A desktop PC(Intel Xeon 3530)
DRAM
L3 (8MB)
Core1 Core2
HRT Xsrv.
w/o MemGuard
HRT’s 99pct: 10.2ms
21
HRT (solo) HRT (w/ Xserver)
HRT’s 99pct: 14.3msX’s CPU util: 78%
MemGuard reserve only (HRT=900MB/s, X=300MB/s)
22
HRT (solo) HRT (w/ Xserver)
HRT’s 99pct: 10.7ms HRT’s 99pct: 11.2msX’s CPU util: 4%
MemGuardreserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing
23
HRT (solo) HRT (w/ Xserver)
HRT’s 99pct: 10.7ms HRT’s 99pct: 10.7msX’s CPU util: 48%
MemGuardreserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing
24
HRT (solo) HRT (w/ Xserver)
HRT’s 99pct: 10.9 ms HRT’s 99pct: 12.1msX’s CPU util: 61%
Real-Time Performance Improvement
• Using MemGuard, we can achieve– No deadline miss for HRT– Good X-server performance
25
HRT X-server
Conclusion• Unpredictable memory performance
– multiple resources(banks), per-bank state, unpredictable queueing delay
• MemGuard– Guarantee minimum memory bandwidth for each core– b/w reservation (guaranteed part) + best-effort sharing
• Case-study– On Intel Xeon multicore platform, using HRT + X-server– MemGuard can improve real-time performance efficiently
• Limitations and Future Work– Coarse grain (a OS tick) enforcement– Small guaranteed b/w DRAM bank partitioning (submitted to RTAS’14)
26
https://github.com/heechul/memguard
Thank you.
27
Evaluation on Intel Core2• T1: Synthetic video capture task (HRT)
– Period=20ms(50Hz)– Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods)
• T2: Xserver, update screen (SRT)– Metric: CPU utilization
• Higher CPU utilization faster screen update
• Platform– Intel Core2Quad 8400, 2MB L2 cache x 2,
tunable H/W prefetchers– PC6400 DDR2 DRAM DIMM x 1
• Three platform configurations– Exp1: Private L2, Prefetch=off– Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on
28
Intel Core2Quad based PC
DRAM
L2 (pref.) L2 (pref.)
Core0 Core1 Core2 Core3
Experiment 1
29
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
solo corun solo corun solo corun0
2
4
6
8
10
12
14
16
18
T1’s
exe
c. ti
me
(ms)
38% 78%92%
Performance guarantee
Private L2Prefetch=off
deadline
Experiment 1
30
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
solo corun solo corun solo corun0
2
4
6
8
10
12
14
16
18
T1’s
exe
c. ti
me
(ms)
ACET
WCET
38% 78%92%
Performance guarantee
Private L2Prefetch=off
30%WCET
deadline
Experiment 1
31
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
solo corun solo corun solo corun0
2
4
6
8
10
12
14
16
18
T1’s
exe
c. ti
me
(ms)
38% 78%92%
Private L2Prefetch=off
550M/s 550M/s
deadline
Experiment 1
32
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
solo corun solo corun solo corun0
2
4
6
8
10
12
14
16
18
T1’s
exe
c. ti
me
(ms)
38% 78%92%
deadline
Private L2Prefetch=off
Experiment 1
33
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
solo corun solo corun solo corun0
2
4
6
8
10
12
14
16
18
T1’s
exe
c. ti
me
(ms)
38% 78%92%
Performance target
Private L2Prefetch=off
Experiment 2: Prefetcher
34
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
550M/s 550M/s
33% 82%94%
Private L2Prefetch=ON
solo corun solo corun solo corun02468
1012141618202224
T1's
exe
c. T
ime
(ms)
60%
More slowdown
Not enough reserv.
Deadline violation
deadline
Experiment 2-2
35
Original
DRAML2 L2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAML2 L2
Core1 Core2
T1 T2
900M/s 200M/s
MemGuard(reclaim + share)
DRAML2 L2
Core1 Core2
T1 T2
900M/s 200M/s
14% 69%94%
Private L2Prefetch=ON
solo corun solo corun solo corun02468
1012141618
T1's
exe
c. T
ime
(ms)
Enough reserv.
60%
No deadline violation
solo corun solo corun solo corun02468
1012141618202224
T1's
exe
c. T
imes
(ms)
Experiment 3: Shared Cache
36
Original
DRAML2
Core1 Core2
T1 T2
MemGuard(Reserve only)
DRAM
Core1 Core2
T1 T2
900M/s 200M/s
MemGuard(reclaim + share)
DRAM
Core1 Core2
T1 T2
900M/s 200M/s
11% 63%92%
Shared L2Prefetch=ON
108%
Even more slowdown
Minimum reserv.
L2 L2
No deadline violation