Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr....

Preview:

Citation preview

Improving Real-Time Performance on Multicore Platforms Using MemGuard

University of Kansas Dr. Heechul Yun

10/28/2013

Multicore

2

Server Desktop Mobile RT/Embedded

Challenges: Shared Resources

3

CPU

Memory Hierarchy

Unicore

T1 T2

Core1

Memory Hierarchy

Core2

Core3

Core4

Multicore

T1

T2

T3

T4

T5

T6

T7

T8

Performance Impact

Case Study• HRT

– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive

• X-server – Scrolling text on a gnome-terminal

• Hardware platform– Intel Xeon 3530– 8MB shared L3 cache– 4GB DDR3 1333MHz DIMM (1ch)

• CPU cores are isolated4

A desktop PC(Intel Xeon 3530)

DRAM

L3 (8MB)

Core1 Core2

HRT Xsrv.

HRT Time Distribution

• 28% deadline violations• Due to contention in DRAM

5

solo99pct: 10.2ms

w/ Xserver99pct: 14.3ms

Outline

• Motivation

• Background– DRAM basics– Worst-case memory performance– MemGuard[RTAS’13]

• Improving Real-Time Performance with MemGuard

6

Background: DRAM Organization

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

• Have multiple banks• Different banks can be

accessed in parallel

Best-case

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Fast• Peak = 10.6 GB/s

– DDR3 1333Mhz

Best-case

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

• Peak = 10.6 GB/s – DDR3 1333Mhz

• Out-of-order processors

Fast

Most-cases

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Mess• Performance = ??

Worst-case

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

• 1bank b/w – Less than peak b/w– How much?

Slow

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Background: DRAM Operation

• Stateful per-bank access time– Row miss: 19 cycles– Row hit: 9 cycles

(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

Row 1Row 2Row 3Row 4Row 5

Bank 1

Row Bufferactivate

precharge

Read/write

Col7

READ (Bank 1, Row 3, Col 7)

Real Worst-case

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

1 bank & always row miss ~1.2GB/s

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1

Core2

Core3

Core4

Row 1Row 2Row 3Row 4Row 1Row 2

Request order

time

Each core = ¼ x 1.2GB/s = 300MB/s ?

Background: Memory Controller(MC)

• Request queue(s)– Not fair (open-row first re-ordering)– Unpredictable queuing delay

14

Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

• Multiple parallel resources (banks)• Stateful bank access latency• Queuing delay

• Unpredictable memory performance

Challenges for Real-Time Systems

15

16

MemGuard [RTAS’13]

• Goal: guarantee minimum memory b/w for each core• How: b/w reservation + best effort sharing

Operating System

Core1 Core2 Core3 Core4PMC PMC PMC PMC

DRAM DIMM

MemGuard

Multicore ProcessorMemory Controller

BWRegulator

BWRegulator

BWRegulator

BWRegulator0.6GB/s 0.2GB/s 0.2GB/s 0.2GB/s

Reclaim Manager

Reservation• Idea

– Scheduler regulates per-core memory b/w using h/w counters– Period = 1 scheduler tick (e.g., 1ms)

17

1ms 2ms0Schedule a RT idle task

Suspend the RT idle task

Budget

Coreactivity

21

computation memory fetch

Reservation

• Key insight– Worst-case bandwidth can be guaranteed. – Total reserved bandwidth < worst-case DRAM

bandwidth ()

• System-wide reservation rule

• m: #of cores• Bi: Core i’s b/w reservation

18

Best-Effort Sharing

• Spare Sharing [RTAS’13]• Proportional Sharing [Unpublished TR]

19

Core0900MB/s

time(

ms)

Core1300MB/s

0

guaranteed b/w

best-effort b/w

throttled

reschedule

1

2

Case Study• HRT

– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive

• X-server – Scrolling text on a gnome-terminal

• Hardware platform– Intel Xeon 3530– 8MB shared cache– 4GB DDR3 1333MHz DIMM

20

A desktop PC(Intel Xeon 3530)

DRAM

L3 (8MB)

Core1 Core2

HRT Xsrv.

w/o MemGuard

HRT’s 99pct: 10.2ms

21

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 14.3msX’s CPU util: 78%

MemGuard reserve only (HRT=900MB/s, X=300MB/s)

22

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.7ms HRT’s 99pct: 11.2msX’s CPU util: 4%

MemGuardreserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing

23

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.7ms HRT’s 99pct: 10.7msX’s CPU util: 48%

MemGuardreserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing

24

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.9 ms HRT’s 99pct: 12.1msX’s CPU util: 61%

Real-Time Performance Improvement

• Using MemGuard, we can achieve– No deadline miss for HRT– Good X-server performance

25

HRT X-server

Conclusion• Unpredictable memory performance

– multiple resources(banks), per-bank state, unpredictable queueing delay

• MemGuard– Guarantee minimum memory bandwidth for each core– b/w reservation (guaranteed part) + best-effort sharing

• Case-study– On Intel Xeon multicore platform, using HRT + X-server– MemGuard can improve real-time performance efficiently

• Limitations and Future Work– Coarse grain (a OS tick) enforcement– Small guaranteed b/w DRAM bank partitioning (submitted to RTAS’14)

26

https://github.com/heechul/memguard

Thank you.

27

Evaluation on Intel Core2• T1: Synthetic video capture task (HRT)

– Period=20ms(50Hz)– Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods)

• T2: Xserver, update screen (SRT)– Metric: CPU utilization

• Higher CPU utilization faster screen update

• Platform– Intel Core2Quad 8400, 2MB L2 cache x 2,

tunable H/W prefetchers– PC6400 DDR2 DRAM DIMM x 1

• Three platform configurations– Exp1: Private L2, Prefetch=off– Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on

28

Intel Core2Quad based PC

DRAM

L2 (pref.) L2 (pref.)

Core0 Core1 Core2 Core3

Experiment 1

29

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Performance guarantee

Private L2Prefetch=off

deadline

Experiment 1

30

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

ACET

WCET

38% 78%92%

Performance guarantee

Private L2Prefetch=off

30%WCET

deadline

Experiment 1

31

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Private L2Prefetch=off

550M/s 550M/s

deadline

Experiment 1

32

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

deadline

Private L2Prefetch=off

Experiment 1

33

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Performance target

Private L2Prefetch=off

Experiment 2: Prefetcher

34

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

33% 82%94%

Private L2Prefetch=ON

solo corun solo corun solo corun02468

1012141618202224

T1's

exe

c. T

ime

(ms)

60%

More slowdown

Not enough reserv.

Deadline violation

deadline

Experiment 2-2

35

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

900M/s 200M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

900M/s 200M/s

14% 69%94%

Private L2Prefetch=ON

solo corun solo corun solo corun02468

1012141618

T1's

exe

c. T

ime

(ms)

Enough reserv.

60%

No deadline violation

solo corun solo corun solo corun02468

1012141618202224

T1's

exe

c. T

imes

(ms)

Experiment 3: Shared Cache

36

Original

DRAML2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAM

Core1 Core2

T1 T2

900M/s 200M/s

MemGuard(reclaim + share)

DRAM

Core1 Core2

T1 T2

900M/s 200M/s

11% 63%92%

Shared L2Prefetch=ON

108%

Even more slowdown

Minimum reserv.

L2 L2

No deadline violation

Recommended