36
Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Embed Size (px)

Citation preview

Page 1: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Improving Real-Time Performance on Multicore Platforms Using MemGuard

University of Kansas Dr. Heechul Yun

10/28/2013

Page 2: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Multicore

2

Server Desktop Mobile RT/Embedded

Page 3: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Challenges: Shared Resources

3

CPU

Memory Hierarchy

Unicore

T1 T2

Core1

Memory Hierarchy

Core2

Core3

Core4

Multicore

T1

T2

T3

T4

T5

T6

T7

T8

Performance Impact

Page 4: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Case Study• HRT

– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive

• X-server – Scrolling text on a gnome-terminal

• Hardware platform– Intel Xeon 3530– 8MB shared L3 cache– 4GB DDR3 1333MHz DIMM (1ch)

• CPU cores are isolated4

A desktop PC(Intel Xeon 3530)

DRAM

L3 (8MB)

Core1 Core2

HRT Xsrv.

Page 5: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

HRT Time Distribution

• 28% deadline violations• Due to contention in DRAM

5

solo99pct: 10.2ms

w/ Xserver99pct: 14.3ms

Page 6: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Outline

• Motivation

• Background– DRAM basics– Worst-case memory performance– MemGuard[RTAS’13]

• Improving Real-Time Performance with MemGuard

6

Page 7: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Background: DRAM Organization

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

• Have multiple banks• Different banks can be

accessed in parallel

Page 8: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Best-case

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Fast• Peak = 10.6 GB/s

– DDR3 1333Mhz

Page 9: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Best-case

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

• Peak = 10.6 GB/s – DDR3 1333Mhz

• Out-of-order processors

Fast

Page 10: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Most-cases

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Mess• Performance = ??

Page 11: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Worst-case

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

• 1bank b/w – Less than peak b/w– How much?

Slow

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1 Core2 Core3 Core4

Page 12: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Background: DRAM Operation

• Stateful per-bank access time– Row miss: 19 cycles– Row hit: 9 cycles

(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)

Row 1Row 2Row 3Row 4Row 5

Bank 1

Row Bufferactivate

precharge

Read/write

Col7

READ (Bank 1, Row 3, Col 7)

Page 13: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Real Worst-case

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

1 bank & always row miss ~1.2GB/s

L3

DRAM DIMM

Memory Controller (MC)

Bank4

Bank3

Bank2

Bank1

Core1

Core2

Core3

Core4

Row 1Row 2Row 3Row 4Row 1Row 2

Request order

time

Each core = ¼ x 1.2GB/s = 300MB/s ?

Page 14: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Background: Memory Controller(MC)

• Request queue(s)– Not fair (open-row first re-ordering)– Unpredictable queuing delay

14

Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

Page 15: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

• Multiple parallel resources (banks)• Stateful bank access latency• Queuing delay

• Unpredictable memory performance

Challenges for Real-Time Systems

15

Page 16: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

16

MemGuard [RTAS’13]

• Goal: guarantee minimum memory b/w for each core• How: b/w reservation + best effort sharing

Operating System

Core1 Core2 Core3 Core4PMC PMC PMC PMC

DRAM DIMM

MemGuard

Multicore ProcessorMemory Controller

BWRegulator

BWRegulator

BWRegulator

BWRegulator0.6GB/s 0.2GB/s 0.2GB/s 0.2GB/s

Reclaim Manager

Page 17: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Reservation• Idea

– Scheduler regulates per-core memory b/w using h/w counters– Period = 1 scheduler tick (e.g., 1ms)

17

1ms 2ms0Schedule a RT idle task

Suspend the RT idle task

Budget

Coreactivity

21

computation memory fetch

Page 18: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Reservation

• Key insight– Worst-case bandwidth can be guaranteed. – Total reserved bandwidth < worst-case DRAM

bandwidth ()

• System-wide reservation rule

• m: #of cores• Bi: Core i’s b/w reservation

18

Page 19: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Best-Effort Sharing

• Spare Sharing [RTAS’13]• Proportional Sharing [Unpublished TR]

19

Core0900MB/s

time(

ms)

Core1300MB/s

0

guaranteed b/w

best-effort b/w

throttled

reschedule

1

2

Page 20: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Case Study• HRT

– Synthetic real-time video capture– P=20, D=13ms– Cache-insensitive

• X-server – Scrolling text on a gnome-terminal

• Hardware platform– Intel Xeon 3530– 8MB shared cache– 4GB DDR3 1333MHz DIMM

20

A desktop PC(Intel Xeon 3530)

DRAM

L3 (8MB)

Core1 Core2

HRT Xsrv.

Page 21: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

w/o MemGuard

HRT’s 99pct: 10.2ms

21

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 14.3msX’s CPU util: 78%

Page 22: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

MemGuard reserve only (HRT=900MB/s, X=300MB/s)

22

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.7ms HRT’s 99pct: 11.2msX’s CPU util: 4%

Page 23: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

MemGuardreserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing

23

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.7ms HRT’s 99pct: 10.7msX’s CPU util: 48%

Page 24: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

MemGuardreserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing

24

HRT (solo) HRT (w/ Xserver)

HRT’s 99pct: 10.9 ms HRT’s 99pct: 12.1msX’s CPU util: 61%

Page 25: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Real-Time Performance Improvement

• Using MemGuard, we can achieve– No deadline miss for HRT– Good X-server performance

25

HRT X-server

Page 26: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Conclusion• Unpredictable memory performance

– multiple resources(banks), per-bank state, unpredictable queueing delay

• MemGuard– Guarantee minimum memory bandwidth for each core– b/w reservation (guaranteed part) + best-effort sharing

• Case-study– On Intel Xeon multicore platform, using HRT + X-server– MemGuard can improve real-time performance efficiently

• Limitations and Future Work– Coarse grain (a OS tick) enforcement– Small guaranteed b/w DRAM bank partitioning (submitted to RTAS’14)

26

https://github.com/heechul/memguard

Page 27: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Thank you.

27

Page 28: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Evaluation on Intel Core2• T1: Synthetic video capture task (HRT)

– Period=20ms(50Hz)– Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods)

• T2: Xserver, update screen (SRT)– Metric: CPU utilization

• Higher CPU utilization faster screen update

• Platform– Intel Core2Quad 8400, 2MB L2 cache x 2,

tunable H/W prefetchers– PC6400 DDR2 DRAM DIMM x 1

• Three platform configurations– Exp1: Private L2, Prefetch=off– Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on

28

Intel Core2Quad based PC

DRAM

L2 (pref.) L2 (pref.)

Core0 Core1 Core2 Core3

Page 29: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 1

29

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Performance guarantee

Private L2Prefetch=off

deadline

Page 30: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 1

30

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

ACET

WCET

38% 78%92%

Performance guarantee

Private L2Prefetch=off

30%WCET

deadline

Page 31: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 1

31

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Private L2Prefetch=off

550M/s 550M/s

deadline

Page 32: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 1

32

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

deadline

Private L2Prefetch=off

Page 33: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 1

33

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

solo corun solo corun solo corun0

2

4

6

8

10

12

14

16

18

T1’s

exe

c. ti

me

(ms)

38% 78%92%

Performance target

Private L2Prefetch=off

Page 34: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 2: Prefetcher

34

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

550M/s 550M/s

33% 82%94%

Private L2Prefetch=ON

solo corun solo corun solo corun02468

1012141618202224

T1's

exe

c. T

ime

(ms)

60%

More slowdown

Not enough reserv.

Deadline violation

deadline

Page 35: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

Experiment 2-2

35

Original

DRAML2 L2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAML2 L2

Core1 Core2

T1 T2

900M/s 200M/s

MemGuard(reclaim + share)

DRAML2 L2

Core1 Core2

T1 T2

900M/s 200M/s

14% 69%94%

Private L2Prefetch=ON

solo corun solo corun solo corun02468

1012141618

T1's

exe

c. T

ime

(ms)

Enough reserv.

60%

No deadline violation

Page 36: Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013

solo corun solo corun solo corun02468

1012141618202224

T1's

exe

c. T

imes

(ms)

Experiment 3: Shared Cache

36

Original

DRAML2

Core1 Core2

T1 T2

MemGuard(Reserve only)

DRAM

Core1 Core2

T1 T2

900M/s 200M/s

MemGuard(reclaim + share)

DRAM

Core1 Core2

T1 T2

900M/s 200M/s

11% 63%92%

Shared L2Prefetch=ON

108%

Even more slowdown

Minimum reserv.

L2 L2

No deadline violation