33
Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim

Outline

  • Upload
    natala

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim. Outline. Background and Motivation Age Based Scheduling Evaluation Conclusion. 2. Heterogeneous Architectures where all cores have same ISA but different performance. PE A. PE B. PE B. - PowerPoint PPT Presentation

Citation preview

Page 1: Outline

Age Based Scheduling for Asymmetric Multiprocessors

Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim

Page 2: Outline

Outline

• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion

2

Page 3: Outline

3

Asymmetric (Chip) Multiprocessors

• Heterogeneous Architectures where all cores have same ISA but different performance

PEA

PEB

PEB

PEB

PEB

Heterogeneous Architecture

Page 4: Outline

4

Asymmetric (Chip) Multiprocessors

• Potential for better performance than SMPs occupying same area and consuming same power

Core0 Core1

Core2 Core3

Core0

Symmetric Chip Multiprocessor (SMP/CMP)

Asymmetric Chip Multiprocessor (AMP/ACMP)

Co

re1

Co

re2

Co

re3

Page 5: Outline

AMPs present new challenges

• Thread Scheduling is one among them

5

Page 6: Outline

6

Scheduling in Multiprocessor OSes

• Thread Assignment– assign to least loaded core

• Load Balancing– make load on all cores uniform

• Idle Balancing – move threads from busy cores to idle

core

Page 7: Outline

7

Scheduling in Multiprocessor OSes

• Assume that all cores are identical • Results in bad performance and application

instability

Parsec benchmarks on a (real) AMP using the Linux Scheduler

all-fast 16 cores- 2GHz

half-half 8 cores -2GHz, 8 cores -1GHz

all-slow 16 cores - 1GHz

Page 8: Outline

8

Problem with current Scheduling

Not taking advantage of fast core

Page 9: Outline

9

Outline

• Background and Motivation• Age Based Scheduling (ABS)• Evaluation• Conclusion

Page 10: Outline

10

Motivation for Age Based Scheduling• Many compute-intensive multithreaded applications follow fork-

join model• Milestones (barriers) in thread execution

Application Model

fork

join

barrier

barrier

barrier

barrier

main thread

Page 11: Outline

11

Symmetry of Applications

• Threads created together are symmetric– Based on instruction count– Degree of Symmetry = Std Dev /

Average

Degree of Symmetry of Parsec Benchmarks

(Symmetric benchmarks are benchmarks with degree of symmetry <= 0.1)

Page 12: Outline

Insight

exe_dur (T1) = exe_dur (T2) =

exe_dur (T3) = exe_dur (T4)

• Difficult to predict absolute execution duration, so predict relative execution duration

12

execution duration = ?

barrier

barrier

T1

T2 T3

T4

Page 13: Outline

Putting together

• Applications follow fork-join model with milestones in between

• Many applications are symmetric• Easy to predict relative execution

duration to next milestoneAge Based Scheduling

13

Page 14: Outline

What is Age?

Age is the progress made by a thread towards its next milestone

14

Page 15: Outline

15

Age Calculation

• Threads created together have the same age

• As a thread executes, it ages• Reset age when milestone crossed

tA – age of thread A

tB – age of thread B

creation

execution

tA = 0

milestone

(termination)

milestone

(barrier)

tA = 30

tA = X

tA = 0

tB = 0

tB = 50

tB = 0

X – Unknown, assumed to be a large value

Page 16: Outline

16

Age Based Scheduling Algorithm

To make a Scheduling decision:• Calculate remaining execution

duration to next milestone based on age

• Assign threads with longer remaining execution durations to fast core – Longest Job to Fast Core First (LJFCF)

Page 17: Outline

Application of L JFCF

• Apply whenever– Thread is created– A core becomes idle– Reassignment timer expires (for load

balancing)

17

Page 18: Outline

Working of the Algorithm

execution

tA = 0

creation milestone

(termination)

milestone

(barrier)

tA = 30 Age at barrier =

X

rem_exe = (X – 30)

T1

18

Page 19: Outline

19

Remaining Execution Duration (I)

• Track progress of threads• Using Prediction [AGE]

– Predict all threads have same inter-milestone distance

tA – age of thread A

tB – age of thread B

creation

execution

tA = 0

milestone (termination)

milestone

(barrier)

tA = X tA =

0 tA = X

tB = 0 tB =

X

Page 20: Outline

20

Remaining Execution Duration (II)

• Using Profiling [AGE(PROF)]– threads have different inter-milestone

distances calculated based on a metric obtained by profiling

tA – age of thread A

tB – age of thread B

creation

execution

tA = 0

milestone

(termination)

milestone (barrier)

tA = X tA = 0

tA = X

tB = 0

tB = rX r is from profiler

Only one r value for each thread

Page 21: Outline

Working of the Algorithm

fast slow slow slow

B C DA

rem_exeA = 50

rem_exeD = 30

rem_exeC = 90

rem_exeB = 70

AC

rem_exeC = 90

rem_exeA = 50

21

Page 22: Outline

22

Benefit of Age Based Scheduling

• Asymmetry aware• Utilizes all cores• Gives all threads opportunities to run

on fast cores

Page 23: Outline

23

Implementation

• OS – Track progress using Performance

Counters– Disable counter on Interrupts

• Compiler (AGE[PROF])– Passing profiled information

• one value for each thread

Page 24: Outline

24

Outline

• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion

Page 25: Outline

25

Evaluation• Simulation based experiments

• Trace + execution hybrid simulator • Lock, barriers are modeled• Context switch and migration overhead simulated• 10 ms time slice for each thread

• Machine configuration• 1 fast, 7 slow, 8:1 speed ratio (others are in the paper)

• Benchmarks• Symmetric

– Parsec (simmedium input)

• Asymmetric– Splash-2– OMPSCR– SuperLU

Page 26: Outline

Comparisons with Other Policies

26

Policy Description

Linux Linux O(1) Scheduler

RR Threads are assigned to fast cores in a Round Robin fashion

SCALEDLD [Li’07]

Fast Core First assignment, asymmetry aware load balancing (baseline)

FCA-AGE Fast Core First assignment with Age based periodic reassignment

AGE Age based assignment and reassignment using prediction

AGE(PROF) Age based assignment and reassignment using profiling

AGE(ORACLE)

Age based assignment and reassignment using oracle

Page 27: Outline

27

L JFCF vs Other Policies (I)

-200

-150

-100

-50

0

50

100

% R

ed

ucti

on

in

Execu

tio

n T

ime

RR

FCA-AGE

AGE

AGE(PROF)

AGE(ORACLE)

Policy Avg % reduction over SCALEDLD

RR -36.64

FCA-AGE 9.8

AGE 10.4

AGE(PROF) 13.2

AGE(ORACLE)

15.4

• Parsec

Baseline: SCALEDLD

Page 28: Outline

L JFCF vs Other Policies (II)• Asymmetric Benchmarks

-10

-5

0

5

10

15

20

25

30

35

40

% R

ed

ucti

on

in

E

xecu

tio

n T

ime

FCA-AGE

AGE

AGE(PROF)

AGE(ORACLE)

28

Policy Avg % reduction over SCALEDLD

FCA-AGE 8.2

AGE 7.7

AGE(PROF) 9.4

AGE(ORACLE) 13.1

Baseline: SCALEDLD

Page 29: Outline

29

Idle Cycles

0%10%20%

30%40%50%60%70%

80%90%

100%

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

blac

ksch

oles

body

trac

k

fluid

anim

ate

swap

tions

Linux SCALEDLD AGE

Slow Cores

Fast Core

• Linux Scheduler – Most of the idle cycles contributed by fast core

• SCALEDLD – keeps same thread(s) on fast core• AGE – assigns different threads to fast core

Page 30: Outline

30

Different AMP Configurations

• Need for asymmetry aware scheduling increases as cores become more asymmetric

• AGE based policies show more improvement over SCALEDLD as asymmetry increases

0

0.5

1

1.5

2

2.5

2/1-Parsec 4/1-Parsec 6/1-Parsec 8/1-Parsec

No

rmal

ized

exe

cuti

on

tim

e

LinuxSCALEDLD

AGEAGE(PROF)

X/1 : Ratio of speeds of Fast and Slow cores is X:1

Page 31: Outline

31

Outline

• Background and Motivation• Age Based Scheduling• Evaluation• Conclusion

Page 32: Outline

32

Conclusion

• Age based scheduling (ABS) for Asymmetric Multiprocessors– ABS assumes threads created at the same

time are symmetric– ABS assigns threads to cores based on their

predicted remaining execution durations– Predictions are made based on Age of

threads• Improvement of 10.4% (Pred) and 13.2%

(Prof) for Parsec and 7.6% (Pred) and 9.4% (Prof) for Asymmetric benchmarks over Li’s mechanism

Page 33: Outline

THANK YOU