Simplified Design Flo€¦ · Complete system (d) LAN (examples) OUTLINE •Why multiprocessor? –energy, performance and predictability •What are multiprocessor systems –OS

2009-11-27

1

Simplified Design Flow

(a picture from Ingo Sander)

Hardware architecture

So far, we have talked about only single processor systems

– “Concurrency” implemented by “scheduling”

RT systems often consist of several processors

• Multiprocessor systems

– “Tightly connected” processors by a “high-speed” interconnect e.g. cross-bar, bus, NoC (network on chip) etc.

– Single processor with multi-thread

• Distributed Systems– “Loosely connected” processors by a “low-speed” network e.g. CAN, Ethernet, Token

Ring etc. --

2009-11-27

2

Multiprocessor/Distributed Systems

-- Local area distributed system (our focus)

Complete system

(d)

LAN

(examples)

OUTLINE

• Why multiprocessor?– energy, performance and predictability

• What are multiprocessor systems– OS etc

• Design RT systems on multiprocessors– Task Assigment

• Multiprocessor scheduling– (semi-)partitioned scheduling

– global scheduling

2009-11-27

3

Why multiprocessor systems?

“Tightly connected” processors by a “high-speed” interconnect e.g. cross-bar, bus, NoC etc.

1

10

100

1000

Now

Performance[log]

Year

Single Core

Multicore:Requires Parallel Applications

Hardware: Trends

2009-11-27

4

You must do “parallel processing” to get:

• Higher Performance– Single-core reached the “limit” of implicit parallelism – Frequency does not scale any more

• Lower Power Consumption: Battery lifetime, cost of energy

– Increasing the cores, decreasing the frequency• Performance = Cores * F 2* Cores * F/2 Cores * F• Power = C * V2 * F 2* C * (V /2)2 * F/2 C * V2 /4 * F

Keep the “same performance” using ¼ of the energy(doubling the cores) -- theoretically

• Experiments show that a 1% clock increase results in a 3% power increase.

CPU frequency vs Power consumption

1. Standard processor over-clocked 20%

2. Standard processor

3. Two standard processors each under-clocked 20%

2009-11-27

5

What’s happening now?

• General-Purpose Computing(Symposium on High-Performance Chips, Hot Chips 21, Palo Alto, Aug 23-25, 2009)

– 4 cores in notebooks

– 12 cores in servers

• AMD 12-core Magny-Cours will consume less energy than previous generations with 6 cores

– 16 cores for IBM servers, Power 7

• Embedded Systems

– 4 cores in ARM11 MPCore embedded processors

What next?

• Manycores (>100’s of cores) predicted to be here in a few years – e.g. Ericsson

2009-11-27

6

11

CPU

L1

CPU

L1

CPU

L1

CPU

L1

CPU

L1

CPU

L1

CPU

L1

CPU

L1

Ban

dw

idth

Typical Multicore Architecture

11

L2 Cache

Off-chip memory

Multiprocessor models

• Identical multiprocessors:

– each processor has the same computing capacity

• Uniform multiprocessors:

– different processors have different computing capacities

• Heterogeneous multiprocessors:

– each (task, processor) pair may have a different computing capacity

2009-11-27

7


P1 P2 P3

background – problem – results

Co

mp

utin

g c

ap

acity

identical multiprocessors: each processor has the same computing capacity


P1 P2 P3

Task T1 Task T2


2009-11-27

8

uniform multiprocessors: different processors have different computing capacities


P1 P2 P3

Task T1 Task T2

x

x/2 x/3

y y/2 y/3


speed = 1 speed = 2 speed = 3


P1 P2 P3

Task T1 Task T2

x/2 x/3

x

y

1.5 y


heterogeneous multiprocessors: each (task, processor) pair may have a different computing capacity

y

2009-11-27

9

Single processor vs. multiprocessor OS

2009-11-27

10

Symmetric Multiprocessing (SMP)

2009-11-27

11

Simplified Design Flow of RT Systems

(a picture from Ingo Sander)

Task assignment

Task Assignment

• In the design flow:– First, the application is partitioned into tasks or task graphs.

– At some stage, the execution times, communication costs, data and control dependencies of all the tasks become known.

• Task assignment determines– how many processors needed (bin-packing problem)

– on which processor each task executes

• This is a very complex problem (NP-hard)– Often done off-line

– Often heuristics work

2009-11-27

12

Task Assignment

• The task models used in task assignment can vary in complexity

depending what considered/ignored:– Communication costs

– Data and control dependencies

– Resource requirements e.g. WCET, memory etc

• In multi-core platforms with shared caches, communication costs for tasks on the same chip may be very small

• It is often meaningful to consider the execution time requirement (WCET) and ignore communication in an early design phase

Multiprocessor scheduling

• "Given a set J of jobs where job ji has length li and a number of processors mi, what is the minimum possible time required to schedule all jobs in J on m

processors such that none overlap?" – Wikipedia

– That is, design a schedule such that the response time of the last tasks is minimized

(Alternatively, given M processors and N tasks,find a mapping from tasks to processors such that all the tasks are schedulable)

• The problem is NP-complete• It is also known as the “load balancing problem”

2009-11-27

13

Multiprocessor scheduling of “periodic/sporadic real-time tasks”

• Partitioned scheduling

– Static allocation – task assignment• Each task may only execute on a fixed processor

• No task migration

• Semi-partitioned scheduling

– Static allocation – task assignment• Each instance (or part of it) of a task is assigned to a fixed processor

• task instance or part of it may migrate

• Global scheduling

– Dynamic allocation – (dynamic) task assignment• Any instance of any task may execute on any processor

• Task migration

Multiprocessor Scheduling

952

1 6

8

4

new task instance

waiting queue

cpu 1 cpu 2 cpu 3

Global scheduling

cpu 1 cpu 2 cpu 3

2

5

2

1

22

3

6

7

4

2 3

new task instance

cpu 1 cpu 2 cpu 3

5

1

2

8

6

3

9

7

4

new task instance

Partitioned scheduling Semi-partitioned scheduling

2009-11-27

14

Partitioned scheduling

• Two steps:

– Determine a mapping of tasks to processors

– Perform run-time scheduling

• Example: Partitioned with EDF

– Assign tasks to the processors such that no processor’s capacity is exceeded (utilization bounded by 1.0)

– Schedule each processor using EDF

Bin-packing algorithms

• The problem concerns packing objects of varying sizes in boxes (”bins”) with the objective of minimizing number of used boxes.

– Solutions (Heuristics): Next Fit and First Fit

• Application to multiprocessor systems:

– Bins are represented by processors and objects by tasks.

– The decision whether a processor is ”full” or not is derived from a utilization-based schedulability test.

2009-11-27

15

Rate-Monotonic-First-Fit (RMFF): [Dhall and Liu, 1978]

• First, sort the tasks in the order of increasing periods.

• Task Assignment

– All tasks are assigned in the First Fit manner starting from the task with highest priority

– A task can be assigned to a processor if all the tasks assigned to the processor are RM-schedulable i.e.• the total utilization of tasks assigned on that processor is bounded

by n(21/n-1) where n is the number of tasks assigned.

(One may also use the Precise test to get a better assignment!)

– Add a new processor if needed for the RM-test.

Partitioned scheduling

• Advantages:

– Most techniques for single-processor scheduling are also applicable here

• Partitioning of tasks can be automated

– Solving a bin-packing algorithm

• Disadvantages:

– Cannot exploit/share all unused processor time

2009-11-27

16

Partitioned Scheduling to Minimize Communication Costs

• Communication between tasks which run on same processor is usually significantly lower than for tasks that run on different processors

– C ij = Communication Cost (e.g. amount of data to send)

• When communication costs are considered, objective of task assignment is twofold

– Find minimum number of processors needed

– Find minimum of communication costs for this number of processors

• The problem may be solved using optimization techniques such as integer/linear programming.

Example of Task Partitioning

P2P1 P1

2009-11-27

17

Semi-Partitioned scheduling

• High resource utilization

• High overhead (due to task migration)

Example:(semi-)Partitioned Scheduling Algorithm

• sort all tasks in increasing priority order

7

6

5

4

3

2

1highest priority

lowest priority

2009-11-27

18

• select the processor on which the assigned utilization is the lowest

7

6

5

4

3

2

1

P1 P2 P3

highest priority

lowest priority



6

5

4

3

2

1

P1 P2 P3

7

highest priority

lowest priority


2009-11-27

19


5

4

3

2

1

P1 P2 P3

76

highest priority

lowest priority



4

3

2

1

P1 P2 P3

76

5

highest priority

lowest priority


2009-11-27

20


3

2

1

P1 P2 P3

76 54

highest priority

lowest priority



2

1

P1 P2 P3

76

54

3

highest priority

lowest priority


2009-11-27

21


1

P1 P2 P3

76 54

321

22

highest priority

lowest priority



1

P1 P2 P3

76

54

32122

highest priority

lowest priority


2009-11-27

22


P1 P2 P3

76 54

321

12

11

22

highest priority

lowest priority



P1 P2 P3

76

54

3211211

22

highest priority

lowest priority


2009-11-27

23

Global scheduling

• All ready tasks are kept in a global queue

• When selected for execution, a task can be dispatched to any processor, even after being preempted

Algorithms for Global scheduling

• EDF – Unfortunately not optimal!

– No simple schedulability test known (only sufficient)

• Fixed Priority Scheduling e.g. RM

– Difficult to find the optimal priority order

– Difficult to check the schedulability

• Any algorithm for single processor scheduling may work, but schedulability analysis is non-trivial.

2009-11-27

24

Example Priority Assignment

Schedulability Test

Note that U/m is close to 1/3 when m is large enough.

This means that regardless of the number of tasks, and the number of processors if no more than 33% of the processing capacity is required, the task set is schedulable. For traditional RM, this is not true.

2009-11-27

25

Priority Assignment

Priority Assignment

2009-11-27

26

Poor Resource Utilization

2009-11-27

27

Known Results

20

10

30

40

60

50

70

80

Multiprocessor Scheduling

Global Partitioned

Fixed

Priority

Dynamic

Priority

Semi-partitioned

Fixed

Priority

Dynamic

Priority

Fixed

Priority

Dynamic

Priority

38

%

50

Liu and Layland’s

Utilization Bound

50 50

65 66

utilizationbound

69

.3

Global Scheduling: + and -

• Advantages:– Supported by most multiprocessor operating systems

• Windows NT, Solaris, Linux, ...

– Effective utilization of processing resources (if it works)• Unused processor time can easily be reclaimed at run-time (mixture of hard

and soft RT tasks to optimize resource utilization)

• Disadvantages:– Few results from single-processor scheduling can be used– No “optimal” algorithms known except idealized assumption (Pfair sch)– Poor resource utilization for hard timing constraints

• No more than 50% resource utilization can be guaranteed for hard RT tasks

– Suffers from scheduling anomalies• Adding processors and reducing computation times and other parameters can

actually decrease optimal performance with some scenarios!

2009-11-27

28

Underlying causes

– The ”root of all evil” in global scheduling: (Liu, 1969):

The simple fact that a task can use only one processor even when several processors are free at the same time adds a surprising amount of difficulty to the scheduling of multiple processors.

– Dhall’s effect: with RM, DM and EDF, some low-utilization task sets can be un-schedulable regardless of how many processors are used.

– Hard-to-find critical instant: a critical instant does not always occur when a task arrives at the same time as all its higher-priority tasks.

2009-11-27

29

How to avoid Dhall’s effect

• Problem: RM, DM and EDF only account for task periods!

– Actual computation demands are not accounted for.

• Solution: Dhall’s effect can easily be avoided by letting tasks with high utilization receive higher priority:

2009-11-27

30

MP Scheduling Anomalies

• Adding processors, reducing computation times, or “improving” other parameters can actually decrease optimal performance in some scenarios!

Doubling Processor Speed (increasing the number of processors) may cause a deadline

miss

2009-11-27

31

Anomalies under Resource constraints

• Task set J of five tasks on two processors.• Task 2 and 4 share the same resource in exclusive mode• Static allocation P1 (J1,J2) and P2 (J3,J4,J5)• Reducing the computation time of task 1 will increase the optimal

schedule time!

1 2

3 4 5

P1

P2

0 2 4 6 8 10 12 14

1 2

3 4 5

P1

P2

0 2 4 6 8 10 12 14

16 18

16 18

20 22

20 22

critical section

blocking

Documents

Simplified Design Flo€¦ · Complete system (d) LAN (examples) OUTLINE •Why multiprocessor? –energy, performance and predictability •What are multiprocessor systems –OS