Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
2009-11-27
1
Simplified Design Flow
(a picture from Ingo Sander)
Hardware architecture
So far, we have talked about only single processor systems
– “Concurrency” implemented by “scheduling”
RT systems often consist of several processors
• Multiprocessor systems
– “Tightly connected” processors by a “high-speed” interconnect e.g. cross-bar, bus, NoC (network on chip) etc.
– Single processor with multi-thread
• Distributed Systems– “Loosely connected” processors by a “low-speed” network e.g. CAN, Ethernet, Token
Ring etc. --
2009-11-27
2
Multiprocessor/Distributed Systems
-- Local area distributed system (our focus)
Complete system
(d)
LAN
(examples)
OUTLINE
• Why multiprocessor?– energy, performance and predictability
• What are multiprocessor systems– OS etc
• Design RT systems on multiprocessors– Task Assigment
• Multiprocessor scheduling– (semi-)partitioned scheduling
– global scheduling
2009-11-27
3
Why multiprocessor systems?
“Tightly connected” processors by a “high-speed” interconnect e.g. cross-bar, bus, NoC etc.
1
10
100
1000
Now
Performance[log]
Year
Single Core
Multicore:Requires Parallel Applications
Hardware: Trends
2009-11-27
4
You must do “parallel processing” to get:
• Higher Performance– Single-core reached the “limit” of implicit parallelism – Frequency does not scale any more
• Lower Power Consumption: Battery lifetime, cost of energy
– Increasing the cores, decreasing the frequency• Performance = Cores * F 2* Cores * F/2 Cores * F• Power = C * V2 * F 2* C * (V /2)2 * F/2 C * V2 /4 * F
Keep the “same performance” using ¼ of the energy(doubling the cores) -- theoretically
• Experiments show that a 1% clock increase results in a 3% power increase.
CPU frequency vs Power consumption
1. Standard processor over-clocked 20%
2. Standard processor
3. Two standard processors each under-clocked 20%
2009-11-27
5
What’s happening now?
• General-Purpose Computing(Symposium on High-Performance Chips, Hot Chips 21, Palo Alto, Aug 23-25, 2009)
– 4 cores in notebooks
– 12 cores in servers
• AMD 12-core Magny-Cours will consume less energy than previous generations with 6 cores
– 16 cores for IBM servers, Power 7
• Embedded Systems
– 4 cores in ARM11 MPCore embedded processors
What next?
• Manycores (>100’s of cores) predicted to be here in a few years – e.g. Ericsson
2009-11-27
6
11
CPU
L1
CPU
L1
CPU
L1
CPU
L1
CPU
L1
CPU
L1
CPU
L1
CPU
L1
Ban
dw
idth
Typical Multicore Architecture
11
L2 Cache
Off-chip memory
Multiprocessor models
• Identical multiprocessors:
– each processor has the same computing capacity
• Uniform multiprocessors:
– different processors have different computing capacities
• Heterogeneous multiprocessors:
– each (task, processor) pair may have a different computing capacity
2009-11-27
7
Multiprocessor models
P1 P2 P3
background – problem – results
Co
mp
utin
g c
ap
acity
identical multiprocessors: each processor has the same computing capacity
Multiprocessor models
P1 P2 P3
Task T1 Task T2
background – problem – results
2009-11-27
8
uniform multiprocessors: different processors have different computing capacities
Multiprocessor models
P1 P2 P3
Task T1 Task T2
x
x/2 x/3
y y/2 y/3
background – problem – results
speed = 1 speed = 2 speed = 3
Multiprocessor models
P1 P2 P3
Task T1 Task T2
x/2 x/3
x
y
1.5 y
background – problem – results
heterogeneous multiprocessors: each (task, processor) pair may have a different computing capacity
y
2009-11-27
9
Single processor vs. multiprocessor OS
2009-11-27
10
Symmetric Multiprocessing (SMP)
2009-11-27
11
Simplified Design Flow of RT Systems
(a picture from Ingo Sander)
Task assignment
Task Assignment
• In the design flow:– First, the application is partitioned into tasks or task graphs.
– At some stage, the execution times, communication costs, data and control dependencies of all the tasks become known.
• Task assignment determines– how many processors needed (bin-packing problem)
– on which processor each task executes
• This is a very complex problem (NP-hard)– Often done off-line
– Often heuristics work
2009-11-27
12
Task Assignment
• The task models used in task assignment can vary in complexity
depending what considered/ignored:– Communication costs
– Data and control dependencies
– Resource requirements e.g. WCET, memory etc
• In multi-core platforms with shared caches, communication costs for tasks on the same chip may be very small
• It is often meaningful to consider the execution time requirement (WCET) and ignore communication in an early design phase
Multiprocessor scheduling
• "Given a set J of jobs where job ji has length li and a number of processors mi, what is the minimum possible time required to schedule all jobs in J on m
processors such that none overlap?" – Wikipedia
– That is, design a schedule such that the response time of the last tasks is minimized
(Alternatively, given M processors and N tasks,find a mapping from tasks to processors such that all the tasks are schedulable)
• The problem is NP-complete• It is also known as the “load balancing problem”
2009-11-27
13
Multiprocessor scheduling of “periodic/sporadic real-time tasks”
• Partitioned scheduling
– Static allocation – task assignment• Each task may only execute on a fixed processor
• No task migration
• Semi-partitioned scheduling
– Static allocation – task assignment• Each instance (or part of it) of a task is assigned to a fixed processor
• task instance or part of it may migrate
• Global scheduling
– Dynamic allocation – (dynamic) task assignment• Any instance of any task may execute on any processor
• Task migration
Multiprocessor Scheduling
952
1 6
8
4
new task instance
waiting queue
cpu 1 cpu 2 cpu 3
Global scheduling
cpu 1 cpu 2 cpu 3
2
5
2
1
22
3
6
7
4
2 3
new task instance
cpu 1 cpu 2 cpu 3
5
1
2
8
6
3
9
7
4
new task instance
Partitioned scheduling Semi-partitioned scheduling
2009-11-27
14
Partitioned scheduling
• Two steps:
– Determine a mapping of tasks to processors
– Perform run-time scheduling
• Example: Partitioned with EDF
– Assign tasks to the processors such that no processor’s capacity is exceeded (utilization bounded by 1.0)
– Schedule each processor using EDF
Bin-packing algorithms
• The problem concerns packing objects of varying sizes in boxes (”bins”) with the objective of minimizing number of used boxes.
– Solutions (Heuristics): Next Fit and First Fit
• Application to multiprocessor systems:
– Bins are represented by processors and objects by tasks.
– The decision whether a processor is ”full” or not is derived from a utilization-based schedulability test.
2009-11-27
15
Rate-Monotonic-First-Fit (RMFF): [Dhall and Liu, 1978]
• First, sort the tasks in the order of increasing periods.
• Task Assignment
– All tasks are assigned in the First Fit manner starting from the task with highest priority
– A task can be assigned to a processor if all the tasks assigned to the processor are RM-schedulable i.e.• the total utilization of tasks assigned on that processor is bounded
by n(21/n-1) where n is the number of tasks assigned.
(One may also use the Precise test to get a better assignment!)
– Add a new processor if needed for the RM-test.
Partitioned scheduling
• Advantages:
– Most techniques for single-processor scheduling are also applicable here
• Partitioning of tasks can be automated
– Solving a bin-packing algorithm
• Disadvantages:
– Cannot exploit/share all unused processor time
2009-11-27
16
Partitioned Scheduling to Minimize Communication Costs
• Communication between tasks which run on same processor is usually significantly lower than for tasks that run on different processors
– C ij = Communication Cost (e.g. amount of data to send)
• When communication costs are considered, objective of task assignment is twofold
– Find minimum number of processors needed
– Find minimum of communication costs for this number of processors
• The problem may be solved using optimization techniques such as integer/linear programming.
Example of Task Partitioning
P2P1 P1
2009-11-27
17
Semi-Partitioned scheduling
• High resource utilization
• High overhead (due to task migration)
Example:(semi-)Partitioned Scheduling Algorithm
• sort all tasks in increasing priority order
7
6
5
4
3
2
1highest priority
lowest priority
2009-11-27
18
• select the processor on which the assigned utilization is the lowest
7
6
5
4
3
2
1
P1 P2 P3
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
• select the processor on which the assigned utilization is the lowest
6
5
4
3
2
1
P1 P2 P3
7
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
2009-11-27
19
• select the processor on which the assigned utilization is the lowest
5
4
3
2
1
P1 P2 P3
76
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
• select the processor on which the assigned utilization is the lowest
4
3
2
1
P1 P2 P3
76
5
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
2009-11-27
20
• select the processor on which the assigned utilization is the lowest
3
2
1
P1 P2 P3
76 54
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
• select the processor on which the assigned utilization is the lowest
2
1
P1 P2 P3
76
54
3
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
2009-11-27
21
• select the processor on which the assigned utilization is the lowest
1
P1 P2 P3
76 54
321
22
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
• select the processor on which the assigned utilization is the lowest
1
P1 P2 P3
76
54
32122
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
2009-11-27
22
• select the processor on which the assigned utilization is the lowest
P1 P2 P3
76 54
321
12
11
22
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
• select the processor on which the assigned utilization is the lowest
P1 P2 P3
76
54
3211211
22
highest priority
lowest priority
Example:(semi-)Partitioned Scheduling Algorithm
2009-11-27
23
Global scheduling
• All ready tasks are kept in a global queue
• When selected for execution, a task can be dispatched to any processor, even after being preempted
Algorithms for Global scheduling
• EDF – Unfortunately not optimal!
– No simple schedulability test known (only sufficient)
• Fixed Priority Scheduling e.g. RM
– Difficult to find the optimal priority order
– Difficult to check the schedulability
• Any algorithm for single processor scheduling may work, but schedulability analysis is non-trivial.
2009-11-27
24
Example Priority Assignment
Schedulability Test
Note that U/m is close to 1/3 when m is large enough.
This means that regardless of the number of tasks, and the number of processors if no more than 33% of the processing capacity is required, the task set is schedulable. For traditional RM, this is not true.
2009-11-27
25
Priority Assignment
Priority Assignment
2009-11-27
26
Poor Resource Utilization
2009-11-27
27
Known Results
20
10
30
40
60
50
70
80
Multiprocessor Scheduling
Global Partitioned
Fixed
Priority
Dynamic
Priority
Semi-partitioned
Fixed
Priority
Dynamic
Priority
Fixed
Priority
Dynamic
Priority
38
%
50
Liu and Layland’s
Utilization Bound
50 50
65 66
utilizationbound
69
.3
Global Scheduling: + and -
• Advantages:– Supported by most multiprocessor operating systems
• Windows NT, Solaris, Linux, ...
– Effective utilization of processing resources (if it works)• Unused processor time can easily be reclaimed at run-time (mixture of hard
and soft RT tasks to optimize resource utilization)
• Disadvantages:– Few results from single-processor scheduling can be used– No “optimal” algorithms known except idealized assumption (Pfair sch)– Poor resource utilization for hard timing constraints
• No more than 50% resource utilization can be guaranteed for hard RT tasks
– Suffers from scheduling anomalies• Adding processors and reducing computation times and other parameters can
actually decrease optimal performance with some scenarios!
2009-11-27
28
Underlying causes
– The ”root of all evil” in global scheduling: (Liu, 1969):
The simple fact that a task can use only one processor even when several processors are free at the same time adds a surprising amount of difficulty to the scheduling of multiple processors.
– Dhall’s effect: with RM, DM and EDF, some low-utilization task sets can be un-schedulable regardless of how many processors are used.
– Hard-to-find critical instant: a critical instant does not always occur when a task arrives at the same time as all its higher-priority tasks.
2009-11-27
29
How to avoid Dhall’s effect
• Problem: RM, DM and EDF only account for task periods!
– Actual computation demands are not accounted for.
• Solution: Dhall’s effect can easily be avoided by letting tasks with high utilization receive higher priority:
2009-11-27
30
MP Scheduling Anomalies
• Adding processors, reducing computation times, or “improving” other parameters can actually decrease optimal performance in some scenarios!
Doubling Processor Speed (increasing the number of processors) may cause a deadline
miss
2009-11-27
31
Anomalies under Resource constraints
• Task set J of five tasks on two processors.• Task 2 and 4 share the same resource in exclusive mode• Static allocation P1 (J1,J2) and P2 (J3,J4,J5)• Reducing the computation time of task 1 will increase the optimal
schedule time!
1 2
3 4 5
P1
P2
0 2 4 6 8 10 12 14
1 2
3 4 5
P1
P2
0 2 4 6 8 10 12 14
16 18
16 18
20 22
20 22
critical section
blocking