CPU Scheduling CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002

CPU Scheduling

CS 519: Operating System Theory

Computer Science, Rutgers University

Instructor: Thu D. Nguyen

TA: Xiaoyan Li

Spring 2002

2Computer Science, Rutgers CS 519: Operating System Theory

What and Why?

What is processor scheduling?

Why?At first to share an expensive resource – multiprogramming

Now to perform concurrent tasks because processor is so powerful

Future looks like past + now

Rent-a-computer approach – large data/processing centers use multiprogramming to maximize resource utilization

Systems still powerful enough for each user to run multiple concurrent tasks


Assumptions

Pool of jobs contending for the CPUCPU is a scarce resource

Jobs are independent and compete for resources (this assumption is not always used)

Scheduler mediates between jobs to optimize some performance criteria


Types of Scheduling

We’re mostlyconcerned withshort-termscheduling


What Do We Optimize?

System-oriented metrics: Processor utilization: percentage of time the processor is busy

Throughput: number of processes completed per unit of time

User-oriented metrics: Turnaround time: interval of time between submission and termination (including any waiting time). Appropriate for batch jobs

Response time: for interactive jobs, time from the submission of a request until the response begins to be received

Deadlines: when process completion deadlines are specified, the percentage of deadlines met must be promoted


Design Space

Two dimensionsSelection function

Which of the ready jobs should be run next?

Preemption

Preemptive: currently running job may be interrupted and moved to Ready state

Non-preemptive: once a process is in Running state, it continues to execute until it terminates or it blocks for I/O or system service


Job Behavior


Job Behavior

I/O-bound jobsJobs that perform lots of I/O

Tend to have short CPU bursts

CPU-bound jobsJobs that perform very little I/O

Tend to have very long CPU bursts

Distribution tends to be hyper-exponential

Very large number of very short CPU bursts

A small number of very long CPU bursts

CPU

Disk


Histogram of CPU-burst Times


Example Job Set

Process

ArrivalTime

ServiceTime

1

2

3

4

5

0

2

4

6

8

3

6

4

5

2


Behavior of Scheduling Policies


Behavior of Scheduling Policies


Multilevel Queue

Ready queue is partitioned into separate queues:foreground (interactive)

background (batch)

Each queue has its own scheduling algorithm:foreground – RR

background – FCFS

Scheduling must be done between the queues.Fixed priority scheduling; i.e., serve all from foreground then from background. Possibility of starvation.

Time slice – each queue gets a certain amount of CPU time which it can schedule amongst its processes; i.e.,80% to foreground in RR

20% to background in FCFS


Multilevel Queue Scheduling


Multilevel Feedback Queue

A process can move between the various queues; aging can be implemented this way.

Multilevel-feedback-queue scheduler defined by the following parameters:

number of queues

scheduling algorithms for each queue

method used to determine when to upgrade a process

method used to determine when to demote a process

method used to determine which queue a process will enter when that process needs service


Multilevel Feedback Queues


Example of Multilevel Feedback Queue

Three queues: Q0 – time quantum 8 milliseconds

Q1 – time quantum 16 milliseconds

Q2 – FCFS

SchedulingA new job enters queue Q0 which is served FCFS. When it gains CPU, job receives 8 milliseconds. If it does not finish in 8 milliseconds, job is moved to queue Q1.

At Q1 job is again served FCFS and receives 16 additional milliseconds. If it still does not complete, it is preempted and moved to queue Q2.


Traditional UNIX Scheduling

Multilevel feedback queues

128 priorities possible (0-127)

1 Round Robin queue per priority

Every scheduling event the scheduler picks the lowest priority non-empty queue and runs jobs in round-robin

Scheduling events: Clock interrupt

Process does a system call

Process gives up CPU,e.g. to do I/O


Traditional UNIX Scheduling

All processes assigned a baseline priority based on the type and current execution status:

swapper 0

waiting for disk 20

waiting for lock 35

user-mode execution 50

At scheduling events, all process’s priorities are adjusted based on the amount of CPU used, the current load, and how long the process has been waiting.

Most processes are not running, so lots of computing shortcuts are used when computing new priorities.


UNIX Priority Calculation

Every 4 clock ticks a processes priority is updated:

The utilization is incremented every clock tick by 1.

The niceFactor allows some control of job priority. It can be set from –20 to 20.

Jobs using a lot of CPU increase the priority value. Interactive jobs not using much CPU will return to the baseline.

NiceFactornutilizatio

BASELINEP 24


UNIX Priority Calculation

Very long running CPU bound jobs will get “stuck” at the highest priority.

Decay function used to weight utilization to recent CPU usage.

A process’s utilization at time t is decayed every second:

The system-wide load is the average number of runnable jobs during last 1 second

niceFactoruload

loadu tt

)1()12(

2


UNIX Priority Decay

1 job on CPU. load will thus be 1. Assume niceFactor is 0.

Compute utilization at time N:

+1 second:

+2 seconds

+N seconds

013

2UU

002

2

11

3

2

3

2

3

2

3

2UUUUU

...3

2

3

22

2

1

nn UUU n


Scheduling Algorithms

FIFO is simple but leads to poor average response times. Short processes are delayed by long processes that arrive before them

RR eliminate this problem, but favors CPU-bound jobs, which have longer CPU bursts than I/O-bound jobs

SJN, SRT, and HRRN alleviate the problem with FIFO, but require information on the length of each process. This information is not always available (although it can sometimes be approximated based on past history or user input)

Feedback is a way of alleviating the problem with FIFO without information on process length


It’s a Changing World

Assumption about bi-modal workload no longer holds

Interactive continuous media applications are sometimes processor-bound but require good response times

New computing model requires more flexibilityHow to match priorities of cooperative jobs, such as client/server jobs?

How to balance execution between multiple threads of a single process?


Lottery Scheduling

Randomized resource allocation mechanism

Resource rights are represented by lottery tickets

Have rounds of lottery

In each round, the winning ticket (and therefore the winner) is chosen at random

The chances of you winning directly depends on the number of tickets that you have

P[wining] = t/T, t = your number of tickets, T = total number of tickets


Lottery Scheduling

After n rounds, your expected number of wins isE[win] = nP[wining]

The expected number of lotteries that a client must wait before its first win

E[wait] = 1/P[wining]

Lottery scheduling implements proportional-share resource management

Ticket currencies allow isolation between users, processes, and threads

OK, so how do we actually schedule the processor using lottery scheduling?


Implementation


Performance

Allocated and observed execution ratios between

two tasks running the Dhrystone benchmark.With exception of 10:1

allocation ratio, all observedratios are close to allocations


Short-term Allocation Ratio


Isolation

Five tasks running the Dhrystonebenchmark. Let amount.currency

denote a ticket allocation of amountdenominated in currency. Tasks

A1 and A2 have allocations 100.A and200.A, respectively. Tasks B1 and B2have allocations 100.B and 200.B,

respectively. Halfway thru experimentB3 is started with allocation 300.B.

This inflates the number of tickets in Bfrom 300 to 600. There’s no effect on

tasks in currency A or on the aggregate iteration ratio of A tasks to B tasks. Tasks B1 and B2 slow to half their

original rates, corresponding to the factor of 2 inflation caused by B3.


Borrowed-Virtual-Time (BVT) Scheduling

Current scheduling in general purpose systems does not support rapid dispatch of latency-sensitive applications

Examples include continuous media applications such as teleconferencing, playing movies, voice-over-IP, etc.

What’s the problem with the traditional Unix scheduler?

Beauty of BVT is its simplicityCorollary: not that much to say

Tricky part is figuring out the appropriate parameters

½ of the paper is on this (which I’m going to skip)


BVT Scheduling: Basic Idea

Scheduling is done based on virtual time

Each thread has

EVT (effective virtual time)

AVT (actual virtual time)

W (warp factor)

warpBack (whether warp is on or not)

EVT of thread is computed as

Threads accumulate virtual time as they run

Thread with earliest EVT is scheduled next

)0:?( WwarpBackAE


BVT Scheduling: Details

Can only switch every C time units to prevent thrashing

Threads can accumulate virtual time at different rates

Allow for weighted fair sharing of CPU

To make sure that latency-sensitive threads are scheduled right away, give these threads high warp values

Have limits on how much and how long can warp to prevent abuse


BVT Scheduling: Performance








BVT vs. Lottery

How do the two compare?

Parallel Processor Scheduling


Simulating Ocean Currents

Model as two-dimensional gridsDiscretize in space and time

finer spatial and temporal resolution => greater accuracy

Many different computations per time step

set up and solve equations

Concurrency across and within grid computations

(a) Cross sections (b) Spatial discretization of a cross section


m1m2r2

Star on which forcesare being computed

Star too close toapproximate

Small group far enough away toapproximate to center of mass

Large group farenough away toapproximate

Case Study 2: Simulating Galaxy Evolution

Simulate interactions of many stars evolving over time

Computing forces is expensiveO(n2) brute force approach

Hierarchical Methods take advantage of force law: G


Case Study 2: Barnes-Hut

Many time steps, plenty of concurrency across stars within each

Locality GoalParticles close together in space should be on same processor

Difficulties: Nonuniform, dynamically changing

Spatial Domain Quad-tree


Case Study 3: Rendering Scenes by Ray Tracing

Shoot rays into scene through pixels in projection plane

Result is color for pixel

Rays shot through pixels in projection plane are called primary rays

Reflect and refract when they hit objects

Recursive process generates ray tree per primary ray

Tradeoffs between execution time and image quality Viewpoint

Projection Plane

3D Scene

Ray fromviewpoint to

upper right cornerpixel

Dynamicallygenerated ray


Partitioning

Need dynamic assignmentUse contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing

A block,the unit ofassignment

A tile,the unit of decompositionand stealing


Sample Speedups


Coscheduling (Gang)

Cooperating processes may interact frequentlyWhat problem does this lead to?

Fine-grained parallel applications have process working set

Two things neededIdentify process working set

Coschedule them

Assumption: explicitly identified process working set

Some good recent work has shown that it may be possible to dynamically identify process working set


Coscheduling

What is coscheduling?

Coordinating across nodes to make sure that processes belonging to the same process working set are scheduled simultaneously

How might we do this?


Impact of OS Scheduling Policies and Synchronization on Performance

Consider performance for a set of applications forFeedback priority scheduling

Spinning

Blocking

Spin-and-block

Block-and-hand-off

Block-and-affinity

Gang scheduling (time-sharing coscheduling)

Process control (space-sharing coscheduling)


Applications


“Normal” Scheduling with Spinning


“Normal Scheduling” with Blocking Locks


Gang Scheduling


Process Control (Space-Sharing)


Multiprocessor Scheduling

Load sharing: poor locality; poor synchronization behavior; simple; good processor utilization. Affinity or per processor queues can improve locality.

Gang scheduling: central control; fragmentation --unnecessary processor idle times (e.g., two applications with P/2+1 threads); good synchronization behavior; if careful, good locality

Hardware partitions: poor utilization for I/O-intensive applications; fragmentation – unnecessary processor idle times when partitions left are small; good locality and synchronization behavior

Documents

CPU Scheduling CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002