CSE 160/Berman Mapping and Scheduling W+A: Chapter 4

CSE 160/Berman

Mapping and Scheduling

W+A: Chapter 4

CSE 160/Berman

Outline

• Mapping and Scheduling• Static Mapping Strategies• Dynamic Mapping Strategies• Scheduling

CSE 160/Berman

Mapping and Scheduling Models

• Basic Models: – Program model is a task graph with

dependencies– Platform model is set of processors

with interconnection network

CSE 160/Berman

Mapping and Scheduling

• Mapping and scheduling involve the following activities– Select a set of resources on which to

schedule the task(s) of the application.– Assign application task(s) to compute

resources.– Distribute data or co-locate data and

computation.– Order tasks on compute resources.– Order communication between tasks.

CSE 160/Berman

Mapping and Scheduling Terminology

1. Select a set of resources on which to schedule the task(s) of the application.

2. Assign application task(s) to compute resources.

3. Distribute data or co-locate data and computation.

4. Order tasks on compute resources.

5. Order communication between tasks.

• 1 = resource selection• 1-3: generally termed as

mapping• 4-5: generally termed as

scheduling• For many researchers,

scheduling is also used to describe activities 1-5.

• Mapping is an assignment of tasks in space

• Scheduling focuses on ordering in time

CSE 160/Berman

Goals• Want the mapping and scheduling

algorithms and models to promote the assignment/ordering with the smallest execution time

• Accuracy vs. RankingModel Real Stuff

xA

B

optimum

optimum

Model Real Stuff

A’A

B B’

CSE 160/Berman

What is the best mapping?3

4

17

2

P1

P2

3

4

17

2

1

1

1

2

2P1

P2

CSE 160/Berman

Static and Dynamic Mapping Strategies

• Static methods generate the partitioning prior to execution

• Static mapping strategies work well when we can reasonably predict the time to perform application tasks during execution

• When it is not easy to predict task execution time, dynamic strategies may be more performance-efficient

• Dynamic methods generate the partitioning during execution – For example, workqueue and M/S are dynamic methods

CSE 160/Berman

Static Mapping• Static mapping can involve

– partitioning of tasks (functional decomposition)• Sieve of Eratosthenes an example

– partitioning of data (data decomposition)• Fixed decomposition of Mandelbrot (k blocks

per processor) is an example of this

P2 P3 P5 P7 P11 P13 P17

CSE 160/Berman

Load Balancing

• Load Balancing = strategy to partition application so that– All processors perform an equivalent amount of work– (All processors finish in an equivalent amount of time.

This is really time-balancing)• May take different amounts of time to do equivalent

amounts of work

• Load balancing an important technique in parallel processing– Many ways to achieve a balanced load– Both dynamic and static load balancing techniques

CSE 160/Berman

Static and Dynamic Mapping for the N-body Problem

• The N-body problem: Given n bodies in 3D space, determine the gravitational force F between them at any given point in time.

where G is the gravitational constant, r is the distance between the bodies, and are the masses of the bodies

2r

mGmF ba

am bm

CSE 160/Berman

Exact N-body serial pseudo-code

• At each time t, velocity v and position x of body i may change

• Real problem a bit more complicated than this. See 4.2.3 in book

• For (t=0: t<max; t++)– For (i=0; i<N; i++) {

• F= Force_routine(i);• v[i]_new = v[i]+F*dt;• x[i]_new=x[i]

+v[i]_new*dt;– }

• For (i=0; i<nmax; i++) {– x[i] = x[i]_new;– v[i]=v[i]_new;– }

CSE 160/Berman

Exact N-body and static partitioning

• Can parallelize n-body by tagging velocity and position for each body and updating bodies using correctly tagged information.

• This can be implemented as a data parallel algorithm. What is the worst-case complexity of complexity for a single iteration?

• How should we partition this? – Static partitioning can be a

bad strategy for n-body problem.

– Load can be very unbalanced for some configurations

CSE 160/Berman

Improving the complexity of the N-body code

• Complexity of serial n-body algorithm very large: O(n^2) for each iteration.

• Communication structure not local – each body must gather data from all other bodies.

• Most interesting problems are when n is large – not feasible to use exact method for this

• Barnes-Hut algorithm is well-known approximation to exact n-body problem and can be efficiently parallelized.

CSE 160/Berman

Barnes-Hut Approximation

• Barnes-Hut algorithm based on the observation that a cluster of distant bodies can be approximated as a single distant body– Total mass = aggregate of

bodies in cluster– Distance to cluster =

distance to center of mass of the cluster

• This clustering idea can be applied recursively

CSE 160/Berman

Barnes-Hut idea• Dynamic divide and

conquer approach:– Each region (cube) of space

divided into 8 subcubes– If subcube contains more

than 1 body, it is recursively subdivided

– If subcube contains no bodies, it is removed from consideration

• 2D example on right – each 2D region divided into 4 subregions

CSE 160/Berman

Barnes-Hut idea

• For 3D decomposition, result is an octtree

• For 2D decomposition, result is a quadtree, (pictured below).

CSE 160/Berman

Barnes Hut Pseudo-code• For (t=0; t< tmax; t++) {

Build octtree;Compute total mass and center;Traverse the tree, computing the forcesUpdate the position and velocity of all bodies

}

• Notes:– Total mass and center of mass of each subcube stored

at its root

– Tree traversal stops at a node when the clustering approximation can be used for a particular body

– In the gravitational n-body problem described here, this can happen when where r is the distance to the center of mass of a subcube of side d and c is a constant.c

dr

CSE 160/Berman

Barnes-Hut Complexity

• Partitioning is dynamic: Whole octtree must be reconstructed for each time step because bodies will have moved.

• Constructing tree can be done in O(nlogn)• Computing forces can be done in O(nlogn)• Barnes-Hut for one iteration is O(nlogn)

[compare to O(n^2) for one iteration with exact solution]

CSE 160/Berman

Generalizing the Barnes-Hut approach

• Approach can be used for applications which repeatedly performs some calculation on particles/bodies/data indexed by position.

• Recursive Bisection:– Divide region in half so that

particles are balanced each time

– Map rectangular regions onto processors so that load is balanced

CSE 160/Berman

Recursive Bisection Programming Issues

• How do we keep track of the regions mapped to each processor?

• What should the density of each region be? [granularity!]

• What is the complexity of performing the partitioning? How often should we repartition to optimize the load balance?

• How can locality of communication or processor configuration be leveraged?

CSE 160/Berman

Scheduling• Application scheduling: ordering and

allocation of tasks/communication/data to processors– Application-centric performance measure, e.g.

minimal execution time

• Job Scheduling: ordering and allocation of jobs on an MPP– System-centric performance measure, e.g.

processor utilization, throughput

CSE 160/Berman

Job Scheduling Strategies

• Gang-scheduling• Batch scheduling using backfilling

CSE 160/Berman

Gang scheduling

• Gang scheduling is a technique for allocating a collection of jobs on a MPP– One or more jobs clustered as a gang– Gangs share time slices on whole machine

• Strategy combines time-sharing (gangs get time slices) and space-sharing (gangs partition space) approaches

• Many flavors of gang scheduling in the literature

CSE 160/Berman

Gang Scheduling

• Formal definition from Dror Feitelson:• Gang scheduling is a scheme that

combines three features:– The threads of a set of jobs are grouped into

gangs with the threads in a single job considered to be a single gang.

– The threads in each gang execute simultaneously on distinct PEs, using a 1-1 mapping.

– Time slicing is used, with all the threads in a gang being preempted and rescheduled at the same time.

CSE 160/Berman

Why gang scheduling?• Gang scheduling promotes efficient performance of

individual jobs as well as efficient utilization and fair allocation of machine resources.

• Gang scheduling leads to two desirable properties:– It promotes efficient fine-grain interactions among the

threads of a gang, since they are executing simultaneously.

– Periodic preemption prevents long jobs from monopolizing system resources.

• overhead of preemption can reduce performance and so must be implemented efficiently).

• Used as the scheduling policy for CM-5, Meiko CS-2, Paragon, etc.

CSE 160/Berman

Batch Job Scheduling

• Problem: How to schedule jobs waiting in a queue to run on a multicomputer?– Each job requests some number n of

nodes and some time t to run

• Goal: promote utilization of machine, fairness to jobs, short queue wait times

CSE 160/Berman

One approach: Backfilling

• Main idea: pack the jobs in the processor/time space – Allow job at the head of the queue to be scheduled in

the first available slot. – If other jobs in the queue can run without changing the

start time of previous jobs in the queue, schedule them.– Promote jobs if they can start earlier

• Many versions of backfilling:– EASY: Promote jobs as long as they don’t delay the

start time of the first job in the queue– Conservative: Promote jobs as long as they don’t delay

the start time of any job in the queue.

CSE 160/Berman

Backfilling Example

• Submitting five requests…

proc

esso

rs

time

CSE 160/Berman

Backfilling Example

• Submitting five requests…

• Using Backfilling...

proc

esso

rs

time

proc

esso

rs

time

CSE 160/Berman

Backfilling Example

proc

esso

rs

time

proc

esso

rs

time

CSE 160/Berman

Backfilling Example

proc

esso

rs

time

proc

esso

rs

time

CSE 160/Berman

Backfilling Example

• Existing job finishes

• Backfilling promotes yellow job and then schedules purple job

time

proc

esso

rs

time

proc

esso

rs

proc

esso

rs

CSE 160/Berman

Backfilling Scheduling• Backfilling used in Maui Scheduler at SDSC on SP-2, PBS

at NASA, Computing Condominium Scheduler at Penn State, etc.

• Backfilling Issues:– What if the processors of the platform have different

capacities (are not homogeneous) ?– What if some jobs get priority over others?– Should parallel jobs be treated separately than serial

jobs?– If multiple queues are used, how should they be

administered?– Should users be charged to wait in the queue as well as

run on the machine?

Optimizing Application Performance

• Backfilling and MPP scheduling strategies typically optimize for throughput

• Optimizing throughput and optimizing application performance (e.g. execution time) can often conflict

• How can applications optimize performance in an MPP environment?

• Moldable jobs = jobs which can run with more than one partition size

• Question: What is the optimal partition size for moldable jobs?

• We can answer this question when the MPP scheduler runs a conservative backfilling strategy and publishes the list of available nodes.

Optimizing Applications targeted to a Batch-scheduled MPP

• SA = generic AppLeS scheduler developed for jobs submitted to backfilling MPP– uses the availability list of

the MPP scheduler to determine the size of the partition to be requested by the application

– Speedup curve known for Gas applications

• Static = jobs submitted without SA

• Workload taken from KTH (Swedish Royal Institute of Technology)

• Experiments developed by Walfredo Cirne

Mean Turn-Around Time replacement SA Static

5% 4457.15 8078.23 10% 4388.35 8030.74 15% 4459.20 8303.79 20% 4576.61 8689.26 25% 4500.23 8410.98 30% 4642.08 8250.78 35% 4863.60 7939.14 40% 4540.10 7864.41 45% 4586.62 7424.85 50% 4307.59 7482.15

Documents

CSE 160/Berman Mapping and Scheduling W+A: Chapter 4