Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing...

Preview:

Citation preview

Multicore Challenge Conference 2013

UWE, Bristol

Static and Dynamic Load Balancing for Heterogeneous Systems

Greg Michaelson

Turkey Al Salkini, Khari Armih, Deng Xiao Yan

Heriot-Watt University

Phil Trinder

University of Glasgow

1 Multicore Challenge Conference 2013

Context

• explosive growth in deployment of heterogeneous systems

– desktop/laptop/tablet with many-core + GPU

• HPC with clusters of nodes of many-core + GPU

• upgrade/replace/remove cluster components as opportune rather than replacing whole cluster

• newer components typically higher spec than older components

• hard to allocate and balance load

Multicore Challenge Conference 2013 2

Context

• typical heterogeneous system

• colours indicate different technologies

Multicore Challenge Conference 2013 3

network

bus

Cores Mem GPU

bus

Cores Mem GPU

cluster cluster

Context

• static work allocation for regular data parallelism

• simple data parallel skeleton with master & identical workers

• masters sends work to workers and receives results

• assume: master on core; workers on cores or GPU

Multicore Challenge Conference 2013 4

master

worker1 worker2 workerM ...

Homogeneity

• cluster of identical single core nodes with no GPU

Multicore Challenge Conference 2013 5

network

bus

Core Mem

bus

Core Mem

Amdahl’s Law

• T = total sequential time

Multicore Challenge Conference 2013 6

T

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

Multicore Challenge Conference 2013 7

Tseq Tpar

T

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

• N = number of nodes

• allocate same amounts of Tpar

to each node

Multicore Challenge Conference 2013 8

Tseq Tpar

T

Tpar1 Tpar2 TparN ... Tseq

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

• N = number of nodes

• allocate same amounts of Tpar

to each node

• Tmin = minimum time

• Tmin = Tseq + Tpar/N

Multicore Challenge Conference 2013 9

Tseq Tpar

T

Tpar1 Tpar2 TparN ... Tseq

Tseq Tpar1

Tpar2

...

TparN

saving

Communication cost

• Tsend = time to send data to processors

• Trec = time to receive results from processors

• for any parallel benefit, need:

• Tsend+Trec+Tpar/N < Tpar

Multicore Challenge Conference 2013 10

T

Tseq Tpar1

Tpar2

...

TparN

Tseq Tpar

Tsend Trec

saving

Heterogeneity 1

• different number of cores/amount of RAM on each node

• same core/RAM technologies

Multicore Challenge Conference 2013 11

network

bus

Cores1 Mem1

bus

Cores2 Mem2

Heterogeneity 1

• Coresi = number of cores on node i

• P = number of processors = Cores1+...+CoresN

• each core gets:

Tpar/P

• node i gets:

Coresi*Tpar/P

• haven’t taken memory into account...?

Multicore Challenge Conference 2013 12

Heterogeneity 2

• different technologies on each node

• no GPU

Multicore Challenge Conference 2013 13

network

bus

Cores1 Mem1

bus

Cores2 Mem2

Heterogeneity 2

• characterise node strength

– coarse-grain, relative measure

• for node i:

Speedi = clock speed of each core

Strengthi = Coresi*Speedi

i.e. strength increases with number of cores & core speed

• whole system: Strength = Strength1+...+StrengthN

• node i gets: Tpar*Strengthi/Strength

• each node i core gets: (Tpar*Strengthi/Strength)/Coresi

Multicore Challenge Conference 2013 14

Heterogeneity 2

• still doesn’t take memory into account?

• top level shared cache (i.e. L2 or L3) most important

– determines frequencies of access clashes & caching

• for node i:

Cachei = shared cache size

Strengthi = Coresi*Speedi*Cachei

i.e. strength also increases with cache size

• use previous equations for system strength, and node and core load

Multicore Challenge Conference 2013 15

Heterogeneity 2

Multicore Challenge Conference 2013 16

Heterogeneity 2

Multicore Challenge Conference 2013 17

Heterogeneity 3

• different node technologies

• with GPU

Multicore Challenge Conference 2013 18

network

bus

Cores1 Mem1 GPU1

bus

Cores2 Mem2 GPU2

Heterogeneity 3

• wish to treat GPU as additional general compute resource - GP-GPU

• but cannot directly adopt current algorithm independent model

– very hard to characterise GP-GPU succinctly from static information

• base GPU strength measure on algorithm specific profiling

• use cost model for core strength

• normalise strength relative to one core

Multicore Challenge Conference 2013 19

Heterogeneity 3

TCore0 = time for algorithm on base core

• for node i:

TGPUi = time for algorithm on GPU

• strength of GPUi = TGPUi/TCore0

• node i core strength = (Speedi*Cachei)/(Speed0*Cache0)

• assume GPU controlled by one core

Strengthi = TGPUi/TCore0+

(Corei-1)*

(Speedi*Cachei)/(Speed0*Cache0)

Multicore Challenge Conference 2013 20

Heterogeneity 3

Multicore Challenge Conference 2013 21

Heterogeneity 3

Multicore Challenge Conference 2013 22

Heterogeneity 3

Multicore Challenge Conference 2013 23

Task mobility

• static model assumes program has sole use of system

– no load changes at run-time

• if system shared then load changes unpredictably at run-time

• base system design on threads?

• hope operating system will manage load?

– optimising movement of arbitrary threads across nodes is a black art

• decentralise mobility

• let master decide when to move work

Multicore Challenge Conference 2013 24

Task mobility

Multicore Challenge Conference 2013 25

W2

M

W1

W3

M: Master, Wi: Worker, Colour: Amount of load on processor.

Task mobility

Multicore Challenge Conference 2013 26

W2

M

W1

W3

W3 becomes overloaded. The load should be balanced by moving the

work to a lightly loaded processor.

Task mobility

Multicore Challenge Conference 2013 27

W2

W3

M

W1

The load is re-balanced.

Mobility decision

Ti = time to complete task on location i

Tmove = time to move task from current to new location

Overhead = acceptable completion overhead e.g. 5%

• to justify moving work from location i to location j:

Tj+Tmove < Ti*Overhead

Multicore Challenge Conference 2013 28

Mobility decision

• homogeneous case

TBase = time to complete whole task on unloaded core

Loadi,t = load on location i at time t

• assume we inspect load having completed P% of task

Ti = TBase*(100-P)/100*f(Loadi,t,Loadi,0)

Multicore Challenge Conference 2013 29

Multicore Challenge Conference 2013 30

Evaluation

Multicore Challenge Conference 2013 31

16 workers + 10 tasks, no mobility 16 workers + 10 tasks, mobility

• 7,000*7,000 matrix multiplication • Poisson load • darker == more heavily loaded

Conclusions

• simple models + relative measures suffice for static & dynamic load balancing

• data parallel/regular parallelism only

• small scale experiments

• next:

– larger scale experiments

– explore more complex algorithms

– extend mobility to many core

Multicore Challenge Conference 2013 32

References

• T. Al Salkini and G. Michaelson, `Dynamic Farm Skeleton Task Allocation Through Task Mobility’, 18th International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'12), Las Vegas, July 16-19, 2012

• K. Armih, G. Michaelson and P. Trinder, `Cache size in a cost model for heterogeneous skeletons’, Proceedings of HLPP 2011 : 5th ACM SIGPLAN Workshop on High-Level Parallel Programming and Applications, Tokyo, September. 2011

• X. Y. Deng, P. Trinder and G. Michaelson, `Cost-Driven Autonomous Mobility', Computer Languages, Systems and Structures, 36(1) (April 2010) pp 34-51, 2010

Multicore Challenge Conference 2013 33

Recommended