33
Multicore Challenge Conference 2013 UWE, Bristol Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao Yan Heriot-Watt University Phil Trinder University of Glasgow 1 Multicore Challenge Conference 2013

Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

  • Upload
    others

  • View
    39

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Multicore Challenge Conference 2013

UWE, Bristol

Static and Dynamic Load Balancing for Heterogeneous Systems

Greg Michaelson

Turkey Al Salkini, Khari Armih, Deng Xiao Yan

Heriot-Watt University

Phil Trinder

University of Glasgow

1 Multicore Challenge Conference 2013

Page 2: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Context

• explosive growth in deployment of heterogeneous systems

– desktop/laptop/tablet with many-core + GPU

• HPC with clusters of nodes of many-core + GPU

• upgrade/replace/remove cluster components as opportune rather than replacing whole cluster

• newer components typically higher spec than older components

• hard to allocate and balance load

Multicore Challenge Conference 2013 2

Page 3: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Context

• typical heterogeneous system

• colours indicate different technologies

Multicore Challenge Conference 2013 3

network

bus

Cores Mem GPU

bus

Cores Mem GPU

cluster cluster

Page 4: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Context

• static work allocation for regular data parallelism

• simple data parallel skeleton with master & identical workers

• masters sends work to workers and receives results

• assume: master on core; workers on cores or GPU

Multicore Challenge Conference 2013 4

master

worker1 worker2 workerM ...

Page 5: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Homogeneity

• cluster of identical single core nodes with no GPU

Multicore Challenge Conference 2013 5

network

bus

Core Mem

bus

Core Mem

Page 6: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Amdahl’s Law

• T = total sequential time

Multicore Challenge Conference 2013 6

T

Page 7: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

Multicore Challenge Conference 2013 7

Tseq Tpar

T

Page 8: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

• N = number of nodes

• allocate same amounts of Tpar

to each node

Multicore Challenge Conference 2013 8

Tseq Tpar

T

Tpar1 Tpar2 TparN ... Tseq

Page 9: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Amdahl’s Law

• T = total sequential time

• Tseq = necessarily sequential time

• Tpar = potentially parallelisable time

• i.e. T = Tseq + Tpar

• N = number of nodes

• allocate same amounts of Tpar

to each node

• Tmin = minimum time

• Tmin = Tseq + Tpar/N

Multicore Challenge Conference 2013 9

Tseq Tpar

T

Tpar1 Tpar2 TparN ... Tseq

Tseq Tpar1

Tpar2

...

TparN

saving

Page 10: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Communication cost

• Tsend = time to send data to processors

• Trec = time to receive results from processors

• for any parallel benefit, need:

• Tsend+Trec+Tpar/N < Tpar

Multicore Challenge Conference 2013 10

T

Tseq Tpar1

Tpar2

...

TparN

Tseq Tpar

Tsend Trec

saving

Page 11: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 1

• different number of cores/amount of RAM on each node

• same core/RAM technologies

Multicore Challenge Conference 2013 11

network

bus

Cores1 Mem1

bus

Cores2 Mem2

Page 12: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 1

• Coresi = number of cores on node i

• P = number of processors = Cores1+...+CoresN

• each core gets:

Tpar/P

• node i gets:

Coresi*Tpar/P

• haven’t taken memory into account...?

Multicore Challenge Conference 2013 12

Page 13: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 2

• different technologies on each node

• no GPU

Multicore Challenge Conference 2013 13

network

bus

Cores1 Mem1

bus

Cores2 Mem2

Page 14: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 2

• characterise node strength

– coarse-grain, relative measure

• for node i:

Speedi = clock speed of each core

Strengthi = Coresi*Speedi

i.e. strength increases with number of cores & core speed

• whole system: Strength = Strength1+...+StrengthN

• node i gets: Tpar*Strengthi/Strength

• each node i core gets: (Tpar*Strengthi/Strength)/Coresi

Multicore Challenge Conference 2013 14

Page 15: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 2

• still doesn’t take memory into account?

• top level shared cache (i.e. L2 or L3) most important

– determines frequencies of access clashes & caching

• for node i:

Cachei = shared cache size

Strengthi = Coresi*Speedi*Cachei

i.e. strength also increases with cache size

• use previous equations for system strength, and node and core load

Multicore Challenge Conference 2013 15

Page 16: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 2

Multicore Challenge Conference 2013 16

Page 17: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 2

Multicore Challenge Conference 2013 17

Page 18: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

• different node technologies

• with GPU

Multicore Challenge Conference 2013 18

network

bus

Cores1 Mem1 GPU1

bus

Cores2 Mem2 GPU2

Page 19: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

• wish to treat GPU as additional general compute resource - GP-GPU

• but cannot directly adopt current algorithm independent model

– very hard to characterise GP-GPU succinctly from static information

• base GPU strength measure on algorithm specific profiling

• use cost model for core strength

• normalise strength relative to one core

Multicore Challenge Conference 2013 19

Page 20: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

TCore0 = time for algorithm on base core

• for node i:

TGPUi = time for algorithm on GPU

• strength of GPUi = TGPUi/TCore0

• node i core strength = (Speedi*Cachei)/(Speed0*Cache0)

• assume GPU controlled by one core

Strengthi = TGPUi/TCore0+

(Corei-1)*

(Speedi*Cachei)/(Speed0*Cache0)

Multicore Challenge Conference 2013 20

Page 21: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

Multicore Challenge Conference 2013 21

Page 22: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

Multicore Challenge Conference 2013 22

Page 23: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Heterogeneity 3

Multicore Challenge Conference 2013 23

Page 24: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Task mobility

• static model assumes program has sole use of system

– no load changes at run-time

• if system shared then load changes unpredictably at run-time

• base system design on threads?

• hope operating system will manage load?

– optimising movement of arbitrary threads across nodes is a black art

• decentralise mobility

• let master decide when to move work

Multicore Challenge Conference 2013 24

Page 25: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Task mobility

Multicore Challenge Conference 2013 25

W2

M

W1

W3

M: Master, Wi: Worker, Colour: Amount of load on processor.

Page 26: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Task mobility

Multicore Challenge Conference 2013 26

W2

M

W1

W3

W3 becomes overloaded. The load should be balanced by moving the

work to a lightly loaded processor.

Page 27: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Task mobility

Multicore Challenge Conference 2013 27

W2

W3

M

W1

The load is re-balanced.

Page 28: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Mobility decision

Ti = time to complete task on location i

Tmove = time to move task from current to new location

Overhead = acceptable completion overhead e.g. 5%

• to justify moving work from location i to location j:

Tj+Tmove < Ti*Overhead

Multicore Challenge Conference 2013 28

Page 29: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Mobility decision

• homogeneous case

TBase = time to complete whole task on unloaded core

Loadi,t = load on location i at time t

• assume we inspect load having completed P% of task

Ti = TBase*(100-P)/100*f(Loadi,t,Loadi,0)

Multicore Challenge Conference 2013 29

Page 30: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Multicore Challenge Conference 2013 30

Page 31: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Evaluation

Multicore Challenge Conference 2013 31

16 workers + 10 tasks, no mobility 16 workers + 10 tasks, mobility

• 7,000*7,000 matrix multiplication • Poisson load • darker == more heavily loaded

Page 32: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

Conclusions

• simple models + relative measures suffice for static & dynamic load balancing

• data parallel/regular parallelism only

• small scale experiments

• next:

– larger scale experiments

– explore more complex algorithms

– extend mobility to many core

Multicore Challenge Conference 2013 32

Page 33: Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing for Heterogeneous Systems Greg Michaelson Turkey Al Salkini, Khari Armih, Deng Xiao

References

• T. Al Salkini and G. Michaelson, `Dynamic Farm Skeleton Task Allocation Through Task Mobility’, 18th International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'12), Las Vegas, July 16-19, 2012

• K. Armih, G. Michaelson and P. Trinder, `Cache size in a cost model for heterogeneous skeletons’, Proceedings of HLPP 2011 : 5th ACM SIGPLAN Workshop on High-Level Parallel Programming and Applications, Tokyo, September. 2011

• X. Y. Deng, P. Trinder and G. Michaelson, `Cost-Driven Autonomous Mobility', Computer Languages, Systems and Structures, 36(1) (April 2010) pp 34-51, 2010

Multicore Challenge Conference 2013 33