MATLAB Tutorial Series Parallel Computing Toolbox Kadin Tseng

Spring 2012

MATLAB Tutorial SeriesParallel Computing Toolbox

Kadin TsengScientific Computing and Visualization Group

Boston University

1

Spring 2012 2

Hands On Practice

• Log in to the thin client with your BU userid and Kerboros password

• Start MATLAB

• Copy example codes with

>> copyfile(‘T:\kadin\pct’, ‘.\’)

http://scv.bu.edu/images/scv.mpg

Spring 2012 3

1. What is the PCT ?

• The Parallel Computing Toolbox is a MATLAB tool box.

• This tool box provides parallel utility functions to enable

users to run MATLAB operations or procedures employed in a user code in parallel to speed up processing time.

Spring 2012 4

2. Where To Run The PCT ? Run on a desktop or laptop

MATLAB must be available on local machine Starting with R2011b, up to 12 processors can be used;

up to 8 processors for older versions Must have multi-core to gain speedup. The thin client you are using has a dual-core processor Requires BU userid (to access MATLAB, PCT licenses)

Run on a Katana Cluster node (as if a multi-cored desktop)

Requires SCV userid

Run on multiple Katana nodes ( for up to 32 processors) Requires SCV userid

Spring 2012 5

3. Types of Parallel Jobs ?

There are two types of parallel applications.

3.1 Distributed Jobs – task parallel 3.2 Parallel Jobs – data parallel

Spring 2012 6

3.1 Distributed Jobs

This type of parallel processing is classified as:

Multiple tasks running independently on multiple workers with no information passed among them. On Katana, a distributed job is a series of single-processor batch jobs. This is also known as task-parallel, or “embarrassingly parallel” , jobs. • Examples of distributed jobs: Monte Carlo simulations, image processing.

• Parallel utility function: dfeval

Spring 2012 7

3.2 Parallel JobsA parallel job is:

A single task running concurrently on multiple workers that may communicate with each other. On Katana, this results in one batch job with multiple processors. This is also known as a data-parallel job.

Examples of a parallel job include many linear algebra applications : matrix multiply; linear algebraic system of equations solvers; Eigen solvers.Some may run efficiently in parallel and others may not.It depends on the underlining algorithms and operations.This also include jobs that mix serial and parallel processing.

Parallel utility functions: spmd, drange, parfor, . . .

Spring 2012 8

4. How much work to parallelize my code ?

• Code dependent

• Many MATLAB functions are overloaded to handle vector or parallel operations based on the variables’ data type, i.e., scalars, vectors, or distributed arrays. For some application, may just have to turn on parallel capability. [No effort to a few lines of code modification]

• If arrays have been declared distributed (i.e., parallel), subsequent operations of these arrays using built-in or user functions will be processed in parallel without additional instructions. Arrays not declared as parallel will also be promoted to parallel if they are used with distributed arrays. [moderate effort: code, algorithm change]

• For custom parallelization, MPI-like utilities are available. [More extensive effort: code, algorithm]

Spring 2012 9

5.1 Distributed Jobs — dfevalExample: Run a dfeval job “interactively”Computes 1x4, 3x2 random arrays locally or on Katana(with pre-defined batch configuration ‘SGE’)

>> y = dfeval(@rand,{1 3},{4 2},'Configuration',‘local')Submitting task 1Job output will be written to: /usr1/scv/kadin/Job6_Task1.outQSUB output: Your job 211908 ("Job6.1") has been submitted

Submitting task 2Job output will be written to: /usr1/scv/kadin/Job6_Task2.outQSUB output: Your job 211909 ("Job6.2") has been submitted

y = [1x4 double] [3x2 double]

Job ran in batch or background, output returns to client workspace.

or ‘SGE’ input to randapplication m-file


Spring 2012 10

5.2 Distributed Jobs – SCV script

For task-parallel applications on the Katana Cluster, we strongly

recommend the use of an SCV script instead of dfeval. This script does not use the PCT. For details, see

http://www.bu.edu/tech/research/computation/linux-cluster/katana-cluster/runningjobs/multiple_matlab_tasks/ If your application fits the description of a distributed job, you don’t need to know any more beyond this point . . .

http://www.bu.edu/tech/research/computation/linux-cluster/katana-cluster/runningjobs/multiple_matlab_tasks/

http://www.bu.edu/tech/research/computation/linux-cluster/katana-cluster/runningjobs/multiple_matlab_tasks/


Spring 2012 11

6. How Do I Run Parallel Jobs ?Two ways to run parallel jobs: pmode or matlabpool. This procedure turns on (off) parallelism and allocate (deallocate) resources.

pmode is a special mode of application; useful for learning the PCT and parallel program prototyping interactively.

matlabpool is the general mode of application; it can be used for interactive and batch processing.


Spring 2012 12

6.1 pmode

>> pmode start local % use 2 workers on the thin clients

Above is a MATLAB window on a Katana node. A separate Parallel Command Window is spawned. (Similar for Windows)

Spring 2012 13

6.1 pmode >> pmode start local % use 2 workers in parallel mode

• Any command issued at the “P>>” prompt is executed on all

workers. Enter “labindex” to query for workers’ ID.• Use if conditional with labindex to issue instructions to specific workers, like this: if labindex==1, numlabs, end

Worker number

Command entered

Enter command here

PCT terminologies:worker = processorlabindex: processor numbernumlabs: Number of processors


Spring 2012 14

pmode• Replicate array P>> A = magic(3); % A is replicated on every worker

• Variant array P>> A = magic(3) + labindex – 1; % labindex=1,2,3,4

LAB 1 LAB 2 LAB 3 LAB 4 | | | | |8 1 6|9 2 7|10 3 8|11 4 9 |3 5 7|4 6 8| 5 7 9| 6 8 10 |4 9 2|5 10 3| 6 11 4| 7 12 5

• Private array P>> if labindex==2, A = magic(3) + labindex – 1; end LAB 1 LAB 2 LAB 3 LAB 4 | | | | | |9 2 7| | |undefined|4 6 8|undefined|undefined | |5 10 3| |


Spring 2012 15

Switch from pmode to matlabpoolIf you are running MATLAB pmode, exit it.P>> exit

• MATLAB allows only one parallel environment at a time.• pmode may be started with the keyword open or start.• matlabpool can only be started with the keyword open.

You can also close pmode from the MATLAB window:>> pmode close

Now, open matlabpool>> matlabpool open local % rely on default worker size


Spring 2012 16

6.2 matlabpool

matlabpool is the general paradigm for MATLAB parallel computing; it can be used for interactive and batch jobs.

>> matlabpool open local % open 2 workers on thin client >> % >> % serial or parallel applications ... >> % >> matlabpool close % ends matlabpool properly

Spring 2012 17

Parallel Methods — parforparfor is a parallel for-loop. Work load is distributed accordingto loop index. Computation among loop indices must beindependent. Operates in matlabpool.

matlabpool open % use default number of workers ( 2 )s = 0;parfor i=1:10 x(i) = sin(2*pi*i/10); % each slice of 5 computed by a worker s = s + i; % summation; a reduction operationendmatlabpool close

Does parfor work for all kinds of operations?No. As an example, a reduction operation (such as addition)• must satisfy x ◊ (y ◊ z) = (x ◊ y) ◊ z associative rule • Plus (+) and multiply (*) operators satisfy rule.• Subtract (-) and divide (/) operators fail rule

indeterministic

Spring 2012 18

Parallel Methods — parforTry this:s = 0; parfor i=1:10, s = s + i, end, s % s = 55(s = 1 + 2 + 3 + . . . + n; s = n(n+1)/2)s=1000; parfor i=1:5, s=s*i, end, s % do a few timess=1000; parfor i=1:5, s=s/i, end, s % do a few times

Above + , * operators satisfy associative rule.The - , / operators fail rule. They may or may notproduce same (correct) result each time.

Some other operational rules for parfor:1. Loop index must be consecutive integers.2. Loop index must not be altered within loop. 3. Loop count need not be divisible by number of workers.4. No need to distribute arrays.5. Solution remains in the MATLAB workspace.6. Does not work in spmd; can’t use numlabs, labindex .7. No graphics can be displayed within parfor.8. More rules depending on operations; consult the PCT doc.

Spring 2012 19

Integration Example• An integration of the cosine function between 0 and π/2

• Integration scheme is mid-point rule for simplicity.• Several parallel methods will be demonstrated.

cos(x)a = 0; b = pi/2; % rangem = 8; % # of incrementsh = (b-a)/m; % incrementp = numlabs; n = m/p; % inc. / workerai = a + (i-1)*n*h;aij = ai + (j-1)*h;

h

x=bx=a

mid-point of increment

p

i

n

j

hij

p

i

n

j

ha

a

b

ahadxxdxx

ij

ij 1 12

1 1

)cos()cos()cos(

Worker 1

Worker 2

Worker 3

Worker 4

Spring 2012 20

Integration Example — Serial Integration% serial integration (with for-loop)tic m = 10000; a = 0; % lower limit of integration b = pi/2; % upper limit of integration dx = (b – a)/m; % increment length intSerial = 0; % initialize intSerial for i=1:m x = a+(i-0.5)*dx; % mid-point of increment i intSerial = intSerial + cos(x)*dx; endtoc

X(1) = a + dx/2

dx

X(m) = b - dx/2

Spring 2012 21

Integration Example — Serial Integration% serial integration (with vector form)tic m = 10000; a = 0; % lower limit of integration b = pi/2; % upper limit of integration dx = (b – a)/m; % increment length x = a+dx/2:dx:b-dx/2; % mid-points of m increments intSerial = sum(cos(x)*dx);toc

X(1) = a + dx/2

dx

X(m) = b - dx/2


Spring 2012 22

Integration Example — spmd

This example performs parallel integration in spmd. It uses labindex and numlabs to enable all workers to run the same command (“sp”) with different data (“md”).

matlabpool open local tic % includes the overhead cost of spmd spmd m = 10000; a = 0; b = pi/2; n = m/numlabs; % # of increments per lab deltax = (b - a)/numlabs; % length per lab ai = a + (labindex - 1)*deltax; % local integration range bi = a + labindex*deltax; dx = deltax/n; % increment length for lab x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments per worker intSPMD = sum(cos(x)*dx); % integral sum per worker intSPMD = gplus(intSPMD,1); % global sum over all workers end % spmdtocmatlabpool close

Spring 2012 23

Integration Example — parfor

This example performs parallel integration with parfor.

matlabpool open 4tic m = 10000; nworkers = matlabpool(‘size’); % returns # of workers assigned n = m/nworkers; % number of increments per worker a = 0; b = pi/2; deltax = (b – a)/nworkers; % increment length per worker dx = deltax/n; intParfor1 = 0; parfor i=1:nworkers ai = a + (i - 1)*deltax; bi = a + i*deltax; x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments per worker intParfor1 = intParfor1 + sum(cos(x)*dx); endtocmatlabpool close


Spring 2012 24

Integration Example — parfor

This example performs parallel integration with parfor.

matlabpool open 4tic m = 10000; a = 0; b = pi/2; dx = (b – a)/m; % increment length intParfor2 = 0; parfor i=1:m intParfor2 = intParfor2 + cos(a+(i-0.5)*dx)*dx; endtocmatlabpool close


Spring 2012 25

Integration Example — drangeSimilar to parfor but used within spmd.

matlabpool open 4ticm = 10000; a = 0; b = pi/2;spmd n = m/numlabs; % number of increments per lab (worker) deltax = (b - a)/numlabs; % increment length per worker for i=drange(1:numlabs) ai = a + (i - 1)*deltax; bi = a + i*deltax; dx = deltax/n; x = ai+dx/2:dx:bi-dx/2; % mid-points of n increments intDrange1 = sum(cos(x)*dx); end intDrange1 = gplus(intDrange1, 1); % send global sum to lab 1end % spmdintDrange1 = intDrange1{1}; % send integral to clienttocmatlabpool close

Spring 2012 26

Integration Example — drangeSimilar to parfor but used within spmd.

matlabpool open 4ticm = 10000; a = 0; b = pi/2;spmd dx = (b - a)/m; % increment length per worker intDrange2 = 0; for i=drange(1:m) intDrange2 = intDrange2 + cos(a+(i-0.5)*dx)*dx; end intDrange2 = gplus(intDrange2, 1); % send global sum to lab 1end % spmdintDrange2 = intDrange2{1}; % send integral to clienttocmatlabpool close


Spring 2012 27

Integration Example Benchmarks

Timings (seconds) obtained on a quad-core Xeon X5570 Computation linearly proportional to # of increments. serial and serialv are times by loop and vector,

respectively parfor1 and drange1 distribute work over workers; local

work performed in vector form. parfor2 and drange2 distribute m over workers in chunk. FORTRAN and C timings are an order of magnitude faster.

# increments serial serial v Spmd parfor1 drange1 parfor2 drange2

10000 0.01 0.0003 0.16 0.11 0.16 0.12 0.14

20000 0.03 0.0006 0.15 0.10 0.16 0.11 0.14

40000 0.06 0.0009 0.15 0.11 0.16 0.11 0.15

80000 0.12 0.0018 0.15 0.10 0.16 0.11 0.17

160000 0.23 0.0030 0.15 0.10 0.16 0.11 0.21


Spring 2012 28

Array Distributions The purpose is to distribute data (e.g., an array) amongworkers to reduce the memory usage and workload on

eachworker in return for smaller wall clock time.

For some parallel applications, creating a distributed arrayoften is the only thing you need to do to make yourapplication to run in parallel (e.g., due to functionoverloading).

Some operations distribute data automatically while othersrequire manual distribution.


Spring 2012 29

How To Distribute ArraysUtilities to distribute arrays: distributed – distribute data from client; convenient but with restrictions codistributed – used in spmd to distribute data on

backend Composite – distribute data on backend; access on

client

Methods to distribute data: Partitioning a larger array. Building from smaller arrays. Created with MATLAB constructor function (rand,

zeros, . . .).


Spring 2012 30

Data Parallel Example – Matrix Multiply>> matlabpool open 4>> n = 3000; A = rand(n); B = rand(n);>> C = A * B; % run with 4 threads>> maxNumCompThreads(1);% set threads to 1>> C1 = A * B; % run on single thread>> a = distributed(A); % distributes A, B from client>> b = distributed(B); % a, b on workers; accessible from client>> c = a * b; % run on workers; c is distributed>> matlabpool close

Wall clock time, in seconds, for the above operationsC1 = A * B(1 thread)

C = A * B(4 threads)

a = distribute(A)b = distribute(B)

c = a * b(4 workers)

12.06 3.25 2.15 3.89

• The cost for distributing the matrices is recorded separately as this is incurred only once over the life of the distributed matrices.

• Time required to distribute matrices is not significantly affected by matrix size.


Spring 2012 31

Additional Ways to Distribute Matrices There are alternative ways to distribute matrices.>> matlabpool open 4>> A = rand(3000); B = rand(3000);>> spmd p = rand(n, codistributor1d(1)); % 2 ways to directly create q = codistributed.rand(n); % distributed random array s = p * q; % run on workers; s is distributed % distribute matrix after it is created u = codistributed(A, codistributor1d(1)); % by row v = codistributed(B, codistributor1d(2)); % by column w = u * v; % run on workers; w is distributed end

>> matlabpool close

Spring 2012 32

Preferred Way to Distribute Matrices ? For matrix-matrix multiply, there are 4 combinations on how to distribute the 2 matrices (by row or column). While all 4 ways lead to the correct solution, some perform better than others.n = 3000; A = rand(n); B = rand(n);spmd ar = codistributed(A, codistributor1d(1)) % distributed by row ac = codistributed(A, codistributor1d(2)) % distributed by column br = codistributed(B, codistributor1d(1)) % distributed by row bc = codistributed(B, codistributor1d(2)) % distributed by column crr = ar * br; crc = ar * bc; ccr = ac * br; ccc = ac * bc;end Wall clock times of the four ways to distribute A and B

C (row x row) C (row x col) C (col x row) C (col x col)

2.44 2.22 3.95 3.67

The above is true for on-node or across-nodes runs. Across-nodes array distribution and runs are more likely to be slower than on-node ones due to slower communications. Specifically for MATLAB, Katana uses 100-Mbits Ethernet for across-nodes communications instead of Gigabit InfiniBand.


Spring 2012 33

Linear algebraic system Example: Ax = b matlabpool open 4% serial operations n = 3000; M = rand(n); x = ones(n,1); [A, b] = linearSystem(M, x); u = A\b; % solves Au = b; u should equal x clear A b% parallel operations in spmd spmd m = codistributed(M, codistributor('1d',2)); % by column y = codistributed(x, codistributor(‘1d’,1)); % by row [A, b] = linearSystem(m, y); v = A\b; end clear A b m y% parallel operations from client m = distributed(M); y = distributed(x); [A, b] = linearSystem(m, y); W = A\b;matlabpool close function [A, b] = linearSystem(M, x)

% Returns A and b of linear system Ax = bA = M + M'; % A is real and symmetric b = A * x; % b is the RHS of linear system

Spring 2012 34

Task Parallel vs. Data Parallel matlabpool open 4 n = 3000; M = rand(n); x = ones(n,1);% Solves 4 cases of Ax=b sequentially, each with 4 workers for i=1:4 m = distributed(M); y = distributed(x*i); [A, b] = linearSystem(m, y); % computes with 4 workers u = A\b; % solves each case with 4 workers end clear A b m y% solves 4 cases of Ax=b concurrently (with parfor) parfor i=1:4 [A, b] = linearSystem(M, x*i); % computes on 1 worker v = A\b; % solves with 1 worker end% solves 4 cases of Ax=b concurrently (with drange) spmd for i=drange(1:4) [A, b] = linearSystem(M, x*i); % computes on 1 worker w = A\b; % 1 worker end endmatlabpool close

function [A, b] = linearSystem(M, x)% Returns A and b of linear system Ax = b A = M + M'; % A is real and symmetric b = A * x; % b is the RHS of linear system


Spring 2012 35

7. How Do I Parallelize My Code ?

1. Profile serial code with profile.2. Identify section of code or function within code using

the most CPU time.3. Look for ways to improve code section performance.4. See if section is parallelizable and worth parallelizing. 5. If warranted, research and choose a suitable parallel

algorithm and parallel paradigm for code section.6. Parallelize code section with chosen parallel algorithm.7. To improve performance further, work on the next

most CPU-time-intensive section. Repeats steps 2 – 7.8. Analyze the performance efficiency to know what the

sweet-spot is, i.e., given the implementation and platforms on which code is intended, what is the minimum number of workers for speediest turn-around

(see the Amdahl’s Law page).

Spring 2012 36

8. How Well Does PCT Scales ?

• Task parallel applications generally scale linearly.

• Data parallel applications’ parallel efficiency depend on individual code and algorithm used.

• My personal experience is that the runtime of a reasonably

well tuned MATLAB program running on single or multi- processor is at least an order of magnitude slower than an equivalent C/FORTRAN code.

• Additionally, the PCT’s communication is based on the Ethernet. MPI on Katana uses Infiniband and is faster than Ethernet. This further disadvantaged PCT if your code is communication-bound.

Spring 2012 37

Speedup Ratio and Parallel EfficiencyS is ratio of T1 over TN , elapsed times of 1 and N workers.f is fraction of T1 due to code sections not parallelizable.

Amdahl’s Law above states that a code with its parallelizablecomponent comprising 90% of total computation time can atbest achieve a 10X speedup with lots of workers. A code that

is50% parallelizable speeds up two-fold with lots of workers.

The parallel efficiency is E = S / N Program that scales linearly (S = N) has parallel efficiency 1.A task-parallel program is usually more efficient than a data-parallel program. Parallel codes can sometimes achievesuper-linear behavior due to efficient cache usage per worker.

NasfT

Nff

TTTSN

)(

11

1

11

Spring 2012 38

Example of Speedup Ratio & Parallel Efficiency

Spring 2012 39

Batch Processing On KatanaMATLAB provides various ways to submit and run jobs in thebackground or in batch. SCV recommends a simple andeffective method for all PCT batch processing on Katana.

Cut-and-paste the below into a file named, say, mbatch#!/bin/csh -f # MATLAB script for running serial or parallel background jobs# For MATLAB applications that contain MATLAB Parallel Computing Toolbox # parallel operations request (matlabpool or dfeval), a batch job # will be queued and run in batch (using the SGE configuration) # Script name: mbatch (you can change it) # Usage: katana% mbatch <m-file> <output-file> # <m-file>: name of m-file to be executed, DONOT include .m ($1)# <output-file>: output file name; may include path ($2)nohup matlab –nodisplay -r “$1 exit” >! $2 &

katana% chmod +x mbatch give mbatch execute attribute

Katana% mbatch my_mfile myOutput

Do not start MATLAB with -nojvm. PCT requires Java.

Spring 2012 40

Use ‘local’ Config. on Katana You can use the Katana Cluster as if it is ‘local’ with 4 workersfor interactive usages:1. Request 4 processors on the same node (You need x-win32 on your client computer)

Katana% qsh –pe omp 42. In the new X-window, run matlab

Katana% matlab3. Request workers in MATLAB window

>> matlabpool open local 4 or >> pmode start local 4

Generally, ‘SGE’ is the default configuration. If ‘local’ is notspecified, matlabpool would request 4 workers using the

‘SGE’configuration and hence the processors allocated with the qsh interactive batch shell would not be used.

Only need a PCT license; no worker licenses needed.

Request a node with 4 processors.

Spring 2012 41

• Communications between workers and MATLAB clientpmode: lab2client, client2labmatlabpool: A = a{1} % from lab 1 to client

• Collective communications among workersgather, gop, gplus, gcat

Communications Among Workers, Client

Spring 2012 42

• MPI point-to-point communication among workerslabSend and labReceive% Example: each lab sends its lab # to lab 1 and sum data on lab 1matlabpool open 4spmd % MPI requires spmda = labindex; % define a on workersIf labindex == 1 % on worker 1 . . .% lab 1 is designated to accumulate sum on a s = a; % sum s starts with a on worker 1 for k = 2:numlabs % loop over remaining workers s = s + labReceive(k); % receive a from worker k, then add to s endelse % for all other workers . . . labSend(a,1); % send a on workers 2 to 4 to worker 1endend % spmdindexSum = s{1}; % copy s on lab 1 to clientmatlabpool closeIt is illegal to communicate with itself.

MPI Point-to-Point Communications


Useful SCV InfoPlease help us do better in the future by participating in a quick survey: http://scv.bu.edu/survey/tutorial_evaluation.html

SCV home page (www.bu.edu/tech/research) Resource Applications

www.bu.edu/tech/accounts/special/research/accounts Help

– System • [email protected]

– Web-based tutorials (www.bu.edu/tech/research/training/tutorials)

(MPI, OpenMP, MATLAB, IDL, Graphics tools)– HPC consultations by appointment

• Kadin Tseng ([email protected]) • Doug Sondak ([email protected])Spring 2012

http://www.bu.edu/tech/research

http://www.bu.edu/tech/accounts/special/research/accounts

http://www.bu.edu/tech/research/training/tutorials


Documents

MATLAB Tutorial Series Parallel Computing Toolbox Kadin Tseng