1
PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty of Computational Mathematics and Cybernetics PARUS is intended for automatically building a parallel program from a data-dependency graph and fragments of source code in C. The algorithm is represented as an oriented graph where vertices are marked with series of instructions, and edges correspond to data dependencies. Vertices are assigned to MPI-processes according to the performance of both the processors and communications. Brief description Steps to parallel program Data dependency graph Graph to C++ with MPI code converter (graph2c++) C++ source code Compiling, linking with libparus and execution Source fragments of C programming language PARUS implies declaration of code fragments in C, then declaration of data dependencies between variables and pieces of arrays based on those fragments. Information about vertices and data dependencies (edges) is stored in a text file. The graph2c++ utility generates C++ code with MPI calls from a data dependency graph. This code is compiled with mpiCC compiler and is linked with libparus. The result is a binary executable for a multiprocessor system. The MPI-processes execute vertices of data dependency graph on the schedule. PARUS supports two types of scheduling: static (built by graph2sch) and dynamic (built in runtime by libparus). Both types of schedule are generated in view of the perfomance of both the processors and the communication system. System architecture Kernel of system (libparus) Utilities Tests results visualizer ntr_viewer GUI for graph editing graph_editor Noise level meter (in communications) network_test_n oise Communications performance tester network_test Processors performance tester proc_test Graph verifier graph_touch Graph to C++ with MPI calls converter graph2c++ Schedule generator graph2sch C source code data dependency analyzer parser Results We successfully ran PARUS on the following multi- processor machines MVS-1000m, MVS-15000 (http://www.jscc.ru), IBM eServer pSeries 690 Regatta, Fujitsu Sun PRIMEPOWER 850. Several popular scientific tasks have been implemented using PARUS, which showed competitive perfomance. speedup of an associative operation to all elements of an array on MVS- 1000m 0 1 2 3 4 5 6 7 8 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500 11500 12500 13500 14500 15500 16500 17500 18500 number of perceptron enters speedup Speedup of three layer perceptron on 16 processors of Regatta according number of perceptron enters. Two perceptron layers are pojected onto a set of processors. Sourceforge PARUS is registered on SourceForge (http://parus.sf.net) that allows developers to facillitate their work, and enables anyone easily download latest source code and documentation. Implemented tasks: • three layer perceptron with a maximum of 18500 neurons • recursive algorithm that computes the result of an associative operation to all elements of an array (10 9 ) • algorithm of multiple sequence alignment (http://monkey.genebee.msu.ru/~salnikov) The PARUS consists of the kernel and a set of utilities. The kernel controls the program execution. The utilities manipulate with data dependency graph and concentrate information about multi-processor componets beahavior. Graph structure Vertices sources Internal verticies Drain Verticies The algorithm is represented as a directed graph. Each edge is directed from the vertex where the data are sent from to the vertex that receives the data. Thereby, the program may be represented as a network that has The graph description is stored in text file with HTML like structure. int a[] int c[] • Arrays to be transmitted between vertices are split into sets of non-overlapping chunks • The recieving vertices can join chunks in any order • Any vertex is capable of both recieving and sending simultaniously Sender Receivers Perfomance data acquisition subsystem The processor and communication perfomance data acquisition is perfomed to spread the load evenly across the componets of a multi-processor system while executing a program. We determine processor perfomance by the time taken to solve a particular scientific task of fixed dimension. The next step is to analize the state of the communication system. This is done via several MPI-tests that measure the duration of data transfer between MPI- processes. All the communication tests within this subsystem build a set of matrices where each matrix corresponds to a message length. Position in such a matrix corresponds to the delay in data transfer between a pair of MPI-processes. The information gathered by these tests is than used by the kernel and graph2sch. Some communication tests: •one_to_one •async_one_to_one •send_recv_and_recv_send •all_to_all Structure of vertices and edges Head Data reception Data packing Body Tail Code Data int b[] <NODE_BEGIN> number 1 type 1 weight 100 layer 2 num_input_edges 1 edges ( 1 ) num_output_edges 1 edges ( 2 ) head "head" body "node" tail "" 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 number of processors speedup • source vertices (they usually serve for reading input files), internal vertices (where the data are processed), and drain vertices, where the data are saved to the output files and the execution terminates.

PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty

Embed Size (px)

Citation preview

Page 1: PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty

PARUS: a parallel programming framework for heterogeneous multiprocessor systemsAlexey N. Salnikov (salnikov cs.msu.su) Moscow State University

Faculty of Computational Mathematics and Cybernetics

PARUS is intended for automatically building a parallel program from a data-dependency graph and fragments of source code in C. The algorithm is represented as an oriented graph where vertices are marked with series of instructions, and edges correspond to data dependencies. Vertices are assigned to MPI-processes according to the performance of both the processors and communications.

Brief description

Steps to parallel program

Data dependency graph

Graph to C++ with MPI code converter (graph2c++)

C++ source code

Compiling, linking with libparus and execution

Source fragments of C programming language

PARUS implies declaration of code fragments in C, then declaration of data dependencies between variables and pieces of arrays based on those fragments. Information about vertices and data dependencies (edges) is stored in a text file. The graph2c++ utility generates C++ code with MPI calls from a data dependency graph. This code is compiled with mpiCC compiler and is linked with libparus. The result is a binary executable for a multiprocessor system. The MPI-processes execute vertices of data dependency graph on the schedule. PARUS supports two types of scheduling: static (built by graph2sch) and dynamic (built in runtime by libparus). Both types of schedule are generated in view of the perfomance of both the processors and the communication system.

System architecture

Kernel of system (libparus)

Utilities

Tests results visualizer ntr_viewer

GUI for graph editing graph_editor

Noise level meter (in communications) network_test_noise

Communications performance tester network_test

Processors performance tester proc_test

Graph verifier graph_touch

Graph to C++ with MPI calls convertergraph2c++

Schedule generatorgraph2sch

C source code data dependency analyzer parser

ResultsWe successfully ran PARUS on the following multi-processor machines MVS-1000m, MVS-15000 (http://www.jscc.ru), IBM eServer pSeries 690 Regatta, Fujitsu Sun PRIMEPOWER 850. Several popular scientific tasks have been implemented using PARUS, which showed competitive perfomance.

speedup of an associative operation to all elements of an array on MVS-1000m

0

1

2

3

4

5

6

7

8

500

1500

2500

3500

4500

5500

6500

7500

8500

9500

1050

0

1150

0

1250

0

1350

0

1450

0

1550

0

1650

0

1750

0

1850

0

number of perceptron enters

sp

ee

du

p

Speedup of three layer perceptron on 16 processors of Regatta according number of perceptron enters. Two perceptron layers are pojected onto a set of processors.

SourceforgePARUS is registered on SourceForge (http://parus.sf.net) that allows developers to facillitate their work, and enables anyone easily download latest source code and documentation.

Implemented tasks:

• three layer perceptron with a maximum of 18500 neurons• recursive algorithm that computes the result of an associative

operation to all elements of an array (109)• algorithm of multiple sequence alignment

(http://monkey.genebee.msu.ru/~salnikov)

The PARUS consists of the kernel and a set of utilities. The kernel controls the program execution. The utilities manipulate with data dependency graph and concentrate information about multi-processor componets beahavior.

Graph structure

Vertices sources

Internal verticies

Drain Verticies

The algorithm is represented as a directed graph. Each edge is directed from the vertex where the data are sent from to the vertex that receives the data. Thereby, the program may be represented as a network that has

The graph description is stored in text file with HTML like structure.

int a[] int c[]

• Arrays to be transmitted between vertices are split into sets of non-overlapping chunks

• The recieving vertices can join chunks in any order

• Any vertex is capable of both recieving and sending simultaniously

SenderReceivers

Perfomance data acquisition subsystemThe processor and communication perfomance data acquisition is perfomed to spread the load evenly across the componets of a multi-processor system while executing a program. We determine processor perfomance by the time taken to solve a particular scientific task of fixed dimension.The next step is to analize the state of the communication system. This is done via several MPI-tests that measure the duration of data transfer between MPI-processes. All the communication tests within this subsystem build a set of matrices where each matrix corresponds to a message length. Position in such a matrix corresponds to the delay in data transfer between a pair of MPI-processes. The information gathered by these tests is than used by the kernel and graph2sch. Some communication tests:•one_to_one•async_one_to_one•send_recv_and_recv_send•all_to_all

Structure of vertices and edgesHead

Data reception

Data packing

Body

Tail

Code

Data

int b[]

<NODE_BEGIN> number 1 type 1 weight 100 layer 2 num_input_edges 1 edges ( 1 ) num_output_edges 1 edges ( 2 ) head "head" body "node" tail "" <NODE_END>

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

number of processors

sp

ee

du

p

• source vertices (they usually serve for reading input files),

• internal vertices (where the data are processed),

• and drain vertices, where the data are saved to the output files and the execution terminates.