Algorithmes parallèles à grain adaptatif : Quelques exemples [email protected] Projet MOAIS ( Laboratoire ID-IMAG (CNRS-INRIA

Algorithmes parallèles à grain adaptatif :

Quelques exemples

[email protected]

Projet MOAIS (www-id.imag.fr/MOAIS)Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)

MOAIS

Multi-programmation et Ordonnancement pour les Applications Interactives de Simulation

Programming and Scheduling Design of Interactive Simulation Applications on Distributed Resources

ID Research activities• Adaptive middleware

– Resource management and scheduling systems– Resource brokering based on prediction tools– 2nd generation Open Grid Service Architecture

with a P2P approach– Operational aspect of P2P systems– Deployability of P2P services (memberships)

• Network operating systems– Open source Grid-aware OS (extensions of Linux)

• Distributed algorithms– Dependable and adaptative

• Programming models & languages– High-performance component models– Lightweight component platforms– QoS aware self-organizing component platforms

with dynamic reconfiguration– Automatic exploitation of coarse-grain algorithms

• Communication models– Generic framework

• Computational models• Novel algorithms & applications

– Need for grid-aware algorithms and applications

• Large scale data management– Shared objects, persistence, coherency

• Security / Accountability– P2P services in an unfriendly world

• Tools– Performance analysis & prediction

• Application testbeds– Bioinformatics– Engineering applications

ID Research activities• Two INRIA projects :

– MOAIS [contact: [email protected]]• Programming & scheduling• Adaptive and interactive applications

– 2 full-time researchers [INRIA]– 4 [assistant| ] professors [3 INPG, 1 UJF]– 14 Ph-D students

– MESCAL [contact: [email protected]]• Dynamic resource management• Performance evaluation and dimensioning

– 2 full-time researchers [INRIA, CNRS]– 6 [assistant| ] professors [2 INPG, 4 UJF]– 13 Ph-D students

Objective Programming of applications where performance is a matter of resources:

take benefit of more and suit to less

– eg : a global computing platform (P2P)

Application code : “independent” from resources and adaptive

Target applications: interactive simulation – “ virtual observatory “

MOAIS

intearction

simulation

rendering

Performance is related to #resources - simulation : precision=size/order

#procs, memory space - rendering : images wall, sounds, ...

#video-projectors, #HPs, ... - interaction : acquisition peripherals

#cameras, #haptic sensors, …

GRIMAGE platform

• 2003 : 11 PCs, 8 projectors and 4 cameras

First demo : 12/03

• 2005: 30 PCs, 16 projectors and 20 cameras– A display wall :

• Surface: 2x2.70 m• Resolution: 4106x3172 pixels• Very bright: daylight work

Commodity components

[B. Raffin]

Video [J Allard, C Menier]

QuickTime™ et undécompresseur

sont requis pour visionner cette image.

Description of potential parallelism + synchronizations

MiddlewareProcesss creation, communicationsDiscovery / Resilience of resources

MOAIS abstraction technology : “almost non”-preemptive scheduling

Dynamic Architecture

Application[macro dataflow graph] “Athapascan-like”

MOAIS : to adapt parallelism by scheduling

Local preemptive scheduling / time sharingPrimitives for synchronization

How to adapt the application ?• By minimizing communications

• e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]

adaptive granularity • By contolling latency (interactivity constraints) :

• FlowVR [Allard, Menier, Raffin]

overhead• By managing node failures and resilience [Checkpoint/restart][checkers]

• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]

• By adapting granularity• malleable tasks [Trystram, Mounié]

• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »

[Blumofe-Leiserson 98, Cilk, Athapascan, ... ]

• Self-adaptive grain algorithms • dynamic extraction of paralllelism

[Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]


Quelques exemples

• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité

•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique

• Exemples

In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic

In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management

How to choose/adapt granularity ?

a b

H(a) O(b,7)

F(2,a) G(a,b) H(b)

High potentialdegree

of parallelism

Parallelism and efficiency

Difficult in general (coarse grain)

But easy if T small (fine grain)

Tp = T1/p + T [Greedy scheduling, Graham69]

Expensive in general (fine grain)But small overhead if coarse grain

Schedulingefficient policy

(close to optimal)

control of the policy (realisation)

Problem : how to adapt the potential parallelism to the resources ?

«Depth »

parallel time on resources

T = #ops on a critcal path

∞T

« Work »sequential time

T1= #operations

=> to have T small with coarse grain control

Work stealing scheduling of a parallel recursive fine-grain algorithm

• Work-stealing scheduling• an idle processor steals the oldest ready task• Interests :

=> #succeeded steals < p. T [Blumofe 98, Narlikar 01, ....] => suited to heterogeneous architectures [Bender-Rabin

03, ....]

• Hypothesis for efficient parallel executions: • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)

• a « sequential » execution of the parallel algorithm is valid• e.g. : search trees, Branch&Bound, ...

• Implementation : work-first principle [Multilisp, Cilk, …]

• overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm

• cactus-stack management

• Intérêt : Grain fin « statique », mais contrôle dynamique

• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]

f2

Implementation of work-stealing

fork f2

f1() { ….

fork f2 ; …

} steal

f1

P

+ non-préemptive execution of ready task

P’

Hypothesis : a sequential schedule is valid

f1

Stack

Experimentation: knary benchmark

SMP ArchitectureOrigin 3800 (32 procs)

Cilk / Athapascan

Distributed Archi.iCluster Athapascan

#procs Speed-Up

8 7,83

16 15,6

32 30,9

64 59,2

100 90,1

Ts = 2397 s T1 = 2435

How to obtain an efficientfine-grain algorithm ?

• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)

• Problem :• Fine grain (T small) parallel algorithms may involve a large

overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead

•

• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; T1 = n

• Parallel algorithm :

T =2. log n BUT T1 = 2.n

i Indeed parallelism often costs ...

Préfixe ( n / 2 )

a0 a1 a2 ana3

* * *

an-1

P0 P2 Pn-1*

P3

*

P1

*

Pn


Quelques exemples



• Exemples

Self-adaptive grain algorithm• Principle :

To save parallelism overhead by provilegiating a sequential algorithm :

=> use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation

• Hypothesis : two algorithms : • - 1 sequential : SeqCompute

• - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

•

SeqCompute

Extract_parLastPartComputation

SeqCompute

Generic self-adaptive grain algorithm

Illustration : f(i), i=1..100

LastPart(w)

W=2..100

SeqComp(w)sur CPU=A

f(1)


LastPart(w)

W=3..100

SeqComp(w)sur CPU=A

f(1);f(2)


LastPart(w) on CPU=B

W=3..100

SeqComp(w)sur CPU=A

f(1);f(2)


SeqComp(w)sur CPU=A

f(1);f(2)

LastPart(w)on CPU=B

W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)


SeqComp(w)sur CPU=A

f(1);f(2)

W=3..51

SeqComp(w’)

LastPart(w’)

W’=52..100

LastPart(w)


SeqComp(w)sur CPU=A

f(1);f(2)

W=3..51

SeqComp(w’)sur CPU=B

f(52)

LastPart(w’)

W’=53..100

LastPart(w)

Cascading a parallel and a sequential algorithm

• In general : two different algorithms may be used:• Sequential recursive algorithm : Ts operations

algo(n) = { ...; algo(n-1); ... }• Parallel algorithm : T small but T1 >> Ts

• Work-preserving speed-up [Bini-Pan 94] = cascading technique [Jaja92]

Careful interplay of both algorithms to build one withboth T small and T1 = O( Ts )

But fine grain : • Divide the sequential algorithm into block• Each block is compute with the (non-optimal) parallel algorithm

• Adaptive grain: duale approach : parallelism is extracted from any sequential task

E.g.Triangular system solving 0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

0 .x = b

A

1/ x1 = - b1 / a11

2/ For k=2..n bk = bk - ak1.x1

0 .x = b

system of dimension n-1

system of dimension n

E.g.Triangular system solving 0 .x = b

• Sequential algorithm : T1 = n2/2; T = n (fine grain)

• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)

0

A21 A22

A11

-1

=0

S A22

A11

-1

-1

S= -A22.A21.A11

-1 -1with A =-1

and x=A-1.b

• Self-adaptive granularity algorithm : T1 = n2; T = n.log n

0 .x = b

ExtractParand self-adaptive scalar product

self adaptive sequential algorithm

self-adaptivematrix inversion

choice of h = m

hm


Quelques exemples



• Exemples• Produit itéré, préfixe• Compression gzip • Inversion de systèmes triangulaire• Vision 3D / Calcul d’oct-tree

Produit iteré Séquentiel, parallèle, adaptatif

[Davide Vernizzi]● Séquentiel :

● Entrée: tableau de n valeurs

● Sortie:

● c/c++ code:

for (i=0; i<n; i++)

res += atoi(x[i]);

● Algorithme parallèle :

● calcul récursif par bloc (arbre binaire avec fusion)

● Taille de bloc = pagesize

● Code kaapi : athapascan API

€

f (x i

i=1

n

∑ )

Expérimentation : parallèle <=> adaptatif

Variante : somme de pages

● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs

● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes

● c/c++ code:

for (i=0; i<n; i++)

for (j=0; j<pageSize; j++)

res [j] += f (pages[i][j]);

res ji 0

n 1

f p a g e i , j

Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1

Démonstration sur ensibull

Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &

Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07

0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07

0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07

: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------

D’où vient la différence ?…Les sources des programmes

Source des codes pour la somme des pages :

parallèle / arbre binaire

adaptatif par couplage ;

- séquentiel + Fork<LastPartComp>

- LastParComp: génération (récursive) de 3 tâches

struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal);} else { // If max num of pages is not reached int half = (start+stop)/2;

a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};

Algorithme parallèle

Parallélisation adaptative

● Calcul par bloc sur des entrées en k blocs:

● 1 bloc = pagesize● Exécution indépendante des k tâches● Fusion des resultats

Algorithme adaptatif (1/3)

● Hypothèse: ordonnancement non préemptif - de type work-stealing

● Couplage séquentiel adaptatif :

void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;

a1::Fork<LPC>() (resLPC, dw);

Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}

Algorithme adaptatif (2/3)

● Côté séquentiel :

void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}

Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :

struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};

Parallélisation adaptative

● Une seule tache de calcul est demarrée pour toutes les entrées

● Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif

● Moins de taches, moins de fusions

Exemple 2 : parallélisation de gzip

• Gzip :

• Utilisé (web) et coûteux bien que de complexité linéaire

• Code source :10000 lignes C, structures de données complexes

• Principe : LZ77 + arbre Huffman

• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible• Inconvénient: toute parallélisation (connue) entraîne un surcoût

• -> perte de taux de compression

Fichiercompressé

Fichieren entrée

Compressionà la volée

Algorithme

Partition dynamique en blocs

Parallélisation « facile » ,100% compatible avec gzip/gunzip

Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

Blocs compressés

Compressionparallèle

Partition statique en blocs

Parallélisation

=>

=>

Comment paralléliser gzip ?

Outputcompressedfile

InputFile

Compressionà la volée

SeqComp LastPartComputation

Outputcompressedblocks

Parallelcompression

Parallélisation gzip à grain adaptatif

Dynamicpartitionin blocks

cat

Taille

Fichiers

Gzip Adaptatif

2 procs

Adaptatif

8 procs

Adaptatif

16 procs

0,86 Mo 272573 275692 280660 280660

5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo

9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo

10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo

5,2 Mo 3,35 s 0,96 s 0,55 s

9,4 Mo 7,67 s 6,73 s 6,79 s

10 Mo 6,79 s 1,71 s 0,88 s

Surcoût en taille de fichier comprimé

Gain en T

Performances

4 processors computer

0

10

20

30

40

50

60

70

80

90

1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914

Size of file (Ko)

Time (in seconds)

Sequential gzip

Athapascan gzipPentium 4x200Mhz

Conclusion

•

Grain adaptatifCascade dynamique récursive de 2 algos : 1 séquentiel, 1 parallèle Génération de parallélisme que sur inactivité de ressources

-> Opérateur de base : ExtractPar de travail séquentiel en cours

Programmation générique, ... et simple !??

Intérêt Réduction du surcoût lié au parallélisme :- création de tâche, ordonnancement- surcoût arithmétique intrinsèque- Gain pratique: code PL inférence probabiliste [Mazer, SHARP]

Perspectives - Expérimentations SMP et distribuées : gzip, préfixes, ....- Extension au cas distribué et hétérogène : ajout/résilience

- Extensions à d’autres algorithmes: [action IMAG-INRIA AHA]

Vision 3D, calcul formel, ...

APACHE/MOAIS + EVASION, [J Allard, C Menier, R Revire, F Zara]Video

Questions ?

PerformanceNUGENT 22 (threshold = 8, #task=209406)

0200400600800

100012001400160018002000

20 40 60 120

# processors

Time (s)

Without checkpoints

SEL

CIC (period=1s)

CIC (period=20s)

NUGENT 22 with threshold = 8

0102030405060708090

100

20 40 60 120

# processors = p

Tp - Ts/p

WithoutcheckpointsSEL

CIC(period=1s)

CIC(period=20s)

NUGENT 22 with threshold=8

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

20 40 60 120# processors = P

P*Tp - Ts

Without checkpointsSELCIC(period=1s)CIC(period=20s)

Performances sur SMP

Duree de recherche dans de larges fichiers DNA locaux

0

2

4

6

8

10

12

31924544 63849088

Taille des fichiers (octets)

Duee (sec.)

sequential parallel (12 threads)

Pentium 4x200 Mhz

Performances en distribué

Duree de recherche a travers deux disques non-locaux

0

200

400

600

800

1000

1200

674021 1228427

Taille des repertoires (octets)

Duree (sec.)

sequential 1 node, 4 threads 2 nodes, 4 threads

Séquentiel Pentium 4x200 Mhz

SMP Pentium 4x200 Mhz

Architecture distribuée Myrinet Pentium 4x200 Mhz + 2x333 Mhz

Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS)

Documents

Algorithmes parallèles à grain adaptatif : Quelques exemples [email protected] Projet MOAIS ( Laboratoire ID-IMAG (CNRS-INRIA