Upload
agrippine-godart
View
106
Download
0
Embed Size (px)
Citation preview
Algorithmes parallèles à grain adaptatif :
Quelques exemples
Projet MOAIS (www-id.imag.fr/MOAIS)Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)
MOAIS
Multi-programmation et Ordonnancement pour les Applications Interactives de Simulation
Programming and Scheduling Design of Interactive Simulation Applications on Distributed Resources
ID Research activities• Adaptive middleware
– Resource management and scheduling systems– Resource brokering based on prediction tools– 2nd generation Open Grid Service Architecture
with a P2P approach– Operational aspect of P2P systems– Deployability of P2P services (memberships)
• Network operating systems– Open source Grid-aware OS (extensions of Linux)
• Distributed algorithms– Dependable and adaptative
• Programming models & languages– High-performance component models– Lightweight component platforms– QoS aware self-organizing component platforms
with dynamic reconfiguration– Automatic exploitation of coarse-grain algorithms
• Communication models– Generic framework
• Computational models• Novel algorithms & applications
– Need for grid-aware algorithms and applications
• Large scale data management– Shared objects, persistence, coherency
• Security / Accountability– P2P services in an unfriendly world
• Tools– Performance analysis & prediction
• Application testbeds– Bioinformatics– Engineering applications
ID Research activities• Two INRIA projects :
– MOAIS [contact: [email protected]]• Programming & scheduling• Adaptive and interactive applications
– 2 full-time researchers [INRIA]– 4 [assistant| ] professors [3 INPG, 1 UJF]– 14 Ph-D students
– MESCAL [contact: [email protected]]• Dynamic resource management• Performance evaluation and dimensioning
– 2 full-time researchers [INRIA, CNRS]– 6 [assistant| ] professors [2 INPG, 4 UJF]– 13 Ph-D students
Objective Programming of applications where performance is a matter of resources:
take benefit of more and suit to less
– eg : a global computing platform (P2P)
Application code : “independent” from resources and adaptive
Target applications: interactive simulation – “ virtual observatory “
MOAIS
intearction
simulation
rendering
Performance is related to #resources - simulation : precision=size/order
#procs, memory space - rendering : images wall, sounds, ...
#video-projectors, #HPs, ... - interaction : acquisition peripherals
#cameras, #haptic sensors, …
GRIMAGE platform
• 2003 : 11 PCs, 8 projectors and 4 cameras
First demo : 12/03
• 2005: 30 PCs, 16 projectors and 20 cameras– A display wall :
• Surface: 2x2.70 m• Resolution: 4106x3172 pixels• Very bright: daylight work
Commodity components
[B. Raffin]
Video [J Allard, C Menier]
QuickTime™ et undécompresseur
sont requis pour visionner cette image.
Description of potential parallelism + synchronizations
MiddlewareProcesss creation, communicationsDiscovery / Resilience of resources
MOAIS abstraction technology : “almost non”-preemptive scheduling
Dynamic Architecture
Application[macro dataflow graph] “Athapascan-like”
MOAIS : to adapt parallelism by scheduling
Local preemptive scheduling / time sharingPrimitives for synchronization
How to adapt the application ?• By minimizing communications
• e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]
adaptive granularity • By contolling latency (interactivity constraints) :
• FlowVR [Allard, Menier, Raffin]
overhead• By managing node failures and resilience [Checkpoint/restart][checkers]
• FlowCert [Jafar, Krings, Leprevost; Roch, Varrette]
• By adapting granularity• malleable tasks [Trystram, Mounié]
• dataflow cactus-stack : Athapascan/Kaapi [Gautier] • recursive parallelism by « work-stealling »
[Blumofe-Leiserson 98, Cilk, Athapascan, ... ]
• Self-adaptive grain algorithms • dynamic extraction of paralllelism
[Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]
Algorithmes parallèles à grain adaptatif :
Quelques exemples
• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité
•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique
• Exemples
In « practice »: coarse granularitySplitting into p = #resourcesDrawback : heterogeneous architecture, dynamic
In « theory »: fine granularity Maximal parallelismDrawback : overhead of tasks management
How to choose/adapt granularity ?
a b
H(a) O(b,7)
F(2,a) G(a,b) H(b)
High potentialdegree
of parallelism
Parallelism and efficiency
Difficult in general (coarse grain)
But easy if T small (fine grain)
Tp = T1/p + T [Greedy scheduling, Graham69]
Expensive in general (fine grain)But small overhead if coarse grain
Schedulingefficient policy
(close to optimal)
control of the policy (realisation)
Problem : how to adapt the potential parallelism to the resources ?
«Depth »
parallel time on resources
T = #ops on a critcal path
∞T
« Work »sequential time
T1= #operations
=> to have T small with coarse grain control
Work stealing scheduling of a parallel recursive fine-grain algorithm
• Work-stealing scheduling• an idle processor steals the oldest ready task• Interests :
=> #succeeded steals < p. T [Blumofe 98, Narlikar 01, ....] => suited to heterogeneous architectures [Bender-Rabin
03, ....]
• Hypothesis for efficient parallel executions: • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)
• a « sequential » execution of the parallel algorithm is valid• e.g. : search trees, Branch&Bound, ...
• Implementation : work-first principle [Multilisp, Cilk, …]
• overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm
• cactus-stack management
• Intérêt : Grain fin « statique », mais contrôle dynamique
• Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]
f2
Implementation of work-stealing
fork f2
f1() { ….
fork f2 ; …
} steal
f1
P
+ non-préemptive execution of ready task
P’
Hypothesis : a sequential schedule is valid
f1
Stack
Experimentation: knary benchmark
SMP ArchitectureOrigin 3800 (32 procs)
Cilk / Athapascan
Distributed Archi.iCluster Athapascan
#procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Ts = 2397 s T1 = 2435
How to obtain an efficientfine-grain algorithm ?
• Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism)
• Problem :• Fine grain (T small) parallel algorithms may involve a large
overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization• But also arithmetic overhead
•
• Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; T1 = n
• Parallel algorithm :
T =2. log n BUT T1 = 2.n
i Indeed parallelism often costs ...
Préfixe ( n / 2 )
a0 a1 a2 ana3
* * *
an-1
P0 P2 Pn-1*
P3
*
P1
*
Pn
Algorithmes parallèles à grain adaptatif :
Quelques exemples
• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité
•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique
• Exemples
Self-adaptive grain algorithm• Principle :
To save parallelism overhead by provilegiating a sequential algorithm :
=> use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation
• Hypothesis : two algorithms : • - 1 sequential : SeqCompute
• - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm
•
SeqCompute
Extract_parLastPartComputation
SeqCompute
Generic self-adaptive grain algorithm
Illustration : f(i), i=1..100
LastPart(w)
W=2..100
SeqComp(w)sur CPU=A
f(1)
Illustration : f(i), i=1..100
LastPart(w)
W=3..100
SeqComp(w)sur CPU=A
f(1);f(2)
Illustration : f(i), i=1..100
LastPart(w) on CPU=B
W=3..100
SeqComp(w)sur CPU=A
f(1);f(2)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
LastPart(w)on CPU=B
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
W=3..51
SeqComp(w’)
LastPart(w’)
W’=52..100
LastPart(w)
Illustration : f(i), i=1..100
SeqComp(w)sur CPU=A
f(1);f(2)
W=3..51
SeqComp(w’)sur CPU=B
f(52)
LastPart(w’)
W’=53..100
LastPart(w)
Cascading a parallel and a sequential algorithm
• In general : two different algorithms may be used:• Sequential recursive algorithm : Ts operations
algo(n) = { ...; algo(n-1); ... }• Parallel algorithm : T small but T1 >> Ts
• Work-preserving speed-up [Bini-Pan 94] = cascading technique [Jaja92]
Careful interplay of both algorithms to build one withboth T small and T1 = O( Ts )
But fine grain : • Divide the sequential algorithm into block• Each block is compute with the (non-optimal) parallel algorithm
• Adaptive grain: duale approach : parallelism is extracted from any sequential task
E.g.Triangular system solving 0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
0 .x = b
A
1/ x1 = - b1 / a11
2/ For k=2..n bk = bk - ak1.x1
0 .x = b
system of dimension n-1
system of dimension n
E.g.Triangular system solving 0 .x = b
• Sequential algorithm : T1 = n2/2; T = n (fine grain)
• Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain)
0
A21 A22
A11
-1
=0
S A22
A11
-1
-1
S= -A22.A21.A11
-1 -1with A =-1
and x=A-1.b
• Self-adaptive granularity algorithm : T1 = n2; T = n.log n
0 .x = b
ExtractParand self-adaptive scalar product
self adaptive sequential algorithm
self-adaptivematrix inversion
choice of h = m
hm
Algorithmes parallèles à grain adaptatif :
Quelques exemples
• Ordonnancement de programme parallèle à grain fin :work-stealing et efficacité
•Algorithmes à grain adaptatif : principe d’une « cascade » dynamique
• Exemples• Produit itéré, préfixe• Compression gzip • Inversion de systèmes triangulaire• Vision 3D / Calcul d’oct-tree
Produit iteré Séquentiel, parallèle, adaptatif
[Davide Vernizzi]● Séquentiel :
● Entrée: tableau de n valeurs
● Sortie:
● c/c++ code:
for (i=0; i<n; i++)
res += atoi(x[i]);
● Algorithme parallèle :
● calcul récursif par bloc (arbre binaire avec fusion)
● Taille de bloc = pagesize
● Code kaapi : athapascan API
€
f (x i
i=1
n
∑ )
Expérimentation : parallèle <=> adaptatif
Variante : somme de pages
● Entrée: ensemble de n pages. Chaque page est un tableau de valeurs
● Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes
● c/c++ code:
for (i=0; i<n; i++)
for (j=0; j<pageSize; j++)
res [j] += f (pages[i][j]);
res ji 0
n 1
f p a g e i , j
Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1
Démonstration sur ensibull
Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 &
Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096Memory allocatedMemory allocated0:In main: th = 1, parallel0: -----------------------------------------0: res = -2.048e+07
0: time = 0.408178 s ADAPTATIF (3 procs)0: Threads created: 540: -----------------------------------------0: res = -2.048e+07
0: time = 0.964014 s PARALLELE (3 procs)0: #fork = 74970: -----------------------------------------: -----------------------------------------: res = -2.048e+07
: time = 1.15204 s SEQUENTIEL (1 proc): -----------------------------------------
D’où vient la différence ?…Les sources des programmes
Source des codes pour la somme des pages :
parallèle / arbre binaire
adaptatif par couplage ;
- séquentiel + Fork<LastPartComp>
- LastParComp: génération (récursive) de 3 tâches
struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal);} else { // If max num of pages is not reached int half = (start+stop)/2;
a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};
Algorithme parallèle
Parallélisation adaptative
● Calcul par bloc sur des entrées en k blocs:
● 1 bloc = pagesize● Exécution indépendante des k tâches● Fusion des resultats
Algorithme adaptatif (1/3)
● Hypothèse: ordonnancement non préemptif - de type work-stealing
● Couplage séquentiel adaptatif :
void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) {// cout << "Adaptative" << endl; a1::Shared <Page> resLPC;
a1::Fork<LPC>() (resLPC, dw);
Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq);}
Algorithme adaptatif (2/3)
● Côté séquentiel :
void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc;}
Algorithme adaptatif (3/3)● Côté extraction = algorithme parallèle :
struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } }};
Parallélisation adaptative
● Une seule tache de calcul est demarrée pour toutes les entrées
● Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif
● Moins de taches, moins de fusions
Exemple 2 : parallélisation de gzip
• Gzip :
• Utilisé (web) et coûteux bien que de complexité linéaire
• Code source :10000 lignes C, structures de données complexes
• Principe : LZ77 + arbre Huffman
• Pourquoi gzip ?• Problème P-complet, mais parallélisation pratique possible• Inconvénient: toute parallélisation (connue) entraîne un surcoût
• -> perte de taux de compression
Fichiercompressé
Fichieren entrée
Compressionà la volée
Algorithme
Partition dynamique en blocs
Parallélisation « facile » ,100% compatible avec gzip/gunzip
Problèmes : perte de taux de compression, grain dépend de la machine, surcoût
Blocs compressés
Compressionparallèle
Partition statique en blocs
Parallélisation
=>
=>
Comment paralléliser gzip ?
Outputcompressedfile
InputFile
Compressionà la volée
SeqComp LastPartComputation
Outputcompressedblocks
Parallelcompression
Parallélisation gzip à grain adaptatif
Dynamicpartitionin blocks
cat
Taille
Fichiers
Gzip Adaptatif
2 procs
Adaptatif
8 procs
Adaptatif
16 procs
0,86 Mo 272573 275692 280660 280660
5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo
9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo
10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo
5,2 Mo 3,35 s 0,96 s 0,55 s
9,4 Mo 7,67 s 6,73 s 6,79 s
10 Mo 6,79 s 1,71 s 0,88 s
Surcoût en taille de fichier comprimé
Gain en T
Performances
4 processors computer
0
10
20
30
40
50
60
70
80
90
1,106 2,089 2,263 4,260 6,769 7,905 8,960 10,957 15,962 19,298 21,914
Size of file (Ko)
Time (in seconds)
Sequential gzip
Athapascan gzipPentium 4x200Mhz
Conclusion
•
Grain adaptatifCascade dynamique récursive de 2 algos : 1 séquentiel, 1 parallèle Génération de parallélisme que sur inactivité de ressources
-> Opérateur de base : ExtractPar de travail séquentiel en cours
Programmation générique, ... et simple !??
Intérêt Réduction du surcoût lié au parallélisme :- création de tâche, ordonnancement- surcoût arithmétique intrinsèque- Gain pratique: code PL inférence probabiliste [Mazer, SHARP]
Perspectives - Expérimentations SMP et distribuées : gzip, préfixes, ....- Extension au cas distribué et hétérogène : ajout/résilience
- Extensions à d’autres algorithmes: [action IMAG-INRIA AHA]
Vision 3D, calcul formel, ...
APACHE/MOAIS + EVASION, [J Allard, C Menier, R Revire, F Zara]Video
Questions ?
PerformanceNUGENT 22 (threshold = 8, #task=209406)
0200400600800
100012001400160018002000
20 40 60 120
# processors
Time (s)
Without checkpoints
SEL
CIC (period=1s)
CIC (period=20s)
NUGENT 22 with threshold = 8
0102030405060708090
100
20 40 60 120
# processors = p
Tp - Ts/p
WithoutcheckpointsSEL
CIC(period=1s)
CIC(period=20s)
NUGENT 22 with threshold=8
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
20 40 60 120# processors = P
P*Tp - Ts
Without checkpointsSELCIC(period=1s)CIC(period=20s)
Performances sur SMP
Duree de recherche dans de larges fichiers DNA locaux
0
2
4
6
8
10
12
31924544 63849088
Taille des fichiers (octets)
Duee (sec.)
sequential parallel (12 threads)
Pentium 4x200 Mhz
Performances en distribué
Duree de recherche a travers deux disques non-locaux
0
200
400
600
800
1000
1200
674021 1228427
Taille des repertoires (octets)
Duree (sec.)
sequential 1 node, 4 threads 2 nodes, 4 threads
Séquentiel Pentium 4x200 Mhz
SMP Pentium 4x200 Mhz
Architecture distribuée Myrinet Pentium 4x200 Mhz + 2x333 Mhz
Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS)