View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Parallel Algorithms in STAPLImplementation and Evaluation
Jeremy Vu, Mauro Bianco, Nancy [email protected]
Parasol Lab, Department of Computer Science and Engineering, Texas A&M University, http://parasol.tamu.edu/
Oil well logging simulation
STAPL Overview
normal misfold
STAPL is a framework for developing parallel C++ code.Its core is a library of C++ components with interfaces similar to the (sequential) C++ Standard Template Library (STL).
Standard Template Adaptive Parallel Library Applications using STAPL
• Particle Transport Computation
Efficient Massively ParallelImplementation of Discrete Ordinates Particle
Transport Calculation.
• Protein & RNA Folding
Probabilistic Roadmap Methodsfrom motion planning adaptedto protein and RNA folding
• Seismic Ray Tracing
Simulation of propagationof seismic rays in earth’s crust.
Prion Protein
References• “A Framework for Adaptive Algorithm Selection in STAPL,” N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2005. • “Parallel Protein Folding with STAPL,” S. Thomas, G. Tanase, L. Dale, J. Moreira, L. Rauchwerger, N. M. Amato. Journal of Concurrency and Computation: Practice and Experience, 2005 •“ARMI: An Adaptive, Platform Independent Communication Library,” S. Saunders, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2003.
• “STAPL: An Adaptive, Generic Parallel C++ Library,” P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. M. Amato and L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), Aug. 2001.
This research supported in part by NSF Grants EIA-0103742, ACR-0081510, ACR-0113971, CCR-0113974, ACI-0326350, CRI-0551685, by the DOE, IBM, and Intel. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Project Goals• Ease of use
Shared Object Programming Model provides consistent interface across shared or distributed memory systems.
• EfficiencyApplication building blocks are based on C++ STL constructs and extended and automatically tuned for parallel execution.
• PortabilityARMI runtime system hides machine specific details and provides an efficient, uniform communication interface.
User Application Code
Adap
tive
Fram
ewor
k
Pthreads, OpenMP, MPI, Native, …
Run-time SystemARMI Communication Library
SchedulerExecutor
Performance Monitor
pAlgorithms
pRange
ViewspContainers
• “Design for Interoperability in STAPL: pMatrices and Linear Algebra Algorithms,” A. Buss, G. Tanase, N. Thomas, T. Smith, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5335 p304-315 August 2008.•“Associative Parallel Containers in STAPL,” G. Tanase, C. Raman, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5234 p156-171, 2008. •“The STAPL pArray,” G. Tanase, M. Bianco, N. M. Amato, L. Rauchwerger. MEmory performance: DEaling with Applications, systems and architecture (MEDEA), 2007.
pRange
pContainers and Views Performance EvaluationpContainer - a distributed collection of generic elements with methods to accessand maintain the collection that providesa shared-object view to the user.
The shared-object view provides uniformaccess to data independent of the physicallocation where it is stored. pContainer interfaces are equivalent to their STL counterpart (e.g., pVector and STL vector).
View – an abstract data type thatallows decoupling of a container interfacefrom the underlying storage.
Views allow a data set to befiltered and traversed in multipleways. Row- or Column-wisetraversal of a pMatrix, for example.
Location 0 Location 1
Component 0Domain={0,1}
Component 1Domain={2,3}
Component 2Domain={4,5}
p_array[5]p_array[2]
pArray<int> p_array(6); - Domain =[0,6) - blocked_partitioned(2), - cyclic_maping
Example of pArray on 2 processors
Location 3Location 2Location 1Location 0
Loc3
Loc2
Loc1
Loc0
Row Based View Aligned with distribution
Column Based View Not aligned with distribution
Dynamic, composable task graph of a parallel computation.A graph vertex is a work function and thedata (represented by a partition of a view) to be processed. Tasks that need toaccess the same data in a particular order have a graph edge between them to enforce the execution order when the graph is processed by the Executor in the runtime.
pRanges that are constructed usingFactories that describe the pattern ofthe computation (e.g., Map-Reduce)are data-parallel.
pRanges can be composed to form
task-parallel computations.
Task
View
Work Function
Example of Map-Reduce Task Graph
p_unique Scalability
•10 million integers in the container.•Results obtained on an IBM Cluster 1600 here at TAMU.•Initial set of results, improvements planned.
pAlgorithm: p_uniquep_uniqueInput: A sequence of elements and a binary relation.Output: A sequence of elements consisting of only the first element of each group of consecutive duplicate elements.
• The binary relation is used to determine the duplicate elements.• ex: ‘=‘
• {1, 1, 2, 2, 3, 3, 4, 4} ---> {1, 2, 3, 4}• Sequentially, there’s only one way to do this; in parallel, there are multiple cases.
• Case 1, symmetric + transitive: • ex: “=“: • Compare, in parallel, each element to the next element in the sequence using the relation and keep or remove it based on the result.• If we have enough processors, all of the comparisons can happen simultaneously. Constant number of remote accesses.
• Case 2, transitive + not symmetric: • ex: ‘<‘ • {60, 100, 70, 20, 50, 40, 30, 10} ---> {60, 20, 10}• Requires parallel prefix algorithm before decisions. We “prefix” each comparison with the appropriate initial element, then we identify the duplicate elements.• log p remote accesses.
• Case 3, not transitive: Sequential; comparisons must be done in order, one at a time.• ex: ‘a is the father of b’
p_unique_copy, Case 1 (symmetric and transitive), 100% of elements copied
p_unique_copy, Case 2 (transitive but not symmetric), 100% of elements copied
p_unique_copy, Case 2, 0% of elements copied
p_unique_copy, Case 1, 50% of elements copied. Slowdown occurs as the amount of communication increases due to compaction of elements.