1
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato [email protected] Parasol Lab, Department of Computer Science and Engineering, Texas A&M University, http://parasol.tamu.edu/ Oil well logging simulation STAPL Overview normal misfold STAPL is a framework for developing parallel C++ code. Its core is a library of C++ components with interfaces similar to the (sequential) C++ Standard Template Library (STL). Standard Template Adaptive Parallel Library Applications using STAPL Particle Transport Computation Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. Protein & RNA Folding Probabilistic Roadmap Methods from motion planning adapted to protein and RNA folding Seismic Ray Tracing Simulation of propagation of seismic rays in earth’s crust. Prion Protein References “A Framework for Adaptive Algorithm Selection in STAPL,” N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2005. “Parallel Protein Folding with STAPL,” S. Thomas, G. Tanase, L. Dale, J. Moreira, L. Rauchwerger, N. M. Amato. Journal of Concurrency and Computation: Practice and Experience, 2005 “ARMI: An Adaptive, Platform Independent Communication Library,” S. Saunders, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2003. “STAPL: An Adaptive, Generic Parallel C++ Library,” P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. M. Amato and L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), This research supported in part by NSF Grants EIA- 0103742, ACR-0081510, ACR-0113971, CCR-0113974, ACI- 0326350, CRI-0551685, by the DOE, IBM, and Intel. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract Project Goals Ease of use Shared Object Programming Model provides consistent interface across shared or distributed memory systems. Efficiency Application building blocks are based on C++ STL constructs and extended and automatically tuned for parallel execution. Portability ARMI runtime system hides machine specific details and provides an efficient, uniform communication interface. User Application Code Adaptive Framework Pthreads, OpenMP, MPI, Native, … Run-time System ARMI Communication Library Scheduler Executor Performance Monitor pAlgorithms pRange Views pContainers “Design for Interoperability in STAPL: pMatrices and Linear Algebra Algorithms,” A. Buss, G. Tanase, N. Thomas, T. Smith, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5335 p304-315 August 2008. “Associative Parallel Containers in STAPL,” G. Tanase, C. Raman, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5234 p156-171, 2008. “The STAPL pArray,” G. Tanase, M. Bianco, N. M. Amato, L. Rauchwerger. MEmory performance: DEaling with Applications, systems and architecture (MEDEA), 2007. pRange pContainers and Views Performance Evaluation pContainer - a distributed collection of generic elements with methods to access and maintain the collection that provides a shared-object view to the user. The shared-object view provides uniform access to data independent of the physical location where it is stored. pContainer interfaces are equivalent to their STL counterpart (e.g., pVector and STL vector). View an abstract data type that allows decoupling of a container interface from the underlying storage. Views allow a data set to be filtered and traversed in multiple ways. Row- or Column-wise traversal of a pMatrix, for example. Location 0 Location 1 Component 0 Domain={0,1} Component 1 Domain={2,3} Component 2 Domain={4,5} p_array[5] p_array[2] pArray<int> p_array(6); - Domain =[0,6) - blocked_partitioned(2), - cyclic_maping Example of pArray on 2 processors Location 3 Location 2 Location 1 Location 0 Loc 3 Loc 2 Loc 1 Loc 0 Row Based View Aligned with distribution Column Based View Not aligned with distribution Dynamic, composable task graph of a parallel computation. A graph vertex is a work function and the data (represented by a partition of a view) to be processed. Tasks that need to access the same data in a particular order have a graph edge between them to enforce the execution order when the graph is processed by the Executor in the runtime. pRanges that are constructed using Factories that describe the pattern of the computation (e.g., Map-Reduce) are data-parallel. pRanges can be composed to form task-parallel computations. Task View Work Function Example of Map-Reduce Task Graph p_unique Scalability •10 million integers in the container. •Results obtained on an IBM Cluster 1600 here at TAMU. •Initial set of results, improvements planned. pAlgorithm: p_unique p_unique Input: A sequence of elements and a binary relation. Output: A sequence of elements consisting of only the first element of each group of consecutive duplicate elements. The binary relation is used to determine the duplicate elements. ex: ‘=‘ {1, 1, 2, 2, 3, 3, 4, 4} ---> {1, 2, 3, 4} Sequentially, there’s only one way to do this; in parallel, there are multiple cases. Case 1, symmetric + transitive: ex: “=“: Compare, in parallel, each element to the next element in the sequence using the relation and keep or remove it based on the result. If we have enough processors, all of the comparisons can happen simultaneously. Constant number of remote accesses. Case 2, transitive + not symmetric: ex: ‘<‘ {60, 100, 70, 20, 50, 40, 30, 10} ---> {60, 20, 10} Requires parallel prefix algorithm before decisions. We “prefix” each comparison with the appropriate initial element, then we identify the duplicate elements. log p remote accesses. Case 3, not transitive: Sequential; comparisons must be done in order, one at a time. ex: ‘a is the father of b’ p_unique_copy, Case 1 (symmetric and transitive), 100% of elements copied p_unique_copy, Case 2 (transitive but not symmetric), 100% of elements copied p_unique_copy, Case 2, 0% of elements copied p_unique_copy, Case 1, 50% of elements copied. Slowdown occurs as the amount of communication increases due to compaction of elements.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato [email protected] Parasol Lab, Department of Computer

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato stapl-support@tamu.edu Parasol Lab, Department of Computer

Parallel Algorithms in STAPLImplementation and Evaluation

Jeremy Vu, Mauro Bianco, Nancy [email protected]

Parasol Lab, Department of Computer Science and Engineering, Texas A&M University, http://parasol.tamu.edu/

Oil well logging simulation

STAPL Overview

normal misfold

STAPL is a framework for developing parallel C++ code.Its core is a library of C++ components with interfaces similar to the (sequential) C++ Standard Template Library (STL).

Standard Template Adaptive Parallel Library Applications using STAPL

• Particle Transport Computation

Efficient Massively ParallelImplementation of Discrete Ordinates Particle

Transport Calculation.

• Protein & RNA Folding

Probabilistic Roadmap Methodsfrom motion planning adaptedto protein and RNA folding

• Seismic Ray Tracing

Simulation of propagationof seismic rays in earth’s crust.

Prion Protein

References• “A Framework for Adaptive Algorithm Selection in STAPL,” N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2005. • “Parallel Protein Folding with STAPL,” S. Thomas, G. Tanase, L. Dale, J. Moreira, L. Rauchwerger, N. M. Amato. Journal of Concurrency and Computation: Practice and Experience, 2005 •“ARMI: An Adaptive, Platform Independent Communication Library,” S. Saunders, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2003.

• “STAPL: An Adaptive, Generic Parallel C++ Library,” P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. M. Amato and L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), Aug. 2001.

This research supported in part by NSF Grants EIA-0103742, ACR-0081510, ACR-0113971, CCR-0113974, ACI-0326350, CRI-0551685, by the DOE, IBM, and Intel. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Project Goals• Ease of use

Shared Object Programming Model provides consistent interface across shared or distributed memory systems.

• EfficiencyApplication building blocks are based on C++ STL constructs and extended and automatically tuned for parallel execution.

• PortabilityARMI runtime system hides machine specific details and provides an efficient, uniform communication interface.

User Application Code

Adap

tive

Fram

ewor

k

Pthreads, OpenMP, MPI, Native, …

Run-time SystemARMI Communication Library

SchedulerExecutor

Performance Monitor

pAlgorithms

pRange

ViewspContainers

• “Design for Interoperability in STAPL: pMatrices and Linear Algebra Algorithms,” A. Buss, G. Tanase, N. Thomas, T. Smith, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5335 p304-315 August 2008.•“Associative Parallel Containers in STAPL,” G. Tanase, C. Raman, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5234 p156-171, 2008. •“The STAPL pArray,” G. Tanase, M. Bianco, N. M. Amato, L. Rauchwerger. MEmory performance: DEaling with Applications, systems and architecture (MEDEA), 2007.

pRange

pContainers and Views Performance EvaluationpContainer - a distributed collection of generic elements with methods to accessand maintain the collection that providesa shared-object view to the user.

The shared-object view provides uniformaccess to data independent of the physicallocation where it is stored. pContainer interfaces are equivalent to their STL counterpart (e.g., pVector and STL vector).

View – an abstract data type thatallows decoupling of a container interfacefrom the underlying storage.

Views allow a data set to befiltered and traversed in multipleways. Row- or Column-wisetraversal of a pMatrix, for example.

Location 0 Location 1

Component 0Domain={0,1}

Component 1Domain={2,3}

Component 2Domain={4,5}

p_array[5]p_array[2]

pArray<int> p_array(6); - Domain =[0,6) - blocked_partitioned(2), - cyclic_maping

Example of pArray on 2 processors

Location 3Location 2Location 1Location 0

Loc3

Loc2

Loc1

Loc0

Row Based View Aligned with distribution

Column Based View Not aligned with distribution

Dynamic, composable task graph of a parallel computation.A graph vertex is a work function and thedata (represented by a partition of a view) to be processed. Tasks that need toaccess the same data in a particular order have a graph edge between them to enforce the execution order when the graph is processed by the Executor in the runtime.

pRanges that are constructed usingFactories that describe the pattern ofthe computation (e.g., Map-Reduce)are data-parallel.

pRanges can be composed to form

task-parallel computations.

Task

View

Work Function

Example of Map-Reduce Task Graph

p_unique Scalability

•10 million integers in the container.•Results obtained on an IBM Cluster 1600 here at TAMU.•Initial set of results, improvements planned.

pAlgorithm: p_uniquep_uniqueInput: A sequence of elements and a binary relation.Output: A sequence of elements consisting of only the first element of each group of consecutive duplicate elements.

• The binary relation is used to determine the duplicate elements.• ex: ‘=‘

• {1, 1, 2, 2, 3, 3, 4, 4} ---> {1, 2, 3, 4}• Sequentially, there’s only one way to do this; in parallel, there are multiple cases.

• Case 1, symmetric + transitive: • ex: “=“: • Compare, in parallel, each element to the next element in the sequence using the relation and keep or remove it based on the result.• If we have enough processors, all of the comparisons can happen simultaneously. Constant number of remote accesses.

• Case 2, transitive + not symmetric: • ex: ‘<‘ • {60, 100, 70, 20, 50, 40, 30, 10} ---> {60, 20, 10}• Requires parallel prefix algorithm before decisions. We “prefix” each comparison with the appropriate initial element, then we identify the duplicate elements.• log p remote accesses.

• Case 3, not transitive: Sequential; comparisons must be done in order, one at a time.• ex: ‘a is the father of b’

p_unique_copy, Case 1 (symmetric and transitive), 100% of elements copied

p_unique_copy, Case 2 (transitive but not symmetric), 100% of elements copied

p_unique_copy, Case 2, 0% of elements copied

p_unique_copy, Case 1, 50% of elements copied. Slowdown occurs as the amount of communication increases due to compaction of elements.