Upload
misha
View
31
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Revisiting a slide from the syllabus: CS 525 will cover. Parallel and distributed computing architectures Shared memory processors Distributed memory processors Multi-core processors Parallel programming S hared memory programming ( OpenMP ) D istributed memory programming (MPI) - PowerPoint PPT Presentation
Citation preview
Revisiting a slide from the syllabus:CS 525 will cover
• Parallel and distributed computing architectures– Shared memory processors– Distributed memory processors– Multi-core processors
• Parallel programming – Shared memory programming (OpenMP)– Distributed memory programming (MPI)– Thread-based programming
• Parallel algorithms– Sorting– Matrix-vector multiplication– Graph algorithms
• Applications– Computational science and engineering– High-performance computing
Revisiting a slide from the syllabus:CS 525 will cover
• Parallel and distributed computing architectures– Shared memory processors– Distributed memory processors– Multi-core processors
• Parallel programming – Shared memory programming (OpenMP)– Distributed memory programming (MPI)– Thread-based programming
• Parallel algorithms– Sorting– Matrix-vector multiplication– Graph algorithms
• Applications– Computational science and engineering– High-performance computing This lecture hopes to touch on each
of the highlighted aspects via one running example
Intel Nehalem: an example of a multi-core shared-memory architecture
A few of the features… • 2 sockets• 4 cores per socket• 2 hyper-threads per core 16 threads in total• Proc speed: 2.5 GHz
• 24 GB total memory• Cache:
– L1 (32 KB, on-core)– L2 (2x6 MB, on-core)– L3 (8 MB, shared)
• Memory: virtually globally shared (from programmer’s point of view)
• Multithreading: simultaneous (Multiple instructions from ready threads executed in a
given cycle)
Block diagram
Overview of Programming Models
• Programming models provide support for expressing concurrency and synchronization
• Process based models assume that all data associated with a process is private, by default, unless otherwise specified
• Lightweight processes and threads assume that all memory is global– A thread is a single stream of control in the flow of a program
• Directive based programming models extend the threaded model by facilitating creation and synchronization of threads
Advantages of Multithreading (threaded programming)
• Threads provide software portability• Multithreading enables latency hiding • Multithreading takes scheduling and load
balancing burdens away from programmers• Multithreaded programs are significantly easier
to write than message-passing programs – Becoming increasingly widespread in use due to the
abundance of multicore platforms
Shared Address Space Programming APIs
• The POSIX Thread (Pthread) API– Has emerged as the standard threads API– Low-level primitives (relatively difficult to work with)
• OpenMP– Is a directive-based API for programming shared
address space platforms (has become a standard)– Used with Fortran, C, C++– Directives provide support for concurrency,
synchronization, and data handling without having to explicitly manipulate threads
Parallelizing Graph Algorithms
• Challenges:– Runtime dominated by memory latency than processor speed– Little work is done while visiting a vertex or an edge
• Little computation to hide memory access cost– Access patterns determined only at runtime
• Prefetching techniques inapplicable– There is poor data locality
• Difficult to obtain good memory system performance
• For these reasons, parallel performance– on distributed memory machines is often poor– on shared memory machines is often better
We consider here graph coloring as an example of a graph algorithm to parallelize on shared memory machines
Graph coloring• Graph coloring is an assignment of
colors (positive integers) to the vertices of a graph such that adjacent vertices get different colors
• The objective is to find a coloring with the least number of colors
• Examples of applications:– Concurrency discovery in parallel
computing (illustrated in the figure to the right)
– Sparse derivative computation– Frequency assignment– Register allocation, etc
A greedy algorithm for coloring
• Graph coloring is NP-hard to solve optimally (and even to approximate)• The following Greedy algorithm gives very good solution in practice
Complexity of GREEDY: O(|E|) (thanks to the way the array forbiddenColors is used)
color is a vertex-indexed array that stores the color of each vertexforbiddenColors is a color-indexed array used to mark impermissible colors to a vertex
Parallelizing Greedy Coloring
• Desired goal: parallelize GREEDY such that– Parallel runtime is roughly O(|E|/p) when p processors
(threads) are used– Number of colors used is nearly the same as in the
serial case
• Difficult to achieve since GREEDY is inherently sequential
• Challenge: come up with a way to create concurrency in a nontrivial way
A potentially “generic” parallelization technique
• “Standard” Partitioning– Break up the given problem into p independent subproblems of almost equal
sizes– Solve the p subproblems concurrently
Main work lies in the decomposition step which is often no easier than solving the original problem
• “Relaxed” Partitioning– Break up the problem into p, not necessarily entirely independent, subproblems
of almost equal sizes– Solve the p subproblems concurrently– Detect inconsistencies in the solutions concurrently– Resolve any inconsistencies
Can be used potentially successfully if the resolution in the fourth step involves only local adjustments
“Relaxed Partitioning” applied towards parallelizing Greedy coloring
• Speculation and Iteration:– Color as many vertices as possible concurrently,
tentatively tolerating potential conflicts, detect and resolve conflicts afterwards (iteratively)
Parallel Coloring on Shared Memory Platforms (using Speculation and Iteration)
Lines 4 and 9 can be parallelized using the OpenMP directive#pragma omp parallel for
Sample experimental results of Algorithm 2 (Iterative) on Nehalem: I
Graph (RMAT-G): 16.7M vertices; 133.1M edges; Degree (avg=16, Max= 1,278, variance=416)
Sample experimental results of Algorithm 2 (Iterative) on Nehalem: II
Graph (RMAT-B): 16.7M vertices; 133.7M edges; Degree (avg=16, Max= 38,143, variance=8,086)
Recap
• Saw an example of a multi-core architecture– Intel Nehalem
• Took a bird’s eye view of parallel programming models– OpenMP (highlighted)
• Saw an example of design of a graph algorithm– Graph coloring
• Saw some experimental results