Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) The Innovative Computing Laboratory University of Tennessee Knoxville

Embed Size (px)

DESCRIPTION

3 Dongarra_PLASMA_SC07 3 Dongarra_KOJAK_SC07 Programming for Multicores Parallel software for multicores should have two characteristics: fine granularity:  high parallelism degree is needed  cores are (and probably will be) associated with relatively small local memories. asynchronicity:  high parallelism degree make synchronizations a bigger bottleneck  hide the latency

Citation preview

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) The Innovative Computing Laboratory University of Tennessee Knoxville 2 Dongarra_PLASMA_SC07 2 Dongarra_KOJAK_SC07 Why Multicore? The ILP Wall The Memory Wall The Power Wall ILP is expensive and does not generate much concurrency TLP provides higher concurrency Power consumption grows with MHz 3 Latency is completely exposed Latency can be hidden using multithreading Single core Multicore Power consumption grows linearly with the number of transistors 3 Dongarra_PLASMA_SC07 3 Dongarra_KOJAK_SC07 Programming for Multicores Parallel software for multicores should have two characteristics: fine granularity: high parallelism degree is needed cores are (and probably will be) associated with relatively small local memories. asynchronicity: high parallelism degree make synchronizations a bigger bottleneck hide the latency 4 Dongarra_PLASMA_SC07 4 Dongarra_KOJAK_SC07 Developing Parallel Algorithms LAPACK Threaded BLAS PThreadsOpenMP parallelism s LAPACK sequential BLAS PThreadsOpenMP parallelism sequential BLAS 5 Dongarra_PLASMA_SC07 5 Dongarra_KOJAK_SC07 Developing Parallel Algorithms: why? BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound. strict synchronizations poor parallelism poor scalability 6 Dongarra_PLASMA_SC07 6 Dongarra_KOJAK_SC07 Tiled Cholesky Factorization In some cases it is possible to use the LAPACK algorithm breaking the elementary operations into tiles. Cholesky: do DPOTF2 on for all do DTRSM on end for all do DGEMM on end 7 Dongarra_PLASMA_SC07 7 Dongarra_KOJAK_SC07 Tiled LU Factorization In many cases different algorithms are needed which must be invented or can be found in literature. LU and QR: DTSTRF: DGETRF: DGESSM: DSSSSM: 8 Dongarra_PLASMA_SC07 8 Dongarra_KOJAK_SC07 Tiled LU Factorization In many cases different algorithms are needed which must be invented or can be found in literature. LU and QR: 9 Dongarra_PLASMA_SC07 9 Dongarra_KOJAK_SC07 k=1 DGETRF k=1, j=2 DGESSM k=1, j=3 DGESSM k=1, i=2 DTSTRF k=1, i=2, j=2 DSSSSM k=1, i=2, j=3 DSSSSM k=1, i=3 DTSTRF k=1, i=3, j=2 DSSSSM k=1, i=3, j=3 DSSSSM Tiled LU Factorization 10 Dongarra_PLASMA_SC07 10 Dongarra_KOJAK_SC07 Block Data Layout Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data. Column-Major Block data layout 11 Dongarra_PLASMA_SC07 11 Dongarra_KOJAK_SC07 Graph Driven Asynchronous Execution The whole factorization can be represented as a DAG: nodes: tasks that operate on tiles edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated. DTSTRF DGETRF DGESSM DSSSSM 12 Dongarra_PLASMA_SC07 12 Dongarra_KOJAK_SC07 A critical path can be defined as the shortest path that connects all the nodes with the higher number of outgoing edges. Graph Driven Asynchronous Execution Priorities: 13 Dongarra_PLASMA_SC07 13 Dongarra_KOJAK_SC07 Fork-Join vs Asynchronous Time Idle time Fork-Join Asynchronous 14 Dongarra_PLASMA_SC07 14 Dongarra_KOJAK_SC07 Performance: Cholesky 15 Dongarra_PLASMA_SC07 15 Dongarra_KOJAK_SC07 Performance: QR 16 Dongarra_PLASMA_SC07 16 Dongarra_KOJAK_SC07 Performance: LU 17 Dongarra_PLASMA_SC07 17 Dongarra_KOJAK_SC07 Contacts