1
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang Other Collaborators: Min Feng, Srimat Chakradhar CSE Poster Event 2014 Motivatio n The Goal of This Work Many-core coprocessors commonly have their own memory hierarchy – Intel Xeon Phi – NVIDIA GPUs Programming Challenges Experiment al Results CPU: Intel Xeon E5-2609 (8-Core) Coprocessor: Intel Xeon Phi (61-Core) -- MIC Compiler: ICC Contributi ons Static Mechanism and Runtime Mechanism Programming with LEO/OpenAcc Design dynamic (runtime library) or static (code transformation) methods to manage and optimize data communication between CPU and many-core coprocessors automatically for multi-dimensional arrays and multi-level pointers – Minimize redundant data transfers – Utilize Direct Memory Accesses (DMA) – Reduce memory allocation on coprocessor – Preserve compiler optimization on coprocessor State of the Art Comparison of best CPU+MIC and CPU Speedup of best CPU-MIC over 8-Core CPU Study the performance bottleneck of the state-of-the-art dynamic and static methods Design two novel heap Linearization algorithms and optimized MYO method to improve the communication performance Implement a static source-to-source code transformer with the Partial Linearization with Pointer Reset design Evaluate and analyze both dynamic and static approaches on multiple benchmarks Data Transfer CPU Host PCIe Many Core Coprocessor 8-core 60+ cores Intel MIC NVIDIA GPU //Change Malloc-Site to split pointers and real data #pragma offload target(mic) in(A_data, B_data, C_data: length(m*n) REUSE) {} #pragma offload target(mic) nocopy(A, B, C:length(n) ALLOC) { //Connect A, B, C with A_data, B_data, C_data} #pragma offload target(mic) nocopy(A, B, C: length(n)) { #pragma omp parallel for private(i) for (i = 0; i < n; i++) for (j = 0; j < m; j++) A[i][j] = B[i][j] * C[i][j]; } #pragma offload target(mic) out(A_data: length(m*n) FREE) Productivi ty Performanc e Current Approaches to Managing the data transfer between CPU and Coprocessor Pros: Easy programming, Complex structures Cons: Slow (unnecessary synchronization) Pros: Fast Cons: Users manageable data offload Only bit-wise copyable data M Y O int * a int ** b Our Static Mechanism Our Combined Mechanism Summary of Benchmarks Comparison of Static Methods (Linearization) and OPT- Runtime (MYO) Speedup of Static over OPT- Runtime Data Trans Size of Static over OPT- Runtime Comparison of OPT-Runtime and Runtime (MYO) Speedup of OPT-Runtime over Runtime Data Trans Size of OPT-Runtime over Runtime Comparison of OPT-Complete Linearization and Complete Linearization Speedup of OPT-CL over CL for MG Data Trans Size of OPT-CL over CL for MG Partial Linearization with PR High Dim Array Addition Struct and Non-unit Stride Access No modification to the access-site Preserve potential compiler optimizations Reduce possibility of introducing bugs Reduce communication overhead Only transfer linearized data Minimize offloading number DMA utilization Linearized data is in a dense memory buffer – Explicit Message Passing – Virtual Shared Memory (MYO)

Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang

Embed Size (px)

Citation preview

Page 1: Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang

Automating and Optimizing Data Transfers for Many-core Coprocessors

Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang Other Collaborators: Min Feng, Srimat Chakradhar

CSE Poster Event 2014

Motivation

The Goal of This Work

Many-core coprocessors commonly have their own memory hierarchy

– Intel Xeon Phi

– NVIDIA GPUs

Programming Challenges

Experimental Results CPU: Intel Xeon E5-2609 (8-

Core) Coprocessor: Intel Xeon Phi

(61-Core) -- MIC Compiler: ICC

Contributions

Static Mechanism and Runtime Mechanism

Programming with LEO/OpenAcc

Design dynamic (runtime library) or static (code transformation) methods to manage and optimize data communication between CPU and many-core coprocessors automatically for multi-dimensional arrays and multi-level pointers

– Minimize redundant data transfers – Utilize Direct Memory Accesses (DMA) – Reduce memory allocation on coprocessor – Preserve compiler optimization on coprocessor

State of the Art

Comparison of best CPU+MIC and CPU

Speedup of best CPU-MIC over 8-Core CPU

Study the performance bottleneck of the state-of-the-art dynamic and static methods

Design two novel heap Linearization algorithms and optimized MYO method to improve the communication performance

Implement a static source-to-source code transformer with the Partial Linearization with Pointer Reset design

Evaluate and analyze both dynamic and static approaches on multiple benchmarks to show the efficacy of our Partial Linearization with Pointer-Reset method

Data Transfer

CPUHost

PCIe

Many Core Coprocessor

8-core 60+ cores

Intel MIC NVIDIA GPU

…//Change Malloc-Site to split pointers and real data#pragma offload target(mic) in(A_data, B_data, C_data: length(m*n) REUSE){}#pragma offload target(mic) nocopy(A, B, C:length(n) ALLOC){ //Connect A, B, C with A_data, B_data, C_data} #pragma offload target(mic) nocopy(A, B, C: length(n)){ #pragma omp parallel for private(i) for (i = 0; i < n; i++) for (j = 0; j < m; j++) A[i][j] = B[i][j] * C[i][j];}#pragma offload target(mic) out(A_data: length(m*n) FREE)…

Productivity Performance

Current Approaches to Managing the data transfer between CPU and Coprocessor

Pros: Easy programming, Complex structuresCons: Slow (unnecessary synchronization)

Pros: Fast Cons: Users manageable data offloadOnly bit-wise copyable data

M Y

O

int * aint ** b

Our Static Mechanism

Our Combined Mechanism

Summary of Benchmarks

Comparison of Static Methods (Linearization) and OPT-Runtime (MYO)

Speedup of Static over OPT-Runtime Data Trans Size of Static over OPT-Runtime

Comparison of OPT-Runtime and Runtime (MYO)

Speedup of OPT-Runtime over Runtime Data Trans Size of OPT-Runtime over Runtime

Comparison of OPT-Complete Linearization and Complete Linearization

Speedup of OPT-CL over CL for MG Data Trans Size of OPT-CL over CL for MG

Partial Linearization with PR

High Dim Array Addition

Struct and Non-unit Stride Access

No modification to the access-site Preserve potential compiler optimizations Reduce possibility of introducing bugsReduce communication overhead Only transfer linearized data Minimize offloading numberDMA utilization Linearized data is in a dense memory buffer

– Explicit Message Passing

– Virtual Shared Memory (MYO)