Download pdf - Enhancing the Experimental MATLAB on the TeraGrid Resource

Enhancing the Experimental "MATLAB on the TeraGrid" Resource

Project Descrip-on

The "MATLAB on the TeraGrid" experimental resource has proven to be an important and unique parallel resource on the TeraGrid for computa=onal science and data analysis. It aAracted many users new to TeraGrid and encouraged them to scale up their research problems. The resource provided seamless parallel MATLAB computa=onal services to remote Linux, Mac, or Windows desktops (hAp://www.cac.cornell.edu/matlab) and Science Gateway users (hAps://hubzero.org/resources/495) with complex analy=c and fast simula=on requirements. In a new research collabora=on with NVIDIA, Dell, and MathWorks, Cornell is tes=ng the performance of general-‐purpose GPUs with MATLAB applica=ons. MATLAB GPU compu=ng capabili=es include data manipula=on on NVIDIA GPUs and the use of mul=ple GPUs on the desktop via the Parallel Compu=ng Toolbox and a computer cluster via MATLAB Distributed Compu=ng Server. Tes=ng is occurring on Dell C6100 servers with the C410x PCIe expansion chassis which supports server connec=ons to NVIDIA Tesla M2070 GPUs. In this poster, we share system configura=on informa=on, =ps for tes=ng and adap=ng codes for use with GPUs, and GPU test results for six case studies.

GPU Test Results

Six case studies were ini=ated to examine the process of adap=ng exis=ng MATLAB codes to u=lize the new GPU capabili=es available in MATLAB 2011a. The results yielded 2-‐=mes to 14.8-‐=mes speedup of the original MATLAB code on four of the six case studies. The MATLAB codes selected for analysis included audio signal processing, medical image processing, Monte Carlo method, and finite element method. Each case study involved a process of profiling the code to iden=fy poten=al candidates for GPU op=miza=on and u=lizing one or more of the 3 methods offered by MATLAB 2011a to u=lize GPU hardware. The amount of effort to u=lize the GPU varied from one-‐hour to achieve a 4-‐=mes speed-‐up to two-‐weeks to vectorize an exis=ng code and develop custom CUDA kernels resul=ng in a 13-‐=mes speedup. The results of the case studies demonstrate that MATLAB 2011a provides an excellent framework for helping researchers leverage GPU hardware with a rela=vely modest amount of effort. Future work will focus on exploring the benefits of u=lizing mul=ple GPUs simultaneously and developing a set of best prac=ces for assis=ng researchers in making the best use of the new GPU capability of MATLAB.

Cornell System Configura-on

The Cornell system configura=on is comprised of mul=ple servers: a Web Server, a Windows HPC Server 2008 head node and compute nodes, a SQL Server, MyProxy and a Grid FTP Server. These are all connected to the DataDirect Networks storage with 8TB dedicated to this project. A Dell PowerEdge C410x hosts the GPUs and they are connected to Dell PowerEdge C6100’s. The GPUs are NVIDIA Tesla M2070s. Authen=ca=on and access is through x509 cer=ficates. Users can seamlessly switch from using their desktop for MATLAB mul=-‐core processes to the cluster using either mul=-‐core or mul=-‐node processing. Currently the sofware stack includes Windows HPC Server 2008 x64, MATLAB R2011a with the Parallel Compu=ng Toolbox (PCT), CUDA Toolkit, HPC Pack 2008, Ac=vePerl 5.12.3, Microsof SDK, Microsof Visual C++ 2010 SP1 Redistributable Package (x64), and a 3D Video Controller on the GPU compute nodes.

Machine learning and signal analysis techniques may automa=cally iden=fy species such as warblers from their flight calls (Image courtesy of the McGill Bird Observatory) Case Study: Theo Damoulas, a research associate with the NSF-‐established Ins=tute for Computa=onal Sustainability (ISC) directed by Prof. Carla Gomes, benefited from a 12-‐=mes speedup in Dynamic Time Warping (DTW) computa=on by using a combina=on of built-‐in MATLAB GPU func=ons and CUDA code. DTW is the computa=onally expensive part of the code which uses machine learning and signal analysis techniques to automa=cally iden=fy bird species from their flight calls. Automa=c flight call classifica=on is much faster and arguably more accurate than manual classifica=on, and the first step in crea=ng large scale networks of recording sta=ons that can provide a detailed understanding of the migra=on paAerns of individual species. This project is representa=ve of the research of the ISC, whose aim is to provide solu=ons for balancing environmental, economic, and societal needs for a sustainable future by bringing computa=onal thinking to sustainability research. The ISC is a joint venture involving scien=sts from Cornell University, Bowdoin College, the Conserva=on Fund, Howard University, Oregon State University, and the Pacific Northwest Na=onal Laboratory.

David Lila, Eric Chen, Lucia Walle, Susan Mehringer, Steven Lantz, Steven Clark, Pascal Meunier

GPU Technical Specifica-ons

8x NVIDIA Tesla M2070 GPUs •  All 8 housed in a single Dell C410x PCIe expansion chassis •  Reconfigurable: 1 to 8 GPUs can be mapped to any of the servers •  6GB RAM per GPU

2x Dell C6100 = 8 servers in total, each with: •  2x Intel 5620 Westmere processors = 8 cores per server •  24GB RAM •  1x 250GB hard drive •  Gigabit Ethernet

GPU Peak Rates

8x NVIDIA Tesla M2070 GPUs •  Single precision total: 8 Tflop/s •  Double precision total: 4 Tflop/s

64x Intel 5620 Westmere cores •  Clock rate = 2.4 GHz •  SSE4 mul=ply-‐add = 8 flop/core/cycle for SP, or 4

for DP •  Single precision total: 1.2 Tflop/s •  Double precision total: 0.6 Tflop/s

Full System •  Single precision total: 9.2 Tflop/s •  Double precision total: 4.6 Tflop/s •  Nearly equivalent to a 512-‐core CPU-‐based

system

NVIDIA Tesla GPUs are being used to design the computer-‐aided diagnosis of breast cancer cells. (Image Courtesy of Constan=n Friedman, MD and Victor Brodsky, MD, Weill Cornell Medical College) Case Study: Researchers from Weill Cornell Medical Center, University of Michigan Health System, and Rutgers Laboratory for Computa=onal Imaging and Bioinforma=cs are currently using the NVIDIA GPUs and MATLAB to accelerate and improve the diagnosis of cancer cells using template matching. Using MATLAB’s built-‐in GPU func=ons, the researchers experienced a 14.7-‐=mes speedup in code processing =me (from 86.9 seconds to 5.9 seconds). That’s a significant improvement for pathologists who would like to process many large scale images each day. By comparison, MATLAB code running on GPUs performed 4.8-‐=mes faster than code that was implemented in C++ without GPUs. And, because MATLAB is op=mized for use with GPUs, users can take advantage of the GPUs’ compute power without needing to learn another programming language or leaving the MATLAB environment.

MATLAB -‐> MATLAB + GPU

MATLAB now offers 3 methods for u=lizing an NVIDIA GPU to boost the performance of MATLAB code. The following outlines methods u=lized to iden=fy MATLAB code candidates that would be well suited for GPU op=miza=on and the steps involved in enabling GPU func=onality: 1.  Profile code 2.  Op=mize code 3.  U=lize GPU func=ons

1. Profile code MATLAB provides a built-‐in profile command that creates a visual representa=on of the boAlenecks in MATLAB code. 2. Op-mize code Before u=lizing GPU func=ons it is best to vectorize code boAlenecks. The provided GPU func=ons work best when code has already been op=mized. 3. U-lize GPU func-ons There are three methods for using a GPU with MATLAB: •  Built-‐in GPUArray methods •  ArrayFun •  Execu=ng CUDA kernel Built-‐in GPUArray method Simple demo of FFT of 100 million random numbers on CPU vs. GPU

BoNleneck!

Original

Vectorized

Research Project Title Built-‐in

ArrayFun CUDA

Speed-‐up

Spa-ally-‐Invariant Vector Quan-za-on (SIVQ) Yes Yes No 14.7x Nirfast Yes No Yes 13x Automated Flight Call Classifica-on Yes No Yes 12x

Array Process of Ambient Noise for Geophysical Inversion Yes No No 2x White MaNer Tracts No No No 0x

Electron Trajectory Simula-on in Hall-‐Effect Thrusters No No No 0x

GridFTP Server

MyProxy Server

Web Server

SQL Server

Compute Nodes NVIDIA Tesla M2070s

Head Node

Network Interc

onnect

GPU Nodes aNached to Dell C410x

DDN Storage