30
Speeding Up Computation Adam J. Suarez Hardware Considerations CPUs GPUs BLAS Libraries Standard Drop-in NVBLAS Performance Comparison Calling C from R GSL and OpenMP Example CUDA Example Speeding Up Computation Tips Geared Towards R Adam J. Suarez Departments of Statistics North Carolina State University Arpil 10, 2015 Adam J. Suarez 1 / 30

Adam J. Suarez Hardware Considerations Speeding Up ...post/adam/SpeedingUpR.pdf · Adam J. Suarez Hardware Considerations CPUs GPUs BLAS Libraries Standard Drop-in NVBLAS Performance

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Speeding Up ComputationTips Geared Towards R

    Adam J. Suarez

    Departments of StatisticsNorth Carolina State University

    Arpil 10, 2015

    Adam J. Suarez 1 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Advantages of R

    I As an interpretive and interactive language, developing an algorithm in R can be donevery quickly.

    I The main sacrifice is speed.

    I As-is, R is better suited for prototyping, where the final program will eventually be runin a lower-level language like C or Fortran.

    I However, the potential exists to be able to speed up much of the computation.

    I R code should be seen as modular, where individual components can eventually beswapped out for faster versions when it is time for final runs or producing packages.

    I This approach allows the best of both worlds, where R’s excellent graphical abilitiesand user-contributed packages can still be used.

    Adam J. Suarez 2 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Hardware Considerations

    I Faster hardware is the most straightforward way to speed up your computation.

    I Statistical/scientific computation can have some special considerations compared togeneral computing.

    I Of special importance to statistical computing is floating point performance, bothsingle and double precision.

    I When choosing hardware, one important factor to consider is theoretical peakFLOPS/sec (FLOPS = FLoating Point OperationS).

    I Theoretical FLOPS/sec = (FLOPS/Processor cycle) * (Processor cycles/sec) * (# ofProcessors).

    Adam J. Suarez 3 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Hardware Considerations - CPU

    I Intel and AMD use very di�erent CPU architectures.

    I AMD Bulldozer/Piledriver/Steamroller cores are paired and share some resources.

    I The important fact for us is that they share the FPU (floating point unit).

    I So, when performing floating point operations, an 8-core AMD processor acts like a4-core processor.

    I Integer operations are more independent.

    I Intel’s “cores” are completely independent.

    Adam J. Suarez 4 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Hardware Considerations - CPU

    I Intel and AMD di�er also in FLOPS/cycle:I Intel Haswell: 16 DP FLOPS/cycle, 32 SP FLOPS/cycle (per core).I AMD Bulldozer/Piledriver/Steamroller: 8 DP FLOPS/cycle, 16 SP FLOPS/cycle (per module).

    I Example: Intel i7-5820k 6-core Haswell @ 4.0GHz has a theoretical peak of 384 DPGFLOPS/sec.

    I Example: AMD FX-8150 8-core Bulldozer @ 4.0GHz has a theoretical peak of 128 DPGFLOPS/sec.

    I Theoretical peak has the potential to be much higher than actual peak performance,depending on the problem and implementation.

    Adam J. Suarez 5 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Hardware Considerations - GPU

    I The two main GPU competitors are AMD and NVIDIA.

    I In general, GPUs do not have the 2:1 ratio of single:double precision performance.

    I GPUs are often purposely crippled for DP at the consumer level to encouragepurchases of workstation grade parts.

    I The top of DP performance for their respective companies (consumer lines, singlechip):

    I NVIDIA: GTX Titan Black ∼ 1700 DP GFLOPS/secI AMD: Radeon R9 280x ∼ 870 DP GFLOPS/sec

    I The main selling point for the workstation cards is ECC RAM.

    Adam J. Suarez 6 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Hardware Considerations - GPU

    I NVIDIA is much more popular in the super-computing world, and the libraries for theirplatform are more developed.

    I Computing on an AMD GPU is typically done using OpenCL, which aims to be moregeneral than NVIDIA’s CUDA libraries.

    I NVIDIA GPUs tend to be more expensive, but much more user-friendly and morewidely supported.

    I For large problems, GFLOPS come much less expensively with GPUs than with CPUs,and libraries can now take advantage of multiple GPUs in a system (e.g., cuBLAS-XT).

    Adam J. Suarez 7 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries

    I By default, R comes with a basic version of BLAS (Basic Linear Algebra Subprograms)and LAPACK (Linear Algebra PACKage).

    I It is a good idea to never use these shared libraries!I There are many optimized versions available that can easily be interfaced with R.

    I OpenBLAS (Free)I Intel MKL (Free for Students)I AMD ACML (Free, GPU accelerated)I Many others...

    I These all have BLAS and LAPACK libraries built in.

    Adam J. Suarez 8 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries

    I R can be compiled from source and told to build with an external library(recommended for MKL).

    I You can also build R with a shared BLAS library and then “drop in” another library.I Shared libraries are:

    /R/lib/libRblas.so

    /R/lib/libRlapack.so

    I Either replace them (backing up the original) by copying, or useupdate-alternatives to easily change between them.

    I If using the alternatives option, make sure that the new library is in the run-time libraryload path.

    Adam J. Suarez 9 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries

    I OpenBLAS is the most user-friendly of the shared libraries.

    I Intel MKL seems to need the Intel C compiler (icc) to function well.I ACML 5.3.1 did not have GPU acceleration, and is much faster than ACML 6.1 for

    non-accelerated calls.I ACML 6.1 uses .lua scripts to determine whether to o�oad to the GPU.I This seems to significantly slow the non-o�oaded calls.I ACML 6.1 also produced errors for me when calling svd in R for large matrices.

    Adam J. Suarez 10 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries - NVBLAS

    I NVIDIA distributes NVBLAS as part of their CUDA Toolkit.

    I It works in a qualitatively di�erent way than the other BLAS libraries.I It “intercepts” certain level-3 BLAS calls and runs them on the GPU using cuBLAS.

    I GEMM, SYRK, HERK, SYR2K, HER2K, TRSM, TRMM, SYMM, HEMM

    I Utilizes a unified memory approach, which means the data is never fully o�oaded tothe GPU RAM.

    I Don’t use on PCIe 2.0!

    I You can use any full BLAS library as a default when it decides not to o�oad.

    I Does not include a LAPACK library (CULA is a separate project).

    Adam J. Suarez 11 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries - NVBLAS

    I Example NVBLAS configuration file (nvblas.conf):

    1 NVBLAS_LOGFILE n v b l a s . l o g2 NVBLAS_CPU_BLAS_LIB l i b m k l _ r t . s o3 NVBLAS_GPU_LIST ALL4 NVBLAS_AUTOPIN_MEM_ENABLED5 # NVBLAS_CPU_RATIO_SGEMM 0 . 1 0

    I To start R using NVBLAS, you could use the following:

    1 e n v LD_PRELOAD= l i b n v b l a s . s o R

    Adam J. Suarez 12 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries

    Adam J. Suarez 13 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    BLAS Libraries

    Adam J. Suarez 14 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Calling C from R

    I Two ways to call C functions from R: .C and .Call/.ExternalI .C requires void functions with pointer arguments. Example:

    1 void m y _ f u n c t i o n ( int * a , double * b )

    I .Call requires functions that both take and return SEXP values (S EXpressionPointer). Example:

    1 SEXP m y _ f u n c t i o n ( SEXP a , SEXP b )

    I .Call is much faster than .C and is more in the style of “hacking” R.I It allows you to create and use R objects directly.

    Adam J. Suarez 15 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Calling C from R - .Call

    I Any SEXP values that you create must be protected from R’s garbage collector usingPROTECT.

    I At the end, you need to use UNPROTECT(N);, where N is the number of previousprotecting statements.

    I For a single number, to convert from an SEXP value, use the functions asInteger,asReal, etc.

    I To get the pointer to the numeric part of an SEXP value, use the functions REAL,INTEGER, etc.

    I If you don’t want to return a value, return R_NilValue.I Don’t alter SEXP objects passed as arguments! Use duplicate first, then alter the

    copy.

    Adam J. Suarez 16 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example

    I The goal of this function is to take advantage of a multicore processor whengenerating random variables, in this case, standard normals.

    I For random number generation, we will use the GSL (GNU Scientific Library), and itsZiggurat implementation.

    I We will also use OpenMP directives to easily parallelize our method.

    I The main issue that needs care is creating separate RNGs for each possible thread toensure that the sequences still appear random.

    Adam J. Suarez 17 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example - Initializing RNG

    1 gsl_rng_type * G S L _ r n g _ t ;2 gsl_rng * * G S L _ r n g ;3 int G S L _ n t ;4 SEXP I N I T _ G S L _ R NG ( SEXP SEED ) {5 int j , s e e d = a s I n t e g e r ( SEED ) , i ;6 G S L _ n t = om p _ g e t _ m a x _ t h r e a d s ( ) ;7 g s l _ r n g _ e n v _ s e t u p ( ) ;8 G S L _ r n g _ t = g s l _ r n g _ m t 1 9 9 3 7 ;9 G S L _ r n g = ( gsl_rng * * ) m a l l o c ( G S L _ n t * sizeof ( gsl_rng * )

    ) ;10 om p _ s e t _ n um _ t h r e a d s ( G S L _ n t ) ;11 . . .

    Adam J. Suarez 18 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example Initializing RNG

    1 . . .2 #pragma omp p a r a l l e l for p r i v a t e ( i ) s h a r e d ( G S L _ r n g ,

    G S L _ r n g _ t ) s c h e d u l e ( s t a t i c , 1 )3 for ( j =0 ; j < G S L _ n t ; j + + ) {4 i = om p _ g e t _ t h r e a d _ n um ( ) ;5 G S L _ r n g [ i ] = g s l _ r n g _ a l l o c ( G S L _ r n g _ t ) ;6 g s l _ r n g _ s e t ( G S L _ r n g [ i ] , s e e d + i ) ;7 }8 return R _ N i l V a l u e ;9 }

    Adam J. Suarez 19 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example - Core Function

    1 void g e n e r a t e _ n o r m a l ( double * o u t _ v , int n , int n t ) {2 int j ;3 #pragma omp p a r a l l e l for s h a r e d ( o u t _ v , G S L _ r n g )

    n um _ t h r e a d s ( n t )4 for ( j =0 ; j < n ; j + + ) {5 o u t _ v [ j ] = g s l _ r a n _ g a u s s i a n _ z i g g u r a t ( G S L _ r n g [

    om p _ g e t _ t h r e a d _ n um ( ) ] , 1 . 0 ) ;6 }7 }

    Adam J. Suarez 20 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example - Wrapper for .Call

    1 SEXP r n o r m _ g s l ( SEXP N , SEXP NT )2 {3 int n= a s I n t e g e r ( N ) , n t = a s I n t e g e r ( NT ) ;4 SEXP r e s u l t = PROTECT ( a l l o c V e c t o r ( REALSXP , n ) ) ;5 double * o u t _ v = REAL ( r e s u l t ) ;6

    7 g e n e r a t e _ n o r m a l ( o u t _ v , n , n t ) ;8

    9 UNPROTECT ( 1 ) ;10 return r e s u l t ;11 }

    Adam J. Suarez 21 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example - Typical R Call

    I The shared library can be compiled using a command like:

    1 gcc - f P I C - s h a r e d - f openmp - O3 - ma r ch = n a t i v e g s l r a n d . c- o g s l r a n d . s o - l g s l

    I The function can now be used in R:

    1 dyn . load ( "gslrand.so" )2 . Call ( "INIT_GSL_RNG" , 1 )3 a < - . Call ( "rnorm_gsl" , 2 0 , 1 )

    Adam J. Suarez 22 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    GSL/OpenMP Example - Performance

    Adam J. Suarez 23 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example

    I The goal of this function is to use the GPU to generate multivariate normal randomvariables.

    I While simply generating univariate normals on the GPU is not nearly as e�cient as theprevious example, if we also Cholesky and matrix-multiply on the GPU, we see benefitsin lower dimensions more quickly.

    I NVIDIA includes cuRAND in the CUDA Toolkit for random number generation.

    I CULA is a LAPACK library built using CUDA that can be obtained freely for academicuse (requires registration).

    Adam J. Suarez 24 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example - Initialization

    1 curandGenerator_t CURAND_gen ;2 cublasHandle_t h a n d l e ;3

    4 SEXP I N I T _CURAND_RNG ( SEXP SEED ) {5 c u r a n d C r e a t e G e n e r a t o r ( &CURAND_gen ,

    CURAND_RNG_PSEUDO_MTGP32 ) ;6 c u r a n d S e t P s e u d o R a n d omG e n e r a t o r S e e d ( CURAND_gen , a s I n t e g e r

    ( SEED ) ) ;7

    8 c u l a I n i t i a l i z e ( ) ;9 c u b l a s C r e a t e _ v 2 ( & h a n d l e ) ;10

    11 return R _ N i l V a l u e ;12 }

    Adam J. Suarez 25 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example - Core Function

    1 SEXP rm vno rm_ cud a ( SEXP N , SEXP M , SEXP S IGMA )2 {3 size_t n= a s I n t e g e r ( N ) , m= a s I n t e g e r ( M ) ;4 double * d e vD a t a , * d e v _ s i g m a ;5 c u d a M a l l o c ( ( void * * ) & d e vD a t a , n *m* sizeof ( double ) ) ;6 c u d a M a l l o c ( ( void * * ) & d e v _ s i gm a , m*m* sizeof ( double ) ) ;7 SEXP r e s u l t = PROTECT ( a l l o c M a t r i x ( REALSXP , n , m ) ) , S I GMA2

    = PROTECT ( d u p l i c a t e ( S IGMA ) ) ;8 double * h o s t D a t a = REAL ( r e s u l t ) , * s i gm a = REAL ( S I GMA2

    ) , a l p h a = 1 . 0 ;9 cudaMemcpy ( d e v _ s i gm a , s i gma , m * m* sizeof ( double ) ,

    c u d aM emc p yH o s t T oD e v i c e ) ;10 . . .

    Adam J. Suarez 26 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example - Core Function

    1 . . .2 c u l a D e v i c e D p o t r f ( ’L’ ,m , d e v _ s i gm a , m ) ;3 c u r a n d G e n e r a t e N o r m a l D o u b l e ( CURAND_gen , d e vD a t a , n *m ,

    0 . 0 , 1 . 0 ) ;4 c u b l a s D t r mm _ v 2 ( h a n d l e , CUB L A S _ S I D E _ R I GHT ,

    CUBLAS_F I LL_MODE_LOWER , CUBLAS_OP_T ,CUBLAS _D I AG_NON_UN IT , n , m, & a l p h a , d e v _ s i gm a , m , d e vD a t a ,n , d e vD a t a , n ) ;

    5 cudaMemcpy ( h o s t D a t a , d e vD a t a , n * m* sizeof ( double ) ,c u d aM emc p yD e v i c e T oH o s t ) ;

    6 c u d a F r e e ( d e v D a t a ) ;7 c u d a F r e e ( d e v _ s i g m a ) ;8 UNPROTECT ( 2 ) ;9 return r e s u l t ;10 }

    Adam J. Suarez 27 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example - Typical R Call

    I The shared library can be compiled using a command like:

    1 gcc - f P I C - s h a r e d - O3 - ma r ch = n a t i v e c u r a n d . c - ocudano rm . s o - l c u d a r t - l c u b l a s - l c u r a n d -l c u l a _ l a p a c k

    I The function can now be used in R:

    1 dyn . load ( "cudanorm.so" )2 . Call ( "INIT_CURAND_RNG" , 1 )3 a < - . Call ( "rmvnorm_cuda" , N , nrow ( S i gm a ) , S i gm a )

    Adam J. Suarez 28 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    CUDA Example

    Adam J. Suarez 29 / 30

  • Speeding UpComputation

    Adam J. Suarez

    HardwareConsiderations

    CPUs

    GPUs

    BLAS Libraries

    Standard Drop-in

    NVBLAS

    Performance Comparison

    Calling C from R

    GSL and OpenMP Example

    CUDA Example

    Thank You for Listening!

    I will try to make my C code available on the SLG website, along with this presentation.

    Adam J. Suarez 30 / 30

    Hardware ConsiderationsCPUsGPUs

    BLAS LibrariesStandard Drop-inNVBLASPerformance Comparison

    Calling C from RGSL and OpenMP ExampleCUDA Example