Adam J. Suarez Hardware Considerations Speeding Up ...post/adam/SpeedingUpR.pdf · Adam J. Suarez Hardware Considerations CPUs GPUs BLAS Libraries Standard Drop-in NVBLAS Performance

Speeding UpComputation

Adam J. Suarez

HardwareConsiderations

CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS

Performance Comparison

Calling C from R

GSL and OpenMP Example

CUDA Example

Speeding Up ComputationTips Geared Towards R

Adam J. Suarez

Departments of StatisticsNorth Carolina State University

Arpil 10, 2015

Adam J. Suarez 1 / 30


Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Advantages of R

I As an interpretive and interactive language, developing an algorithm in R can be donevery quickly.

I The main sacrifice is speed.

I As-is, R is better suited for prototyping, where the final program will eventually be runin a lower-level language like C or Fortran.

I However, the potential exists to be able to speed up much of the computation.

I R code should be seen as modular, where individual components can eventually beswapped out for faster versions when it is time for final runs or producing packages.

I This approach allows the best of both worlds, where R’s excellent graphical abilitiesand user-contributed packages can still be used.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Hardware Considerations

I Faster hardware is the most straightforward way to speed up your computation.

I Statistical/scientific computation can have some special considerations compared togeneral computing.

I Of special importance to statistical computing is floating point performance, bothsingle and double precision.

I When choosing hardware, one important factor to consider is theoretical peakFLOPS/sec (FLOPS = FLoating Point OperationS).

I Theoretical FLOPS/sec = (FLOPS/Processor cycle) * (Processor cycles/sec) * (# ofProcessors).



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Hardware Considerations - CPU

I Intel and AMD use very di�erent CPU architectures.

I AMD Bulldozer/Piledriver/Steamroller cores are paired and share some resources.

I The important fact for us is that they share the FPU (floating point unit).

I So, when performing floating point operations, an 8-core AMD processor acts like a4-core processor.

I Integer operations are more independent.

I Intel’s “cores” are completely independent.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Hardware Considerations - CPU

I Intel and AMD di�er also in FLOPS/cycle:I Intel Haswell: 16 DP FLOPS/cycle, 32 SP FLOPS/cycle (per core).I AMD Bulldozer/Piledriver/Steamroller: 8 DP FLOPS/cycle, 16 SP FLOPS/cycle (per module).

I Example: Intel i7-5820k 6-core Haswell @ 4.0GHz has a theoretical peak of 384 DPGFLOPS/sec.

I Example: AMD FX-8150 8-core Bulldozer @ 4.0GHz has a theoretical peak of 128 DPGFLOPS/sec.

I Theoretical peak has the potential to be much higher than actual peak performance,depending on the problem and implementation.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Hardware Considerations - GPU

I The two main GPU competitors are AMD and NVIDIA.

I In general, GPUs do not have the 2:1 ratio of single:double precision performance.

I GPUs are often purposely crippled for DP at the consumer level to encouragepurchases of workstation grade parts.

I The top of DP performance for their respective companies (consumer lines, singlechip):

I NVIDIA: GTX Titan Black ∼ 1700 DP GFLOPS/secI AMD: Radeon R9 280x ∼ 870 DP GFLOPS/sec

I The main selling point for the workstation cards is ECC RAM.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Hardware Considerations - GPU

I NVIDIA is much more popular in the super-computing world, and the libraries for theirplatform are more developed.

I Computing on an AMD GPU is typically done using OpenCL, which aims to be moregeneral than NVIDIA’s CUDA libraries.

I NVIDIA GPUs tend to be more expensive, but much more user-friendly and morewidely supported.

I For large problems, GFLOPS come much less expensively with GPUs than with CPUs,and libraries can now take advantage of multiple GPUs in a system (e.g., cuBLAS-XT).



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries

I By default, R comes with a basic version of BLAS (Basic Linear Algebra Subprograms)and LAPACK (Linear Algebra PACKage).

I It is a good idea to never use these shared libraries!I There are many optimized versions available that can easily be interfaced with R.

I OpenBLAS (Free)I Intel MKL (Free for Students)I AMD ACML (Free, GPU accelerated)I Many others...

I These all have BLAS and LAPACK libraries built in.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries

I R can be compiled from source and told to build with an external library(recommended for MKL).

I You can also build R with a shared BLAS library and then “drop in” another library.I Shared libraries are:

/R/lib/libRblas.so

/R/lib/libRlapack.so

I Either replace them (backing up the original) by copying, or useupdate-alternatives to easily change between them.

I If using the alternatives option, make sure that the new library is in the run-time libraryload path.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries

I OpenBLAS is the most user-friendly of the shared libraries.

I Intel MKL seems to need the Intel C compiler (icc) to function well.I ACML 5.3.1 did not have GPU acceleration, and is much faster than ACML 6.1 for

non-accelerated calls.I ACML 6.1 uses .lua scripts to determine whether to o�oad to the GPU.I This seems to significantly slow the non-o�oaded calls.I ACML 6.1 also produced errors for me when calling svd in R for large matrices.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries - NVBLAS

I NVIDIA distributes NVBLAS as part of their CUDA Toolkit.

I It works in a qualitatively di�erent way than the other BLAS libraries.I It “intercepts” certain level-3 BLAS calls and runs them on the GPU using cuBLAS.

I GEMM, SYRK, HERK, SYR2K, HER2K, TRSM, TRMM, SYMM, HEMM

I Utilizes a unified memory approach, which means the data is never fully o�oaded tothe GPU RAM.

I Don’t use on PCIe 2.0!

I You can use any full BLAS library as a default when it decides not to o�oad.

I Does not include a LAPACK library (CULA is a separate project).



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries - NVBLAS

I Example NVBLAS configuration file (nvblas.conf):

1 NVBLAS_LOGFILE n v b l a s . l o g2 NVBLAS_CPU_BLAS_LIB l i b m k l _ r t . s o3 NVBLAS_GPU_LIST ALL4 NVBLAS_AUTOPIN_MEM_ENABLED5 # NVBLAS_CPU_RATIO_SGEMM 0 . 1 0

I To start R using NVBLAS, you could use the following:

1 e n v LD_PRELOAD= l i b n v b l a s . s o R



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

BLAS Libraries



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Calling C from R

I Two ways to call C functions from R: .C and .Call/.ExternalI .C requires void functions with pointer arguments. Example:

1 void m y _ f u n c t i o n ( int * a , double * b )

I .Call requires functions that both take and return SEXP values (S EXpressionPointer). Example:

1 SEXP m y _ f u n c t i o n ( SEXP a , SEXP b )

I .Call is much faster than .C and is more in the style of “hacking” R.I It allows you to create and use R objects directly.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Calling C from R - .Call

I Any SEXP values that you create must be protected from R’s garbage collector usingPROTECT.

I At the end, you need to use UNPROTECT(N);, where N is the number of previousprotecting statements.

I For a single number, to convert from an SEXP value, use the functions asInteger,asReal, etc.

I To get the pointer to the numeric part of an SEXP value, use the functions REAL,INTEGER, etc.

I If you don’t want to return a value, return R_NilValue.I Don’t alter SEXP objects passed as arguments! Use duplicate first, then alter the

copy.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example

I The goal of this function is to take advantage of a multicore processor whengenerating random variables, in this case, standard normals.

I For random number generation, we will use the GSL (GNU Scientific Library), and itsZiggurat implementation.

I We will also use OpenMP directives to easily parallelize our method.

I The main issue that needs care is creating separate RNGs for each possible thread toensure that the sequences still appear random.



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example - Initializing RNG

1 gsl_rng_type * G S L _ r n g _ t ;2 gsl_rng * * G S L _ r n g ;3 int G S L _ n t ;4 SEXP I N I T _ G S L _ R NG ( SEXP SEED ) {5 int j , s e e d = a s I n t e g e r ( SEED ) , i ;6 G S L _ n t = om p _ g e t _ m a x _ t h r e a d s ( ) ;7 g s l _ r n g _ e n v _ s e t u p ( ) ;8 G S L _ r n g _ t = g s l _ r n g _ m t 1 9 9 3 7 ;9 G S L _ r n g = ( gsl_rng * * ) m a l l o c ( G S L _ n t * sizeof ( gsl_rng * )

) ;10 om p _ s e t _ n um _ t h r e a d s ( G S L _ n t ) ;11 . . .



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example Initializing RNG

1 . . .2 #pragma omp p a r a l l e l for p r i v a t e ( i ) s h a r e d ( G S L _ r n g ,

G S L _ r n g _ t ) s c h e d u l e ( s t a t i c , 1 )3 for ( j =0 ; j < G S L _ n t ; j + + ) {4 i = om p _ g e t _ t h r e a d _ n um ( ) ;5 G S L _ r n g [ i ] = g s l _ r n g _ a l l o c ( G S L _ r n g _ t ) ;6 g s l _ r n g _ s e t ( G S L _ r n g [ i ] , s e e d + i ) ;7 }8 return R _ N i l V a l u e ;9 }



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example - Core Function

1 void g e n e r a t e _ n o r m a l ( double * o u t _ v , int n , int n t ) {2 int j ;3 #pragma omp p a r a l l e l for s h a r e d ( o u t _ v , G S L _ r n g )

n um _ t h r e a d s ( n t )4 for ( j =0 ; j < n ; j + + ) {5 o u t _ v [ j ] = g s l _ r a n _ g a u s s i a n _ z i g g u r a t ( G S L _ r n g [

om p _ g e t _ t h r e a d _ n um ( ) ] , 1 . 0 ) ;6 }7 }



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example - Wrapper for .Call

1 SEXP r n o r m _ g s l ( SEXP N , SEXP NT )2 {3 int n= a s I n t e g e r ( N ) , n t = a s I n t e g e r ( NT ) ;4 SEXP r e s u l t = PROTECT ( a l l o c V e c t o r ( REALSXP , n ) ) ;5 double * o u t _ v = REAL ( r e s u l t ) ;6

7 g e n e r a t e _ n o r m a l ( o u t _ v , n , n t ) ;8

9 UNPROTECT ( 1 ) ;10 return r e s u l t ;11 }



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example - Typical R Call

I The shared library can be compiled using a command like:

1 gcc - f P I C - s h a r e d - f openmp - O3 - ma r ch = n a t i v e g s l r a n d . c- o g s l r a n d . s o - l g s l

I The function can now be used in R:

1 dyn . load ( "gslrand.so" )2 . Call ( "INIT_GSL_RNG" , 1 )3 a < - . Call ( "rnorm_gsl" , 2 0 , 1 )



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

GSL/OpenMP Example - Performance



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example

I The goal of this function is to use the GPU to generate multivariate normal randomvariables.

I While simply generating univariate normals on the GPU is not nearly as e�cient as theprevious example, if we also Cholesky and matrix-multiply on the GPU, we see benefitsin lower dimensions more quickly.

I NVIDIA includes cuRAND in the CUDA Toolkit for random number generation.

I CULA is a LAPACK library built using CUDA that can be obtained freely for academicuse (requires registration).



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example - Initialization

1 curandGenerator_t CURAND_gen ;2 cublasHandle_t h a n d l e ;3

4 SEXP I N I T _CURAND_RNG ( SEXP SEED ) {5 c u r a n d C r e a t e G e n e r a t o r ( &CURAND_gen ,

CURAND_RNG_PSEUDO_MTGP32 ) ;6 c u r a n d S e t P s e u d o R a n d omG e n e r a t o r S e e d ( CURAND_gen , a s I n t e g e r

( SEED ) ) ;7

8 c u l a I n i t i a l i z e ( ) ;9 c u b l a s C r e a t e _ v 2 ( & h a n d l e ) ;10

11 return R _ N i l V a l u e ;12 }



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example - Core Function

1 SEXP rm vno rm_ cud a ( SEXP N , SEXP M , SEXP S IGMA )2 {3 size_t n= a s I n t e g e r ( N ) , m= a s I n t e g e r ( M ) ;4 double * d e vD a t a , * d e v _ s i g m a ;5 c u d a M a l l o c ( ( void * * ) & d e vD a t a , n *m* sizeof ( double ) ) ;6 c u d a M a l l o c ( ( void * * ) & d e v _ s i gm a , m*m* sizeof ( double ) ) ;7 SEXP r e s u l t = PROTECT ( a l l o c M a t r i x ( REALSXP , n , m ) ) , S I GMA2

= PROTECT ( d u p l i c a t e ( S IGMA ) ) ;8 double * h o s t D a t a = REAL ( r e s u l t ) , * s i gm a = REAL ( S I GMA2

) , a l p h a = 1 . 0 ;9 cudaMemcpy ( d e v _ s i gm a , s i gma , m * m* sizeof ( double ) ,

c u d aM emc p yH o s t T oD e v i c e ) ;10 . . .



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example - Core Function

1 . . .2 c u l a D e v i c e D p o t r f ( ’L’ ,m , d e v _ s i gm a , m ) ;3 c u r a n d G e n e r a t e N o r m a l D o u b l e ( CURAND_gen , d e vD a t a , n *m ,

0 . 0 , 1 . 0 ) ;4 c u b l a s D t r mm _ v 2 ( h a n d l e , CUB L A S _ S I D E _ R I GHT ,

CUBLAS_F I LL_MODE_LOWER , CUBLAS_OP_T ,CUBLAS _D I AG_NON_UN IT , n , m, & a l p h a , d e v _ s i gm a , m , d e vD a t a ,n , d e vD a t a , n ) ;

5 cudaMemcpy ( h o s t D a t a , d e vD a t a , n * m* sizeof ( double ) ,c u d aM emc p yD e v i c e T oH o s t ) ;

6 c u d a F r e e ( d e v D a t a ) ;7 c u d a F r e e ( d e v _ s i g m a ) ;8 UNPROTECT ( 2 ) ;9 return r e s u l t ;10 }



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example - Typical R Call

I The shared library can be compiled using a command like:

1 gcc - f P I C - s h a r e d - O3 - ma r ch = n a t i v e c u r a n d . c - ocudano rm . s o - l c u d a r t - l c u b l a s - l c u r a n d -l c u l a _ l a p a c k

I The function can now be used in R:

1 dyn . load ( "cudanorm.so" )2 . Call ( "INIT_CURAND_RNG" , 1 )3 a < - . Call ( "rmvnorm_cuda" , N , nrow ( S i gm a ) , S i gm a )



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

CUDA Example



Adam J. Suarez


CPUs

GPUs

BLAS Libraries

Standard Drop-in

NVBLAS


Calling C from R


CUDA Example

Thank You for Listening!

I will try to make my C code available on the SLG website, along with this presentation.


Hardware ConsiderationsCPUsGPUs

BLAS LibrariesStandard Drop-inNVBLASPerformance Comparison

Calling C from RGSL and OpenMP ExampleCUDA Example

Documents

Adam J. Suarez Hardware Considerations Speeding Up ...post/adam/SpeedingUpR.pdf · Adam J. Suarez Hardware Considerations CPUs GPUs BLAS Libraries Standard Drop-in NVBLAS Performance