46
R and C++ [email protected] @romainfrancois Romain François

R and C++

Embed Size (px)

DESCRIPTION

R and C++. Slides from my talk at the R meetup in Copenhagen.

Citation preview

Page 1: R and C++

R and C++!

[email protected] @romainfrancois

Romain François

Page 2: R and C++

Topics• Rcpp

• dplyr

• Rcpp98, Rcpp11

Page 3: R and C++

Rcpp

Page 4: R and C++

54releases since 2008

Page 5: R and C++

0.10.6currently

!0.10.7 out soon, and perhaps it will be called 0.11.0, or

perhaps 1.0.0

Page 6: R and C++

172cran packages directly depend* on it

Page 7: R and C++

97 163lines of code (*.cpp + *.h)

Page 8: R and C++
Page 9: R and C++

int add( int a, int b){ return a + b ; }

Page 10: R and C++
Page 11: R and C++

#include <Rcpp.h> !

// [[Rcpp::export]] int add( int a, int b){ return a + b ; }

Page 12: R and C++

A bridge between R and C++

Page 13: R and C++

#include <Rcpp.h> !

// [[Rcpp::export]] int add( int a, int b){ return a + b ; }

> sourceCpp( "add.cpp" ) > add( 1, 2 ) [1] 3

sourceCpp

Page 14: R and C++

R data • vectors: NumericVector, IntegerVector, …

• lists : List

• functions: Function

• environments: Environment

Page 15: R and C++

Key design decisionRcpp objects are proxy objects to

the underlying R data structure

No additional memory

Page 16: R and C++

Example: Vector // [[Rcpp::export]] double sum( NumericVector x){ int n = x.size() ; ! double res = 0.0 ; for( int i=0; i<n; i++){ res += x[i] ; } ! return res ; }

Page 17: R and C++

List res = List::create( _["a"] = 1, _["b"] = "foo" ) ; res.attr( "class" ) = "myclass" ; !

int a = res["a"] ; res["b"] = 42 ;

Example: List

Page 18: R and C++

Function rnorm( "rnorm" ) ; NumericVector x = rnorm( 10, _["mean"] = 30, _["sd"] = 100 ) ;

Example: Function

Page 19: R and C++

Benchmarkn <- length(x) m <- 0.0 for( i in 1:n ){ m <- m + x[i]^2 / n }

Page 20: R and C++

Benchmarkm <- mean( x^2 )

Page 21: R and C++

#include <Rcpp.h> using namespace Rcpp ; !double square(x){ return x*x ; } !// [[Rcpp::export]] double fun( NumericVector x){ int n = x.size() ; double res = 0.0 ; for( int i=0; i<n; i++){ res += square(x[i]) / n ; } return res ; }

Benchmark

Page 22: R and C++

Benchmark10 000 100 000 1 000 000

Dumb R 1008 10 214 104 000

Vectorized R 24 125 1 021

C++ 13 80 709

Execution times (micro seconds)

Page 23: R and C++

Benchmarkm <- mean( x^2 )

Page 24: R and C++

C++ data structures Modules

Page 25: R and C++

The usual bank account exampleclass Account { private: double balance ; ! public: Account( ) : balance(0){} ! double get_balance(){ return balance ; } ! void withdraw(double x){ balance -= x ; } ! void deposit(double x ){ balance += x ; } } ;

RCPP_MODULE(BankAccount){ class_<Account>( "Account" ) .constructor() ! .property( "balance", Account::get_balance ) ! .method( "deposit", Account::deposit) .method( "withdraw", Account::withdraw) ; }

account <- new( Account ) account$deposit( 1000 ) account$balance account$withdraw( 200 ) account$balance account$balance <- 200

Page 26: R and C++

PackagesRcpp.package.skeleton

compileAttributes !

!

devtools::load_all

Page 27: R and C++

Rcpp.package.skeleton

Extension of package.skeleton !Adds Rcpp specific artefacts and code examples

> Rcpp.package.skeleton( "cph" )

Page 28: R and C++

Then devtools::load_all

Edit your .cpp files// [[Rcpp::export]] int add( int a,int b){ return a + b ; }

This updates C++ and R generated code

Page 29: R and C++

dplyr

Page 30: R and C++

dplyr• Package by Hadley Whickham

• Plyr specialised for data frames: faster & with remote data stores

• Great design and syntax

• Great performance thanks to C++

Page 31: R and C++

arrangearrange(Batting, playerID, yearID)

Unit: milliseconds expr min lq median uq max neval df 186.64016 188.48495 190.8989 192.42140 195.36592 10 dt 349.25496 352.12806 357.4358 403.45465 405.30055 10 cpp 12.20485 13.85538 14.0081 16.72979 23.95173 10 base 181.68259 182.58014 184.6904 186.33794 189.70377 10 dt_raw 166.94213 170.15704 170.6418 220.89911 223.42155 10

ex: Arrange by year within each player

Page 32: R and C++

filterfilter(Batting, G == max(G))

Unit: milliseconds expr min lq median uq max neval df 371.96066 375.98652 380.92300 389.78870 430.2898 10 dt 47.37897 49.39681 51.23722 52.79181 95.8757 10 cpp 34.63382 35.27462 36.48151 38.30672 106.2422 10 base 141.81983 144.87670 147.36940 148.67299 173.8763 10

Find the year for which each player played the most games

Page 33: R and C++

summarisesummarise(x, ab = mean(AB))

Unit: microseconds expr min lq median uq max neval df 470726.569 475168.481 495500.076 498223.152 502601.494 10 dt 23002.422 23923.691 25888.191 28517.318 28683.864 10 cpp 756.265 820.921 838.529 864.624 950.079 10 base 253189.624 259167.496 263124.650 273097.845 326663.243 10 dt_raw 22462.560 23469.528 24438.422 25718.549 28385.158 10

Compute the average number of at bats for each player

Page 34: R and C++

Vector Visitorclass VectorVisitor { public: virtual ~VectorVisitor(){} /** hash the element of the visited vector at index i */ virtual size_t hash(int i) const = 0 ; /** are the elements at indices i and j equal */ virtual bool equal(int i, int j) const = 0 ; ! /** creates a new vector, of the same type as the visited vector, by * copying elements at the given indices */ virtual SEXP subset( const Rcpp::IntegerVector& index ) const = 0 ; !}

Traversing an R vector of any type with the same interface

Page 35: R and C++

Vector Visitor inline VectorVisitor* visitor( SEXP vec ){ switch( TYPEOF(vec) ){ case INTSXP: if( Rf_inherits(vec, "factor" )) return new FactorVisitor( vec ) ; return new VectorVisitorImpl<INTSXP>( vec ) ; case REALSXP: if( Rf_inherits( vec, "Date" ) ) return new DateVisitor( vec ) ; if( Rf_inherits( vec, "POSIXct" ) ) return new POSIXctVisitor( vec ) ; return new VectorVisitorImpl<REALSXP>( vec ) ; case LGLSXP: return new VectorVisitorImpl<LGLSXP>( vec ) ; case STRSXP: return new VectorVisitorImpl<STRSXP>( vec ) ; default: break ; } // should not happen return 0 ; }

Page 36: R and C++

Chunked evaluation

• R expression to evaluate: mean(Sepal.Length)

• Sepal.Length ∊ iris

• dplyr knows mean.

• fast and memory efficient algorithm

ir <- group_by( iris, Species) summarise(ir, Sepal.Length = mean(Sepal.Length) )

Page 37: R and C++

Hybrid evaluationmyfun <- function(x) x+x ir <- group_by( iris, Species) summarise(ir, xxx = mean(Sepal.Length) + min(Sepal.Width) - myfun(Sepal.Length) )

#1: fast evaluation of mean(Sepal.Length). 5.006 + min(Sepal.Width) - myfun(Sepal.Length)

#2: fast evaluation of min(Sepal.Width). 5.006 + 3.428 - myfun(Sepal.Length)

#3: fast evaluation of 5.006 + 3.428. 8.434 - myfun(Sepal.Length)

#4: R evaluation of 8.434 - myfun(Sepal.Length).

Page 38: R and C++

Hybrid Evaluation!

• mean, min, max, sum, sd, var, n, +, -, /, *, <, >, <=, >=, &&, ||

• packages can register their own hybrid evaluation handler.

• See hybrid-evaluation vignette

Page 39: R and C++

Rcpp11

Page 40: R and C++

Rcpp11• Using C++11 features

• Smaller

• More memory efficient

• Clean

Page 41: R and C++

C++11 :

// [[Rcpp::export]] NumericVector foo( NumericVector v){ NumericVector res = sapply( v, [](double x){ return x*x; } ) ; return res ; }

Lambda: function defined where used. Similar to apply functions in R.

Page 42: R and C++

C++11 : for each loop

std::vector<double> v ; for( int i=0; i<v.size(); v++){ double d = v[i] ; // do something with d }

for( double d: v){ // do stuff with d }

C++98, C++03

C++11

Page 43: R and C++

C++11 : init listNumericVector x = NumericVector::create( 1, 2 ) ;

NumericVector x = {1, 2} ;

C++98, C++03

C++11

Page 44: R and C++

Other changes

• Move semantics : used under the hood in Rcpp11. Using less memory.

• Less code bloat. Variadic templates

Page 45: R and C++

Rcpp11 article• I’m writing an article about C++11

• Explain the merits of C++11

• What’s next: C++14, C++17

• Goal is to make C++11 welcome on CRAN

• https://github.com/romainfrancois/cpp11_article

Page 46: R and C++

Questions

[email protected] @romainfrancois

Romain François