78
R SciDB Julia Mert Terzihan Zhixiong Chen

R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

RSciDBJulia

Mert TerzihanZhixiong Chen

Page 2: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

R

Page 3: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

1. What is R

● In the 1970s, at Bell Labs, John Chambers developed a statistical programming language – S○ The aim was to turn ideas into software, quickly and

faithfully○ R is an implementation of S, initially written by

Robert Gentleman and Ross Ihaka in 1993.● R is a language and environment for

statistical computing and graphics

Page 4: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

2. Features

● Object Oriented○ similar to Python

● Optimized for Vector/Matrix operation○ similar to Matlab

● Fully statistical analysis support● Part of the GNU FREE software project● Over 4300 user contributed packages

Page 5: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

3. Study Plan

● Scalar● Vector● Matrix● Data Frame● The apply Function● Statistics● Plot

Page 6: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Scalar

● Use R as a calculator> 4+6[1] 10> x<-6 /* '<-' means to assign value 6 to object x */> y<-4> x+y[1] 10> x<-"Hello world" /* String support */> x[1] "Hello world"

Page 7: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Vector

● Create a vector> x<-c(5,9,1,0) /* function c is to concatenate individual elements */> x[1] 5 9 1 0> x<-1:10 /* generate the numbers from 1 to 10 */> x[1] 1 2 3 4 5 6 7 8 9 10> seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */[1] 1 3 5 7 9> seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */[1] 8.0 10.4 12.8 15.2 17.6 20.0

Page 8: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Vector

● Access a vector, indexing from 1 and using [] > x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */> x[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> x[1:9] /* Get the numbers indexed from 1 to 9 */[1] 1 2 3 1 2 3 1 2 3> x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */[1] 3 3 3> x[-c(3,6,9)] /* '-' is to exclude particular elements */[1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3

Page 9: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Vector

● Access a vector, masking> x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> mask = x == 3 /* Create a mask */> mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE> x[mask] [1] 3 3 3 3 3 3> x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2

Page 10: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Matrix

● Create a matrix> x<-c(5,7,9) > y<-c(6,3,4)> z<-cbind(x,y) /* bind two vectors as a column-wise matrix */> z x y[1,] 5 6[2,] 7 3[3,] 9 4> matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ [,1] [,2][1,] 5 6[2,] 7 3[3,] 9 4

> diag(3) /* identity*/ [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1

Page 11: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Matrix

● Matrix Operations, component-wise> z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T)> z [,1] [,2][1,] 5 7[2,] 9 6[3,] 3 4> y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T)> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1

> y+z [,1] [,2][1,] 6 10[2,] 9 15[3,] 8 3

> y*z [,1] [,2][1,] 5 21[2,] 0 54[3,] 15 -4

Page 12: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Matrix

● Matrix Operations, based on definition> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> z<-matrix(c(3,4,-2,6),nrow=2,byrow=T)> z [,1] [,2][1,] 3 4[2,] -2 6> y%*%x /*multiplication*/ [,1][1,] 26[2,] 63[3,] 18

> t(z) /*transpose*/ [,1] [,2][1,] 3 -2[2,] 4 6

> solve(z) /* inverse */ [,1] [,2][1,] 0.23076923 -0.1538462[2,] 0.07692308 0.1153846

Page 13: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Matrix

● Access a matrix, indexing> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> y[1,2] /* fetch a specific value */[1] 3> y[1:2,] /* fetch rows */ [,1] [,2][1,] 1 3[2,] 0 9> y[,2] /* fetch columns */[1] 3 9 -1

> y[c(1,2),] /* use vectors */ [,1] [,2][1,] 1 3[2,] 0 9

Page 14: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Matrix

● Access a matrix, masking> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> mask<-y>0> mask [,1] [,2][1,] TRUE TRUE[2,] FALSE TRUE[3,] TRUE FALSE> y[mask][1] 1 5 3 9

Page 15: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Data Frame

● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor

Page 16: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Data Frame

● Access a data frame> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor> patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent> patientdata$patientID /*Access using column name*/[1] 1 2 3 4

Page 17: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

● Apply a function to data structure elements> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10+ return (x/10)+ }> apply(y,c(1,2),func)/* apply the func on all elements in matrix y */ [,1] [,2][1,] 1.1 1.3[2,] 1.0 1.9[3,] 1.5 0.9

The apply Function

Page 18: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

● Some handy distributions> dnorm(c(3,2),0,1) /* normal distribution */[1] 0.004431848 0.053990967> x<-seq(-5,10,by=.1)> dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 …...d*:density functionp*:distribution functionq*:quantile function (the inverse distribution function)dnorm,pnorm,qnormdt,pt,qt

binomial,exponential,posson,gamma

Statistics

Page 19: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

● Simulationsto randomly simulate 100 observations from the N(3,4)> rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165……

rnorm,rt,rpois

Statistics

Page 20: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

● ploting x*sin(x)> f <- function(x) { /* define the function f(x)=x*sin(x) */+ return (x*sin(x))+ }> plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */> abline(0,1,lty=2) /* lty = 2 means dash line *//* add a dash line with intercept 0 and slope 1 */> abline(0,-1,lty=2)/* add a dash line with intercept 0 and slope -1 */

Plot

Page 21: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

More?

● The help() function● Refer to the official manual

○ http://cran.r-project.org/manuals.html● A wonderful 4-week long online course

○ http://blog.revolutionanalytics.com/2012/12/coursera-videos.html

● A good book○ ‘R in Action’ by Robert Kabacoff

● Google

Page 22: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4. Bonus

● Installation○ Tested on Ubuntu12.04http://livesoncoffee.

wordpress.com/2012/12/09/installing-r-on-ubuntu-12-04/

○ ignore some error like “Unknown media type in type 'all/all'”

● RStudio○ a wonderful IDE for R programmers○ http://www.rstudio.com/

Page 23: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Ricardo

Integrating R and Hadoop

Page 24: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Motivation

● Statistical software, such as R, provides rich functionality for data analysis and modeling, but can handle only limited amounts of data

● Data management systems, such as hadoop, can handle large data, but provides insufficient analytical functionality

Union is strength!

Page 25: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Solution

● Ricardo decompose data-analysis algorithms into○ parts executed by the R statistical analysis system○ parts handled by the Hadoop data management

system.

Page 26: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Components

● R○ The core of statistical analysis

● Large-Scale Data Management Systems○ HDFS○ Work with dirty, semi/un-structured data○ Massive data storage, manipulation and parallel

processing● Jaql

○ A JSON Query Language○ The declarative interface to Hadoop for Ricardo○ Like Pig, Hive

Page 27: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Architecture

Page 28: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Conclusion

● The current version has poor performance

Page 29: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Overview of SciDB

Large Scale Array Storage, Processing and Analysis

Page 30: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Context

1. Background and Motivation2. Features and Functionality3. Data Definition4. Data Manipulation5. Architecture

Page 31: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

What is SciDB?

● Massively parallel storage manager● Able to parallelize large scale array

processing algorithms

Page 32: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

1. Background and Motivation● Modern scientific data differs from business data in three important

respects:○ Sensor arrays consist of rectangular ‘arrays’ of individual

sensors○ Scientific analysis requires sophisticated data processing

methods■ Ex: Noisy data needs to be ‘cleaned’

○ Data generated by modern scientific instruments is extremely large

● Array Data Model is more desirable in scientific domains○ With notions of adjacency or neighborhood ○ Ordering is fundamental

● Complexity of data processing needs a much more flexible data management platform○ A different kind of DBMS

Page 33: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

2. Features and Functionality

● Collections of n-dimensional arrays● Cells in arrays contain tuple of values● Values are associated with a distinguishing attribute

name

Page 34: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

3. Data Definition

● Create an array:

● Output:

Page 35: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

3.1 Sparse Arrays

● Arrays in SciDB may be sparse● Two ways to handle missing information:

○ Ignore it○ Treat it depending on the operation

● Sparse array with jagged edges and holes:

Page 36: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4. Data Manipulation in SciDB

Page 37: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.1 Slice()

● Projects an array along a particular index value in single dimension

Page 38: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.2 Subsample()

● Extracts a region of the array● Generalization of Slice()

Page 39: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.3 SJoin()

● Combines attributes from two input arrays○ Combines cells with the same index value○ Input arrays need not to have identical dimensions

Page 40: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.4 Filter()

● Applies a predicate to the attribute values of input○ Produces an array with same size○ Cells where the predicate is found false are set to

empty

Page 41: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.5 Extensibility

● Provides Postgres style UDT and UDF extensibility○ New types will inter-operate with SciDB’s own types

and array operators● Supports operator extensibility

○ Gaussian Smoothing○ Weighted average of the cell’s neighborhood

Page 42: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

5. Architecture

● Shared nothing design● Centralized system catalog database

○ Information about nodes, data distribution and user-defined extensions

● Influenced by MapReduce● Implements only ‘A’ and ‘D’ of ACIDity

○ Atomicity○ Durability

Page 43: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

References

● SciDB web-site: http://www.scidb.org● Publications on SciDB: http://www.scidb.

org/about/publications.php● Overview of SciDB, Large Scale Array Storage,

Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA: http://www.scidb.org/Documents/sigmod691-brown.pdf

Page 44: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

SciDB-R

Best Database for R

Page 45: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Data Frame

● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor

Page 46: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

The R Programmers

● want their analytics to just work–on extremely large datasets as nimbly as on small ones.

● want to concentrate on the analytics, not parallelism, data formatting, and memory management.

Page 47: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Benefits

● Use SciDB to manage large data set○ a storage backend○ filter and join data before performing analytics

● Use SciDB to share intensive computing load○ offload large computations to a cluster○ do some analytical task

● Use SciDB to share data among multiple users

Page 48: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/

Page 49: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */

Page 50: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/

Page 51: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/> y[, drop = FALSE] /* Return the computed result to R, storing it to SciDB.drop: if data frame y has only one column and drop is true, y will be reduced to a plain vector without labels. */

Page 52: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

References

● Official Website○ http://www.paradigm4.com/scidb-r/

● SciDB-R package○ https://github.com/Paradigm4

● Instructions and Manuals○ http://cran.r-project.org/web/packages/scidb/

Page 53: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Julia Language

A Fast Dynamic Language for Technical Computing

Page 54: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Context

1. Motivation2. Features3. JIT Compiler and Performance Benchmarks4. Example Codes5. IJulia6. Issue Tracking

Page 55: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

1. Motivation

● Why do we need more?

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 56: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

1. What is Julia?

● High level, high performance dynamic programming language○ Syntax familiar to Matlab

● The library is written mostly in Julia

Page 57: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

2. Some Features

● Open source with an MIT licensed core● Easy installation● Dynamically typed with fast user-defined types● JIT compiler● Distributed memory parallelism● Call C, Fortran and Python libraries● Unicode support● Metaprogramming with Lisp-like macros

Page 58: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

3. High-Performance JIT Compiler

● LLVM-based Just in Time Compiler● Often match the performance of C

Benchmark times relative to C (smaller is better)

Page 59: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

3. Log-scale of Benchmark

Execution time relative to C++

Benchmarks: fib, parse_int, quicksort, mandel, pi_sum, rand_mat_stat, and rand_mat_mul

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 60: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4. Example Julia Codes

Page 61: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.1 Arrays and Vectors

Page 62: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.1 Arrays and Vectors

Page 63: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.2 Matrix Operations

Page 64: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.3 Ternary Operators

Page 65: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.4 Packages

● Pkg.add(“Package_Name”)○ Cpp for calling C++ from Julia○ Curl for Julia HTTP Curl library○ Winston, Gadfly, Gaston or PyPlot for graphics and

plotting○ HDFS for a wrapper over Hadoop HDFS library○ LIBSVM for LIBSVM bindings for Julia○ and many more

● http://docs.julialang.org/en/latest/packages/packagelist

Page 66: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.5 Plotting

● Graphics in Julia are available through external packages○ Use Winston.jl for Matlab plots○ Use Gadfly for Wickham-Wilkinson style of graphics

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 67: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.6 Sequential Buffon’s Needle

● We have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. ○ What is the probability that the needle will lie across

a line between two strips?

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 68: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.7 Parallel Buffon’s Needle

● @parallel (+) for loop○ Assign iterations to multiple processes○ Combine them with a specified reduction (+)

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 69: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.8 Writing Low-Level Code

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 70: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

4.9 Using Python Libraries

● Pkg.add(“PyCall”)○ Using PyCall

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

Page 71: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

5. IJulia

● Julia is written in command line by default● IJulia combines Julia with IPython● IPython provides rich architecture for interactive

computing● Available in GitHub: https://github.com/JuliaLang/IJulia.jl

Page 72: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

6. Issue Tracking

● Source codes are in GitHub○ https://github.com/JuliaLang/julia

● Issues can be opened from Julia GitHub Repository○ https://github.com/JuliaLang/julia/issues

● Easy and quick bug fixes○ No need to wait for another release

Page 73: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

References

● Julia web-site: http://julialang.org● C, Fortran, Julia, Python, Matlab, R and JavaScript codes

used in benchmarking: https://github.com/JuliaLang/julia/tree/master/test/perf/micro

● Publications on Julia: http://julialang.org/publications/● Julia Source Code: https://github.com/JuliaLang/julia● Viral B. Shah’s introductory slides on July 13, 2013: https:

//github.com/ViralBShah/julia-presentations/raw/master/Fifth-Elephant-2013/Fifth-Elephant-2013.pdf

● MIT IAP Julia Tutorial: http://www.youtube.com/user/JuliaLanguage

● Winston, 2D plotting for Julia: https://github.com/nolta/Winston.jl

Page 74: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Advantages of R over Julia

● Backed by GNU● More mature and older● Large collection of libraries, i.e. CRAN● Rich development environment, i.e. RStudio● Graphics and plotting● Many great tutorials and books

Page 75: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Advantages of Julia over R

● Performance of Julia that is close to C○ JIT LLVM Compiler

● Supports low-level programming, modify arguments● Active community promises a bright future for Julia● R is single threaded, whereas Julia supports parallelism

○ R has techniques for large datasets but not easy to use

● Julia provides fast development and fast execution

Page 76: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Syntax Differences

Operator R Julia

Assignment <- =

Element wise multiplication * *(*.)

Element wise addition + +(+.)

Modulo %% mod

Creating Vector c(1,2,3,4) [1:4]

Size of the array dim size

Page 77: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

References

● Matlab, R, and Julia: Languages for Data Analysis, October 15 2012: http://strata.oreilly.com/2012/10/matlab-r-julia-languages-for-data-analysis.html

● Julia for R Programmers, Douglas Bates, July 18 2013: http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf

● An R Programmer look at Julia, Douglas Bates, April 7 2012: http://dmbates.blogspot.com/2012/04/r-programmer-looks-at-julia.html

● http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/

Page 78: R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

Questions?