R SciDB Julia - Brown Universitycs.brown.edu/courses/cs195w/slides/r_julia.pdf · 2013. 10. 6. · 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming

RSciDBJulia

Mert TerzihanZhixiong Chen

R

1. What is R

● In the 1970s, at Bell Labs, John Chambers developed a statistical programming language – S○ The aim was to turn ideas into software, quickly and

faithfully○ R is an implementation of S, initially written by

Robert Gentleman and Ross Ihaka in 1993.● R is a language and environment for

statistical computing and graphics

2. Features

● Object Oriented○ similar to Python

● Optimized for Vector/Matrix operation○ similar to Matlab

● Fully statistical analysis support● Part of the GNU FREE software project● Over 4300 user contributed packages

3. Study Plan

● Scalar● Vector● Matrix● Data Frame● The apply Function● Statistics● Plot

Scalar

● Use R as a calculator> 4+6[1] 10> x<-6 /* '<-' means to assign value 6 to object x */> y<-4> x+y[1] 10> x<-"Hello world" /* String support */> x[1] "Hello world"

Vector

● Create a vector> x<-c(5,9,1,0) /* function c is to concatenate individual elements */> x[1] 5 9 1 0> x<-1:10 /* generate the numbers from 1 to 10 */> x[1] 1 2 3 4 5 6 7 8 9 10> seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */[1] 1 3 5 7 9> seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */[1] 8.0 10.4 12.8 15.2 17.6 20.0

Vector

● Access a vector, indexing from 1 and using [] > x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */> x[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> x[1:9] /* Get the numbers indexed from 1 to 9 */[1] 1 2 3 1 2 3 1 2 3> x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */[1] 3 3 3> x[-c(3,6,9)] /* '-' is to exclude particular elements */[1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3

Vector

● Access a vector, masking> x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> mask = x == 3 /* Create a mask */> mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE> x[mask] [1] 3 3 3 3 3 3> x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2

Matrix

● Create a matrix> x<-c(5,7,9) > y<-c(6,3,4)> z<-cbind(x,y) /* bind two vectors as a column-wise matrix */> z x y[1,] 5 6[2,] 7 3[3,] 9 4> matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ [,1] [,2][1,] 5 6[2,] 7 3[3,] 9 4

> diag(3) /* identity*/ [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1

Matrix

● Matrix Operations, component-wise> z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T)> z [,1] [,2][1,] 5 7[2,] 9 6[3,] 3 4> y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T)> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1

> y+z [,1] [,2][1,] 6 10[2,] 9 15[3,] 8 3

> y*z [,1] [,2][1,] 5 21[2,] 0 54[3,] 15 -4

Matrix

● Matrix Operations, based on definition> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> z<-matrix(c(3,4,-2,6),nrow=2,byrow=T)> z [,1] [,2][1,] 3 4[2,] -2 6> y%*%x /*multiplication*/ [,1][1,] 26[2,] 63[3,] 18

> t(z) /*transpose*/ [,1] [,2][1,] 3 -2[2,] 4 6

> solve(z) /* inverse */ [,1] [,2][1,] 0.23076923 -0.1538462[2,] 0.07692308 0.1153846

Matrix

● Access a matrix, indexing> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> y[1,2] /* fetch a specific value */[1] 3> y[1:2,] /* fetch rows */ [,1] [,2][1,] 1 3[2,] 0 9> y[,2] /* fetch columns */[1] 3 9 -1

> y[c(1,2),] /* use vectors */ [,1] [,2][1,] 1 3[2,] 0 9

Matrix

● Access a matrix, masking> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> mask<-y>0> mask [,1] [,2][1,] TRUE TRUE[2,] FALSE TRUE[3,] TRUE FALSE> y[mask][1] 1 5 3 9

Data Frame

● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor

Data Frame

● Access a data frame> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor> patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent> patientdata$patientID /*Access using column name*/[1] 1 2 3 4

● Apply a function to data structure elements> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10+ return (x/10)+ }> apply(y,c(1,2),func)/* apply the func on all elements in matrix y */ [,1] [,2][1,] 1.1 1.3[2,] 1.0 1.9[3,] 1.5 0.9

The apply Function

● Some handy distributions> dnorm(c(3,2),0,1) /* normal distribution */[1] 0.004431848 0.053990967> x<-seq(-5,10,by=.1)> dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 …...d*:density functionp*:distribution functionq*:quantile function (the inverse distribution function)dnorm,pnorm,qnormdt,pt,qt

binomial,exponential,posson,gamma

Statistics

● Simulationsto randomly simulate 100 observations from the N(3,4)> rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165……

rnorm,rt,rpois

Statistics

● ploting x*sin(x)> f <- function(x) { /* define the function f(x)=x*sin(x) */+ return (x*sin(x))+ }> plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */> abline(0,1,lty=2) /* lty = 2 means dash line *//* add a dash line with intercept 0 and slope 1 */> abline(0,-1,lty=2)/* add a dash line with intercept 0 and slope -1 */

Plot

More?

● The help() function● Refer to the official manual

○ http://cran.r-project.org/manuals.html● A wonderful 4-week long online course

○ http://blog.revolutionanalytics.com/2012/12/coursera-videos.html

● A good book○ ‘R in Action’ by Robert Kabacoff

● Google

http://cran.r-project.org/manuals.html

http://cran.r-project.org/manuals.html

http://blog.revolutionanalytics.com/2012/12/coursera-videos.html



4. Bonus

● Installation○ Tested on Ubuntu12.04http://livesoncoffee.

wordpress.com/2012/12/09/installing-r-on-ubuntu-12-04/

○ ignore some error like “Unknown media type in type 'all/all'”

● RStudio○ a wonderful IDE for R programmers○ http://www.rstudio.com/

http://livesoncoffee.wordpress.com/2012/12/09/installing-r-on-ubuntu-12-04/



http://www.rstudio.com/

http://www.rstudio.com/

Ricardo

Integrating R and Hadoop

Motivation

● Statistical software, such as R, provides rich functionality for data analysis and modeling, but can handle only limited amounts of data

● Data management systems, such as hadoop, can handle large data, but provides insufficient analytical functionality

Union is strength!

Solution

● Ricardo decompose data-analysis algorithms into○ parts executed by the R statistical analysis system○ parts handled by the Hadoop data management

system.

Components

● R○ The core of statistical analysis

● Large-Scale Data Management Systems○ HDFS○ Work with dirty, semi/un-structured data○ Massive data storage, manipulation and parallel

processing● Jaql

○ A JSON Query Language○ The declarative interface to Hadoop for Ricardo○ Like Pig, Hive

Architecture

Conclusion

● The current version has poor performance

Overview of SciDB

Large Scale Array Storage, Processing and Analysis

Context

1. Background and Motivation2. Features and Functionality3. Data Definition4. Data Manipulation5. Architecture

What is SciDB?

● Massively parallel storage manager● Able to parallelize large scale array

processing algorithms

1. Background and Motivation● Modern scientific data differs from business data in three important

respects:○ Sensor arrays consist of rectangular ‘arrays’ of individual

sensors○ Scientific analysis requires sophisticated data processing

methods■ Ex: Noisy data needs to be ‘cleaned’

○ Data generated by modern scientific instruments is extremely large

● Array Data Model is more desirable in scientific domains○ With notions of adjacency or neighborhood ○ Ordering is fundamental

● Complexity of data processing needs a much more flexible data management platform○ A different kind of DBMS

2. Features and Functionality

● Collections of n-dimensional arrays● Cells in arrays contain tuple of values● Values are associated with a distinguishing attribute

name

3. Data Definition

● Create an array:

● Output:

3.1 Sparse Arrays

● Arrays in SciDB may be sparse● Two ways to handle missing information:

○ Ignore it○ Treat it depending on the operation

● Sparse array with jagged edges and holes:

4. Data Manipulation in SciDB

4.1 Slice()

● Projects an array along a particular index value in single dimension

4.2 Subsample()

● Extracts a region of the array● Generalization of Slice()

4.3 SJoin()

● Combines attributes from two input arrays○ Combines cells with the same index value○ Input arrays need not to have identical dimensions

4.4 Filter()

● Applies a predicate to the attribute values of input○ Produces an array with same size○ Cells where the predicate is found false are set to

empty

4.5 Extensibility

● Provides Postgres style UDT and UDF extensibility○ New types will inter-operate with SciDB’s own types

and array operators● Supports operator extensibility

○ Gaussian Smoothing○ Weighted average of the cell’s neighborhood

5. Architecture

● Shared nothing design● Centralized system catalog database

○ Information about nodes, data distribution and user-defined extensions

● Influenced by MapReduce● Implements only ‘A’ and ‘D’ of ACIDity

○ Atomicity○ Durability

References

● SciDB web-site: http://www.scidb.org● Publications on SciDB: http://www.scidb.

org/about/publications.php● Overview of SciDB, Large Scale Array Storage,

Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA: http://www.scidb.org/Documents/sigmod691-brown.pdf

http://www.scidb.org

http://www.scidb.org/about/publications.php



http://www.scidb.org/Documents/sigmod691-brown.pdf





SciDB-R

Best Database for R

Data Frame

● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor

The R Programmers

● want their analytics to just work–on extremely large datasets as nimbly as on small ones.

● want to concentrate on the analytics, not parallelism, data formatting, and memory management.

Benefits

● Use SciDB to manage large data set○ a storage backend○ filter and join data before performing analytics

● Use SciDB to share intensive computing load○ offload large computations to a cluster○ do some analytical task

● Use SciDB to share data among multiple users

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/

Example

● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/> y[, drop = FALSE] /* Return the computed result to R, storing it to SciDB.drop: if data frame y has only one column and drop is true, y will be reduced to a plain vector without labels. */

References

● Official Website○ http://www.paradigm4.com/scidb-r/

● SciDB-R package○ https://github.com/Paradigm4

● Instructions and Manuals○ http://cran.r-project.org/web/packages/scidb/

http://www.paradigm4.com/scidb-r/

http://www.paradigm4.com/scidb-r/

https://github.com/Paradigm4

https://github.com/Paradigm4

http://cran.r-project.org/web/packages/scidb/

http://cran.r-project.org/web/packages/scidb/

Julia Language

A Fast Dynamic Language for Technical Computing

Context

1. Motivation2. Features3. JIT Compiler and Performance Benchmarks4. Example Codes5. IJulia6. Issue Tracking

1. Motivation

● Why do we need more?

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

1. What is Julia?

● High level, high performance dynamic programming language○ Syntax familiar to Matlab

● The library is written mostly in Julia

2. Some Features

● Open source with an MIT licensed core● Easy installation● Dynamically typed with fast user-defined types● JIT compiler● Distributed memory parallelism● Call C, Fortran and Python libraries● Unicode support● Metaprogramming with Lisp-like macros

3. High-Performance JIT Compiler

● LLVM-based Just in Time Compiler● Often match the performance of C

Benchmark times relative to C (smaller is better)

3. Log-scale of Benchmark

Execution time relative to C++

Benchmarks: fib, parse_int, quicksort, mandel, pi_sum, rand_mat_stat, and rand_mat_mul


4. Example Julia Codes

4.1 Arrays and Vectors

4.1 Arrays and Vectors

4.2 Matrix Operations

4.3 Ternary Operators

4.4 Packages

● Pkg.add(“Package_Name”)○ Cpp for calling C++ from Julia○ Curl for Julia HTTP Curl library○ Winston, Gadfly, Gaston or PyPlot for graphics and

plotting○ HDFS for a wrapper over Hadoop HDFS library○ LIBSVM for LIBSVM bindings for Julia○ and many more

● http://docs.julialang.org/en/latest/packages/packagelist

http://docs.julialang.org/en/latest/packages/packagelist/#available-packages

http://docs.julialang.org/en/latest/packages/packagelist/#available-packages

4.5 Plotting

● Graphics in Julia are available through external packages○ Use Winston.jl for Matlab plots○ Use Gadfly for Wickham-Wilkinson style of graphics


4.6 Sequential Buffon’s Needle

● We have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. ○ What is the probability that the needle will lie across

a line between two strips?


http://en.wikipedia.org/wiki/Floor

http://en.wikipedia.org/wiki/Parallel_(geometry)

http://en.wikipedia.org/wiki/Wood

http://en.wikipedia.org/wiki/Sewing_needle

http://en.wikipedia.org/wiki/Probability

4.7 Parallel Buffon’s Needle

● @parallel (+) for loop○ Assign iterations to multiple processes○ Combine them with a specified reduction (+)


4.8 Writing Low-Level Code


4.9 Using Python Libraries

● Pkg.add(“PyCall”)○ Using PyCall


5. IJulia

● Julia is written in command line by default● IJulia combines Julia with IPython● IPython provides rich architecture for interactive

computing● Available in GitHub: https://github.com/JuliaLang/IJulia.jl

https://github.com/JuliaLang/IJulia.jl

6. Issue Tracking

● Source codes are in GitHub○ https://github.com/JuliaLang/julia

● Issues can be opened from Julia GitHub Repository○ https://github.com/JuliaLang/julia/issues

● Easy and quick bug fixes○ No need to wait for another release

https://github.com/JuliaLang/julia


https://github.com/JuliaLang/julia/issues

https://github.com/JuliaLang/julia/issues

References

● Julia web-site: http://julialang.org● C, Fortran, Julia, Python, Matlab, R and JavaScript codes

used in benchmarking: https://github.com/JuliaLang/julia/tree/master/test/perf/micro

● Publications on Julia: http://julialang.org/publications/● Julia Source Code: https://github.com/JuliaLang/julia● Viral B. Shah’s introductory slides on July 13, 2013: https:

//github.com/ViralBShah/julia-presentations/raw/master/Fifth-Elephant-2013/Fifth-Elephant-2013.pdf

● MIT IAP Julia Tutorial: http://www.youtube.com/user/JuliaLanguage

● Winston, 2D plotting for Julia: https://github.com/nolta/Winston.jl

http://julialang.org

https://github.com/JuliaLang/julia/tree/master/test/perf/micro



http://julialang.org/publications/


https://github.com/ViralBShah/julia-presentations/raw/master/Fifth-Elephant-2013/Fifth-Elephant-2013.pdf




http://www.youtube.com/user/JuliaLanguage



https://github.com/nolta/Winston.jl



Advantages of R over Julia

● Backed by GNU● More mature and older● Large collection of libraries, i.e. CRAN● Rich development environment, i.e. RStudio● Graphics and plotting● Many great tutorials and books

Advantages of Julia over R

● Performance of Julia that is close to C○ JIT LLVM Compiler

● Supports low-level programming, modify arguments● Active community promises a bright future for Julia● R is single threaded, whereas Julia supports parallelism

○ R has techniques for large datasets but not easy to use

● Julia provides fast development and fast execution

Syntax Differences

Operator R Julia

Assignment <- =

Element wise multiplication * *(*.)

Element wise addition + +(+.)

Modulo %% mod

Creating Vector c(1,2,3,4) [1:4]

Size of the array dim size

References

● Matlab, R, and Julia: Languages for Data Analysis, October 15 2012: http://strata.oreilly.com/2012/10/matlab-r-julia-languages-for-data-analysis.html

● Julia for R Programmers, Douglas Bates, July 18 2013: http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf

● An R Programmer look at Julia, Douglas Bates, April 7 2012: http://dmbates.blogspot.com/2012/04/r-programmer-looks-at-julia.html

● http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/

http://strata.oreilly.com/2012/10/matlab-r-julia-languages-for-data-analysis.html




http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf



http://dmbates.blogspot.com/2012/04/r-programmer-looks-at-julia.html



http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/




Questions?