Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
RSciDBJulia
Mert TerzihanZhixiong Chen
R
1. What is R
● In the 1970s, at Bell Labs, John Chambers developed a statistical programming language – S○ The aim was to turn ideas into software, quickly and
faithfully○ R is an implementation of S, initially written by
Robert Gentleman and Ross Ihaka in 1993.● R is a language and environment for
statistical computing and graphics
2. Features
● Object Oriented○ similar to Python
● Optimized for Vector/Matrix operation○ similar to Matlab
● Fully statistical analysis support● Part of the GNU FREE software project● Over 4300 user contributed packages
3. Study Plan
● Scalar● Vector● Matrix● Data Frame● The apply Function● Statistics● Plot
Scalar
● Use R as a calculator> 4+6[1] 10> x<-6 /* '<-' means to assign value 6 to object x */> y<-4> x+y[1] 10> x<-"Hello world" /* String support */> x[1] "Hello world"
Vector
● Create a vector> x<-c(5,9,1,0) /* function c is to concatenate individual elements */> x[1] 5 9 1 0> x<-1:10 /* generate the numbers from 1 to 10 */> x[1] 1 2 3 4 5 6 7 8 9 10> seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */[1] 1 3 5 7 9> seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */[1] 8.0 10.4 12.8 15.2 17.6 20.0
Vector
● Access a vector, indexing from 1 and using [] > x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */> x[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> x[1:9] /* Get the numbers indexed from 1 to 9 */[1] 1 2 3 1 2 3 1 2 3> x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */[1] 3 3 3> x[-c(3,6,9)] /* '-' is to exclude particular elements */[1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3
Vector
● Access a vector, masking> x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> mask = x == 3 /* Create a mask */> mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE> x[mask] [1] 3 3 3 3 3 3> x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2
Matrix
● Create a matrix> x<-c(5,7,9) > y<-c(6,3,4)> z<-cbind(x,y) /* bind two vectors as a column-wise matrix */> z x y[1,] 5 6[2,] 7 3[3,] 9 4> matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ [,1] [,2][1,] 5 6[2,] 7 3[3,] 9 4
> diag(3) /* identity*/ [,1] [,2] [,3][1,] 1 0 0[2,] 0 1 0[3,] 0 0 1
Matrix
● Matrix Operations, component-wise> z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T)> z [,1] [,2][1,] 5 7[2,] 9 6[3,] 3 4> y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T)> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1
> y+z [,1] [,2][1,] 6 10[2,] 9 15[3,] 8 3
> y*z [,1] [,2][1,] 5 21[2,] 0 54[3,] 15 -4
Matrix
● Matrix Operations, based on definition> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> z<-matrix(c(3,4,-2,6),nrow=2,byrow=T)> z [,1] [,2][1,] 3 4[2,] -2 6> y%*%x /*multiplication*/ [,1][1,] 26[2,] 63[3,] 18
> t(z) /*transpose*/ [,1] [,2][1,] 3 -2[2,] 4 6
> solve(z) /* inverse */ [,1] [,2][1,] 0.23076923 -0.1538462[2,] 0.07692308 0.1153846
Matrix
● Access a matrix, indexing> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> y[1,2] /* fetch a specific value */[1] 3> y[1:2,] /* fetch rows */ [,1] [,2][1,] 1 3[2,] 0 9> y[,2] /* fetch columns */[1] 3 9 -1
> y[c(1,2),] /* use vectors */ [,1] [,2][1,] 1 3[2,] 0 9
Matrix
● Access a matrix, masking> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> mask<-y>0> mask [,1] [,2][1,] TRUE TRUE[2,] FALSE TRUE[3,] TRUE FALSE> y[mask][1] 1 5 3 9
Data Frame
● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor
Data Frame
● Access a data frame> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor> patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent> patientdata$patientID /*Access using column name*/[1] 1 2 3 4
● Apply a function to data structure elements> y [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10+ return (x/10)+ }> apply(y,c(1,2),func)/* apply the func on all elements in matrix y */ [,1] [,2][1,] 1.1 1.3[2,] 1.0 1.9[3,] 1.5 0.9
The apply Function
● Some handy distributions> dnorm(c(3,2),0,1) /* normal distribution */[1] 0.004431848 0.053990967> x<-seq(-5,10,by=.1)> dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 …...d*:density functionp*:distribution functionq*:quantile function (the inverse distribution function)dnorm,pnorm,qnormdt,pt,qt
binomial,exponential,posson,gamma
Statistics
● Simulationsto randomly simulate 100 observations from the N(3,4)> rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165……
rnorm,rt,rpois
Statistics
● ploting x*sin(x)> f <- function(x) { /* define the function f(x)=x*sin(x) */+ return (x*sin(x))+ }> plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */> abline(0,1,lty=2) /* lty = 2 means dash line *//* add a dash line with intercept 0 and slope 1 */> abline(0,-1,lty=2)/* add a dash line with intercept 0 and slope -1 */
Plot
More?
● The help() function● Refer to the official manual
○ http://cran.r-project.org/manuals.html● A wonderful 4-week long online course
○ http://blog.revolutionanalytics.com/2012/12/coursera-videos.html
● A good book○ ‘R in Action’ by Robert Kabacoff
4. Bonus
● Installation○ Tested on Ubuntu12.04http://livesoncoffee.
wordpress.com/2012/12/09/installing-r-on-ubuntu-12-04/
○ ignore some error like “Unknown media type in type 'all/all'”
● RStudio○ a wonderful IDE for R programmers○ http://www.rstudio.com/
Ricardo
Integrating R and Hadoop
Motivation
● Statistical software, such as R, provides rich functionality for data analysis and modeling, but can handle only limited amounts of data
● Data management systems, such as hadoop, can handle large data, but provides insufficient analytical functionality
Union is strength!
Solution
● Ricardo decompose data-analysis algorithms into○ parts executed by the R statistical analysis system○ parts handled by the Hadoop data management
system.
Components
● R○ The core of statistical analysis
● Large-Scale Data Management Systems○ HDFS○ Work with dirty, semi/un-structured data○ Massive data storage, manipulation and parallel
processing● Jaql
○ A JSON Query Language○ The declarative interface to Hadoop for Ricardo○ Like Pig, Hive
Architecture
Conclusion
● The current version has poor performance
Overview of SciDB
Large Scale Array Storage, Processing and Analysis
Context
1. Background and Motivation2. Features and Functionality3. Data Definition4. Data Manipulation5. Architecture
What is SciDB?
● Massively parallel storage manager● Able to parallelize large scale array
processing algorithms
1. Background and Motivation● Modern scientific data differs from business data in three important
respects:○ Sensor arrays consist of rectangular ‘arrays’ of individual
sensors○ Scientific analysis requires sophisticated data processing
methods■ Ex: Noisy data needs to be ‘cleaned’
○ Data generated by modern scientific instruments is extremely large
● Array Data Model is more desirable in scientific domains○ With notions of adjacency or neighborhood ○ Ordering is fundamental
● Complexity of data processing needs a much more flexible data management platform○ A different kind of DBMS
2. Features and Functionality
● Collections of n-dimensional arrays● Cells in arrays contain tuple of values● Values are associated with a distinguishing attribute
name
3. Data Definition
● Create an array:
● Output:
3.1 Sparse Arrays
● Arrays in SciDB may be sparse● Two ways to handle missing information:
○ Ignore it○ Treat it depending on the operation
● Sparse array with jagged edges and holes:
4. Data Manipulation in SciDB
4.1 Slice()
● Projects an array along a particular index value in single dimension
4.2 Subsample()
● Extracts a region of the array● Generalization of Slice()
4.3 SJoin()
● Combines attributes from two input arrays○ Combines cells with the same index value○ Input arrays need not to have identical dimensions
4.4 Filter()
● Applies a predicate to the attribute values of input○ Produces an array with same size○ Cells where the predicate is found false are set to
empty
4.5 Extensibility
● Provides Postgres style UDT and UDF extensibility○ New types will inter-operate with SciDB’s own types
and array operators● Supports operator extensibility
○ Gaussian Smoothing○ Weighted average of the cell’s neighborhood
5. Architecture
● Shared nothing design● Centralized system catalog database
○ Information about nodes, data distribution and user-defined extensions
● Influenced by MapReduce● Implements only ‘A’ and ‘D’ of ACIDity
○ Atomicity○ Durability
References
● SciDB web-site: http://www.scidb.org● Publications on SciDB: http://www.scidb.
org/about/publications.php● Overview of SciDB, Large Scale Array Storage,
Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA: http://www.scidb.org/Documents/sigmod691-brown.pdf
SciDB-R
Best Database for R
Data Frame
● Create, like a table in databasemydata <- data.frame(col1, col2, col3,...)> patientID <- c(1, 2, 3, 4)> age <- c(25, 34, 28, 52)> diabetes <- c("Type1", "Type2", "Type1", "Type1")> status <- c("Poor", "Improved", "Excellent", "Poor")> patientdata <- data.frame(patientID, age, diabetes, status)> patientdata patientID age diabetes status1 1 25 Type1 Poor2 2 34 Type2 Improved3 3 28 Type1 Excellent4 4 52 Type1 Poor
The R Programmers
● want their analytics to just work–on extremely large datasets as nimbly as on small ones.
● want to concentrate on the analytics, not parallelism, data formatting, and memory management.
Benefits
● Use SciDB to manage large data set○ a storage backend○ filter and join data before performing analytics
● Use SciDB to share intensive computing load○ offload large computations to a cluster○ do some analytical task
● Use SciDB to share data among multiple users
Example
● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/
Example
● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */
Example
● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/
Example
● R codes using SciDB to perform caculations> library(“scidb”) /* Load scidb module(package) in the current R session */> scidbconnect() /* May require host, port, username, password*/> U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */> set.seed(1) /* Set a seed for randomization */> x = cbind(rnorm(5)) /* Create a column vector with 5 rows */> y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/> y[, drop = FALSE] /* Return the computed result to R, storing it to SciDB.drop: if data frame y has only one column and drop is true, y will be reduced to a plain vector without labels. */
References
● Official Website○ http://www.paradigm4.com/scidb-r/
● SciDB-R package○ https://github.com/Paradigm4
● Instructions and Manuals○ http://cran.r-project.org/web/packages/scidb/
Julia Language
A Fast Dynamic Language for Technical Computing
Context
1. Motivation2. Features3. JIT Compiler and Performance Benchmarks4. Example Codes5. IJulia6. Issue Tracking
1. Motivation
● Why do we need more?
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
1. What is Julia?
● High level, high performance dynamic programming language○ Syntax familiar to Matlab
● The library is written mostly in Julia
2. Some Features
● Open source with an MIT licensed core● Easy installation● Dynamically typed with fast user-defined types● JIT compiler● Distributed memory parallelism● Call C, Fortran and Python libraries● Unicode support● Metaprogramming with Lisp-like macros
3. High-Performance JIT Compiler
● LLVM-based Just in Time Compiler● Often match the performance of C
Benchmark times relative to C (smaller is better)
3. Log-scale of Benchmark
Execution time relative to C++
Benchmarks: fib, parse_int, quicksort, mandel, pi_sum, rand_mat_stat, and rand_mat_mul
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4. Example Julia Codes
4.1 Arrays and Vectors
4.1 Arrays and Vectors
4.2 Matrix Operations
4.3 Ternary Operators
4.4 Packages
● Pkg.add(“Package_Name”)○ Cpp for calling C++ from Julia○ Curl for Julia HTTP Curl library○ Winston, Gadfly, Gaston or PyPlot for graphics and
plotting○ HDFS for a wrapper over Hadoop HDFS library○ LIBSVM for LIBSVM bindings for Julia○ and many more
● http://docs.julialang.org/en/latest/packages/packagelist
4.5 Plotting
● Graphics in Julia are available through external packages○ Use Winston.jl for Matlab plots○ Use Gadfly for Wickham-Wilkinson style of graphics
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.6 Sequential Buffon’s Needle
● We have a floor made of parallel strips of wood, each the same width, and we drop a needle onto the floor. ○ What is the probability that the needle will lie across
a line between two strips?
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.7 Parallel Buffon’s Needle
● @parallel (+) for loop○ Assign iterations to multiple processes○ Combine them with a specified reduction (+)
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.8 Writing Low-Level Code
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.9 Using Python Libraries
● Pkg.add(“PyCall”)○ Using PyCall
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
5. IJulia
● Julia is written in command line by default● IJulia combines Julia with IPython● IPython provides rich architecture for interactive
computing● Available in GitHub: https://github.com/JuliaLang/IJulia.jl
6. Issue Tracking
● Source codes are in GitHub○ https://github.com/JuliaLang/julia
● Issues can be opened from Julia GitHub Repository○ https://github.com/JuliaLang/julia/issues
● Easy and quick bug fixes○ No need to wait for another release
References
● Julia web-site: http://julialang.org● C, Fortran, Julia, Python, Matlab, R and JavaScript codes
used in benchmarking: https://github.com/JuliaLang/julia/tree/master/test/perf/micro
● Publications on Julia: http://julialang.org/publications/● Julia Source Code: https://github.com/JuliaLang/julia● Viral B. Shah’s introductory slides on July 13, 2013: https:
//github.com/ViralBShah/julia-presentations/raw/master/Fifth-Elephant-2013/Fifth-Elephant-2013.pdf
● MIT IAP Julia Tutorial: http://www.youtube.com/user/JuliaLanguage
● Winston, 2D plotting for Julia: https://github.com/nolta/Winston.jl
Advantages of R over Julia
● Backed by GNU● More mature and older● Large collection of libraries, i.e. CRAN● Rich development environment, i.e. RStudio● Graphics and plotting● Many great tutorials and books
Advantages of Julia over R
● Performance of Julia that is close to C○ JIT LLVM Compiler
● Supports low-level programming, modify arguments● Active community promises a bright future for Julia● R is single threaded, whereas Julia supports parallelism
○ R has techniques for large datasets but not easy to use
● Julia provides fast development and fast execution
Syntax Differences
Operator R Julia
Assignment <- =
Element wise multiplication * *(*.)
Element wise addition + +(+.)
Modulo %% mod
Creating Vector c(1,2,3,4) [1:4]
Size of the array dim size
References
● Matlab, R, and Julia: Languages for Data Analysis, October 15 2012: http://strata.oreilly.com/2012/10/matlab-r-julia-languages-for-data-analysis.html
● Julia for R Programmers, Douglas Bates, July 18 2013: http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf
● An R Programmer look at Julia, Douglas Bates, April 7 2012: http://dmbates.blogspot.com/2012/04/r-programmer-looks-at-julia.html
● http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/
Questions?