Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
<Insert Picture Here>
©2011 Oracle – All Rights Reserved
Oracle R Enterprise – Training Sessions
Session 6: Advanced Topics
Mark Hornick, Senior Manager, Development
Oracle Advanced Analytics
2
The following is intended to outline our general product direction.
It is intended for information purposes only, and may not be
incorporated into any contract. It is not a commitment to deliver
any material, code, or functionality, and should not be relied upon
in making purchasing decisions.
The development, release, and timing of any features or
functionality described for Oracle’s products remain at the sole
discretion of Oracle.
3
Oracle R Enterprise Training Sessions
Date / Time Session Topics
Thursday, December 1, 2:00 PM ET Getting Started with Oracle R Enterprise
(ORE)
Oracle R Enterprise Overview
Installation of R
Installation of Oracle R Enterprise
Connecting to Exadata from R
Tuesday, December 6, 11:00 AM ET Introduction to the R Language and
Environment
R Language Basics
Producing Graphs in R
Thursday, December 8, 1:00 PM ET ORE Transparency Layer Interacting with Database Tables
Manipulating and transforming data through ORE
Tuesday, December 13, 11:00 AM ET ORE Embedded R Script Execution Execution through R interface
Execution through SQL interface
Thursday, December 15, 2:00 PM ET Operationalizing R Scripts From Analyst to Production
Integration with OBIEE
XML graph generation using SQL
Tuesday, December 20, 11:00 AM ET Advanced Topics Base SAS equivalent functionality
ORE support for Hadoop and Map-Reduce
Use of ORE in Exadata and BDA environments
©2011 Oracle – All Rights Reserved
4
Topics
• Base SAS equivalent functionality
• ORE support for Hadoop – Oracle R Hadoop Connector
• Use of ORE in Exadata and BDA environments
©2011 Oracle – All Rights Reserved
5
What is Oracle R Enterprise?
• R packages, database library, and SQL extensions that bring Oracle Database closer to Advanced Analytics
users in an Enterprise
– Transparency Layer
• Set of packages mapping R data types to Oracle Database objects
• Transparent SQL generation for R expressions on mapped data types
• Enables direct interaction with database-resident data using R language constructs
• Enables R users to work with data larger than desktop system memory
– Statistics Engine
• Set of statistical functions and procedures for commonly-used statistical libraries
• Executes in Oracle Database
– SQL extensions
• Enables database server execution of R code to facilitate embedding R in operational systems
• Eliminates client data loading and result write-back to Oracle Database
• Enables dataflow parallelism, generation of rich XML output, SQL access to R, and parallel simulations capability
– Hadoop Connector
• R packages that enable working directly with an Oracle Hadoop cluster using R
• Data resides in HDFS, Oracle Database ,or local files
©2011 Oracle – All Rights Reserved
6
Oracle R Enterprise Statistics Engine Example Features
• Special Functions
– Gamma function
– Natural logarithm of the Gamma function
– Digamma function
– Trigamma function
– Error function
– Complementary error function
• Tests
– Chi-square, McNemar, Bowker
– Simple and weighted kappas
– Cochran-Mantel-Haenzel correlation
– Cramer's V
– Binomial, KS, t, F, Wilcox
• Base SAS equivalents
– Freq, Summary, Sort
– Rank, Corr, Univariate
• Density, Probability, and Quantile Functions
– Standard normal distribution
– Chi-square distribution
– Exponential distribution
– F-distribution
– Density Function
– Probability Function
– Quantile
– Gamma distribution
– Beta distribution
– Cauchy distribution
– Student's t distribution
– Weibull distribution
©2011 Oracle – All Rights Reserved
7
Statistical Tests – Examples ore.create(iris,table="IRIS_TABLE")
IRIS_TABLE$PETALBINS=ifelse(IRIS_TABLE$Petal.Length < 2, 1, 2)
# Binomial Test
binom.test(IRIS_TABLE$PETALBINS)
# Chi Square Test
chisq.test(IRIS_TABLE$PETALBINS)
# One sample K-S Test for given probabilities
ks.test(IRIS_TABLE$Petal.Length, "pexp", rate=4)
# Two sample K S Test
ks.test(IRIS_TABLE$Petal.Length, IRIS_TABLE$Sepal.Length)
# T-test with different alternate hypothesis possibilities */
t.test(IRIS_TABLE$Petal.Length, alternative="two.sided", mu=0, conf.level=0.9)
# F test to compare variances
var.test(IRIS_TABLE$Petal.Length, IRIS_TABLE$Sepal.Length, ratio=0.75,
alternative="two.sided", conf.level=0.9)
©2011 Oracle – All Rights Reserved
8
Statistical Tests – Results
©2011 Oracle – All Rights Reserved
9
Base SAS equivalent functionality
©2011 Oracle – All Rights Reserved
10
SAS PROCs and SAS Code
PROC SUMMARY DATA=EX1; CLASS CITYSIZE;
VAR QUANTITY AMOUNT;
OUTPUT OUT=A MAX=MAXQUAN MAXAMT
MAXID(QUANTITY(REGION) AMOUNT(PRODUCT))=REGNMXQ PRODMXA
MINID(QUANTITY(REGION)) = REGNMNQ;
RUN;
PROC PRINT; RUN;
proc sort data=account out=sorted;
by town descending debt accountnumber;
run;
proc print data=sorted;
var company town debt accountnumber;
title 'Customers with Past-Due Accounts';
title2 'Listed by Town, Amount, Account Number';
run;
ods graphics on;
proc freq data=Color order=data;
tables Hair / nocum chisq testp=(30 12 30 25 3)
plots(only)=deviationplot(type=dot);
weight Count;
by Region;
title 'Hair Color of European Children';
run;
ods graphics off;
©2011 Oracle – All Rights Reserved
11
SAS Characterization
• Client-side tool that requires data extracts from the database
• ETL limits data available to data analysts / statisticians
• Base SAS language similar in concept to R
– Includes math-like constructs, math functions and graphics
– Libraries with Base SAS similar to R packages that contain sophisticated
math/stat functions
• SAS applications mostly use Base SAS
– Compete with Oracle applications, e.g.,
• Financial Services: banking credit risk, anti-money laundering, etc.
• Health Sciences: clinical trials, drug safety
• Retail Planning: assortment, forecasting
©2011 Oracle – All Rights Reserved
12
+
+
+
Base SAS
A programming language that contains statements, expressions, and functions.
Foundation for most SAS products
DATA STEP is the most popular SAS statement that allows manipulation of SAS data sets (select, project, join, filter, derived columns)
Data analysis procedures (a.k.a SAS PROCs)
A PROC is a bundling of several statistics functions + data analysis options (grouping, sorting, filtering, ranking) that support a specific statistical analysis technique on a data set
e.g. PROC CORR bundles statistics functions and data analysis options required for correlation analysis
Standard and customized Report generation capabilities.
2D and 3D Graphics integrated with SAS PROCs via ODS
A data management facility which organizes data into persistent tables called SAS data sets
A SAS data set is a file with concurrent read/write user access
A SAS data set can also be mapped to a relational table/view
In this case a SAS user is expected to craft SQL statements to send over SAS ODBC connection
When SAS data sets get large, they are managed by a SAS database product called Scalable Performance Data Server
©2011 Oracle – All Rights Reserved
13
SAS Customers using Oracle R Enterprise – Why?
• ORE statistics engine functions facilitates SAS user transition to ORE
– SAS PROCs widely used
– ORE provides similar syntax to enable reuse of SAS users skills
• ORE key benefits over SAS
– ORE leverages Oracle Database as compute engine
– SAS requires extracts to middle tier file system
• Problems of latency, data security, extra infrastructure, complexity
– ORE makes operationalizing results easy
• Deploying SAS models is non trivial
• Requires software engineers work with SAS analysts to rewrite SAS model in SQL
or C
• ORE eliminated that hand-off pain
– SAS software is rented annually
• Annual renegotiation with SAS
• Software simply stops working
©2011 Oracle – All Rights Reserved
14
Oracle R Enterprise “PROCs”
• SUMMARY / MEANS
• RANK
• SORT
• CROSSTAB
• FREQ
• CORR
• UNIVARIATE
©2011 Oracle – All Rights Reserved
15
ore.summary
• Provides descriptive statistics and extensive analysis of columns in
an ore.frame with flexible row aggregations
• Statistics – mean, min, max, mode, #missing values, sum, weighted sum
– corrected and uncorrected sum of squares, range of values,
stddev, stderr, variance
– t-test for testing hypothesis that the population mean is zero
– kurtosis, skew, Coefficient of Variation
– quantiles: p1,p5,p10,p25,p50,p75,p90,p95,p99,qrange
– 1 and 2 sided Confidence Limits for the mean: clm, rclm, lclm
– extreme value tagging
• Simple syntax abstracting complex SQL queries
©2011 Oracle – All Rights Reserved
16
ore.summary – Parameters
• data: ore.frame of data on which to compute descriptive statistics
• class: column(s) of data to aggregate (i.e., SQL group by)
• var: numeric column(s) of data on which to apply statistics functions (SQL select list)
• stats: list of statistics functions available to be applied on var columns -
mean,min,max,cnt,n,nmiss,css,uss,cv,sum,sumwgt,range,stddev,stderr,var,t,probt,kurt,
skew,p1,p5,p10,p25,p50,p75,p90,p95,p99,qrange,lclm,rclm,clm,mode
• weight: A column of data whose numeric values provide a multiplicative factor for var columns
• maxid, minid: for each group optionally list max/min value from other columns in data
• ways: restrict output to only certain grouping levels of the class variables
• group.by: column(s) of data to stratify summary results across
Returns an ore.frame as output in all cases except when group.by clause is used.
When group.by clause is used returns a list of ore.frames, one per stratum
©2011 Oracle – All Rights Reserved
17
ore.summary – Examples
# Default statistics – mean, min, max for columns AGE, CLASS, ROLLUP(GENDER)
ore.summary(NARROW, class='GENDER', var ='AGE,CLASS')
# Report just skew(AGE) as column A and t(CLASS) as column B
ore.summary(NARROW, class='GENDER', var='AGE,CLASS', stats='skew(AGE)=A, probt(CLASS)=B')
# Weighted sum (var * weight)
ore.summary(NARROW, class='GENDER', var='AGE', stat='sum=X', weight='YRS_RESIDENCE')
# Separate group-bys: Group by GENDER, Group by Marital Status
ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways=1)
# All possible group-bys: CUBE Gender, Marital Status
ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways='nway')
©2011 Oracle – All Rights Reserved
18
ore.summary – Results
©2011 Oracle – All Rights Reserved
19
ore.summary – Results
©2011 Oracle – All Rights Reserved
20
ore.summary – Results
©2011 Oracle – All Rights Reserved
21
ore.summary – Examples
# Sort results by ascending frequency count
ore.summary(NARROW, class='GENDER', var ='AGE,CLASS', order='freq')
# Sort results by descending aggregation level
ore.summary(NARROW, class='GENDER', var ='AGE,CLASS', order='-level')
# Stratification by country: For each country report statistics on AGE, ROLLUP by GENDER, CLASS
ore.summary(NARROW, class='GENDER, CLASS', var ='AGE', group.by='COUNTRY')
# In addition to aggregated statistics on AGE, for each grouping of class variable, report country
# with the maximum and minimum value for YRS_RESIDENCE column and label the result as maxcol and
# mincol respectively
x = ore.summary(NARROW, class='GENDER', var='AGE',
maxid='COUNTRY/YRS_RESIDENCE=maxcol',
minid='COUNTRY/YRS_RESIDENCE=mincol')
©2011 Oracle – All Rights Reserved
22
ore.summary – Results
©2011 Oracle – All Rights Reserved
23
ore.summary – Results
©2011 Oracle – All Rights Reserved
24
ore.summary – Results
©2011 Oracle – All Rights Reserved
25
ore.summary - Examples
• Compute the mean, standard deviation, and count for arrival delay
for each destination airport
• Compute the maximum arrival and departure delay for each airline
and the corresponding destination airport
res <- ore.summary(data=ONTIME_S,
var='ARRDELAY',
class='DEST',
stats='mean=M,std=S,n=COUNT')
head(res)
ore.summary(data=ONTIME_S,
class='UNIQUECARRIER',
var='ARRDELAY,DEPDELAY',
stats='MAX(ARRDELAY)=MAXARRDELAY,
MAX(DEPDELAY)=MAXDEPDELAY',
maxid='DEST/ARRDELAY=DESTMAXARRDELAY,
DEST/DEPDELAY=DESTMAXDEPDELAY')
©2011 Oracle – All Rights Reserved
26
ore.summary – Results
©2011 Oracle – All Rights Reserved
27
ore.rank
• Enables investigation of distribution of values in numeric
columns of an ore.frame
• Highlights
– Ranking within groups
– Partition rows into groups based on rank tiles
– Cumulative percentages and percentiles
– Treatment of ties
– Calculation of normal scores from ranks
• Simple syntax abstracting complex SQL queries
©2011 Oracle – All Rights Reserved
28
ore.rank – Parameters
• data : ore.frame of the data to compute rankings on
• var : numeric columns in data to rank
• desc : ranks in descending (asc is default) order if TRUE
• groups : partition rows into #groups based on ranks. For percentiles, #groups=100, For deciles
#groups=10, For quartiles #groups=4.
• group.by : rank each group identified by group.by columns separately
• ties : specification of tie treatment. Assign largest of/smallest of/mean of corresponding ranks of
tied values
• fraction : rank of a column value ÷ # non missing column values
• nplus1 : rank of a column value ÷ # non missing column values + 1
– Fraction and nplus1 options can be used to estimate cumulative distribution function
• percent : (rank of a column value ÷ # non missing column values) * 100
• Scoring Methods
– To compute Exponential scores from ranks use savage
– To compute normal scores – Use one of blom, tukey or vw (Van Der Waerden)
Returns an ore.frame as output in all cases
©2011 Oracle – All Rights Reserved
29
# Rank 2 columns and report them as derived columns
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')
class(x)
y <- ore.sort(data=x, by='RankOfAge')
# Handling of ties
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')
head(x,10)
# Rank within groups
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')
head(x,10)
ore.rank – Examples
©2011 Oracle – All Rights Reserved
30
# Partition rows into groups e.g. Deciles
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)
# Partition rows into groups e.g. Quartiles
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=4)
# Estimating cumulative distribution function
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)
# Scores calculation
x <- ore.rank(data=NARROW, var='AGE=RankOfAge,
CLASS=RankOfClass',score='savage', groups=100, group.by='COUNTRY')
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='blom')
ore.rank – Examples
©2011 Oracle – All Rights Reserved
31
ore.sort
• Enables flexible sorting of columns in a data frame
• Can be used with other data pre-processing functions
– Sorting happens in the database
– (Top k) results of sorting can be provided as input to R
visualization
• Supports database nls.sort option
©2011 Oracle – All Rights Reserved
32
ore.sort - Parameters
• data : ore.frame of the data to be sorted
• by : columns in data to sort
• stable : Allows relative order to be maintained within sorted groups
(TRUE/FALSE)
• reverse: Allows optional reversal of collation order for character variables
(TRUE/FALSE)
• unique.keys : Allows optional removal of rows with duplicate values in the
column(s) being sorted from appearing in the result (TRUE/FALSE)
• unique.data: Allows optional removal of duplicate rows from appearing in the
result (TRUE/FALSE)
Returns an ore.frame as output in all cases
©2011 Oracle – All Rights Reserved
33
# Sort all specified columns in desc order
x <- ore.sort(data=NARROW,by='AGE,GENDER', reverse=TRUE)
# Sort AGE in desc order but GENDER in ascending order
x <- ore.sort(data=NARROW,by='-AGE,GENDER')
# Keep just 1 row per unique value of AGE
x <- ore.sort(data=NARROW,by='AGE', unique.key=TRUE)
# Remove duplicate rows
x <- ore.sort(data=NARROW,by='AGE', unique.data=TRUE)
# Remove duplicate rows as well as rows with duplicate values for AGE
x <- ore.sort(data=NARROW,by='AGE', unique.data=TRUE, unique.key = TRUE)
# Maintain relative order within sorted output
x <- ore.sort(data=NARROW,by='AGE', stable=TRUE)
ore.sort – Examples
©2011 Oracle – All Rights Reserved
34
ore.sort – Examples
• Sort the ONTIME_S data by airline descending and departure delay
ascending
• Sort ONTIME_S by airline and departure delay, but select one of each
combination, e.g., unique key
sortedOntime <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')
head(sortedOntime[,c(11,18)], 20)
sortedOntime2 <- ore.sort(ONTIME_S,by='UNIQUECARRIER,DEPDELAY',unique.key=TRUE)
head(sortedOntime2[,c(11,18)], 20)
©2011 Oracle – All Rights Reserved
35
ore.crosstab - Basics
Input data set
One way tables
2 way table
A general KxL table
©2011 Oracle – All Rights Reserved
36
ore.crosstab
• Enables cross column frequency analysis of an ore.frame
• A sophisticated variant of R's function table () for two variables
• Builds tables of frequency counts across columns of a data frame
• Required as a pre-cursor to frequency analysis using ore.freq
• R translated to 100% SQL
©2011 Oracle – All Rights Reserved
37
ore.crosstab – Parameters
• expr: cross tabulation definition in the form
[COLUMN_SPEC] ~ COLUMN_SPEC [ * <WEIGHTING COLUMN>]
[ / <GROUPING COLUMN>]
[ ^ <STRATIFICATION COLUMN>]
[ | ORDER_SPECIFICATION]
COLUMN_SPEC is <column-name>[+COLUMN_SET][+COLUMN_RANGE]
COLUMN_SET is <column_name>[+COLUMN_SET]
COLUMN_RANGE is <FROM COLUMN>-<TO COLUMN>
ORDER_SPEC is one of [-]NAME, [-]DATA, [-]FREQ, or INTERNAL
• data: ore.frame of data to cross tabulate
• group.by: as many cross tabulations as unique values in grouping columns
• order: optional sorting of table data
[-] NAME: Sort by tabulation column names, [-]FREQ: Sort by frequency counts in the table
• weights : Column of data that indicates frequency of occurrence of the corresponding
row
• strata : Column name used to cluster, or group, the data in combination
Returns an ore.frame as output in all cases, except when multiple tables are created – in this case an ore.list is returned
©2011 Oracle – All Rights Reserved
38
# For comparison, look at R's table function on 2 columns
table(NARROW$MARITAL_STATUS, NARROW$GENDER)
# Corresponding ore.crosstab(), extensible to more than 2 columns
ore.crosstab(MARITAL_STATUS ~ GENDER, data=NARROW)
# MARITAL_STATUS x GENDER and MARITAL_STATUS x CLASS
x=ore.crosstab(MARITAL_STATUS ~ GENDER+CLASS, data=NARROW)
# One way table
ore.crosstab(~AGE, data=NARROW)
# Weight values in AGE and GENDER using values in CLASS
x=ore.crosstab(AGE~GENDER*WEIGHT, data=NARROW)
# Order rows of cross tab by frequency counts
x=ore.crosstab(AGE~GENDER|FREQ, data=NARROW)
ore.crosstab – Examples
©2011 Oracle – All Rights Reserved
39
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
40
# 4, 2 way cross tabs (GENDER,AGE,MARITAL_STATUS,COUNTRY)~CLASS
x=ore.crosstab(GENDER-COUNTRY~CLASS, data=NARROW)
length(x)
# 1 way table with as many rows as unique values of COUNTRY for each unique value of AGE
x = ore.crosstab(~AGE/COUNTRY, data=NARROW)
# Same as above, but 2 way
x = ore.crosstab(AGE~GENDER/COUNTRY, data=NARROW)
# Post-process output – 2 2-way tables AGExGENDER and AGExCLASS
x=ore.crosstab(AGE ~ GENDER+CLASS, data=NARROW)
class(x)
class(x[[1]])
names(x[[1]])
z <- x[[1]][,c(1,2,3)]
ore.crosstab – Examples
©2011 Oracle – All Rights Reserved
41
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
42
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
43
# Stratification – As many 2 way tables as unique values of CLASS
x <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW)
# Custom binning and subsequent cross tabulation
NARROW$AGEBINS=ifelse(NARROW$AGE<20, 1,
ifelse(NARROW$AGE<30,2,
ifelse(NARROW$AGE<40,3,4)))
ore.crosstab(GENDER~AGEBINS, NARROW)
ore.crosstab – Examples
©2011 Oracle – All Rights Reserved
44
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
45
ore.extend
• Augments cross tabulation produced using ore.crosstab()
with 3 basic statistics
– ROW and COLUMN SUMs
crosstab = ore.extend.sum(crosstab)
– CUMULATIVE SUMs for each cell of the table
crosstab = ore.extend.cumsum(crosstab)
– TOTAL for the entire table
crosstab = ore.extend.total(crosstab)
©2011 Oracle – All Rights Reserved
46
ore.crosstab – Augmenting cross-tabulations
• Adding row and column sums x <- ore.crosstab(GENDER~AGEBINS, NARROW)
y <- ore.extend.sum(x)
• Adding cumulative sums for each cell of crosstab x <- ore.crosstab(GENDER~AGEBINS, NARROW)
y <- ore.extend.cumsum(x)
• Adding total for the entire table y <- ore.extend.total(x)
©2011 Oracle – All Rights Reserved
47
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
48
ore.crosstab – Results
©2011 Oracle – All Rights Reserved
49
ore.freq
• Operates on output of ore.crosstab() and automatically determines the techniques
relevant to the nature of the result
• 1-way cross tables
– Goodness of fit tests for equal proportions or specified null proportions, confidence
limits, and tests for equivalence
• 2-way cross tables
– Various statistics that describe relationships between columns in the cross tabulation
– Chi-square tests, cochran-mantel-haenzsel statistics, measures of association, strength
of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend
• N-way cross tables
– N 2-way cross tables
– Statistics across and within strata
• Leverages database SQL functions when available
©2011 Oracle – All Rights Reserved
50
ore.freq - Parameters
• crosstab: ore.frame output from ore.crosstab()
• stats: List of statistics required – ChiSquare: AJCHI, LRCHI, MHCHI, PCHISQ
– Kappa:KAPPA, WTKAP,
– Lambda:LAMCR, LAMRC, LAMDAS
– Correlation:KENTB,PCORR, SCORR
– Stuart'c Tau, Somer's D|C:STUTC,SMDCR,SMDRC
– Fisher's, Cochran's Q:FISHER, COCHQ
– Odds Ratio:OR, MHOR, LGOR
– Relative Risk:RR,MHRR,ALRR
– Others: MCNEM, PHI, CRAMV, CONTGY, TSYM, TREND, GAMMA,
• params: Control parameters to the specific statistical function – SCORE: TABLE|RANK|RIDIT|MODRIDIT
– ALPHA: <number>
– WEIGHTS: <number>
• skip.missing: (TRUE/FALSE) skip cells with missing values in the cross table
• skip.failed: (TRUE/FALSE) if a statistical test required fails on the cross table because it is found to be
in-applicable to the table then return immediately
Returns an ORE data frame as output in all cases
©2011 Oracle – All Rights Reserved
51
ore.freq – Examples
• Compute cross tabulation for number of diverted flights for each airline.
Compute the Pearson CHISQ for the results.
• For each airline, compute cross tabulation for number of diverted flights
and day of week. Compute the Pearson CHISQ for each result.
ct <- ore.crosstab(UNIQUECARRIER~DIVERTED, data=ONTIME_S)
ct
freq <- ore.freq(ct)
freq
ct <- ore.crosstab(UNIQUECARRIER~DIVERTED+DAYOFWEEK,data=ONTIME_S)
ct
freq <- ore.freq(ct)
freq
©2011 Oracle – All Rights Reserved
52
ore.corr
• Correlation analysis across numeric columns in an ore.frame
• Supports partial correlations with a control column
• Enables aggregations prior to correlations
• Allows post-processing of results and integration into R code flow
• Output can be made to conform to output of R's cor() function and so
can be post-processed by any CRAN function or graphics
©2011 Oracle – All Rights Reserved
53
ore.corr – Parameters
• data: ore.frame of the data to compute correlation coefficients
• var: numeric column(s) of the data for which to build correlation matrix
• group.by: as many correlation matrices as unique values in group.by columns
• weight: A column of the data whose numeric values provide a multiplicative factor for var
columns
• partial: columns of data to use as control variables for partial correlation
• stats: pearson (default), spearman, kendall
Use OREeda:::ore.corr.as.matrix() to convert into R's cor() compatible output format
Returns an ore.frame as output in all cases except when group.by is used in which case an ore.list object is returned
©2011 Oracle – All Rights Reserved
54
# R's cor –project out non-numeric columns first
names(NARROW)
N <- ore.pull(NARROW[,c(3,8,9)])
cor(N, use="complete.obs")
cor(N, method='spearman' , use="complete.obs")
cor(N, method='kendall' , use="complete.obs")
# Corresponding ore.corr
x1 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS')
x2 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')
x3 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')
cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x3)
class(cor_compatible_matrix)
ore.corr – Examples
©2011 Oracle – All Rights Reserved
55
ore.corr - Results
©2011 Oracle – All Rights Reserved
56
Partial correlation
ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman', partial='GENDER')
Creating a number of correlation matrices
x <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS',
stats='pearson', partial='GENDER', group.by='COUNTRY')
class(x)
cor_compatible_matrix <- OREeda:::.ore.corr.as.matrix(x[[1]])
ore.corr – Examples
©2011 Oracle – All Rights Reserved
57
ore.corr Post-processing matrix using CRAN visualization
circle.corr( cor(mtcars), order = TRUE, bg = "gray50",
col = colorRampPalette(c("blue","white","red"))(100))
circle.corr(cor_compatible_matrix,
order = TRUE, bg = "gray50",
col = colorRampPalette(c("blue","white","red"))(100))
• http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=152
©2011 Oracle – All Rights Reserved
58
ore.univariate
• Enables distribution analysis of numeric variables in an ore.frame
• Statistics
– All statistics reported by ore.summary()
– Signed rank test, Student's t-test
– Extreme values reporting
• Graphics
– QQ plots
– Scatterplots
©2011 Oracle – All Rights Reserved
59
ore.univariate – Parameters
• data: ore.frame of the data whose columns are to be analyzed
• var: numeric column(s) of the data for which to compute statistics
• weight: A column of the data whose numeric values provide a multiplicative factor for var
columns
• stats: optional specification of a subset of statistics to be printed
momemts – n,sumwgt,mean,sum,stddev,var,skew,kurt.,uss.css.cv.stderr
measures – mean,stddev,median,var,mode,range,iqr
quantiles – p100,p99,p95,p90,p75,p50,p25,p10,p5,p1,p0
location – studentt,studentp,signt,signp,srankt,srankp
normality
loccount – loc<,loc>,loc!
extremes
Returns an ore.frame as output in all cases except when group.by is used in which case an R list object is returned
©2011 Oracle – All Rights Reserved
60
ore.univariate - Examples
# Default univariate statistics
ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")
# Compute location statistics on YRS_RESIDENCE
ore.univariate(NARROW, var="YRS_RESIDENCE",stats="location")
# Compute quantiles statistics on AGE and YRS_RESIDENCE
ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")
©2011 Oracle – All Rights Reserved
61
ore.univariate – Results
©2011 Oracle – All Rights Reserved
62
Normal QQ Plot – Univariate Graphics
Arrival and Departure Delay
©2011 Oracle – All Rights Reserved
63
Normal QQ Plot – Univariate Graphics Arrival and Departure Delay
ontime <- ONTIME_S
ontimeSubset <- ontime[ontime$UNIQUECARRIER =="AA", ,drop = TRUE]
ontimeSubset <- ontime[, c("ARRDELAY", "DEPDELAY")]
ontimeSubset <- ore.pull(ontimeSubset)
ontimeSubset <-ontimeSubset[sample(nrow(ontimeSubset),nrow(ontimeSubset)*.03),]
dat = ontimeSubset$ARRDELAY
type = "Arrival"
#dat = ontimeSubset$DEPDELAY
#type = "Departure"
qqnorm(dat, xlab="Normal Quantiles",
ylab=paste(type, " Delay (min)",sep="" ));
qqline(dat, col = 2)
mean <- mean(dat, na.rm=TRUE)
sd <- sd(dat, na.rm=TRUE)
legend("top", paste("normal line mean= ",round(mean,2),
" sd= ",round(sd,2),sep=""), lty=1, col="red")
©2011 Oracle – All Rights Reserved
64
Scatterplot Matrix Airline, Arrival Delay, Departure Delay, Distance
©2011 Oracle – All Rights Reserved
65
Scatterplot Matrix Airline, Arrival Delay, Departure Delay, Distance
ontime <- ONTIME_S
ontimeSubset <- ontime[ontime$UNIQUECARRIER %in%
c("AA", "AS", "CO", "DL","HP","WN"), ,
drop = TRUE]
ontimeSubset <- ore.pull(ontimeSubset)
ontimeSubset <-ontimeSubset[sample(nrow(ontimeSubset),nrow(ontimeSubset)*.03),]
pairs(~UNIQUECARRIER+ARRDELAY+DEPDELAY+DISTANCE,data=ontimeSubset,
main="Simple Scatterplot Matrix")
©2011 Oracle – All Rights Reserved
attach(ontimeSubset)
os2 <- data.frame(UNIQUECARRIER, ARRDELAY, DEPDELAY, DISTANCE)
detach(ontimeSubset)
panel.hist <- function(x, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, breaks=20, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
pairs(os2, diag.panel=panel.hist, main="Scatterplot Matrix with Histograms")
66
ORE support for Hadoop
©2011 Oracle – All Rights Reserved
67
Oracle R Hadoop Connector
Transparently leverage Hadoop for High Performance Analytics to Big Data Appliance
Function push-down – data transformation & statistics
R workspace console
Oracle statistics engine
BI Publisher, OBIEE, Web Services
©2011 Oracle – All Rights Reserved
1. Interactive access to Hadoop infrastructure from R user's desktop as part of exploratory analysis
2. Lights out execution of production R calculations on Hadoop infrastructure
68
Hadoop Publicized Examples
• A9.com (Amazon)
– Product search indices
• Adobe
– Social services & structured data store
• EBay
– Search optimization & research
– User growth, page views, ad campaign analysis
• Journey Dynamics
– Forecast traffic speeds from GPS data
• Lineberger Comprehensive Cancer
Center
– Analyzes Next Generation Sequence Data for
Cancer Genome Atlas
• NAVTEQ Media Solutions
– Optimizes ad selection based on user
interactions
– Stores and processes Tweets
• Yahoo!
– Research into ad systems & web search
©2011 Oracle – All Rights Reserved
69
What is Hadoop?
• Originally sponsored by Yahoo! Apache project Cloudera
• Based on Google's GFS and Big Table whitepaper
• Cloudera website states “Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File
System (HDFS) and high-performance parallel data processing using a technique called MapReduce.”
• Enables analysis of Big Data
– Can store huge volumes of unstructured data, e.g.,weblogs, transaction data, social
media data
– Enables massive data aggregation
– Highly scalable and robust grid infrastructure
©2011 Oracle – All Rights Reserved
70
How is data stored and accessed?
• Data stored as flat files which are automatically distributed and replicated
across all Hadoop nodes
• Agreed upon delimiters “structure” the “unstructured” data
– <line feed> indicates end of row, generally
– Each row has (optional) key and value(s)
• Loading data to HDFS is equivalent to copying files on OS
– hadoop fs –put <local file name> <HDFS file name>
• Map-Reduce programs provide access to data on Hadoop
– Map data using keys for independent processing
– Reduce results of “mapper” functions to single result by “reducer” functions
©2011 Oracle – All Rights Reserved
76
Oracle R Connector for Hadoop
• Provide transparent access to Hadoop and HDFS-resident data – Hadoop - a high performance distributed computational system
– HDFS – a distributed high-availability file storage.
• R users not forced to learn new language or interface to work with Hadoop / HDFS
• Ability to leverage open source contributed R packages to work on HDFS-resident data
• Facilitate enterprise collaboration to share mappers and reducers to database R repository
• Transition work from lab to production deployment on a Hadoop cluster without requiring
knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure
• Computation cluster administrators do not need to learn R to schedule R map-reduce
models in production
• Facilitate off-loading computationally intensive algorithms to commodity hardware
©2011 Oracle – All Rights Reserved
77
Oracle R Connector for Hadoop Concepts
• HDFS access API – Exposes HDFS CLI API as R functions where HDFS datasets are represented as a special R object
– Enables all operations that can be performed with “hadoop fs <cmd>”
• HDFS transparency layer – Allows all R functions that operate on data.frames to execute on HDFS resident data transparently
• Hadoop execution engine – Allows scheduling of map-reduce jobs from within R environment to Hadoop engine
• Hadoop job driver – Controls data retrieval/translation, execution of map/reduce user functions, and data storage on the server side (e.g. Hadoop cluster)
• Database data loader – Allows export of data from Oracle Database to HDFS and vice versa
• Data translation layer – Communicates with HDFS and optionally Hive to store metadata for all data exported to and imported from HDFS
• HDFS data sampler – Samples the data which was originally resident in HDFS and has no metadata attached and creates the missing metadata object
©2011 Oracle – All Rights Reserved
80
Oracle R Hadoop Cluster (ORHC) Architecture
Client Host (e.g., laptop)
R engine
orhc
Hadoop Cluster Software
Java VM
Server Machine (e.g., Big Data Appliance)
R engine
orhc-drv package Java VM
DBMS Machine (e.g., Exadata)
R engine
ORE libraries
Oracle Database
ORE packages
Hadoop Cluster
Task node
…
Task node
JobTracker
MapReduce nodes
HDFS nodes
Data node
Data node
…
Name node
ORE packages
ORE client packages orhc
©2011 Oracle – All Rights Reserved
81
Prerequisites for use
Client Side
• R engine and R package ORHC is installed on client host machine
– Provides API to access Hadoop facilities within R
– Controls all interactions with Hadoop and HDFS cluster
• Hadoop client software (HCS) - Sqoop
– Must be deployed and configured on client host where ORHC is installed
for access to Hadoop cluster and CLI
– ORHC uses HCS to connect to HDFS cluster
– HCS handles authentication from client host to the Hadoop cluster
machine
– Client host does not participate in Hadoop computations or as HDFS file
system nodes
• Oracle R Enterprise packages
– DBI, ROracle , OREbase, OREeda, OREgraphics, OREstats, RToXmp,
etc.
– Required to access to Oracle Database
– If not present • ORHC client can operate only with in-memory R objects or local data files
• ORHC will not have access to advanced statistical algorithms
Server Side
• Big Data Appliance
• R engine as shared library with all standard R base
libraries deployed on each Hadoop node plus ORHC-DRV
R package with ORHC drivers libraries
• Cloudera’s Sqoop
• Java VM (Hotspot VM is preferred)
– Required by both Hadoop software and ORHC
– Hadoop is implemented in Java
– ORHC uses Hadoop CLI to interact with Hadoop cluster
therefore Java VM must be installed on both server and client
sides
Hadoop Cluster Nodes
• Hadoop client software
• R engine
• ORHC-DRV R package
• Java VM
©2011 Oracle – All Rights Reserved
82
Oracle R Enterprise and Hadoop
• Goal: Average the stopping distance where distance is greater than 30 feet for each speed
• Map function returns key-value pairs where column “dist” is greater than 30
• Reduce function takes the average of all the returned values
©2011 Oracle – All Rights Reserved
cars.dfs <- hdfs.put(cars, key='speed')
res <- hadoop.run(
cars.dfs,
mapper = function(key, val) {
if (val$dist > 30) {
keyval(key, val)
}
else {
NULL
}
},
reducer = function(key, vals) {
X <- 0
for (x in vals) {
X <- X + x$dist
}
X = X / length(vals)
keyval(key, X)
}
)
res
hdfs.get(res)
R> res
[1] "orhc2d12adf7"
attr(,"orhc.dfsid")
[1] TRUE
R> hdfs.get(res)
key val1
1 10 34.00000
2 13 38.00000
3 14 58.66667
4 15 54.00000
5 16 36.00000
6 17 40.66667
7 18 64.50000
8 19 50.00000
9 20 50.40000
10 22 66.00000
11 23 54.00000
12 24 93.75000
13 25 85.00000
83
Oracle R Enterprise and Hadoop
• Goal: take the average arrival delay for all flights to SFO.
©2011 Oracle – All Rights Reserved
ontime <- ore.pull(ONTIME_S[ONTIME_S$YEAR==2007,])
ontime.dfs <- hdfs.put(ontime, key='DEST')
res <- hadoop.run(
ontime.dfs,
mapper = function(key, ontime) {
if (key == 'SFO') {
keyval(key, ontime)
}
else {
NULL
}
},
reducer = function(key, vals) {
sumAD <- 0; count <- 0
for (x in vals) {
if(!is.na(x$ARRDELAY)) {
sumAD <- sumAD + x$ARRDELAY
count <- count + 1
}
}
res <- sumAD / count
keyval(key, res)
}
)
hdfs.get(res)
84
Oracle R Enterprise and Hadoop
• Goal: take the average arrival delay for flights from to SFO by airline.
©2011 Oracle – All Rights Reserved
ontime <- ore.pull(ONTIME_S[ONTIME_S$YEAR==2007,]))
ontime.dfs <- hdfs.put(ontime, key='UNIQUECARRIER')
res <- hadoop.run(
ontime.dfs,
mapper = function(key, ontime) {
if (ontime$DEST == 'SFO') {
keyval(key, ontime)
}
else {
NULL
}
},
reducer = function(key, vals) {
sumAD <- 0
for (x in vals) sumAD <- sumAD + x$ARRDELAY
res <- sumAD / length(vals)
keyval(key, res)
}
)
res
hdfs.get(res)
85
Oracle R Enterprise and Hadoop
• Goal: create a file of the ONITME_S ore.frame to illustrate loading data from a file. Take the average arrival
delay for all flights from to SFO. Output is one value pair.
• Map function returns key-value pairs where column DEST is “SFO”
• Reduce function produces the mean of arrival delay
©2011 Oracle – All Rights Reserved
mydat <- "ONTIME_S_FILE.dat"
write.csv(ore.pull(ONTIME_S),row.names=F,file=mydat)
ontime.dfs <- hdfs.upload(mydat, header=T)
res <- hadoop.run(
ontime.dfs,
mapper = function(key, ontime) {
…
}
},
reducer = function(key, vals) {
…
}
)
print(readLines(hdfs.download(res)))
86
Database connectivity / interaction
• orhc.connect(host, user, sid, passwd, port, secure=T)
– Establishes connection from ORHC to Oracle Database
– Returns RDBMS connection object
• orhc.disconnect()
– Disconnects from Oracle Database
– Returns RDBMS connection object
• orhc.reconnect()
– Reconnects to Oracle RDBMS with previous credentials
– Faster than orhc.connect ()
• orhc.which()
– Displays information about current RDBMS connection
• orhc.dbg.off()
– Turns off all debug output
• orhc.dbg.on('ERROR')
– Turns on error messages only
©2011 Oracle – All Rights Reserved
87
HDFS connectivity / interaction
• hdfs.connect(dfs.url)
– Establishes connection to Hadoop's HDFS
– Returns HDFS connection object
• hdfs.disconnect()
– Disconnect from Hadoop's HDFS.
– Rolls back connection to the default as setup in local Hadoop client configuration
– Returns HDFS connection object
• hdfs.reconnect()
– Reconnects to the previous disconnected Hadoop HDFS
• hdfs.which()
– Displays information about current HDFS connection
©2011 Oracle – All Rights Reserved
88
HDFS connectivity / interaction
• hdfs.push(x, dfs.name, overwrite, driver, split.by)
– Copies ore.frame from RDBMS to HDFS.
– Returns HDFS object identifier used in HDFS/Hadoop function calls
• hdfs.pull(dfs.id, sep, db.table.name, overwrite, driver)
– Copies HDFS object to RDBMS
– Returns an ore.frame which points to new table
• hdfs.upload(filename, dfs.name, overwrite, split.size, header)
– Uploads local file to HDFS
– Simplest and fastest way to transfer data to HDFS from local storage
– Replicates the local file into HDFS directory
– By default HDFS directory get a unique ID and the HDFS file(s) named "part-0000x“
– If local file > split.size bytes, file automatically split into several parts
• hdfs.download(dfs.id, filename, overwrite)
– Downloads an HDFS file to the local file system
– Simplest and fastest way to transfer data from HDFS to local storage
– Replicates HDFS directory part-0000x files into the local file by combining all part-0000x files as one
©2011 Oracle – All Rights Reserved
89
HDFS connectivity / interaction
• hdfs.put(x, key, dfs.name, overwrite)
– Copies data from R in-memory object (data.frame) to HDFS
– Column names, data types, etc. stored as metadata with data
– Differs from hdfs.push in that if x = ore.frame then data pulled into the local R memory and then
loaded to HDFS
– If no dfs.name provided, random name generated and returned
• hdfs.get(dfs.id, sep)
– Copies data from HDFS into R in-memory object
– Metadata extracted and column names, data types, etc. restored if data originated from R
environment
– Otherwise generic reverse-engineered attributes (like val1, val2 for names) are assigned
©2011 Oracle – All Rights Reserved
90
HDFS connectivity / interaction
• hdfs.sample(dfs.id, lines, sep)
– Samples (partially copies) data in HDFS and returns an R data.frame in-memory object
– No guarantee to the random nature of rows returned
• hdfs.attach(dfs.name)
– Brings an HDFS object into ORHC environment
– Attaches "unmanaged" HDFS data to ORHC framework and return HDFS object identifier
• hdfs.rm(dfs.id)
– Removes data from HDFS
– Invalidate all HDFS object identifiers pointing to data set
©2011 Oracle – All Rights Reserved
91
HDFS connectivity / interaction
• hdfs.exists(dfs.id)
– Checks if HDFS object exists
– Validates HDFS object identifier or existence of HDFS data with the specified name in HDFS
– Returns TRUE if data can be attached and used in hadoop.run() function
• hdfs.cd(dfs.path)
– Sets current URI path and optionally connection to an HDFS resource
• hdfs.ls(dfs.path)
– Returns name list of all HDSF data objects (directories) at currently set path
– Only directories containing data are listed
• hdfs.pwd()
– Returns current HDFS working directory
• hdfs.mkdir(dfs.name, cd)
– Creates a new sub-directory in HDFS relative to the current working directory
©2011 Oracle – All Rights Reserved
92
HDFS connectivity / interaction
• hdfs.rmdir(dfs.name)
– Deletes an existing sub-directory in HDFS relative to the current working directory
– All data objects stored in this directory will be deleted too and, therefore, all assosiated HDFS
object identifier will be invalidated
• hdfs.size(dfs.id)
– Returns total size of the HDFS object in bytes
• hdfs.parts(dfs.id)
– Returns number of parts the HDFS object is divided into
©2011 Oracle – All Rights Reserved
93
Hadoop connectivity / interaction
• hadoop.connect(host, user, passwd, secure)
– Establishes connection to Hadoop's MapReduce
– Returns hadoop connection object if connection was successfully established
• hadoop.disconnect()
– Disconnects (drops connection) from Hadoop's MapReduce JobTracker
– Returns hadoop connection object of the previous settings
– Rolls back connection to default as set up in local Hadoop client configuration
• hadoop.reconnect(hmr.con)
– Reconnects to Hadoop's MapReduce with the hadoop connection object
– After Hadoop connection dropped, ORHC preserves all user credentials and connection attributes
• hadoop.which()
– Displays information about current Hadoop MapReduce connection
©2011 Oracle – All Rights Reserved
94
Hadoop connectivity / interaction
• hadoop.exec(dfs.id, mapper, reducer, combiner)
– Invokes Hadoop engine and sends mapper, reducer, and combiner R functions for execution on
the server side
– Provides core functionality for Hadoop MapReduce execution
• hadoop.run(dfs.id, mapper, reducer, combiner)
– Invokes Hadoop engine and sends mapper and reducer R functions for execution
– If input data not resident in HDFS…
• Data pushed to HDFS
• User map-reduce script prepared for execution
• Script sent to Hadoop
– If successful execution, data pulled from HDFS to local R memory and to the database depending
on input data location
– Internally, invokes hdfs.push() and hdfs.pull() APIs so their security considerations apply
©2011 Oracle – All Rights Reserved
95
Running jobs locally
• ORHC allows Hadoop jobs to be run locally
– Supports testing scripts before deploying to Hadoop Cluster
– Use the command assign('dry.run', T, ORHC:::.orhc.env)
– Next hadoop.run() command will be executed locally
• Data from HDFS is still accessed as before
96
Use of ORE in Exadata and BDA environments
©2011 Oracle – All Rights Reserved
97
Big Data Appliance
• An engineered system optimized for capturing
and integrating “low density” data into Exadata
• High-performance Hardware • Optimized for Hadoop and NoSQL workloads
• InfiniBand Networking for integration with Exadata
• Software: • Oracle Hadoop
• Oracle R Hadoop Connector
• Oracle R Enterprise client (optional)
• Oracle NoSQL DB
• Oracle Data Integrator (Hadoop capabilities)
• Oracle Loader for Hadoop
18 Sun X4270 M2 Servers
48 GB memory per node = 864 GB memory
12 Intel cores per node = 216 cores
24 TB storage per node = 432 TB storage
40 Gb p/sec InfiniBand
10 Gb p/sec Ethernet
©2011 Oracle – All Rights Reserved
98
Big Data Appliance Usage Model
©2011 Oracle – All Rights Reserved
99
Where to run your job?
• BDA / Hadoop Cluster or Exadata/Oracle Database
• Consider data volumes
– If DBMS data volume dominates, then run in database
– If HDFS data volume dominates with DBMS providing
just coefficients or parameters then run in HDFS
– If both volumes dominate then use DBFS connect to
reach out to HDFS and use the database
100 ©2011 Oracle – All Rights Reserved
101