- sheepsqueezers.comsheepsqueezers.com/media/documentation/oracle/ore... · ... Oracle Advanced Analytics . 2 The following

<Insert Picture Here>

©2011 Oracle – All Rights Reserved

Oracle R Enterprise – Training Sessions

Session 6: Advanced Topics

Mark Hornick, Senior Manager, Development

Oracle Advanced Analytics

2

The following is intended to outline our general product direction.

It is intended for information purposes only, and may not be

incorporated into any contract. It is not a commitment to deliver

any material, code, or functionality, and should not be relied upon

in making purchasing decisions.

The development, release, and timing of any features or

functionality described for Oracle’s products remain at the sole

discretion of Oracle.

3

Oracle R Enterprise Training Sessions

Date / Time Session Topics

Thursday, December 1, 2:00 PM ET Getting Started with Oracle R Enterprise

(ORE)

Oracle R Enterprise Overview

Installation of R

Installation of Oracle R Enterprise

Connecting to Exadata from R

Tuesday, December 6, 11:00 AM ET Introduction to the R Language and

Environment

R Language Basics

Producing Graphs in R

Thursday, December 8, 1:00 PM ET ORE Transparency Layer Interacting with Database Tables

Manipulating and transforming data through ORE

Tuesday, December 13, 11:00 AM ET ORE Embedded R Script Execution Execution through R interface

Execution through SQL interface

Thursday, December 15, 2:00 PM ET Operationalizing R Scripts From Analyst to Production

Integration with OBIEE

XML graph generation using SQL

Tuesday, December 20, 11:00 AM ET Advanced Topics Base SAS equivalent functionality

ORE support for Hadoop and Map-Reduce

Use of ORE in Exadata and BDA environments


4

Topics

• Base SAS equivalent functionality

• ORE support for Hadoop – Oracle R Hadoop Connector

• Use of ORE in Exadata and BDA environments


5

What is Oracle R Enterprise?

• R packages, database library, and SQL extensions that bring Oracle Database closer to Advanced Analytics

users in an Enterprise

– Transparency Layer

• Set of packages mapping R data types to Oracle Database objects

• Transparent SQL generation for R expressions on mapped data types

• Enables direct interaction with database-resident data using R language constructs

• Enables R users to work with data larger than desktop system memory

– Statistics Engine

• Set of statistical functions and procedures for commonly-used statistical libraries

• Executes in Oracle Database

– SQL extensions

• Enables database server execution of R code to facilitate embedding R in operational systems

• Eliminates client data loading and result write-back to Oracle Database

• Enables dataflow parallelism, generation of rich XML output, SQL access to R, and parallel simulations capability

– Hadoop Connector

• R packages that enable working directly with an Oracle Hadoop cluster using R

• Data resides in HDFS, Oracle Database ,or local files


6

Oracle R Enterprise Statistics Engine Example Features

• Special Functions

– Gamma function

– Natural logarithm of the Gamma function

– Digamma function

– Trigamma function

– Error function

– Complementary error function

• Tests

– Chi-square, McNemar, Bowker

– Simple and weighted kappas

– Cochran-Mantel-Haenzel correlation

– Cramer's V

– Binomial, KS, t, F, Wilcox

• Base SAS equivalents

– Freq, Summary, Sort

– Rank, Corr, Univariate

• Density, Probability, and Quantile Functions

– Standard normal distribution

– Chi-square distribution

– Exponential distribution

– F-distribution

– Density Function

– Probability Function

– Quantile

– Gamma distribution

– Beta distribution

– Cauchy distribution

– Student's t distribution

– Weibull distribution


7

Statistical Tests – Examples ore.create(iris,table="IRIS_TABLE")

IRIS_TABLE$PETALBINS=ifelse(IRIS_TABLE$Petal.Length < 2, 1, 2)

# Binomial Test

binom.test(IRIS_TABLE$PETALBINS)

# Chi Square Test

chisq.test(IRIS_TABLE$PETALBINS)

# One sample K-S Test for given probabilities

ks.test(IRIS_TABLE$Petal.Length, "pexp", rate=4)

# Two sample K S Test

ks.test(IRIS_TABLE$Petal.Length, IRIS_TABLE$Sepal.Length)

# T-test with different alternate hypothesis possibilities */

t.test(IRIS_TABLE$Petal.Length, alternative="two.sided", mu=0, conf.level=0.9)

# F test to compare variances

var.test(IRIS_TABLE$Petal.Length, IRIS_TABLE$Sepal.Length, ratio=0.75,

alternative="two.sided", conf.level=0.9)


8

Statistical Tests – Results


9

Base SAS equivalent functionality


10

SAS PROCs and SAS Code

PROC SUMMARY DATA=EX1; CLASS CITYSIZE;

VAR QUANTITY AMOUNT;

OUTPUT OUT=A MAX=MAXQUAN MAXAMT

MAXID(QUANTITY(REGION) AMOUNT(PRODUCT))=REGNMXQ PRODMXA

MINID(QUANTITY(REGION)) = REGNMNQ;

RUN;

PROC PRINT; RUN;

proc sort data=account out=sorted;

by town descending debt accountnumber;

run;

proc print data=sorted;

var company town debt accountnumber;

title 'Customers with Past-Due Accounts';

title2 'Listed by Town, Amount, Account Number';

run;

ods graphics on;

proc freq data=Color order=data;

tables Hair / nocum chisq testp=(30 12 30 25 3)

plots(only)=deviationplot(type=dot);

weight Count;

by Region;

title 'Hair Color of European Children';

run;

ods graphics off;


11

SAS Characterization

• Client-side tool that requires data extracts from the database

• ETL limits data available to data analysts / statisticians

• Base SAS language similar in concept to R

– Includes math-like constructs, math functions and graphics

– Libraries with Base SAS similar to R packages that contain sophisticated

math/stat functions

• SAS applications mostly use Base SAS

– Compete with Oracle applications, e.g.,

• Financial Services: banking credit risk, anti-money laundering, etc.

• Health Sciences: clinical trials, drug safety

• Retail Planning: assortment, forecasting


12

+

+

+

Base SAS

A programming language that contains statements, expressions, and functions.

Foundation for most SAS products

DATA STEP is the most popular SAS statement that allows manipulation of SAS data sets (select, project, join, filter, derived columns)

Data analysis procedures (a.k.a SAS PROCs)

A PROC is a bundling of several statistics functions + data analysis options (grouping, sorting, filtering, ranking) that support a specific statistical analysis technique on a data set

e.g. PROC CORR bundles statistics functions and data analysis options required for correlation analysis

Standard and customized Report generation capabilities.

2D and 3D Graphics integrated with SAS PROCs via ODS

A data management facility which organizes data into persistent tables called SAS data sets

A SAS data set is a file with concurrent read/write user access

A SAS data set can also be mapped to a relational table/view

In this case a SAS user is expected to craft SQL statements to send over SAS ODBC connection

When SAS data sets get large, they are managed by a SAS database product called Scalable Performance Data Server


13

SAS Customers using Oracle R Enterprise – Why?

• ORE statistics engine functions facilitates SAS user transition to ORE

– SAS PROCs widely used

– ORE provides similar syntax to enable reuse of SAS users skills

• ORE key benefits over SAS

– ORE leverages Oracle Database as compute engine

– SAS requires extracts to middle tier file system

• Problems of latency, data security, extra infrastructure, complexity

– ORE makes operationalizing results easy

• Deploying SAS models is non trivial

• Requires software engineers work with SAS analysts to rewrite SAS model in SQL

or C

• ORE eliminated that hand-off pain

– SAS software is rented annually

• Annual renegotiation with SAS

• Software simply stops working


14

Oracle R Enterprise “PROCs”

• SUMMARY / MEANS

• RANK

• SORT

• CROSSTAB

• FREQ

• CORR

• UNIVARIATE


15

ore.summary

• Provides descriptive statistics and extensive analysis of columns in

an ore.frame with flexible row aggregations

• Statistics – mean, min, max, mode, #missing values, sum, weighted sum

– corrected and uncorrected sum of squares, range of values,

stddev, stderr, variance

– t-test for testing hypothesis that the population mean is zero

– kurtosis, skew, Coefficient of Variation

– quantiles: p1,p5,p10,p25,p50,p75,p90,p95,p99,qrange

– 1 and 2 sided Confidence Limits for the mean: clm, rclm, lclm

– extreme value tagging

• Simple syntax abstracting complex SQL queries


16

ore.summary – Parameters

• data: ore.frame of data on which to compute descriptive statistics

• class: column(s) of data to aggregate (i.e., SQL group by)

• var: numeric column(s) of data on which to apply statistics functions (SQL select list)

• stats: list of statistics functions available to be applied on var columns -

mean,min,max,cnt,n,nmiss,css,uss,cv,sum,sumwgt,range,stddev,stderr,var,t,probt,kurt,

skew,p1,p5,p10,p25,p50,p75,p90,p95,p99,qrange,lclm,rclm,clm,mode

• weight: A column of data whose numeric values provide a multiplicative factor for var columns

• maxid, minid: for each group optionally list max/min value from other columns in data

• ways: restrict output to only certain grouping levels of the class variables

• group.by: column(s) of data to stratify summary results across

Returns an ore.frame as output in all cases except when group.by clause is used.

When group.by clause is used returns a list of ore.frames, one per stratum


17

ore.summary – Examples

# Default statistics – mean, min, max for columns AGE, CLASS, ROLLUP(GENDER)

ore.summary(NARROW, class='GENDER', var ='AGE,CLASS')

# Report just skew(AGE) as column A and t(CLASS) as column B

ore.summary(NARROW, class='GENDER', var='AGE,CLASS', stats='skew(AGE)=A, probt(CLASS)=B')

# Weighted sum (var * weight)

ore.summary(NARROW, class='GENDER', var='AGE', stat='sum=X', weight='YRS_RESIDENCE')

# Separate group-bys: Group by GENDER, Group by Marital Status

ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways=1)

# All possible group-bys: CUBE Gender, Marital Status

ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways='nway')


18

ore.summary – Results


19



20



21

ore.summary – Examples

# Sort results by ascending frequency count

ore.summary(NARROW, class='GENDER', var ='AGE,CLASS', order='freq')

# Sort results by descending aggregation level

ore.summary(NARROW, class='GENDER', var ='AGE,CLASS', order='-level')

# Stratification by country: For each country report statistics on AGE, ROLLUP by GENDER, CLASS

ore.summary(NARROW, class='GENDER, CLASS', var ='AGE', group.by='COUNTRY')

# In addition to aggregated statistics on AGE, for each grouping of class variable, report country

# with the maximum and minimum value for YRS_RESIDENCE column and label the result as maxcol and

# mincol respectively

x = ore.summary(NARROW, class='GENDER', var='AGE',

maxid='COUNTRY/YRS_RESIDENCE=maxcol',

minid='COUNTRY/YRS_RESIDENCE=mincol')


22



23



24



25

ore.summary - Examples

• Compute the mean, standard deviation, and count for arrival delay

for each destination airport

• Compute the maximum arrival and departure delay for each airline

and the corresponding destination airport

res <- ore.summary(data=ONTIME_S,

var='ARRDELAY',

class='DEST',

stats='mean=M,std=S,n=COUNT')

head(res)

ore.summary(data=ONTIME_S,

class='UNIQUECARRIER',

var='ARRDELAY,DEPDELAY',

stats='MAX(ARRDELAY)=MAXARRDELAY,

MAX(DEPDELAY)=MAXDEPDELAY',

maxid='DEST/ARRDELAY=DESTMAXARRDELAY,

DEST/DEPDELAY=DESTMAXDEPDELAY')


26



27

ore.rank

• Enables investigation of distribution of values in numeric

columns of an ore.frame

• Highlights

– Ranking within groups

– Partition rows into groups based on rank tiles

– Cumulative percentages and percentiles

– Treatment of ties

– Calculation of normal scores from ranks

• Simple syntax abstracting complex SQL queries


28

ore.rank – Parameters

• data : ore.frame of the data to compute rankings on

• var : numeric columns in data to rank

• desc : ranks in descending (asc is default) order if TRUE

• groups : partition rows into #groups based on ranks. For percentiles, #groups=100, For deciles

#groups=10, For quartiles #groups=4.

• group.by : rank each group identified by group.by columns separately

• ties : specification of tie treatment. Assign largest of/smallest of/mean of corresponding ranks of

tied values

• fraction : rank of a column value ÷ # non missing column values

• nplus1 : rank of a column value ÷ # non missing column values + 1

– Fraction and nplus1 options can be used to estimate cumulative distribution function

• percent : (rank of a column value ÷ # non missing column values) * 100

• Scoring Methods

– To compute Exponential scores from ranks use savage

– To compute normal scores – Use one of blom, tukey or vw (Van Der Waerden)

Returns an ore.frame as output in all cases


29

# Rank 2 columns and report them as derived columns

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')

class(x)

y <- ore.sort(data=x, by='RankOfAge')

# Handling of ties

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')

head(x,10)

# Rank within groups

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')

head(x,10)

ore.rank – Examples


30

# Partition rows into groups e.g. Deciles

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)

# Partition rows into groups e.g. Quartiles

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=4)

# Estimating cumulative distribution function

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)

# Scores calculation

x <- ore.rank(data=NARROW, var='AGE=RankOfAge,

CLASS=RankOfClass',score='savage', groups=100, group.by='COUNTRY')

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='blom')

ore.rank – Examples


31

ore.sort

• Enables flexible sorting of columns in a data frame

• Can be used with other data pre-processing functions

– Sorting happens in the database

– (Top k) results of sorting can be provided as input to R

visualization

• Supports database nls.sort option


32

ore.sort - Parameters

• data : ore.frame of the data to be sorted

• by : columns in data to sort

• stable : Allows relative order to be maintained within sorted groups

(TRUE/FALSE)

• reverse: Allows optional reversal of collation order for character variables

(TRUE/FALSE)

• unique.keys : Allows optional removal of rows with duplicate values in the

column(s) being sorted from appearing in the result (TRUE/FALSE)

• unique.data: Allows optional removal of duplicate rows from appearing in the

result (TRUE/FALSE)

Returns an ore.frame as output in all cases


33

# Sort all specified columns in desc order

x <- ore.sort(data=NARROW,by='AGE,GENDER', reverse=TRUE)

# Sort AGE in desc order but GENDER in ascending order

x <- ore.sort(data=NARROW,by='-AGE,GENDER')

# Keep just 1 row per unique value of AGE

x <- ore.sort(data=NARROW,by='AGE', unique.key=TRUE)

# Remove duplicate rows

x <- ore.sort(data=NARROW,by='AGE', unique.data=TRUE)

# Remove duplicate rows as well as rows with duplicate values for AGE

x <- ore.sort(data=NARROW,by='AGE', unique.data=TRUE, unique.key = TRUE)

# Maintain relative order within sorted output

x <- ore.sort(data=NARROW,by='AGE', stable=TRUE)

ore.sort – Examples


34

ore.sort – Examples

• Sort the ONTIME_S data by airline descending and departure delay

ascending

• Sort ONTIME_S by airline and departure delay, but select one of each

combination, e.g., unique key

sortedOntime <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')

head(sortedOntime[,c(11,18)], 20)

sortedOntime2 <- ore.sort(ONTIME_S,by='UNIQUECARRIER,DEPDELAY',unique.key=TRUE)

head(sortedOntime2[,c(11,18)], 20)


35

ore.crosstab - Basics

Input data set

One way tables

2 way table

A general KxL table


36

ore.crosstab

• Enables cross column frequency analysis of an ore.frame

• A sophisticated variant of R's function table () for two variables

• Builds tables of frequency counts across columns of a data frame

• Required as a pre-cursor to frequency analysis using ore.freq

• R translated to 100% SQL


37

ore.crosstab – Parameters

• expr: cross tabulation definition in the form

[COLUMN_SPEC] ~ COLUMN_SPEC [ * <WEIGHTING COLUMN>]

[ / <GROUPING COLUMN>]

[ ^ <STRATIFICATION COLUMN>]

[ | ORDER_SPECIFICATION]

COLUMN_SPEC is <column-name>[+COLUMN_SET][+COLUMN_RANGE]

COLUMN_SET is <column_name>[+COLUMN_SET]

COLUMN_RANGE is <FROM COLUMN>-<TO COLUMN>

ORDER_SPEC is one of [-]NAME, [-]DATA, [-]FREQ, or INTERNAL

• data: ore.frame of data to cross tabulate

• group.by: as many cross tabulations as unique values in grouping columns

• order: optional sorting of table data

[-] NAME: Sort by tabulation column names, [-]FREQ: Sort by frequency counts in the table

• weights : Column of data that indicates frequency of occurrence of the corresponding

row

• strata : Column name used to cluster, or group, the data in combination

Returns an ore.frame as output in all cases, except when multiple tables are created – in this case an ore.list is returned


38

# For comparison, look at R's table function on 2 columns

table(NARROW$MARITAL_STATUS, NARROW$GENDER)

# Corresponding ore.crosstab(), extensible to more than 2 columns

ore.crosstab(MARITAL_STATUS ~ GENDER, data=NARROW)

# MARITAL_STATUS x GENDER and MARITAL_STATUS x CLASS

x=ore.crosstab(MARITAL_STATUS ~ GENDER+CLASS, data=NARROW)

# One way table

ore.crosstab(~AGE, data=NARROW)

# Weight values in AGE and GENDER using values in CLASS

x=ore.crosstab(AGE~GENDER*WEIGHT, data=NARROW)

# Order rows of cross tab by frequency counts

x=ore.crosstab(AGE~GENDER|FREQ, data=NARROW)

ore.crosstab – Examples


39

ore.crosstab – Results


40

# 4, 2 way cross tabs (GENDER,AGE,MARITAL_STATUS,COUNTRY)~CLASS

x=ore.crosstab(GENDER-COUNTRY~CLASS, data=NARROW)

length(x)

# 1 way table with as many rows as unique values of COUNTRY for each unique value of AGE

x = ore.crosstab(~AGE/COUNTRY, data=NARROW)

# Same as above, but 2 way

x = ore.crosstab(AGE~GENDER/COUNTRY, data=NARROW)

# Post-process output – 2 2-way tables AGExGENDER and AGExCLASS

x=ore.crosstab(AGE ~ GENDER+CLASS, data=NARROW)

class(x)

class(x[[1]])

names(x[[1]])

z <- x[[1]][,c(1,2,3)]



41



42



43

# Stratification – As many 2 way tables as unique values of CLASS

x <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW)

# Custom binning and subsequent cross tabulation

NARROW$AGEBINS=ifelse(NARROW$AGE<20, 1,

ifelse(NARROW$AGE<30,2,

ifelse(NARROW$AGE<40,3,4)))

ore.crosstab(GENDER~AGEBINS, NARROW)



44



45

ore.extend

• Augments cross tabulation produced using ore.crosstab()

with 3 basic statistics

– ROW and COLUMN SUMs

crosstab = ore.extend.sum(crosstab)

– CUMULATIVE SUMs for each cell of the table

crosstab = ore.extend.cumsum(crosstab)

– TOTAL for the entire table

crosstab = ore.extend.total(crosstab)


46

ore.crosstab – Augmenting cross-tabulations

• Adding row and column sums x <- ore.crosstab(GENDER~AGEBINS, NARROW)

y <- ore.extend.sum(x)

• Adding cumulative sums for each cell of crosstab x <- ore.crosstab(GENDER~AGEBINS, NARROW)

y <- ore.extend.cumsum(x)

• Adding total for the entire table y <- ore.extend.total(x)


47



48



49

ore.freq

• Operates on output of ore.crosstab() and automatically determines the techniques

relevant to the nature of the result

• 1-way cross tables

– Goodness of fit tests for equal proportions or specified null proportions, confidence

limits, and tests for equivalence

• 2-way cross tables

– Various statistics that describe relationships between columns in the cross tabulation

– Chi-square tests, cochran-mantel-haenzsel statistics, measures of association, strength

of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend

• N-way cross tables

– N 2-way cross tables

– Statistics across and within strata

• Leverages database SQL functions when available


50

ore.freq - Parameters

• crosstab: ore.frame output from ore.crosstab()

• stats: List of statistics required – ChiSquare: AJCHI, LRCHI, MHCHI, PCHISQ

– Kappa:KAPPA, WTKAP,

– Lambda:LAMCR, LAMRC, LAMDAS

– Correlation:KENTB,PCORR, SCORR

– Stuart'c Tau, Somer's D|C:STUTC,SMDCR,SMDRC

– Fisher's, Cochran's Q:FISHER, COCHQ

– Odds Ratio:OR, MHOR, LGOR

– Relative Risk:RR,MHRR,ALRR

– Others: MCNEM, PHI, CRAMV, CONTGY, TSYM, TREND, GAMMA,

• params: Control parameters to the specific statistical function – SCORE: TABLE|RANK|RIDIT|MODRIDIT

– ALPHA: <number>

– WEIGHTS: <number>

• skip.missing: (TRUE/FALSE) skip cells with missing values in the cross table

• skip.failed: (TRUE/FALSE) if a statistical test required fails on the cross table because it is found to be

in-applicable to the table then return immediately

Returns an ORE data frame as output in all cases


51

ore.freq – Examples

• Compute cross tabulation for number of diverted flights for each airline.

Compute the Pearson CHISQ for the results.

• For each airline, compute cross tabulation for number of diverted flights

and day of week. Compute the Pearson CHISQ for each result.

ct <- ore.crosstab(UNIQUECARRIER~DIVERTED, data=ONTIME_S)

ct

freq <- ore.freq(ct)

freq

ct <- ore.crosstab(UNIQUECARRIER~DIVERTED+DAYOFWEEK,data=ONTIME_S)

ct

freq <- ore.freq(ct)

freq


52

ore.corr

• Correlation analysis across numeric columns in an ore.frame

• Supports partial correlations with a control column

• Enables aggregations prior to correlations

• Allows post-processing of results and integration into R code flow

• Output can be made to conform to output of R's cor() function and so

can be post-processed by any CRAN function or graphics


53

ore.corr – Parameters

• data: ore.frame of the data to compute correlation coefficients

• var: numeric column(s) of the data for which to build correlation matrix

• group.by: as many correlation matrices as unique values in group.by columns

• weight: A column of the data whose numeric values provide a multiplicative factor for var

columns

• partial: columns of data to use as control variables for partial correlation

• stats: pearson (default), spearman, kendall

Use OREeda:::ore.corr.as.matrix() to convert into R's cor() compatible output format

Returns an ore.frame as output in all cases except when group.by is used in which case an ore.list object is returned


54

# R's cor –project out non-numeric columns first

names(NARROW)

N <- ore.pull(NARROW[,c(3,8,9)])

cor(N, use="complete.obs")

cor(N, method='spearman' , use="complete.obs")

cor(N, method='kendall' , use="complete.obs")

# Corresponding ore.corr

x1 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS')

x2 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')

x3 <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')

cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x3)

class(cor_compatible_matrix)

ore.corr – Examples


55

ore.corr - Results


56

Partial correlation

ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman', partial='GENDER')

Creating a number of correlation matrices

x <- ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS',

stats='pearson', partial='GENDER', group.by='COUNTRY')

class(x)

cor_compatible_matrix <- OREeda:::.ore.corr.as.matrix(x[[1]])

ore.corr – Examples


57

ore.corr Post-processing matrix using CRAN visualization

circle.corr( cor(mtcars), order = TRUE, bg = "gray50",

col = colorRampPalette(c("blue","white","red"))(100))

circle.corr(cor_compatible_matrix,

order = TRUE, bg = "gray50",

col = colorRampPalette(c("blue","white","red"))(100))

• http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=152


58

ore.univariate

• Enables distribution analysis of numeric variables in an ore.frame

• Statistics

– All statistics reported by ore.summary()

– Signed rank test, Student's t-test

– Extreme values reporting

• Graphics

– QQ plots

– Scatterplots


59

ore.univariate – Parameters

• data: ore.frame of the data whose columns are to be analyzed

• var: numeric column(s) of the data for which to compute statistics

• weight: A column of the data whose numeric values provide a multiplicative factor for var

columns

• stats: optional specification of a subset of statistics to be printed

momemts – n,sumwgt,mean,sum,stddev,var,skew,kurt.,uss.css.cv.stderr

measures – mean,stddev,median,var,mode,range,iqr

quantiles – p100,p99,p95,p90,p75,p50,p25,p10,p5,p1,p0

location – studentt,studentp,signt,signp,srankt,srankp

normality

loccount – loc<,loc>,loc!

extremes

Returns an ore.frame as output in all cases except when group.by is used in which case an R list object is returned


60

ore.univariate - Examples

# Default univariate statistics

ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")

# Compute location statistics on YRS_RESIDENCE

ore.univariate(NARROW, var="YRS_RESIDENCE",stats="location")

# Compute quantiles statistics on AGE and YRS_RESIDENCE

ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")


61

ore.univariate – Results


62

Normal QQ Plot – Univariate Graphics

Arrival and Departure Delay


63

Normal QQ Plot – Univariate Graphics Arrival and Departure Delay

ontime <- ONTIME_S

ontimeSubset <- ontime[ontime$UNIQUECARRIER =="AA", ,drop = TRUE]

ontimeSubset <- ontime[, c("ARRDELAY", "DEPDELAY")]

ontimeSubset <- ore.pull(ontimeSubset)

ontimeSubset <-ontimeSubset[sample(nrow(ontimeSubset),nrow(ontimeSubset)*.03),]

dat = ontimeSubset$ARRDELAY

type = "Arrival"

#dat = ontimeSubset$DEPDELAY

#type = "Departure"

qqnorm(dat, xlab="Normal Quantiles",

ylab=paste(type, " Delay (min)",sep="" ));

qqline(dat, col = 2)

mean <- mean(dat, na.rm=TRUE)

sd <- sd(dat, na.rm=TRUE)

legend("top", paste("normal line mean= ",round(mean,2),

" sd= ",round(sd,2),sep=""), lty=1, col="red")


64

Scatterplot Matrix Airline, Arrival Delay, Departure Delay, Distance


65

Scatterplot Matrix Airline, Arrival Delay, Departure Delay, Distance

ontime <- ONTIME_S

ontimeSubset <- ontime[ontime$UNIQUECARRIER %in%

c("AA", "AS", "CO", "DL","HP","WN"), ,

drop = TRUE]

ontimeSubset <- ore.pull(ontimeSubset)

ontimeSubset <-ontimeSubset[sample(nrow(ontimeSubset),nrow(ontimeSubset)*.03),]

pairs(~UNIQUECARRIER+ARRDELAY+DEPDELAY+DISTANCE,data=ontimeSubset,

main="Simple Scatterplot Matrix")


attach(ontimeSubset)

os2 <- data.frame(UNIQUECARRIER, ARRDELAY, DEPDELAY, DISTANCE)

detach(ontimeSubset)

panel.hist <- function(x, ...) {

usr <- par("usr"); on.exit(par(usr))

par(usr = c(usr[1:2], 0, 1.5) )

h <- hist(x, breaks=20, plot = FALSE)

breaks <- h$breaks; nB <- length(breaks)

y <- h$counts; y <- y/max(y)

rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)

}

pairs(os2, diag.panel=panel.hist, main="Scatterplot Matrix with Histograms")

66

ORE support for Hadoop


67

Oracle R Hadoop Connector

Transparently leverage Hadoop for High Performance Analytics to Big Data Appliance

Function push-down – data transformation & statistics

R workspace console

Oracle statistics engine

BI Publisher, OBIEE, Web Services


1. Interactive access to Hadoop infrastructure from R user's desktop as part of exploratory analysis

2. Lights out execution of production R calculations on Hadoop infrastructure

http://search.oraclecorp.com/search/search

http://www.r-project.org/index.html

68

Hadoop Publicized Examples

• A9.com (Amazon)

– Product search indices

• Adobe

– Social services & structured data store

• EBay

– Search optimization & research

• Facebook

– User growth, page views, ad campaign analysis

• Journey Dynamics

– Forecast traffic speeds from GPS data

• Lineberger Comprehensive Cancer

Center

– Analyzes Next Generation Sequence Data for

Cancer Genome Atlas

• NAVTEQ Media Solutions

– Optimizes ad selection based on user

interactions

• Twitter

– Stores and processes Tweets

• Yahoo!

– Research into ad systems & web search


69

What is Hadoop?

• Originally sponsored by Yahoo! Apache project Cloudera

• Based on Google's GFS and Big Table whitepaper

• Cloudera website states “Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File

System (HDFS) and high-performance parallel data processing using a technique called MapReduce.”

• Enables analysis of Big Data

– Can store huge volumes of unstructured data, e.g.,weblogs, transaction data, social

media data

– Enables massive data aggregation

– Highly scalable and robust grid infrastructure


70

How is data stored and accessed?

• Data stored as flat files which are automatically distributed and replicated

across all Hadoop nodes

• Agreed upon delimiters “structure” the “unstructured” data

– <line feed> indicates end of row, generally

– Each row has (optional) key and value(s)

• Loading data to HDFS is equivalent to copying files on OS

– hadoop fs –put <local file name> <HDFS file name>

• Map-Reduce programs provide access to data on Hadoop

– Map data using keys for independent processing

– Reduce results of “mapper” functions to single result by “reducer” functions


76

Oracle R Connector for Hadoop

• Provide transparent access to Hadoop and HDFS-resident data – Hadoop - a high performance distributed computational system

– HDFS – a distributed high-availability file storage.

• R users not forced to learn new language or interface to work with Hadoop / HDFS

• Ability to leverage open source contributed R packages to work on HDFS-resident data

• Facilitate enterprise collaboration to share mappers and reducers to database R repository

• Transition work from lab to production deployment on a Hadoop cluster without requiring

knowledge of Hadoop internals, Hadoop CLI, or IT infrastructure

• Computation cluster administrators do not need to learn R to schedule R map-reduce

models in production

• Facilitate off-loading computationally intensive algorithms to commodity hardware


77

Oracle R Connector for Hadoop Concepts

• HDFS access API – Exposes HDFS CLI API as R functions where HDFS datasets are represented as a special R object

– Enables all operations that can be performed with “hadoop fs <cmd>”

• HDFS transparency layer – Allows all R functions that operate on data.frames to execute on HDFS resident data transparently

• Hadoop execution engine – Allows scheduling of map-reduce jobs from within R environment to Hadoop engine

• Hadoop job driver – Controls data retrieval/translation, execution of map/reduce user functions, and data storage on the server side (e.g. Hadoop cluster)

• Database data loader – Allows export of data from Oracle Database to HDFS and vice versa

• Data translation layer – Communicates with HDFS and optionally Hive to store metadata for all data exported to and imported from HDFS

• HDFS data sampler – Samples the data which was originally resident in HDFS and has no metadata attached and creates the missing metadata object


80

Oracle R Hadoop Cluster (ORHC) Architecture

Client Host (e.g., laptop)

R engine

orhc

Hadoop Cluster Software

Java VM

Server Machine (e.g., Big Data Appliance)

R engine

orhc-drv package Java VM

DBMS Machine (e.g., Exadata)

R engine

ORE libraries

Oracle Database

ORE packages

Hadoop Cluster

Task node

…

Task node

JobTracker

MapReduce nodes

HDFS nodes

Data node

Data node

…

Name node

ORE packages

ORE client packages orhc


81

Prerequisites for use

Client Side

• R engine and R package ORHC is installed on client host machine

– Provides API to access Hadoop facilities within R

– Controls all interactions with Hadoop and HDFS cluster

• Hadoop client software (HCS) - Sqoop

– Must be deployed and configured on client host where ORHC is installed

for access to Hadoop cluster and CLI

– ORHC uses HCS to connect to HDFS cluster

– HCS handles authentication from client host to the Hadoop cluster

machine

– Client host does not participate in Hadoop computations or as HDFS file

system nodes

• Oracle R Enterprise packages

– DBI, ROracle , OREbase, OREeda, OREgraphics, OREstats, RToXmp,

etc.

– Required to access to Oracle Database

– If not present • ORHC client can operate only with in-memory R objects or local data files

• ORHC will not have access to advanced statistical algorithms

Server Side

• Big Data Appliance

• R engine as shared library with all standard R base

libraries deployed on each Hadoop node plus ORHC-DRV

R package with ORHC drivers libraries

• Cloudera’s Sqoop

• Java VM (Hotspot VM is preferred)

– Required by both Hadoop software and ORHC

– Hadoop is implemented in Java

– ORHC uses Hadoop CLI to interact with Hadoop cluster

therefore Java VM must be installed on both server and client

sides

Hadoop Cluster Nodes

• Hadoop client software

• R engine

• ORHC-DRV R package

• Java VM


82

Oracle R Enterprise and Hadoop

• Goal: Average the stopping distance where distance is greater than 30 feet for each speed

• Map function returns key-value pairs where column “dist” is greater than 30

• Reduce function takes the average of all the returned values


cars.dfs <- hdfs.put(cars, key='speed')

res <- hadoop.run(

cars.dfs,

mapper = function(key, val) {

if (val$dist > 30) {

keyval(key, val)

}

else {

NULL

}

},

reducer = function(key, vals) {

X <- 0

for (x in vals) {

X <- X + x$dist

}

X = X / length(vals)

keyval(key, X)

}

)

res

hdfs.get(res)

R> res

[1] "orhc2d12adf7"

attr(,"orhc.dfsid")

[1] TRUE

R> hdfs.get(res)

key val1

1 10 34.00000

2 13 38.00000

3 14 58.66667

4 15 54.00000

5 16 36.00000

6 17 40.66667

7 18 64.50000

8 19 50.00000

9 20 50.40000

10 22 66.00000

11 23 54.00000

12 24 93.75000

13 25 85.00000

83


• Goal: take the average arrival delay for all flights to SFO.


ontime <- ore.pull(ONTIME_S[ONTIME_S$YEAR==2007,])

ontime.dfs <- hdfs.put(ontime, key='DEST')

res <- hadoop.run(

ontime.dfs,

mapper = function(key, ontime) {

if (key == 'SFO') {

keyval(key, ontime)

}

else {

NULL

}

},


sumAD <- 0; count <- 0

for (x in vals) {

if(!is.na(x$ARRDELAY)) {

sumAD <- sumAD + x$ARRDELAY

count <- count + 1

}

}

res <- sumAD / count

keyval(key, res)

}

)

hdfs.get(res)

84


• Goal: take the average arrival delay for flights from to SFO by airline.


ontime <- ore.pull(ONTIME_S[ONTIME_S$YEAR==2007,]))

ontime.dfs <- hdfs.put(ontime, key='UNIQUECARRIER')

res <- hadoop.run(

ontime.dfs,


if (ontime$DEST == 'SFO') {

keyval(key, ontime)

}

else {

NULL

}

},


sumAD <- 0

for (x in vals) sumAD <- sumAD + x$ARRDELAY

res <- sumAD / length(vals)

keyval(key, res)

}

)

res

hdfs.get(res)

85


• Goal: create a file of the ONITME_S ore.frame to illustrate loading data from a file. Take the average arrival

delay for all flights from to SFO. Output is one value pair.

• Map function returns key-value pairs where column DEST is “SFO”

• Reduce function produces the mean of arrival delay


mydat <- "ONTIME_S_FILE.dat"

write.csv(ore.pull(ONTIME_S),row.names=F,file=mydat)

ontime.dfs <- hdfs.upload(mydat, header=T)

res <- hadoop.run(

ontime.dfs,


…

}

},


…

}

)

print(readLines(hdfs.download(res)))

86

Database connectivity / interaction

• orhc.connect(host, user, sid, passwd, port, secure=T)

– Establishes connection from ORHC to Oracle Database

– Returns RDBMS connection object

• orhc.disconnect()

– Disconnects from Oracle Database

– Returns RDBMS connection object

• orhc.reconnect()

– Reconnects to Oracle RDBMS with previous credentials

– Faster than orhc.connect ()

• orhc.which()

– Displays information about current RDBMS connection

• orhc.dbg.off()

– Turns off all debug output

• orhc.dbg.on('ERROR')

– Turns on error messages only


87

HDFS connectivity / interaction

• hdfs.connect(dfs.url)

– Establishes connection to Hadoop's HDFS

– Returns HDFS connection object

• hdfs.disconnect()

– Disconnect from Hadoop's HDFS.

– Rolls back connection to the default as setup in local Hadoop client configuration

– Returns HDFS connection object

• hdfs.reconnect()

– Reconnects to the previous disconnected Hadoop HDFS

• hdfs.which()

– Displays information about current HDFS connection


88


• hdfs.push(x, dfs.name, overwrite, driver, split.by)

– Copies ore.frame from RDBMS to HDFS.

– Returns HDFS object identifier used in HDFS/Hadoop function calls

• hdfs.pull(dfs.id, sep, db.table.name, overwrite, driver)

– Copies HDFS object to RDBMS

– Returns an ore.frame which points to new table

• hdfs.upload(filename, dfs.name, overwrite, split.size, header)

– Uploads local file to HDFS

– Simplest and fastest way to transfer data to HDFS from local storage

– Replicates the local file into HDFS directory

– By default HDFS directory get a unique ID and the HDFS file(s) named "part-0000x“

– If local file > split.size bytes, file automatically split into several parts

• hdfs.download(dfs.id, filename, overwrite)

– Downloads an HDFS file to the local file system

– Simplest and fastest way to transfer data from HDFS to local storage

– Replicates HDFS directory part-0000x files into the local file by combining all part-0000x files as one


89


• hdfs.put(x, key, dfs.name, overwrite)

– Copies data from R in-memory object (data.frame) to HDFS

– Column names, data types, etc. stored as metadata with data

– Differs from hdfs.push in that if x = ore.frame then data pulled into the local R memory and then

loaded to HDFS

– If no dfs.name provided, random name generated and returned

• hdfs.get(dfs.id, sep)

– Copies data from HDFS into R in-memory object

– Metadata extracted and column names, data types, etc. restored if data originated from R

environment

– Otherwise generic reverse-engineered attributes (like val1, val2 for names) are assigned


90


• hdfs.sample(dfs.id, lines, sep)

– Samples (partially copies) data in HDFS and returns an R data.frame in-memory object

– No guarantee to the random nature of rows returned

• hdfs.attach(dfs.name)

– Brings an HDFS object into ORHC environment

– Attaches "unmanaged" HDFS data to ORHC framework and return HDFS object identifier

• hdfs.rm(dfs.id)

– Removes data from HDFS

– Invalidate all HDFS object identifiers pointing to data set


91


• hdfs.exists(dfs.id)

– Checks if HDFS object exists

– Validates HDFS object identifier or existence of HDFS data with the specified name in HDFS

– Returns TRUE if data can be attached and used in hadoop.run() function

• hdfs.cd(dfs.path)

– Sets current URI path and optionally connection to an HDFS resource

• hdfs.ls(dfs.path)

– Returns name list of all HDSF data objects (directories) at currently set path

– Only directories containing data are listed

• hdfs.pwd()

– Returns current HDFS working directory

• hdfs.mkdir(dfs.name, cd)

– Creates a new sub-directory in HDFS relative to the current working directory


92


• hdfs.rmdir(dfs.name)

– Deletes an existing sub-directory in HDFS relative to the current working directory

– All data objects stored in this directory will be deleted too and, therefore, all assosiated HDFS

object identifier will be invalidated

• hdfs.size(dfs.id)

– Returns total size of the HDFS object in bytes

• hdfs.parts(dfs.id)

– Returns number of parts the HDFS object is divided into


93

Hadoop connectivity / interaction

• hadoop.connect(host, user, passwd, secure)

– Establishes connection to Hadoop's MapReduce

– Returns hadoop connection object if connection was successfully established

• hadoop.disconnect()

– Disconnects (drops connection) from Hadoop's MapReduce JobTracker

– Returns hadoop connection object of the previous settings

– Rolls back connection to default as set up in local Hadoop client configuration

• hadoop.reconnect(hmr.con)

– Reconnects to Hadoop's MapReduce with the hadoop connection object

– After Hadoop connection dropped, ORHC preserves all user credentials and connection attributes

• hadoop.which()

– Displays information about current Hadoop MapReduce connection


94

Hadoop connectivity / interaction

• hadoop.exec(dfs.id, mapper, reducer, combiner)

– Invokes Hadoop engine and sends mapper, reducer, and combiner R functions for execution on

the server side

– Provides core functionality for Hadoop MapReduce execution

• hadoop.run(dfs.id, mapper, reducer, combiner)

– Invokes Hadoop engine and sends mapper and reducer R functions for execution

– If input data not resident in HDFS…

• Data pushed to HDFS

• User map-reduce script prepared for execution

• Script sent to Hadoop

– If successful execution, data pulled from HDFS to local R memory and to the database depending

on input data location

– Internally, invokes hdfs.push() and hdfs.pull() APIs so their security considerations apply


95

Running jobs locally

• ORHC allows Hadoop jobs to be run locally

– Supports testing scripts before deploying to Hadoop Cluster

– Use the command assign('dry.run', T, ORHC:::.orhc.env)

– Next hadoop.run() command will be executed locally

• Data from HDFS is still accessed as before

96

Use of ORE in Exadata and BDA environments


97

Big Data Appliance

• An engineered system optimized for capturing

and integrating “low density” data into Exadata

• High-performance Hardware • Optimized for Hadoop and NoSQL workloads

• InfiniBand Networking for integration with Exadata

• Software: • Oracle Hadoop

• Oracle R Hadoop Connector

• Oracle R Enterprise client (optional)

• Oracle NoSQL DB

• Oracle Data Integrator (Hadoop capabilities)

• Oracle Loader for Hadoop

18 Sun X4270 M2 Servers

48 GB memory per node = 864 GB memory

12 Intel cores per node = 216 cores

24 TB storage per node = 432 TB storage

40 Gb p/sec InfiniBand

10 Gb p/sec Ethernet


98

Big Data Appliance Usage Model


99

Where to run your job?

• BDA / Hadoop Cluster or Exadata/Oracle Database

• Consider data volumes

– If DBMS data volume dominates, then run in database

– If HDFS data volume dominates with DBMS providing

just coefficients or parameters then run in HDFS

– If both volumes dominate then use DBFS connect to

reach out to HDFS and use the database

100 ©2011 Oracle – All Rights Reserved

101

Documents

- sheepsqueezers.comsheepsqueezers.com/media/documentation/oracle/ore... · ... Oracle Advanced Analytics . 2 The following