paced lesson entitled “The ORE Transparency Layer.” This

Hello, and welcome to this online, self-paced lesson entitled “The ORE Transparency Layer.”

This session is part of an eight-lesson tutorial series on Oracle R Enterprise.

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and

review sessions on this lesson.

ORE Transparency Layer - 1

Introduction

Before we begin, take a look at some of the features of this Flash-based course player. If you’ve viewed

a similar self-paced course in the past, feel free to skip this slide.

Outline

This is the Outline tab. It’s set up to automatically progress through the course in a linear fashion, but

you can also review the material in any order. Just click a slide title in the outline to display its contents.

Transcript

Click the Transcript tab to view the audio transcript for each slide.

Search

Click the Search tab to find specific information in the course.

Player Controls

Use these controls to pause, play, or move to the previous or next slide. Use the interactive progress

bar to fast forward or rewind the current slide. Some interactive slides in this course may contain

additional navigation and controls.

Attachments

Click the Attachments button to access any attachments associated with this course.


“The ORE Transparency Layer” is the fifth lesson of eight self-study sessions on Oracle R

Enterprise. Let’s take a look at the topics for this lesson.



As mentioned previously, the Transparency Layer is a set of packages that map R data types

to Oracle Database objects. This feature enables direct interaction with data in Oracle

Database while using R language constructs.

This lesson includes five topics:

• First, you’ll briefly review the basic functions that enable interaction between R and

Oracle Database.

• Second, you’ll learn about the packages that comprise the Transparency Layer, and

how to use some of the common Transparency Layer functions.

• Third, you’ll learn how to perform some common data transformations and use object

persistence with ORE.

• Fourth, you’ll learn about the ORE ordering framework, and how this framework applies

to in-database sampling and random partitioning.

• Finally, you’ll examine a simple investigative analysis case study that uses the sample

data set.

So first, a quick definition of the Transparency Layer and a review of the functions that enable

interaction with Oracle Database.


The Transparency layer supports in-database data exploration, data preparation, and data analysis in

order to apply predictive analytics using a mix of in-database and CRAN techniques.

From a coding perspective, the Transparency Layer is a set of packages that map R data types to

Oracle Database objects. Therefore, for base R functionality, R users can write code as though working

with data frames, and transparently have that code operate on data in the database.

This enables R programmers to develop code in their familiar environment, while leveraging the

database server as a high-performance and scalable compute engine. There’s no need to learn a

different programming paradigm.

In addition, the Transparency Layer addresses two of the major limitations of R:

• The memory constraint of R on the local machine system.

• The need to duplicate data on function invocation or modification. This second constraint can

quickly exhaust available memory, and dramatically reduce the size of data that can be analyzed.

By implicitly translating R functions to SQL, the Transparency Layer allows users to take advantage of

the database’s powerful query optimization, table indexes, and database parallel execution.

The Transparency framework also offers users SAS data step functionality in an R-based alternative.

With this framework, users can use R to perform the same SAS functionality interactively against

Oracle Database tables.


Recall that you can explicitly connect to Oracle Database by using the ore.connect() function. In this example, the all=TRUE parameter automatically invokes both ore.sync() and ore.attach() on the specified schema.

Remember that only one connection is active at a time.

You can create a database table or view from an existing table by using the ore.create() function.

• When creating a table, you list the existing table as the first parameter. Create the new table from an R data.frame or an ore.frame. An ore.frame is an ORE metadata object that maps to a database table or view.

• When creating a view, you must create it from an existing ore.frame, not from an R data.frame, because the data must be present in the database.

You can also drop tables or views by using the ore.drop() function.

In addition, ORE can store R objects as temporary database objects.

• With the ore.push() function, create a temporary database object that returns a handle to the object. Use this handle to access the temporary object from R.

• These temporary objects are available only for the current session. When the R session ends, these objects are automatically cleaned up.

As mentioned previously, the ore.sync() function synchronizes ORE proxy objects in R with tables and views in the database, on a per schema basis. In the examples shown here:

• The first use synchronizes objects for the currently connected database user.

• The second use synchronizes objects for a named schema - rquser.

• The third use synchronizes two named tables, ontime_s and narrow, in the currently connected schema.

• The fourth use synchronizes two named tables in a specified schema.

You can also find out if an object exists by using the ore.exists() function. This function returns TRUE if a named table or view exists in the schema.


This slide reviews several other functions that you commonly use when you work with Oracle Database connections.

The ore.ls() function has several uses:

• When ore.ls() has no argument, as shown in the first use, a list of all available database objects in the R environment is returned. However, you can restrict this list by adding arguments to the function.

• In the second use of the function, you see only those objects that are associated with the rquser schema.

• In the third use, you see only a list of those objects in the rquser schema that start with a period, such as internal functions or objects. If you use FALSE instead of TRUE in this usage, all objects whose names do not start with a period are returned.

• In the fourth use, only objects that start with the text string specified in the pattern option are returned.

If you no longer want to see the objects in a given schema, use the ore.detach() function. This function removes a named schema’s environment from the R object search path, such as the rquser schema being detached in the slide example.

The ore.attach() function makes database objects visible in the R environment.

• The first use attaches (or reattaches) the rquser schema.

• In the second use, the rquser schema is attached and placed at the second position in the environment search path.

The ore.get() function lets you obtain a handle to a named table in a named schema. In the example, during the current R session, you can use the handle with the name “t” to identify the ontime_s table in the rquser schema.

To remove a specific table or view, or a list of tables or views, from the R environment, use the ore.rm function. This function differs from ore.detach(), because it removes only the specified objects, rather than all objects in a schema.

• In the first example, the table or view named df1 (in one of the attached schemas) is removed from the R environment.

• In the second example, two specified tables that are in the rquser schema are removed from the R environment.

When you are finished with the current database, use the ore.disconnect() function. ORE cleans up all associated R objects and temporary database schema objects.

Use the ore.exec() function to use SQL or PL/SQL within an ORE session.

• Because no return values are supported, this function is best suited for DDL statements.

• In the example, ore.exec() creates a table named F2 with all of the data from the ontime_s table.


The ontime airline sample data set is available for ORE users. This data set contains on-

time arrival data for nonstop domestic flights by major air carriers. It contains 26 columns,

some of which include:

• Departure and arrival delays

• Origin and destination airports

• Flight numbers

• Airline names

The full data set contains 22 years of data, for 29 airlines. A sample of this data set, named ontime_s, contains a subset of this data.

We refer to both of theses data sets in this lesson.


Let’s begin with a simple example of pulling data from the Oracle Database into R memory.

This example illustrates the difference between an ore.frame object and a standard R

data.frame object.

On the right , an R script contains commands and related output.

To retrieve data from Oracle Database, use the ore.pull() function. You can retrieve an entire

data set, assuming it fits in the memory of the user’s desktop, or just a small summary result.

To begin, use two functions to identify some information about the ontime_s data set:

• First, use the class() function to return the type of ontime_s data set. Notice that

ontime_s is an ore.frame object.

• The dim() function returns the data set’s dimensions.

Then, use the ore.pull() function. This function retrieves data from the ontime_s table in the

Oracle Database into an R object named “ontime.”

When you execute the function:

• The ORE Transparency Layer invokes the equivalent of the SQL select * from

ontime_s query.

• The query is then submitted to the database.

• Finally, the query result is returned and populated into an R object.

Now, check the type of the R ontime object by using the class() function. Notice that it is a

standard R data.frame object, not an ore.frame object. However, the data set dimensions are

the same.

So, you’ve gone from a proxy object, an ore.frame, to a standard R object, a data.frame.


Now, let’s contrast the previous example with the opposite flow of data by creating a database table from an R data.frame object. Users can reference the table as an ore.frame object for transparent access from R.

First, create a data.frame named df that contains two columns, one numbered 1 to 26, and the other a to z.

Like the previous example , we show the output of the functions in the R script on the right. As before, you can see the dimensions and type of the df object.

To create a database table from a data.frame, use the ore.create() function. For this example, specify the df object as the data source, and specify a table named test_df as the target. You can then reference the table by its name as an ore.frame object.

For example, you can:

• Search for it by using the ore.ls() function

• Return its object type by using the class() function

• Return its dimensions by using the dim() function

• View the first few rows of data by using the head() function

You can also drop the table by using the ore.drop() function.

So, the ore.create() function does two things:

• It creates an ore.frame object that serves as a proxy object for the database table, and

• It loads the R data into the database table, thereby enabling transparent access to the data in the database from R.

This approach provides an R user an easy transition to using the database:

• R users can push R environment data and file-based data from their laptop to the database.

• Then, they can work with the data without the memory constraints of the native R environment.


This section introduces the ORE Transparency Layer packages and examines some of the

Transparency Layer functions, including code examples.


This slide contains the list of packages that comprise Oracle R Enterprise:

• The ORE package is a top-level package that is used for the installation of the other

packages.

• The Transparency Layer consists of the next three packages: OREbase, OREstats, and

OREgraphics. These packages correspond to the R base, statistics, and graphics

packages, respectively. In this lesson, we’ll examine some of the functions in the

OREbase package, and list the supported classes in the OREstats and OREgraphics

packages.

• The next four packages: OREeda, OREdm, OREmodels, and OREpredict, comprise

predictive analysis functionality. These packages are examined in the last lesson in this

series.

• Finally, ORExml is an internal package for translation between R and Oracle Database.


ORE provides access to online documentation and demonstrations.

• The OREShowDoc() function provides access to the entire online documentation set for

ORE. You can also view documentation associated with a particular package or feature,

specified within the function. The first code example provides access to the entire doc

set.

• The R demo() function is overloaded to show prerecorded demonstrations on specified

content. For example:

- The first use of demo() provides a demonstration of the ORE package.

- The second use provides a demonstration of the aggregate() function in the ORE

package.


OREbase is the core package, enabling transparency between R and Oracle Database on

primitive and complex objects, such as vectors, characters, factors, frames, and matrixes.

As stated previously, OREbase maps to the R base package, and allows R to function as a

language with the database.

Next, let’s examine some of the functions that are associated with each class.


The as.ore* functions listed here enable you to convert R type objects to ORE type objects.

Simply use as.ore, and then append the associated object type, such as .vector or .frame, to

push an R object to the database. When you do this, a handle to the associated ORE object is

returned, enabling direct access to the object in the database.

Internally, these functions use ore.push(), which we covered previously. When you push an R

object to the database as an ORE object, remember that it enables in-database processing of

that object without the memory constraints of R.

In the code example, push an R data.frame named df to the database by using the as.ore()

function. .

Then, when you ask for the class of the resulting ORE object , you can see that it is, indeed,

an ore.frame object.

In addition, notice that the ORE object has the same structure and data as the R object, by

using the dim() and head() functions.


Here is a list of the functions in the ore.vector package.

Let’s briefly examine three of these vector functions that ORE overloads:

• split()

• table()

• sapply()

As always, you may invoke “?” on the function to learn more about how a function behaves.


Here, we look at an example that uses the split() and sapply() functions on data from the ontime_s sample data set. The purpose of the spilt() function is to divide vector data into

groups, according to the values specified in a “factor vector.”

In this example, perform a split on arrival delay by airline. Then, create a box plot from

resulting data to visualize the data. Finally, use the sapply() function to provide additional

information about the data. As shown in the code example:

• The first assignment creates a vector named dat. It contains records from the

ontime_s data set, where the arrival delay value is less than 100 and greater than -

100.

• Then, use the split() function on the dat vector to create a list object named r. The

resulting list object contains all arrival delays by airline.

• Next, the boxplot() function generates a graph of the data. The table data remains in

Oracle Database. The overloaded boxplot() function uses the database to compute the

statistics necessary for the graph, which can be as few as five statistics for each box.

Only these statistics are retrieved from the database to produce the graphic display.

• Finally, use the sapply() function to compute two results:

- The number of values in each element of the list, and

- The average of those elements

Next, let’s look at the result of executing this script.


The resulting box plot, with the variable-width boxes, lets us see which airlines have the bulk

of the flights.

For example, compare American Airlines (AA) with AQ, a much smaller airline.

In addition, look at a portion of the script execution to see the results of the sapply() function used on the ad list.

• In the first use, the number of flights per airline are generated.

• In the second use, the mean function generates the average arrival delay per airline. When invoking sapply() with mean, we added the na.rm=TRUE parameter, which

removes missing values. Otherwise, many of the reported mean values would show

“NA.”


The table() function builds a contingency table by using the factors that are specified within the parenthesis. The result is an object of table type.

The code in this slide contains three examples:

• The first example invokes the table() function on the dayofweek element in ontime_s,

and then plot the results.

• The second example creates a cross-table of results by using the dayofweek and

cancelled data elements to see the number of canceled flights for each week day.

• The third example creates a three-way cross-table that also indicates whether the flight

was diverted.

Next, let’s look at the results of these examples.


For the first example:

• The number of flights is fairly consistent through day 5, and then drops on days 6 and 7.

• The plot of this table provides a visual confirmation of these results.

The second example produces a cross-table of the counts of flights that were canceled per

day (in the second column), along with the total flights per day.

For the third example, the three-way tables show that our data is consistent. No canceled

flights were also diverted, as well there shouldn’t be.

In all cases, the data remains in the database. Only the resulting small set of statistics are

retrieved. So, even if the underlying data set had a billion rows, the database would compute

the statistics, potentially leveraging parallelism, any available indexes, and query optimization.


ORE supports the character string functions shown in the slide.

In the example of the Character Replace function:

• A mixed-case alpha-numeric string is assigned to a variable by using the

as.ore.character() function.

• Then, two different uses of the chartr() function are shown to illustrate how individual

characters and ranges of characters may be replaced.

You can get help on these functions by typing “?” and the function name.


The ore.factor class has four functions. The code examples show two functions:

• First, use the levels() function on the cancellationcode column in the ontime_s

data set. The levels() function returns distinct values that comprise the specified vector. ORE calls these values “levels.” Notice that the cancellationcode column has four

unique values: A, B, C, and D.

• Second, use the summary() function on the same data element. The summary() function

does a count on the number of times that a unique value occurs in the vector. The

summary() function also lists the amount of the missing count.


Of all classes in the OREbase package, the ore.frame class has the most functions for

manipulating tables. You’ve already seen a few examples, such as head() and merge().

Next, let’s look at examples of the subset() and scale() functions.

Many math and summary functions in the ore.frame class are not listed in this slide.


First, let’s examine a code example that uses the subset() function, which returns a subset of

data from a vector, a matrix, or a data.frame that meets a specified criteria.

In the first example:

• Use the subset() function to restrict arrival delay to those between -200 and 200. This

subset provides a better scale, because there are mostly outliers beyond this range.

• Then, generate a histogram graph of the subset data with 100 breaks to see the shape

of the distribution.

In the second example:

• Further restrict arrival delay values to those between -100 and 100, and departure

delays of less than 100.

• Then, generate a box plot graph that provides a sharper view of the interquartile range

of the data for both arrival and departure delays.

When you use subset(), notice that you can reference each column without prefixing it with

the data.frame name.


The scale() function centers or scales the columns of a numeric ore.frame. In data mining, we

refer to this as “normalizing” the data by using the z-score method.

Here are two examples of the scale() function that are used on the arrival delay and departure delay columns from the ontime_s data set.

As shown in the code examples, the scale() function takes a scale parameter.

In the first example:

• If you set the scale parameter to FALSE, the values are centered only by subtracting the

mean.

• Show the first few rows of the scaled results by using the head() function.

In the second example:

• If you set the scale parameter to TRUE, the values are scaled by dividing by the

standard deviation.

• Once again, results of this method are shown.


The ore.matrix class contains common mathematical functions that you can use on matrixes

which are stored in Oracle Database.

In the next slide, you’ll see an example of the tabulate() function.

Again, use the “?” command on any of these functions to get help.


The tabulate() function counts the number of times a particular integer value occurs within a

vector. This function allows you to see the distribution of numbers that occur within your data.

This code example shows two uses of tabulate():

• The first example shows how many times each number between 1 and 7 occurs in the V

vector. Notice that the vector is pushed to the database, creating an ore.numeric vector

so that the tabulate() function was performed in the database.

• The second example uses tabulate() on the arrival delay column in ontime_s. Here,

you get a count of flights for each integer minute of reported arrival delay. In this case,

5898 flights had an arrival delay of one minute, 6257 flights had an arrival delay of 2

minutes, and so on.


Here is a list of the functions in the OREgraphics package. A number of these functions were

examined in the earlier “Getting Started with ORE” lesson.

These functions enable in-database execution of statistics generation that supports graphing

at the R client. Where possible, only the statistics are retrieved from the database to enable

graph rendering.


The OREstats package contains a set of common statistical functions, which correspond to

the R stat package.

As an example, let’s look at the aggregate() function.


The R aggregate() function is an important function that is invoked on data sets. It is

overloaded by the same ORE function such that, when presented by data from an ore.frame,

the function is translated to the appropriate SQL and then executed in the database.

The aggregate() function counts the number of rows associated with each value in the by

parameter. With ORE, the function maps to a group by query in SQL, which allows the

database to perform the heavy lifting. In this way, the Transparency Layer allows users to

specify the same code to work with standard R data.frames, and also with ore.frames.

On the left is an R script that uses the aggregate() function. In this script:

• The destination airport (ONTIME_S$DEST) is specified in the by parameter.

• An equivalent SQL query is generated, conceptually as shown on the right.

• The query is executed in the database, and the results are retrieved.

Much of base R is overridden so that the data doesn’t need to be moved to the client,

enabling both scalability and performance.

You see the results of the execution at the bottom of the code example.


In this section, you’ll learn how to perform a few common data transformations with ORE, and

also how to persist R objects in Oracle Database.


R provides a wide range of data manipulation capabilities. Two database-centric examples

are shown here: projecting columns and filtering rows.

The first example shows how to select columns from an ore.frame (table in the database):

• First, specify columns in a vector by name. Here, specify the year, dest, and

arrdelay columns in the ontime_s data set. The resulting data.frame (df) contains

the data for these three columns, as shown with the head() function.

• Second, provide a vector of column numbers that reference columns within the vector. Here, select the first, fourth, and twenty-third columns in the ontime_s data set, and

show a subset of the data by using the head() function.

• Third, specify which columns to remove by using a “-” sign. Here, we specify all columns

except those from 1 to 22.

As shown in the second code example, you can filter rows from an ore.frame by specifying a

logical expression that is evaluated on each row. If TRUE, the row is included. If FALSE, the

row is filtered out of the result.

• In the df1 data.frame, all rows that include “SFO” in the DEST column are included.

• In the df2 data.frame, the same DEST condition is specified, but only the first and third

columns of the df data set are passed to df2, where this expression is TRUE.

• In the df3 data.frame, all rows that include “SFO” or “BOS” in the DEST column are

included, and all three columns of the df data set are passed to df3.


Now, let’s compare the same column and row selections with the equivalent SQL syntax.

Column Selection

Column selection is a straightforward mapping. However, for the second and third R

examples, notice that column selection by number or exclusion is impossible with SQL. With

SQL, you must refer to the columns by name.

Row Selection

For row selection, the mapping is nearly identical. You simply place the filter criteria in a WHERE clause in the SQL query.

Notice that R syntax enables object name creation dynamically. SQL mimics this feature by

view creation. The primary difference is that the R objects disappear at the end of the session,

whereas the views persist across database sessions.

Fortunately, when you program using ORE, the correct SQL transformations are done for you

automatically.


In the same way that the R aggregate() function is overloaded for ORE, the R merge()

function is also overloaded to accept ore.frame objects.

Therefore, you can use merge() on data.frames, as shown on the left, and also on ore.frames,

as shown on the right:

• On the left, create two data.frame objects: df1 and df2. Each data frame has two

columns of data : x1, y1, and x2, y2, respectively.

• On the right, create two tables from these data frames. The tables are accessed as

ore.frame objects.

In both uses of the merge() function, the syntax is equivalent to merge the data sets. And, the

results are the same.

The R documentation for merge() includes parameters that enable natural, inner, and outer

joins, as well as other options.

As before, use the “?” command to get help on the merge function.


As you learned in a previous lesson, the designers of R incorporated the ability to save the

whole R workspace to disk for later reloading. In addition, you can save and reload individual

R objects. In both cases, you use the same R save() and load() functions. With these

functions, R objects are serialized and unserialized to and from disk files.

Using this feature, predictive models can be built in one session, and be scored in another.

To address this need for saving and restoring objects, ORE provides object persistence in the

database through an R Datastore table.

The name of a datastore is specified in the ore.save() function, and used in the ore.load()

function to restore the saved objects.

Each database schema has its own datastore table where R objects are saved as named

datastores. Referential integrity of the saved objects is maintained by the datastore. This

feature preserves ORE objects across R sessions, enables the contents of a datastore to be

passed to embedded R functions as parameters for loading within that function.


Here, we show two examples of object persistence, first in R, and then in ORE.

In the R example:

• Two R objects, a linear model and a data frame, are saved to a file on the file system

during one R session.

• Then, these objects reloaded into another R session. When the objects are restored,

they have the same name as when they were saved.

For this example to work, the entire R workspace would also need to be saved and restored,

so that the proxy object references are available in the new R session.

In the ORE example:

• Two ORE objects, a linear model and an ore.frame(), are saved to a datastore table named ds1, during one R session.

• Then, in another R session these objects reloaded into R memory, by calling the

datastore by name in the ore.load() function.

In this example, there is no need for the entire R workspace to be saved and reloaded, since

datastore tables maintain the referential integrity of saved objects.


Here’s an example with ore.save().

In this example, we create several objects, including:

• DAT1, an ore.frame() that includes three columns from the sample ONTIME_S data set

• An ore.lm() model that uses DAT1.

• A standard R lm() model, using the mtcars data set

• An Oracle data mining naive baise model, using ONTIME_S.

We then invoke ore.save() for the three models and use the datastore name myModels.

The R objects are saved in the datastore, and any referenced tables for the Oracle Data

Mining model are kept in the database.

There are a number of arguments that may be used with ore.save(), as shown in the slide.


The ore.load() function restores the R objects to the R .GlobalEnv environment. As shown

in the example, ore.load() can be used to load all objects in the named datastore, or you can load specific objects by using the list argument. The environment into which these objects

should be loaded may also be specified.


Using ore.datastore (), users can list basic information about the contents of a datastore.

The resulting object, in this example dsinfo, is an R data.frame. The data.frame includes the

following columns and rows:

• Columns: datastore.name, object.count (# objects in datastore), size (in bytes),

creation.date, and description

• Rows include: one per datastore object in the schema

In the code example at the bottom, the ore.datastore() function is used twice. For this

example, we have previously created some other datastores.

In the first use, the default output is shown, including rows for all datastores found in the

schema.

In the second use, the pattern argument is used to show all datastores that include the

string “Mod”. The result shows only the myModels datastore.


Using ore.datastoreSummary(), users can list the R objects that are saved within the named

datastore.

The resulting object, in this example objinfo, is an R data.frame. The data.frame includes

the following columns and rows:

• Columns: object.name, class, size (in bytes), length (if a vector), row.count (if

data,frame), col.count (if data.frame)

• Rows include: one per datastore object in the schema

In the code example at the bottom, the ore.datastoreSumary() function is used twice. In the first use, the function is invoked on the datastore myModels. In the output, you can see

the class of each model.

In the second use is invoked on the datastore myIrisData, which is the ORE data set iris,

stored as an ore.frame.


Of course, you can delete existing datastores using the ore.delete() function. This assumes

that some other user did not explicitly delete the same database objects using other means,

such as SQL, or the ORE drop() function.


This section examines the ORE ordering framework, and shows how this framework applies

to in-database sampling and random partitioning.


To understand the rationale for an ordering framework in ORE, we first need to contrast data

ordering behavior in R and in Oracle Database. In R, ordering is taken for granted. However,

in the relational database, ordering is not expected unless explicitly asked for.

R's in-memory nature has implicit ordering of elements in vectors and data frames. By default,

R supports integer indexing. For example, a user can ask for the 3rd and 7th elements of a

vector, or for rows 3 through 7 in a data.frame, where 3 and 7 are integer subscripts for the

object, as shown in the examples. So, when we talk about R, the notion of “unordered” data

doesn’t really exist.

Contrast this with database tables. By definition, a relational table does not define data order.

For example, executing a select statement multiple times may result in a different ordering of

the rows returned. One way to support ordering involves defining a primary key for tables and views, and then performing a sort using the ORDER BY clause in a SELECT statement.

However, this can be expensive from a performance standpoint.

Therefore, to map tables and views directly to R data structures, and ensure repeatable

results, additional measures need to be taken. This is the rationale for the ORE ordering

framework.


An important observation is that not all operations in R require ordering. For example, invoking

summary() on a data.frame to compute summary statistics.

Therefore, the ORE ordering framework includes an intelligent component that helps determine whether

or not ordering is necessary for a given operation. To support this design, two types of ore.frame

objects are provided. These two types of ore frames are ordered and unordered.

An ordered ore.frame is produced when:

• A primary key is defined on the underlying table

• It is produced by certain functions. For example aggregate(), cbind(), and others.

• The row names of the ore.frame are set to unique values

• All input ore frames to relevant ORE functions are ordered

An unordered ore.frame is produced when:

• No primary key is defined on the underlying table

• Even with a primary key is specified, if the ore.sync() parameter use.keys is set to FALSE

• No row names are specified for the ore.frame, or row names have been set to NULL

• For relevant ORE functions, one or more input ore.frames are unordered

Let’s take a look at some code examples to illustrate this framework.


Here’s an example that leverages a data set from the kernlab package known as spam.

We’re going to augment this data with two columns: the first is a unique ID for each row, and

the second is a numeric user ID. Using these new columns, we’ll create two variants of this

data set:

• SPAM_PK will define a primary key using USER_ID and TS, as shown in the SQL ALTER

TABLE statement.

• SPAM_NOPK will have no primary key defined.

When we look at the first few row of each table, using the head() function, we notice that:

• SPAM_PK has row IDs that have values that correspond to both USERID and TS,

separated by a vertical bar.

• Whereas SPAM_NOPK just has row IDs 1-6.

ORE automatically knows that SPAM_PK has a key defined on two columns, and created an

ordered ore.frame, using the key columns to create row IDs. And, since SPAM_NOPK has no

primary key defined, it was created an unordered ore.frame. Notice that the unordered

ore.frame output is followed by warning messages. The first warning is for the inner access of the SPAM_NOPK, to subset the columns, and the other for the invocation of head() on that

result.


Recall that the use.keys parameter to ore.sync() enables you ignore a primary key if you set

it to FALSE. In this example, we set use.keys=FALSE, which reloads proxy objects as

unordered ore.frames.

Then, we view output from both tables defined in the previous example with the head() function. First, the output of SPAM_PK displays default sequential numbers as the row names.

It is now an unordered ore.frame, returning a warning. And, the same is true for SPAM_NOPK.

Now, let’s use ore.sync() again, with the default of use.keys=TRUE. As shown with both

tables, we’re back to the same result as in the previous example. The output of SPAM_PK

displays row names of TS and USERID values separated by a “|” character, while SPAM_NOPK

is still unordered, since it has no primary key specified.

Another bit of information about unordered ore.frames: the row.names parameter is set to

NULL. So, you can check if an ore.frame is ordered by using the is.null() function on the

row.names parameter. Here, the is.null() function will return FALSE, since SPAM_PK is an

ordered frame.


Here, we show a few more code examples to illustrate how row.names() works.

What happens when use ore.push() to dynamically create an ore.frame? It automatically

creates an ordered frame, with default row names. Notice that row.names() values are

character strings.

If you try to get row.names on an unordered frame, for example, using SPAM_NOPK,

row.names() raises an error, since an unordered frame has no unique key.

However, you can get the row names of an ordered frame, as shown with SPAM_PK. Again,

the row names here consist of TS and USERID.

In the next example,, we reassign row names with TS only. Now, we see that the row names

only correspond to the TS values. It is the same when viewing with the head() function.


Now, let’s look at character and integer indexing examples with ore.frames.

The first example shows that we can access the rows with the name “2060”.

We can also access a set of rows by supplying a vector of character row names. Here, we

index to a range of rows from “2060” to “2064”.

Alternatively, if we supply the actual integer values, we get different rows, as evidenced by the

different row names, and different column values.


Here, we illustrate how ordered and unordered frames interact with the merge() function,

which joins data from two frames.

First, we’ll work with the unordered frame SPAM_NOPK. We create two subsets of the data that

we can merge. Since these are created from an unordered data set, x and y are also

unordered. The merge() function is permitted here, since order is not required for merging

data. However, notice that warnings are issued, since the merged results are not ordered.

If we create the same two subsets of the data with the ordered frame SPAM_PK, we see the

merged result with no warnings. Also, the row names are a concatenation of the row names from x and y.


Just as R allows the setting of global options that effect behavior, ORE enables these options for the ordering framework. The ore.warn.order option indicates whether warnings should

be issued if detected by ORE. This can be used to clean up the output, if you know what to expect. With ore.warn.order, users can:

• See what the current setting is

• Turn warnings on

• Turn warnings off

In addition, the separator used for row name values can be set by using the ore.sep option.

Finally, if you want a specific separator used as a one-time option, use the sep option in an

ore.pull() function, as shown here. Notice that the ID and AGE values are concatenated with a

“+” separator as row names for the output.


In summary, the ORE ordering framework:

• Enables integer and character indexing on ore.frames

• Identifies functions that require ordering and those that do not

• Provides the row.names() function for ordered ore.frames

• Enables creation of ordered frames by using ore.sync()


Ordering in the database is an expensive operation, since it normally involves sorting.

Fortunately, most operations in R do not require ordering.

Therefore, we recommend that you set use.keys = FALSE in the ore.sync() function,

unless you know that you need ordering. If you need integer indexing, set use.keys =

TRUE in the ore.sync() function.


One key area that leverages the ORE ordering framework is in-database sampling. This is

true because sampling uses integer indexing, which requires ordered ore frames.

Sampling is an important capability for statistical analysis. Normally with R-based sampling,

you must first load data into memory. However, if the data is too large, this isn’t possible. With

ORE, you can perform sampling on large data sets in Oracle Database by leveraging the

ordering framework’s integer row indexing.

In the next few slides, we’ll examine some sampling techniques that use this approach.


In the latest version of ORE, users no longer have have to pull the data and then sample.

Instead, you can now sample directly in the database, and then pull only those records that

are part of the sample.

As shown in the slide, ORE now provides a wide range of sampling techniques for in-

database sampling.


With simple random sampling, we want to select rows at random. In this example, we create a small data.frame, and push it to the database to create an ore.frame named MYDATA. A

portion of the data set is shown in the output. Out of 20 rows, we want to sample 5. We use

the R sample() function to produce a random set of indices that will allow us to get our sample from MYDATA.

By using the class() function, we see that the sample is in an ore.frame. Then, we run the

sample and view the results.


With split data sampling, we want to split our data into “train” and “test” data sets. This is

normally done in classification and regression models, so that you can tell how well a model

performs on data it hasn’t seen before.

The basic idea is to produce a sampled set of indices for the test data set. Then, we create a logical vector, which we call group, that is TRUE if the index is in the sample set, and FALSE

otherwise. Finally, we use row indexing to produce the:

• Training data set where the group is FALSE

• Test data set where the group is TRUE

Notice that the number of rows in the training set is 15, and in the test set it’s 5, as was

specified in the sample size.


In systematic sampling, we select rows at regular intervals. The key step is creating a

sequence of values using the seq() function. The seq() invocation produces values 2,5, 8, 11,

and so on, since we start at 2 and increment by 3.


In stratified sampling, the goal is to select rows within each group, where a group is

determined by the values of a particular column. We create our data set to have each row

assigned a group. The function rnorm() produces random, normal numbers. The parameter 4

is the desired mean for the distribution.

Here, we use the split() function to split the data according to the group, and then sample

proportionately from each partition. Lastly, we row-bind the list of subset ore.frames into a

single ore.frame.

The execution results are shown at the bottom.


In cluster sampling, we take whole groups of data at random. As before, rnorm() is used to

produce random, normal numbers. Then, we split the data according to group, sample among

the groups, and finally row-bind into a single ore.frame. We see that the resulting sample has

data from two clusters: 6 and 7.


Quota sampling is achieved by taking the first N records corresponding to your sample size.

For this, we simply use the head() function. The tail() function could also be used. The

resulting output is shown here.


Here’s a summary of the sampling techniques we just examined, and how to realize them in

R.


Finally, let’s examine a simple case study for the sample ontime_s data set. We will perform

some investigatory analysis on the data to determine the answers to several questions.


During this lesson, the ontime_s sample data set has been used to show how to some of

the Transparency Layer functions work. By way of the code examples, you’ve learned a few

things about airline and airport delays.

Now, let’s examine a simple case study that analyzes arrival delays across the 36 busiest

airports in 2007.


Recall that we examined a box plot graph in the lesson about producing graphs in R. Let’s

review how to interpret a box plot before looking at our case study example.

The basic elements of the box plot are identified in the slide. They include:

• The interquartile range (IQR), which measures the spread of the distribution between

the 1st and 3rd Quartiles.

• The Median position, indicating the skew of the data. In this example, the median

position is skewed toward the 3rd Quartile, which means that more of the data is at the

low end.

• The Notch which provides a 95% confidence interval around the median.

• Outliers , which are those values outside 1½ times the IQR.

Now, let’s look at a graph that uses this representation.


Let’s determine the significance of the differences between the best and worst arrival delays

for the busiest airports. In this case, compare the busiest 36 airports in 2007.

We started with a box plot of each airport’s distribution of arrival delay, and included a min

and max confidence bound around the median (also called the Notch).

By comparing the median values, you can see that Pittsburg (PIT), LaGuardia (LGA), and

John F. Kennedy (JFK) were generally the worst airports for delays.

However, Philadelphia (PHL) has the longest delays, followed by Newark (EWR), and is

indicated by the long stretch between the median and the 3rd Quartile. In both cases, the

median position is skewed significantly toward the 1st Quartile, which means that more of the

data is at the high end of the interquartile range, and indicates longer delays. Also notice that

the whisker values for both airports are the largest, indicating a larger range of values from

the median.

Finally, the best airports in this group of 36 include Baltimore/Washington (BWI), Salt Lake

City (SLC), and Houston (IAH).

So, let’s examine the R script that produces this graph.


Here is the R script that creates the box plot. We’ll cover each line of code from the top down.

• In line 1, simply rename the ore.frame ontime_s, so that you can easily change to a different data set if you want to, without modifying the subsequent code.

• In line 2, use the aggregate() function to count the number of flights for destination airport. Use the length() function to get the number of elements in each subset of the data. Because this aggregate() function uses an ore.vector object, the data remains in the database.

• In line 3, determine the 36 busiest airports from the aggregated data.

• In line 4, get the names of the airports from the set of data created in line three. Because the vector of airlines names is a factor, use the drop = TRUE parameter to indicate that you don’t want to include levels (values) that are not actually in the results.

• In lines 5-6, select the arrival delay values for the selected airlines in 2007.

• In lines 7-8, do the same thing for the destination.

• In line 9, reorder the destinations according to the median value of the delay for each airline.

• In line 10, use the split() function to partition the delay data by destination.

• In lines 11-14, use the boxplot() function to graph the data appropriately.

• In line 15, get the labels from the levels (or values) of the destination data set. In this case, the labels are the unique airport code names.

• And finally, use the text() function to add to label the y-axis of the graph.



In this lesson, you learned:

• About the basic functions that enable interaction between R and Oracle Database.

• About the packages that comprise the Transparency Layer, and how to use some of the

common Transparency Layer functions.

• How to perform some common data transformations and how to use object persistence

with ORE.

• About the ORE ordering framework, and how this framework applies to in-database

sampling and random partitioning.

• Finally, you examined a simple analytic case study that uses the sample data set.


You’ve just completed “The ORE Transparency Layer”. Please move on to the next lesson in

the series: “Embedded R Execution: R Interface”.




Documents

paced lesson entitled “The ORE Transparency Layer.” This