Performance of SAS In-Memory Statistics for Hadoopsupport.sas.com/resources/papers/Benchmark-LASR-IMSTAT.pdf · that the SAS® In-Memory Statistics for HadoopTM time to run the reported

Technical Paper

Performance of SAS® In-Memory Statistics for Hadoop™ A Benchmark Study

Allison Jennifer Ames

Xiangxiang Meng

Wayne Thompson

ii

Release Information Content Version: 1.0 May 20, 2014

Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies.

Technical Paper

Technical Paper

i

Contents

Executive Summary .................................................................................1

Introduction ...............................................................................................2

Construction of Proxy Data .....................................................................3

Benchmark Methods ................................................................................3

Computing Environment ............................................................................. 3

Benchmark Tasks ........................................................................................ 4

Results ......................................................................................................5

Conclusion ................................................................................................7

References ................................................................................................8

1

Executive Summary

A recent benchmark study was undertaken by Revolution Analytics, including claims such as “ScaleR outperformed

SAS on every task” and “ScaleR ran the tasks 42 times faster than SAS” (Dinsmore & Norton, 2014).

However, the comparison made in the study was between Revolution R Enterprise’s (RRE) Parallel External Memory

Algorithms, a distributed process, to SAS procedures which were not run in distributed mode.

To make a more just comparison, this benchmark study compared the tasks on a distributed analytic environment.

That is, we constructed a data set of identical size to the one used in Revolution Analytics’ benchmark and ran the

same tasks utilizing SAS® In-Memory Statistics for HadoopTM (PROC IMSTAT) on a cluster with an identical number

of nodes to the hardware used in Revolution Analytics’ benchmark.

Results indicate:

With 5 million observations and 134 columns, PROC IMSTAT took a total of 12.56 seconds to complete all

tasks.

In comparison, RRE7 completed in 109.7 seconds. Thus, Revolution Analytics’ RRE7 took 8.7 times as

long to run the same set of tasks as PROC IMSTAT.

The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with the SAS®

PROC IMSTAT.

In all instances, PROC IMSTAT outperformed the RRE7 reported timings for both 1 million and 5 million

observations of the data.

Scoring a 50 million observation data set completed in 1.34 seconds. The comparable task in RRE7 took

21.5 times as long to complete.

2

Introduction

The context for this study begins at the Strata Conference on October 25, 2012, where the research and planning

division of a large insurance corporation presented various methods that they used to model 150 million observations

of insurance data. A summary of their presentation is available at:

http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html .

In this performance benchmark, Revolution Analytics asserted their Parallel External Memory Algorithms (PEMA)

resulted in “vastly better performance for advanced analytics” (Dinsmore & Norton, 2014). However, several readers

voiced concern regarding the methodology used, and validity of the claims made, by Revolution Analytics. These

readers pointed out that the Revolution Analytics tests were run on clustered computing environments, but that the

SAS benchmark tests were not.

In March 2014, a follow-up benchmark study was undertaken by Revolution Analytics to make a more fair

comparison by running the tests on the same hardware. The 2014 benchmark included hiring a SAS consultant to

review the programs and enable them for Grid computing. The second Revolution Analytics benchmark findings

included claims such as “ScaleR outperformed SAS on every task” and “ScaleR ran the tasks 42 times faster than

SAS” (Dinsmore & Norton, 2014). However, Dinsmore and Norton (2014) deployed SAS Release 9.4 with base SAS,

SAS/STAT, and SAS Grid Manager as the major components. They used a desktop machine running SAS

Management Console and SAS Enterprise Guide as the Grid Client. Despite enabling the Grid, SAS procedures

running on a single node were compared to distributed Revolution Analytics algorithms. The one instance in which

distributed SAS procedures were compared (i.e., PROC HPREG), the SAS High Performance Analytics Server was

not utilized. In this case, the benefits of the High Performance procedures cannot be fully realized.

While we applaud the attempt to make a more fair comparison between Revolution Analytics and SAS products, and

Revolution Analytics’ transparency by posting the SAS code used to run the procedures (posted at

https://github.com/Revolution AnalyticslutionAnalytics/Benchmark), the benchmark is still not an evaluation using

comparable computing environments.

The computing environments used in the 2014 Revolution Analytics benchmark remain dramatically different despite

their intentions to provide a more just comparison. Dinsmore and Norton (2014) concluded that SAS/STAT software

was slower than RRE because of the way in which SAS/STAT swaps data between memory and disk when a data

set is larger than memory, a process which can be slower than in-memory operations. In contrast, RRE uses Parallel

External Memory Algorithms (PEMA) to distribute operations over multiple machines in a clustered architecture.

When a data set is larger than memory on any single machine, rather than swap to disk, RRE distributes the data

across all available computing resources. This, Dinsmore and Norton (2014) claim, is the reason behind the vastly

different timings.

A more fruitful and just comparison can be made comparing SAS distributed procedures to RRE distributed

algorithms. The purpose of this benchmark is to make such a comparison. We generated a data set comparable to

the one described in the 2014 Revolution Analytics benchmark and performed a set of tests using SAS® LASR™

Analytic Server and SAS® In-Memory Statistics for HadoopTM. The remainder of the paper discusses construction of

the proxy data, a description of the SAS® LASR™ Analytic Server and SAS® In-Memory Statistics for HadoopTM,

benchmark procedures, results, and conclusions.

http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

https://github.com/RevolutionAnalytics/Benchmark

3

Construction of Proxy Data

Three data sets were generated to mimic the properties of those used in the Dinsmore and Norton (2014) study in

terms of row and column size. The row counts of these data sets are one million, five million and 50 million

respectively. Each table contains 134 columns. All data generation was performed using the IMSTAT procedure on

the SAS® LASR™ Analytic Server.

Benchmark Methods

Computing Environment

SAS® LASR™ Analytic Server is an in-memory engine which has been designed to address advanced analytics in a

scalable manner. It is an in-memory analytics engine that provides secure, multiuser, concurrent access to any size

data. The SAS® LASR™ Analytic Server is a dedicated, multipass analytical server. The SAS® In-Memory Statistics

for HadoopTM procedure (PROC IMSTAT) moves all of the data into dedicated memory. The main advantage is

being able to analyze all of the data in the shortest amount of time. The software is optimized for distributed,

multithreaded architectures and scalable processing, so requests to run new scenarios or complex analytical

computations are handled very fast. This benchmark demonstrates just how fast some common analytical

procedures can be performed.

PROC IMSTATuses in-memory analytics technology to perform analyses that range from data exploration,

visualization and descriptive statistics to model building with advanced statistical and machine learning algorithms

and scoring new data.

Revolution Analytics used a clustered computing environment consisting of five, four-core machines running

CentOS, all networked using Gigabit Ethernet connections and a separate NFS Server. Revolution R Enterprise

Release 7 (RRE7) was installed on each node. To make a valid comparison, all tasks run within PROC IMSTAT on

the SAS® LASR™ Analytic Server used five nodes as well (one name node and four data nodes).

4

Benchmark Tasks

The set of tasks included in the benchmark are provided in Table 1.

Task RRE 7 Capability

SAS® PROC IMSTAT

Descriptive statistics (n, min, max, mean, std) on 1 numeric variable rxSummary summary

Median and deciles for 1 numeric variable rxQuantile percentile

Frequency distribution for 1 text variable rxCube frequency

Linear regression with 1 numeric response and 20 numeric predictors, with score code generated

rxLinMod glm

Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors

rxLinMod glm

Stepwise linear regression with 100 numeric predictors rxLinMod --

Logistic regression with 1 binary response variable and 20 numeric predictors

rxLogit logistic

Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function

rxGlm genmodel

k-means clustering with 20 active variables rxKmeans cluster

k-means clustering with 100 active variables rxKmeans cluster

Table 1 Benchmark Tasks

Example script for computing frequencies in PROC IMSTAT is found below. For a more comprehensive discussion

on the SAS® LASR™ Analytic Server and SAS® In-Memory Statistics for HadoopTM, please see the SAS® LASR™

Analytic Server reference guide and the PROC IMSTAT documentation (SAS Institute Inc., 2014).

proc lasr create port=&myport path="/tmp";

performance nodes=4;

run;

libname lasr sasiola port=&myport tag='WORK';

data lasr.data1M;

set &data1M.;

run;

proc imstat;

table lasr.organics;

frequency DemTVReg;

run;

5

A distributioninfo statement can provide information about how the data are spread across the nodes. The

following information is provided to show the user how the 5,000,175 rows of data are distributed across the nodes. This information is provided in Table 2 below.

Nodes Number of Partitions

Number of Records

node48 0 1250044

node49 0 1250044

node50 0 1250044

node51 0 1250043

Table 2 Distribution of 5 Million Observations Across 4 Nodes

Results

Table 3 shows complete time to run results, in seconds, using the larger data set of five million records.

PROC IMSTAT took a total of 12.56 seconds to complete. This is in comparison to RRE7, which took 109.7 seconds

to complete. This time includes the sum of all times reported in Dinsmore and Norton (2014) minus the time for the

stepwise linear regression task as SAS® In-Memory Statistics for HadoopTM has yet to implement stepwise

regression. Thus, Revolution Analytics’ RRE7 took 8.73 times as long to run the same set of tasks as PROC

IMSTAT.

The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with PROC IMSTAT. In all

instances, PROC IMSTAT outperformed the RRE7 reported timings across a set of representative tasks

representing end-to-end life cycle analytics.

6

Task RRE 7 SAS® PROC IMSTAT

How Much Faster is SAS?

Descriptive statistics (n, min, max, mean, std) on 1 numeric variable

1.2 0.03 40x

Median and deciles for 1 numeric variable 1.4 0.11 12.72x

Frequency distribution for 1 text variable 0.8 0.03 26.7x

Linear regression with 1 numeric response and 20 numeric predictors, , with score code generated

6.8 2.43 2.8x

Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors

7.3 0.55 13.2x

Stepwise linear regression with 100 numeric predictors

13.9 -- --

Logistic regression with 1 binary response variable and 20 numeric predictors

16.9 1.10 15.4x

Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function

32.7 5.49 6x

k-means clustering with 20 active variables 10.1 0.64 15.8x

k-means clustering with 100 active variables 32.5 2.18 14.9x

Table 3 Time to Run (Seconds)

Table 4 provides the overall time to run for both the 5 million observations and 1 million observations data.

Using the first linear regression model (with 20 numeric predictors), 50 million observations were scored using PROC

IMSTAT in 1.34 seconds. A comparable task in RRE7 took 28.8 seconds, over 21 times as long.

Data Set Size Total Time for Tasks

1 Million rows 4.80

5 Million rows 12.56

Table 4 Total Time to Run (Seconds)

7

Conclusion

This study has attempted to make a benchmark comparison between SAS® In-Memory Statistics for HadoopTM, a

distributed computing environment, and Revolution Analytics’ Grid distributed computing environment. Results show

that the SAS® In-Memory Statistics for HadoopTM time to run the reported tasks were all faster than the Revolution

Analytic counterparts.

These results are in contrast to those reported in a 2014 benchmark by Dinsmore and Norton (2014). One reason for

the conflicting results between the two benchmarks is that the Dinsmore and Norton (2014) benchmark used

Revolution Analytics’ distributed computing environment, PEMA, but contrasted results with (a) SAS High-

Performance procedures not run on the SAS High Performance Analytics Server or (b) non-distributed procedures.

This severely limited the comparability of procedures.

One limitation of this study is that we were only able to use a proxy data set to the one used in the Revolution

Analytics benchmark. However, the data sizes (number of rows and columns) between the two studies were

identical. A next step may include ensuring the exact data generated by Revolution Analytics is used.

Despite this, we feel that the results provided in this study provide a more clear comparison between the two

analytics solutions. If “speed matters,” as claimed by Dinsmore and Norton (2014), then the SAS® In-Memory

Statistics for HadoopTM provide a clear advantage for advanced analytics customers.

We would like to thank the SAS Enterprise Excellence Center and Business Intelligence Research and Development

teams in their assistance securing hardware assets and installing software for the tests performed in this benchmark

study.

8

References

Dinsmore, Thomas, & Norton, Derek (2014). “Revolution R Enterprise: Faster than SAS.” Available at

http://www.revolutionanalytics.com/sites/default/files/revolution-analytics-sas-benchmark-whitepaper-mar2014.pdf .

SAS Institute Inc. 2014. SAS® LASR™ Analytic Server 2.3: Reference Guide. Cary, NC: SAS Institute Inc. Available

at http://support.sas.com/documentation/cdl/en/inmsref/67306/PDF/default/inmsref.pdf .

SAS Institute Inc. 2014. IMSTAT Procedure (Analytics). Cary, NC: SAS Institute Inc. Available at

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q

958.htm .

SAS Institute Inc. 2014. IMSTAT Procedure (Data and Server Management). Cary, NC: SAS Institute Inc. Available

at

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#p10dosb1fybvpzn1hw38gxuot

opk.htm .

Smith, David. (2012). “Allstate compares SAS, Hadoop and R for Big-Data Insurance Models” Available at

http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html .

http://www.revolutionanalytics.com/sites/default/files/revolution-analytics-sas-benchmark-whitepaper-mar2014.pdf%20.

http://support.sas.com/documentation/cdl/en/inmsref/67306/PDF/default/inmsref.pdf

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q958.htm

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q958.htm

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#p10dosb1fybvpzn1hw38gxuotopk.htm

http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#p10dosb1fybvpzn1hw38gxuotopk.htm

http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

To contact your local SAS office, please visit: sas.com/offices

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA

registration. Other brand and product names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved.

Documents

Performance of SAS In-Memory Statistics for Hadoopsupport.sas.com/resources/papers/Benchmark-LASR-IMSTAT.pdf · that the SAS® In-Memory Statistics for HadoopTM time to run the reported