Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Technical Paper
Performance of SAS® In-Memory Statistics for Hadoop™ A Benchmark Study
Allison Jennifer Ames
Xiangxiang Meng
Wayne Thompson
ii
Release Information Content Version: 1.0 May 20, 2014
Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
Technical Paper
Technical Paper
i
Contents
Executive Summary .................................................................................1
Introduction ...............................................................................................2
Construction of Proxy Data .....................................................................3
Benchmark Methods ................................................................................3
Computing Environment ............................................................................. 3
Benchmark Tasks ........................................................................................ 4
Results ......................................................................................................5
Conclusion ................................................................................................7
References ................................................................................................8
1
Executive Summary
A recent benchmark study was undertaken by Revolution Analytics, including claims such as “ScaleR outperformed
SAS on every task” and “ScaleR ran the tasks 42 times faster than SAS” (Dinsmore & Norton, 2014).
However, the comparison made in the study was between Revolution R Enterprise’s (RRE) Parallel External Memory
Algorithms, a distributed process, to SAS procedures which were not run in distributed mode.
To make a more just comparison, this benchmark study compared the tasks on a distributed analytic environment.
That is, we constructed a data set of identical size to the one used in Revolution Analytics’ benchmark and ran the
same tasks utilizing SAS® In-Memory Statistics for HadoopTM (PROC IMSTAT) on a cluster with an identical number
of nodes to the hardware used in Revolution Analytics’ benchmark.
Results indicate:
With 5 million observations and 134 columns, PROC IMSTAT took a total of 12.56 seconds to complete all
tasks.
In comparison, RRE7 completed in 109.7 seconds. Thus, Revolution Analytics’ RRE7 took 8.7 times as
long to run the same set of tasks as PROC IMSTAT.
The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with the SAS®
PROC IMSTAT.
In all instances, PROC IMSTAT outperformed the RRE7 reported timings for both 1 million and 5 million
observations of the data.
Scoring a 50 million observation data set completed in 1.34 seconds. The comparable task in RRE7 took
21.5 times as long to complete.
2
Introduction
The context for this study begins at the Strata Conference on October 25, 2012, where the research and planning
division of a large insurance corporation presented various methods that they used to model 150 million observations
of insurance data. A summary of their presentation is available at:
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html .
In this performance benchmark, Revolution Analytics asserted their Parallel External Memory Algorithms (PEMA)
resulted in “vastly better performance for advanced analytics” (Dinsmore & Norton, 2014). However, several readers
voiced concern regarding the methodology used, and validity of the claims made, by Revolution Analytics. These
readers pointed out that the Revolution Analytics tests were run on clustered computing environments, but that the
SAS benchmark tests were not.
In March 2014, a follow-up benchmark study was undertaken by Revolution Analytics to make a more fair
comparison by running the tests on the same hardware. The 2014 benchmark included hiring a SAS consultant to
review the programs and enable them for Grid computing. The second Revolution Analytics benchmark findings
included claims such as “ScaleR outperformed SAS on every task” and “ScaleR ran the tasks 42 times faster than
SAS” (Dinsmore & Norton, 2014). However, Dinsmore and Norton (2014) deployed SAS Release 9.4 with base SAS,
SAS/STAT, and SAS Grid Manager as the major components. They used a desktop machine running SAS
Management Console and SAS Enterprise Guide as the Grid Client. Despite enabling the Grid, SAS procedures
running on a single node were compared to distributed Revolution Analytics algorithms. The one instance in which
distributed SAS procedures were compared (i.e., PROC HPREG), the SAS High Performance Analytics Server was
not utilized. In this case, the benefits of the High Performance procedures cannot be fully realized.
While we applaud the attempt to make a more fair comparison between Revolution Analytics and SAS products, and
Revolution Analytics’ transparency by posting the SAS code used to run the procedures (posted at
https://github.com/Revolution AnalyticslutionAnalytics/Benchmark), the benchmark is still not an evaluation using
comparable computing environments.
The computing environments used in the 2014 Revolution Analytics benchmark remain dramatically different despite
their intentions to provide a more just comparison. Dinsmore and Norton (2014) concluded that SAS/STAT software
was slower than RRE because of the way in which SAS/STAT swaps data between memory and disk when a data
set is larger than memory, a process which can be slower than in-memory operations. In contrast, RRE uses Parallel
External Memory Algorithms (PEMA) to distribute operations over multiple machines in a clustered architecture.
When a data set is larger than memory on any single machine, rather than swap to disk, RRE distributes the data
across all available computing resources. This, Dinsmore and Norton (2014) claim, is the reason behind the vastly
different timings.
A more fruitful and just comparison can be made comparing SAS distributed procedures to RRE distributed
algorithms. The purpose of this benchmark is to make such a comparison. We generated a data set comparable to
the one described in the 2014 Revolution Analytics benchmark and performed a set of tests using SAS® LASR™
Analytic Server and SAS® In-Memory Statistics for HadoopTM. The remainder of the paper discusses construction of
the proxy data, a description of the SAS® LASR™ Analytic Server and SAS® In-Memory Statistics for HadoopTM,
benchmark procedures, results, and conclusions.
3
Construction of Proxy Data
Three data sets were generated to mimic the properties of those used in the Dinsmore and Norton (2014) study in
terms of row and column size. The row counts of these data sets are one million, five million and 50 million
respectively. Each table contains 134 columns. All data generation was performed using the IMSTAT procedure on
the SAS® LASR™ Analytic Server.
Benchmark Methods
Computing Environment
SAS® LASR™ Analytic Server is an in-memory engine which has been designed to address advanced analytics in a
scalable manner. It is an in-memory analytics engine that provides secure, multiuser, concurrent access to any size
data. The SAS® LASR™ Analytic Server is a dedicated, multipass analytical server. The SAS® In-Memory Statistics
for HadoopTM procedure (PROC IMSTAT) moves all of the data into dedicated memory. The main advantage is
being able to analyze all of the data in the shortest amount of time. The software is optimized for distributed,
multithreaded architectures and scalable processing, so requests to run new scenarios or complex analytical
computations are handled very fast. This benchmark demonstrates just how fast some common analytical
procedures can be performed.
PROC IMSTATuses in-memory analytics technology to perform analyses that range from data exploration,
visualization and descriptive statistics to model building with advanced statistical and machine learning algorithms
and scoring new data.
Revolution Analytics used a clustered computing environment consisting of five, four-core machines running
CentOS, all networked using Gigabit Ethernet connections and a separate NFS Server. Revolution R Enterprise
Release 7 (RRE7) was installed on each node. To make a valid comparison, all tasks run within PROC IMSTAT on
the SAS® LASR™ Analytic Server used five nodes as well (one name node and four data nodes).
4
Benchmark Tasks
The set of tasks included in the benchmark are provided in Table 1.
Task RRE 7 Capability
SAS® PROC IMSTAT
Descriptive statistics (n, min, max, mean, std) on 1 numeric variable rxSummary summary
Median and deciles for 1 numeric variable rxQuantile percentile
Frequency distribution for 1 text variable rxCube frequency
Linear regression with 1 numeric response and 20 numeric predictors, with score code generated
rxLinMod glm
Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors
rxLinMod glm
Stepwise linear regression with 100 numeric predictors rxLinMod --
Logistic regression with 1 binary response variable and 20 numeric predictors
rxLogit logistic
Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function
rxGlm genmodel
k-means clustering with 20 active variables rxKmeans cluster
k-means clustering with 100 active variables rxKmeans cluster
Table 1 Benchmark Tasks
Example script for computing frequencies in PROC IMSTAT is found below. For a more comprehensive discussion
on the SAS® LASR™ Analytic Server and SAS® In-Memory Statistics for HadoopTM, please see the SAS® LASR™
Analytic Server reference guide and the PROC IMSTAT documentation (SAS Institute Inc., 2014).
proc lasr create port=&myport path="/tmp";
performance nodes=4;
run;
libname lasr sasiola port=&myport tag='WORK';
data lasr.data1M;
set &data1M.;
run;
proc imstat;
table lasr.organics;
frequency DemTVReg;
run;
5
A distributioninfo statement can provide information about how the data are spread across the nodes. The
following information is provided to show the user how the 5,000,175 rows of data are distributed across the nodes. This information is provided in Table 2 below.
Nodes Number of Partitions
Number of Records
node48 0 1250044
node49 0 1250044
node50 0 1250044
node51 0 1250043
Table 2 Distribution of 5 Million Observations Across 4 Nodes
Results
Table 3 shows complete time to run results, in seconds, using the larger data set of five million records.
PROC IMSTAT took a total of 12.56 seconds to complete. This is in comparison to RRE7, which took 109.7 seconds
to complete. This time includes the sum of all times reported in Dinsmore and Norton (2014) minus the time for the
stepwise linear regression task as SAS® In-Memory Statistics for HadoopTM has yet to implement stepwise
regression. Thus, Revolution Analytics’ RRE7 took 8.73 times as long to run the same set of tasks as PROC
IMSTAT.
The individual tasks varied from 2.8 times as long to 40 times as long to run in RRE7 than with PROC IMSTAT. In all
instances, PROC IMSTAT outperformed the RRE7 reported timings across a set of representative tasks
representing end-to-end life cycle analytics.
6
Task RRE 7 SAS® PROC IMSTAT
How Much Faster is SAS?
Descriptive statistics (n, min, max, mean, std) on 1 numeric variable
1.2 0.03 40x
Median and deciles for 1 numeric variable 1.4 0.11 12.72x
Frequency distribution for 1 text variable 0.8 0.03 26.7x
Linear regression with 1 numeric response and 20 numeric predictors, , with score code generated
6.8 2.43 2.8x
Linear regression with 1 numeric response and 10 numeric predictors and 10 categorical predictors
7.3 0.55 13.2x
Stepwise linear regression with 100 numeric predictors
13.9 -- --
Logistic regression with 1 binary response variable and 20 numeric predictors
16.9 1.10 15.4x
Generalized linear model with numeric response variable, 20 numeric predictors, gamma distribution and link function
32.7 5.49 6x
k-means clustering with 20 active variables 10.1 0.64 15.8x
k-means clustering with 100 active variables 32.5 2.18 14.9x
Table 3 Time to Run (Seconds)
Table 4 provides the overall time to run for both the 5 million observations and 1 million observations data.
Using the first linear regression model (with 20 numeric predictors), 50 million observations were scored using PROC
IMSTAT in 1.34 seconds. A comparable task in RRE7 took 28.8 seconds, over 21 times as long.
Data Set Size Total Time for Tasks
1 Million rows 4.80
5 Million rows 12.56
Table 4 Total Time to Run (Seconds)
7
Conclusion
This study has attempted to make a benchmark comparison between SAS® In-Memory Statistics for HadoopTM, a
distributed computing environment, and Revolution Analytics’ Grid distributed computing environment. Results show
that the SAS® In-Memory Statistics for HadoopTM time to run the reported tasks were all faster than the Revolution
Analytic counterparts.
These results are in contrast to those reported in a 2014 benchmark by Dinsmore and Norton (2014). One reason for
the conflicting results between the two benchmarks is that the Dinsmore and Norton (2014) benchmark used
Revolution Analytics’ distributed computing environment, PEMA, but contrasted results with (a) SAS High-
Performance procedures not run on the SAS High Performance Analytics Server or (b) non-distributed procedures.
This severely limited the comparability of procedures.
One limitation of this study is that we were only able to use a proxy data set to the one used in the Revolution
Analytics benchmark. However, the data sizes (number of rows and columns) between the two studies were
identical. A next step may include ensuring the exact data generated by Revolution Analytics is used.
Despite this, we feel that the results provided in this study provide a more clear comparison between the two
analytics solutions. If “speed matters,” as claimed by Dinsmore and Norton (2014), then the SAS® In-Memory
Statistics for HadoopTM provide a clear advantage for advanced analytics customers.
We would like to thank the SAS Enterprise Excellence Center and Business Intelligence Research and Development
teams in their assistance securing hardware assets and installing software for the tests performed in this benchmark
study.
8
References
Dinsmore, Thomas, & Norton, Derek (2014). “Revolution R Enterprise: Faster than SAS.” Available at
http://www.revolutionanalytics.com/sites/default/files/revolution-analytics-sas-benchmark-whitepaper-mar2014.pdf .
SAS Institute Inc. 2014. SAS® LASR™ Analytic Server 2.3: Reference Guide. Cary, NC: SAS Institute Inc. Available
at http://support.sas.com/documentation/cdl/en/inmsref/67306/PDF/default/inmsref.pdf .
SAS Institute Inc. 2014. IMSTAT Procedure (Analytics). Cary, NC: SAS Institute Inc. Available at
http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#n1l5k6bed95vzqn1a47vafe3q
958.htm .
SAS Institute Inc. 2014. IMSTAT Procedure (Data and Server Management). Cary, NC: SAS Institute Inc. Available
at
http://support.sas.com/documentation/cdl/en/inmsref/67306/HTML/default/viewer.htm#p10dosb1fybvpzn1hw38gxuot
opk.htm .
Smith, David. (2012). “Allstate compares SAS, Hadoop and R for Big-Data Insurance Models” Available at
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html .
To contact your local SAS office, please visit: sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA
registration. Other brand and product names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved.