12
iianalytics.com Copyright © 2014 International Institute for Analytics. Proprietary to subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected]. Enterprise Research Service Research Brief Eyes Wide Open: Open Source Analytics Software August 2014 Written by Guest Contributor: David Macdonald

Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

  • Upload
    ngokien

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Research Brief

Eyes Wide Open: Open Source Analytics Software

August 2014

Written by Guest Contributor: David Macdonald

Page 2: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 2

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Key Findings

1. The total cost of owning and managing analytics technology consists of hardware (price

per CPU, price per unit of storage), software (price per unit/license) and human capital

(price per output) costs. Human capital costs are divided between line of business (LOB)

users and IT support costs.

2. Transformational advances in data storage and compute power over the past 20 years

have driven hardware costs so low that adoption is nearly universal. At the same time,

managing these systems has become easier, resulting in lower human capital expense in

the form training time (LOB users) and maintenance and management costs (IT).

Resilient and reliable storage and compute power is now a commodity.

3. Open source storage (Apache Hadoop) and operating system (Linux) options have

proliferated over the past 3+ years leading many firms to reliably experiment with

low/no cost open source options to supplement or replace licensed commercial

solutions.

4. In contrast, firms venturing down the open source analytics software path are not

always seeing the expected cost reductions due to higher human capital expenses and

increased risk that introduced into the enterprise through open source software.

5. IIA recommends firms take a blended approach to software selection, matching the

correct tool to analytics user type/role, and that firms recalculate total costs, specifically

incorporating potential risks associated with open source tools, particularly in mission

critical applications.

Page 3: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 3

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Introduction & Background

Across a broad spectrum of information technology projects, the primary cost drivers have traditionally fallen into three categories: hardware, software and people. In a similar way, the costs of assembling the technology and expertise necessary for a robust analytics ecosystem inside your enterprise will fall into these same categories. The unique feature of a data and analytics program, however, is the growing cost of supporting end-users who are playing the largest role in how analytical answers are generated for complex business problems. The human capital costs are not simply confined to IT.

Additionally, significant efficiency gains in storage and computational power have driven acquisition costs down while open source software options have proliferated. Firms are now faced with two primary questions when building analytics platforms. First, how do I effectively evaluate the risks and rewards of the technology improvements and open source software options? Second, how do I accurately calculate the total cost of ownership (TCO)1, particularly when it comes to total labor costs?

This research brief begins with an outline of the major costs to consider when making an investment in analytics, the second section presents five modernization stages that analytics hardware/software have experienced, and the final section finishes with considerations when calculating total cost of ownership of the analytics ecosystem.

Data and Analytics Cost Drivers

What is unique in the area of analytics is that significant human capital costs are spread across various lines of business (LOB) users compared to other technology initiatives that are largely paid for, and managed by, IT. This uniqueness is particularly important today, as software tools and discreet, problem-specific applications are proliferating. LOB users need to be re-trained in how to operate and inter-operate these tools.

The hardware, software and human costs of a representative firm-level analytics ecosystem are outlined below.

1 For a full description of software total cost of ownership, and a TCO calculator, see

http://www.softwareadvice.com/tco/.

Page 4: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 4

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Hardware

Hardware requirements for the analytics ecosystem consist of data storage capabilities and the processing environment to run analytics software. The traditional approach for firms has been to own hardware on premise for ease of use, governance, and security considerations. This will likely continue for most firms who have invested heavily in Enterprise Data Warehouses (Oracle, Teradata). But, cloud-based offerings that provision storage, the compute environment, and “rented” (Software as a Service) analytics software are increasingly being offered by new players such as Amazon Web Services.

Software

Analytics software is obviously at the heart of any firm’s analytics platform. Current options divide between commercial and open source offerings. Increasingly, application- specific analytics tools are being designed and offered in the marketplace that deliver discrete, single-purpose computation or answer to a specific business problem. For example, retailers today commonly use a single solution to generate a next best offer result for consumers as they navigate a shopping website. These single-purpose applications use sophisticated analytical models, but the application is confined to solving a single business problem.

Human Capital (IT)

Technology projects are traditionally purchased and managed by IT staff. The majority of IT human capital costs related to the analytics ecosystem are related to data warehousing, provisioning and management of software, and software maintenance. Increasingly, however, these costs are being shifted into the line of business where data scientists and analysts embedded into the business units are managing both data and the analytics software tools used for analysis. This trend is relegating the role of IT staff away from their traditional role of overall system stewardship, centralized purchasing and governance, toward a more tactical support and training role.

Human Capital (Line of Business “LOB”)

At high performing analytical firms, the analytics function (data mining, predictive modeling, forecasting, text mining, optimization, etc.) is increasingly decentralized, sitting next to business decision makers. From a cost perspective, these firms now see their human capital expenses related to analytics moving from IT to be accounted for within divisional budgets. In fact, the majority of software license sales for the newly emerging visual analytics category are occurring directly within the lines of business rather than through IT.

Page 5: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 5

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

This shift in “ownership” from IT to the LOB is causing some consternation on the part of IT, who has traditionally seen data management as a core competency of theirs. As this tension plays out in firms, there is bound to be confusion around “who does what,” resulting in duplicated efforts and increased costs.

While the components of TCO for a firm’s analytics ecosystem resemble other technology project categories, they are also shifting due to changes in licensing costs, and where actual utilization of data and software sit.

Stages of Modernization of Commercial Data Storage, Compute Power

With the cost drivers of a firm’s analytics ecosystem in mind, this section will focus on the significant improvements in technology and delivery models that have tended to bring the overall costs of acquiring and managing commercial analytics environments down. Table 1 presents the five stages of maturity.

Stage Era Description Brands

1 Legacy Platforms + Warehouse

Pre-2000 PC’s, Servers, Legacy Environments

IBM, HP, Sun; Intel

2 Grid + Warehouse Early 2000’s Move to distributed grid computing environments

Above + Oracle, Netezza, Teradata

3 Grid + HADOOP Storage

2008 – 2012 Hadoop as alternative storage environment to commercial solutions

Above + Apache Hadoop

4 Grid + HADOOP Storage & Compute

2008 – 2012 Hadoop as compute environment to offset other compute requirements

Above + Hortonworks, Cloudera, MapR

5 Grid + HADOOP Storage & Compute + In-Memory

2010-2014 Big Data, real-time decisions Above + SAS, SAP, SPSS; Amazon Web Services

Table 1: Five Stages of maturity

The progression of maturity over the past 15-20 years has been punctuated by (at least) three important inflection points of technology improvements. The first came with the transition from legacy single machine processing platforms to grid computing environments in the early 2000’s. Significant processing gains generated by grid computing led to lower cost per processing unit.

Page 6: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 6

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

The next inflection point came more recently with the emergence of open source Apache Hadoop, first as an alternative storage environment, and then as an offsetting compute environment. The emergence of Hadoop as a storage platform allowed firms to get past concerns of storage limits, and begin imagining the possibilities of Big Data applied to their most difficult questions. Further, Hadoop brought large-scale parallel processing power to bear allowing users to answer petabyte-scale questions quickly.2

The final inflection point has now come with the introduction of in-memory computing that is now bringing significant processing power gains to real-time analysis of Big Data. In-memory computing involves storing data in the main random access memory (RAM) of specialized servers instead of in complex relational databases running on relatively slow disk drives. In-memory computing is, by definition, faster since it alleviates the need to constantly pull data from a database to perform analysis. Current data visualization tools rely heavily on in-memory computing to deliver a dynamic, visual presentation of data.

How does this maturation of technology impact each of the categories of costs outlined above? Most estimates show total cost of acquisition and management over the past 15-20 years effectively being cut in half.3

Processing and storage improvements lowering cost per unit of compute or storage have accounted for the bulk of the savings, but human capital costs savings have also been realized both in the line of business and in IT. Software costs have come down with improvements, but not as fast as hardware costs. Many firms now measure the value created by these cost reductions on a “cost per insight” basis. That is, as the cost of software drops, and the capabilities of the software progress, the cost per valuable insight created drops.

Figure 1 presents conceptual estimates for how the composition of these costs has changed at each phase of the maturity progression.

2 Apache Hadoop consists of three components: the Hadoop Distributed File System, or HDFS, which handles

storage; MapReduce, the distributed processing infrastructure, which handles the work of running analyses on data; and Common, which is a set of shared infrastructure that both HDFS and MapReduce need. Adapted from interview with Mike Olson, Chief Executive Officer at Cloudera via ODMS Industry Watch 3 See Intel Best Practices for Implementing Apache Hadoop Software, published by Intel IT.

Page 7: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 7

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Figure 1: Effect of adoption and modernization of commercial analytics infrastructure on TCO

In sum, the technology improvements shown across this maturity cycle have led to measurable gains in scalability, accuracy and governance, all at increasingly lower costs. Resilient and reliable storage and compute power is now a commodity.

Open Source Options for the Analytics Platform

The inevitable emergence of open source options for data storage, operating systems and analytics software over the past four plus years has made for a bewildering set of choices for enterprises. Data and tools have become democratized, and it seems that any statistical enthusiast can stand up a Hadoop cluster in the cloud running open source analytics tools. Similarly, enterprises can now find an open source, full commercial, or commercial distribution of open source for every element of their analytics platform.

On the hardware storage side, commercial distributions of Apache Hadoop such as Cloudera and Hortonworks are popular alternatives to traditional commercial choices like Oracle and Teradata. As an alternative to on premise hardware storage and processing capabilities, most firms actively use cloud-based solutions like Amazon Web Services.

On the operating system, open source alternatives like SUSE Linux Enterprise and RedHat Enterprise Linux are considered suitable replacements for traditional commercial operating systems (UNIX, Windows, Mac OS).

Page 8: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 8

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

The analytics software market has seen an explosion of open source alternatives challenging the long-standing commercial solutions offered by SAS, IBM (SPSS), SAP and others. Probably most popular among data scientists and modelers is the open source R programming language.4

Open Source Cost Considerations

By definition, the most appealing feature of any open source technology is the very low acquisition cost, both in the form of actual expense and the convenience of delivery and installation.

As has been noted, most firms expect the majority of costs savings to come in the elimination of hardware/software licensing fees. While these cost savings are quantifiable it is increasingly difficult to estimate what will happen to human capital costs, particularly among line of business users.

Through working with a variety of practitioner research clients over the past four years, IIA has observed that open source analytics software tools have initially proliferated among data scientists who pride themselves on being able to provision large data sets at will, and build models quickly. Line of business (LOB) users, however, often find these open source tools require a steep learning curve for practical use.

Figure 2 below introduces a new curve to Figure 1 showing a firm’s “expected” cost reductions that could be realized at each stage of the maturity progression.

4 R is an open source statistical language and software for data miners and data scientists. A Rexer

Analytics survey reports that the number of data miners using R is growing; 70 percent of respondents reported they now use R. Survey link: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html.

Page 9: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 9

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Figure 2: Expected cost reductions from open source adoption

Enterprise business and technical leaders who are charged with keeping track of the total investment being made in enterprise analytics and calculating a true return on that investment should consider costs through the following five filters to estimate a risk-adjusted TCO under an open source scenario.

Support Filter

Complete reliance on open source software can be a risky proposition for an enterprise due to the lack of adequate support. Community-based support options certainly exist for most open source tools, but the time consumed to “self-support” coupled with the risk of inaccurate information should be included in the TCO calculation.

Reliability/Resilience Filter

Commercial software tends to be tested and validated for accuracy, reliability and resilience. If analytics engines, models, and data sets take on mission critical status for an enterprise, having robust, validated software is vital to insuring availability, accuracy and consistency which are important to both internal productivity, and service level agreements to end customers.

Algorithm validation is just one challenge that introduces significant risk into running open source solutions. Without the transparency of documented testing and validation results of embedded algorithms, users are left to either blindly trust the validity of embedded algorithms or conduct testing and validation on their own. The self-validation path can burn up

Page 10: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 10

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

unanticipated time and resources (see Hidden Labor Costs below) that are avoided with commercial solutions where well-document testing and validation results are generally available.

Scalability Filter

Most open source software is not designed with scalability as a top priority. The scale limitations generally relate to increased data volumes and the transition from single users to enterprise-wide deployments. As data volumes and user counts grow, individual productivity may suffer as open source tools don’t keep up with user requirements. Therefore, forecasting in advance the additional time and expense that may be needed to expand both data volumes (if even possible) and user counts are important considerations that should be factored in, upfront, during initial deployment of open source analytics software.

Hidden Labor Costs Filter

Specific to analytics and modeling software, hidden labor costs associated with open source solutions can be significant. Training time and associated expenses top the list; while lost productivity due to time spent searching for answers is a close second. To illustrate the impact of training time for LOB users, imagine it takes one week of an LOB user’s time to train themselves to use an open source analytics solution. For a firm with 100 users, that is 100 weeks of training time, equating to approximately two years of FTE productivity. Is that training cost justifiable compared to the expense of incumbent commercial software?

Regulatory Risk Filter

Finally, if the financial services industry is any guide, firms in regulated industries will increasingly be held accountable for compliance with data quality and accuracy standards. In response to the recent banking crisis, government authorities will inevitably mandate documented governance programs and the use of tested, validated software to generate critical analytical models and algorithms that are at the heart of global financial markets.

Taken together, these five filters can be combined to generate a “risk-adjusted” calculation of the true cost of open source analytics software. Unknowingly, firms may find themselves with much higher total operating costs associated with open source solutions than expected, even higher than the baseline commercial option.

Without question, accurate measurement of risk-adjusted costs using the five filters cited above can be challenging. Lost productivity and regulatory risk, for example, are difficult to measure and usually are estimated in a cost model based on anecdotal input from users.

Page 11: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 11

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

Figure 3 below presents this potential cost increase as the “Open Source Abyss.”

Figure 3: The Open Source Abyss

By abyss, we mean the difference between expected total costs (savings) using open source tools, and the actual costs accounting for unanticipated labor costs and adjusting for the risks outlined above. Firms may find that a complete accounting of costs reveals a cost structure not only higher than expected, but also higher than the predicable costs and reliability of commercial solutions.

Conclusion: A Path Forward

The proliferation of an open source option for just about every facet of a firm’s analytics environment may seem appealing from a cost standpoint. While efficiency gains are possible, IIA recommends a process for determining when and for whom open source tools make sense, and where the support, resilience and scalability are worth the commercial license costs.

Further, to avoid unknowingly missing the mark from a cost and risk perspective, executives must now develop new methods of calculating total cost of ownership that take into account unintended increases in human capital costs in the form of training, management and downtime, algorithm validation, and regulatory risk.

This requires re-thinking the TCO components outlined above by including costs that traditionally were accounted for elsewhere or not accounted for at all, and adjusting each of the components with a risk weighting. For firms in highly regulated industries, calculating the

Page 12: Eyes Wide Open - SAS · Eyes Wide Open: Open Source Analytics Software: August, 2014

Eyes Wide Open: Open Source Analytics Software: August, 2014 p. 12

iianalytics.com

Copyright©2014 International Institute for Analytics. Proprietary to ARC subscribers. IIA research is intended for IIA members only and should not be distributed without permission from IIA. All inquiries should be directed to [email protected].

Enterprise Research Service

risk-adjusted cost of open source solutions can have a dramatic effect on “true cost” calculation.

Most firms will likely settle on a hybrid approach that utilizes open source tools for redundancy, or for non-mission critical activities.

About the Author

David Macdonald leads the Financial Services Business Division for SAS United States and is responsible for sales, pre-sales and consulting for all SAS Financial Services customers. Macdonald has expertise working with institutions focused on retail and wholesale banking, consumer credit, capital markets, investment banking, life and P&C insurance, and institutional investments. Macdonald also helps direct the strategy, development and delivery of SAS solutions for regulatory compliance and customer intelligence.

Additional Information

To learn more about this topic, please visit IT strategy on sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 107322_S130874.0914