19
E-guide Hadoop Big Data Platforms Buyer’s Guide part 2 Your expert guide to Hadoop big data platforms

E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Embed Size (px)

Citation preview

Page 1: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

E-guide

Hadoop Big Data Platforms Buyer’s Guide – part 2 Your expert guide to Hadoop big data platforms

Page 2: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 1 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

What to consider when evaluating Hadoop vendors

David Loshin, Knowledge Integrity Inc.

Before you evaluate specific Hadoop software or subscriptions,

examine what features the vendor distributions provide and how

they match your big data management needs.

Apache Hadoop is at the heart of many big data environments, supporting

large-scale, data-intensive applications. Its variety of open source software

components and related tools for capturing, processing, managing and

analyzing data, and the low overall cost of Hadoop clusters, are alluring to lots

of organizations. But, as this series has examined, the open source Hadoop

framework only offers so much, and companies that need more robust

performance and functionality capabilities as well as maintenance and support

are turning to commercial Hadoop vendors.

Because Hadoop is a technology that's managed via The Apache Software

Foundation's open source process, the sales model of Hadoop vendors differs

from that of proprietary software development companies. The Hadoop source

code is open, meaning that it's available to anyone who wants to access it, so

Page 3: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 2 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

product offerings have to be differentiated by what the vendors provide beyond

the openly accessible functionality.

Once you've determined that your organization could benefit from a commercial

Hadoop big data distribution, the next step is to explore some value-added

supplements to the code base and key features offered by Hadoop vendors and

determine how these offerings match your needs.

What are the Hadoop distribution vendors really selling?

IT teams can download Hadoop from the Apache website and deploy it on a

hardware cluster themselves, without any vendor involvement. But Hadoop

vendors are aware that the self-starter approach isn't for everyone, so they

provide prebuilt Hadoop distributions that can be downloaded from their

websites -- typically in both a free community edition and an enterprise edition

that adds more features and requires the purchase of a license. But if these

vendors are providing users with a product, what are they really selling? In other

words, what do you actually get when you engage and pay a Hadoop software

vendor?

Vendors offering commercial versions of open source technologies, such as

those providing big data management systems based on Hadoop, follow an

alternative system and services model in which customers effectively subscribe

Page 4: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 3 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

to the enterprise edition of the product. Benefits of subscribing to an enterprise

edition include:

Access to enterprise features. The subscription relationship enables

customers to access versions of Hadoop that have features and

optimizations that haven't been openly released to the open source

community.

Release from restrictions. In some situations, the freely downloadable

Hadoop distributions have been built with restrictions, such as a limit to the

number of nodes on which the system can be run or the amount of data

that can be managed. Buying an enterprise subscription lifts these

restrictions.

Responsive technical support. Enterprise subscriptions provide

availability of resources for support with 24/7 telephone access and

response times that can be guaranteed under service-level agreements,

depending on the level of support purchased.

Advanced training. While all website visitors may have access to some

training materials and videos, enterprise subscribers typically are entitled to

more advanced and extensive training sessions.

Page 5: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 4 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

Access to deployment experts. Hadoop vendors have professional

services teams that are experienced in big data management deployments

and can help jump-start a customer's implementation.

Key considerations for comparing Hadoop distribution vendors

The enterprise editions of vendor Hadoop distributions all provide the core

components of the Hadoop ecosystem stack, which include the Hadoop

Distributed File System (HDFS), the MapReduce programming and execution

environment for batch processing, and the YARN job scheduler and cluster

resource manager. They also commonly incorporate various other open source

technologies, such as the Spark data processing engine and HBase database.

But different vendors may support different releases of all those technologies,

and newer or more specialized tools may not be universally supported. If your

organization is looking to use a particular technology as part of a Hadoop

deployment, you should ensure that the distributions you're considering support

it and, if so, which release they're currently on

Beyond these typical components, you should also compare and contrast how

each vendor provides the following:

Access to enterprise-class features. Some Hadoop vendors offer additional

tools that aren't part of the open source distribution for system configuration,

Page 6: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 5 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

system performance, ongoing monitoring and administration. While these may

add value to the enterprise distribution, recognize that integration with

proprietary components may lock the customer into that vendor's product.

Infrastructure deployment alternatives. Your organization may choose to

adopt different underlying infrastructure options, such as running on-premises,

in the cloud or in virtualized environments. Consider how the Hadoop

distribution alternatives are adaptable to these infrastructure choices.

Interoperability with other data management systems. In most cases, an

organization will have existing data warehousing, business intelligence and

analytics systems in place. Hadoop typically doesn't fully replace these systems,

but rather augments and complements them. So it's critical that the adopted

Hadoop environment enable access and data exchange with existing data

management platforms such as DB2, Oracle, SQL Server, Teradata and others.

Integration with end-user tools. End users will want to continue using their

favorite tools for business intelligence, reporting, visualization and analytics.

Assess how well the Hadoop big data management vendor's distribution

supports integration with the tools used in your organization.

Security and data protection. The Apache Hadoop ecosystem is still maturing,

which means that not all of its components may meet enterprise expectations

for data security and protection. Many Hadoop vendors provide security

features as add-ons.

Page 7: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 6 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

Support options. Consider what your support requirements are in terms of

availability and response times. Vendors offer different plans for support

availability as well as response windows.

Indemnification from litigation from use of open source technology. This

increasingly important concept ensures that vendors of open source

technologies protect their users from potential liabilities related to the use of the

product.

Optimized performance. Enterprise distributions may be augmented with

performance optimizations that enhance scalability and extensibility.

One additional consideration when comparing Hadoop distribution vendor

offerings relates to the approach that vendors are taking toward compatibility

within the open source community and interoperability between product

offerings from different companies. Ideally, this means ensuring that Hadoop

distributions will remain compatible with the open source versions of Hadoop

and other Apache technologies, even as vendors make code changes and

develop proprietary add-ons. That could help prevent vendor and version lock-

in, in which an organization becomes bound to a particular distribution of

Hadoop.

However, there's a lack of unanimity among Hadoop vendors on how best to

enable interoperability. Several have formed a group called the Open Data

Platform Initiative, set up within the Linux Foundation open source consortium,

Page 8: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 7 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

to develop a common set of interoperability standards for Hadoop. But other

vendors have declined to join the group, saying that compatibility and

interoperability issues are already being sufficiently addressed within Apache.

Assuring alignment with the open source distribution as a standard is certainly

desirable in that it allows Hadoop users to maintain some flexibility in their

choice of vendors.

Prior to engaging vendors, it's also important to assess what types of

applications your company plans to develop and run using the Hadoop

ecosystem, and the required capabilities. Then determine which of these are

provided by the community open source versions of Hadoop and other

technologies and which require additional functions only provided by a specific

Hadoop software vendor.

Weighing all of these factors will help prepare your organization to move

forward and evaluate the available options. In our next article, we will assess

the similarities and differences between the leading Hadoop distributions.

Page 9: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 8 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

Four factors for comparing the top Hadoop distributions

David Loshin, Knowledge Integrity Inc.

By examining the key characteristics presented here -- along with

the top Hadoop distributions -- you can determine which subscription

is right for your organization.

Although the software components that constitute the Hadoop ecosystem stack

are open source technologies, there are numerous benefits to paying a vendor

for a subscription to use its commercial Hadoop platform. For example, a

subscription provides technical support and training, as well as access to

enterprise features not available to the open source community. While the

enterprise editions of vendor Hadoop distributions all provide the core

components of the Hadoop ecosystem stack, the key differentiators are what

these vendors offer beyond the openly accessible functionality.

Recent changes in the market have thinned the ranks of Hadoop vendors. Just

this month, for example, Pivotal Software pulled the plug on its own Hadoop

distribution and said it would start reselling Hortonworks' instead. But there's still

a diverse group of suppliers to consider, including independent Hadoop

specialists, cloud providers and two of the largest IT vendors.

Page 10: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 9 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

To help you determine which Hadoop provider is right for your organization, this

article distinguishes the top Hadoop distributions based on several key

characteristics; these include deployment models, enterprise-class features,

security and data protection features, and support services.

Note that while the Hadoop big data management ecosystem is engineered to

support scalable data storage and high-performance distributed computing, your

actual performance may vary for several reasons, including the software

implementation. But many performance issues are dependent on the planned

applications themselves. To address this, we'll further examine how the Hadoop

product distributions are targeted to meet the business needs of user

organizations.

1. Hadoop deployment models

Most of the Hadoop vendors support a mix of deployment methods, but Hadoop

offerings from Microsoft and Amazon Web Services are deployed solely in cloud

environments. Microsoft leverages its Azure cloud infrastructure for HDInsight, a

managed service based on the Hortonworks Data Platform (HDP) -- the same

Hadoop distribution that Pivotal is now reselling. AWS uses its Amazon Elastic

Cloud Computing platform and S3 data store to underpin Amazon Elastic

MapReduce (EMR), which bundles its Hadoop distribution with various other

Page 11: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 10 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

tools and technologies. In addition, Amazon EMR provides the option of using

MapR's Hadoop distribution instead of the Amazon one.

The cloud deployment model provides a rapid yet low-effort means of

provisioning a Hadoop cluster, and both Microsoft and AWS enable users to

resize their environments on demand to handle dynamic computing and storage

capacity needs. This elasticity is desirable for organizations with computational

and storage needs that may vary over time.

While the other major Hadoop vendors -- Cloudera, Hortonworks, IBM and

MapR -- all offer cloud-based deployments, they aren't limited to that model.

They allow users to download distributions that can be deployed on-premises or

in private clouds on a variety of servers, including Linux and Windows systems.

In addition, Cloudera and MapR also provide sandbox versions that can be run

in a virtual environment such as VMware.

The bottom line: Consider whether your organization prefers to manage its big

data environment in-house or use a hosted service. In-house management

implies oversight and maintenance of the software environment and continuous

monitoring of the system, whether that environment is a physical platform on

premises or housed using a cloud-based service. The on-premises option may

be preferable if you have experienced staff and know the proper system sizing

characteristics, or if security concerns warrant managing the system behind a

trusted firewall.

Page 12: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 11 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

The alternative is to use a vendor with a hosted services platform that will help

configure, launch, manage and monitor your operations. This may be preferable

if you aren't sure what size system you will need or expect that the system size

will grow based on increasing demand. The benefit of working with a cloud or

hosted service is that it will provide the necessary elasticity for both storage and

processing resources.

2. Enterprise-class features of the top Hadoop distributions

There are some notable differences in the development approaches of the three

independent Hadoop vendors. Cloudera often augments the Hadoop core with

internally developed add-on technologies -- for example, its Impala SQL-on-

Hadoop query engine; Cloudera Manager administration tools; and Kudu, an

alternative data store to the Hadoop Distributed File System (HDFS) for use in

real-time analytics applications. Typically, the company now open sources such

technologies after doing the initial development work itself. Hortonworks, on the

other hand, promotes that it's "innovating 100% of its software in the Apache

Hadoop community, and there are no proprietary extensions." Add-on

technologies that it's the driving force behind, such as the Ambari provisioning

and management software, are launched as open source projects from the

outset. In addition, Hortonworks has banded together with IBM and other

companies to form the Open Data Platform Initiative (ODPi), an organization

Page 13: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 12 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

devoted to creating a common set of core technical specifications for Hadoop

platforms. ODPi members claim that will improve interoperability and minimize

vendor lock-in.

MapR has taken a third path by developing its own file system, MapR-FS,

instead of using HDFS, as well as its own NoSQL database, MapR-DB, and

other foundational technologies in an effort to support deployments of large

clusters with enterprise-class performance needs. MapR also is increasingly

focusing on real-time and stream processing applications. In late 2015, the

company rebranded its product as the MapR Converged Data Platform, which

combines Hadoop and the MapR file system and database with the Apache

Spark processing engine and a new event streaming technology called MapR

Streams in order to handle both batch and real-time jobs.

From a features standpoint, the enterprise version of the Cloudera CDH

distribution provides tools for operational management and reporting and for

supporting business continuity. This includes such items as configuration history

and rollbacks, rolling updates and service restarts, and automated disaster

recovery. MapR's enterprise offering provides tools to better manage and

ensure the resiliency and reliability of data in Hadoop clusters, as well as multi-

tenancy and high availability capabilities. Hortonworks provides proactive

monitoring and maintenance with its HDP support subscriptions.

IBM, meanwhile, has adopted an analytics-oriented strategy on its BigInsights

for Apache Hadoop distribution, in keeping with its broader focus on selling

Page 14: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 13 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

business intelligence and advanced analytics tools. IBM offers different value-

add modules with enterprise-grade features as part of BigInsights, including

separate Analyst and Data Scientist modules. Its Analyst module provides Big

SQL for federated SQL access to Hadoop and other data sources. BigSheets,

which is part of the Analyst module, allows users to explore, transform and

perform visualizations on large data sets stored in Hadoop, using an intuitive

spreadsheet-like interface. The BigInsights Data Scientist Module includes a

version of the R language, text analytics and a machine learning library called

SystemML that has been contributed to the open source community.

While its cloud platform is AWS' primary calling card for Amazon EMR, it also

offers tools for monitoring and managing clusters and enabling application and

cluster interoperability as part of the Hadoop service.

Amazon EMR collects metrics that are used to track progress and measure the

health of a cluster. Cluster health metrics can be accessed through the

command line interface, software developer kits or APIs and can be viewed

through the EMR management console. Additionally, Amazon's CloudWatch

monitoring service can be used along with its implementation of the Apache

Ganglia performance monitoring component to check the cluster and set alarms

on events triggered by these metrics.

Page 15: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 14 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

The bottom line: Choosing a vendor that provides value-add components as

part of its enterprise subscription may mean committing to a long-term

relationship -- especially if these components are tightly integrated with its

standard stack distribution. If you're concerned about vendor lock-in, consider

those vendors that are participating in the OPDi.

3. Security and protection offerings from the Hadoop vendors

Despite the expanding use of open source software for enterprise-class

applications, there remain suspicions about its suitability for production use from

a security and protection perspective. Several Hadoop vendors have taken

steps to alleviate some of this anxiety.

For example, Hortonworks has teamed up with other vendors and customers to

launch a Data Governance Initiative for Hadoop, with an initial focus on a new

Apache project called Atlas for managing shared metadata, data classification,

auditing, and security and policy management for data protection. It's also

working to integrate Atlas with Ranger, an open source security tool for

enforcing data access policies. Cloudera provides tools that enable users to

manage data security and governance for the CDH platform, supporting an

organization's need to meet compliance and regulatory requirements.

Page 16: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 15 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

In addition, Hortonworks, Cloudera, MapR and IBM all provide data encryption.

Both Hortonworks and Cloudera support encryption of data at rest. MapR

provides encryption of data transmitted to, from and within a cluster. IBM offers

the product InfoSphere Guardium, which enforces data privacy as well as

provides encryption and masking of confidential data.

The bottom line: The Hadoop vendors provide different approaches to

authentication, role-based access control, security policy management and data

encryption. Carefully specify your security and protection requirements and

review how each vendor addresses those needs.

4. Support subscriptions for the top Hadoop distributions

The fundamental value proposition for the open source software model is the

bundling and simplification of system deployment with support and services.

One alternative for deploying Hadoop involves downloading the source code for

each component from the open source repository and then building and

integrating all the parts together. This takes both skill and effort, and is likely to

be an iterative process. Open source vendors have already done the heavy

lifting, providing preconfigured distributions and maintaining an up-to-date

integrated stack.

Page 17: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 16 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

What differentiates the vendors to a large degree is their support models.

Hortonworks provides several models, ranging from its Jumpstart edition with

Web-based support during business hours and one-day response time to its

Enterprise edition with 24/7 support and much shorter response times

depending on the severity of the issue. Cloudera offers a support subscription

with one-hour and 24/7 support options for enterprise license holders. It also

offers premium support for organizations with the Flex or Data Hub edition

licenses that include a 15-minute response time for critical issues.

All AWS accounts include basic support, which provides 24/7 customer service,

access to community forums and documentation, as well as access to the AWS

Trusted Advisor application. Developer support includes one-hour response for

severe issues -- with 12- or 24-hour response times for most issues. Business-

level support provides 24/7 email access to cloud support engineers as well as

shortened response times based on severity. Enterprise-level support adds less

than 15-minute response time for critical issues as well as a dedicated technical

account manager, plus additional launch and operation support benefits.

MapR offers a Premium support service that adds Web and email support,

custom portal, training, urgent bug fixes, follow-the-sun support and 24/7 phone

support for priority issues. The company's Premium+ Support adds priority

queuing of tickets and single point of contact support, and offers options for

onsite or remote dedicated support. IBM provides support for organizations that

Page 18: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 17 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

purchase the licensed components -- also referred to as their value-add

modules -- that extend their Open Platform with Apache Hadoop.

The bottom line: If support services are the source of added value from the

vendor, the costs for the different support subscriptions should be aligned with

customer expectations. Subscriptions providing one-hour or even 15-minute

response times on a 24/7 basis with dedicated support staff will cost a lot more

than 24-hour response time from a Web-based interface during business hours.

Hadoop has transformed the business intelligence and analytics industry during

the past 10 years. But, as we've examined, the open source Hadoop framework

offers only so much, and companies that need more robust performance and

functionality capabilities as well as maintenance and support are turning to

commercial Hadoop software distributions. Hopefully, this information will help

you make a more informed choice when purchasing a Hadoop distribution.

Page 19: E-guide Hadoop Big Data Platforms Buyer’s Guide part 2cdn.ttgtmedia.com/.../downloads/Hadoop_Big_Data_Platforms_Buyers… · Hadoop Big Data Platforms Buyer’s Guide ... is looking

Page 18 of 18

In this e-guide

What to consider when

evaluating Hadoop vendors

Four factors for comparing the

top Hadoop distributions

E-guide

About the author

David Loshin, managing director at Decisionworx, is a recognized thought

leader, speaker and expert consultant. He has written numerous books,

including Big Data Analytics: From Strategic Planning to Enterprise Integration

with Tools, Techniques, NoSQL and Graph. He can be reached through his

website, at www.decisionworx.com.

Email us at [email protected] and follow us on Twitter:

@BizAnalyticsTT.