The Enterprise Use of Hadoop

© 2011 Internet Research Group – all rights reserved

The Enterprise

Use of Hadoop

(v1)

Internet Research Group

November 2011

About The Internet Research Group

www.irg-intl.com

The Internet Research Group (IRG) provides market research and

market strategy services to product and service vendors. IRG services

combine the formidable and unique experience and perspective of the

two principals: John Katsaros and Peter Christy, each an experienced

industry veteran. The overarching mission of IRG is to help clients

make faster and better decisions about product strategy, market entry,

and market development. Katsaros and Christy published a book on

high tech business strategy Getting It Right the First Time – Praeger,

2005 www.gettingitrightthefirsttime.com.

http://www.irg-intl.com/

http://www.gettingitrightthefirsttime.com/

IRG 2011: The Enterprise Use of Hadoop (v1) page i

© 2011 Internet Research Group – all rights reserved

Table of Contents

1. Overview .................................................................................................................... 1

2. Background ............................................................................................................... 1

3. What Is Hadoop? ...................................................................................................... 2

4. Why Is Embedded Processing So Important? ....................................................... 3

5. MapReduce Analytics ............................................................................................... 4

6. What is “Big” Data? .................................................................................................. 4

7. The Major Components of Hadoop ......................................................................... 5

8. The Hadoop Application Ecology ............................................................................ 6

9. Cloud Economics ..................................................................................................... 6

10. Why Is Hadoop So Interesting? ............................................................................... 8

11. What Are the Interesting Sources of Big Data? ..................................................... 9

12. How Important Is Big Data Analytics? .................................................................. 10

13. Things You Don’t Want to Do with Hadoop .......................................................... 11

14. Horizontal Hadoop Applications ........................................................................... 11

15. Summary ................................................................................................................. 12

© 2011 Internet Research Group, all rights reserved

Provided to Clients Under License

1. Overview

The last decade has seen amazing continuing progress in computer technology,

systems and implementations, as evidenced by some of the remarkable Web and

Internet systems that have been constructed such as Google and Facebook.

Although most enterprise CIOs yearn to be able to take advantage of the

performance and cost efficiencies that these pioneering Web systems deliver, the

enterprise path to Cloud computing is intrinsically complex because of the need

to bring forward existing applications and evolve organization structure and skill

set, so achieving those economies will take some time.

Hadoop, an Apache Foundation Open Source project, represents a way for

enterprise IT to take advantage of Cloud and Internet capabilities sooner when it

comes to the storage and processing of huge (by enterprise IT standards)

amounts of data. Hadoop provides a means of implementing storage systems

with Internet economics and doing large-scale processing on that data. It is not

a general replacement for existing enterprise data management and analysis

systems, but for many companies an attractive complement to those systems, as

well as a way of making use of the large-volume data sets that are increasingly

available. The Yahoo! Hadoop team argues that in five years, 50% of enterprise

data will be stored in Hadoop – they might well be right.

2. Background

The last decade has been remarkable for the advances in computer technology

and systems:

There has been continuing, relentless “Moore’s Law” progress in semiconductor

technology (CPUs, DRAM and now SSD).

There has been even faster progress in disk price/performance improvement.

Google demonstrated the remarkable performance and cost-effectiveness that

could be achieved using mega-scale systems built from commodity technology,

as well as pioneering the application and operational adaptations needed to take

advantage of such systems.

The compounded impact of these improvements is seen most dramatically in

various Cloud offerings (starting with Google or Amazon Web Services) where

the cost of storage or computation is dramatically (orders of magnitude) cheaper

than in typical enterprise computing.

IRG 2011: The Enterprise Use of Hadoop (v1) page 2



Hadoop presents an opportunity for enterprises to take advantage of Cloud

economics immediately, especially in terms of storage, as we will sketch below.

3. What Is Hadoop?

Hadoop builds on a massive file system (Google File System or GFS) and a

parallel application model (MapReduce) originally developed at Google. Google

has an unbelievable number of servers compared to typical large enterprises (in

all likelihood more than a million). Search is a relatively easy task to parallelize:

many search requests can be run in parallel because they only have to be loosely

synchronized (the same search done at the same time doesn’t have to get exactly

the same response). GFS was developed as a file system for applications running

at this scale. MapReduce was developed as a means of performing data analysis

using these resources.

Hadoop is an OpenSource reimplementation of GFS and MapReduce. Google’s

systems run a unique and proprietary software “stack” so no one else could run

Google’s MapReduce even if Google permitted it. Hadoop is designed to run on

a conventional LINUX stack. Google has encouraged the development of

Hadoop, recognizing the value in a broader population of people trained in the

methodology and tools. Much of the development of Hadoop has been driven by

Yahoo!. Yahoo! is also a large Hadoop user, internally running a cluster of more

than 40,000 servers.

Operationally we talk about a Hadoop “cluster”: a set of servers dedicated to a

particular instance of Hadoop that may consist of just a few to the clusters of

more than 4,000 servers in use at Yahoo!.

Today a typical Hadoop server might be two sockets, a total of 8 cores (two 4-

core servers), 48 GB of DRAM, and 8-16 directly attached disks, typically cost-

per-byte optimized (e.g., 2 or 3 TB 3.5” SATA drives). When implemented with

high-volume commodity technology, the majority of the server cost is the disk

drive complement, and each server will have 20-50 TB of storage.




4. Why Is Embedded Processing So Important?

A useful way of thinking about a Hadoop cluster is as a very high-capacity

storage system built with “Cloud” economics (using inexpensive, high-capacity

drives), with substantial, general purpose, embedded processing power. The

importance of having local processing capability becomes clear as soon as you

realize that even when using the fastest LAN links (10Gbits/sec), it takes 40

minutes to transfer the contents of a single 3 TB disk drive. Big data sets may be

remarkably inexpensive to store, but they aren’t easy to move around, even

within a data center using high-speed network connections.1

In the past we brought the data to the program: we ran a program on a server,

opened a file on a network-based storage system, brought the file to the server,

processed the data, and then probably wrote new data back out to the storage

system.2 With Hadoop, this is reversed reflecting the fact that it’s much easier

to move the program to the data than the data to the program. Modern servers

and large-capacity disks enable affordable storage systems of enormous

capacity, but you have to process the data in place when possible; you can’t

move it.

Some “Cloud” storage applications require only infrequent access to the stored

data. Almost all the activity in a Cloud-based backup service is writing the

protected data to the disks. Reading the stored data is only done infrequently

(albeit being able to read a backup file when needed is the key value

proposition). The same is true to an only slightly lesser degree when pictures,

videos or music are stored in the Cloud. Only a small percentage of that data is

ever accessed, and that small fraction can (and is) cached on higher

performance, more expensive storage. Analysis is very different; data will be

processed repeatedly as it is used to answer diverse questions. PC backup or

picture storage are write-once/read-never applications. Analysis is write-

once/read-many.

1 A modern SATA drive can transfer data between the disk and server at a

sustained rate of about 1 Gbit/second. On a 12-disk node, the aggregate read rate

could be up to about 10 Gbits/second. On a 50-node cluster the total aggregate

read rate could approach 500 Gbits/second.

2 A 10 MB file (100Mbits) can be transmitted in about 0.1 second over a

Gbit/second link.




5. MapReduce Analytics

The use of Hadoop has created a lot of interest in large-scale analytics (the

MapReduce part of Hadoop). This kind of “divide and conquer” algorithm

methodology has been used for numerical analysis for many years as a way of

dealing with problems that were known to be bigger than the biggest machine

available. MapReduce is an elegant way of structuring this kind of algorithm

that isolates the analyst/programmer from the specific details of managing the

pieces of work that get distributed to the available machines, as well as an

application architecture that doesn’t depend on any specific structuring of the

data.

As Hadoop evolves, the basic ideas will be adapted to more computer system

architectures than just the commodity scale-out systems used by the mega Web

properties like Google and Yahoo. A MapReduce computation cluster could also

be used with data stored in a high-performance, high-bandwidth storage

subsystem which would make a lot of sense if the data was already stored there

for other reasons. We expect many such variants of the original architecture to

emerge over time.

6. What is “Big” Data?

Google and Yahoo! use MapReduce for purposes that are unique to extremely

large-scale systems (e.g., search optimization, ad delivery optimization). That

fact notwithstanding, almost all companies have important sources of big data.

For example:

World-wide markets: The Internet enables any company, large or small, to

interact with the billions of people world-wide who are connected. Modern

logistics services such as UPS, FedEx and USPS let any company sell to global

markets. A successful company has to think of millions of people and build

business systems capable of running at that scale. That’s big data.

Machine-generated data: IT infrastructure (the stuff that all modern companies

run on) comprises thousands of devices (PCs and mobile devices, servers,

storage, network and security devices) all of which are capable of generating a

stream of log-data summarizing normal and abnormal activity. In aggregate this

stream is a rich source of business process, operational, security and regulatory

compliance analysis. That’s big data.

We’ll talk more later about how big data will impact enterprises over time.




7. The Major Components of Hadoop

The core of the Hadoop OpenSource projects is HDFS (the Hadoop Distributed

File System), the reimplementation of the Google File System, and MapReduce

defined by the public documents Google has published. HDFS is the basic file

storage, capable of storing a large number of large files. MapReduce is the

programming model by which data is analyzed using the processing resources

within the cluster.

HDFS has these goals:

Build very large data management systems from commodity parts where

component failure had to be assumed and dealt with as part of the basic design of

the data system (in contrast to most enterprise storage where great attention is

paid to making the components reliable).

A file system capable of storing huge files by historical standards (many files

larger than 1 GB).

A file system that was optimized assuming that files typically change by data be

appended to the file (e.g., additions to a log file) rather than by the modification

of internal pieces of the file.

A system where the file system APIs reflect the needs of these new applications.

The motivation for MapReduce is more complicated. Today’s world of

commodity servers and inexpensive disk drives is completely different from

yesterday’s world of enterprise IT. Historically, analytics ran on expensive,

high-end servers and used expensive, enterprise-class disk drives. Buying a new

database server is a big decision and comes with software licensing costs, as

well as incremental operational needs (e.g., a database administrator). In the

Hadoop world, adding more nodes isn’t a major capital expense (< $10K server)

and doesn’t trigger new software licenses, or additional administrators.

MapReduce was designed for an environment where adding more hardware is a

perfectly reasonable approach to problem solving. MapReduce is designed for

such environments: progress is more easily made by adding hardware than by

thinking about the problem and carefully crafting an optimized solution.

MapReduce allows the scale of the solution to grow with minimal need for the

analyst or programmer to adapt the program. The MapReduce infrastructure

functions to distribute that work among the available processors (the application

programmer shouldn’t have to worry about how big the actual cluster is),

monitor progress, restart work that stalls or fails, or to balance the work among

the available nodes.

Using MapReduce is by no means simple, nor something that many business

analysts would ever want to do directly (or be able to do for that matter). Google

has required all summer college interns to develop a MapReduce application, all




being excellent programmers and having the benefit of colleagues who were

experienced and still found it difficult to do. Google has supported the Hadoop

effort in part so that it could be used in education to train more knowledgeable

individuals. This isn’t a reason why the impact of MapReduce will be limited,

however; it’s the motivation for a software ecology built on top of HDFS and

MapReduce that makes the capability usable to a broader population.

8. The Hadoop Application Ecology

It is useful to think of Hadoop as a platform, like Windows or Linux. Although

Hadoop was developed based on the specific Google application model, the

interest in Hadoop has spawned the creation of a set of related programs. The

Apache OpenSource Project includes these:

HBase – the Hadoop database

Pig – a high-level database for data analysis programs

Hive – a data warehouse system

Mahout – a set of machine learning tools

There is other software that can be licensed to use with Hadoop including:

MapR – an alternative storage system

Cloudera – management tools

Various database and BI vendors offer software for us with Hadoop including

these:

Various database and BI vendors offer connectors that make it easy to control an

attached Hadoop system and import the output of Hadoop processors

Similarly the “ETL” vendors offer connectors so that Hadoop can be a source (or

sink) of data in that process.

9. Cloud Economics

Now that we have introduced Hadoop and HDFS, we can explain in more detail

what we mean by “Cloud Economics.” If you walked into any modern large-

scale data center (Google, Yahoo!, Facebook, Microsoft) you would see

something that looked very different from an enterprise data center. The

enterprise data center would be filled with top-of-the-line systems (“enterprise

class”); the Web data center would be filled with something looking more like

what you would find in a thrift shop: inexpensive “white box” servers and

storage. As the cost of the hardware continues to decline, lots of other aspects of




IT have to evolve as well (e.g., software licensing fees, operational costs) if the

value of the hardware is to be exploited. The basic system and application

design have to evolve as well.

Perhaps most importantly, Google recognized that in large-scale computing

failure and reliability had to be reconsidered. In large-scale systems, failure was

the rule rather than the exception (with millions of disk drives, disk drive failure

is ongoing). In large-scale systems, it makes more sense to achieve reliability

and availability in the higher-level system (e.g., HDFS) and application (e.g.,

MapReduce) layers, not by using “enterprise-class” subsystems (e.g., RAID disk

systems). HDFS is a very reliable data storage subsystem because the file data is

replicated and distributed. MapReduce anticipates that individual tasks will fail

on an ongoing basis (because of some combination of software and hardware

failure) and manages the redistribution of work so that the overall job is

completed in a timely manner.

Consider how this plays out with storage. In the enterprise data center, the data

would likely be stored on a shared SAN (storage area networking) system.

Because this SAN system held key data for multiple important applications, the

performance, reliability and availability of the SAN system was critical:

Redundant disks would be included and the data spread among multiple disks so

that the loss of one or more of the disks wouldn’t result in the loss or

unavailability of the data.

Critical elements (the controller, SAN switches and links, power supplies, host

adaptors) would all be replicated for availability.

Because the SAN system supported multiple applications concurrently,

performance was critical, so the fastest (and most expensive) disks would be

used, with the fastest (and most expensive) connection to the controller. The

controller would include substantial RAM memory for caching.

In contrast, a Hadoop cluster of 50 nodes has 500-1000 high-capacity, low-cost

disk drives.

The disks are selected to be cost optimized – lowest cost per byte stored, least

expensive attachment directly to a server (no storage network, no Fiber Channel

attachment).

The design has no redundancy at the disk level (no RAID configurations, for

example). The HDFS file system assumes that disk failures are an ongoing issue

and achieves high-availability data storage despite that.

Cloud economics of storage means cost-effective drives directly connected to a

commodity server with the least expensive connection. In a typical Hadoop

node, 70% of the cost of the node is the cost of the disk drives and the disk

drives are the most cost-effective possible. It can’t get any cheaper than that! A

Hadoop cluster is a large data store built in the most cost-effective way possible.




10. Why Is Hadoop So Interesting?

As we noted earlier, big data is something of relevance to essentially all

businesses because of Internet markets and because of machine-generated log

data, if for no other reason. For dealing with big data, Hadoop is unquestionably

a game changer:

It enables the purchase and operation of very large-scale data systems at a much

lower cost because it uses cost-optimized, commodity components. Adding

500 TB of Hadoop storage is clearly affordable; adding 500 TB to a conventional

database system is often not.

Hadoop is designed to move programs to data rather than the inverse. This basic

paradigm change is required to deal with modern, high-volume disk drives.

Because of the OpenSource community, Hadoop software is available for free

rather than at current database and data warehouse licensing fees. The use of

Hadoop isn’t free, but the elimination of traditional license fees makes it much

easier to experiment (for example).

Because Hadoop is designed to deal with unstructured data and unconstrained

analysis (in contrast to a data warehouse that is carefully schematized and

optimized), it doesn’t require database trained individuals (e.g., a DBA),

although it clearly requires specialized expertise.

The MapReduce model minimizes the parallel programming experience and

expertise. To program MapReduce directly requires significant programming

skills (Java and functional programming), but the basic Hadoop model is

designed to use scaling (adding more nodes, especially as they get cheaper) as an

alternative to parallel programming optimization of resources.

Hadoop represents a quite dramatic rethinking of “data processing” drive by the

increasing volumes of data being processed and by the opportunity to follow the

pioneering work of Google and others, and use commodity system technology at

a much lower price. The downside of taking a new approach is twofold:

There is a lot of learning to do. Conventional data management and analysis is a

large and well-established business. There are many analysts trained to use

today’s tools, and a lot of technical people trained for the installation, operation

and maintenance of these tools.

The “whole” product still needs some fleshing out. A modern data storage and

analysis product is complicated: tools to import data, tools to transform data, job

and work management systems, data management and migration tools, interfaces

to and integration with popular analysis tools, for a beginning. By this standard

Hadoop is still pretty young.

From a product perspective, the biggest deficiencies are probably the adaptation

of Hadoop for operation in an IT shop rather than a large Web property, and the




development of tools that let users with more diverse skill sets (e.g., business

analysts), make productive use of Hadoop-stored data. All of this is being

worked on, either within the OpenSource community or as licensed proprietary

software to use in conjunction with Hadoop. Companies providing Hadoop

support and training services have discovered a vibrant and growing market. The

usability of Hadoop (both operationally and as a data tool) is improving all the

time. But it does have some more distance to go.

11. What Are the Interesting Sources of Big Data?

There is no single answer. Different companies will have different data sets of

interest. Some of the common ones of interest are these:

Integration of data from multiple data warehouses: Most big companies have

multiple data warehouses, in part because each may have a particular divisional

or departmental focus, and in part to keep each at an affordable and manageable

level since traditional data warehouses all tend to increase in cost rapidly beyond

some capacity. Hadoop provides a tool by which multiple sources of data can be

brought together and analyzed, and by which a bigger “virtual” data warehouse

can be built at a more affordable price.

Clickstream data: A Web server can record (in a log file) every interaction with a

browser/user that it sees. This detailed record of use provides a wealth of

information on the optimality of the Web site design, the Web system

performance and in many cases, the underlying business. For example, for the

large Web properties, clickstream analysis is the source of fundamental business

analysis and optimization. For other businesses, the value depends on the

importance of Web systems to the business.

Log file data: modern systems, subsystems, and applications and devices all can

be configured to create log “interesting” events. This information is potentially

the source of a wealth of information ranging from security/attack analysis to

design correctness and system utilization.

Information scraped from the Web: Every year more information is captured on

the Web and more valuable data is captured on the Web. Much of it is free to use

for the cost of finding it and recording it.

Specific sources such as Twitter produce high-volume data streams potentially of

value.

Where is this information all coming from? There are multiple sources, but to

begin with, consider:

The remarkable and continuing growth of the World Wide Web. The Web has

become a remarkable repository of data to analyze in terms of all the contents of

the Web, and for a Web site owner, the ability to analyze in complete detail the

use of the Web site.




The remarkable and growing use of mobile devices. The iPhone has only existed

for the last five years (and the iPad for less), but this kind of mobile device has

transformed how we deal with information. Specifically, more and more of what

we do is in text form (not written notes, nor FAXs nor phone calls) and available

for analysis one way or another. Mobile devices also provide valuable (albeit

frightening) information on where and when the data was created or read.

The rise in “social sites.” There has been rapid growth in Facebook, LinkedIn as

well as customer feedback on specific products (both at shared sites like Amazon

and on vendor sites). Twitter provides remarkable volumes of data with possible

value.

The rise in customer self-service. Increasingly companies look for ways for the

community of their customers to help oneanother through shared Web sites. This

not only is cost-effective, but generally leads to the earlier identification and

solution to problems, as well as providing a rich source of data by which to

assess customer sentiment.

Machine generated data. Almost all “devices” are now implemented in software

and capable of providing log data (see above) if it can be used productively.

12. How Important Is Big Data Analytics?

The only reasonable answer is “it depends.” Big data evangelists note that

analytics can be worth 5% on the bottom line, meaning that intelligent analysis

of business data can have a significant impact on the financial performance of a

company. Even if that is true, for most companies most of the value will come

from the analysis of “small data,” not from the incremental analysis of data that

is infeasible to store or analyze today.

At the same time, there are unquestionably companies for which the ability to do

big data analytics is essential (Google and Facebook for example). These

companies depend on the analysis of huge data sets (clickstream data from large

on-line user communities) that cannot be practically processed by conventional

database and analytics solutions.

For most companies, big data analytics can provide incremental value, but the

larger value will come from small data analytics. Over time, the value will

clearly shift toward big data as more and more interesting data is available.

There will almost always be value in the analysis of some very large data set.

The more important question from a business optimization perspective, is

whether the highest priority requirement is based on big data or is there still

untapped and higher value “small” data?




13. Things You Don’t Want to Do with Hadoop

The Hadoop source distribution is “free” and a bright Java programmer can

often “find” enough “underutilized” servers with which to stand up a small

Hadoop cluster and do experiments. While it is true that almost every large

company has real large-data problems of interest, to date much of the

experimentation has been on problems that don’t really need this class of

solution. Here is a partial list of some of the workloads that probably don’t

justify going to Hadoop:

Non-huge problems. Keep in mind that a relatively inexpensive server can easily

have 10 cores and 200 GB of memory. 200 GB is a lot of data, especially in a

compressed format (Microsoft PowerPivot – an Excel plugin – can process

100 M rows of a compressed fact table data in 5% of that storage). Having the

data resident in DRAM makes a huge difference (PowerPivot can scan 1 trillion

rows a minute with less than 5 cores). If a compressed version of the data can

reside in a large commodity-server memory, it’s almost certain to be a better

solution (there are various in-memory database tools available).

Only for data storage. Although Hadoop is a good, very-large storage system

(HDFS) unless you want to do embedded processing, there are often better

storage solutions around.

Only for parallel processing. If you just want to manage the parallel execution of

a distributed Java program, there are simpler and better solutions.

For HPC applications. Although a larger Hadoop cluster (100 nodes) comprises a

significant amount of processing power and memory, you wouldn’t want to do

traditional HPC algorithms in Hadoop rather than in a more traditional

computational grid (e.g., FEA, CFD, geophysical data analysis).

14. Horizontal Hadoop Applications

With some very bright programmers, Hadoop can be applied wherever the

functional model can be applied. One generic class of applications is

characterized by this:

Data sets that are clearly too large to economically store in traditional enterprise

storage systems (SAN and NAS) and that are clearly too large to analyze with

traditional data warehouse systems.

Think of a Hadoop as a place where you can now store the data economically,

and use MapReduce to preprocess the data and extract data that can be fed into

an existing data warehouse and analyzed, along with existing structured data,

using existing analysis tools.




Alternatively you can think of Hadoop as a way of “extending” the capacity of

an existing storage and analysis system when the cost of the solution starts to

grow faster than linearly as more capacity is required.

As introduced above, Hadoop can also be used as a means of integrating data

from multiple existing warehouse and analysis systems.

15. Summary

Technology progress and the increased use of the Internet are creating very large

new data sets with increasing value to businesses and making the processing

power to analyze them affordable. The size of these data sets suggests that

exploitation may well require a new category of data storage and analysis

systems with different system architectures (parallel processing capability

integrated with high-volume storage), different use of components (more

exploitation of the same high-volume, commodity components that are used

within today’s very-large Web properties). Hadoop is a strong candidate for

such a new processing tier. In addition to its initial design by Google, the fact

that it is today a vibrant OpenSource efforts suggests additional disruptive

impact in product pricing and the economics of use is possible.