An Overview of High Performance Data Mining and Big Data Analytics

8/2/2019 An Overview of High Performance Data Mining and Big Data Analytics

1/51

High Performance Data Mining

and Big Data Analytics1

Aut ho r: Khosrow Hassibi, PhD

Contact: [email protected]

Janu ar y 20 12

Last Modified Date: Apri l 17, 201 2

1This document is still a work in progress.


2/51

2

Table of Contents

1 Summary ................................................................................................................................. 4

2 Introduction ............................................................................................................................. 5

3 State of Computing ................................................................................................................. 7

3.1 Microprocessors and Moores Law .................................................................................. 7

3.2 Traditional Supercomputing ............................................................................................. 7

3.3 High Performance Computing ......................................................................................... 8

3.4 New Computing Delivery Models ................................................................................. 10

3.5 Key Future Trends in Computing................................................................................... 11

4 Exponential Data Growth and Big Data ............................................................................... 13

4.1 What defines a Big Data scenario? ................................................................................ 13

4.2 6Vs of Data..................................................................................................................... 14

4.2.1 Value ....................................................................................................................... 15

4.2.2 Volume .................................................................................................................... 15

4.2.3 Variety..................................................................................................................... 15

4.2.4 Velocity ................................................................................................................... 16

4.2.5 Validity ................................................................................................................... 16

4.2.6 Volatility ................................................................................................................. 16

4.3 The Impact of 6Vs .......................................................................................................... 17

5 Big Data Analytics ................................................................................................................ 19

5.1 Major Developments in Big Data Analytics .................................................................. 19

5.1.1 Hadoop and MapReduce ......................................................................................... 20

5.1.2 Scalable database .................................................................................................... 23

5.1.3 Real-time stream processing ................................................................................... 24

5.1.4 The In-memory Big Data appliance ........................................................................ 26

5.2 High Performance Data Mining and Big Data ............................................................... 28

5.2.1 Approaches to Achieve High Performance and Scalability in Data Mining .......... 28

5.2.1.1 Chunking or Data Partitioning ............................................................................ 29

5.2.1.2 Statistical Query Model (Machine Learning and Advanced Analytics) ............. 30

5.2.1.3 Serial Computing Environment ........................................................................... 30

5.2.1.4 Multiprocessor Computing Environment (Symmetric Multiprocessing or SMP) .. 31


3/51

3

5.2.1.5 Distributed Computing Environment (Cluster Computing) with Shared Storage .. 32

5.2.1.6 Shared-Nothing Distributed Computing Environment ....................................... 33

5.2.1.7 In-memory Distributed Computing Environment ............................................... 34

5.2.1.8 Side Note on Accelerated Processing Units (ALUs or CPU/GPU combinations) .. 35

6 Applications of Big Data Analytics ...................................................................................... 36

6.1 Applications of Big Data in Traditional Analytic Environments ................................... 37

6.1.1 ETL ......................................................................................................................... 38

6.1.2 Extracting Specific Events, Anomalies or Patterns ................................................ 38

6.1.3 Low End Analysis, Queries, and Visualization ...................................................... 38

6.1.4 Data Mining and Advanced Analytics .................................................................... 39

6.2 Big Data Mining and Sampling .................................................................................. 40

6.2.1 Structured Risk Minimization (SRM) and VC Dimension ..................................... 41

6.2.2 Properly Sampled Data is Good Enough for Most Modeling Tasks ................... 44

6.2.3 Where to Use all the Data? ..................................................................................... 46

7 Evolution of Business Analytics Environments ................................................................... 47

8 REFERENCES ..................................................................................................................... 50


4/51

4

1 Summary

Due to the exponential growth of data, today there is an ever-increasing need to process and

analyze Big Data. High performance computing architectures have been devised to address theneeds for handling Big Data not only from a transaction processing standpoint but also from atactical and strategic analytics viewpoint. The objective of this write up is to provide a historicaland comprehensive view on the recent trend toward high performance computing architecturesespecially as it relates to Analytics and Data Mining. The article also emphasizes the impact ofBig Data on requiring a rethinking of every aspect of analytics life cycle from data management,data Analysis, to data mining and analysis, and to deployment.


5/51

5

2 Introduction

With the exponential growth of data comes an ever-increasing need to process and analyze the

so-called Big Data. High performance computing architectures have been devised to address theneeds for handling Big Data processing not only from a transaction processing viewpoint butalso from an analytics perspective. The main goal of this paper is to provide the reader with ahistorical and comprehensive view on the recent trend toward high performance computingarchitectures especially as it relates to Analytics and Data Mining.

There are a variety of readings separately on Big Data (and its characteristics), HighPerformance Computing for Analytics, Massively Parallel Processing (MPP) databases,algorithms for Big Data, In-memory Databases, implementation of machine learning algorithmsfor Big Data platforms, Analytics environments, etc. However none gives a historical andcomprehensive view of all these separate topics in a single document. It is the authors firstattempt to bring about as many of these topics together as possible and to portray an idealanalytic environment that is better suited to the challenges of todays analytics demands.

In Section 3, todays state of computing is examined with special focus on microprocessorPower Wall and the impact it has had on the new processor designs and the popularity of paralleland distributed processing architectures in our every days life. The Section ends withenumerating a few key future trends in computing.

In Section 4, the ever-increasing exponential growth of data in todays world is examined onwhat is now called Big Data in marketing jargon. What is considered Big Data is describedassociated with what I call 6Vs of datavalue, volume, velocity, variety, validity, and volatility.Each combination of these characteristics could require different process and possiblyinfrastructure for data management and data analysis.

Section 5 starts with major developments in Big Data Analytics including Hadoop andMapReduce, Scalable database, Real-time stream processing, and the In-memory Big Dataappliance. Algorithms that can be parallelized and fall into external memory and statistical querymodel definitions are also described. A variety of approaches to achieve high performance andscalability in data mining and analytics are covered including parallel processing and distributedcomputing architectures.

Section 6 talks about the applications of Big Data from a high level. These include but notlimited to ETL, event and anomaly extraction, low end analysis, and data mining. Sampling isthen examined and the question of whether all the data should be used or not in a data miningexercise will be discussed where I use a couple of topics in statistical learning theory to describemy viewpoint.

Section 7 portrays a picture of an ideal analytic environment in which all sorts of data can beprocessed and analyzed depending on a variety of business needs. These environments will be


6/51

6

the fertile ground for many new innovations and drastic evolutions in Analytics and Data Miningpractices.

Section 8 provides a list of references for further readings.


7/51

7

3 State of Computing

3.1 Microprocessors and Moores Law

Microprocessors performance has grown 1,000-fold over the past 20 years, driven by transistorspeed and energy scaling, microarchitecture advances that exploited the transistor density gainsfrom Moores Law2 (such as Advanced microarchitectures pipelining, branch prediction,out-of-order

execution, and speculation), and cache memory architectures. About several years ago, however,the top speed for most microprocessors peaked when their clocks hit about 3 Gigahertz. Theproblem is not that the individual transistors themselves can't be pushed to run faster; they can.But doing so for the many millions of them found on a typical microprocessor would require thatchip to dissipate impractical amounts of heat. Computer engineers call this thepower wall. Giventhat obstacle, it's clear that all kinds of computers, including PCs, Servers and Supercomputers,

are not going to advance at nearly the rates they have in the past. This means faster single-threaded performance has hit a limit [2].

One technique to cope with the Power Wall has been the introduction of multi-core processorchips -keeping clock frequency fixed, but increasing the number of processing cores on a chip.One can maintain lower power this way while increasing the speed of many applications. Thisallows for example, two threads to process twice as much data in parallel, but at the same speedat which they operated previously. As a result, there has been an industry-wide shift to multicorein the last few years.

In the next two decades, diminishing transistor speed scaling and practical energy limits create

new challenges for continued performance scaling. As a result, the frequency of operations willincrease slowly, with energy the key limiter of performance, forcing designs to use large-scaleparallelism, heterogeneous cores, and accelerators (e.g. GPUs, FPGAs) to achieve performanceand energy efficiency. Software- hardware partnership to achieve efficient data orchestration isincreasingly critical in the drive toward energy-proportional computing.

3.2 Traditional Supercomputing

Traditional Supercomputers are one of the greatest achievements of the digital age even thoughyesterday's supercomputer is today's game console (Table 1), as far as performance goes.

2Moores Law states that the density of integrated circuits doubles every generation (every two years) for half thecost.


8/51

8

SANDIA LABs ASCI RED Sony PlayStation 3

Date Introduced 1997 2006

Peak Performance 1.8 Teraflops 1.8 Teraflops

Size 150 square Meters 0.08 square Meters

Power Usage 800,000 Watts < 200 Watts

Table 1: Sony PlayStation 3 versus SANDIA Labs Supercomputer.

Traditional supercomputers are very expensive and are employed for specialized applicationsthat require immense amounts of mathematical calculations. During the past five decades thesemachines have driven some fascinating pursuits; for example, weather forecasting, animatedgraphics, fluid dynamic calculations, nuclear energy research, designing new drugs, andpetroleum exploration. Traditional supercomputing is based on groups of tightly interconnectedmicroprocessors connected by a local high speed bus. Examples of these are DOESupercomputers used for weather forecasting and nuclear energy research3. The focus of designfor these is to run compute-intensive tasks [1].

3.3 High Performance Computing

In recent years, modern supercomputers have shaped our daily lives more directly. We now relyon them every time we do a Google search or try to find an old high school friend on Facebook,for example. And you can rarely watch a big-budget movie without seeing supercomputer-generated special effects. The focus of design for these is to run compute-intensive tasks on huge

amount of data [2]. The data distribution and colocation of data and computing is the key tosuccessful application of these applications and algorithms.

Modern supercomputing could be of different architecture varieties compared to traditionalsupercomputer architectures.

Distributed Computing4

This is a method of computer processing in which different parts of a program are run

simultaneously on two or more computers that are communicating with each other over anetwork. Distributed computing is a type of segmented or parallel computing, but the latter termis most commonly used to refer to processing in which different parts of a program runsimultaneously on two or more processors that are part of the same computer and share memory

3The fastest super computer at the time of this writing is Japans Fijutsu K Computer at 10.51 Peta FLOPS.

4http://en.wikipedia.org/wiki/Distributed_computing
http://www.webopedia.com/TERM/A/application.htmlhttp://www.webopedia.com/TERM/G/graphics.htmlhttp://en.wikipedia.org/wiki/Distributed_computinghttp://en.wikipedia.org/wiki/Distributed_computinghttp://en.wikipedia.org/wiki/Distributed_computinghttp://www.webopedia.com/TERM/G/graphics.htmlhttp://www.webopedia.com/TERM/A/application.html


9/51

9

(also called Shared Memory parallel computing model). Parallel computing may be seen as aparticular tightly-coupled form of distributed computing, and distributed computing may be seenas a loosely-coupled form of parallel computing. In distributed computing, each processor has itsown private memory, i.e. the memory is distributed (also called Distributed Memory parallelcomputing model) and local to each processor. Information is exchanged by passing messages

between the processors for example using MPI (Message Passing Interface). MPI is astandardized and portable message-passing protocol that was designed by a group of researchersfrom academia and industry in 1994 which fostered the development of a parallel softwareindustry. The standard defines the syntax and semantics of a core of library routines.While both types of processing require that a program be segmenteddivided into sections thatcan run simultaneously, distributed computing also requires that the division of the program takeinto account the different environments on which the different sections of the program will berunning. For example, two computers are likely to have different file systems and differenthardware components. It is important to emphasize that Distributed Computing covers bothFunctional Partitioning (breaking a program to parts that can run in parallel) and DataPartitioning5where very large data is broken into pieces and the same program runs on each piece

(also called SPMD or Single Program Multiple Data), or a combination of both. True scalabilityfor very large applications is only realized through Distributed Computing techniques.

Cluster Computing6

Computer clusters (a form of distributed computing) emerged as a result of convergence of anumber of computing trends including the availability of low cost microprocessors, high speednetworks, and software for High Performance Distributed Computing. A computer cluster is agroup of linked computers, working together closely so that in many respects they form a single

computer. The components of a cluster are commonly, but not always, connected to each otherthrough fast local area networks. Clusters are usually deployed to improve performance and/oravailability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.

Grid Computing7

Grid computing is a term referring to the federation of computer resources from multipleadministrative domains to reach a common goal. The GRID is a form of distributed computingwith non-interactive workloads that involve a large number of files. What distinguishes GRIDcomputing from conventional High Performance Computing systems such as Cluster Computingis that grids tend to be more loosely coupled, heterogeneous, and geographically dispersed.

5This is in particular of great importance for Big Data processing.

6http://en.wikipedia.org/wiki/Computer_cluster

7http://en.wikipedia.org/wiki/Grid_Computing
http://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Local_area_networkhttp://en.wikipedia.org/wiki/Computer_clusterhttp://en.wikipedia.org/wiki/Computer_clusterhttp://en.wikipedia.org/wiki/Grid_Computinghttp://en.wikipedia.org/wiki/Grid_Computinghttp://en.wikipedia.org/wiki/Grid_Computinghttp://en.wikipedia.org/wiki/Computer_clusterhttp://en.wikipedia.org/wiki/Local_area_networkhttp://en.wikipedia.org/wiki/Computer


10/51

10

Although a GRID can be dedicated to a specialized application, it is more common that a singlegrid will be used for a variety of different purposes. Grids are often constructed with the aid ofgeneral-purpose grid software libraries known as middleware8. One of the main strategies ofGRID computing is to use this middleware to divide and apportion pieces of a program amongseveral computers, sometimes up to many thousands9.

In-memory Computing

With the advent of 64-bit computing in the last several years (emergence of both 64-bit CPUsand 64-bit OSs), the addressable memory for a program has become theoretically unlimited. Theonly practical limit on memory size is the physical memory that could be provided on the CPUboard. This has been a blessing for developers of data intensive applications that have to rely onI/O performance. Availability of multicore processing units, huge 64-bit addressable memory,and distributed computing/data paradigms have provided the possibility of full In-memory

distributed processing specially for data intensive and I/O bound applications such as datamining and large scale Analytics.

Virtual Computing

Virtualization in computing is the creation of a virtual (rather than actual) version of a hardwareplatform, operating system, a storage device or network resources. The usual goal ofvirtualization is to centralize administrative tasks while improving scalability and overallhardware-resource utilization.

3.4 New Computing Delivery Models

To deliver computing power and applications, there are new computing models in addition to thepersonal and traditional client/server computing models. These computing delivery models coulddemocratize the use of supercomputing for a much larger pool of users10 [10].

8 There are Open Source and commercial GRID management software. Condoris an Open Source softwareframework for managing workload on a dedicated cluster of computers. Platform LSF(IBM) is a commercialcomputer software, job scheduler sold by Platform Computing that is used to execute batch jobs on networked Unixand Windows systems on many different architectures.

9Todays 20 Nano Meter Computer Chip design and verification process is a good example where thousands ofcomputers in a GRID may be needed to verify the logic and the layout. A typical process may require runninghundreds of thousands of small jobs and handling millions of files.

10Amazon's virtual super computeris capable of running 240 trillion calculations per second, or 240 teraflops on17,000 cores. While undoubtedly impressive, this pales in comparison to Fujitsu's K Super Computer, which hit 10Petaflops in November 2011, equating to 10 quadrillion (One quadrillion =1015, Seeherefor an interestingcomparison) calculations a second.
http://news.cnet.com/8301-13846_3-57349321-62/amazon-takes-supercomputing-to-the-cloud/#ixzz1n2le32RDhttp://news.cnet.com/8301-13846_3-57349321-62/amazon-takes-supercomputing-to-the-cloud/#ixzz1n2le32RDhttp://news.cnet.com/8301-13846_3-57349321-62/amazon-takes-supercomputing-to-the-cloud/#ixzz1n2le32RDhttp://www.kokogiak.com/megapenny/seventeen.asphttp://www.kokogiak.com/megapenny/seventeen.asphttp://www.kokogiak.com/megapenny/seventeen.asphttp://www.kokogiak.com/megapenny/seventeen.asphttp://news.cnet.com/8301-13846_3-57349321-62/amazon-takes-supercomputing-to-the-cloud/#ixzz1n2le32RD


11/51

11

Cloud Computing11

Cloud computing is a computing paradigm shift where computing is moved away from personalcomputers or an individual application server to a cloud of computers. Users of the cloud onlyneed to be concerned with the computing service being asked for, as the underlying details ofhow it is achieved are hidden. This method of distributed computing is done through pooling allcomputer resources together and being managed by software rather than a human. The servicesbeing requested of a cloud are not limited to using web applications, but can also be ITmanagement tasks such as requesting of systems, a software stack or a specific web appliance.

Utility Computing

12

Conventional Internet hosting services have the capability to quickly arrange for the rental ofindividual servers, for example to provision a bank of web servers to accommodate a suddensurge in traffic to a web site. Utility computing usually envisions some form of virtualizationso that the amount of storage or computing power available is considerably larger than that of asingle time-sharing computer. Multiple servers are used on the back end to make this possible.These might be a dedicated computer cluster specifically built for the purpose of being rentedout, or even an under-utilized supercomputer.

3.5 Key Future Trends in Computing

The following are the key future trends in computing:

1. Moores Law continues but demands radical changes in architecture and software.2. Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and

exploit the bounty of transistors to incorporate application-customized hardware [2].3. Software must increase parallelism and exploit heterogeneous and application-

customized hardware to deliver performance growth.4. Where data is large, colocation of data and processing will be the key to truly improve

performance. This scheme has been commercially employed by some data warehousevendors for a long time for simple analytics and on open source systems such as Hadoop.

11http://en.wikipedia.org/wiki/Cloud_computing

12http://en.wikipedia.org/wiki/Utility_computing
http://en.wikipedia.org/wiki/Cloud_computinghttp://en.wikipedia.org/wiki/Cloud_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Cloud_computing


12/51

12

The key take away is that the same application software that with no change has enjoyed a 20-30% year over year performance increase in the last two decades (due to consistent processorspeed improvements) now has to be written differently and more intelligently to exploitparallelism, distributed data, and modern hardware architectures for speed up. That means a lotmore work for software engineers and application developers. This is in particular true for

advanced analytics and data mining applications that rely on processing of very large amounts ofdata.
http://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computing


13/51

13

4 Exponential Data Growth and Big Data

With the exponential growth in data generation and acquisitionwhether in business (CRM,Web data, Mobile data, social media, etc.) or scientific applications (next-generation telescopes,

Petascale scientific computing, or high-resolution sensors)it's an exciting time for discovery,analysis, and better decision making. As a result of these technological advances, the next decadewill see even more significant impacts in fields such as commerce, medicine, astronomy,materials science, climate, etc. Interesting discoveries and better business decisions will likely bemade possible with amounts of data previously unavailable or available but not minable.

In the early 2000s, storage technologies were overwhelmed by the numerous terabytes of bigdatato the point that IT faced a data scalability crisis. Then yet again, storage not onlydeveloped greater capacity, speed, and intelligence; they also fell in price. Enterprises went frombeing unable to afford or manage big data to lavishing budgets on its collection and analysis.And using advanced analytics, businesses could analyze detailed-level in big data to understand

the current state of the business and track still-evolving aspects such as customer behavior.

An optimal approach to the overall data analysis problem requires close cooperation betweendomain experts and computer scientists. If we asked computational or data scientists today whatwould maximize progress in their field, most would still say more disk space and more CPUcycles. However, the emerging petabytes13 of data fundamentally change every aspect ofbusiness and scientific discovery: the tools (computer hardware and software), the techniques(algorithms and statistics), and as a result the cycle of the scientific method itself, which isgreatly accelerated [9].

4.1 What defines a Big Data scenario?

Big Data is a new marketing term14 to highlight the ever increasing exponential growth of datain every aspect. The term Big Data15 originated from within the open source community wherethere was an effort to develop analytics processes that were faster and more scalable thantraditional data warehousing, and could extract value from the vast amounts of unstructured dataproduced daily by web users. All vendors including database/data warehousing vendors,hardware vendors, and analytic vendors have jumped on the bandwagon of Big Data.

13 In May 2011, IBM announced that it will invest$100 million in Big Data Analysis Research. This included toolsand service offerings. They specifically mention the challenge of analyzing Petabytes of data which requiresadvances in software, systems, and services.

14 Very large databases are nothing new. For three decades at least, the term very large database, or VLDB, has beenused in technical circles and defined as a database that contains an extremely high number of tuples (database rows),or occupies an extremely large physical file system storage space. The most common definition of VLDB is adatabase that occupies more than 1 terabyte or contains several billion rows, although naturally this definitionchanges over time like today petabytes may be more appropriate.

15 Due to the importance of Big Data, the US Government (Office of Science and Technology Policy at The WhiteHouse) has announced the Big Data Research and Development Initiative with a $200 million commitment fromsix agencies. Seeherefor details.
http://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://en.wikipedia.org/wiki/Utility_computinghttp://m.computerworld.com/s/article/9216911/IBM_to_invest_100_million_for_big_data_analysis_research?source=rss_latest_content&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+computerworld%2Fnews%2Ffeed+%28Latest+from+Computerworld%29http://m.computerworld.com/s/article/9216911/IBM_to_invest_100_million_for_big_data_analysis_research?source=rss_latest_content&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+computerworld%2Fnews%2Ffeed+%28Latest+from+Computerworld%29http://m.computerworld.com/s/article/9216911/IBM_to_invest_100_million_for_big_data_analysis_research?source=rss_latest_content&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+computerworld%2Fnews%2Ffeed+%28Latest+from+Computerworld%29http://en.wikipedia.org/wiki/Filesystemhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdfhttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdfhttp://en.wikipedia.org/wiki/Filesystemhttp://m.computerworld.com/s/article/9216911/IBM_to_invest_100_million_for_big_data_analysis_research?source=rss_latest_content&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+computerworld%2Fnews%2Ffeed+%28Latest+from+Computerworld%29


14/51

14

The most common definition of Big Data16is datasets that grow so large that they becomeawkward to work with using available database management tools. Difficulties include capture,storage, search, sharing, analytics, and visualization of such data.

The trend in working with ever larger and larger datasets continues because first and foremost,

companies at the scale of Google, Facebook, Linked-in, Yahoo!, have to capture and manageall the data on their site at the most granular detail to provide the services they offer. Second, formore traditional companies in industries such as Finance, Telco, Retail, etc. there are potentialbenefits in analyzing big data that allows their data scientists to unravel business trends, detectnovel customer behavior, detect fraud, combat crime, etc.

Though Big Data is always a moving target, current limits are on the order of terabytes andpetabytes of data that is the size of a single dataset or combinations of datasets that has to beanalyzed for a specific analysis purpose at a time.Scientists have regularly encountered thisproblem in data mining, meteorology, genomics, connectomics, complex physics simulations,biological/environmental research,Internet search, and finance. Data sets also grow in size

because they are increasingly being gathered by ubiquitous information-sensing mobile devices,aerial sensory technologies, software logs, cameras, microphones, Radio-frequency identification(RFID) readers, and wireless sensor networks.The worlds technological per capita capacity tostore information has roughly doubled every 40 months since the 1980s (about every 3 years)

[15].

Data Inflation Byte Size

Megabyte (MB) 2 20

Gigabyte (GB) 2 30 1,000 MB

Terabyte (TB) 2

40

1,000 GBPetabyte (PB) 2 50 1,000 TB

Exabyte (EB) 2 60 1,000 PB

Zettabyte (ZB) 2 70 1,000 EB

Yottabyte (YB) 2 80 1,000 ZB

4.2 6Vs of Data

The exponential growth of data in the last decade can be looked at in six dimensions which I callthe 6Vs [5]: Value, Volume, Variety, Velocity, Validity, and Volatility.

16 Refer to Wikipedia.
http://en.wikipedia.org/wiki/Connectomicshttp://en.wikipedia.org/wiki/Connectomics


15/51

15

4.2.1 Value

Almost all big organizations, specifically in the last 15 years have started to look at the analysisof the data they generate (interactions with their customers, vendors, machines, etc.)strategically. They treat that data as a strategic asset from an analysis standpoint. Detailed

analysis of data has proved to bring significant competitive advantage in Risk, Fraud, Marketingand Sales. As such, the value of the data from a business standpoint has increased significantlyresulting in easier justification of storing and processing of that data for analysis.

If the perception of the strategic value of data was not real, none of the other data dimensionswould matter. Great success of data warehouse companies such as Teradata in the last twodecades exemplifies this point. These companies have been able to create a separate market (incomparison with transaction processing databases) focused mainly on the value of strategic dataAnalysis.

4.2.2 Volume

With great advances in technology in the last two decades, many transactions between interestedentities (human to human, human to machine, machine to machine) can be electronicallycaptured and stored. New transactions that did not exist before are now captured: Smart Meters,Mobile data, RFID data, and healthcare data are just a few of the newcomers. One can nowrequest a cab from his cell phone without entering any address and then paying for it by tappingthe phone on a reader. The volume of traditional transactions has also significantly increasedyear over year. For example, traditional payment transactions such as credit and debit havesignificantly increased in the last decade after huge growth in the decade just before that. Thereader can check Reference [15] for some interesting statistics.

We can look at data in terms of different metrics such as the number of rows and columns, filesand tables, transactions, or terabytes. All of these metrics have significantly increased resultingin higher volumes of data available for processing and analysis. For example, a decade ago, alarge bank may be looking at tens of variables to use for risk or marketing purposes forpredictive models. Now they are looking at thousands of variables instead and one or more ofthese variables by itself may provide great competitive advantage for a specific businessquestion.

4.2.3 Variety

Data is now collected in all shapes and forms and for all new purposes. It could be structured orunstructured (e.g. free text) or semi-structured (XML). A decade ago, capturing of unstructureddata was only of interest to the pioneers in the private sector. Now it is the norm. The data couldbe static or slowly changing (customer master data) or dynamic (transactional data), or highlyvolatile (social media data), etc. Data is even generated when one is not actively engaged likecell phone location data or TV viewing. The important thing is that each variety of data oftenrequires its own special storage, processing, and analysis.


16/51

16

4.2.4 Velocity

With the advent of Internet, Mobile, and Social Media, the velocity of transactions amongentities has significantly increased. This has impacted the data volume. For particularapplications, a decision has to be made in real-time in sub-seconds where other decisions can stillbe made in batch or near real-time. The most famous and pioneering example (which has been inuse since 1993) is payment card fraud detection where authorization decisions based on the latesttransaction fraud score must be refined and made in milliseconds [6].

4.2.5 Validity

Data validity and quality has always been a challenge. Historically it was typical to expect thatdata validity/quality suffers as data volumes increase. In last several years with improvements intools and processes surrounding data quality and the attention given to the value of the data,

validity of the captured data has improved. This is in particular true for structured data that iscaptured into the data warehouses for business analysis and decision making. The value of dataas an asset cannot be realized unless there is trust that it represents the true state of reality inwhatever application it is.

4.2.6 Volatility

One aspect of some of the newer data types that is often ignored is its volatility (or its transientnature) and the ever-shrinking time windows to act upon it. Even though data can be capturedand analyzed, its relevance may be short lived given a specific business purpose. Considersomeone who is browsing different sites on internet in search of buying a specific product. Forthe specific purchase he has in mind at that moment of time, the collected data has short utility. Ifone observes his click streams, compare it to the outcome of others who have generated similarclick streams, and provide the right offering to him quickly, a sale could be made. If this data isnot used in a short time span, it is not useful anymore for a revenue generation purpose but onlygood for some summary statistic in a report.

Twitter and some other social media data are good examples of volatility of the data. The bestknown example is automatic short term trading systems where stock trends in seconds or less arecrucial in making a profitable trade. In general, one would expect data volatility as a whole toincrease irrelevant of the channel or medium because people have shorter attention spans andtheir habits change more often than what was the norm in the past. Cisco Systems Inc. estimatesthere are approximately 35 billion electronic devices that can connect to the Internet. Anyelectronic device can be connected (wired or wirelessly) to the Internet, and even automakers are


17/51

17

building Internet connectivity into vehicles. Connected cars could become commonplace by2012-13 and generate millions of new transient data streams17.

4.3 The Impact of 6Vs

Each combination of these characteristics could require different process and maybe differentinfrastructure for data management and data analysis. I divide these six characteristics into twogroups:

- Group 1 (the traditional 3Vs): Volume, Velocity, Variety- Group 2: Value, Validity, Volatility.

Size of data is directly related to any increase in Group 1 (3Vs) and it is well understood.However, Group 2 is indirectly related to the size of data.

As the Value perception of data increases, it creates the appetite for collecting more of it, oftenwith more granularity, Velocity, and Variety. As an example, consider a department in anorganization that has benefited from analysis of its customer data for a particular business need.This propels the curiosity and the drive to collect more data on the customer (internal or thirdparty data) to add possibly more business benefits. Other departments in the organization canfollow the same path and all this is translated into higher volumes of data for the enterprise tostore and to analyze. That has been the story of the last 2 decades and the growth of datawarehouse and analytic market.

Some data is known to be valuable but in its current form has low validity. This may have to dowith the way the data is generated, transferred, or captured. Such data is deemed useless unlessits validity can be improved and assured. As data validity improves, given its known value, itwill directly affect volume. As an example, log data from web servers is known to be valuablefor analysis of online fraud where information about IP address, time stamp, Operating System,and the Browser is captured. If this information is not generated or captured accurately, it will beuseless and not worth to store or analyze. If validity of this data improves, it affects the volumeof data to be processed and analyzed for that specific business need. Note that validity is notequivalent to Data Quality. One can have valid data with low or medium quality which is stillacceptable for data mining18 and Analytics. Data can always be cleaned, standardized, and

enhanced through data quality procedures.

17 Analytics derived from wireless Intra VAN (Intra Vehicle Area Network) and Inter VAN (between vehicles) datanetworks are becoming necessary components of smart vehicles in the year ahead and could be another source ofdata for a slew of new applications [11].

18 Data mining methodologies and algorithms are inherently designed to be resistive to noise and can cope withlower quality data as long as it is valid and has some value.


18/51

18

High Volatility data will reduce the volume of the data that needs to be stored and analyzed.

High volatile data has a short useful life span that may not justify its storage and analysis outside

an established time window. And if longer history outside that time window is ever needed, it

could be captured and stored as summaries and not in granular detailed form.


19/51

19

5 Big Data Analytics

Big Data analytics is more interesting and multifaceted compared to Big Data Storage, but lessunderstood especially by the IT organizations. Development of Big Data analytics processes has

been driven historically by the need to cope with the web generated data. However, the rapidgrowth of applications for Big Data analytics is taking place in all major vertical industrysegments and now represents a growth opportunity to vendors that's worth all the hype [9].

One of the challenges of Big Data Analytics is the sheer scale of the data: the data simply can'tbe moved for analysis. Therefore, the data must be analyzed in situ and/or one must developmethods for extracting a smaller set of relevant data.

Big Data analytics is an area of rapidly growing diversity. Therefore, trying to define it isprobably not helpful. What is helpful, however, is identifying the characteristics that are commonto the technologies now identified with Big Data Analytics [4]. These include:

The perception that traditional data warehousing processes are too slow and limited inscalability,

The ability to converge data from multiple data sources, both structured and unstructured, The realization that time to information is critical to extract value from data sources that

include mobile devices, RFID, Logs, the web and a growing list of automated sensorytechnologies19,

The accelerating trend for companies moving from annual budgets and monthly/dailyreviews to instant responses. They need in real time to know what's going on in theirmarkets, how do they predict changes and change their operations faster than theircompetition.

The key to success of Big Data Analytics projects is the appropriate combination of analytics(advanced software tools combined with the right methodology), expertise (domain knowledge),and delivery platform (hardware architecture) around certain business problems enablingoutcomes and results which are simply not possible in traditional ways.

5.1 Major Developments in Big Data Analytics

There are at least four major development segments that underline the diversity to be found

within Big Data analytics. These segments are MapReduce & Hadoop, scalable database, real-time stream processing and Big Data appliance [9].

19The Marketing jargon often used with this is Analytics at the speed of thought.


20/51

20

5.1.1 Hadoop and MapReduce

Apache Hadoop is a software framework where conceptually began with a paper published byGoogle in 2004 [3]. The paper described a programming model for parallelizing the processingof web-based data for Google search engine implementation using Google File System (GFS). It

was called MapReduce. MapReduce is a software framework to support distributed computingon large data sets on clusters of computers. Shortly thereafter, Apache Hadoop was born as anopen source implementation of the MapReduce process. The community surrounding it isgrowing dramatically and producing add-ons that expand Apache Hadoop's usability withincorporate data centers.

In summary, Apache Hadoop software library is a framework that allows for the distributedprocessing of large data sets across clusters of computers using a simple programmingmodel. Performing computation on large volumes of data has been done before, usually in adistributed setting. What makes Hadoop unique is its simplified programming model20 whichallows the user to quickly write and test distributed systems, and its efficient, automaticdistribution of data and work across machines and in turn utilizing the underlying parallelism ofthe CPU cores [13]. Hadoop is designed to scale up from single servers to thousands ofmachines, each offering local computation and storage. Rather than rely on hardware todeliver high-availability, the library itself is designed to detect and handle failures at theapplication layer, so delivering a highly-available service on top of a cluster of computers,each of which may be prone to failures.

Apache Hadoop users typically build their own parallelized computing clusters from commodityservers, each with dedicated storage in the form of a small disk array or solid-state drive (SSD)for performance. Historically, such architectures have been called shared-nothingarchitectures21 and have been in use by some database vendors for some time. TraditionalShared storage architectures use scalable and resilient Storage-area network (SAN) and

network-attached storage (NAS) but they typically lack the kind of I/O performance needed torise above the capabilities of the standard data warehouse. Hadoop storage is direct-attachedstorage (DAS).

A typical issue with Hadoop is the growing list of sourcing choices that range from pure opensource to highly commercialized versions like Cloudera Inc and MapR. Apache Hadoop andrelated tools are available for free at the Apache Hadoop site.

There are a variety of tools available for Hadoop such as:

Hive (initiated by Facebook): This is a data warehouse software with SQL-like language

that facilitates querying and managing large datasets residing in distributed storage,

20 Traditionally, the collaboration between multiple distributed compute nodes is managed with a communicationsystem such as MPI.

21Some data warehouse architectures use shared nothing architectures. The most notable is Teradata andAsterData.DB2 and Oracle use shared storage.


21/51

21

HBase (initiated by Powerset): One uses HBase for random, realtime read/write access toBig Data where the goal is to deal with very large tables -billions of rows and millions ofcolumns- atop clusters of commodity hardware.

PIG (initiated by Yahoo!) is an Analysis platform including a high level programminglanguage,

Oozie provides users with the ability to define workflow (action and dependenciesbetween them) and scheduling actions for execution when their dependencies are met.

Mahout22: Mahout is an open source and scalable Java library for Machine Learning andData Mining. It is a framework of tools intended to be used and adapted by developers.Mahout aims to be the machine learning tool of choice when the data to be processed isvery large, perhaps far too large for a single machine. In its current incarnation, thesescalable implementations are written in Java, and some portions are built upon Apache'sHadoop distributed computation project.

Lucene is a very popular for text search and it predates Hadoop. It provides full textindexing and search libraries for use in an application (Java, C++, Python, Perl, etc.).

Sqoop, Flume or Chukwa allow users to procure the data to be ingested and placed in a

Hadoop-based cluster.

Figure 1: Current Hadoop Big Data Analytics software ecosystem.

22As of this writing, Mahout is still work in progress. Some of the algorithms are implemented on top of ApacheHadoop using the map/reduce paradigm. However Mahout does not restrict contributions to just Hadoopimplementations. Contributions that run on a single node or on a non-Hadoop cluster are also accepted. The corelibraries are highly optimized to allow for good performance for non-distributed environments.


22/51

22

In traditional corporate world, the typical analytics use cases mentioned on Hadoop focus onETL and simple processing and analysis of Big Data such as web and computer logs. Thefollowing link is an alphabetical list of organizations that are currently using Hadoop. For eachorganization, there is an explanation of the application and the size of the Hadoop Cluster:

Example Uses of Hadoop:http://wiki.apache.org/hadoop/PoweredBy

One of the largest Hadoop implementations is at Yahoo! where most of the user interactions arewith the data processed by Hadoop including Content Optimization, Search Index, AdsOptimization and Content Feed Processing. Yahoo! Hadoop clusters currently have more than40,000 computers (100,000 + CPUs). The biggest cluster used to support research for WebSearch and Ad Systems has 4500 nodes (each node is a 2x4cpu box with 4x1TB disk and 16GBRAM). 60% of Hadoop jobs in Yahoo! are Pig Jobs.

Some types of big data are too expensive to store in any other way by IT. These include:

Unstructured data (search optimization, filtering, indexing) Web data: Web Logs, Click Streams,

Telemetry and Tracking data (e.g. smart meters, automobile tracking,)

Security logs,

Application/Server performance logs,

Ad serving logs,

Network security,

Multi-channel detailed granular data

Media data (pictures, audio, video, )

Social Network data,

These data are too large and storing them on sequential file systems makes their analysis longand tedious. Hadoop provides an efficient way to store them in distributed fashion for muchfaster and scalable analysis23. A typical ETL application may require to compute and to extractsome hand-picked useful variables from these large datasets and storing them into a warehouseas structured fields for BI/reporting and modeling purposes. They are also archiving use cases forHadoop to store older data as better alternative to tapes whose processing is cumbersome andmanually expensive.

Another use case could be to go through computers and networks security logs24 and look foranything suspicious or abnormal: again high volume/simple processing. One other Hadoop

23 As an example, Yahoo! Search Assist that uses three years of log data used to take 26 days to be processed. WithHadoop and a 20-step MapReduce, this time was reduced to 20 minutes (Source: Google for Hadoop and its Real-world Applications).

24 In big organizations, variety of system logs generated by Servers, Networks, databases, etc. are massive in size.They need to be analyzed for a variety of purposes including performance improvement of systems, Security/Fraud,prediction/provisioning, and reporting/profiling.
http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredBy


23/51

23

example use by IT is to parse web logs and extract a number of fields into the warehouse forincorporation in existing analytic processing flows. Processing the data in Hadoop significantlyreduces the ETL time.

Both ETL vendors and data warehouse vendors have sensed a threat for some time and are

realigning their products to incorporate Hadoop. Based on anecdotal information from corporateIT staff, from a storage standpoint, Hadoop is anywhere from 20-100 times cheaper than a datawarehouse storage option. In one case Hadoop cost $500/Terabyte for vs $20,000/Terabyte of analternative warehouse option. However, a totally different skill set is required to do data analysiswith Hadoop. This however does not take into account the different skillsets and resourcesrequired to use it especially in the context of more traditional business Analytic environment.

In the corporate world, we do not see many applications -except the big known applications inLinked-In, Yahoo, and the like- that performs any type of advanced analytics directly in Hadoop.However, many corporations are actively investigating to use Hadoop for specific internaladvanced analytic applications that ware not conceivable to do before. Easier-to-use tools and

applications to leverage Hadoop more efficiently in the corporate world are being developed.Hadoop is a better deal for mass volume unstructured data processing and mass volume semi-structured data (like XML data). In all instances, it still needs a lot more savvy programmingthough a lot of companies are working toward making this programming much simpler.

5.1.2 Scalable database

While Hadoop has grabbed most of the headlines because of its ability to process unstructureddata in a data warehouse-like environment, theres much more going on in the Big Data analyticsspace.

Structured data is also getting lots of attention. There is a rapidly growing communitysurrounding NoSQL25 databases. NoSQL is an open source, non-relational, distributed andhorizontally scalable (elastic scaling or scale out instead of scale up) collection of databasestructures that address the need for a web-scale database designed for high-traffic websites andstreaming media. Compared to their SQL counterparts, NoSQL databases have flexible datamodels and are designed from ground up to require less maintenance (less tuning, automaticrepair, etc.).

Often, NoSQL databases are categorized according to the way they store the data and fall undercategories such as key-value stores, BigTable implementations26, document store databases27,

and graph databases. NoSQL database systems rose alongside major internet companies, such as

25NoSQL is sometimes expanded to Not Only SQL and it is an important Big Data technology. In academiccommunities, these databases are referred to as Structured Storage where relational databases are a subset.

26 Examples of Big Table-based NoSQL databases are Casandra and Hadoop HBase.

27 Mongo Db for example.
http://searchsoa.techtarget.com/NoSQL-Git-and-more-on-ThoughtWorks-radarhttp://searchsoa.techtarget.com/NoSQL-Git-and-more-on-ThoughtWorks-radarhttp://en.wikipedia.org/wiki/BigTablehttp://en.wikipedia.org/wiki/BigTablehttp://searchsoa.techtarget.com/NoSQL-Git-and-more-on-ThoughtWorks-radar


24/51

24

Google, Amazon, Twitter, and Facebook which had significantly different challenges in dealingwith data that the traditional RDBMS solutions could not cope with. With the rise of the real-time web, there was a need to provide information out of large volumes of data which more orless followed similar horizontal structures.Available implementations include MongoDB (as inhumongous DB), Casandra28, Hadoop HBase, and Terrastore. Another analytics-oriented

database emanating from the open source community is SciDB which is being developed for usecases that include environmental observation and monitoring, radio astronomy and seismology,among others.

Traditional commercial data warehouse vendors aren't standing idly by. Oracle Corp. is buildingitsnext-generation Big Data platforms that will leverage its analytical platform and In-memorycomputing for real-time information delivery. Teradata Corp. recently acquired Aster DataSystems Inc. to addAster Datas SQL-MapReduce implementation to its product portfolio.EMC/Greenplum and a slew of others have been realigning their products to leverage these newtechnologies.

There are faster implementations of Hadoop and similar technologies (also with real-timeprocessing support). MapR, DataRush, Hstreaming, HPCC, Platform computing, Datastax etc.are examples of faster technologies that can serve as alternatives to Apache Hadoop.

5.1.3 Real-time stream processing

Applications that require real-time processing of high-volume data streams are pushing the limitsof traditional data processing infrastructures. These stream-based applications include marketfeed processing29, electronic trading on Wall Street30, network and infrastructure monitoring,fraud detection, Micro-Sensors (RFID, Lojack), Smart Sensors, product recommendation on theWeb, and command and control in military environments. Furthermore, as the sea change

caused by cheap micro-sensor technology takes hold, one expects to see everything of materialsignificance on the planet get sensor-tagged and report its state or location in real time. Thissensorization of the real world will lead to a green field of novel monitoring and controlapplications with high-volume and low-latency processing requirements. In real-time streamprocessing, data has to be analyzed and acted upon in motion, not at rest. Current databasemanagement systems assume a pull-based model of data access as opposed to a push-basedmodel of data in streaming system in which users are passive and the data management system isactive [21].

28 Apache Cassandra is an open source distributed database management system. It is designed to handle very large

amounts of data spread out across many commodity servers while providing a highly available service with nosingle point of failure. Cassandra was developed at Facebookto power their Inbox Search feature.

29The Options Price Reporting Authority (OPRA), which aggregates all the quotes and trades from the optionsexchanges, estimates peak rates of 122,000 messages per second in 2005, with rates doubling every year. Thisdramatic escalation in feed volumes is stressing or breaking traditional feed processing systems.

30In electronic trading, a latency of even one second is unacceptable, and the trading operation whose engineproduces the most current results will maximize arbitrage profits. This fact is causing financial services companiesto require very high volume processing of feed data with very low latency.
http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/mongodb-the-nosql-favourite-part-1/http://code.google.com/p/terrastore/wiki/Getting_Startedhttp://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/scidb-a-database-for-scientific-analysis-for-the-toughest-problems-on-the-planet/http://en.wikipedia.org/wiki/Open_source_softwarehttp://en.wikipedia.org/wiki/Distributed_databasehttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/Facebookhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Distributed_databasehttp://en.wikipedia.org/wiki/Open_source_softwarehttp://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/scidb-a-database-for-scientific-analysis-for-the-toughest-problems-on-the-planet/http://code.google.com/p/terrastore/wiki/Getting_Startedhttp://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/mongodb-the-nosql-favourite-part-1/


25/51

25

Traditionally, custom coding has been used to solve high-volume, low-latency stream processingproblems. Even though the design it yourself approach is universally despised because of itsinflexibility, high cost of development and maintenance, and slow response time to new featurerequests, application developers had to resort to it as they have not had success with traditionaloff-the- shelf software. In 2005, Stonebraker et al. [8] listed the eight requirements for RT

stream processing as follows:

(1)Toprocess messages in-stream, without any requirement to store them to perform anyoperation or sequence of operations. Ideally the system should also use an active (i.e.,non-polling) processing model.

(2)To support a high-level StreamSQL language with built-in extensible stream orientedprimitives and operators.

(3)To have built-in mechanisms to provide resiliency against stream imperfections,including missing and out-of-order data, which are commonly present in real-world data

streams.(4)A stream processing engine must guarantee predictable and repeatable outcomes.(5)To have the capability to efficiently store, access, and modify state information, and

combine it with live streaming data. For seamless integration, the system should use auniform language when dealing with either type of data.

(6)To ensure that the applications are up and available, and the integrity of the datamaintained at all times, despite failures.

(7)To have the capability to distribute processing across multiple processors and machinesto achieve incremental scalability. Ideally, the distribution should be automatic andtransparent.

(8)A stream processing system must have a highly-optimized, minimal-overhead executionengine to deliver real-time response for high-volume applications.

In last few years, several traditional software technologies, such as In-memory databases andrule engines31, have been repurposed and remarketed to address this application space. Inaddition, stream processing engines (SPEs) have emerged to specifically support high volume,low latency processing and analysis. SPEs perform SQL-style processing on the incomingMessages (or transactions) as they fly by, without necessarily storing them. Clearly, to store statewhen necessary, one can use a conventional SQL database embedded in the system forefficiency. SPEs use specialized primitives and constructs (e.g., time-windows) to expressstream-oriented processing logic.

The following table shows how the three technology approaches stack up against therequirements.

31Rule Engines date from the early 1970-80s when systems such as PLANNER, Conniver, and Prolog wereinitiallyproposed by the artificial intelligence community.


26/51

26

Requirement In-memory

DBMS

(IMDBMS)

Rule Engine

(RE)

Stream Processing

Engine

(SPE)

Keep the data moving No Yes Yes

SQL on streams No No YesHandle stream imperfections Hard Possible Possible

Predictable outcome Hard Possible Possible

High availability Possible Possible Possible

Stored and streamed data No No Yes

Distribution and scalability Possible Possible Possible

Instantaneous response Possible Possible Possible

Table 2. This table shows how the three different technologies stack up against the 8 aforementioned requirements.

One early example of a Real-Time Stream Processing system is FalconTM Payment Card FraudDetection which has been in operation since 1993 [6]. This system in addition to a real-time ruleengine and a mechanism to store the states for each individual entity (in this case a payment carduser) also incorporated state-of-the-art predictive analytics algorithms. In short, it could processtransactions on the fly and could assign an account-level fraud score to each without ever storingany of the transactions. Though specifically designed and implemented for transactional paymentfraud, it satisfied some of these requirements in the context of its specific application.

The ability to do real-time analytics32 on multiple data streams using StreamSQL has beenavailable since 2003. Up until now, StreamSQL has only been able to penetrate some relativelysmall niche markets in the financial services, surveillance and telecommunications networkmonitoring areas. However, with the interest in all things Big Data, StreamSQL is bound to getmore attention and find more market opportunities.

StreamSQL is an outgrowth of an area of computational research called Complex EventProcessing (CEP), a technology for low-latency processing of real-world event data. Both IBM,with InfoSphere Streams, and StreamBase Systems Inc. have products in this space.

5.1.4 The In-memory Big Data appliance

As the interest in Big Data analytics expands into enterprise and corporate data centers, the

vendor community has seen an opportunity to put together Big Data appliances. These

appliances are pre-configured and they integrate server (including software), networking and

storage gear into a single enclosure and run analytics software that accelerates information

32 Real-time analytics is finding more and more applications in business. It spans technologies such as in-databaseanalytics, data warehouse MPP (massively parallel processing) appliances, and In-memory analytics.


27/51

27

delivery to users. They typically use in-memory computing schemes such as In-memory database

(IMDB) where all data is distributed across database nodes memory for much faster access.

These appliances are targeted at enterprise buyers who will value the ease of implementation and

use characteristics inherent in Big Data appliances. Vendors in this space include EMC with

appliances built around the Greenplum database engine, IBM/Netezza, MapRs recently

announced commercialized version of Hadoop, SAP (Hana), Oracle (Exalytics), and Teradata

with comparable, pre-integrated systems.


28/51

28

5.2 High Performance Data Mining and Big Data

Data mining is an interdisciplinary field of computer science that focuses on the process ofdiscovering novel patterns from large data sets involving methods at the intersection ofartificialintelligence, machine learning, pattern recognition, statistics and database systems. The main

defining characteristic of Data Mining system is the large input data sets it must process. Novelpatterns from these data sets can be inferred using a variety of computations ranging fromsimple/complex queries to OLAP to Advanced Analytics/Machine Learning algorithms. Queryand OLAP are considered low end analytics since their focus is to aggregate, to slice and dice thedata, or to compute statistics that all in all present new perspectives on the data. To cope with thehuge size of the data especially in analytic-focused data warehouses and applications, distributedcomputing techniques have been in use for a long time33.

Use of Advanced Analytics algorithms on very large data has been the core and the maindifferentiator of data mining. These algorithms and techniques provide the ability to discover,learn, and predict new and novel patterns beyond simple analysis of the past and current data.

They have enabled businesses to make smarter decisions at the level of individual customers (orentities34) to impact their behavior. Use of very large data (millions of records and Gigabyterange tables) is nothing new to data mining algorithms.

With exponential growth of data, the new era of Big Data, and continuous drop of CPU/diskprices, corporations can potentially capture Petabytes and Exabytes of data of all varieties fromtheir interactions with the customers and counter-parties. A typical data mining analysis mayrequire processing of billions of rows and terabyte range tables. Getting to this scale requires arethinking of how to use and to deploy existing data mining algorithms and applications. There isno question that data mining systems must do a much better job of separating noise from signalin this sea of data and to discover interesting patterns in reasonable amount of time to satisfy the

business requirements.

5.2.1 Approaches to Achieve High Performance and Scalability in Data Mining

Any successful data mining effort is built upon five pillars:

(a)Availability of relevant data,(b)Business domain knowledge,(c)Analytic methodology & Know-How,(d)Analytics Software (including algorithms) to implement the methodology,(e)Right platform for development and deployment of results.

33 Teradata which delivered its first system in 1983 is probably the first commercial implementation of Shared-Nothing distributed data computing architecture.

34 In the realm of Data Mining, an entity can be defined as a customer, household, merchant, a device (ATM, POS,Cell phone,), etc.


29/51

29

Until a few years ago consistent improvements in processors speed/architecture/memory hadcoped with the data growth. However with exponential growth of data, this equation has not beenholding true anymore and the gap has been widening. Since mid-1990s there have been efforts inacademic circles to address this gap35. With success of Hadoop and MapReduce, customers are

now demanding from commercial vendors to address Big Data market with new productofferings. When Big Data is involved, it is essential for the solution to be massively scalable forwhich one has to go beyond the normal and traditional ways of doing (c), (d) and (e) in theabove.

The traditional ways of speeding up the data mining applications have been:

(T1) Improvements in clock speed and processor performance in general,(T2) Placement of all the data in memory,(T3) Chunking that is typically used to achieve data scalability (ability to handle largedata files of any size).

The more modern approaches for increasing analysis speed and scalability is to use parallel anddistributed computing techniques:

(N1) Functional partitioning across multiple processing units (cores with shared memoryor computer nodes on a cluster),(N2) Data partitioning (distribution) across multiple computer nodes, and co-locating ofthe program and the data.

5.2.1.1 Chunking or Data Partitioning

In Computer Science, the algorithms that are optimized to process external memory data arereferred to asExternal Memory Algorithms or EM Algorithms. External Memory is typicallyreferred to slower storage media such as disk. The problem domains include databases/datamanagement (sorting, merging, transformation, new variable creation, permuting , ), scientificcomputing (FFT, ), machine learning algorithms (many clustering and decision treealgorithms, linear models, maximum likelihood and optimization algorithms, neural networks,), text and string processing, geographic information systems, graphs, and computationalgeometry.

Chunking or data partitioning (T3 above) is a technique used by External Memory algorithmsand was originally developed for single thread single processor applications. In chunking, asubset of data is read into the available RAM and processed optimally in memory. This continueswith reading other data subsets into RAM and processing those sequentially. The results andoutcomes on the chunks are then aggregated to get the final output of the algorithm. Handling

35 Many of these efforts have been funded by government agencies.


30/51

30

large data (data scalability) is achieved this way. Many data mining and data analysis toolsoperating on very large data use this approach. Those tools and applications that require loadingof the whole dataset into internal memory fail to scale (Open Source R is a good example).

5.2.1.2 Statistical Query Model (Machine Learning and Advanced Analytics)

In 2006, Chu et al. [7] at Stanford developed a general and exact technique for parallelprogramming of a large class of machine learning algorithms for multi-core processors. Theircentral idea was to allow a programmer to speed up machine learning applications simply bythrowing more cores at the problem rather than search for specialized optimizations of thesoftware or hardware or use of programming languages with parallel constructs (like Orca,Occam, MPI, SNOW, PARLOG). They showed that:

(i) Any algorithm that fits to the Statistical Query Model36(SQM) may be written in acertain summation form without changing the underlying algorithm (so it is not anapproximation);

(ii) The summation form could be expressed in MapReduce framework;(iii) The technique achieves basically linear speed up with the number of cores.

They adapted Googles MapReduce Paradigm to demonstrate this parallel speed up scheme.They listed the algorithms that satisfied these conditions which included Linear Regression,Logistic Regression, Nave Bayes, Gaussian Discriminant Analysis, K-Means Clustering, NeuralNetworks, Principal Component Analysis, Independent Component Analysis, ExpectationMaximization, and Support Vector Machines. Their experimental runs showed that they canachieve linear speed ups for these machine learning algorithms in the number of cores.

5.2.1.3 Serial Computing Environment

In a serial computing environmentwhere the algorithm of interest is single threaded and isrunning on a single core- placing the whole data in memory will naturally speed up theprocessing, especially if the algorithm requires multiple iterations through the data. However thiswould only work if the data can fit into the fast internal memory (RAM) completely37. For datamining applications where datasets are almost always very large, this is rarely the case though.Historically, any data mining tool requiring data to fully reside in memory has not been scalablein terms of ability to handle data of any size. Scalable data mining tools have to continuously

36 Statistical query (SQ) learning is a natural restriction ofprobably approximately correctlearning (PAC learning)proposed by Leslie Valiant that models algorithms that use statistical properties of a data set rather than individualexamples.

37 64-bit processors which practically have unlimited addressable memory space are only constrained by the size ofphysical memory. In 32-bit processors, maximum addressable space was at best 4 Giga-bytes (typically lowerdepending on the OS) no matter what the size of the physical memory was.


31/51

31

access the external memory. The resulting input/output overhead (I/O) between fast internalmemory and slower external memory (such as disks) has always been their major performancebottleneck.

Chunking has been a technique to provide data scalability and also optimizing the I/O taking

advantage of locality (by cashing subsets of the data). Statistical Query Model algorithms canalso be implemented in this way to get data scalability. Many data mining and data analysis toolsoperating on very large data use this approach. In summary, (T1) through (T3) are populartechniques that have been applied in these environments. Algorithms deployed in theseenvironments have historically benefited from consistent increases in processor speeds tillrecently. Power Wall and Big Data are both forcing data mining software developers to useparallel and distributed approaches for further processing speedups.

5.2.1.4 Multiprocessor Computing Environment (Symmetric Multiprocessing

or SMP)

Symmetricmultiprocessing (SMP38) involves a multiprocessor computer hardware architecturewhere two or more identical processors are connected to a single shared main memory and arecontrolled by a single OS instance. In the case of multi-core processors, the SMP architectureapplies to the cores, treating them as separate processors.

In an SMP environment which uses shared external storage (external memory), the schemes (T1)through (T3) will provide speed up39. The real benefit, however, is realized by leveraging theparallelism of the algorithm and mapping it to the multicore environment using multi-threading40. The later requires careful analysis and implementation of the algorithms in multi-threaded fashion. If the entire algorithm can be parallelized and multi-threaded, in theory it can

be sped up linearly with the number of processors. Though typically not all part of algorithms areparallelizable and the maximum speed up follows theAmdhals Law41.

38 NUMA or ccNUMA (cache coherent Non-Uniform Memory Access) is an enhancement of SMP architecture tosurpass the scalability of pure SMP. NUMA or ccNUMA could also be used in place of SMP in this environment.

39 There are In-memory relational databases such as Oracle Times Ten that are designed for low latency, high-volume data, transaction, and event management. An In-memory database (IMDB) typically uses SMP architectureand stores the whole database in the shared memory. As such the database size will be limited by the size of RAM.

40The term multi-threaded is used for software systems that create and manage multiple points of executionsimultaneously. Compared to a process, a thread is lightweight because its state can be preserved by a set of CPU

registers and a stack making the thread switching very efficient. Memory and other resources are all shared amongthreads in the same process.

41 Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of analgorithm relative to the serial algorithm, under the assumption that the problem size remains the same whenparallelized. More technically, the law is concerned with the speedup achievable from an improvement to acomputation that affects a proportion P of that computation where the improvement has a speedup ofS. The overallspeedup is then computed as:
http://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/Multiprocessor


32/51

32

It is important to realize that it is possible to automatically parallelize and distribute ExternalMemory algorithms. Hence, chunking provides a way for data subsets to be processed in parallelon different cores at the same time. All cores access the same shared storage (typically disk) butoperate on their own subset of data. Statistical Query Model algorithms are ideal to be

implemented in this environment and can take advantage of the parallel processing on multicore.Many programming frameworks exist for implementation of Statistical Query Model algorithms.Googles MapReduce which is a functional programming construct is one way to implementthese algorithms in a multicore environment. Refer to [7] for details.

Figure 2: Symmetric Multi-Processing (SMP) architecture where all processors cores share all the resources including

memory and storage. These systems could be highly optimized in terms of hardware and software and be very expensive.

5.2.1.5 Distributed Computing Environment (Cluster Computing) with Shared

Storage

A computer cluster is a group of linked computers with their own local memory and OS instance,working together closely resembling a single computer. Speeding up data mining algorithms inthese environments is in essence similar to SMP with the difference that each compute node inthis case is a separate computer with its own resources. The cluster only share the externalmemory (disk) and any scheme used to parallelize an algorithm should manage and account forthe communication in between the cluster compute nodes. On each node, (T1) and (T2) can beleveraged for speed up. In this environment, data lives on a shared storage and is not distributedat rest across the cluster compute nodes. Similar to SMP, Cluster (or GRID) Computing ismainly geared toward functional partitioning (N1)programs distribution across nodes and loadbalancing them- not data partitioning.

RAM

Multiple

Processors

Storage Appliance

(NAS, SAN, DAS)


33/51

33

Figure 3: Distributed Computing Architecture using commodity computing nodes with Shared Storage. In this schemeeach node has its own resources with the exception of the shared storage where the data to be processed resides. Typically

one node is dedicated as Master or Queen and the rest are assigned as Slave or Worker nodes.

5.2.1.6 Shared-Nothing Distributed Computing Environment

A Shared-Nothing architecture (SN) is a parallel distributed computing architecture in whicheach compute node is independent and self-sufficient, i.e. none of the nodes share any resourcesincluding internal memory or external memory (disk storage). As such, there is no single point of

contention across the system. SN architectures are highly scalable and have become prevalent inthe data warehousing space42 and web development. They have become popular for webdevelopment because of their shear scalability. They can scale infinitely by adding inexpensivecomputers as nodes. With distributed partitioning of the data (this is calledsharding), eachcompute node can operate in parallel on its own data subset locally and independently. In thisenvironment, data is distributed at rest. The key for optimal parallelism here is the colocation ofthe data and the program. This is ideal for EM and SQM algorithms. Since mid-1990s, where theuse of commodity servers as compute nodes became popular in academia, there has been aconsistent trend toward the use of clusters of commodity computers in commercial applicationsincluding databases and data analysis applications.

42There are computer cluster architectures for data warehouses that are shared everything which means everynode in the cluster has access to the resources of the other nodes. Sybase IQ (a part of SAP AG) -which is acolumnar data warehouse- is an example of such architecture where the nodes are connected through a full meshhigh-speed interconnect and the storage is shared among all nodes.

Storage Appliance(NAS, SAN, ...)

Network

. . . . . . . .

DAS

RAM

Multicores

DAS

RAM

Multicores

DAS

RAM

Multicores

DAS

RAM

Multicores
http://en.wikipedia.org/wiki/Data_warehousinghttp://en.wikipedia.org/wiki/Shardinghttp://en.wikipedia.org/wiki/Shardinghttp://en.wikipedia.org/wiki/Shardinghttp://en.wikipedia.org/wiki/Data_warehousing


34/51

34

Figure 4: This is a depiction of a Shared Nothing (SN) Distributed Computing Architecture using commodity computing

nodes. As in Figure 2, each node has its own local resources including storage and is in full control of those resources. In

this scheme the data is also distributed across the nodes on their local storage, i.e. data is distributed at rest. Each node

only operates on its local data and reports its results of processing back to the Master or Queen node.

5.2.1.7 In-memory Distributed Computing Environment

In a Shared-Nothing Distributed Computing environment, partitioned distributed data can beplaced in local memory of each computing node instead of its local disk resulting in significantimprovement in processing speed. The In-memory distributed databases are the latest innovationin database technology43.

IT organizations dedicate significant resources to building data structures (either through

denormalized schemas or OLAP programming) that provide acceptable performance foranalytics applications. Even then, maintaining these structures can introduce formidablechallenges. The most obvious advantages of In-memory technology, then, are the improvedperformance of analytics applications. Equally beneficial to speed is the simplicity of In-memory solutions that eliminate arduous OLAP requirements. OLAP technologies requiresignificant expertise around building dimensions, aggregates and measures. Typically, resourcesrequire expertise with SQL development and multi-dimensional programming languages, such asMDX. The result is the delivery of enterprise reporting and analytics with less complexity andsimplified application maintenance as compared to traditional OLAP solutions.

In addition to low end analyt

Documents

An Overview of High Performance Data Mining and Big Data Analytics