[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

REAL-TIME ANALYTICS PROCESSING WITH MAPREDUCE

CHENG-ZHANG PENG., ZE-JUN JIANG., XIAO-BIN CAl2, ZHI-KE ZHANG!

1 School of Computer Science and Technology Northwestern Polytechnical University Xi'an, China 2 China Aerospace Science and Technology Corporation

E-MAIL: {abelard2009.zhangzhike}@[email protected]@gmail.com

Abstract: Many real-time analytical applications over massive data

streams were performed by usually introducing a specific stream processing core. In general, these SPCs were not popularly applied to enterprises same as Map Reduce, even if now real-time analytics applications are taken into attention

more and more. For reversing this tide, we developed a new analytics system. Our system modified the stock Hadoop's

MapReduce programming model and execution framework,

and used Chord model as temporary data, Cassandra as its persistent storage. With our system, we can develop data

stream processing application with the familiar MapReduce programming model.

Keywords: Real-time analytics; Data stream processing; MapReduce;

Cassandra; JOL

1. Introduction

Today, with the increasing use of wireless terminal devices and personal computers being found everywhere connected to the Internet continuously, and the feature of user-generated data has changed: it has become more and more real-time. These changes made people more like sharing their thoughts and discussing breaking news than previous every age. Of course social networks such as Sina Weibo, Twitter, Google+ and Facebook were best platform. At the same time, these social networks easily accumulated

tens of petabytes of data about some hot blog, user behavior, the user's social context, and so forth. Real-time analyzing these information will get huge commercial benefit and avoid commercial deficit.

The above description implies data stream management system(DSMS) will be attracted again in the data analytics area. Several DSMS products(e.g., STREAM[I], Aurora[2], Borealis[3], TelegraphCQ[ 4 D were applied successfully to the following data streams: financial tickers, performance measurements in network monitoring and traffic management, log records or click-streams in web

978-1-4673-1487-9/12/$31.00 ©2012 IEEE

tracking and personalization, manufacturing processes, data feeds from sensor applications, call detail records in telecommunications, email messages, and others. But none of these systems completely meet these social networks' data analytics requirements. So that MapReduce[5] naturally is thought as a possible choice.

Of course, there were many discussions about whether MapReduce can be applied in real-time analytics. The answer was very clear that the stock Mapreduce implementations in Hadoop[6] and Google mainly were suitable for static bulk-processing tasks.

After elaborately studying the existing implementations of MapReduce, we have several reasons that think that they are unsuitable for DSMS. These will be discussed in detail in Section 2.1. But if we modify the programming model and execution framework of MapReduce, and choose an appropriate storage model for it, we will get a new system satisfying the requirements for the real-time analytics of the social networks' data.

F or verifying our thinking, we have developed a prototype system based on the MapReduce implementation in Hadoop.

Recently, there are several systems for supporting realtime analytics, such as Naiad [7], Google Percolator [8], which are extended to a distributed database, Google Big Table, but there are no details about data input and processing, which are important for such system; Twitter Storm [9], is open-source in a distributed real-time computation system, which is developed from the scratch.

The remainder of this paper is organized as follows: In Section 2, we begin by introducing background material about Mapreduce, in particular, the implementation of it in Hadoop and discussing the reasons of the ill-equipped for real-time analytics in it in Section 2.1, and then we amply describe our system architecture in Section 2.2, 2.3 and 2.4. We summarize our work and explain our future work in Section 3.

1308


2. System

In this Section we discuss the stock Hadoop MapReduce mainly including the aspects ill-equipped for real-time analytics in Section 2.1. The modified MapReduce programming model will be described in Section 2.2. The implemented Chord[lO] based on JOL[U] for execution framework and the persistent storage for the system will be discussed in 2.3 and 2.4 respectively.

2.1. Hadoop Map Reduce

Map Reduce has increasingly formed as a popular way to utilize the power of large clusters of computers. Map Reduce allows developers to think in a data-oriented fashion: they focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them, so that the details of distributed execution, network communication, coordination and fault tolerance will be handled by the MapReduce framework.

Hadoop MapReduce implemented as one-master and multi-slave distributed computation platform, is primarily designed for batch processing over large data sets. To the extent possible, all computations are organized into long I/O streaming operations that make the best of the total bandwidth of many disks in a cluster.

For a batch processing called one job, before MapReduce application being run, the first step is that the data needed to be handled will be stored in Hadoop HDFS which is specifically adapted to large-data processing. When running, Map Reduce's master node will first divide the data into multiple splits(64M boundary alignment), and produce key-value pairs, and record the generated meta information. If user does not set the number of Map and Reduce tasks, the master node will automatically calculate the number using a hash algorithm according the above step's meta information.

The second step, the slave nodes also called data node will apply user-specified map functions to analyze the data specified by meta information. After mapping, the intermediate data will enter shuffled and sorted phase by supplied by MapReduce, and then the sorted data in some one key specified by user will enter reduce phase where user-defmed reduce functions will be called. Finally the output from reduce will be stored into the appropriate director in HDFS.

Of course, for some complex jobs, which will possibly need to be divided into multiple jobs, every job in them must comply with the above steps.

For real-time analytics, the handled data continuously come from the external applications such as the repositories from Sina Weibo, Twitter, Facebook, monitoring network and so on, and the users mainly are

interested in the result of analyzing the latest data. According to this, Hadoop MapReduce is not suited for this kind of data stream processing in at least two folds: the data are generated dynamically, and only the partial data need to be analyzed.

Furthermore, such above real-time data stream are time correlative, which means that some values are relative to the same key cannot be aggregated if the correlative time stamps are different. So the stock MapReduce's shuffle and sort phase need to be modified in our system.

2.2. Programming Model for Real-Time Processing

According to Section2.I, to support real-time analytical applications, in our implemented system, we modified Hadoop MapReduce's programming model as described below in detail. The modified model is shown as Figure 1.

In Figure 1, there is no shuffle and sort phase. We modify the stock programming model to push the intermediate data to the appropriate reduce without shuffle and sort handle. In particular, we extended input key-value model through map and reduce phases shown as below:

map: «kI, t1>, vI) � list«k2, t1>, v2) reduce: «kI, t1>, <r1,r2, ... rn» �«kI,t1>, r) The difference between map and reduce is that multiple

maps will be executed in parallel for the same key (here <kl, t1», while the execution of reduce has to be synchronous for the same key with time stamp to ensure right result.

The modified key-value model can support the time stamp necessary for real-time data stream, so that in Fig 1, only the same key with the same time stamp will be aggregated.

Another, for real-time analytics, push-style map is needed, Hadoop MapReduce provides this function, so we did not modify the map function.

Figure 1. The modified MapReduce programmiug model

1309


2.3. Execution Framework for Real-Time Processing

From the execution framework in the stock Hadoop MapReduce, HDFS provides a guarantee for the massive scale data set. However in real-time data processing, we must replace the key-value storage provided by HDFS with meeting the following requirements:

(1) The input data can be horizontally partitioned and replicated across the Hadoop MapReduce's data nodes. (2) The input, intermediate and output data can be stored in memory controlled by user. (3) The key-value storage system can be recovered from node failure. (4) The data consistency can be guaranteed by providing configurations.

This key-value storage system discussed above is mainly designed for the input and unhandled data, which are stored in memory before map processing. Unlike the stock Hadoop MapReduce, which generates key-value pairs for HDFS ' data before analytical processing. For continuously inputting data stream, the best approach is that they will be forming appropriate key-value pairs when arriving for map processing.

We implemented a Chord[lO] that was a technique employed . d' 'b d 1ll many Istn ute storage systems

II MapRE!duce

��:-��}� JoDOfent program

CI'Ilffit JVM

ai�trlllJd�

I

JOL[12] that was a java implementation for Overlog for meeting the above requirements. Using JOL, we can express the input data as tuples in usual table and event table. When input data are arriving, the following steps will be executed. First, the nodes in Chord ring will be contacted to determine which node being selected, and subsequently be replicated to the calculated nodes. Second one tuple in replicates will be selected to insert into map task event table implemented by JOL, and then will trigger MapReduce to execute map function.

Figure 2 shows our system architecture that consists of all important components for any streaming application:

External data streams that enter the system showed in far left;

Processing data called query showed in bottom left and right;

External output destinations, to which the system push results, showed in bottom left's "Persistent Storage" or out of the system that is not pointed out here.

Explaining all steps in Figure 2 is out of the scope of this paper, we will discuss the relative components and steps introduced for real-time data streaming processing.

t 7; heartbeat t

(returns task] !

s: Inltl'it1lze jdb

Job Resour�

I Man.ill/:ler

ie'l �job S:n�tr reso UI�

OhlldNM

Persistent S-tora&e

--

Ma'p:ra:s.k Tasf<Tl'aCker ---------

B·�· Dr �; laurn 1O:JU1 Reducn i!ik

T;lslclracker node

DS ill'ld f!e�ist!!lIlt Storage System

Figure 2. MapReduce for real-time analytics

Data Stream Manager: all external data streams sink this component, we used JOL to implement a Chord in it, the incoming data streams are divided into two type: query and data, which are horizontally distributed and replicated into some nodes through Chord. For the data needing persistently to be stored will be placed into the component "Persistent Storage". After the above processing, the concrete query operation rigorously ordered in time will be pushed into map task queues. The query item in map task queue that is event-driven will be synchronously executed.

JOL is java implementation for overlog, which is a descendent of Datalog that consists in a set of declarative

rule and optional query. A rule has the form p:- ql, q2, ... ,qn, which informally reads "if ql and q2 and ... and qn is true, then p is true." JOL compiles Overlog programs into pipelined dataflow graphs of operators, and JOL provides meta-programming: each Overlog program is compiled into a representation that is captured in rows of tables. For implementing Chord, we wrote 50 rules and less than 500 lines of java code. All map and reduce tasks are distributed and replicated to the nodes shown as Figure 3.

1310


2.4. Persistent Storage

In a distributed computation system, some node failures are not uncommon phenomenon, but these failures make some queries not able to get executed, or some incoming data lost. So persistent storage is necessary. Our system choose Cassandra as persistent key-value storage, some

Figure 3. Chord for key-value storage

incoming data/queries, intermediate results and output data, and important states such as windows of streaming computations according to calculation of the component "Data Stream Manager" are stored into it. When a single node of failure emerges, the system will continue processing unfinished query or other procedure through receiving the dependent data, queries or states on command.

3. Conclusions and future work

In summary, we demonstrated a prototype system by building a real-time data stream analytics system through modifying the stock Hadoop MapReduce' programming model, and implemented a Chord for distributing and replicating data and query event in execution framework, and replaced the HDFS with Cassandra as its key-value storage. In particular, the Chord was implemented based on JOL which was a rule-based language. JOL supplies an adaptation of powerful query language to a distributed context of data and messages, which are simply expressed into table and stream needed by our system as described as before, and build an event-driven control mechanism in every node and built-in networking procedures.

Our prototype system is still in active development. Future work includes evaluating our prototype system's performance using the popular Linear Road benchmark[13]

Acknowledgment

This paper is supported by Aeronautical Science Foundation of China No. 201OZD53, 2009ZD53044, and National Natural Science Foundation of Shaanxi Province No. 2009JM8017, 2009JQ021, 201OJM8023, and Seed Fund of Northwestern Polytechnical University No.z2011138.

REFERENCES

[1] Motwani. R., Widom. J., et ai, "Query processing, resource management, and approximation in a data stream management system", Proceedings of the 2003 CIDR Conference, pp. 245-256, Jan 2003.

[2] Abadi. D. J., Camey. D., et al, "Aurora: A Data Stream Management System", In ACM SIGMOD Conference, June 2003.

[3] Abadi, D. J., Ahmad, Y., et ai, "The design of the Borealis stream processing engine", In CIDR, pp. 277-289, 2005.

[4] Chandrasekaran. S., Cooper. 0., et ai, "TelegraphCQ: Continuous dataflow processing for an uncertain world", Proceedings of the 2003 CIDR Conference, pp. 269 -280, 2003.

[5] Dean. J., Ghemawat. S., "MapReduce: Simplied data processing on large clusters", In OSDI, pp. 137-150,2004.

[6] Welcome to HadoopTM MapReduce! http://hadoop.apache.orglmapreduce/

[7] Grinev. M., Grineva. M., et ai, "Analytics for the real-time web", In PVLDB, 4(12):1391-1394, September 2011.

[8] Pengo D. and Dabek. F., "Large-scale incremental processing using distributed transactions and notications", In OSDI '10, pp. 251-264, 2010.

[9] Marz. N., "A Storm is coming", http://engineering.twitter .com!20 11/081 storm-is-coming -moredetails-and-plans.htrnl, August 2011.

[10] STOICA, I., MORRIS, R., et ai, "Chord: A scalable peer-to-peer lookup service for internet applications", In Proceedings of ACM SIGCOMM Conference on Data Communication, 11(1):17-32, 2001.

[11] Condie. T. et ai, "Online aggregation and continuous query support in MapReduce", In ACM SIGMOD Conf., pp. 1115-1118,2010.

[12] ALVARO, P., CONDIE, T., et ai, "Boom analytics: exploring datacentric, declarative programming for the cloud", In EuroSys'10: Proceedings of the 5th European conference on Computer systems 2010,pp. 223-236,2010.

[13] A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. Maskey, E.

1311

Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: A Stream Data Management Benchmark. In VLDB Conference, Sept. 2004.

Documents

[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and