Approaches for Implementation of Data Analysis Using Map ... · This study is focused on optimizing MapReduce programs on Hadoopon a single Virtual Machines, cluster of "small" VMs

Approaches for Implementation of Data

Analysis Using Map Reduce on Mobile

Platform: An Experimental Study 1Madhavi Vaidya and

2Shrinivas Deshpande

1Department of Computer Science,

VES College, Mumbai, India. 2Department of Computer Science,

HVPM, Amravati, India.

Abstract The computing world is undergoing a significant change from

traditional non-centralized distributed system architecture, to typical

parallel and psudo-distributed nodes. Such nodes are scattered on different

geographic areas to a centralized cloud computing architecture where data

transformation and computations are operated somewhere on any node.

Data centres owned and maintained by third party or a cloud can be

formed and maintained using the number of physical machines. These

machines can be of different configurations or using virtual machines on a

shared LAN to communicate with each other. It has been read from and

described in the literature that there is always a difference in performance

when the MapReduce program is run on Virtual Machines and the physical

bare metal machines.

The same has been elaborated here by considering the use case where

the data generated from the mobile platform has taken into consideration.

In our case to run this program, we have configured a mini-cloud on LAN.

The outcome of analysis of using a MapReduce program on the data

generated from mobile platforms has been tested on various infrastructures

like Single Virtual Machine, a cluster of Virtual Machine and the mini cloud

on various Distributed File Systems.

International Journal of Pure and Applied MathematicsVolume 118 No. 18 2018, 1221-1232ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

1221

1. Introduction

Distributed file systems have been developed to support the style of data

accessing in the manner of write-once, producing high-throughput, and does

parallel streaming. These include the I/O subsystem such as Glusterfs, Zfs,

Ceph and the Hadoop Distributed File System (HDFS).

In this paper the study has been performed to show how these distributed file

systems work on a mini cloud which has been maintained on few machines in

small setup. The elaboration on the comparison of the file on single Virtual

machine and the cloud of few machines arranged in LAN.

The paper’s flow goes in such a way that, we studied few Distributed File

systems in this introductory section. In the II section, the literature study has

been carried out. In the III section, the elaboration on architecture and

implementation issues of those distributed file systems is been provided. IV

section provides the working of MapReduce architecture has been mentioned on

Distributed File Systems and the role of MapReduce on those distributed file

systems. The section V provides the discussion on the discussion, analysis and

results. Finally section VI would be closed with the concluding remarks.

The software framework MapReduce and its open-source implementation,

Hadoop is deployed in the Cloud to generate the availability of virtualized

resources and payper-usage cost model of cloud computing. Instead of using the

pay-per-use cloud model, we have developed a cloud on few machines which

are in local area network. MapReduce is a software framework for parallel data

intensive computations popularized by Google [1]. There are several

implementations available besides Google’s original system, but the most

popular one is Hadoop [2], developed by Apache [3].

We will look into the comparison on the results carried in literature of the single

VM, the cluster of VMs and the cloud of machines which are physical bare

metal machines performed experimentally in LAN, Where the operations are

performed are the "computing machines" are either physical i.e. bare metal

machines or a virtual machine (VM) running on a physical machine. There can

be many virtual machines running as "guests" on a real "host" machine, using

hypervisor software.

A cloud is typically comprised of inter-connected, virtualised computers

coupled with security and a programming model. From an application

developer’s perspective, the programming model and its performance are

undoubtedly the main criteria to select a cloud environment. A Private Cloud is

a collection of virtualized physical hardware that has added services such as

catalogues of software or defined platforms that a customer can control.

The primary goal of virtualizing master and slave servers is the same, to

maximize operability and utilize existing hardware. Virtualisation is a software

International Journal of Pure and Applied Mathematics Special Issue

1222

technology that allows a machine to run different operating systems on the same

physical computer. A computing environment consists of compute nodes, a

network to connect them and software to run on them. However the strategy for

virtualizing the Hadoop master servers is different than virtualizing Hadoop

slave servers on the basis of settings done on machines [4][5]. There are various

findings made in the literature , EC2 also provides a bigger VM sizes that have

lower variance in I/O performance than the light VMs, possibly because they

fully own a disk[6]. The same fundamental is experienced when the mini-cloud

formation is done on the machines used in LAN.

A virtualized environment has its own benefits [7];Virtualization can provide

higher hardware utilization by collaborating multiple Hadoop machines together

in a cluster and other workloads on the same physical cluster. It has been

studied in the literature that then development environment is a place where

containers might work [8]as they require fewer hardware resources as overhead,

they typically have less costs than VMs. Another benefit is they provide great

flexibility.

So what to choose between physical and bare metal environment and virtual

machine for Hadoop installation will be elaborated here depending on the

performance of MapReduce programs for various types of files.

This study is focused on optimizing MapReduce programs on Hadoopon a

single Virtual Machines, cluster of "small" VMs and the physical/bare metal

machines to get the best performance. It has been studied in the literature that

optimizing Hadoop on "small" VMs can give the best performance [9].

Distributed file systems have been developed to support this style of data

accessing in the manner of write-once, producing high-throughput, and does

parallel streaming. These include the I/O subsystem such as glusterfs, zfs, Ceph

and the Hadoop Distributed File System (HDFS). The Hadoop has been

implemented on the mini cloud of nodes of physical bare metal machines and

the single node where glusterfs, ceph and zfs are the distributed file systems

have been implemented and the Hadoop is implemented using various plugins

like glusterfs-hadoop plugin[10] and ceph-hadoop plugin[11] cephfs-hadoop.jar

is been configured properly. It has been observed from our implementation that

Ceph and Hadoop 1.x and 2.x work satisfactorily together.

2. Literature Study

One scenario that works well is a highly static environment with a very large

cluster. Workloads that require few clusters with long life spans are good

candidates for a bare metal deployment. As there is no overhead of

virtualization may provide performance benefits. Besides, all of the flexibility

of a virtualized environment would go to waste[12] when a VM is requested, its

performance may vary from the performance obtained on physical data centres.

This can be due to the configurations of machines, or the other workloads.

Virtualization can provide higher hardware utilization by consolidating multiple


1223

Hadoop clusters and other workload on the same physical cluster. The resource

management for such large number of nodes is quite difficult from the aspects

of configuration, deployment and efficient resource utilization. By deploying

virtual machines (VMs), Hadoop management becomes much easier. Amazon

already released the Hadoop on Xen-virtualized environment as Elastic

MapReduce[13].

Virtual Machines are requested on demand from the infrastructure: the

machines could be allocated anywhere in the infrastructure, possibly on servers

running other VMs at the same time. Actually, Hadoop can be brought in

virtualized infrastructures with ample of benefits which can be, a single image

can be cloned and hence it generates lower operation costs. They can be setup

on demand. The cluster size can be contracted or expanded on demand

[14].Hadoop is designed on the fundamental principle of distributing computing

power to where the data is rather than movement of data as in traditional

parallel and distributed computing system using MPI (Message Passing

Interface).Hadoop is an extremely scalable and highly fault tolerant distributed

infrastructure and it automatically handles parallelization, load balancing and

data distribution[15]. HDFS is capable of storing excessive volumes of data and

fast accessing to the stored data. A Hadoop cluster works on master-slave

architecture in which one node is a master node and remaining nodes are slave

nodes. Master node known as NameNode controls the slave nodes known as

DataNodes. Slave nodes hold all the data blocks and perform map and reduce

tasks[16].

3. Study of Architecture and

Implementation Issues

Hadoop

Hadoop is an Apache open source project. Its architecture supports the storage,

transformation, and analysis of very large datasets. Hadoop is a massively

parallel processing solution and characterized as a massive I/O platform.

Massive can be a relative term which changes from the specifications to

specifications and dependent on various infrastructure used.

Two critical components of Hadoop are the Hadoop distributed file system

(HDFS)[17], and the MapReduce algorithm. Both the components allow

efficient parallel computation on very large datasets. HDFS is based on Google

File System i.e. GFS.

The various DFS mentioned below have also been installed with HDFS and

MapReduce programs were executed for various use cases.

Glusterfs

Glusterfs is an open source model. Glusterfs is designed to provide a scale-out

architecture for both performance and capacity and is able to scale up or down.


1224

Operation of Glusterfs is done in user space so it makes installing and

upgrading Glusterfs significantly easier. Storage system servers can be added or

removed on-the-fly with data automatically rebalanced across the cluster.

Glusterfs locates data algorithmically on the basis of the path name and the file

name, any storage system node and any client requiring read or write access to a

file in a Glusterfs storage cluster performs a mathematical operation that

calculates the file location [18]. Experimental work on installation and

configuration of Glusterfs is found to be smooth [10].

Recent introduction of Gluster FS Hadoop plugin (HDFS interface) enables

Hadoop applications to take advantage of unstructured file and object data

hosted on Gluster volumes in place. On the Glusterfs cluster, Glusterfs 3.7.17

build is been installed. On the top of it, glusterfs-hadoop plugin [10] has been

installed. In order to configure the said plugin, we made specific type of

deployment. Firstly, Hadoop Job Tracker and Task Trackers are installed on

Glusterfs server’s trusted storage pool for a given Glusterfs volume. The Job

Tracker used the plugin to query the job input files in Glusterfs, they determine

the file placement as well as the distribution of file replicas across the cluster.

The Task Trackers used the plugin to control a local fuse mount of the Glusterfs

volume in order to access the data required for the tasks.

Zfs

The Z file system is built in 2004 by Sun Microsystems which is free and open

source logical volume manager for using it in their Solaris operating system

[20].

In ZFS, the basic storage unit is a “storage pool” or zpool. to a single virtual

drive. The storage pool understands physical details about the various storage

hardware like devices, layout, redundancy, and so on and presents itself to an

Operating System as a large “data store” that can support one or more file

systems [20]. It has been an experience at the time of setup, for configuring Zfs

, one needs to have a robust infrastructure. Otherwise might create a bottleneck

for performance. But running MapReduce on Zfs was smoothly performed [21].

This is the way by which the implementation of Hadoop Distributed File

System. It has been observed that MapReduce works well on all the distributed

file systems.

Cephfs

Ceph is distributed file system and is an object-based , open source ,parallel file

system[22][23] whose design is based object-based storage, which divides the

traditional file system architecture into two components viz. a client component

and a storage component and it increases scalability. A Ceph OSD as a process

that runs on a cluster node and uses a local file system to store data objects. A

CRUSH (Controlled Replication Under Scalable Hashing) [24] algorithm to

compute on demand where the data should be written to or be read from.


1225

One mon node and 3 Object Storage Device (OSD) nodes have installed on

virtual machines. It has been found that during recovery mon nodes and OSDs

nodes need significantly more RAM. If the user has decided to implement Ceph

on Virtual Machines, one will need to ensure that those other processes leave

sufficient processing power for Ceph daemons. The additional CPU-intensive

processes have to be run on separate hosts [25]. After implementation, it has

been observed that data storage has to be planned carefully, otherwise system

gets slower.

Once gets installed, achieving the active+clean states for placement groups is

the final state then and there the user can start executing commands on Cephfs.

As a part of this implementation, Ceph version 9.2.1 has already been installed.

We configured the cephfs-hadoop plugin on underlying distributed file system

of Ceph.

The ceph-hadoopplugin [11] cephfs-hadoop.jar is been configured properly. It

has been observed from our implementation that Ceph and Hadoop 1.x and 2.x

work satisfactorily together.

It has been studied from the literature present, which DFS will give us good

performance. Most distributed storage systems do not give adequate

performance with large files or sometimes few don’t give the same performance

with either type of files.

It has been studied that Glusterfs does a decent job for both small and large

objects [26]. It has been mentioned in the literature that Hadoop can also be an

alternative for storing large files.

Particulars of Implementation

For this article, the MapReduce algorithm was implemented on a system using:

Off-the shelf commodity machines.

Hadoop 1.2.1, 1 NameNode and 3 Data Nodes.

4 GB RAM on Master node and 1 GB RAM on slave nodes.

Eclipse IDE 3.0.

Cent OS7.0/Ubuntu 8.2 or above.

4. MapReduce Implementation on DFS

Hadoop is an implementation of the abstract idea called "map-reduce". That is,

it is a programming model for a certain type of calculation that consists of a

parallel "map" step, followed by some form data gathering that is "reduce".

MapReduce consists of a first “map” step, in which a master node (acting as a

coordinator) divides the initial input into smaller chunks, each to be used in its

own separate sub-task, and distributes them to worker nodes. Each worker

performs the tasks it was given, and reports its completion to the master.


1226

Figure 1: Trackster on Plain Hadoop

At the next “reduce” step, the master node schedules one or more tasks on

workers to perform join operations on the map outputs, to combine them in a

way to get the output–the answer to the initial problem to be solved. In the

cloud environment, MapReduce relies on a shared, parallel file system at each

step like Google File System [27].

Figure 2: Trackster on Ceph

Figure 3: Trackster on Zfs

0

1000

2000

3000

4000

5000

6000

7000

4 1 1

6540

1980

3030

1 117 117

Plain Hadoop (Trackster)

Running timeforwholeprocess

CPU time taken toexec pgm

Map i/p records

0

1000

2000

3000

4000

5000

6000

5 9 3

4020

3200

5260

117 117 117

ceph(trackster)

Runningtime forwholeprocess

CPU timetaken toexec pgm

Map i/precords

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

5 20 2

16350

2540

105117 117 117

Gluster (trackster)

Running time forwholeprocess

CPU time takento exec pgm

Map i/p records


1227

Following is the output generated on Master node when the number of

processes are shown using command, jps.

Figure 4: Trackster on Glusterfs

No task tracker and job tracker in case of Hadoop 2.5.0 Hadoop, the o/p is

displayed in the above manner.

Thus, the Hadoop fits with the problem of data analysis for various problem

statements viz. semi-large sized sample text file, the web logs and the data

generated from the mobile apps. The use case which we have taken into

consideration for generating and collating the logs as it has been studied in the

literature that MapReduce works best for processing and analyzing the logs as

mentioned in Fig 5.

Figure 5: Count Using of Person visited Various Locations Using MapReduce

0

50

100

150

200

250

300

2 1 1

269

17 27

117 117 117

zfs (Trackster)

Running timeforwholeprocess

CPU timetaken to execpgm

Map i/precords

0

5

10

15

20

Trackster- Person Visited Locations using MapReduce


1228

In our case, the Android program, Trackstercollects the data generated from the

mobile platforms when a person changes his location. As the attributes like

latitude, longitude changes, the date and time along with area, city and zip will

be generated and collated. The logs data is used when the comparative study has

been made using the MapReduce program on various DFS like, Ceph, Glusterfs

and Zfs where Hadoop has been configured on such DFS as given in Fig.1 ,2

and 4.It is written and executed on all types of DFS to find the count that how

many times a person has visited the specific area. This program is executed on

Single Virtual Machine, on cluster of VMs and on physical bare metal

machines. We built a Hadoop performance model and examine how the

performance is affected by changing VM configuration, changing the sizes for

unstructured data type of the files, and allocation of VMs over physical

machines.

5. Discussion and Analytical Results

There are overall 117 records which have been analysed. As explained earlier,

the analysis is done on the various DFS viz. Cephfs, Glusterfs and Zfs. The

HDFS has been configured on them using various plug-ins with the

implementation of MapReduce program on them. It has been found that, in all

DFS, the running time to execute the MapReduce program on all DFS and CPU

time is far more than on single VM than the VM Cluster. The CPU time taken

and total time taken on the mini cloud is much less than the single VM and VM

Cluster. By looking at the CPU time, it has been observed that there is a drastic

comparative change found between Zfs minicloud and Zfs single VM and VM

cluster as given in Fig.3. In all other DFS, the comparative difference is not

much. But it has been analysed that CPU execution time on minicloud of

physical bare metal machines is always reduced than the two other i.e. single

VM and VM cluster.

The aforementioned observations have been found from our implementation

and experimental setup and suggest that when Hadoop is scheduling tasks, the

nodes with the better performance those that return results faster as on physical

bare metal machines the nodes get the better processing power and are therefore

assigned new tasks earlier than others, regardless of the configuration. It has

been observed that the single virtual machine gives poor results than the

MapReduce program run on cluster of virtual machines[28]. For I/O intensive

jobs, the best practice is to increase the number of VMs and to allocate VMs

widely over physical servers, and to decrease the number of simultaneous

executed jobs. Hadoop on VM clusters degrades its performance due to the

overhead of the virtualization. Thus, it is important to minimize the overhead.

6. Conclusion

From the results, we can conclude that MapReduce is the framework which is

been implemented smoothly on all various DFS. The output of MapReduce

program executed on physical bare metal machines gives much more good and


1229

optimized performance than single VM and Cluster of VMs. Future work may

include a comparative approach on the logs generated on the web and the

MapReduce execution on all DFS for all of them.

References

[1] Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Communications of the ACM 51(1) (2008), 107-113.

[2] Apache Hadoop. See website at: http://hadoop.apache.org/

[3] Isard M., Budiu M., Yu Y., Birrell A., Fetterly D., Dryad: distributed data-parallel programs from sequential building blocks, ACM SIGOPS operating systems review 41(3) (2007), 59-72.

[4] Ghemawat S., Gobioff H., Leung S.T., The Google file system, ACM 37(5) (2003), 29-43.

[5] Costa F., Silva L., Dahlin M., Volunteer cloud computing: Map reduce over the internet, IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (2011), 1855-1862.

[6] Ghoshal D., Canon R.S., Ramakrishnan L., I/o performance of virtualized cloud environments, Proceedings of the second international workshop on Data intensive computing in the clouds (2011), 71-80.

[7] Buzen J.P., Gagliardi U.O., The evolution of virtual machine architecture, Proceedings of the National Computer Conference and Exposition, ACM, New York, NY (1973), 291–299.

[8] https://docs.docker.com/opensource/project/set-up-dev-env/

[9] Zaharia M., Konwinski A., Joseph A.D., Katz R. H., Stoica I., Improving MapReduce performance in heterogeneous environments, Osdi 8(4) (2008), 1-7.

[10] https://github.com/gluster/glusterfs-hadoop.

[11] Source http://docs.ceph.com/docs/master/cephfs/hadoop

[12] Tom P., How to Hadoop: Choose the Right Environment for Your Needs (2015).

[13] Ishii M., Han J., Makino H., Design and performance evaluation for hadoop clusters on virtualized environment, IEEE International Conference on Information Networking (2013), 244-249.

[14] http://wiki.apache.org/hadoop/Virtual%20Hadoop


1230

[15] Kim C., Kameda H., An algorithm for optimal static load balancing in distributed computer systems, IEEE Transactions on Computers 41(3) (1992), 381-384.

[16] Yahoo! Hadoop Tutorial,

[17] Shvachko K., Kuang H., Radia S., Chansler R., The hadoop distributed file system, IEEE 26th symposium on Mass storage systems and technologies (MSST) (2010), 1-10.

[18] An Introduction to Gluster Architecture, Cloud Storage for the Modern Data Centre, 2011.

[19] Rodeh O., Teperman A., zFS-a scalable distributed file system using object disks, 20th IEEE/11th Conference on Mass Storage Systems and Technologies (2003), 207-218.

[20] Teperman A., Weit A., Improving performance of a distributed file system using OSDs and cooperative cache, IBM Journal of Research and Development (2004), 1-8.

[21] Source- http://openzfs.org/wiki/Performance_tuning

[22] Weil S.A., Brandt S.A., Miller E.L., Long D.D., Maltzahn C., Ceph: A scalable, high-performance distributed file system, 7th symposium on Operating systems design and implementation (2006), 307-320.

[23] Maltzahn C., Molina-Estolano E., Khurana A., Nelson A.J., Brandt S.A., Weil S., Ceph as a scalable alternative to the hadoop distributed file system, login: The USENIX Magazine 35 (2010), 38-49.

[24] Weil S.A., Brandt S.A., Miller E.L., Maltzahn C., CRUSH: Controlled, scalable, decentralized placement of replicated data, ACM/IEEE conference on super computing (2006).

[25] Ceph Troubleshooting-http://docs.ceph.

[26] Glusterfs Trouble shooting-https://support.rackspace.com/how-to/glusterfs-troubleshooting/

[27] Map Reduce Tutorial

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

[28] Ishii M., Han J., Makino H., Design and performance evaluation for hadoop clusters on virtualized environment, IEEE International Conference on Information Networking (2013), 244-249.


1231

1232

Documents

Approaches for Implementation of Data Analysis Using Map ... · This study is focused on optimizing MapReduce programs on Hadoopon a single Virtual Machines, cluster of "small" VMs