Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Approaches for Implementation of Data
Analysis Using Map Reduce on Mobile
Platform: An Experimental Study 1Madhavi Vaidya and
2Shrinivas Deshpande
1Department of Computer Science,
VES College, Mumbai, India. 2Department of Computer Science,
HVPM, Amravati, India.
Abstract The computing world is undergoing a significant change from
traditional non-centralized distributed system architecture, to typical
parallel and psudo-distributed nodes. Such nodes are scattered on different
geographic areas to a centralized cloud computing architecture where data
transformation and computations are operated somewhere on any node.
Data centres owned and maintained by third party or a cloud can be
formed and maintained using the number of physical machines. These
machines can be of different configurations or using virtual machines on a
shared LAN to communicate with each other. It has been read from and
described in the literature that there is always a difference in performance
when the MapReduce program is run on Virtual Machines and the physical
bare metal machines.
The same has been elaborated here by considering the use case where
the data generated from the mobile platform has taken into consideration.
In our case to run this program, we have configured a mini-cloud on LAN.
The outcome of analysis of using a MapReduce program on the data
generated from mobile platforms has been tested on various infrastructures
like Single Virtual Machine, a cluster of Virtual Machine and the mini cloud
on various Distributed File Systems.
International Journal of Pure and Applied MathematicsVolume 118 No. 18 2018, 1221-1232ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
1221
1. Introduction
Distributed file systems have been developed to support the style of data
accessing in the manner of write-once, producing high-throughput, and does
parallel streaming. These include the I/O subsystem such as Glusterfs, Zfs,
Ceph and the Hadoop Distributed File System (HDFS).
In this paper the study has been performed to show how these distributed file
systems work on a mini cloud which has been maintained on few machines in
small setup. The elaboration on the comparison of the file on single Virtual
machine and the cloud of few machines arranged in LAN.
The paper’s flow goes in such a way that, we studied few Distributed File
systems in this introductory section. In the II section, the literature study has
been carried out. In the III section, the elaboration on architecture and
implementation issues of those distributed file systems is been provided. IV
section provides the working of MapReduce architecture has been mentioned on
Distributed File Systems and the role of MapReduce on those distributed file
systems. The section V provides the discussion on the discussion, analysis and
results. Finally section VI would be closed with the concluding remarks.
The software framework MapReduce and its open-source implementation,
Hadoop is deployed in the Cloud to generate the availability of virtualized
resources and payper-usage cost model of cloud computing. Instead of using the
pay-per-use cloud model, we have developed a cloud on few machines which
are in local area network. MapReduce is a software framework for parallel data
intensive computations popularized by Google [1]. There are several
implementations available besides Google’s original system, but the most
popular one is Hadoop [2], developed by Apache [3].
We will look into the comparison on the results carried in literature of the single
VM, the cluster of VMs and the cloud of machines which are physical bare
metal machines performed experimentally in LAN, Where the operations are
performed are the "computing machines" are either physical i.e. bare metal
machines or a virtual machine (VM) running on a physical machine. There can
be many virtual machines running as "guests" on a real "host" machine, using
hypervisor software.
A cloud is typically comprised of inter-connected, virtualised computers
coupled with security and a programming model. From an application
developer’s perspective, the programming model and its performance are
undoubtedly the main criteria to select a cloud environment. A Private Cloud is
a collection of virtualized physical hardware that has added services such as
catalogues of software or defined platforms that a customer can control.
The primary goal of virtualizing master and slave servers is the same, to
maximize operability and utilize existing hardware. Virtualisation is a software
International Journal of Pure and Applied Mathematics Special Issue
1222
technology that allows a machine to run different operating systems on the same
physical computer. A computing environment consists of compute nodes, a
network to connect them and software to run on them. However the strategy for
virtualizing the Hadoop master servers is different than virtualizing Hadoop
slave servers on the basis of settings done on machines [4][5]. There are various
findings made in the literature , EC2 also provides a bigger VM sizes that have
lower variance in I/O performance than the light VMs, possibly because they
fully own a disk[6]. The same fundamental is experienced when the mini-cloud
formation is done on the machines used in LAN.
A virtualized environment has its own benefits [7];Virtualization can provide
higher hardware utilization by collaborating multiple Hadoop machines together
in a cluster and other workloads on the same physical cluster. It has been
studied in the literature that then development environment is a place where
containers might work [8]as they require fewer hardware resources as overhead,
they typically have less costs than VMs. Another benefit is they provide great
flexibility.
So what to choose between physical and bare metal environment and virtual
machine for Hadoop installation will be elaborated here depending on the
performance of MapReduce programs for various types of files.
This study is focused on optimizing MapReduce programs on Hadoopon a
single Virtual Machines, cluster of "small" VMs and the physical/bare metal
machines to get the best performance. It has been studied in the literature that
optimizing Hadoop on "small" VMs can give the best performance [9].
Distributed file systems have been developed to support this style of data
accessing in the manner of write-once, producing high-throughput, and does
parallel streaming. These include the I/O subsystem such as glusterfs, zfs, Ceph
and the Hadoop Distributed File System (HDFS). The Hadoop has been
implemented on the mini cloud of nodes of physical bare metal machines and
the single node where glusterfs, ceph and zfs are the distributed file systems
have been implemented and the Hadoop is implemented using various plugins
like glusterfs-hadoop plugin[10] and ceph-hadoop plugin[11] cephfs-hadoop.jar
is been configured properly. It has been observed from our implementation that
Ceph and Hadoop 1.x and 2.x work satisfactorily together.
2. Literature Study
One scenario that works well is a highly static environment with a very large
cluster. Workloads that require few clusters with long life spans are good
candidates for a bare metal deployment. As there is no overhead of
virtualization may provide performance benefits. Besides, all of the flexibility
of a virtualized environment would go to waste[12] when a VM is requested, its
performance may vary from the performance obtained on physical data centres.
This can be due to the configurations of machines, or the other workloads.
Virtualization can provide higher hardware utilization by consolidating multiple
International Journal of Pure and Applied Mathematics Special Issue
1223
Hadoop clusters and other workload on the same physical cluster. The resource
management for such large number of nodes is quite difficult from the aspects
of configuration, deployment and efficient resource utilization. By deploying
virtual machines (VMs), Hadoop management becomes much easier. Amazon
already released the Hadoop on Xen-virtualized environment as Elastic
MapReduce[13].
Virtual Machines are requested on demand from the infrastructure: the
machines could be allocated anywhere in the infrastructure, possibly on servers
running other VMs at the same time. Actually, Hadoop can be brought in
virtualized infrastructures with ample of benefits which can be, a single image
can be cloned and hence it generates lower operation costs. They can be setup
on demand. The cluster size can be contracted or expanded on demand
[14].Hadoop is designed on the fundamental principle of distributing computing
power to where the data is rather than movement of data as in traditional
parallel and distributed computing system using MPI (Message Passing
Interface).Hadoop is an extremely scalable and highly fault tolerant distributed
infrastructure and it automatically handles parallelization, load balancing and
data distribution[15]. HDFS is capable of storing excessive volumes of data and
fast accessing to the stored data. A Hadoop cluster works on master-slave
architecture in which one node is a master node and remaining nodes are slave
nodes. Master node known as NameNode controls the slave nodes known as
DataNodes. Slave nodes hold all the data blocks and perform map and reduce
tasks[16].
3. Study of Architecture and
Implementation Issues
Hadoop
Hadoop is an Apache open source project. Its architecture supports the storage,
transformation, and analysis of very large datasets. Hadoop is a massively
parallel processing solution and characterized as a massive I/O platform.
Massive can be a relative term which changes from the specifications to
specifications and dependent on various infrastructure used.
Two critical components of Hadoop are the Hadoop distributed file system
(HDFS)[17], and the MapReduce algorithm. Both the components allow
efficient parallel computation on very large datasets. HDFS is based on Google
File System i.e. GFS.
The various DFS mentioned below have also been installed with HDFS and
MapReduce programs were executed for various use cases.
Glusterfs
Glusterfs is an open source model. Glusterfs is designed to provide a scale-out
architecture for both performance and capacity and is able to scale up or down.
International Journal of Pure and Applied Mathematics Special Issue
1224
Operation of Glusterfs is done in user space so it makes installing and
upgrading Glusterfs significantly easier. Storage system servers can be added or
removed on-the-fly with data automatically rebalanced across the cluster.
Glusterfs locates data algorithmically on the basis of the path name and the file
name, any storage system node and any client requiring read or write access to a
file in a Glusterfs storage cluster performs a mathematical operation that
calculates the file location [18]. Experimental work on installation and
configuration of Glusterfs is found to be smooth [10].
Recent introduction of Gluster FS Hadoop plugin (HDFS interface) enables
Hadoop applications to take advantage of unstructured file and object data
hosted on Gluster volumes in place. On the Glusterfs cluster, Glusterfs 3.7.17
build is been installed. On the top of it, glusterfs-hadoop plugin [10] has been
installed. In order to configure the said plugin, we made specific type of
deployment. Firstly, Hadoop Job Tracker and Task Trackers are installed on
Glusterfs server’s trusted storage pool for a given Glusterfs volume. The Job
Tracker used the plugin to query the job input files in Glusterfs, they determine
the file placement as well as the distribution of file replicas across the cluster.
The Task Trackers used the plugin to control a local fuse mount of the Glusterfs
volume in order to access the data required for the tasks.
Zfs
The Z file system is built in 2004 by Sun Microsystems which is free and open
source logical volume manager for using it in their Solaris operating system
[20].
In ZFS, the basic storage unit is a “storage pool” or zpool. to a single virtual
drive. The storage pool understands physical details about the various storage
hardware like devices, layout, redundancy, and so on and presents itself to an
Operating System as a large “data store” that can support one or more file
systems [20]. It has been an experience at the time of setup, for configuring Zfs
, one needs to have a robust infrastructure. Otherwise might create a bottleneck
for performance. But running MapReduce on Zfs was smoothly performed [21].
This is the way by which the implementation of Hadoop Distributed File
System. It has been observed that MapReduce works well on all the distributed
file systems.
Cephfs
Ceph is distributed file system and is an object-based , open source ,parallel file
system[22][23] whose design is based object-based storage, which divides the
traditional file system architecture into two components viz. a client component
and a storage component and it increases scalability. A Ceph OSD as a process
that runs on a cluster node and uses a local file system to store data objects. A
CRUSH (Controlled Replication Under Scalable Hashing) [24] algorithm to
compute on demand where the data should be written to or be read from.
International Journal of Pure and Applied Mathematics Special Issue
1225
One mon node and 3 Object Storage Device (OSD) nodes have installed on
virtual machines. It has been found that during recovery mon nodes and OSDs
nodes need significantly more RAM. If the user has decided to implement Ceph
on Virtual Machines, one will need to ensure that those other processes leave
sufficient processing power for Ceph daemons. The additional CPU-intensive
processes have to be run on separate hosts [25]. After implementation, it has
been observed that data storage has to be planned carefully, otherwise system
gets slower.
Once gets installed, achieving the active+clean states for placement groups is
the final state then and there the user can start executing commands on Cephfs.
As a part of this implementation, Ceph version 9.2.1 has already been installed.
We configured the cephfs-hadoop plugin on underlying distributed file system
of Ceph.
The ceph-hadoopplugin [11] cephfs-hadoop.jar is been configured properly. It
has been observed from our implementation that Ceph and Hadoop 1.x and 2.x
work satisfactorily together.
It has been studied from the literature present, which DFS will give us good
performance. Most distributed storage systems do not give adequate
performance with large files or sometimes few don’t give the same performance
with either type of files.
It has been studied that Glusterfs does a decent job for both small and large
objects [26]. It has been mentioned in the literature that Hadoop can also be an
alternative for storing large files.
Particulars of Implementation
For this article, the MapReduce algorithm was implemented on a system using:
Off-the shelf commodity machines.
Hadoop 1.2.1, 1 NameNode and 3 Data Nodes.
4 GB RAM on Master node and 1 GB RAM on slave nodes.
Eclipse IDE 3.0.
Cent OS7.0/Ubuntu 8.2 or above.
4. MapReduce Implementation on DFS
Hadoop is an implementation of the abstract idea called "map-reduce". That is,
it is a programming model for a certain type of calculation that consists of a
parallel "map" step, followed by some form data gathering that is "reduce".
MapReduce consists of a first “map” step, in which a master node (acting as a
coordinator) divides the initial input into smaller chunks, each to be used in its
own separate sub-task, and distributes them to worker nodes. Each worker
performs the tasks it was given, and reports its completion to the master.
International Journal of Pure and Applied Mathematics Special Issue
1226
Figure 1: Trackster on Plain Hadoop
At the next “reduce” step, the master node schedules one or more tasks on
workers to perform join operations on the map outputs, to combine them in a
way to get the output–the answer to the initial problem to be solved. In the
cloud environment, MapReduce relies on a shared, parallel file system at each
step like Google File System [27].
Figure 2: Trackster on Ceph
Figure 3: Trackster on Zfs
0
1000
2000
3000
4000
5000
6000
7000
4 1 1
6540
1980
3030
1 117 117
Plain Hadoop (Trackster)
Running timeforwholeprocess
CPU time taken toexec pgm
Map i/p records
0
1000
2000
3000
4000
5000
6000
5 9 3
4020
3200
5260
117 117 117
ceph(trackster)
Runningtime forwholeprocess
CPU timetaken toexec pgm
Map i/precords
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
5 20 2
16350
2540
105117 117 117
Gluster (trackster)
Running time forwholeprocess
CPU time takento exec pgm
Map i/p records
International Journal of Pure and Applied Mathematics Special Issue
1227
Following is the output generated on Master node when the number of
processes are shown using command, jps.
Figure 4: Trackster on Glusterfs
No task tracker and job tracker in case of Hadoop 2.5.0 Hadoop, the o/p is
displayed in the above manner.
Thus, the Hadoop fits with the problem of data analysis for various problem
statements viz. semi-large sized sample text file, the web logs and the data
generated from the mobile apps. The use case which we have taken into
consideration for generating and collating the logs as it has been studied in the
literature that MapReduce works best for processing and analyzing the logs as
mentioned in Fig 5.
Figure 5: Count Using of Person visited Various Locations Using MapReduce
0
50
100
150
200
250
300
2 1 1
269
17 27
117 117 117
zfs (Trackster)
Running timeforwholeprocess
CPU timetaken to execpgm
Map i/precords
0
5
10
15
20
Trackster- Person Visited Locations using MapReduce
International Journal of Pure and Applied Mathematics Special Issue
1228
In our case, the Android program, Trackstercollects the data generated from the
mobile platforms when a person changes his location. As the attributes like
latitude, longitude changes, the date and time along with area, city and zip will
be generated and collated. The logs data is used when the comparative study has
been made using the MapReduce program on various DFS like, Ceph, Glusterfs
and Zfs where Hadoop has been configured on such DFS as given in Fig.1 ,2
and 4.It is written and executed on all types of DFS to find the count that how
many times a person has visited the specific area. This program is executed on
Single Virtual Machine, on cluster of VMs and on physical bare metal
machines. We built a Hadoop performance model and examine how the
performance is affected by changing VM configuration, changing the sizes for
unstructured data type of the files, and allocation of VMs over physical
machines.
5. Discussion and Analytical Results
There are overall 117 records which have been analysed. As explained earlier,
the analysis is done on the various DFS viz. Cephfs, Glusterfs and Zfs. The
HDFS has been configured on them using various plug-ins with the
implementation of MapReduce program on them. It has been found that, in all
DFS, the running time to execute the MapReduce program on all DFS and CPU
time is far more than on single VM than the VM Cluster. The CPU time taken
and total time taken on the mini cloud is much less than the single VM and VM
Cluster. By looking at the CPU time, it has been observed that there is a drastic
comparative change found between Zfs minicloud and Zfs single VM and VM
cluster as given in Fig.3. In all other DFS, the comparative difference is not
much. But it has been analysed that CPU execution time on minicloud of
physical bare metal machines is always reduced than the two other i.e. single
VM and VM cluster.
The aforementioned observations have been found from our implementation
and experimental setup and suggest that when Hadoop is scheduling tasks, the
nodes with the better performance those that return results faster as on physical
bare metal machines the nodes get the better processing power and are therefore
assigned new tasks earlier than others, regardless of the configuration. It has
been observed that the single virtual machine gives poor results than the
MapReduce program run on cluster of virtual machines[28]. For I/O intensive
jobs, the best practice is to increase the number of VMs and to allocate VMs
widely over physical servers, and to decrease the number of simultaneous
executed jobs. Hadoop on VM clusters degrades its performance due to the
overhead of the virtualization. Thus, it is important to minimize the overhead.
6. Conclusion
From the results, we can conclude that MapReduce is the framework which is
been implemented smoothly on all various DFS. The output of MapReduce
program executed on physical bare metal machines gives much more good and
International Journal of Pure and Applied Mathematics Special Issue
1229
optimized performance than single VM and Cluster of VMs. Future work may
include a comparative approach on the logs generated on the web and the
MapReduce execution on all DFS for all of them.
References
[1] Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Communications of the ACM 51(1) (2008), 107-113.
[2] Apache Hadoop. See website at: http://hadoop.apache.org/
[3] Isard M., Budiu M., Yu Y., Birrell A., Fetterly D., Dryad: distributed data-parallel programs from sequential building blocks, ACM SIGOPS operating systems review 41(3) (2007), 59-72.
[4] Ghemawat S., Gobioff H., Leung S.T., The Google file system, ACM 37(5) (2003), 29-43.
[5] Costa F., Silva L., Dahlin M., Volunteer cloud computing: Map reduce over the internet, IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (2011), 1855-1862.
[6] Ghoshal D., Canon R.S., Ramakrishnan L., I/o performance of virtualized cloud environments, Proceedings of the second international workshop on Data intensive computing in the clouds (2011), 71-80.
[7] Buzen J.P., Gagliardi U.O., The evolution of virtual machine architecture, Proceedings of the National Computer Conference and Exposition, ACM, New York, NY (1973), 291–299.
[8] https://docs.docker.com/opensource/project/set-up-dev-env/
[9] Zaharia M., Konwinski A., Joseph A.D., Katz R. H., Stoica I., Improving MapReduce performance in heterogeneous environments, Osdi 8(4) (2008), 1-7.
[10] https://github.com/gluster/glusterfs-hadoop.
[11] Source http://docs.ceph.com/docs/master/cephfs/hadoop
[12] Tom P., How to Hadoop: Choose the Right Environment for Your Needs (2015).
[13] Ishii M., Han J., Makino H., Design and performance evaluation for hadoop clusters on virtualized environment, IEEE International Conference on Information Networking (2013), 244-249.
[14] http://wiki.apache.org/hadoop/Virtual%20Hadoop
International Journal of Pure and Applied Mathematics Special Issue
1230
[15] Kim C., Kameda H., An algorithm for optimal static load balancing in distributed computer systems, IEEE Transactions on Computers 41(3) (1992), 381-384.
[16] Yahoo! Hadoop Tutorial,
[17] Shvachko K., Kuang H., Radia S., Chansler R., The hadoop distributed file system, IEEE 26th symposium on Mass storage systems and technologies (MSST) (2010), 1-10.
[18] An Introduction to Gluster Architecture, Cloud Storage for the Modern Data Centre, 2011.
[19] Rodeh O., Teperman A., zFS-a scalable distributed file system using object disks, 20th IEEE/11th Conference on Mass Storage Systems and Technologies (2003), 207-218.
[20] Teperman A., Weit A., Improving performance of a distributed file system using OSDs and cooperative cache, IBM Journal of Research and Development (2004), 1-8.
[21] Source- http://openzfs.org/wiki/Performance_tuning
[22] Weil S.A., Brandt S.A., Miller E.L., Long D.D., Maltzahn C., Ceph: A scalable, high-performance distributed file system, 7th symposium on Operating systems design and implementation (2006), 307-320.
[23] Maltzahn C., Molina-Estolano E., Khurana A., Nelson A.J., Brandt S.A., Weil S., Ceph as a scalable alternative to the hadoop distributed file system, login: The USENIX Magazine 35 (2010), 38-49.
[24] Weil S.A., Brandt S.A., Miller E.L., Maltzahn C., CRUSH: Controlled, scalable, decentralized placement of replicated data, ACM/IEEE conference on super computing (2006).
[25] Ceph Troubleshooting-http://docs.ceph.
[26] Glusterfs Trouble shooting-https://support.rackspace.com/how-to/glusterfs-troubleshooting/
[27] Map Reduce Tutorial
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[28] Ishii M., Han J., Makino H., Design and performance evaluation for hadoop clusters on virtualized environment, IEEE International Conference on Information Networking (2013), 244-249.
International Journal of Pure and Applied Mathematics Special Issue
1231
1232