Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
HPC, Grids, Clouds: A Distributed System from Top to Bottom
Group 15
Kavin Kumar Palanisamy, Magesh Khanna Vadivelu, Shivaraman Janakiraman, Vasumathi Sridharan
1. Introduction
1.1 Overview
This project involved implementation of pagerank algorithm in on cloud. As a part of understanding of
the implementation and analysis of performance of the pagerank algorithm with respect to various
technologies utilized in distributed systems, we started with the parallelization of the pagerank algorithm
using MPI libraries. The parallelized pagerank algorithm was then put to test in academic cloud in order
to produce a performance report. This was followed by implementation of resource monitoring system
which is a system that monitors and visualizes the resource utilization in a distributed set of nodes using
message broker middleware. We then performed dynamic provisioning that provides the ability and
possibility to use on-demand resources in a shared academic Cloud environment.
1.2 Technologies
The following are the technologies we used during the course of the project:
1.2.1 NaradaBrokering
NaradaBrokering is an open source technology supporting a suite of capabilities for reliable/robust
flexible messaging. It is aimed at providing for the transport of messages between services and between
services and clients. NaradaBrokering is designed around a scalable distributed network of cooperating
message routers and processors. NaradaBrokering is a content distribution infrastructure for voluminous
data streams. The substrate places no limits on the size, rate and scope of the information encapsulated
within these streams or on the number of entities within the system. NaradaBrokering provides support
for the scalable and efficient dissemination of these data streams. The substrate incorporates capabilities
to mitigate network-induced effects, and also to ensure that these streams are secure, reliable, ordered and
jitter-reduced. All components within the system utilize globally-synchronized timestamps.
To facilitate communications in a variety of network realms, NaradaBrokering incorporates
support for several communication protocols such as TCP, UDP, Multicast, HTTP, SSL, IPSec and
Parallel TCP. Support for enterprise messaging standards such as the Java Message Service, and a slew of
Web Service specifications such as SOAP, WS-Eventing, WS-ReliableMessaging and WS-Reliability are
also available.
Since NaradaBrokering is application-independent, it has been harnessed in a variety of domains such
as Earthquake Science, Environmental Monitoring, Particle Physics, Geosciences and Internet based
conferencing systems.
Figure 1: NaradaBrokering Architecture.
NaradaBrokering is an asynchronous messaging infrastructure with a publish and subscribe -based
architecture. Networks of collaborating brokers are arranged in a cluster topology, with a hierarchy of
clusters, super-clusters, and super-super-clusters. NaradaBrokering is an asynchronous messaging
infrastructure with a publish and subscribe -based architecture. Networks of collaborating brokers are
arranged in a cluster topology, with a hierarchy of clusters, super-clusters, and super-super-clusters . Each
broker is assigned a logical address within the network, which corresponds to its location and contains a
Broker Node Map (BNM) for the calculation of routes, based on broker hops. The NaradaBrokering
transport framework provides the capability for each link between brokers to implement a different
underlying protocol . The security framework incorporates an encryption key management structure,
supporting a variety of algorithms, for topics, publishers, and subscribers. A built-in performance
aggregation service can monitor links originating from a broker and typically displays values for the
average delay, latency, jitter, throughput, and loss rates. Audiovideo conferencing is accomplished with
the aid of the Real-Time Protocol (RTP) and the Java Media Framework. Support for JXTA Peer-to-Peer
end-points communicating over a NaradaBrokering broker network is propagated though a proxy .
NaradaBrokering also incorporates services for the compression/decompression and
fragmentation/coalescing of payloads/files; it also has the ability to bypass firewalls and proxies.
1.2.2 Eucalyptus
Eucalyptus is a software platform for the implementation of private cloud computing on computer
clusters. There is an enterprise edition and an open-source edition. Currently, it exports a user-facing
interface that is compatible with the Amazon EC2 and S3 services but the platform is modularized so that
it can support a set of different interfaces simultaneously. The development of Eucalyptus software is
sponsored by Eucalyptus Systems, a venture-backed start-up. Eucalyptus works with most currently
available Linux distributions including Ubuntu, Red Hat Enterprise Linux (RHEL), CentOS, SUSE Linux
Enterprise Server (SLES), openSUSE, Debian and Fedora. It can also host Microsoft Windows images.
Similarly Eucalyptus can use a variety of virtualization technologies including VMware, Xen and KVM
hypervisors to implement the cloud abstractions it supports. Eucalyptus is an acronym for ―Elastic Utility
Computing Architecture for Linking Your Programs to Useful Systems‖.
Eucalyptus implements IaaS (Infrastructure as a Service) style private and hybrid clouds. The platform
provides a single interface that lets users access computing infrastructure resources (machines, network,
and storage) available in private clouds—implemented by Eucalyptus inside an organizations's existing
data center—and resources available externally in public cloud services. The software is designed with a
modular and extensible Web services-based architecture that enables Eucalyptus to export a variety of
APIs towards users via client tools. Currently, Eucalyptus implements the industry-standard Amazon
Web Services (AWS) API, which allows the interoperability of Eucalyptus with existing AWS services
and tools. Eucalyptus provides its own set of command line tools called Euca2ools, which can be used
internally to interact with Eucalyptus private cloud installations or externally to interact with public cloud
offerings, including Amazon EC2.
Eucalyptus includes these features:
Compatibility with Amazon Web Services API.
Installation and deployment from source or DEB and RPM packages.
Secure communication between internal processes via SOAP and WS-Security.
Support for Linux and Windows virtual machines (VMs).
Support for multiple clusters as a single cloud.
Elastic IPs and Security Groups.
Users and Groups Management.
Accounting reports.
Configurable scheduling policies and SLAs.
Figure 2: Eucalyptus Software architecture
The Eucalyptus cloud computing platform has five high-level components: Cloud Controller (CLC),
Cluster Controller (CC), Walrus, Storage Controller (SC) and Node Controller (NC). Each high-level
system component has its own Web interface and is implemented as a stand-alone Web service. This has
two major advantages: First, each Web service exposes a well-defined language-agnostic API in the form
of a WSDL document containing both the operations that the service can perform and the input/output
data structures. Second, Eucalyptus leverages existing Web-service features such as security policies
(WSS) for secure communication between components and relies on industry-standard web-services
software packages.
Eucalyptus Components
Cloud Controller (CLC) - The CLC is responsible for exposing and managing the underlying virtualized
resources (machines (servers), network, and storage) via user-facing APIs. Currently, the CLC exports a
well-defined industry standard API (Amazon EC2) and via a Web-based user interface.
Walrus - Walrus implements scalable ―put-get bucket storage.‖ The current implementation of Walrus is
interface compatible with Amazon‘s S3 (a get/put interface for buckets and objects), providing a
mechanism for persistent storage and access control of virtual machine images and user data.
Cluster Controller (CC) - The CC controls the execution of virtual machines (VMs) running on the nodes
and manages the virtual networking between VMs and between VMs and external users.
Storage Controller (SC) - The SC provides block-level network storage that can be dynamically attached
by VMs. The current implementation of the SC supports the Amazon Elastic Block Storage (EBS)
semantics.
Node Controller (NC) - The NC (through the functionality of a hypervisor) controls VM activities,
including the execution, inspection, and termination of VM instances.
1.2.3 Torque
The TORQUE Resource Manager is an open source distributed resource manager
providing control over batch jobs and distributed compute nodes. Its name stands for Terascale
Open-Source Resource and QUEue Manager. It is a community effort based on the original PBS
project and, with more than 1,200 patches, has incorporated significant advances in the areas of
scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the US
DOE, Sandia, PNNL, UB, TeraGrid, and many other leading edge HPC organizations.
TORQUE can integrate with the open source Maui Cluster Scheduler or the commercial
Moab Workload Manager to improve overall utilization, scheduling and administration on a
cluster.
2. Architecture and Implementations
2.1. PageRank algorithm:
Figure.1.Pagerank indicated as percentage for 11 nodes
PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping
factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the
next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A
is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank form a probability distribution over web pages, so the sum of all web pages' PageRank will be
one. The process of PageRank can be understood as a Markov Chain[1] which needs iterative calculation
to converge.
Damping factor in Random surfer model:
PageRank is considered as a model of user behavior, where a surfer clicks on links at random with no
regard towards content. The probability for the random surfer not stopping to click on links is given by
the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The
higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another
page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-
d) into the algorithm.
b. MPI PageRank:
Parallel PageRank works by partitioning PageRank problem into N sub problems so that N processes
solve each sub-problem concurrently. One of simple approaches in partitioning is a vertex-centric
approach. The graph of PageRank can be divided into groups of vertices and each group will be
processed by a process. In this project we have implemented parallel PageRank using this method. The
program thus implemented was run under two settings: bare metal and Eucalyptus VM on multiple nodes
using FutureGrid.
2.2 Running MPI PageRank on a cluster and Eucalyptus Cloud infrastructure
The aim of this portion of the project was to understand the efficiency of our PageRank algorithm by
measuring its performance in two different environments. The goal was to achieve speed up. We ran the
MPI PageRank program in two different modes:
a) Baremetal
b) Eucalyptus cloud
We obtained a Baremetal node and Eucalyptus using our FutureGrid India account.
Speedup:
S = T1 / Tp, where
T1 – Execution time for the sequential page rank algorithm, in our case it is the execution of the algorithm
for 1 process and
Tp – Execution time for the page rank algorithm in parallel with ‗p‘ number of processes.
In an ideal case, we would like the value of S to be the same as P to indicate that the program scales up
perfectly with the increase in the number of processes. We show how this is not the case in VM
environments.
2.3 A MVC based cluster monitoring system using pub/sub messaging middleware
We implemented a system that monitors the CPU and memory utilization on two systems:
1) local commercial laptop 2) VM node running on Eucalyptus cluster.
Monitoring information was collected and aggregated through the message broker and displayed the
overall CPU and memory utilization percentages using graphs.
Figure.2. MPI PageRank Algorithm Flowchart
2.3.1. NaradaBrokering
NaradaBrokering is a message broker middleware that we used to monitor the resource utilization
in a distributed set of nodes.
Architecture: There are three main components of this monitoring system: a Message Broker, Monitoring Daemons
running on nodes and a Monitoring UI.
• Message Broker: a middleware that holds series of messages with specific topics, and waits for a Front-
End Subscriber to pick the messages. i.e. NaradaBrokering, ActiveMQ, etc. AIs had setup instances of
NaradaBrokering and ActiveMQ to be used by the students. Students were advised to prefix their topics
with the group number (eg: G01_xyz) to avoid conflicts when sharing the same brokers.
• Monitoring Daemon: a background process that runs on each compute node which captures and
publishes the system resource utilization information (CPU and Memory utilization required) and other
important usage information, to the Message Broker periodically. This daemon should not interfere with
the other running processes in the compute node. • Summarizer and Monitoring UI: Summarizer should listen to the messages with a specific topic(s)
from Message Broker and should summarize the collected information. These summarized information
(overall CPU and Memory utilization) needs to be displayed using a cumulative graph of the targeted
computing environment. The summarizer and the UI can be separate applications that communicate with
each other or can be a single application.
Figure.3.Overview architecture
2.4 Job submissions on a dynamic provisioning cluster
We automated the process of setting up the monitoring system and running MPI PageRank using PBS
job scripts on
Bare metal
Virtual clusters
We obtained a set of Bare Metal machines from Torque resource manager from FutureGrid and boot
up a set of Virtual Machines using India-Eucalyptus.
2.4.1 System Architecture
Based on the information received from the monitoring infrastructure, users will programmatically
switch/re-provision their nodes to another environment (eg: from Linux to Linux VM‘s). Figure 1
shows the interactions between each components within this system.
Figure 4 User interactions with Dynamic provisioning system
3.Experiments
3.1 Settings
Academic Cloud and Hardware: The cloud comprised of BareMetal Cluster and
Eucalyptus VM. The clients were Linux machines.
Languages used: We used C using OpenMPIfor parallel implementation of PageRank and Java to
implement the monitoring system.
Libraries and Tool: We used NaradaBrokering as our Pub/ Sub Library, JFreeChart
and Sigar Libraries for Monitoring Chart creation. We used Torque and Moab for Dynamic
provisioning and Batch processing.
3.2 Input
Data format:
The input data for PageRank application is the web graph in adjacency matrix format [2]. It transfers the
web graph into a simplified adjacency matrix. Following is the steps we
constructed adjacency matrix for web graph in Fig.1:
1) Construct a set of tuples that describe the web graph structure: WebG = {(A,null), (B, C), (C, B) ,(D,
A, B), (E, B D F), (F, B E), (G1, B E), (G2, B E), (G3, B E), (G4, E), (G5, E)
2) Map letters to numbers. A->0, B->1, C->2, D->3, E->4, F->5, G1->6, G2->7, G3->8, G4->9, G5->10
3) Construct the simplified adjacency matrix based on information in step 1,2.
0
1 2
2 1
3 0 1
4 1 3 5
5 1 4
6 1 4
7 1 4
8 1 4
9 4
10 4
3.3 Output
3.3.1 Pagerank results:
The pagerank program displays the top 10 URLs arranged according decreasing PageRank values. The
following is the output achieved when the following parameters were set:
a) Number of processes= 3
b) Threshold=0.000001
c) Iteration count=10
d) No.of URLs in the dataset: 1000
The TOP 10 URL's are
Node PRValue
0 0.138430
34 0.124686
10 0.111501
20 0.077290
14 6 0.056927
2 0.047347
12 0.019615
14 0.017787
6 0.013246
16 0.012961
3.3.2.Performance charts of MPI pagerank running in bare vs. Eucalyptus
Fig.3. Parameters used for MPI PageRank algorithm
Baremetal Eucalyptus
No. of worker nodes 4 3
Size of dataset 100K and 500K 100K and 500K
No. of processes 1 to 13 1 to 13
Threshold 0.000001 0.000001
Iteration setting 10 10
Figure.4. Performance analysis speed up charts on Bare metal and
Eucalyptus
3.3.3.Snapshots of monitoring system UI
Fig.5. Performance index of a commercial Laptop (left) compared to our UI.
Fig.6. Performance index on
cluster
Fig.7.Performance index Baremetal (500K ,700K and 900K URLs)
Fig.8.CPU and Memory Utilization (VM)
4 Analysis of results
4.1 Measurements of MPI PageRank on baremetal vs. Eucalyptus
a) Baremetal
As seen in Figure 4 graphs I and II, we achieved an overall speedup as the number of processes
increase as we ran the MPI PageRank program, which is an expected trend with parallel
algorithms .
b) Eucalyptus:
We found that as the number of processes increased in multiples of 3n+1, we got a speed up i.e
when np = 4,7,10,13..At all other times, we observed a speed down in performance. The sudden
spike in speed up could be due to
1) we used 3 instances and the performance increased as the first instance was
assigned more processes than the rest of the instances.
2) Also, speed down could be due to the absence of virtual infinite band capacity
that is present in bare metal nodes.
3) We can also attribute the speed down to the communication delay between
processes.
4.2. Dynamic switching overhead :
Whenever the VM booted, we noticed a spike in the CPU and Memory utilization as shown in Figure[8].
5.Conclusion
a. Summary of Achievements
We successfully parallelized pagerank algorithm with the help of MPI libraries. The performance of the
pagerank algorithm was analyzed and a report was generated illustrating its performance on the
academic cloud. The resource monitoring system that monitors and visualizes the resource utilization in
a distributed set of nodes was implemented using NaradaBroker. Implemented Dynamic provisioning
that provides the ability and possibility to use on-demand resources in a shared academic Cloud
environment
As a part of future work we plan to implement data classification tool in Hadoop that can be used in the
shopping malls at the application level.
b.Findings
i.Computation vs Communication Overhead of MPI Pagerank
Since Eucalyptus runs on Ethernet Band, we had Communication overhead. We observed speed down in
Eucalyptus which we attribute to the communication delay between processes. But, in the case of
baremetal, there wasn‘t any problem of bandwidth which resulted in a good speed up in performance in
correlation with the number of processes.
ii.Synchronization issue in a distributed system
Synchronizing the CPU and memory utilization while gathering it from multiple nodes was a challenge as
each node could provide with their information asynchronously. In order to get optimum combined
utilization while running the MPI pagerank algorithm in BareMetal as well as VMs, it was imperative to
synchronize all the nodes.
6. Aknowledgement
We thank Professor Qiu and the Future Grid team especially Andrew J Younge, Stephen Wu and Thilina
Gunarathne for their continued support throughout the course of the projects.
7.References
[1] - http://en.wikipedia.org/wiki/PageRank [2] Sigar Resource monitoring
API, http://www.hyperic.com/products/sigarhttp://sourceforge.net/projects/sigar/
[3] - ActiveMQ, http://activemq.apache.org/
[4] - JFreeChart, http://www.jfree.org/jfreechart/
[5] - TORQUE Resource Manager, http://www.clusterresources.com/products/torque-resource-
manager.php
[6] http://portal.futuregrid.org [7] NaradaBrokering, http://www.naradabrokering.org/