HPC, Grids, Clouds: A Distributed System from Top to ...salsahpc.indiana.edu/b534projects/sites/default... · HPC, Grids, Clouds: A Distributed System from Top to Bottom Group 15

HPC, Grids, Clouds: A Distributed System from Top to Bottom

Group 15

Kavin Kumar Palanisamy, Magesh Khanna Vadivelu, Shivaraman Janakiraman, Vasumathi Sridharan

1. Introduction

1.1 Overview

This project involved implementation of pagerank algorithm in on cloud. As a part of understanding of

the implementation and analysis of performance of the pagerank algorithm with respect to various

technologies utilized in distributed systems, we started with the parallelization of the pagerank algorithm

using MPI libraries. The parallelized pagerank algorithm was then put to test in academic cloud in order

to produce a performance report. This was followed by implementation of resource monitoring system

which is a system that monitors and visualizes the resource utilization in a distributed set of nodes using

message broker middleware. We then performed dynamic provisioning that provides the ability and

possibility to use on-demand resources in a shared academic Cloud environment.

1.2 Technologies

The following are the technologies we used during the course of the project:

1.2.1 NaradaBrokering

NaradaBrokering is an open source technology supporting a suite of capabilities for reliable/robust

flexible messaging. It is aimed at providing for the transport of messages between services and between

services and clients. NaradaBrokering is designed around a scalable distributed network of cooperating

message routers and processors. NaradaBrokering is a content distribution infrastructure for voluminous

data streams. The substrate places no limits on the size, rate and scope of the information encapsulated

within these streams or on the number of entities within the system. NaradaBrokering provides support

for the scalable and efficient dissemination of these data streams. The substrate incorporates capabilities

to mitigate network-induced effects, and also to ensure that these streams are secure, reliable, ordered and

jitter-reduced. All components within the system utilize globally-synchronized timestamps.

To facilitate communications in a variety of network realms, NaradaBrokering incorporates

support for several communication protocols such as TCP, UDP, Multicast, HTTP, SSL, IPSec and

Parallel TCP. Support for enterprise messaging standards such as the Java Message Service, and a slew of

Web Service specifications such as SOAP, WS-Eventing, WS-ReliableMessaging and WS-Reliability are

also available.

Since NaradaBrokering is application-independent, it has been harnessed in a variety of domains such

as Earthquake Science, Environmental Monitoring, Particle Physics, Geosciences and Internet based

conferencing systems.

Figure 1: NaradaBrokering Architecture.

NaradaBrokering is an asynchronous messaging infrastructure with a publish and subscribe -based

architecture. Networks of collaborating brokers are arranged in a cluster topology, with a hierarchy of

clusters, super-clusters, and super-super-clusters. NaradaBrokering is an asynchronous messaging

infrastructure with a publish and subscribe -based architecture. Networks of collaborating brokers are

arranged in a cluster topology, with a hierarchy of clusters, super-clusters, and super-super-clusters . Each

broker is assigned a logical address within the network, which corresponds to its location and contains a

Broker Node Map (BNM) for the calculation of routes, based on broker hops. The NaradaBrokering

transport framework provides the capability for each link between brokers to implement a different

underlying protocol . The security framework incorporates an encryption key management structure,

supporting a variety of algorithms, for topics, publishers, and subscribers. A built-in performance

aggregation service can monitor links originating from a broker and typically displays values for the

average delay, latency, jitter, throughput, and loss rates. Audiovideo conferencing is accomplished with

the aid of the Real-Time Protocol (RTP) and the Java Media Framework. Support for JXTA Peer-to-Peer

end-points communicating over a NaradaBrokering broker network is propagated though a proxy .

NaradaBrokering also incorporates services for the compression/decompression and

fragmentation/coalescing of payloads/files; it also has the ability to bypass firewalls and proxies.

1.2.2 Eucalyptus

Eucalyptus is a software platform for the implementation of private cloud computing on computer

clusters. There is an enterprise edition and an open-source edition. Currently, it exports a user-facing

interface that is compatible with the Amazon EC2 and S3 services but the platform is modularized so that

it can support a set of different interfaces simultaneously. The development of Eucalyptus software is

sponsored by Eucalyptus Systems, a venture-backed start-up. Eucalyptus works with most currently

available Linux distributions including Ubuntu, Red Hat Enterprise Linux (RHEL), CentOS, SUSE Linux

Enterprise Server (SLES), openSUSE, Debian and Fedora. It can also host Microsoft Windows images.

Similarly Eucalyptus can use a variety of virtualization technologies including VMware, Xen and KVM

hypervisors to implement the cloud abstractions it supports. Eucalyptus is an acronym for ―Elastic Utility

Computing Architecture for Linking Your Programs to Useful Systems‖.

Eucalyptus implements IaaS (Infrastructure as a Service) style private and hybrid clouds. The platform

provides a single interface that lets users access computing infrastructure resources (machines, network,

and storage) available in private clouds—implemented by Eucalyptus inside an organizations's existing

data center—and resources available externally in public cloud services. The software is designed with a

modular and extensible Web services-based architecture that enables Eucalyptus to export a variety of

APIs towards users via client tools. Currently, Eucalyptus implements the industry-standard Amazon

Web Services (AWS) API, which allows the interoperability of Eucalyptus with existing AWS services

and tools. Eucalyptus provides its own set of command line tools called Euca2ools, which can be used

internally to interact with Eucalyptus private cloud installations or externally to interact with public cloud

offerings, including Amazon EC2.

Eucalyptus includes these features:

Compatibility with Amazon Web Services API.

Installation and deployment from source or DEB and RPM packages.

Secure communication between internal processes via SOAP and WS-Security.

Support for Linux and Windows virtual machines (VMs).

Support for multiple clusters as a single cloud.

Elastic IPs and Security Groups.

Users and Groups Management.

Accounting reports.

Configurable scheduling policies and SLAs.

Figure 2: Eucalyptus Software architecture

The Eucalyptus cloud computing platform has five high-level components: Cloud Controller (CLC),

Cluster Controller (CC), Walrus, Storage Controller (SC) and Node Controller (NC). Each high-level

system component has its own Web interface and is implemented as a stand-alone Web service. This has

two major advantages: First, each Web service exposes a well-defined language-agnostic API in the form

of a WSDL document containing both the operations that the service can perform and the input/output

data structures. Second, Eucalyptus leverages existing Web-service features such as security policies

(WSS) for secure communication between components and relies on industry-standard web-services

software packages.

Eucalyptus Components

Cloud Controller (CLC) - The CLC is responsible for exposing and managing the underlying virtualized

resources (machines (servers), network, and storage) via user-facing APIs. Currently, the CLC exports a

well-defined industry standard API (Amazon EC2) and via a Web-based user interface.

Walrus - Walrus implements scalable ―put-get bucket storage.‖ The current implementation of Walrus is

interface compatible with Amazon‘s S3 (a get/put interface for buckets and objects), providing a

mechanism for persistent storage and access control of virtual machine images and user data.

Cluster Controller (CC) - The CC controls the execution of virtual machines (VMs) running on the nodes

and manages the virtual networking between VMs and between VMs and external users.

Storage Controller (SC) - The SC provides block-level network storage that can be dynamically attached

by VMs. The current implementation of the SC supports the Amazon Elastic Block Storage (EBS)

semantics.

Node Controller (NC) - The NC (through the functionality of a hypervisor) controls VM activities,

including the execution, inspection, and termination of VM instances.

1.2.3 Torque

The TORQUE Resource Manager is an open source distributed resource manager

providing control over batch jobs and distributed compute nodes. Its name stands for Terascale

Open-Source Resource and QUEue Manager. It is a community effort based on the original PBS

project and, with more than 1,200 patches, has incorporated significant advances in the areas of

scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the US

DOE, Sandia, PNNL, UB, TeraGrid, and many other leading edge HPC organizations.

TORQUE can integrate with the open source Maui Cluster Scheduler or the commercial

Moab Workload Manager to improve overall utilization, scheduling and administration on a

cluster.

2. Architecture and Implementations

2.1. PageRank algorithm:

Figure.1.Pagerank indicated as percentage for 11 nodes

PageRank is defined as follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping

factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the

next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A

is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank form a probability distribution over web pages, so the sum of all web pages' PageRank will be

one. The process of PageRank can be understood as a Markov Chain[1] which needs iterative calculation

to converge.

Damping factor in Random surfer model:

PageRank is considered as a model of user behavior, where a surfer clicks on links at random with no

regard towards content. The probability for the random surfer not stopping to click on links is given by

the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The

higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another

page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-

d) into the algorithm.

b. MPI PageRank:

Parallel PageRank works by partitioning PageRank problem into N sub problems so that N processes

solve each sub-problem concurrently. One of simple approaches in partitioning is a vertex-centric

approach. The graph of PageRank can be divided into groups of vertices and each group will be

processed by a process. In this project we have implemented parallel PageRank using this method. The

program thus implemented was run under two settings: bare metal and Eucalyptus VM on multiple nodes

using FutureGrid.

2.2 Running MPI PageRank on a cluster and Eucalyptus Cloud infrastructure

The aim of this portion of the project was to understand the efficiency of our PageRank algorithm by

measuring its performance in two different environments. The goal was to achieve speed up. We ran the

MPI PageRank program in two different modes:

a) Baremetal

b) Eucalyptus cloud

We obtained a Baremetal node and Eucalyptus using our FutureGrid India account.

Speedup:

S = T1 / Tp, where

T1 – Execution time for the sequential page rank algorithm, in our case it is the execution of the algorithm

for 1 process and

Tp – Execution time for the page rank algorithm in parallel with ‗p‘ number of processes.

In an ideal case, we would like the value of S to be the same as P to indicate that the program scales up

perfectly with the increase in the number of processes. We show how this is not the case in VM

environments.

2.3 A MVC based cluster monitoring system using pub/sub messaging middleware

We implemented a system that monitors the CPU and memory utilization on two systems:

1) local commercial laptop 2) VM node running on Eucalyptus cluster.

Monitoring information was collected and aggregated through the message broker and displayed the

overall CPU and memory utilization percentages using graphs.

Figure.2. MPI PageRank Algorithm Flowchart

2.3.1. NaradaBrokering

NaradaBrokering is a message broker middleware that we used to monitor the resource utilization

in a distributed set of nodes.

Architecture: There are three main components of this monitoring system: a Message Broker, Monitoring Daemons

running on nodes and a Monitoring UI.

• Message Broker: a middleware that holds series of messages with specific topics, and waits for a Front-

End Subscriber to pick the messages. i.e. NaradaBrokering, ActiveMQ, etc. AIs had setup instances of

NaradaBrokering and ActiveMQ to be used by the students. Students were advised to prefix their topics

with the group number (eg: G01_xyz) to avoid conflicts when sharing the same brokers.

• Monitoring Daemon: a background process that runs on each compute node which captures and

publishes the system resource utilization information (CPU and Memory utilization required) and other

important usage information, to the Message Broker periodically. This daemon should not interfere with

the other running processes in the compute node. • Summarizer and Monitoring UI: Summarizer should listen to the messages with a specific topic(s)

from Message Broker and should summarize the collected information. These summarized information

(overall CPU and Memory utilization) needs to be displayed using a cumulative graph of the targeted

computing environment. The summarizer and the UI can be separate applications that communicate with

each other or can be a single application.

Figure.3.Overview architecture

2.4 Job submissions on a dynamic provisioning cluster

We automated the process of setting up the monitoring system and running MPI PageRank using PBS

job scripts on

Bare metal

Virtual clusters

We obtained a set of Bare Metal machines from Torque resource manager from FutureGrid and boot

up a set of Virtual Machines using India-Eucalyptus.

2.4.1 System Architecture

Based on the information received from the monitoring infrastructure, users will programmatically

switch/re-provision their nodes to another environment (eg: from Linux to Linux VM‘s). Figure 1

shows the interactions between each components within this system.

Figure 4 User interactions with Dynamic provisioning system

3.Experiments

3.1 Settings

Academic Cloud and Hardware: The cloud comprised of BareMetal Cluster and

Eucalyptus VM. The clients were Linux machines.

Languages used: We used C using OpenMPIfor parallel implementation of PageRank and Java to

implement the monitoring system.

Libraries and Tool: We used NaradaBrokering as our Pub/ Sub Library, JFreeChart

and Sigar Libraries for Monitoring Chart creation. We used Torque and Moab for Dynamic

provisioning and Batch processing.

3.2 Input

Data format:

The input data for PageRank application is the web graph in adjacency matrix format [2]. It transfers the

web graph into a simplified adjacency matrix. Following is the steps we

constructed adjacency matrix for web graph in Fig.1:

1) Construct a set of tuples that describe the web graph structure: WebG = {(A,null), (B, C), (C, B) ,(D,

A, B), (E, B D F), (F, B E), (G1, B E), (G2, B E), (G3, B E), (G4, E), (G5, E)

2) Map letters to numbers. A->0, B->1, C->2, D->3, E->4, F->5, G1->6, G2->7, G3->8, G4->9, G5->10

3) Construct the simplified adjacency matrix based on information in step 1,2.

0

1 2

2 1

3 0 1

4 1 3 5

5 1 4

6 1 4

7 1 4

8 1 4

9 4

10 4

3.3 Output

3.3.1 Pagerank results:

The pagerank program displays the top 10 URLs arranged according decreasing PageRank values. The

following is the output achieved when the following parameters were set:

a) Number of processes= 3

b) Threshold=0.000001

c) Iteration count=10

d) No.of URLs in the dataset: 1000

The TOP 10 URL's are

Node PRValue

0 0.138430

34 0.124686

10 0.111501

20 0.077290

14 6 0.056927

2 0.047347

12 0.019615

14 0.017787

6 0.013246

16 0.012961

3.3.2.Performance charts of MPI pagerank running in bare vs. Eucalyptus

Fig.3. Parameters used for MPI PageRank algorithm

Baremetal Eucalyptus

No. of worker nodes 4 3

Size of dataset 100K and 500K 100K and 500K

No. of processes 1 to 13 1 to 13

Threshold 0.000001 0.000001

Iteration setting 10 10

Figure.4. Performance analysis speed up charts on Bare metal and

Eucalyptus

3.3.3.Snapshots of monitoring system UI

Fig.5. Performance index of a commercial Laptop (left) compared to our UI.

Fig.6. Performance index on

cluster

Fig.7.Performance index Baremetal (500K ,700K and 900K URLs)

Fig.8.CPU and Memory Utilization (VM)

4 Analysis of results

4.1 Measurements of MPI PageRank on baremetal vs. Eucalyptus

a) Baremetal

As seen in Figure 4 graphs I and II, we achieved an overall speedup as the number of processes

increase as we ran the MPI PageRank program, which is an expected trend with parallel

algorithms .

b) Eucalyptus:

We found that as the number of processes increased in multiples of 3n+1, we got a speed up i.e

when np = 4,7,10,13..At all other times, we observed a speed down in performance. The sudden

spike in speed up could be due to

1) we used 3 instances and the performance increased as the first instance was

assigned more processes than the rest of the instances.

2) Also, speed down could be due to the absence of virtual infinite band capacity

that is present in bare metal nodes.

3) We can also attribute the speed down to the communication delay between

processes.

4.2. Dynamic switching overhead :

Whenever the VM booted, we noticed a spike in the CPU and Memory utilization as shown in Figure[8].

5.Conclusion

a. Summary of Achievements

We successfully parallelized pagerank algorithm with the help of MPI libraries. The performance of the

pagerank algorithm was analyzed and a report was generated illustrating its performance on the

academic cloud. The resource monitoring system that monitors and visualizes the resource utilization in

a distributed set of nodes was implemented using NaradaBroker. Implemented Dynamic provisioning

that provides the ability and possibility to use on-demand resources in a shared academic Cloud

environment

As a part of future work we plan to implement data classification tool in Hadoop that can be used in the

shopping malls at the application level.

b.Findings

i.Computation vs Communication Overhead of MPI Pagerank

Since Eucalyptus runs on Ethernet Band, we had Communication overhead. We observed speed down in

Eucalyptus which we attribute to the communication delay between processes. But, in the case of

baremetal, there wasn‘t any problem of bandwidth which resulted in a good speed up in performance in

correlation with the number of processes.

ii.Synchronization issue in a distributed system

Synchronizing the CPU and memory utilization while gathering it from multiple nodes was a challenge as

each node could provide with their information asynchronously. In order to get optimum combined

utilization while running the MPI pagerank algorithm in BareMetal as well as VMs, it was imperative to

synchronize all the nodes.

6. Aknowledgement

We thank Professor Qiu and the Future Grid team especially Andrew J Younge, Stephen Wu and Thilina

Gunarathne for their continued support throughout the course of the projects.

7.References

[1] - http://en.wikipedia.org/wiki/PageRank [2] Sigar Resource monitoring

API, http://www.hyperic.com/products/sigarhttp://sourceforge.net/projects/sigar/

[3] - ActiveMQ, http://activemq.apache.org/

[4] - JFreeChart, http://www.jfree.org/jfreechart/

[5] - TORQUE Resource Manager, http://www.clusterresources.com/products/torque-resource-

manager.php

[6] http://portal.futuregrid.org [7] NaradaBrokering, http://www.naradabrokering.org/

http://en.wikipedia.org/wiki/PageRank

http://www.hyperic.com/products/sigar

http://www.hyperic.com/products/sigar

http://activemq.apache.org/

http://www.jfree.org/jfreechart/

http://www.clusterresources.com/products/torque-resource-manager.php

http://www.clusterresources.com/products/torque-resource-manager.php

http://portal.futuregrid.org/

Documents

HPC, Grids, Clouds: A Distributed System from Top to ...salsahpc.indiana.edu/b534projects/sites/default... · HPC, Grids, Clouds: A Distributed System from Top to Bottom Group 15