78
Achieving High Performance Computing CHAPTER 1 INTRODUCTION 1.1. Parallel Programming Paradigm In the 1980s it was believed computer performance was best improved by creating faster and more efficient processors. This idea was challenged by parallel processing, which in essence means linking together two or more computers to jointly solve a computational problem. Since the early 1990s there has been an increasing trend to move away from expensive and specialized proprietary parallel supercomputers (vector-supercomputers and massively parallel processors) towards networks of computers (PCs Workstations SMPs). Among the driving forces that have enabled this transition has been the rapid improvement in the availability of commodity high performance components for PCs workstations and networks. These technologies are making a network cluster of computers an appealing vehicle for cost- effective parallel processing and this is consequently leading to low-cost commodity supercomputing. Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations, to SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing. The main attractiveness of such systems is that they are built using a fordable, low-cost, commodity hardware (such as Pentium PCs), fast LAN , and standard software Department of CSE 1 PACE, Mangalore

Achieving High Performance Computing

  • Upload
    isham

  • View
    160

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Achieving High Performance Computing

Achieving High Performance Computing

CHAPTER 1

INTRODUCTION

1.1. Parallel Programming Paradigm

In the 1980s it was believed computer performance was best improved by creating faster

and more efficient processors. This idea was challenged by parallel processing, which in essence

means linking together two or more computers to jointly solve a computational problem. Since

the early 1990s there has been an increasing trend to move away from expensive and specialized

proprietary parallel supercomputers (vector-supercomputers and massively parallel processors)

towards networks of computers (PCs Workstations SMPs). Among the driving forces that have

enabled this transition has been the rapid improvement in the availability of commodity high

performance components for PCs workstations and networks. These technologies are making a

network cluster of computers an appealing vehicle for cost-effective parallel processing and this

is consequently leading to low-cost commodity supercomputing.

Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous)

PCs or workstations, to SMPs, are rapidly becoming the standard platforms for high-

performance and large-scale computing. The main attractiveness of such systems is that they are

built using a fordable, low-cost, commodity hardware (such as Pentium PCs), fast LAN , and

standard software components such as UNIX and MPI. These systems are scalable, i.e., they can

be tuned to available budget and computational needs and allow efficient execution of both

demanding sequential and parallel applications.

1.2. Overview

We intend to present some of the main motivations for the widespread use of clusters in

high-performance parallel computing. In the next section, we discuss a generic architecture of a

cluster computer and grid computer and the rest of the chapter focuses on message passing

interface, strategies for writing parallel programs, and the two main approaches to parallelism

(implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel

processing: message-passing libraries, distributed shared memory, object-oriented

Department of CSE 1 PACE, Mangalore

Page 2: Achieving High Performance Computing

Achieving High Performance Computing

programming. However, the main focusing chapter is about the identification and introducing

parallel programming paradigms in existing applications such as OpenFoam. This approach

presents some interesting advantages, for example, the reuse of code, higher flexibility, and the

increased productivity of the parallel program developer.

1.3. Grid Network

Grid networking services are best presented within the context of the Grid and its

architectural principles. The Grid is a flexible, distributed, information technology environment

that enables multiple services to be created with a significant degree of independence from the

specific attributes of underlying support infrastructure. Advanced architectural infrastructure

design increasingly revolves around the creation and delivery of multiple ubiquitous digital

services. A major goal of information technology designers is to provide an environment within

which it is possible to present any form of information on any device at any location. The Grid

is an infrastructure that highly complements the era of ubiquitous digital information and

services.

These environments are designed to support services not as discrete infrastructure

components, but as modular resources that can be integrated into specialized blends of

capabilities to create multiple additional, highly customizable services. The Grid also allows

such services to be designed and implemented by diverse, distributed communities,

independently of centralized processes. Grid architecture represents an innovation that is

advancing efforts to achieve these goals.

Early Grid infrastructure was developed to support data and compute intensive science

projects. For example, the high-energy physics community was an early adopter of Grid

technology. This community must acquire extremely high volumes of data from specialized

instruments at key locations in different countries. They must gather, distribute, and analyze

those large volumes of data as a collaborative initiative with thousands of colleagues around the

world.

Department of CSE 2 PACE, Mangalore

Page 3: Achieving High Performance Computing

Achieving High Performance Computing

1.4. Message Passing Interface Cluster

Message passing libraries allow efficient parallel programs to be written for distributed

memory systems. These libraries provide routines to initiate and configure the messaging

environment as well as sending and receiving packets of data. Currently, the two most popular

high-level message-passing systems for scientific and engineering application is MPI (Message

Passing Interface) defined by the MPI Forum.

Currently, there are several implementations of MPI, including versions for networks of

workstations, clusters of personal computers, distributed-memory multiprocessors, and shared-

memory machines. Almost every hardware vendor is supporting MPI. This gives the user a

comfortable feeling since an MPI program can be executed on almost all of the existing

computing platforms without the need to rewrite the program from scratch. The goal of

portability, architecture, and network transparency has been achieved with these low-level

communication libraries like MPI . Both communication libraries provide an interface for C and

Fortran, and additional support of graphical tools.

However, these message-passing systems are still stigmatized as low-level because most

tasks of the parallelization are still left to the application programmer. When writing parallel

applications using message passing, the programmer still has to develop a signi cant amount of

software to manage some of the tasks of the parallelization, such as: the communication and

synchronization between processes, data partitioning and distribution, mapping of processes

onto processors, and input output of data structures. If the application programmer has no

special support for these tasks, it then becomes difficult to widely exploit parallel computing.

The easy-to-use goal is not accomplished with a bare message-passing system, and hence

requires additional support.

1.5. OpenFoam

The OpenFOAM® (Open Field Operation and Manipulation) CFD Toolbox is a free,

open source CFD software package produced by a commercial company, OpenCFD Ltd. It has a

large user base across most areas of engineering and science, from both commercial and

Department of CSE 3 PACE, Mangalore

Page 4: Achieving High Performance Computing

Achieving High Performance Computing

academic organisations. OpenFOAM has an extensive range of features to solve anything from

complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid

dynamics and electromagnetics.

The core technology of OpenFOAM is a flexible set of efficient C++ modules. These are

used to build a wealth of: solvers, to simulate specific problems in engineering mechanics;

utilities, to perform pre- and post-processing tasks ranging from simple data manipulations to

visualisation and mesh processing; libraries, to create toolboxes that are accessible to the

solvers/utilities, such as libraries of physical models.

OpenFOAM is supplied with numerous pre-configured solvers, utilities and libraries and

so can be used like any typical simulation package. However, it is open, not only in terms of

source code, but also in its structure and hierarchical design, so that its solvers, utilities and

libraries are fully extensible. OpenFOAM uses finite volume numerics to solve systems of partial

differential equations ascribed on any 3D unstructured mesh of polyhedral cells. The fluid flow

solvers are developed within a robust, implicit, pressure-velocity, iterative solution framework,

although alternative techniques are applied to other continuum mechanics solvers.

One of the strengths of OpenFOAM is that new solvers and utilities can be created by its

users with some pre-requisite knowledge of the underlying method, physics and programming

techniques involved. OpenFOAM is supplied with pre- and post-processing environments. The

interface to the pre- and post-processing are themselves OpenFOAM utilities, thereby ensuring

consistent data handling across all environments. The overall structure of OpenFOAM is shown

in Figure1.1.

Department of CSE 4 PACE, Mangalore

Page 5: Achieving High Performance Computing

Achieving High Performance Computing

1.6. Case Studies

1.6.1. Dense Matrix

Dense matrix multiplication is a core operation in scientific computing, and has been a

topic of interest for computer scientists for over forty years. Theoretical computer scientists have

redefined the time bounds of the problem, and focuses on implementations have been shifted

from the serial to parallel computing model.

The lower bound of the problem of multiplying two square dense matrices of size N by N

(henceforth referred to as matrix multiplication) has been known for some time to be Ω(n2) as

every scalar element of the matrices must be examined. Until 1968, no improvements to the

naive algorithm were known.

1.6.2. Computational Fluid Dynamics

One of the CFD (Computational Fluid Dynamics) code developed for the Conjugate Heat

Transfer problem is used for the case study. The term Conjugate Heat Transfer refers to a heat

transfer process involving an interaction of heat conduction within a Solid body with either of

the free, forced, and mixed convection from its surface to a fluid (or to its surface from a fluid)

flowing over it. An accurate analysis of such heat transfer problems necessitates the coupling of

the problem of conduction in the solid with that of convection in the fluid by satisfying the

conditions of continuity in temperature and heat flux at the solid–fluid interface.

There are many engineering and practical applications in which conjugate heat transfer

occurs. One such area of application is in the thermal design of a fuel element of a nuclear

reactor. The energy released due to fission in the fuel element is first conducted to its lateral

surface, which in turn is dissipated to the coolant flowing over it so as to maintain the

temperature anywhere in the fuel element well within its allowable limit. If this energy generated

is not removed fast enough, the fuel elements and other components may heat up so much that

eventually a part of the core may melt. In fact, the limit to the power at which a reactor can be

operated is set by the heat transfer capacity of the coolant. Therefore, the knowledge of the

Department of CSE 5 PACE, Mangalore

Page 6: Achieving High Performance Computing

Achieving High Performance Computing

temperature field in the fuel element and the flow and thermal fields in the coolant is needed in

order to predict its thermal performance.

1.6.3. OpenFOAM

Fluid dynamics is a field of science which studies the physical laws governing the flow of

fluids under various conditions. Great effort has gone into understanding the governing laws and

the nature of fluids themselves, resulting in a complex yet theoretically strong field of research.

Department of CSE 6 PACE, Mangalore

Page 7: Achieving High Performance Computing

Achieving High Performance Computing

CHAPTER 2

TESTBED SETUP

2.1. Globus Toolkit

Globus is community of users and developers who collaborate on the use and

development of open source software, and associated documentation, for distributed computing

and resource federation.

The middleware software itself—the Globus Toolkit is a set of libraries and programs

that address common problems that occur when building distributed system services and

applications. Its the infrastructure that supports this community—code repositories, email lists,

problem tracking system, and so forth, all accessible at globus.org.

The software itself provides a variety of components and capabilities, including

the following:

− A set of service implementations focused on infrastructure management.

− Tools for building new Web services, in Java, C, and Python.

− A powerful standards-based security infrastructure.

− Both client APIs (in different languages) and command line programs for accessing

these various services and capabilities.

− Detailed documentation on these various components, their interfaces, and how they

can be used to build applications.

GT4 makes extensive use of Web services mechanisms to define its interfaces and

structure its components. Web services provide flexible, extensible, and widely adopted XML-

based mechanisms for describing, discovering, and invoking network services; in addition, its

document-oriented protocols are well suited to the loosely coupled interactions that many argue

are preferable for robust distributed systems. These mechanisms facilitate the development of

service-oriented architectures systems and applications structured as communicating services, in

which service interfaces are described, operations invoked, access secured, etc., all in uniform

ways.

Department of CSE 7 PACE, Mangalore

Page 8: Achieving High Performance Computing

Achieving High Performance Computing

Figure 2 illustrates various aspects of GT4 architecture.

2.1.1. Prerequisites

We intend to present some of the main motivations for the widespread use of clusters in

high-performance parallel computing. In the next section, we discuss a generic architecture of a

cluster computer and grid computer and the rest of the chapter focuses on message passing

interface, strategies for writing parallel programs, and the two main approaches to parallelism

(implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel

processing: message-passing libraries, distributed shared memory, object-oriented

programming.

The list of packages needed to be pre-installed in the system are

- jdk-1_5_0_03-linux- i586.bin

- apache-ant-1.6.4-bin.tar

- gt4.2.0-all-source- installer.tar.gz

Java Installation:

[root@pace~]# cd /usr/local/

Department of CSE 8 PACE, Mangalore

Page 9: Achieving High Performance Computing

Achieving High Performance Computing

[root@pace local]#rpm -q zlib-devel

[root@pace local]#./jdk-1_5_0_03-linux- i586.bin

[root@pace local]# vi /etc/profile

add the following lines...

#GRID ENVIRONMENT VARIABLE SETTINGS....

JAVA_HOME=/usr/local/jdk1.5.0_03

PATH=$JAVA_HOME/bin:$PATH

CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/tools.jar

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC

CLASSPATH JAVA_HOME

Note: All Linux has Java inbuilt but the Globus need the Java by SUN, so check for the vendors

installation.

Apache Ant Installation:

[root@pace]# tar - xvf /home/vkuser/gt4/software/apache-ant-1.6.4-bin.tar

[root@pace]# mv apache-ant-1.6.4 ant-1.6.4

[root@pace ant-1.6.4]# vi /etc/profile

add the following lines...

ANT_HOME=/usr/local/ant-1.6.4

PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$PATH

Note:

Ant and Java is needed for compiling Globus source code .We used Fedora 10 it had all the

requirements. We installed Ant and Java .The installation step may differ according to the

version refer the installation guide in each package.

Department of CSE 9 PACE, Mangalore

Page 10: Achieving High Performance Computing

Achieving High Performance Computing

Globus installation

create the Globus user account

[root@pace]#adduser globus

[root@pace]#passwd xxxxxx

copy the file gt4.0.1-all-source- installer.tar in /usr/local and untar it.

[root@pace]$ tar xzf gt4.0.1-all-source- installer.tar

Configure, compile and change the ownership to globus user and change the permissions

[root@pace]#chown globus:globus gt4.0.1-all-source- installer.tar

Now a directory will be created (Ex: gt4.0.1-all-source- installer), go into directory and execute

configure script.

[root@pace]#./configure

[root@pace]#make

[root@pace]#make install

[root@pace]# chown -R globus:globus /usr/local/globus-4.2.0/

Note: Before starting the installation process change the hostname of your system. The default

hostname (localhost.localdomain) will create problems during the certificate genration.

Now as a root user

[root@pace local]# vi /etc/profile

add the following lines...

GLOBUS_LOCATION=/usr/local/globus-4.2.0

PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$LAM_HOME/bin:$LAM_HOME/sbin:

$PATH:$GLOBUS_LOCATION/bin:$GLOBUS_LOCATION/sbin

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC

CLASSPATH GLOBUS_LOCATION

Department of CSE 10 PACE, Mangalore

Page 11: Achieving High Performance Computing

Achieving High Performance Computing

2.1.2. Setting up the first machine

2.1.2.1 SimpleCA configuration:

[globus@pace gt4.2.0-all-source-installer]$source

$GLOBUS_LOCATION/etc/globus-user-env.sh

[globus@pace~]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca

The following results r displayed on the terminal

The unique subject name for this CA is:

cn=Globus Simple CA, ou=simpleCA-pace.grid, ou=GlobusTest, o=Grid

Do you want to keep this as the CA subject (y/n) [y]:y

Enter the email of the CA (this is the email where certificate

requests will be sent to be signed by the CA):[email protected]

The CA certificate has an expiration date. Keep in mind that

once the CA certificate has expired, all the certificates

signed by that CA become invalid. A CA should regenerate

the CA certificate and start re- issuing ca-setup packages

before the actual CA certificate expires. This can be done

by re-running this setup script. Enter the number of DAYS

the CA certificate should last before it expires.

[default: 5 years (1825 days)]: <enter>

Enter PEM pass phrase:xxxxxx

Verifying - Enter PEM pass phrase:123456

setup-ssl-utils: Complete

[root@pace~]#$GLOBUS_LOCATION/setup/globus_simple_ca_116a21a8_setup/setup-

gsi-default

Department of CSE 11 PACE, Mangalore

Page 12: Achieving High Performance Computing

Achieving High Performance Computing

Running the above command cause the following instructions to be processed

setup-gsi: Configuring GSI security

Making /etc/grid-security...

mkdir /etc/grid-security

Making trusted certs directory: /etc/grid-security/certificates/

mkdir /etc/grid-security/certificates/

Installing /etc/grid-security/certificates//grid-security.conf.116a21a8...

Running grid-security-config...

Installing Globus CA certificate into trusted CA certificate directory...

Installing Globus CA signing policy into trusted CA certificate directory...

setup-gsi: Complete

[root@pace~]# source $GLOBUS_LOCATION/etc/globus-user-env.sh

[root@pace~]# grid-cert-request -host `hostname`

[root@pace~]#exit

[globus@pace ~]$ grid-ca-sign - in /etc/grid-security/hostcert_request.pem -out

hostsigned.pem

To sign the request please enter the password for the CA key:xxxxxx

The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/01.pem

[root@pace ~]# cp /home/globus/hostsigned.pem /etc/grid-security/hostcert.pem

cp: overwrite `/etc/grid-security/hostcert.pem'? y

[root@pace ~]# cd /etc/grid-security/

[root@pace grid-security]# cp hostcert.pem containercert.pem

[root@pace grid-security]# cp hostkey.pem containerkey.pem

[root@pace grid-security]# chown globus:globus container*.pem

Department of CSE 12 PACE, Mangalore

Page 13: Achieving High Performance Computing

Achieving High Performance Computing

[root@pace grid-security]# exit

Now we'll get a usercert for guser01.

[globus@pace ~]$ su - guser01

[guser01@pace~]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh

[guser01@pace ~]$ grid-cert-request

Generating a 1024 bit RSA private key

..........++++++

............++++++

writing new private key to '/home/guser01/.globus/userkey.pem'

Enter PEM pass phrase:xxxxxx

Verifying - Enter PEM pass phrase:xxxxxx

[guser01@pace ~]$ cp /home/guser01/.globus/usercert_request.pem /tmp/request.pem

[globus@pace ~]$ cp /tmp/request.pem /home/globus

[globus@pace ~]$ grid-ca-sign - in request.pem -out signed.pem

To sign the request

please enter the password for the CA key:123456

The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/02.pem

[globus@pace ~]$ cp signed.pem /tmp/

[globus@pace ~]$ su - guser01

[guser01@pace ~]$ cp /tmp/signed.pem ~/.globus/usercert.pem

[guser01@pace~]$ grid-cert- info -subject

/O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid user #01

[root@pace ~]# vi /etc/grid-security/grid- mapfile

add the following line..

Department of CSE 13 PACE, Mangalore

Page 14: Achieving High Performance Computing

Achieving High Performance Computing

"/O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid" guser01

Environment variable setting for Credentials

[root@pace~]#vi /etc/profile

add the following lines...

GRID_SECURITY_DIR=/etc/grid-security

GRIDMAP=/etc/grid-security/grid- mapfile

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC

CLASSPATH GLOBUS_LOCATION

JAVA_HOME GRIDMAP GRID_SECURITY_DIR

Validate certificate setup:

Note: login as guser01

[root@pace~]# openssl verify -CApath /etc/grid-security/certificates -purpose sslserver

/etc/grid-security/hostcert.pem /etc/grid-security/hostcert.pem: OK

2.1.2.2. Setting up GridFTP

[root@pace ~]# vim /etc/xinetd.d/gridftp

add the following lines...

service gsiftp

{

instances = 100

socket_type = stream

wait = no

user = root

env += GLOBUS_LOCATION=/usr/local/globus-4.2.0

Department of CSE 14 PACE, Mangalore

Page 15: Achieving High Performance Computing

Achieving High Performance Computing

env += LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib

server = /usr/local/globus-4.2.0/sbin/globus- gridftp-server

server_args = -i

log_on_success += DURATION

nice = 10

disable = no

}

[root@pace ~]# vim /etc/services

add the following line into bottom of the file.

# Local services

gsiftp 2811/tcp

[root@mitgrid ~]# /etc/init.d/xinetd reload

Reloading configuration: [ OK ]

[root@mitgrid ~]# netstat -an | grep 2811

tcp 0 0 0.0.0.0:2811 0.0.0.0:* LISTEN

Note:

Now the gridftp server is waiting for a request, so we'll run a client and transfer a file:

Testing:

[guser01@pace ~]$ grid-proxy- init -verify -debug

User Cert File: /home/guser01/.globus/usercert.pem

User Key File: /home/guser01/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /tmp/x509up_u502

Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid

user #01

Enter GRID pass phrase for this identity:guser01

Department of CSE 15 PACE, Mangalore

Page 16: Achieving High Performance Computing

Achieving High Performance Computing

Creating proxy .............++++++++++++

..++++++++++++

Done

Proxy Verify OK

Your proxy is valid until: Sun Jan 29 01:12:48 2006

[guser01@mitgrid]$globus-url-copy gsiftp://mitgrid.grid/etc/groupfile:///tmp/guser01.test.-

copy

[guser01@mitgrid ~]$ diff /tmp/guser01.test.copy /etc/group

Okay, so the GridFTP server works.

Starting the web services container configuration:

Now we'll setup an /etc/init.d entry for the web services container.

Note: login as globus

[globus@mitgrid ~]$ vim $GLOBUS_LOCATION/start-stop

add the following lines....

#! /bin/sh

set -e

export GLOBUS_OPTIONS="-Xms256M -Xmx512M"

. $GLOBUS_LOCATION/etc/globus-user-env.sh

cd $GLOBUS_LOCATION

case "$1" in start)

$GLOBUS_LOCATION/sbin/globus-start-container-detached -p 8443

;;

stop)

$GLOBUS_LOCATION/sbin/globus-stop-container-detached

;;

*)

echo "Usage: globus {start|stop}" >&2

exit 1

Department of CSE 16 PACE, Mangalore

Page 17: Achieving High Performance Computing

Achieving High Performance Computing

;;

esac

exit 0

[globus@mitgrid ~]$ chmod +x $GLOBUS_LOCATION/start-stop

Now, as root, we'll create an /etc/init.d script to call the globus user's start-stop script:

Note: login as root

[root@mitgrid ~]# vim /etc/init.d/globus-4.2.0

add the following lines...

#!/bin/sh -e

case "$1" in

start)

su - globus /usr/local/globus-4.0.1/start-stop start

;;

stop)

su - globus /usr/local/globus-4.0.1/start-stop stop

;;

restart)

$0 stop

sleep 1

$0 start

;;

*)

printf "Usage: $0 {start|stop|restart}\n" >&2

exit 1

;;

esac

exit 0

Department of CSE 17 PACE, Mangalore

Page 18: Achieving High Performance Computing

Achieving High Performance Computing

[root@pace ~]# chmod +x /etc/init.d/globus-4.2.0

[root@pace ~]# /etc/init.d/globus-4.2.0 start

Starting Globus container. PID: 19051

2.1.2.3. Grid Resource Allocation and Management (GRAM)

Now that we have GridFTP and RFT working, we can setup GRAM for resource management.-

First we have to setup sudo so the globus user can start jobs as a different user.

[root@pace ~]# visudo

add the following lines in the bottom of the file(It is link with /etc/sudoers)...

#Grid variable settings by VK@MITGRID

globus ALL=(guser01) NOPASSWD: /usr/local/globus-4.2.0bexec/globus-gridmap-

and-execute - g /etc/grid-security/grid- mapfile /usr/local/gobus-4.2.0/exec/globus-job-

manager-script.pl * globus ALL=(guser01) NOPASSWD: /usr/local/globus-4.2.0bexec/

globus-gridmap-and-execute - g /etc/grid-security/grid- mapfile /usr/local/globus-4.2.0/

exec/globus-gram- local-proxy-tool *

Note: login as guser01

[guser01@pace ~]$ globusrun-ws -submit -c /bin/true

Submitting job...Done.

Job ID: uuid:a9378900-8fed-11da-a691-000ffe3b1003

Termination time: 01/29/2006 11:03 GMT

Current job state: Active

Current job state: CleanUp

Current job state: Done

Destroying job...Done.

[guser01@mitgrid ~]$

Department of CSE 18 PACE, Mangalore

Page 19: Achieving High Performance Computing

Achieving High Performance Computing

[guser01@mitgrid ~]$ echo $?

0

MyProxy Server Setup and Configuration:

In order to create a MyProxy server first we'll turn pace.grid machine into a MyProxy server by

following instructions

Note: Login as root

[root@pace ~]# cp $GLOBUS_LOCATION/etc/myproxy-server.config /etc

[root@pace~]# vim /etc/myproxy-server.config

Just uncomment the following lines...

Before modification

#

# Complete Sample Policy

#

# The following lines define a sample policy that enables all

# myproxy-server features. See below for more examples.

#accepted_credentials "*"

#authorized_retrievers "*"

#default_retrievers "*"

#authorized_renewers "*"

#default_renewers "none"

#authorized_key_retrievers "*"

#default_key_retrievers "none"

after modification:

#

# Complete Sample Policy

#

# The following lines define a sample policy that enables all

# myproxy-server features. See below for more examples.

Department of CSE 19 PACE, Mangalore

Page 20: Achieving High Performance Computing

Achieving High Performance Computing

accepted_credentials "*"

authorized_retrievers "*"

default_retrievers "*"

authorized_renewers "*"

default_renewers "none"

authorized_key_retrievers "*"

default_key_retrievers "none"

[root@pace ~]# cat

$GLOBUS_LOCATION/share/myproxy/etc.services.modifications >> /etc/services

[root@mitgrid ~]# tail /etc/services

asp 27374/udp # Address Search Protocol

tfido 60177/tcp # Ifmail

tfido 60177/udp # Ifmail

fido 60179/tcp # Ifmail

fido 60179/udp # Ifmail

# Local services

gsiftp 2811/tcp

myproxy-server 7512/tcp # Myproxy server

[root@pace ~]# cp $GLOBUS_LOCATION/share/myproxy/etc.xinetd.myproxy /etc/

xinetd.d/myproxy

[root@pace ~]# vim /etc/xinetd.d/myproxy

Modify the following lines....

service myproxy-server

{

socket_type = stream

protocol = tcp

wait = no

Department of CSE 20 PACE, Mangalore

Page 21: Achieving High Performance Computing

Achieving High Performance Computing

user = root

server = /usr/local/globus-4.0.1/sbin/myproxy-server

env = GLOBUS_LOCATION=/usr/local/globus-4.2.0

LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib

disable = no

}

[root@pace ~]# /etc/init.d/xinetd reload

Reloading configuration: [ OK ]

[root@pace ~]# netstat -an | grep 7512

tcp 0 0 0.0.0.0:7512 0.0.0.0:* LISTEN

Note: Login as guser01 @pace.grid

[guser01@pace~]$ grid-proxy-destroy

[guser01@pace ~]$ grid-proxy- info

ERROR: Couldn't find a valid proxy.

Use -debug for further information.

Note: Instead of grid-proxy we use Myproxy server.

[guser01@mitgrid ~]$ myproxy-init -s mitgrid

Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid

user #01

Enter GRID pass phrase for this identity: guser01

Creating proxy ........................................... Done

Proxy Verify OK

Your proxy is valid until: Fri Feb 10 15:44:40 2006

Enter MyProxy pass phrase: globus

Department of CSE 21 PACE, Mangalore

Page 22: Achieving High Performance Computing

Achieving High Performance Computing

Verifying - Enter MyProxy pass phrase: globus

A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid.

[guser01@pace ~]$ myproxy-logon -s pace.grid

Enter MyProxy pass phrase:guser01

A proxy has been received for user guser01 in /tmp/x509up_u503.

2.1.3. Setting up the Second machine

Install GlobusToolKit [ follow the steps as specified in the prerequisites and first machine]

Installation of CA packages

To install CA packages log in to the CA host as a Globus user and invoke the setup-sim-

ple-ca script, and answer the prompts as appropriate

[globus@ca]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca

WARNING: GPT_LOCATION not set, assuming:

GPT_LOCATION=/usr/local/globus-4.2.0

C e r t i f i c a t e A u t h o r i t y S e t u p

This script will setup a Certificate Authority for signing Globus users certificates. It will

also generate a simple CA package that can be distributed to the users of the CA.The CA

information about the certificates it distributes will be kept in:

/home/globus/.globus/simpleCA/

/usr/local/globus-4.0.0/setup/globus/setup-simple-ca: line 250: test: res:

integer expression expected

The unique subject name for this CA is:

cn=Globus Simple CA, ou=simpleCA-ca.redbook.ibm.com, ou=GlobusTest, o=Grid

Do you want to keep this as the CA subject (y/n) [y]: y

Department of CSE 22 PACE, Mangalore

Page 23: Achieving High Performance Computing

Achieving High Performance Computing

Enter the email of the CA (this is the email where certificate requests will be

sent to be signed by the CA): (type mail address)[email protected]. The CA

certificate has an expiration date. Keep in mind that once the CA certificate has expired,

all the certificates signed by that CA become invalid. A CA should regenerate the CA cer-

tificate and start re-issuing ca-setup packages before the actual CA certificate expires.

This can be done by re-running this setup script. Enter the number of DAYS the CA cer-

tificate should last before it expires.

[default: 5 years (1825 days)]: (type the number of days)1825

Enter PEM pass phrase: (type ca certificate pass phrase)

Verifying - Enter PEM pass phrase: (type ca certificate pass phrase)

...(unrelated information omitted)

Setup security in each grid node. After performing the steps above, a package file has

been created that needs to be used on other nodes, as described in this section. In order to use

certificates from this CA in other grid nodes, you need to copy and install the CA setup package

to each grid node.

1. Log in to a grid node as a Globus user and obtain a CA setup package from the CA host. Then

run the setup commands for configuration .

[globus@hosta]$ scp globus@ca:/home/globus/.globus/simpleCA

/globus_simple_ca_(ca_hash)_setup-0.18.tar.gz .

[globus@hosta]$GLOBUS_LOCATION/sbin/gpt-build/

globus_simple_ca_(ca_hash)_setup-0.18.tar.gz gcc32dbg

[globus@hosta]$ $GLOBUS_LOCATION/sbin/gpt-postinstall

Note: A CA setup package is generated when you run the setup-simple-ca command. Keep in

mind that the name of the CA setup package includes a unique CA hash. As the root user, submit

the commands to configure the CA settings in each grid node. This script creates the /etc/grid-se -

curity directory. This directory contains the configuration files for security.

Configure CA in each grid node

Department of CSE 23 PACE, Mangalore

Page 24: Achieving High Performance Computing

Achieving High Performance Computing

[root@hosta]# $GLOBUS_LOCATION/setup/globus_simple_ca_[ca_hash]_setup/setup-

gsi -default

Note: For the setup of the CA host, you do not need to run the setup-gsi script. This script creates

a directory that contains the configuration files for security. The CA host does not need this di -

rectory, because these configuration files are for the servers and users who use the CA.

In order to use some of the services provided by Globus Toolkit 4, such as GridFTP, you

need to have a CA signed host certificate and host key in the appropriate directory.As root user,

request a host certificate with the command

[root@pace]# grid-cert-request -host `hostname`

Copy or send the /etc/grid-security/hostcert_request.pem file to the CA host. In the CA host as a

Globus user, sign the host certificate by using the grid-ca-sign command.

[globus@ca]$ grid-ca-sign -in hostcert_request.pem -out hostcert.pem

To sign the request please enter the password for the CA key: (type CA passphrase)

The new signed certificate is at:

/home/globus/.globus/simpleCA//newcerts/01.pem

Copy the hostcert.pem back to the /etc/grid-security/ directory in the grid node.

In order to use the grid environment, a grid user needs to have a CA signed user

certificate and user key in the user’s directory. As a user (auser1 in hosta), request a user certifi-

cate with the command

[auser1@pace1]$ grid-cert-request

Enter your name, e.g., John Smith: grid user 1 (type grid user name). A certificate re-

quest and private key is being created.You will be asked to enter a PEM pass phrase.

This pass phrase is akin to your account password,and is used to protect your key file. If

you forget your pass phrase, you will need to obtain a new certificate.

Department of CSE 24 PACE, Mangalore

Page 25: Achieving High Performance Computing

Achieving High Performance Computing

Generating a 1024 bit RSA private key

.....................................++++++

...++++++

writing new private key to '/home/auser1/.globus/userkey.pem'

Enter PEM pass phrase: (type pass phrase for grid user)

Verifying - Enter PEM pass phrase: (retype pass phrase for grid user)

...(unrelated information omitted)

Copy or send the (userhome)/.globus/usercert_request.pem file to the CA host.In CA host

as a Globus user, sign the user certificate by using the grid-ca-sign command .

[globus@pace]$ grid-ca-sign -in usercert_request.pem -out usercert.pem

To sign the request

please enter the password for the CA key:

The new signed certificate is at:

/home/globus/.globus/simpleCA//newcerts/02.pem

Copy the created usercert.pem to the (userhome)/.globus/ directory on the grid node.

Test the user certificate by typing grid-proxy-init -debug -verify as the auser user. With this com-

mand, you can see the location of a user certificate and a key, CA’s certificate directory, a distin -

guished name for the user, and the expiration time. After you successfully execute grid-proxy-

init, you have been authenticated and are ready to use the grid environment.

[auser1@pace1]$ grid-proxy-init -debug -verify

User Cert File: /home/auser1/.globus/usercert.pem

User Key File: /home/auser1/.globus/userkey.pem

Trusted CA Cert Dir: /etc/grid-security/certificates

Output File: /tmp/x509up_u511

Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-ca.redbook.ibm.com/OU=re-

book.ibm.com/CN=grid user 1

Department of CSE 25 PACE, Mangalore

Page 26: Achieving High Performance Computing

Achieving High Performance Computing

Enter GRID pass phrase for this identity:

Creating proxy .........++++++++++++

.................++++++++++++

Done

Proxy Verify OK

Your proxy is valid until: Thu Jun 9 22:16:28 200

Note: You may copy those user certificates to other grid nodes in order to access each grid node

as a single grid user. But you may not copy a host certificate and a host key. A host certificate is

needed to be created in each grid node. Set mapping information between a grid user and a local

user Globus Toolkit 4 requires a mapping between an authenticated grid user and a local user. In

order to map a user, you need to get the distinguished name of the grid user, and map it to a local

user. Get the distinguished name by invoking the grid-cert-info command.

[auser1@pace1]$ grid-cert-info -subject -f /home/auser1/.globus/usercert.pem

/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/

CN=grid user 1

As a root user, map the local user name with the distinguished name by using the grid-

mapfile-add-entry command.

[root@pace1]# grid-mapfile-add-entry -dn \

"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/

CN=grid user 1"

Modifying /etc/grid-security/grid-mapfile ...

/etc/grid-security/grid-mapfile does not exist... Attempting to create /etc/grid-security/

grid-mapfile

New entry:

"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/

CN=grid user 1" auser1

Department of CSE 26 PACE, Mangalore

Page 27: Achieving High Performance Computing

Achieving High Performance Computing

Note: The grid-mapfile-add-entry command creates and adds an entry to /etc/grid-security/grid-

mapfile. You can manually add an entry by adding a line into this file. In order to see the map-

ping information, look at /etc/grid-security/grid-mapfile

Example of /etc/grid-security/grid-mapfile

"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/CN=grid

user 1" auser1

For setting up JavaWSCore, GRIDFTP, MyProxy server follow as specified for first

machin.

Submitting grid-proxy-init command

Note: Login as auser1 @pace.grid

[auser1@pace~]$ grid-proxy-destroy

[auser1@pace ~]$ grid-proxy- info

ERROR: Couldn't find a valid proxy.

Use -debug for further information.

[guser01@pace ~]$

Note: Instead of grid-proxy we use Myproxy server.

[auser1@mitgrid ~]$ myproxy-init -s mitgrid

Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid

auser #1

Enter GRID pass phrase for this identity: auser01

Creating proxy ........................................... Done

Proxy Verify OK

Your proxy is valid until: Fri Feb 10 15:44:40 2006

Enter MyProxy pass phrase: globus

Verifying - Enter MyProxy pass phrase: globus

A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid.

Department of CSE 27 PACE, Mangalore

Page 28: Achieving High Performance Computing

Achieving High Performance Computing

[auser1@pace ~]$ myproxy-logon -s pace.grid

Enter MyProxy pass phrase:auser1

A proxy has been received for user guser01 in /tmp/x509up_u503.

2.2. Message Passing Interface

2.2.1. Setting up MPICH

Here are the steps from obtaining MPICH2 through running your own parallel program

on multiple machines.

1. Unpack the tar file for MPICH2 i.e. mpich2.tar.gz

2. Choose an installation directory (the default is /usr/local/bin):

It will be most convenient if this directory is shared by all of the machines where you in-

tend to run processes. If not, you will have to duplicate it on the other machines after intallation.

3. Choose a build directory. Building will proceed much faster if your build directory is on a file

system local to the machine on which the configuration and compilation steps are executed. It is

preferable that this also be separate from the source directory, so that the source directories re-

main clean and can be reused to build other copies on other machines.

4. Configure, build, and install MPICH2 using the following respective commands, specifying

the installation directory, and running the configure script in the source directory:

./configure

make

make install

7. Add the bin subdirectory of the installation directory to your path:

export PATH=/home/you/mpich2-install/bin:$PATH

8. For security reasons, MPD looks in your home directory for a file named .mpd.conf containing

the line

Department of CSE 28 PACE, Mangalore

Page 29: Achieving High Performance Computing

Achieving High Performance Computing

secretword=<secretword>

where <secretword> is a string known only to yourself. It should not be your normal Unix pass-

word. Set the file permissions as readable and writable only by you:

cd $HOME

touch .mpd.conf

chmod 600 .mpd.conf

Then use an editor to place a line like:

secretword=mr45-j9z

into the file (Of course use a different secret word than mr45-j9z.). If super user then as root

create the mpd.conf file in /etc/mpd.conf

9. The first sanity check consists of bringing up a ring of one MPD on the local machine, testing

one MPD command, and bringing the “ring” down.

mpd &

mpdtrace

mpdallexit

The output of mpdtrace should be the hostname of the machine you are running on. The mp-

dallexit causes the mpd daemon to exit.

10. The next sanity check is to run a non-MPI program using the daemon.

mpd &

mpiexec -n 1 /bin/hostname

mpdallexit

This should print the name of the machine you are running on.

11. Now we will bring up a ring of mpd’s on a set of machines. Create a file consisting of a list

of machine names, one per line. Name this file mpd.hosts. These hostnames will be used as tar -

Department of CSE 29 PACE, Mangalore

Page 30: Achieving High Performance Computing

Achieving High Performance Computing

gets for ssh or rsh, so include full domain names if necessary. Check to see if all the hosts you

listed in mpd.hosts are in the output of mpdtrace and if so move on to the next step.

12. Test the ring you have just created:

ssh login without password

First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:

a@A:~> ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/a/.ssh/id_rsa):

Created directory '/home/a/.ssh'.

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/a/.ssh/id_rsa.

Your public key has been saved in /home/a/.ssh/id_rsa.pub.

The key fingerprint is:

3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist,

which is fine):

a@A:~> ssh b@B mkdir -p .ssh

b@B's password:

Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last

time:

a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'

b@B's password:

From now on you can log into B as b from A as a without password:

a@A:~> ssh b@B hostname

Department of CSE 30 PACE, Mangalore

Page 31: Achieving High Performance Computing

Achieving High Performance Computing

13. Test that the ring can run a multiprocess job:

mpiexec -n <number> hostname

The number of processes need not match the number of hosts in the ring; if there are more, they

will wrap around. You can see the effect of this by getting rank labels on the stdout:

mpiexec -l -n 30 hostname

You probably didn’t have to give the full pathname of the hostname command because it is in

your path. If not, use the full pathname:

mpiexec -l -n 30 /bin/hostname

14. Now we will run an MPI job, using the mpiexec command as specified

mpiexec -n 5 examples/cpi

The number of processes need not match the number of hosts. The cpi example will tell

you which hosts it is running on. By default, the processes are launched one after the other on the

hosts in the mpd ring, so it is not necessary to specify hosts when running a job with mpiexec.

Trouble shooting:

It can be rather tricky to configure one or more hosts in such a way that they adequately

support client-server applications like mpd. In particular, each host must not only know its own

name, but must identify itself correctly to other hosts when necessary. Further, certain informa-

tion must be readily accessible to each host. For example, each host must be able to map another

host’s name to its IP address. In this section, we will walk slowly through a series of steps that

will help to ensure success in running mpds on a single host or on a large cluster.

If you can ssh from each machine to itself, and from each machine to each other machine

in your set (and back), then you probably have an adequate environment for mpd. However,

there may still be problems. For example, if you are blocking all ports except the ports used by

ssh/sshd, then mpdwill still fail to operate correctly.

To begin using mpd, the sequence of steps that we recommend is this:

Department of CSE 31 PACE, Mangalore

Page 32: Achieving High Performance Computing

Achieving High Performance Computing

1. get one mpd working alone on a first test node

2. get one mpd working alone on a second test node

3. get two new mpds to work together on the two test nodes

Following the steps

1. Install mpich2, and thus mpd.

2. Make sure the mpich2 bin directory is in your path. Below, we will refer to it as

MPDDIR.

3. Run a first mpd (alone on a first node). As mentioned above, mpd uses client-server

communications to perform its work. So, before running an mpd, let’s run a simpler

program (mpdcheck) to verify that these communications are likely to be successful.

Even on hosts where communications are well supported, sometimes there are probl

ems associated with hostname resolution, etc. So, it is worth the effort to proceed a bit

slowly. Below, we assume that you have installed mpd and have it in your path.

Select a test node, let’s call it n1. Login to n1.First, we will run 'mpdcheck' as a server

and a client. To run it as a server, get into a window with a command-line and run this:

n1 $ mpdcheck –s

server listening at INADDR_ANY on: n1 1234

Now, run the client side (in another window if convenient) and see if it can find the server and

communicate. Be sure to use the same hostname and port number printed by the server (above:

n1 1234):

n1 $ mpdcheck -c n1 1234

server has conn on

<socket._socketobject object at 0x40200f2c>

from (’192.168.1.1’, 1234)

server successfully recvd msg from client:

hello_from_client_to_server

client successfully recvd ack from server:

Department of CSE 32 PACE, Mangalore

Page 33: Achieving High Performance Computing

Achieving High Performance Computing

ack_from_server_to_client

If the experiment failed, you have some network or machine configuration problem

which will also be a problem later when you try to use mpd.

If the experiment succeeded, then you should be ready to try mpd on this one host. To

start an mpd, you will use the mpd command. To run parallel programs, you will use the mpiexec

program. All mpd commands accept the -h or –help arguments, e.g.:

n1 $ mpd --help

n1 $ mpiexec –help

Try a few tests:

n1 $ mpd &

n1 $ mpiexec -n 1 /bin/hostname

n1 $ mpiexec -l -n 4 /bin/hostname

n1 $ mpiexec -n 2 PATH_TO_MPICH2_EXAMPLES/cpi

where PATH TO MPICH2 EXAMPLES is the path to the mpich2-1.0.3/examples directory.

To terminate the mpd:

n1 $ mpdallexit

Run a second mpd (alone on a second node). To verify that things are fine on a second

host (say n2 ), login to n2 and perform the same set of tests that you did on n1. Make sure that

you use mpdallexit to terminate the mpd so you will be ready for further tests.

Run a ring of two mpds on two hosts. Before running a ring of mpds on n1 and n2, we

will again use mpdcheck, but this time between the two machines. We do this because the two

nodes may have trouble locating each other or communicating between them and it is easierto

check this out with the smaller program.

First, we will make sure that a server on n1 can service a client from n2. On n1:

n1 $ mpdcheck –s

which will print a hostname (hopefully n1) and a portnumber (say 3333 here). On n2:

Department of CSE 33 PACE, Mangalore

Page 34: Achieving High Performance Computing

Achieving High Performance Computing

n2 $ mpdcheck -c n1 3333

Second, we will make sure that a server on n2 can service a client from n1. On n2:

n2 $ mpdcheck -s

which will print a hostname (hopefully n2) and a portnumber (say 7777 here). On n2:

n2 $ mpdcheck -c n2 7777

The 6789 is the port that the mpd is listeneing on for connections from other mpds wish-

ing to enter the ring. We will use that port in a moment to get an mpd from n2 into the ring. The

value in parentheses should be the IP address of n1.

On n2:

n2 $ mpd -h n1 -p 6789 &

where 6789 is the listening port on n1 (from mpdtrace above). Now try:

n2 $ mpdtrace -l

You should see both mpds in the ring.To run some programs in parallel:

n1 $ mpiexec -n 2 /bin/hostname

n1 $ mpiexec -n 4 /bin/hostname

n1 $ mpiexec -l -n 4 /bin/hostname

n1 $ mpiexec -l -n 4 PATH_TO_MPICH2_EXAMPLES/cpi

To bring down the ring of mpds:

n1 $ mpdallexit

If the output from any of mpdcheck, mpd, or mpdboot leads you to believe that one or

more of your hosts are having trouble communicating due to firewall issues, we can offer a few

simple suggestions. If the problems are due to an “enterprise” firewall computer, then we can

only point you to your local network admin for assistance. In other cases, there are a few quick

things that you can try to see if there some common protections in place which may be causing

your problems.Deactivate all firewalls in the running services window.

Department of CSE 34 PACE, Mangalore

Page 35: Achieving High Performance Computing

Achieving High Performance Computing

2.3. OpenFoam

System requirements:

OpenFOAM is developed and tested on Linux, but should work with other POSIX sys-

tems. To check your system setup, execute the foamSystemCheck script in the bin/ directory of

the OpenFOAM installation.

Here is the output you should get

[open@sham OpenFOAM-1.6]$ foamSystemCheck

Checking basic system...

-----------------------------------------------------------------------

Shell: /bin/bash

Host: sham.globus

OS: Linux version 2.6.27.5-117.fc10.i686

User: open

System check: PASS

==================

Continue OpenFOAM installation.

Installation:

Download and unpack the files in the $HOME/OpenFOAM directory as described in:

http://www.OpenFOAM.org/download.html

The environment variable settings are contained in files in an etc/ directory in the OpenFOAM

release. e.g. in

$HOME/OpenFOAM/OpenFOAM-1.6/etc/

source the etc/bashrc file by adding the following line to the end of your HOME/.bashrc

file:

Department of CSE 35 PACE, Mangalore

Page 36: Achieving High Performance Computing

Achieving High Performance Computing

. $HOME/OpenFOAM/OpenFOAM-1.6/etc/bashrc

Then update the environment variables by sourcing the $HOME/.bashrc file by typing in the ter-

minal:

. $HOME/.bashrc

Testing the installation:

To check your installation setup, execute the 'foamInstallationTest' script (in the bin/ di-

rectory of the OpenFOAM installation). If no problems are reported, proceed to getting started

with OpenFOAM; otherwise, go back and check you have installed the software correctly .

Getting Started

Create a project directory within the $HOME/OpenFOAM directory named <USER>-1.6 (e.g.

'chris-1.6' for user chris and OpenFOAM version 1.6) and create a directory named 'run' within

it, e.g. by typing:

mkdir -p $FOAM_RUN/run

Copy the 'tutorial' examples directory in the OpenFOAM distribution to the 'run' directory. If the

OpenFOAM environment variables are set correctly, then the following command will be cor-

rect:

+ cp -r $WM_PROJECT_DIR/tutorials $FOAM_RUN

Run the first example case of incompressible laminar flow in a cavity:

+ cd $FOAM_RUN/tutorials/incompressible/icoFoam/cavity

+ blockMesh

+ icoFoam

+ paraFoam

Department of CSE 36 PACE, Mangalore

Page 37: Achieving High Performance Computing

Achieving High Performance Computing

Chapter 3

Case Studies

3.1 Case Study Dense Matrix Multiplication:

One way to implement the matrix multiplication algorithm is to allocate one processor to

compute each row of resultant matrix C, matrix B and one row of elements of A are needed for

each processor. Using master slave approach, these elements could be sent from the master

processor to the selected slave processors. Results are then collected back from each of the slaves

and displayed by the master.

Steps taken to parallelize:

1) A MPICH code for matrix multiplication is written.

2) MPD must be running on all the nodes.

Start the daemons "by hand" as follows:

mpd & # starts the local daemon

mpdtrace -l # makes the local daemon print its host

# and port in the form <host>_<port>

Then log into each of the other machines, put the install/bin directory in your path, and do:

mpd -h <hostname> -p <port> &

Where the hostname and port belong to the original mpd that has been started. From each

machine, after starting the mpd, mpdtrace is used to see which machines are in the ring.

3) The execution command is given in the master node.

MPI job is run using mpiexec command.

mpiexec –n <no of processors> <output filename>

mpiexec –n 5 ./cpi

4) System monitor is checked if all the nodes are utilized.

Department of CSE 37 PACE, Mangalore

Page 38: Achieving High Performance Computing

Achieving High Performance Computing

5) The result from each matrix is given back to the master node.

Implementing:

There exist many ways of solving matrix multiplication. Finding the efficient code for it

is perhaps the greatest challenge faced by the programming community.

A sequential code is written considering the ordinary matrix multiplication.

for (i=0; i<n;i++){

for (j=0;j<n;j++){ mult[i][j]=0;

for (k=0; k<n; k++) { Mult[i][j] +=m1[i][k]*m2[k][j]; }

}}

This algorithm requires n3 multiplication and n3 addition, leading to a sequential time

complexity of 0(n3). Parallel matrix multiplication is usually based upon the direct sequential

matrix multiplication algorithm. Even a superficial look at the sequential code reveals that the

computation in each iteration of the outer two loops is not dependent upon any other iteration

and each instance of the inner loop could be executed in parallel. Theoretically with p=n 2

processors, we can expect a parallel time complexity of O(n2) and this is easily obtainable.

Direct implementation:

One way to implement the matrix multiplication algorithm is to allocate one processor to

compute each column of resultant matrix C, matrix A and one column of elements of B are

needed for each processor. Using master slave approach, these elements could be sent from the

master processor to the selected slave processors. Results are then collected back from each of

the slaves and displayed by the master.

Department of CSE 38 PACE, Mangalore

Page 39: Achieving High Performance Computing

Achieving High Performance Computing

Observations:

Observation of the Performance of Matrix Multiplication Using MPICH2:

Matrix Dimension No. of cores Time (seconds)

1000x10001 13.90

2 8.33

3 5.86

4 5.07

5 6.89

Matrix Dimension No. of cores Time (seconds)

2000x20001 108.17

2 64.67

3 45.56

4 36.54

5 51.62

Matrix Dimension No. of cores Time (seconds)

3000x30001 392.9

2 220.19

3 156.81

4 123.36

5 180.29

The below graph shows the experimental observations of different matrix dimensions

versus the number of processes.

Department of CSE 39 PACE, Mangalore

Page 40: Achieving High Performance Computing

Achieving High Performance Computing

Remarks:

1. Time increases as the number of processes spawned reaches some critical number.

2. Time spent in the above communication process also somtime adds overhead.

3. If the program is carefully divided between the number of processes as of machines and

cores available, then there is improvement in the performance .

4. For example if there a 2 machines with 2 cores on each then for the processes 1,2,3,4 the

performance increases.

5. If the program is not carefully divided between number of processes as of machines and

cores available then there is slight decrease in the performance.

6. For example if there are 2 machines with the 2 cores on each , for the processes 1,2,3,4

performance increases. But if we divide it into 5 ,the performance slightly decreases as

shown in graph.

7. The program should be carefully divided between the number of processes according to

the machines and cores available in order to achieve high performance.

Department of CSE 40 PACE, Mangalore

Page 41: Achieving High Performance Computing

Achieving High Performance Computing

3.2 Case Study 2: Computatiuonal Fluid Dynamics

1. Involves interaction of heat conduction within a solid body from its surface to a fluid

flowing over it.

2. Application are : thermal design of a fuel element of a nuclear reactor.

3. Software was developed that deals with

study of conjugate heat transfer problem associated with a rectangular nuclear

fuel element washed by upward moving coolant.

employing stream function-vorticity formulation.

equations governing the steady, two-dimensional flow and thermal fields in the

coolant are solved simultaneously with the steady, two-dimensional heat conduc-

tion equation

4. Software was developed that deals with

study of conjugate heat transfer problem associated with a rectangular nuclear

fuel element washed by upward moving coolant.

employing stream function-vorticity formulation.

equations governing the steady, two-dimensional flow and thermal fields in the

coolant are solved simultaneously with the steady, two-dimensional heat conduc-

tion equation

Pre-analysis -Profiling:

Using gprof, Gnu’s opensource profiler, to profile the code. The output of gprof, which

includes a flat profile and a call graph, was also used for code comprehension.

Department of CSE 41 PACE, Mangalore

Page 42: Achieving High Performance Computing

Achieving High Performance Computing

Flat profile:

Department of CSE 42 PACE, Mangalore

Page 43: Achieving High Performance Computing

Achieving High Performance Computing

Call graph:

The following observations were made.

1. Values for the different input parameters were hardcoded into the code. Each change of

parameter value necessitated a recompilation of the code.

Department of CSE 43 PACE, Mangalore

Page 44: Achieving High Performance Computing

Achieving High Performance Computing

2. For the given set of parameters present in the code the observed run time was about 23

minutes. The run time could get much larger (ranging from a few hours to days) with

changed parameters. The large system run time discouraged experimenting with a range

of values of the computational grid system that may have resulted in values with a finer

resolution. Additionally, the range of values of parameters that may compute results giv-

ing more insight into the physics of the problem could not be studied for the same rea-

sons.

3. Multiple output files were used and data was being written to a large set of output files.

File handling could have been more efficiently done to positively impact the total execu-

tion time.

Department of CSE 44 PACE, Mangalore

Page 45: Achieving High Performance Computing

Achieving High Performance Computing

4. As the program was serially executed, the execution of a few functions was delayed, even

though parameters or values required to compute that function were available. A similar

case was of functions being called after the complete execution of loops although no data

or control dependencies existed between the loop and/or the functions.

5. It was observed that there was an excessive and sometimes unnecessary use of global

variables.

Department of CSE 45 PACE, Mangalore

Page 46: Achieving High Performance Computing

Achieving High Performance Computing

6. Some loops were identified that could have been combined to bring down the size of the

code.

Department of CSE 46 PACE, Mangalore

Page 47: Achieving High Performance Computing

Achieving High Performance Computing

7. None of the functions used any input parameters nor did they return any value. Global

variables were used as a replacement for both.

8. Some functions that have been defined exhibit identical functionality with few differing

statements.

Department of CSE 47 PACE, Mangalore

Page 48: Achieving High Performance Computing

Achieving High Performance Computing

Analyzing Data Dependencies:

The call graph was used to understand the initial working of the code after which the data

dependencies at the coarse function-level granularity were analyzed. More specifically, the fol-

lowing data dependencies were identified.

1. Flow dependence - If the variables modified in one function are passed to another func-

tion then the execution must follow the same path.

2. Anti-dependence - If a changed variable in one function is being used by a previously

called function. The order of these functions cannot be interchanged.

3. Output dependence - If two functions produce or write to the same output variable they

are said to be output dependent; thus their order cannot be changed.

4. I/O dependence - This dependence between two functions occurs when a file is being

read and written by both these functions. Based on the identified dependencies, the code

was statically restructured for a theoretical execution on a multiprocessor system. Initial

analysis indicated a reduction of the execution time to about 13 seconds indicating a theo-

reticalspeedup of 1.7 ignoring the communication overhead.

PRE-PARALLELIZATION EXERCISES:

Based on the analysis, it was noted that the following need to be completed before the

start of the parallelization step:

1. Code needs to be changed to read in parameters from the command-prompt or from an exter-

nally available input file. This would allow the code to be executed unchanged for different val-

ues of the parameters without the need for a recompilation.

Department of CSE 48 PACE, Mangalore

Page 49: Achieving High Performance Computing

Achieving High Performance Computing

2. Reduction in the number of output files. If the files are genuinely required, the output se-

quence needs to be further analyzed; else the data that is written to these files can be combined

into a reduced set.

3. Functions that have been identified with no data dependenciesbetween them are good candi-

dates for parallel execution. Their execution time profiles and the computation-communication

ratio needs to be further studied to see if parallelization will indeed produce a

speedup.

4. The code needs to be rewritten to reduce the usage of global variables. This may involve

changing all or most of the function signatures to read in input parameters and return results.

This exercise may also involve the creation of more efficient data structures for parameter

passing between functions.

5. Many functions can be eliminated by rewriting functions to combine the functionality of two

or more functions. This would considerably reduce the code size and will result in more compact

and well written code. However, code repeated in different programs offers the advantage that it

can be customized to execute that part of the program where it lies in a unique manner; combin-

ing similar portions of code into a generalized single function, while offering other advantages,

removes this advantage. The trade-off need to be deliberated before performing this exercise.

6. Loops that are temporally close need to be studied along with their indices to see if they can be

successfully combined. In addition to reduce the code size, this would reduce the effort of paral-

lelization as only a single loop needs to analyzed.

Based on the identified dependencies, the code was statically restructured for a theoreti-

cal execution on a multiprocessor system. Initial analysis indicated a reduction of the execution

time to about 13 seconds indicating a theoretical speedup of 1.7 ignoring the communication

overhead.

Department of CSE 49 PACE, Mangalore

Page 50: Achieving High Performance Computing

Achieving High Performance Computing

Flowchart

Flowchart for order of execution of program

The above flowchart shows the execution flow of functions in a given CFD case

problem .There are total 31 functions some of which are dependent and independent. The single

rectangular box with number inside indicates the number of functions which has to be executed

sequentially .The double rectangular box with number inside indicates the number of functions

which can be executed independently.

Department of CSE 50 PACE, Mangalore

Page 51: Achieving High Performance Computing

Achieving High Performance Computing

The problem took over 15 minutes to execute when executed sequentially. After

analyzing that some part of the code can be parallelized the theoretical speed up of 1.71 was

achived by taking the maximum time taken among functions which can be executed in parallel.

3.3 Case Study 3 :OpenFOAM

BubbleFoam is one of the case in OpenFOAM application that we will be studying from

many of the cases in OpenFOAM. Before u start up with this case you need to do some modifi -

cation. First to generate the profiler you have to include -pg option in C and C++ files which are

there in following directory

/home/open/OpenFOAM/OpenFOAM-1.6/wmake/rules/linuxGcc

After modifications you need to compile Bubblefoam case by running wmake command located

in following direcory

/home/open/OpenFOAM/OpenFOAM-1.6/applications/solvers/multiphase/bubbleFoam

The case is located in

/home/open/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/bubbleFoam/bubbleCol-

umn

+blockMesh

+bubbleFoam

The case is executed for 8 minutes.

Now to reduce the execution time we needed to do some observation. So we used gprof profiler

to profile the case.

From the profile graph we observed that the functions :

Department of CSE 51 PACE, Mangalore

Page 52: Achieving High Performance Computing

Achieving High Performance Computing

'H()' - which is located in fvMatrix.C file

Foam::tmp<Foam::fvMatrix<Foam::Vector<double>>>Foam::fvm::div<Foam::Vector<d

ouble> >(Foam::GeometricField<double, Foam::fvsPatchField, Foam::surfaceMesh>

const&, Foam::GeometricField<Foam::Vector<double>, Foam::fvPatchField,

Foam::volMes-h>&)

were taking more time. The H() was called about 40000 times and the other function was taking

more time.

Running in Parallel:

This case was also run in parallel to create 4 diffrent mesh at a time. There is a dictionary

associated with decomposePar named decomposeParDict which is located in the system direc-

Department of CSE 52 PACE, Mangalore

Page 53: Achieving High Performance Computing

Achieving High Performance Computing

tory of the tutorial case; also, like with many utilities, a default dictionary can be found in the di -

rectory of the source code of the specific utility, i.e. in

$FOAM UTILITIES/parallelProcessing/decomposePar

The first entry is numberOfSubdomains which specifies the number of subdomains into

which the case will be decomposed, usually corresponding to the number of processors available

for the case. The method of decomposition should be simple and the corresponding simpleCo-

effs should be edited according to the following criteria. The domain is split into pieces, or sub-

domains, in the x, y and z directions, the number of subdomains in each direction being given by

the vector n. As this geometry is 2 dimensional, the 3rd direction, z, cannot be split, hence nz

must equal 1. The nx and ny components of n split the domain in the x and y directions and must

be specified so that the number of subdomains specified by nx and ny equals the specified num-

berOfSubdomains, i.e. nx ny = numberOfSubdomains. It is beneficial to keep the number of cell

faces adjoining the subdomains to a minimum so, for a square geometry, it is best to keep the

split between the x and y directions should be fairly even. The delta keyword should be set to

0.001.

For example, let us assume we wish to run on 4 processors. We would set number-Of-

Subdomains to 4 and n = (2, 2, 1). When running decomposePar, we can see from the screen

messages that the decomposition is distributed fairly even between the processors. The user has a

choice of four methods of decomposition, specified by the method keyword as described below.

simple Simple geometric decomposition in which the domain is split into pieces by direction,

e.g. 2 pieces in the x direction, 1 in y etc. hierarchical Hierarchical geometric decomposition

which is the same as simple except the user specifies the order in which the directional split is

done, e.g. first in the y-direction, then the x-direction etc.

Manual decomposition, where the user directly specifies the allocation of each cell to a

particular processor. For each method there are a set of coefficients specified in a sub-dictionary

of decompositionDict, named <method>Coeffs as shown in the dictionary listing. The decom-

posePar utility is executed in the normal manner by typing

Department of CSE 53 PACE, Mangalore

Page 54: Achieving High Performance Computing

Achieving High Performance Computing

decomposePar

On completion, a set of subdirectories will have been created, one for each processor, in

the case directory. The directories are named processorN where N = 0, 1, . . . represents a proces-

sor number and contains a time directory, containing the decomposed field descriptions, and a

constant/polyMesh directory containing the decomposed mesh description.

Running a decomposed case:

A decomposed OpenFOAM case is run in parallel using the openMPI implementation of

MPI. openMPI can be run on a local multiprocessor machine very simply but when running on

machines across a network, a file must be created that contains the host names of the machines.

The file can be given any name and located at any path. In the following description we shall re -

fer to such a file by the generic name, including full path, <machines>.

The <machines> file contains the names of the machines listed one machine per line.The

names must correspond to a fully resolved hostname in the /etc/hosts file of the machine on

which the openMPI is run. The list must contain the name of the machine running the openMPI.

Where a machine node contains more than one processor, the node name may be followed by the

entry cpu=n where n is the number of processors openMPI should run on that node. For example,

let us imagine a user wishes to run openMPI from machine machine1 on the following machines:

machine1; machine2, which has 2 processors; and machine3.

The <machines> file would contain:

machine1

machine2 cpu=2

machine3

An application is run in parallel using mpirun.

mpirun --hostfile <machines> -np <nProcs> <foamExec> <otherArgs> -parallel >

log &

Department of CSE 54 PACE, Mangalore

Page 55: Achieving High Performance Computing

Achieving High Performance Computing

where: <nProcs> is the number of processors;

<foamExec> is the executable, e.g.icoFoam;

and, the output is redirected to a file named log.

For example, if icoFoam is run on 4 nodes, specified in a file named machines, on the

cavity tutorial in the $FOAM RUN/tutorials/incompressible/icoFoam directory, then the follow-

ing command should be executed:

mpirun --hostfile machines -np 4 interFoam -parallel > log &.

The Foam ran for about 991 seconds for creating 4 different meshes.

Department of CSE 55 PACE, Mangalore

Page 56: Achieving High Performance Computing

Achieving High Performance Computing

Chapter 4

Conclusion & Future work

Parallelization process is not easy ,as it needs the application to be studied thoroughly.-

Many applications are written in sequential execution point of view which were easy to write and

test. The porting of matrix multiplication to mpi cluster helped us is studying & understanding

how to parallize the sequential program as it used as a benchmark.

The CFD code for conjugate heat transfer cannot be ported to mpi or grid cluster because

the program is poorly strutured. The most effecient method to parallelize it is to rewrite the code

again. But observation can be made that optimizing the code by eliminating dependencies and re-

moving unnecessary references and a bit of smart programming can produce better performance.

OpenFOAM is written in high object oriented language,which is very difficult to under-

stand. It is one of the oldest written code which had undergone continuous enhancements. There

is a seperate group involved in writing up this code. The code is standard and optimized and

hence gives problems in prallelization process.

Work can be carried to generelize the process of parallelisation.Port the case bubbleFoam

to MPI-Cluster for generating single mesh.Study how to port applications to grid cluster and Port

OpenFOAM application to the same.

The study of OpenFOAM code thoroughly will be the next major advancement in the

field of paralellising this application.We tried to parallellize some part of the case and we ran

into errors.Even though the case was not running properly after modifying it to run in parallel,

we achived parallellising some part of that case. A little more time spent can produce the disir-

able results.

Department of CSE 56 PACE, Mangalore

Page 57: Achieving High Performance Computing

Achieving High Performance Computing

Chapter 5

References

[1] Joseph D. Sloan. High Performance LINUX Clusters with OSCAR, Rocks, openMosix & MPI.

[2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/ECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.

[3] Introduction to Grid Computing with Globus by Luis Ferreira,Viktors Berstis, Jonathan Armstrong,Mike Kendzierski, Andreas Neukoetter, MasanobuTakagi,Richa Bing-Wo, Adeeb Amir, Ryo Murakawa, Olegario Hernandez, James Magowan, Norbert Bieberstein ibm.com/redbooks www.redbooks.ibm.com/redbooks/pdfs/sg246895.pdf

[4] Pre-Parallelization Exercises in Budget-Constrained HPC Projects: A Case Study in CFD byDr. Waseem Ahmed, Dr. Ramis M. K., Shamsheer Ahmed, Suma Bhat, Mohammed Isham, P. A. College of Engineering, Mangalore, India.

[4] www.annauniv.edu/care/soft.htm

[5] www.mcs.anl.gov/mpi/mpich1. [6] www.openfoam.com/docs.

[7] www.nus.edu.sg/demo2a.html .

[8] www.linuxproblem.org/art_9.html

Department of CSE 57 PACE, Mangalore