Upload
isham
View
160
Download
1
Embed Size (px)
Citation preview
Achieving High Performance Computing
CHAPTER 1
INTRODUCTION
1.1. Parallel Programming Paradigm
In the 1980s it was believed computer performance was best improved by creating faster
and more efficient processors. This idea was challenged by parallel processing, which in essence
means linking together two or more computers to jointly solve a computational problem. Since
the early 1990s there has been an increasing trend to move away from expensive and specialized
proprietary parallel supercomputers (vector-supercomputers and massively parallel processors)
towards networks of computers (PCs Workstations SMPs). Among the driving forces that have
enabled this transition has been the rapid improvement in the availability of commodity high
performance components for PCs workstations and networks. These technologies are making a
network cluster of computers an appealing vehicle for cost-effective parallel processing and this
is consequently leading to low-cost commodity supercomputing.
Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous)
PCs or workstations, to SMPs, are rapidly becoming the standard platforms for high-
performance and large-scale computing. The main attractiveness of such systems is that they are
built using a fordable, low-cost, commodity hardware (such as Pentium PCs), fast LAN , and
standard software components such as UNIX and MPI. These systems are scalable, i.e., they can
be tuned to available budget and computational needs and allow efficient execution of both
demanding sequential and parallel applications.
1.2. Overview
We intend to present some of the main motivations for the widespread use of clusters in
high-performance parallel computing. In the next section, we discuss a generic architecture of a
cluster computer and grid computer and the rest of the chapter focuses on message passing
interface, strategies for writing parallel programs, and the two main approaches to parallelism
(implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel
processing: message-passing libraries, distributed shared memory, object-oriented
Department of CSE 1 PACE, Mangalore
Achieving High Performance Computing
programming. However, the main focusing chapter is about the identification and introducing
parallel programming paradigms in existing applications such as OpenFoam. This approach
presents some interesting advantages, for example, the reuse of code, higher flexibility, and the
increased productivity of the parallel program developer.
1.3. Grid Network
Grid networking services are best presented within the context of the Grid and its
architectural principles. The Grid is a flexible, distributed, information technology environment
that enables multiple services to be created with a significant degree of independence from the
specific attributes of underlying support infrastructure. Advanced architectural infrastructure
design increasingly revolves around the creation and delivery of multiple ubiquitous digital
services. A major goal of information technology designers is to provide an environment within
which it is possible to present any form of information on any device at any location. The Grid
is an infrastructure that highly complements the era of ubiquitous digital information and
services.
These environments are designed to support services not as discrete infrastructure
components, but as modular resources that can be integrated into specialized blends of
capabilities to create multiple additional, highly customizable services. The Grid also allows
such services to be designed and implemented by diverse, distributed communities,
independently of centralized processes. Grid architecture represents an innovation that is
advancing efforts to achieve these goals.
Early Grid infrastructure was developed to support data and compute intensive science
projects. For example, the high-energy physics community was an early adopter of Grid
technology. This community must acquire extremely high volumes of data from specialized
instruments at key locations in different countries. They must gather, distribute, and analyze
those large volumes of data as a collaborative initiative with thousands of colleagues around the
world.
Department of CSE 2 PACE, Mangalore
Achieving High Performance Computing
1.4. Message Passing Interface Cluster
Message passing libraries allow efficient parallel programs to be written for distributed
memory systems. These libraries provide routines to initiate and configure the messaging
environment as well as sending and receiving packets of data. Currently, the two most popular
high-level message-passing systems for scientific and engineering application is MPI (Message
Passing Interface) defined by the MPI Forum.
Currently, there are several implementations of MPI, including versions for networks of
workstations, clusters of personal computers, distributed-memory multiprocessors, and shared-
memory machines. Almost every hardware vendor is supporting MPI. This gives the user a
comfortable feeling since an MPI program can be executed on almost all of the existing
computing platforms without the need to rewrite the program from scratch. The goal of
portability, architecture, and network transparency has been achieved with these low-level
communication libraries like MPI . Both communication libraries provide an interface for C and
Fortran, and additional support of graphical tools.
However, these message-passing systems are still stigmatized as low-level because most
tasks of the parallelization are still left to the application programmer. When writing parallel
applications using message passing, the programmer still has to develop a signi cant amount of
software to manage some of the tasks of the parallelization, such as: the communication and
synchronization between processes, data partitioning and distribution, mapping of processes
onto processors, and input output of data structures. If the application programmer has no
special support for these tasks, it then becomes difficult to widely exploit parallel computing.
The easy-to-use goal is not accomplished with a bare message-passing system, and hence
requires additional support.
1.5. OpenFoam
The OpenFOAM® (Open Field Operation and Manipulation) CFD Toolbox is a free,
open source CFD software package produced by a commercial company, OpenCFD Ltd. It has a
large user base across most areas of engineering and science, from both commercial and
Department of CSE 3 PACE, Mangalore
Achieving High Performance Computing
academic organisations. OpenFOAM has an extensive range of features to solve anything from
complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid
dynamics and electromagnetics.
The core technology of OpenFOAM is a flexible set of efficient C++ modules. These are
used to build a wealth of: solvers, to simulate specific problems in engineering mechanics;
utilities, to perform pre- and post-processing tasks ranging from simple data manipulations to
visualisation and mesh processing; libraries, to create toolboxes that are accessible to the
solvers/utilities, such as libraries of physical models.
OpenFOAM is supplied with numerous pre-configured solvers, utilities and libraries and
so can be used like any typical simulation package. However, it is open, not only in terms of
source code, but also in its structure and hierarchical design, so that its solvers, utilities and
libraries are fully extensible. OpenFOAM uses finite volume numerics to solve systems of partial
differential equations ascribed on any 3D unstructured mesh of polyhedral cells. The fluid flow
solvers are developed within a robust, implicit, pressure-velocity, iterative solution framework,
although alternative techniques are applied to other continuum mechanics solvers.
One of the strengths of OpenFOAM is that new solvers and utilities can be created by its
users with some pre-requisite knowledge of the underlying method, physics and programming
techniques involved. OpenFOAM is supplied with pre- and post-processing environments. The
interface to the pre- and post-processing are themselves OpenFOAM utilities, thereby ensuring
consistent data handling across all environments. The overall structure of OpenFOAM is shown
in Figure1.1.
Department of CSE 4 PACE, Mangalore
Achieving High Performance Computing
1.6. Case Studies
1.6.1. Dense Matrix
Dense matrix multiplication is a core operation in scientific computing, and has been a
topic of interest for computer scientists for over forty years. Theoretical computer scientists have
redefined the time bounds of the problem, and focuses on implementations have been shifted
from the serial to parallel computing model.
The lower bound of the problem of multiplying two square dense matrices of size N by N
(henceforth referred to as matrix multiplication) has been known for some time to be Ω(n2) as
every scalar element of the matrices must be examined. Until 1968, no improvements to the
naive algorithm were known.
1.6.2. Computational Fluid Dynamics
One of the CFD (Computational Fluid Dynamics) code developed for the Conjugate Heat
Transfer problem is used for the case study. The term Conjugate Heat Transfer refers to a heat
transfer process involving an interaction of heat conduction within a Solid body with either of
the free, forced, and mixed convection from its surface to a fluid (or to its surface from a fluid)
flowing over it. An accurate analysis of such heat transfer problems necessitates the coupling of
the problem of conduction in the solid with that of convection in the fluid by satisfying the
conditions of continuity in temperature and heat flux at the solid–fluid interface.
There are many engineering and practical applications in which conjugate heat transfer
occurs. One such area of application is in the thermal design of a fuel element of a nuclear
reactor. The energy released due to fission in the fuel element is first conducted to its lateral
surface, which in turn is dissipated to the coolant flowing over it so as to maintain the
temperature anywhere in the fuel element well within its allowable limit. If this energy generated
is not removed fast enough, the fuel elements and other components may heat up so much that
eventually a part of the core may melt. In fact, the limit to the power at which a reactor can be
operated is set by the heat transfer capacity of the coolant. Therefore, the knowledge of the
Department of CSE 5 PACE, Mangalore
Achieving High Performance Computing
temperature field in the fuel element and the flow and thermal fields in the coolant is needed in
order to predict its thermal performance.
1.6.3. OpenFOAM
Fluid dynamics is a field of science which studies the physical laws governing the flow of
fluids under various conditions. Great effort has gone into understanding the governing laws and
the nature of fluids themselves, resulting in a complex yet theoretically strong field of research.
Department of CSE 6 PACE, Mangalore
Achieving High Performance Computing
CHAPTER 2
TESTBED SETUP
2.1. Globus Toolkit
Globus is community of users and developers who collaborate on the use and
development of open source software, and associated documentation, for distributed computing
and resource federation.
The middleware software itself—the Globus Toolkit is a set of libraries and programs
that address common problems that occur when building distributed system services and
applications. Its the infrastructure that supports this community—code repositories, email lists,
problem tracking system, and so forth, all accessible at globus.org.
The software itself provides a variety of components and capabilities, including
the following:
− A set of service implementations focused on infrastructure management.
− Tools for building new Web services, in Java, C, and Python.
− A powerful standards-based security infrastructure.
− Both client APIs (in different languages) and command line programs for accessing
these various services and capabilities.
− Detailed documentation on these various components, their interfaces, and how they
can be used to build applications.
GT4 makes extensive use of Web services mechanisms to define its interfaces and
structure its components. Web services provide flexible, extensible, and widely adopted XML-
based mechanisms for describing, discovering, and invoking network services; in addition, its
document-oriented protocols are well suited to the loosely coupled interactions that many argue
are preferable for robust distributed systems. These mechanisms facilitate the development of
service-oriented architectures systems and applications structured as communicating services, in
which service interfaces are described, operations invoked, access secured, etc., all in uniform
ways.
Department of CSE 7 PACE, Mangalore
Achieving High Performance Computing
Figure 2 illustrates various aspects of GT4 architecture.
2.1.1. Prerequisites
We intend to present some of the main motivations for the widespread use of clusters in
high-performance parallel computing. In the next section, we discuss a generic architecture of a
cluster computer and grid computer and the rest of the chapter focuses on message passing
interface, strategies for writing parallel programs, and the two main approaches to parallelism
(implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel
processing: message-passing libraries, distributed shared memory, object-oriented
programming.
The list of packages needed to be pre-installed in the system are
- jdk-1_5_0_03-linux- i586.bin
- apache-ant-1.6.4-bin.tar
- gt4.2.0-all-source- installer.tar.gz
Java Installation:
[root@pace~]# cd /usr/local/
Department of CSE 8 PACE, Mangalore
Achieving High Performance Computing
[root@pace local]#rpm -q zlib-devel
[root@pace local]#./jdk-1_5_0_03-linux- i586.bin
[root@pace local]# vi /etc/profile
add the following lines...
#GRID ENVIRONMENT VARIABLE SETTINGS....
JAVA_HOME=/usr/local/jdk1.5.0_03
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/tools.jar
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC
CLASSPATH JAVA_HOME
Note: All Linux has Java inbuilt but the Globus need the Java by SUN, so check for the vendors
installation.
Apache Ant Installation:
[root@pace]# tar - xvf /home/vkuser/gt4/software/apache-ant-1.6.4-bin.tar
[root@pace]# mv apache-ant-1.6.4 ant-1.6.4
[root@pace ant-1.6.4]# vi /etc/profile
add the following lines...
ANT_HOME=/usr/local/ant-1.6.4
PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$PATH
Note:
Ant and Java is needed for compiling Globus source code .We used Fedora 10 it had all the
requirements. We installed Ant and Java .The installation step may differ according to the
version refer the installation guide in each package.
Department of CSE 9 PACE, Mangalore
Achieving High Performance Computing
Globus installation
create the Globus user account
[root@pace]#adduser globus
[root@pace]#passwd xxxxxx
copy the file gt4.0.1-all-source- installer.tar in /usr/local and untar it.
[root@pace]$ tar xzf gt4.0.1-all-source- installer.tar
Configure, compile and change the ownership to globus user and change the permissions
[root@pace]#chown globus:globus gt4.0.1-all-source- installer.tar
Now a directory will be created (Ex: gt4.0.1-all-source- installer), go into directory and execute
configure script.
[root@pace]#./configure
[root@pace]#make
[root@pace]#make install
[root@pace]# chown -R globus:globus /usr/local/globus-4.2.0/
Note: Before starting the installation process change the hostname of your system. The default
hostname (localhost.localdomain) will create problems during the certificate genration.
Now as a root user
[root@pace local]# vi /etc/profile
add the following lines...
GLOBUS_LOCATION=/usr/local/globus-4.2.0
PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$LAM_HOME/bin:$LAM_HOME/sbin:
$PATH:$GLOBUS_LOCATION/bin:$GLOBUS_LOCATION/sbin
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC
CLASSPATH GLOBUS_LOCATION
Department of CSE 10 PACE, Mangalore
Achieving High Performance Computing
2.1.2. Setting up the first machine
2.1.2.1 SimpleCA configuration:
[globus@pace gt4.2.0-all-source-installer]$source
$GLOBUS_LOCATION/etc/globus-user-env.sh
[globus@pace~]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca
The following results r displayed on the terminal
The unique subject name for this CA is:
cn=Globus Simple CA, ou=simpleCA-pace.grid, ou=GlobusTest, o=Grid
Do you want to keep this as the CA subject (y/n) [y]:y
Enter the email of the CA (this is the email where certificate
requests will be sent to be signed by the CA):[email protected]
The CA certificate has an expiration date. Keep in mind that
once the CA certificate has expired, all the certificates
signed by that CA become invalid. A CA should regenerate
the CA certificate and start re- issuing ca-setup packages
before the actual CA certificate expires. This can be done
by re-running this setup script. Enter the number of DAYS
the CA certificate should last before it expires.
[default: 5 years (1825 days)]: <enter>
Enter PEM pass phrase:xxxxxx
Verifying - Enter PEM pass phrase:123456
setup-ssl-utils: Complete
[root@pace~]#$GLOBUS_LOCATION/setup/globus_simple_ca_116a21a8_setup/setup-
gsi-default
Department of CSE 11 PACE, Mangalore
Achieving High Performance Computing
Running the above command cause the following instructions to be processed
setup-gsi: Configuring GSI security
Making /etc/grid-security...
mkdir /etc/grid-security
Making trusted certs directory: /etc/grid-security/certificates/
mkdir /etc/grid-security/certificates/
Installing /etc/grid-security/certificates//grid-security.conf.116a21a8...
Running grid-security-config...
Installing Globus CA certificate into trusted CA certificate directory...
Installing Globus CA signing policy into trusted CA certificate directory...
setup-gsi: Complete
[root@pace~]# source $GLOBUS_LOCATION/etc/globus-user-env.sh
[root@pace~]# grid-cert-request -host `hostname`
[root@pace~]#exit
[globus@pace ~]$ grid-ca-sign - in /etc/grid-security/hostcert_request.pem -out
hostsigned.pem
To sign the request please enter the password for the CA key:xxxxxx
The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/01.pem
[root@pace ~]# cp /home/globus/hostsigned.pem /etc/grid-security/hostcert.pem
cp: overwrite `/etc/grid-security/hostcert.pem'? y
[root@pace ~]# cd /etc/grid-security/
[root@pace grid-security]# cp hostcert.pem containercert.pem
[root@pace grid-security]# cp hostkey.pem containerkey.pem
[root@pace grid-security]# chown globus:globus container*.pem
Department of CSE 12 PACE, Mangalore
Achieving High Performance Computing
[root@pace grid-security]# exit
Now we'll get a usercert for guser01.
[globus@pace ~]$ su - guser01
[guser01@pace~]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh
[guser01@pace ~]$ grid-cert-request
Generating a 1024 bit RSA private key
..........++++++
............++++++
writing new private key to '/home/guser01/.globus/userkey.pem'
Enter PEM pass phrase:xxxxxx
Verifying - Enter PEM pass phrase:xxxxxx
[guser01@pace ~]$ cp /home/guser01/.globus/usercert_request.pem /tmp/request.pem
[globus@pace ~]$ cp /tmp/request.pem /home/globus
[globus@pace ~]$ grid-ca-sign - in request.pem -out signed.pem
To sign the request
please enter the password for the CA key:123456
The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/02.pem
[globus@pace ~]$ cp signed.pem /tmp/
[globus@pace ~]$ su - guser01
[guser01@pace ~]$ cp /tmp/signed.pem ~/.globus/usercert.pem
[guser01@pace~]$ grid-cert- info -subject
/O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid user #01
[root@pace ~]# vi /etc/grid-security/grid- mapfile
add the following line..
Department of CSE 13 PACE, Mangalore
Achieving High Performance Computing
"/O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid" guser01
Environment variable setting for Credentials
[root@pace~]#vi /etc/profile
add the following lines...
GRID_SECURITY_DIR=/etc/grid-security
GRIDMAP=/etc/grid-security/grid- mapfile
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC
CLASSPATH GLOBUS_LOCATION
JAVA_HOME GRIDMAP GRID_SECURITY_DIR
Validate certificate setup:
Note: login as guser01
[root@pace~]# openssl verify -CApath /etc/grid-security/certificates -purpose sslserver
/etc/grid-security/hostcert.pem /etc/grid-security/hostcert.pem: OK
2.1.2.2. Setting up GridFTP
[root@pace ~]# vim /etc/xinetd.d/gridftp
add the following lines...
service gsiftp
{
instances = 100
socket_type = stream
wait = no
user = root
env += GLOBUS_LOCATION=/usr/local/globus-4.2.0
Department of CSE 14 PACE, Mangalore
Achieving High Performance Computing
env += LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib
server = /usr/local/globus-4.2.0/sbin/globus- gridftp-server
server_args = -i
log_on_success += DURATION
nice = 10
disable = no
}
[root@pace ~]# vim /etc/services
add the following line into bottom of the file.
# Local services
gsiftp 2811/tcp
[root@mitgrid ~]# /etc/init.d/xinetd reload
Reloading configuration: [ OK ]
[root@mitgrid ~]# netstat -an | grep 2811
tcp 0 0 0.0.0.0:2811 0.0.0.0:* LISTEN
Note:
Now the gridftp server is waiting for a request, so we'll run a client and transfer a file:
Testing:
[guser01@pace ~]$ grid-proxy- init -verify -debug
User Cert File: /home/guser01/.globus/usercert.pem
User Key File: /home/guser01/.globus/userkey.pem
Trusted CA Cert Dir: /etc/grid-security/certificates
Output File: /tmp/x509up_u502
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid
user #01
Enter GRID pass phrase for this identity:guser01
Department of CSE 15 PACE, Mangalore
Achieving High Performance Computing
Creating proxy .............++++++++++++
..++++++++++++
Done
Proxy Verify OK
Your proxy is valid until: Sun Jan 29 01:12:48 2006
[guser01@mitgrid]$globus-url-copy gsiftp://mitgrid.grid/etc/groupfile:///tmp/guser01.test.-
copy
[guser01@mitgrid ~]$ diff /tmp/guser01.test.copy /etc/group
Okay, so the GridFTP server works.
Starting the web services container configuration:
Now we'll setup an /etc/init.d entry for the web services container.
Note: login as globus
[globus@mitgrid ~]$ vim $GLOBUS_LOCATION/start-stop
add the following lines....
#! /bin/sh
set -e
export GLOBUS_OPTIONS="-Xms256M -Xmx512M"
. $GLOBUS_LOCATION/etc/globus-user-env.sh
cd $GLOBUS_LOCATION
case "$1" in start)
$GLOBUS_LOCATION/sbin/globus-start-container-detached -p 8443
;;
stop)
$GLOBUS_LOCATION/sbin/globus-stop-container-detached
;;
*)
echo "Usage: globus {start|stop}" >&2
exit 1
Department of CSE 16 PACE, Mangalore
Achieving High Performance Computing
;;
esac
exit 0
[globus@mitgrid ~]$ chmod +x $GLOBUS_LOCATION/start-stop
Now, as root, we'll create an /etc/init.d script to call the globus user's start-stop script:
Note: login as root
[root@mitgrid ~]# vim /etc/init.d/globus-4.2.0
add the following lines...
#!/bin/sh -e
case "$1" in
start)
su - globus /usr/local/globus-4.0.1/start-stop start
;;
stop)
su - globus /usr/local/globus-4.0.1/start-stop stop
;;
restart)
$0 stop
sleep 1
$0 start
;;
*)
printf "Usage: $0 {start|stop|restart}\n" >&2
exit 1
;;
esac
exit 0
Department of CSE 17 PACE, Mangalore
Achieving High Performance Computing
[root@pace ~]# chmod +x /etc/init.d/globus-4.2.0
[root@pace ~]# /etc/init.d/globus-4.2.0 start
Starting Globus container. PID: 19051
2.1.2.3. Grid Resource Allocation and Management (GRAM)
Now that we have GridFTP and RFT working, we can setup GRAM for resource management.-
First we have to setup sudo so the globus user can start jobs as a different user.
[root@pace ~]# visudo
add the following lines in the bottom of the file(It is link with /etc/sudoers)...
#Grid variable settings by VK@MITGRID
globus ALL=(guser01) NOPASSWD: /usr/local/globus-4.2.0bexec/globus-gridmap-
and-execute - g /etc/grid-security/grid- mapfile /usr/local/gobus-4.2.0/exec/globus-job-
manager-script.pl * globus ALL=(guser01) NOPASSWD: /usr/local/globus-4.2.0bexec/
globus-gridmap-and-execute - g /etc/grid-security/grid- mapfile /usr/local/globus-4.2.0/
exec/globus-gram- local-proxy-tool *
Note: login as guser01
[guser01@pace ~]$ globusrun-ws -submit -c /bin/true
Submitting job...Done.
Job ID: uuid:a9378900-8fed-11da-a691-000ffe3b1003
Termination time: 01/29/2006 11:03 GMT
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
[guser01@mitgrid ~]$
Department of CSE 18 PACE, Mangalore
Achieving High Performance Computing
[guser01@mitgrid ~]$ echo $?
0
MyProxy Server Setup and Configuration:
In order to create a MyProxy server first we'll turn pace.grid machine into a MyProxy server by
following instructions
Note: Login as root
[root@pace ~]# cp $GLOBUS_LOCATION/etc/myproxy-server.config /etc
[root@pace~]# vim /etc/myproxy-server.config
Just uncomment the following lines...
Before modification
#
# Complete Sample Policy
#
# The following lines define a sample policy that enables all
# myproxy-server features. See below for more examples.
#accepted_credentials "*"
#authorized_retrievers "*"
#default_retrievers "*"
#authorized_renewers "*"
#default_renewers "none"
#authorized_key_retrievers "*"
#default_key_retrievers "none"
after modification:
#
# Complete Sample Policy
#
# The following lines define a sample policy that enables all
# myproxy-server features. See below for more examples.
Department of CSE 19 PACE, Mangalore
Achieving High Performance Computing
accepted_credentials "*"
authorized_retrievers "*"
default_retrievers "*"
authorized_renewers "*"
default_renewers "none"
authorized_key_retrievers "*"
default_key_retrievers "none"
[root@pace ~]# cat
$GLOBUS_LOCATION/share/myproxy/etc.services.modifications >> /etc/services
[root@mitgrid ~]# tail /etc/services
asp 27374/udp # Address Search Protocol
tfido 60177/tcp # Ifmail
tfido 60177/udp # Ifmail
fido 60179/tcp # Ifmail
fido 60179/udp # Ifmail
# Local services
gsiftp 2811/tcp
myproxy-server 7512/tcp # Myproxy server
[root@pace ~]# cp $GLOBUS_LOCATION/share/myproxy/etc.xinetd.myproxy /etc/
xinetd.d/myproxy
[root@pace ~]# vim /etc/xinetd.d/myproxy
Modify the following lines....
service myproxy-server
{
socket_type = stream
protocol = tcp
wait = no
Department of CSE 20 PACE, Mangalore
Achieving High Performance Computing
user = root
server = /usr/local/globus-4.0.1/sbin/myproxy-server
env = GLOBUS_LOCATION=/usr/local/globus-4.2.0
LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib
disable = no
}
[root@pace ~]# /etc/init.d/xinetd reload
Reloading configuration: [ OK ]
[root@pace ~]# netstat -an | grep 7512
tcp 0 0 0.0.0.0:7512 0.0.0.0:* LISTEN
Note: Login as guser01 @pace.grid
[guser01@pace~]$ grid-proxy-destroy
[guser01@pace ~]$ grid-proxy- info
ERROR: Couldn't find a valid proxy.
Use -debug for further information.
Note: Instead of grid-proxy we use Myproxy server.
[guser01@mitgrid ~]$ myproxy-init -s mitgrid
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid
user #01
Enter GRID pass phrase for this identity: guser01
Creating proxy ........................................... Done
Proxy Verify OK
Your proxy is valid until: Fri Feb 10 15:44:40 2006
Enter MyProxy pass phrase: globus
Department of CSE 21 PACE, Mangalore
Achieving High Performance Computing
Verifying - Enter MyProxy pass phrase: globus
A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid.
[guser01@pace ~]$ myproxy-logon -s pace.grid
Enter MyProxy pass phrase:guser01
A proxy has been received for user guser01 in /tmp/x509up_u503.
2.1.3. Setting up the Second machine
Install GlobusToolKit [ follow the steps as specified in the prerequisites and first machine]
Installation of CA packages
To install CA packages log in to the CA host as a Globus user and invoke the setup-sim-
ple-ca script, and answer the prompts as appropriate
[globus@ca]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca
WARNING: GPT_LOCATION not set, assuming:
GPT_LOCATION=/usr/local/globus-4.2.0
C e r t i f i c a t e A u t h o r i t y S e t u p
This script will setup a Certificate Authority for signing Globus users certificates. It will
also generate a simple CA package that can be distributed to the users of the CA.The CA
information about the certificates it distributes will be kept in:
/home/globus/.globus/simpleCA/
/usr/local/globus-4.0.0/setup/globus/setup-simple-ca: line 250: test: res:
integer expression expected
The unique subject name for this CA is:
cn=Globus Simple CA, ou=simpleCA-ca.redbook.ibm.com, ou=GlobusTest, o=Grid
Do you want to keep this as the CA subject (y/n) [y]: y
Department of CSE 22 PACE, Mangalore
Achieving High Performance Computing
Enter the email of the CA (this is the email where certificate requests will be
sent to be signed by the CA): (type mail address)[email protected]. The CA
certificate has an expiration date. Keep in mind that once the CA certificate has expired,
all the certificates signed by that CA become invalid. A CA should regenerate the CA cer-
tificate and start re-issuing ca-setup packages before the actual CA certificate expires.
This can be done by re-running this setup script. Enter the number of DAYS the CA cer-
tificate should last before it expires.
[default: 5 years (1825 days)]: (type the number of days)1825
Enter PEM pass phrase: (type ca certificate pass phrase)
Verifying - Enter PEM pass phrase: (type ca certificate pass phrase)
...(unrelated information omitted)
Setup security in each grid node. After performing the steps above, a package file has
been created that needs to be used on other nodes, as described in this section. In order to use
certificates from this CA in other grid nodes, you need to copy and install the CA setup package
to each grid node.
1. Log in to a grid node as a Globus user and obtain a CA setup package from the CA host. Then
run the setup commands for configuration .
[globus@hosta]$ scp globus@ca:/home/globus/.globus/simpleCA
/globus_simple_ca_(ca_hash)_setup-0.18.tar.gz .
[globus@hosta]$GLOBUS_LOCATION/sbin/gpt-build/
globus_simple_ca_(ca_hash)_setup-0.18.tar.gz gcc32dbg
[globus@hosta]$ $GLOBUS_LOCATION/sbin/gpt-postinstall
Note: A CA setup package is generated when you run the setup-simple-ca command. Keep in
mind that the name of the CA setup package includes a unique CA hash. As the root user, submit
the commands to configure the CA settings in each grid node. This script creates the /etc/grid-se -
curity directory. This directory contains the configuration files for security.
Configure CA in each grid node
Department of CSE 23 PACE, Mangalore
Achieving High Performance Computing
[root@hosta]# $GLOBUS_LOCATION/setup/globus_simple_ca_[ca_hash]_setup/setup-
gsi -default
Note: For the setup of the CA host, you do not need to run the setup-gsi script. This script creates
a directory that contains the configuration files for security. The CA host does not need this di -
rectory, because these configuration files are for the servers and users who use the CA.
In order to use some of the services provided by Globus Toolkit 4, such as GridFTP, you
need to have a CA signed host certificate and host key in the appropriate directory.As root user,
request a host certificate with the command
[root@pace]# grid-cert-request -host `hostname`
Copy or send the /etc/grid-security/hostcert_request.pem file to the CA host. In the CA host as a
Globus user, sign the host certificate by using the grid-ca-sign command.
[globus@ca]$ grid-ca-sign -in hostcert_request.pem -out hostcert.pem
To sign the request please enter the password for the CA key: (type CA passphrase)
The new signed certificate is at:
/home/globus/.globus/simpleCA//newcerts/01.pem
Copy the hostcert.pem back to the /etc/grid-security/ directory in the grid node.
In order to use the grid environment, a grid user needs to have a CA signed user
certificate and user key in the user’s directory. As a user (auser1 in hosta), request a user certifi-
cate with the command
[auser1@pace1]$ grid-cert-request
Enter your name, e.g., John Smith: grid user 1 (type grid user name). A certificate re-
quest and private key is being created.You will be asked to enter a PEM pass phrase.
This pass phrase is akin to your account password,and is used to protect your key file. If
you forget your pass phrase, you will need to obtain a new certificate.
Department of CSE 24 PACE, Mangalore
Achieving High Performance Computing
Generating a 1024 bit RSA private key
.....................................++++++
...++++++
writing new private key to '/home/auser1/.globus/userkey.pem'
Enter PEM pass phrase: (type pass phrase for grid user)
Verifying - Enter PEM pass phrase: (retype pass phrase for grid user)
...(unrelated information omitted)
Copy or send the (userhome)/.globus/usercert_request.pem file to the CA host.In CA host
as a Globus user, sign the user certificate by using the grid-ca-sign command .
[globus@pace]$ grid-ca-sign -in usercert_request.pem -out usercert.pem
To sign the request
please enter the password for the CA key:
The new signed certificate is at:
/home/globus/.globus/simpleCA//newcerts/02.pem
Copy the created usercert.pem to the (userhome)/.globus/ directory on the grid node.
Test the user certificate by typing grid-proxy-init -debug -verify as the auser user. With this com-
mand, you can see the location of a user certificate and a key, CA’s certificate directory, a distin -
guished name for the user, and the expiration time. After you successfully execute grid-proxy-
init, you have been authenticated and are ready to use the grid environment.
[auser1@pace1]$ grid-proxy-init -debug -verify
User Cert File: /home/auser1/.globus/usercert.pem
User Key File: /home/auser1/.globus/userkey.pem
Trusted CA Cert Dir: /etc/grid-security/certificates
Output File: /tmp/x509up_u511
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-ca.redbook.ibm.com/OU=re-
book.ibm.com/CN=grid user 1
Department of CSE 25 PACE, Mangalore
Achieving High Performance Computing
Enter GRID pass phrase for this identity:
Creating proxy .........++++++++++++
.................++++++++++++
Done
Proxy Verify OK
Your proxy is valid until: Thu Jun 9 22:16:28 200
Note: You may copy those user certificates to other grid nodes in order to access each grid node
as a single grid user. But you may not copy a host certificate and a host key. A host certificate is
needed to be created in each grid node. Set mapping information between a grid user and a local
user Globus Toolkit 4 requires a mapping between an authenticated grid user and a local user. In
order to map a user, you need to get the distinguished name of the grid user, and map it to a local
user. Get the distinguished name by invoking the grid-cert-info command.
[auser1@pace1]$ grid-cert-info -subject -f /home/auser1/.globus/usercert.pem
/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/
CN=grid user 1
As a root user, map the local user name with the distinguished name by using the grid-
mapfile-add-entry command.
[root@pace1]# grid-mapfile-add-entry -dn \
"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/
CN=grid user 1"
Modifying /etc/grid-security/grid-mapfile ...
/etc/grid-security/grid-mapfile does not exist... Attempting to create /etc/grid-security/
grid-mapfile
New entry:
"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/
CN=grid user 1" auser1
Department of CSE 26 PACE, Mangalore
Achieving High Performance Computing
Note: The grid-mapfile-add-entry command creates and adds an entry to /etc/grid-security/grid-
mapfile. You can manually add an entry by adding a line into this file. In order to see the map-
ping information, look at /etc/grid-security/grid-mapfile
Example of /etc/grid-security/grid-mapfile
"/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/CN=grid
user 1" auser1
For setting up JavaWSCore, GRIDFTP, MyProxy server follow as specified for first
machin.
Submitting grid-proxy-init command
Note: Login as auser1 @pace.grid
[auser1@pace~]$ grid-proxy-destroy
[auser1@pace ~]$ grid-proxy- info
ERROR: Couldn't find a valid proxy.
Use -debug for further information.
[guser01@pace ~]$
Note: Instead of grid-proxy we use Myproxy server.
[auser1@mitgrid ~]$ myproxy-init -s mitgrid
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid
auser #1
Enter GRID pass phrase for this identity: auser01
Creating proxy ........................................... Done
Proxy Verify OK
Your proxy is valid until: Fri Feb 10 15:44:40 2006
Enter MyProxy pass phrase: globus
Verifying - Enter MyProxy pass phrase: globus
A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid.
Department of CSE 27 PACE, Mangalore
Achieving High Performance Computing
[auser1@pace ~]$ myproxy-logon -s pace.grid
Enter MyProxy pass phrase:auser1
A proxy has been received for user guser01 in /tmp/x509up_u503.
2.2. Message Passing Interface
2.2.1. Setting up MPICH
Here are the steps from obtaining MPICH2 through running your own parallel program
on multiple machines.
1. Unpack the tar file for MPICH2 i.e. mpich2.tar.gz
2. Choose an installation directory (the default is /usr/local/bin):
It will be most convenient if this directory is shared by all of the machines where you in-
tend to run processes. If not, you will have to duplicate it on the other machines after intallation.
3. Choose a build directory. Building will proceed much faster if your build directory is on a file
system local to the machine on which the configuration and compilation steps are executed. It is
preferable that this also be separate from the source directory, so that the source directories re-
main clean and can be reused to build other copies on other machines.
4. Configure, build, and install MPICH2 using the following respective commands, specifying
the installation directory, and running the configure script in the source directory:
./configure
make
make install
7. Add the bin subdirectory of the installation directory to your path:
export PATH=/home/you/mpich2-install/bin:$PATH
8. For security reasons, MPD looks in your home directory for a file named .mpd.conf containing
the line
Department of CSE 28 PACE, Mangalore
Achieving High Performance Computing
secretword=<secretword>
where <secretword> is a string known only to yourself. It should not be your normal Unix pass-
word. Set the file permissions as readable and writable only by you:
cd $HOME
touch .mpd.conf
chmod 600 .mpd.conf
Then use an editor to place a line like:
secretword=mr45-j9z
into the file (Of course use a different secret word than mr45-j9z.). If super user then as root
create the mpd.conf file in /etc/mpd.conf
9. The first sanity check consists of bringing up a ring of one MPD on the local machine, testing
one MPD command, and bringing the “ring” down.
mpd &
mpdtrace
mpdallexit
The output of mpdtrace should be the hostname of the machine you are running on. The mp-
dallexit causes the mpd daemon to exit.
10. The next sanity check is to run a non-MPI program using the daemon.
mpd &
mpiexec -n 1 /bin/hostname
mpdallexit
This should print the name of the machine you are running on.
11. Now we will bring up a ring of mpd’s on a set of machines. Create a file consisting of a list
of machine names, one per line. Name this file mpd.hosts. These hostnames will be used as tar -
Department of CSE 29 PACE, Mangalore
Achieving High Performance Computing
gets for ssh or rsh, so include full domain names if necessary. Check to see if all the hosts you
listed in mpd.hosts are in the output of mpdtrace and if so move on to the next step.
12. Test the ring you have just created:
ssh login without password
First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa):
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist,
which is fine):
a@A:~> ssh b@B mkdir -p .ssh
b@B's password:
Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last
time:
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password:
From now on you can log into B as b from A as a without password:
a@A:~> ssh b@B hostname
Department of CSE 30 PACE, Mangalore
Achieving High Performance Computing
13. Test that the ring can run a multiprocess job:
mpiexec -n <number> hostname
The number of processes need not match the number of hosts in the ring; if there are more, they
will wrap around. You can see the effect of this by getting rank labels on the stdout:
mpiexec -l -n 30 hostname
You probably didn’t have to give the full pathname of the hostname command because it is in
your path. If not, use the full pathname:
mpiexec -l -n 30 /bin/hostname
14. Now we will run an MPI job, using the mpiexec command as specified
mpiexec -n 5 examples/cpi
The number of processes need not match the number of hosts. The cpi example will tell
you which hosts it is running on. By default, the processes are launched one after the other on the
hosts in the mpd ring, so it is not necessary to specify hosts when running a job with mpiexec.
Trouble shooting:
It can be rather tricky to configure one or more hosts in such a way that they adequately
support client-server applications like mpd. In particular, each host must not only know its own
name, but must identify itself correctly to other hosts when necessary. Further, certain informa-
tion must be readily accessible to each host. For example, each host must be able to map another
host’s name to its IP address. In this section, we will walk slowly through a series of steps that
will help to ensure success in running mpds on a single host or on a large cluster.
If you can ssh from each machine to itself, and from each machine to each other machine
in your set (and back), then you probably have an adequate environment for mpd. However,
there may still be problems. For example, if you are blocking all ports except the ports used by
ssh/sshd, then mpdwill still fail to operate correctly.
To begin using mpd, the sequence of steps that we recommend is this:
Department of CSE 31 PACE, Mangalore
Achieving High Performance Computing
1. get one mpd working alone on a first test node
2. get one mpd working alone on a second test node
3. get two new mpds to work together on the two test nodes
Following the steps
1. Install mpich2, and thus mpd.
2. Make sure the mpich2 bin directory is in your path. Below, we will refer to it as
MPDDIR.
3. Run a first mpd (alone on a first node). As mentioned above, mpd uses client-server
communications to perform its work. So, before running an mpd, let’s run a simpler
program (mpdcheck) to verify that these communications are likely to be successful.
Even on hosts where communications are well supported, sometimes there are probl
ems associated with hostname resolution, etc. So, it is worth the effort to proceed a bit
slowly. Below, we assume that you have installed mpd and have it in your path.
Select a test node, let’s call it n1. Login to n1.First, we will run 'mpdcheck' as a server
and a client. To run it as a server, get into a window with a command-line and run this:
n1 $ mpdcheck –s
server listening at INADDR_ANY on: n1 1234
Now, run the client side (in another window if convenient) and see if it can find the server and
communicate. Be sure to use the same hostname and port number printed by the server (above:
n1 1234):
n1 $ mpdcheck -c n1 1234
server has conn on
<socket._socketobject object at 0x40200f2c>
from (’192.168.1.1’, 1234)
server successfully recvd msg from client:
hello_from_client_to_server
client successfully recvd ack from server:
Department of CSE 32 PACE, Mangalore
Achieving High Performance Computing
ack_from_server_to_client
If the experiment failed, you have some network or machine configuration problem
which will also be a problem later when you try to use mpd.
If the experiment succeeded, then you should be ready to try mpd on this one host. To
start an mpd, you will use the mpd command. To run parallel programs, you will use the mpiexec
program. All mpd commands accept the -h or –help arguments, e.g.:
n1 $ mpd --help
n1 $ mpiexec –help
Try a few tests:
n1 $ mpd &
n1 $ mpiexec -n 1 /bin/hostname
n1 $ mpiexec -l -n 4 /bin/hostname
n1 $ mpiexec -n 2 PATH_TO_MPICH2_EXAMPLES/cpi
where PATH TO MPICH2 EXAMPLES is the path to the mpich2-1.0.3/examples directory.
To terminate the mpd:
n1 $ mpdallexit
Run a second mpd (alone on a second node). To verify that things are fine on a second
host (say n2 ), login to n2 and perform the same set of tests that you did on n1. Make sure that
you use mpdallexit to terminate the mpd so you will be ready for further tests.
Run a ring of two mpds on two hosts. Before running a ring of mpds on n1 and n2, we
will again use mpdcheck, but this time between the two machines. We do this because the two
nodes may have trouble locating each other or communicating between them and it is easierto
check this out with the smaller program.
First, we will make sure that a server on n1 can service a client from n2. On n1:
n1 $ mpdcheck –s
which will print a hostname (hopefully n1) and a portnumber (say 3333 here). On n2:
Department of CSE 33 PACE, Mangalore
Achieving High Performance Computing
n2 $ mpdcheck -c n1 3333
Second, we will make sure that a server on n2 can service a client from n1. On n2:
n2 $ mpdcheck -s
which will print a hostname (hopefully n2) and a portnumber (say 7777 here). On n2:
n2 $ mpdcheck -c n2 7777
The 6789 is the port that the mpd is listeneing on for connections from other mpds wish-
ing to enter the ring. We will use that port in a moment to get an mpd from n2 into the ring. The
value in parentheses should be the IP address of n1.
On n2:
n2 $ mpd -h n1 -p 6789 &
where 6789 is the listening port on n1 (from mpdtrace above). Now try:
n2 $ mpdtrace -l
You should see both mpds in the ring.To run some programs in parallel:
n1 $ mpiexec -n 2 /bin/hostname
n1 $ mpiexec -n 4 /bin/hostname
n1 $ mpiexec -l -n 4 /bin/hostname
n1 $ mpiexec -l -n 4 PATH_TO_MPICH2_EXAMPLES/cpi
To bring down the ring of mpds:
n1 $ mpdallexit
If the output from any of mpdcheck, mpd, or mpdboot leads you to believe that one or
more of your hosts are having trouble communicating due to firewall issues, we can offer a few
simple suggestions. If the problems are due to an “enterprise” firewall computer, then we can
only point you to your local network admin for assistance. In other cases, there are a few quick
things that you can try to see if there some common protections in place which may be causing
your problems.Deactivate all firewalls in the running services window.
Department of CSE 34 PACE, Mangalore
Achieving High Performance Computing
2.3. OpenFoam
System requirements:
OpenFOAM is developed and tested on Linux, but should work with other POSIX sys-
tems. To check your system setup, execute the foamSystemCheck script in the bin/ directory of
the OpenFOAM installation.
Here is the output you should get
[open@sham OpenFOAM-1.6]$ foamSystemCheck
Checking basic system...
-----------------------------------------------------------------------
Shell: /bin/bash
Host: sham.globus
OS: Linux version 2.6.27.5-117.fc10.i686
User: open
System check: PASS
==================
Continue OpenFOAM installation.
Installation:
Download and unpack the files in the $HOME/OpenFOAM directory as described in:
http://www.OpenFOAM.org/download.html
The environment variable settings are contained in files in an etc/ directory in the OpenFOAM
release. e.g. in
$HOME/OpenFOAM/OpenFOAM-1.6/etc/
source the etc/bashrc file by adding the following line to the end of your HOME/.bashrc
file:
Department of CSE 35 PACE, Mangalore
Achieving High Performance Computing
. $HOME/OpenFOAM/OpenFOAM-1.6/etc/bashrc
Then update the environment variables by sourcing the $HOME/.bashrc file by typing in the ter-
minal:
. $HOME/.bashrc
Testing the installation:
To check your installation setup, execute the 'foamInstallationTest' script (in the bin/ di-
rectory of the OpenFOAM installation). If no problems are reported, proceed to getting started
with OpenFOAM; otherwise, go back and check you have installed the software correctly .
Getting Started
Create a project directory within the $HOME/OpenFOAM directory named <USER>-1.6 (e.g.
'chris-1.6' for user chris and OpenFOAM version 1.6) and create a directory named 'run' within
it, e.g. by typing:
mkdir -p $FOAM_RUN/run
Copy the 'tutorial' examples directory in the OpenFOAM distribution to the 'run' directory. If the
OpenFOAM environment variables are set correctly, then the following command will be cor-
rect:
+ cp -r $WM_PROJECT_DIR/tutorials $FOAM_RUN
Run the first example case of incompressible laminar flow in a cavity:
+ cd $FOAM_RUN/tutorials/incompressible/icoFoam/cavity
+ blockMesh
+ icoFoam
+ paraFoam
Department of CSE 36 PACE, Mangalore
Achieving High Performance Computing
Chapter 3
Case Studies
3.1 Case Study Dense Matrix Multiplication:
One way to implement the matrix multiplication algorithm is to allocate one processor to
compute each row of resultant matrix C, matrix B and one row of elements of A are needed for
each processor. Using master slave approach, these elements could be sent from the master
processor to the selected slave processors. Results are then collected back from each of the slaves
and displayed by the master.
Steps taken to parallelize:
1) A MPICH code for matrix multiplication is written.
2) MPD must be running on all the nodes.
Start the daemons "by hand" as follows:
mpd & # starts the local daemon
mpdtrace -l # makes the local daemon print its host
# and port in the form <host>_<port>
Then log into each of the other machines, put the install/bin directory in your path, and do:
mpd -h <hostname> -p <port> &
Where the hostname and port belong to the original mpd that has been started. From each
machine, after starting the mpd, mpdtrace is used to see which machines are in the ring.
3) The execution command is given in the master node.
MPI job is run using mpiexec command.
mpiexec –n <no of processors> <output filename>
mpiexec –n 5 ./cpi
4) System monitor is checked if all the nodes are utilized.
Department of CSE 37 PACE, Mangalore
Achieving High Performance Computing
5) The result from each matrix is given back to the master node.
Implementing:
There exist many ways of solving matrix multiplication. Finding the efficient code for it
is perhaps the greatest challenge faced by the programming community.
A sequential code is written considering the ordinary matrix multiplication.
for (i=0; i<n;i++){
for (j=0;j<n;j++){ mult[i][j]=0;
for (k=0; k<n; k++) { Mult[i][j] +=m1[i][k]*m2[k][j]; }
}}
This algorithm requires n3 multiplication and n3 addition, leading to a sequential time
complexity of 0(n3). Parallel matrix multiplication is usually based upon the direct sequential
matrix multiplication algorithm. Even a superficial look at the sequential code reveals that the
computation in each iteration of the outer two loops is not dependent upon any other iteration
and each instance of the inner loop could be executed in parallel. Theoretically with p=n 2
processors, we can expect a parallel time complexity of O(n2) and this is easily obtainable.
Direct implementation:
One way to implement the matrix multiplication algorithm is to allocate one processor to
compute each column of resultant matrix C, matrix A and one column of elements of B are
needed for each processor. Using master slave approach, these elements could be sent from the
master processor to the selected slave processors. Results are then collected back from each of
the slaves and displayed by the master.
Department of CSE 38 PACE, Mangalore
Achieving High Performance Computing
Observations:
Observation of the Performance of Matrix Multiplication Using MPICH2:
Matrix Dimension No. of cores Time (seconds)
1000x10001 13.90
2 8.33
3 5.86
4 5.07
5 6.89
Matrix Dimension No. of cores Time (seconds)
2000x20001 108.17
2 64.67
3 45.56
4 36.54
5 51.62
Matrix Dimension No. of cores Time (seconds)
3000x30001 392.9
2 220.19
3 156.81
4 123.36
5 180.29
The below graph shows the experimental observations of different matrix dimensions
versus the number of processes.
Department of CSE 39 PACE, Mangalore
Achieving High Performance Computing
Remarks:
1. Time increases as the number of processes spawned reaches some critical number.
2. Time spent in the above communication process also somtime adds overhead.
3. If the program is carefully divided between the number of processes as of machines and
cores available, then there is improvement in the performance .
4. For example if there a 2 machines with 2 cores on each then for the processes 1,2,3,4 the
performance increases.
5. If the program is not carefully divided between number of processes as of machines and
cores available then there is slight decrease in the performance.
6. For example if there are 2 machines with the 2 cores on each , for the processes 1,2,3,4
performance increases. But if we divide it into 5 ,the performance slightly decreases as
shown in graph.
7. The program should be carefully divided between the number of processes according to
the machines and cores available in order to achieve high performance.
Department of CSE 40 PACE, Mangalore
Achieving High Performance Computing
3.2 Case Study 2: Computatiuonal Fluid Dynamics
1. Involves interaction of heat conduction within a solid body from its surface to a fluid
flowing over it.
2. Application are : thermal design of a fuel element of a nuclear reactor.
3. Software was developed that deals with
study of conjugate heat transfer problem associated with a rectangular nuclear
fuel element washed by upward moving coolant.
employing stream function-vorticity formulation.
equations governing the steady, two-dimensional flow and thermal fields in the
coolant are solved simultaneously with the steady, two-dimensional heat conduc-
tion equation
4. Software was developed that deals with
study of conjugate heat transfer problem associated with a rectangular nuclear
fuel element washed by upward moving coolant.
employing stream function-vorticity formulation.
equations governing the steady, two-dimensional flow and thermal fields in the
coolant are solved simultaneously with the steady, two-dimensional heat conduc-
tion equation
Pre-analysis -Profiling:
Using gprof, Gnu’s opensource profiler, to profile the code. The output of gprof, which
includes a flat profile and a call graph, was also used for code comprehension.
Department of CSE 41 PACE, Mangalore
Achieving High Performance Computing
Flat profile:
Department of CSE 42 PACE, Mangalore
Achieving High Performance Computing
Call graph:
The following observations were made.
1. Values for the different input parameters were hardcoded into the code. Each change of
parameter value necessitated a recompilation of the code.
Department of CSE 43 PACE, Mangalore
Achieving High Performance Computing
2. For the given set of parameters present in the code the observed run time was about 23
minutes. The run time could get much larger (ranging from a few hours to days) with
changed parameters. The large system run time discouraged experimenting with a range
of values of the computational grid system that may have resulted in values with a finer
resolution. Additionally, the range of values of parameters that may compute results giv-
ing more insight into the physics of the problem could not be studied for the same rea-
sons.
3. Multiple output files were used and data was being written to a large set of output files.
File handling could have been more efficiently done to positively impact the total execu-
tion time.
Department of CSE 44 PACE, Mangalore
Achieving High Performance Computing
4. As the program was serially executed, the execution of a few functions was delayed, even
though parameters or values required to compute that function were available. A similar
case was of functions being called after the complete execution of loops although no data
or control dependencies existed between the loop and/or the functions.
5. It was observed that there was an excessive and sometimes unnecessary use of global
variables.
Department of CSE 45 PACE, Mangalore
Achieving High Performance Computing
6. Some loops were identified that could have been combined to bring down the size of the
code.
Department of CSE 46 PACE, Mangalore
Achieving High Performance Computing
7. None of the functions used any input parameters nor did they return any value. Global
variables were used as a replacement for both.
8. Some functions that have been defined exhibit identical functionality with few differing
statements.
Department of CSE 47 PACE, Mangalore
Achieving High Performance Computing
Analyzing Data Dependencies:
The call graph was used to understand the initial working of the code after which the data
dependencies at the coarse function-level granularity were analyzed. More specifically, the fol-
lowing data dependencies were identified.
1. Flow dependence - If the variables modified in one function are passed to another func-
tion then the execution must follow the same path.
2. Anti-dependence - If a changed variable in one function is being used by a previously
called function. The order of these functions cannot be interchanged.
3. Output dependence - If two functions produce or write to the same output variable they
are said to be output dependent; thus their order cannot be changed.
4. I/O dependence - This dependence between two functions occurs when a file is being
read and written by both these functions. Based on the identified dependencies, the code
was statically restructured for a theoretical execution on a multiprocessor system. Initial
analysis indicated a reduction of the execution time to about 13 seconds indicating a theo-
reticalspeedup of 1.7 ignoring the communication overhead.
PRE-PARALLELIZATION EXERCISES:
Based on the analysis, it was noted that the following need to be completed before the
start of the parallelization step:
1. Code needs to be changed to read in parameters from the command-prompt or from an exter-
nally available input file. This would allow the code to be executed unchanged for different val-
ues of the parameters without the need for a recompilation.
Department of CSE 48 PACE, Mangalore
Achieving High Performance Computing
2. Reduction in the number of output files. If the files are genuinely required, the output se-
quence needs to be further analyzed; else the data that is written to these files can be combined
into a reduced set.
3. Functions that have been identified with no data dependenciesbetween them are good candi-
dates for parallel execution. Their execution time profiles and the computation-communication
ratio needs to be further studied to see if parallelization will indeed produce a
speedup.
4. The code needs to be rewritten to reduce the usage of global variables. This may involve
changing all or most of the function signatures to read in input parameters and return results.
This exercise may also involve the creation of more efficient data structures for parameter
passing between functions.
5. Many functions can be eliminated by rewriting functions to combine the functionality of two
or more functions. This would considerably reduce the code size and will result in more compact
and well written code. However, code repeated in different programs offers the advantage that it
can be customized to execute that part of the program where it lies in a unique manner; combin-
ing similar portions of code into a generalized single function, while offering other advantages,
removes this advantage. The trade-off need to be deliberated before performing this exercise.
6. Loops that are temporally close need to be studied along with their indices to see if they can be
successfully combined. In addition to reduce the code size, this would reduce the effort of paral-
lelization as only a single loop needs to analyzed.
Based on the identified dependencies, the code was statically restructured for a theoreti-
cal execution on a multiprocessor system. Initial analysis indicated a reduction of the execution
time to about 13 seconds indicating a theoretical speedup of 1.7 ignoring the communication
overhead.
Department of CSE 49 PACE, Mangalore
Achieving High Performance Computing
Flowchart
Flowchart for order of execution of program
The above flowchart shows the execution flow of functions in a given CFD case
problem .There are total 31 functions some of which are dependent and independent. The single
rectangular box with number inside indicates the number of functions which has to be executed
sequentially .The double rectangular box with number inside indicates the number of functions
which can be executed independently.
Department of CSE 50 PACE, Mangalore
Achieving High Performance Computing
The problem took over 15 minutes to execute when executed sequentially. After
analyzing that some part of the code can be parallelized the theoretical speed up of 1.71 was
achived by taking the maximum time taken among functions which can be executed in parallel.
3.3 Case Study 3 :OpenFOAM
BubbleFoam is one of the case in OpenFOAM application that we will be studying from
many of the cases in OpenFOAM. Before u start up with this case you need to do some modifi -
cation. First to generate the profiler you have to include -pg option in C and C++ files which are
there in following directory
/home/open/OpenFOAM/OpenFOAM-1.6/wmake/rules/linuxGcc
After modifications you need to compile Bubblefoam case by running wmake command located
in following direcory
/home/open/OpenFOAM/OpenFOAM-1.6/applications/solvers/multiphase/bubbleFoam
The case is located in
/home/open/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/bubbleFoam/bubbleCol-
umn
+blockMesh
+bubbleFoam
The case is executed for 8 minutes.
Now to reduce the execution time we needed to do some observation. So we used gprof profiler
to profile the case.
From the profile graph we observed that the functions :
Department of CSE 51 PACE, Mangalore
Achieving High Performance Computing
'H()' - which is located in fvMatrix.C file
Foam::tmp<Foam::fvMatrix<Foam::Vector<double>>>Foam::fvm::div<Foam::Vector<d
ouble> >(Foam::GeometricField<double, Foam::fvsPatchField, Foam::surfaceMesh>
const&, Foam::GeometricField<Foam::Vector<double>, Foam::fvPatchField,
Foam::volMes-h>&)
were taking more time. The H() was called about 40000 times and the other function was taking
more time.
Running in Parallel:
This case was also run in parallel to create 4 diffrent mesh at a time. There is a dictionary
associated with decomposePar named decomposeParDict which is located in the system direc-
Department of CSE 52 PACE, Mangalore
Achieving High Performance Computing
tory of the tutorial case; also, like with many utilities, a default dictionary can be found in the di -
rectory of the source code of the specific utility, i.e. in
$FOAM UTILITIES/parallelProcessing/decomposePar
The first entry is numberOfSubdomains which specifies the number of subdomains into
which the case will be decomposed, usually corresponding to the number of processors available
for the case. The method of decomposition should be simple and the corresponding simpleCo-
effs should be edited according to the following criteria. The domain is split into pieces, or sub-
domains, in the x, y and z directions, the number of subdomains in each direction being given by
the vector n. As this geometry is 2 dimensional, the 3rd direction, z, cannot be split, hence nz
must equal 1. The nx and ny components of n split the domain in the x and y directions and must
be specified so that the number of subdomains specified by nx and ny equals the specified num-
berOfSubdomains, i.e. nx ny = numberOfSubdomains. It is beneficial to keep the number of cell
faces adjoining the subdomains to a minimum so, for a square geometry, it is best to keep the
split between the x and y directions should be fairly even. The delta keyword should be set to
0.001.
For example, let us assume we wish to run on 4 processors. We would set number-Of-
Subdomains to 4 and n = (2, 2, 1). When running decomposePar, we can see from the screen
messages that the decomposition is distributed fairly even between the processors. The user has a
choice of four methods of decomposition, specified by the method keyword as described below.
simple Simple geometric decomposition in which the domain is split into pieces by direction,
e.g. 2 pieces in the x direction, 1 in y etc. hierarchical Hierarchical geometric decomposition
which is the same as simple except the user specifies the order in which the directional split is
done, e.g. first in the y-direction, then the x-direction etc.
Manual decomposition, where the user directly specifies the allocation of each cell to a
particular processor. For each method there are a set of coefficients specified in a sub-dictionary
of decompositionDict, named <method>Coeffs as shown in the dictionary listing. The decom-
posePar utility is executed in the normal manner by typing
Department of CSE 53 PACE, Mangalore
Achieving High Performance Computing
decomposePar
On completion, a set of subdirectories will have been created, one for each processor, in
the case directory. The directories are named processorN where N = 0, 1, . . . represents a proces-
sor number and contains a time directory, containing the decomposed field descriptions, and a
constant/polyMesh directory containing the decomposed mesh description.
Running a decomposed case:
A decomposed OpenFOAM case is run in parallel using the openMPI implementation of
MPI. openMPI can be run on a local multiprocessor machine very simply but when running on
machines across a network, a file must be created that contains the host names of the machines.
The file can be given any name and located at any path. In the following description we shall re -
fer to such a file by the generic name, including full path, <machines>.
The <machines> file contains the names of the machines listed one machine per line.The
names must correspond to a fully resolved hostname in the /etc/hosts file of the machine on
which the openMPI is run. The list must contain the name of the machine running the openMPI.
Where a machine node contains more than one processor, the node name may be followed by the
entry cpu=n where n is the number of processors openMPI should run on that node. For example,
let us imagine a user wishes to run openMPI from machine machine1 on the following machines:
machine1; machine2, which has 2 processors; and machine3.
The <machines> file would contain:
machine1
machine2 cpu=2
machine3
An application is run in parallel using mpirun.
mpirun --hostfile <machines> -np <nProcs> <foamExec> <otherArgs> -parallel >
log &
Department of CSE 54 PACE, Mangalore
Achieving High Performance Computing
where: <nProcs> is the number of processors;
<foamExec> is the executable, e.g.icoFoam;
and, the output is redirected to a file named log.
For example, if icoFoam is run on 4 nodes, specified in a file named machines, on the
cavity tutorial in the $FOAM RUN/tutorials/incompressible/icoFoam directory, then the follow-
ing command should be executed:
mpirun --hostfile machines -np 4 interFoam -parallel > log &.
The Foam ran for about 991 seconds for creating 4 different meshes.
Department of CSE 55 PACE, Mangalore
Achieving High Performance Computing
Chapter 4
Conclusion & Future work
Parallelization process is not easy ,as it needs the application to be studied thoroughly.-
Many applications are written in sequential execution point of view which were easy to write and
test. The porting of matrix multiplication to mpi cluster helped us is studying & understanding
how to parallize the sequential program as it used as a benchmark.
The CFD code for conjugate heat transfer cannot be ported to mpi or grid cluster because
the program is poorly strutured. The most effecient method to parallelize it is to rewrite the code
again. But observation can be made that optimizing the code by eliminating dependencies and re-
moving unnecessary references and a bit of smart programming can produce better performance.
OpenFOAM is written in high object oriented language,which is very difficult to under-
stand. It is one of the oldest written code which had undergone continuous enhancements. There
is a seperate group involved in writing up this code. The code is standard and optimized and
hence gives problems in prallelization process.
Work can be carried to generelize the process of parallelisation.Port the case bubbleFoam
to MPI-Cluster for generating single mesh.Study how to port applications to grid cluster and Port
OpenFOAM application to the same.
The study of OpenFOAM code thoroughly will be the next major advancement in the
field of paralellising this application.We tried to parallellize some part of the case and we ran
into errors.Even though the case was not running properly after modifying it to run in parallel,
we achived parallellising some part of that case. A little more time spent can produce the disir-
able results.
Department of CSE 56 PACE, Mangalore
Achieving High Performance Computing
Chapter 5
References
[1] Joseph D. Sloan. High Performance LINUX Clusters with OSCAR, Rocks, openMosix & MPI.
[2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/ECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
[3] Introduction to Grid Computing with Globus by Luis Ferreira,Viktors Berstis, Jonathan Armstrong,Mike Kendzierski, Andreas Neukoetter, MasanobuTakagi,Richa Bing-Wo, Adeeb Amir, Ryo Murakawa, Olegario Hernandez, James Magowan, Norbert Bieberstein ibm.com/redbooks www.redbooks.ibm.com/redbooks/pdfs/sg246895.pdf
[4] Pre-Parallelization Exercises in Budget-Constrained HPC Projects: A Case Study in CFD byDr. Waseem Ahmed, Dr. Ramis M. K., Shamsheer Ahmed, Suma Bhat, Mohammed Isham, P. A. College of Engineering, Mangalore, India.
[4] www.annauniv.edu/care/soft.htm
[5] www.mcs.anl.gov/mpi/mpich1. [6] www.openfoam.com/docs.
[7] www.nus.edu.sg/demo2a.html .
[8] www.linuxproblem.org/art_9.html
Department of CSE 57 PACE, Mangalore