10
17th Sep 2007 Andrey [email protected] Computing cluster at NCG Introduction Past upgrades Current state of the cluster Problems with cluster Where to find out information about the cluster Conclusion

17th Sep 2007 Andrey [email protected] Computing cluster at NCG Introduction Past upgrades Current state of the cluster Problems with cluster Where to find

Embed Size (px)

Citation preview

17th Sep 2007 Andrey [email protected]

Computing cluster at NCG

Introduction Past upgrades Current state of the cluster Problems with cluster Where to find out information about the cluster Conclusion

17th Sep 2007 Andrey [email protected]

Introduction

The cluster has appeared at the end of 1999– Persons who started to tune the cluster :

Jerome Lauret and Andrey Shevel

– Initially there were 33 machines by 500 MHz and 256 MB of main memory, around 1 TB of disk space (all disks were connected to one RAID controller).

– Main machine was Digital Alpha server.– About 25 persons were registered first year of

operation (2000).

17th Sep 2007 Andrey [email protected]

Past upgrades

With the time the disk storage was increased by 5 times– Computing power has been increased by 3 times

at least.– Alpha server has been retired and main computer

now is Intel based server (ram11).– All file systems are on separate disk controller.– Many other improvements.

All above permitted us to work many years almost without support. I am proud to inform you about this fact.

17th Sep 2007 Andrey [email protected]

Current state of the cluster

Nominal Reality

Machines 34 31

Raid arrays 7 5

17th Sep 2007 Andrey [email protected]

Computing cluster problems

Liquid leaking from upper flour The batteries in both UPSs were expired. The UPS procedure for auto shutting down is out

of order No reservation for central machine (this machine

was affected several times by water in past years) Needs to be watched almost every day (power,

water, temperature, etc) No remote access to consoles of the machines No remote control of electrical power No policies (rules) how to use the resources on

the cluster.

17th Sep 2007 Andrey [email protected]

17th Sep 2007 Andrey [email protected]

Nearest upgrades

At first we need to move the cluster physically to another place in the same room. - DONE

We need to install all new machines (9 machines). – in progress– Prepare automatic procedure to install the

software – in progress– To upgrade the version of SL to follow

RACF (BNL). – in progress

17th Sep 2007 Andrey [email protected]

Where is info about the cluster

General info about the cluster http://ram3.chem.sunysb.edu/ramdata/news.shtml

User mailing archive https://ram3.chem.sunysb.edu/ramdata-news

System mailing archive https://ram3.chem.sunysb.edu/ramdata-system

17th Sep 2007 Andrey [email protected]

The cluster role

I think now role of the cluster is even more than at the beginning (more people are interested how to use cluster).

For those who needs relatively small fraction for computing power the cluster power is enough. For others who need huge computing power on largest remote clusters the local one is good gateway for remote large cluster.

17th Sep 2007 Andrey [email protected]

Conclusion

Several steps must be undertaken to improve the situation:– To find one or two volunteers which would watch

the cluster;– To find the funding agency where to submit new

request for financial support for cluster upgrade.– May be we need to discuss how to use the cluster

as the department computing facility.