24
Stephen Pickles <[email protected]> http://www.realitygrid.org http://www.realitygrid.org/TeraGyroid.html UKLight Town Meeting, NeSC, Edinburgh, 9/9/2004 TeraGyroid TeraGyroid HPC Applications ready for UKLight

Stephen Pickles UKLight Town Meeting, NeSC, Edinburgh, 9/9/2004 TeraGyroid HPC Applications

Embed Size (px)

Citation preview

Stephen Pickles <[email protected]>

http://www.realitygrid.orghttp://www.realitygrid.org/TeraGyroid.html

UKLight Town Meeting, NeSC, Edinburgh, 9/9/2004

TeraGyroidTeraGyroid

HPC Applications ready for UKLight

UKLight Town Meeting, NeSC, 9/9/20042

The TeraGyroid Project

Funded by EPSRC (UK) & NSF (USA) to join the UK e-Science Grid and US TeraGrid

– application from RealityGrid, a UK e-Science Pilot Project– 3 month project including work exhibited at SC’03 and SC Global,

Nov 2003– thumbs up from TeraGrid mid-September, funding from EPSRC

approved later

Main objective was to deliver high impact science which it would not be possible to perform without the combined resources of the US and UK grids

Study of defect dynamics in liquid crystalline surfactant systems using lattice-Boltzmann methods

– featured world’s largest Lattice Boltzmann simulation– 1024^3 cell simulation of gyroid phase demands terascale computing

• hence “TeraGyroid”

UKLight Town Meeting, NeSC, 9/9/20043

(realtime) UDP

realtime)-(non TCP

realtime)-(near TCP

Networking

visualization engine storage

HPC engineHPC engine

checkpoint files

visualization data

compressed video

steering: control and status

UKLight Town Meeting, NeSC, 9/9/20044

LB3D: 3-dimensional Lattice-Boltzmann simulations

LB3D code is written in Fortran90 and parallelized using MPI

Scales linearly on all available resources (Lemieux, HPCx, CSAR,

Linux/Itanium II clusters)

Data produced during a single run can exceed 100s of gigabytes to terabytes

Simulations require supercomputers

High end visualization hardware (eg. SGI Onyx, dedicated viz clusters) and parallel rendering software (e.g. VTK)

needed for data analysis

3D datasets showing snapshots from a simulation of spinodal decomposition: A binary mixture of water and oil phase separates. ‘Blue’ areas denote high water densities and ‘red’ visualizes the interface between both fluids.

UKLight Town Meeting, NeSC, 9/9/20045

Computational Steering ofLattice Boltzmann Simulations

LB3D instrumented for steering using the RealityGrid steering library.

Malleable checkpoint/restart functionality allows ‘rewinding’ of simulations and run-time job migration across architectures.

Steering reduces storage requirements because the user can adapt data dumping frequencies.

CPU time can be saved because users do not have to wait for jobs to be finished if they can already see that nothing relevant is happening.

Instead of doing “task farming”, parameter searches are accelerated by “steering” through parameter space.

Analysis time is significantly reduced because less irrelevant data is produced.

Applied to study of gyroid mesophase of amphiphilic

liquid crystals at unprecedented space and

time scales

UKLight Town Meeting, NeSC, 9/9/20046

Parameter space exploration

Initial condition: Random water/ surfactant mixture.

Self-assembly starts.

Rewind and restart from checkpoint.

Lamellar phase: surfactant bilayers between water layers.

Cubic micellar phase, low surfactant density gradient.

Cubic micellar phase, high surfactant density gradient.

UKLight Town Meeting, NeSC, 9/9/20047

Strategy

Aim: use federated resources of US TeraGrid and UK e-Science Grid to accelerate scientific process

Rapidly map out parameter space using large number of independent “small” (128^3) simulations

– use job cloning and migration to exploit available resources and save equilibration time

Monitor their behaviour using on-line visualization Hence identify parameters for high-resolution simulations on HPCx

and Lemieux– 1024^3 on Lemieux (PSC) – takes 0.5 TB to checkpoint!– create initial conditions by stacking smaller simulations with periodic boundary

conditions

Selected 128^3 simulations were used for long-time studies All simulations monitored and steered by geographically distributed

team of computational scientists

UKLight Town Meeting, NeSC, 9/9/20048

The Architecture of Steering

Steering client

Simulation

Steering library

VisualizationVisualization

Registry

Steering GS

Steering GS

connect

publish

find

bind

data transfer

(Globus-IO)

publish

bind

Client

Steering library

Steering library

Steering library Display

Display

Display

components start independently and

attach/detach dynamically

remote visualization through SGI VizServer, Chromium, and/or streamed to Access Grid

multiple clients: Qt/C++, .NET on PocketPC, GridSphere Portlet (Java)

OGSI middle tier

•Computations run at HPCx, CSAR, SDSC, PSC and NCSA•Visualizations run at Manchester, UCL, Argonne, NCSA, Phoenix•Scientists in 4 sites steer calculations, collaborating via Access Grid•Visualizations viewed remotely•Grid services run anywhere

UKLight Town Meeting, NeSC, 9/9/20049

SC Global ’03 Demonstration

UKLight Town Meeting, NeSC, 9/9/200410

TeraGyroid Testbed

VisualizationComputation

Starlight (Chicago)

Netherlight (Amsterdam)

BT provision

PSC

ANL

NCSA

Phoenix

Caltech

SDSC

UCL

Daresbury

Manchester

SJ4MB-NG

Network PoP

Access Grid nodeService Registry

production network

Dual-homed system

10 Gbps

2 x 1 Gbps

UKLight Town Meeting, NeSC, 9/9/200411

Trans-AtlanticNetwork

Collaborators: Manchester Computing Daresbury Laboratory

Networking Group MB-NG and UKERNA UCL Computing Service BT SurfNET (NL) Starlight (US) Internet-2 (US)

UKLight Town Meeting, NeSC, 9/9/200412

TeraGyroid:Hardware Infrastructure

Computation (using more than 6000 processors) including: HPCx (Daresbury), 1280 procs IBM Power4 Regatta, 6.6 Tflops peak, 1.024 TB Lemieux (PSC), 3000 procs HP/Compaq, 3TB memory, 6 Tflops peak TeraGrid Itanium2 cluster (NCSA), 256 procs, 1.3 Tflops peak TeraGrid Itanium2 cluster (SDSC), 256 procs, 1.3 Tflops peak Green (CSAR), SGI Origin 3800, 512 procs, 0.512 TB memory (shared) Newton (CSAR), SGI Altix 3700, 256 Itanium 2 procs, 384GB memory (shared)Visualization: Bezier (Manchester), SGI Onyx 300, 6xIR3, 32procs Dirac (UCL), SGI Onyx 2, 2xIR3, 16 procs SGI loan machine, Phoenix, SGI Onyx 1xIR4, 1xIR3, commissioned on site TeraGrid Visualization Cluster (ANL), Intel Xeon SGI Onyx (NCSA)Service Registry: Frik (Manchester), Sony Playstation2Storage: 20 TB of science data generated in project 2 TB moved to long term storage for on-going analysis - Atlas Petabyte Storage System

(RAL)Access Grid nodes at Boston University, UCL, Manchester, Martlesham, Phoenix (4)

UKLight Town Meeting, NeSC, 9/9/200413

Network lessons

Less than three weeks to debug networks– applications people and network people nodded wisely but didn’t understand each

other– middleware such as GridFTP is infrastructure to applications folk, but an application

to network folk– rapprochement necessary for success

Grid middleware not designed with dual-homed systems in mind– HPCx, CSAR (Green) and Bezier are busy production systems– had to be dual homed on SJ4 and MB-NG– great care with routing– complication: we needed to drive everything from laptops that couldn’t see the MB-

NG network

Many other problems encountered– but nothing that can’t be fixed once and for all given persistent infrastructure

UKLight Town Meeting, NeSC, 9/9/200414

Measured Transatlantic Bandwidths during SC’03

UKLight Town Meeting, NeSC, 9/9/200415

TeraGyroid: Summary

Real computational science...– Gyroid mesophase of amphiphilic

liquid crystals– Unprecedented space and time

scales – investigating phenomena previously

out of reach

...on real Grids...– enabled by high-bandwidth

networks

...to reduce time to insight

Interfacial Surfactant Density

Dislocations

UKLight Town Meeting, NeSC, 9/9/200416

TeraGyroid: Collaborating Organisations

Our thanks to hundreds of individuals at:...

Argonne National Laboratory (ANL)Boston University

BTBT ExactCaltech

CSCComputing Services for Academic Research (CSAR)

CCLRC Daresbury LaboratoryDepartment of Trade and Industry (DTI)Edinburgh Parallel Computing Centre

Engineering and Physical Sciences Research Council (EPSRC)Forschungzentrum Juelich

HLRS (Stuttgart)HPCxIBM

Imperial College LondonNational Center for Supercomputer Applications (NCSA)

Pittsburgh Supercomputer CenterSan Diego Supercomputer Center

SCinetSGI

SURFnetTeraGrid

Tufts University, BostonUKERNA

UK Grid Support CentreUniversity College London

University of EdinburghUniversity of Manchester

ANL

http://www.realitygrid.orghttp://www.realitygrid.org/TeraGyroid.html

The TeraGThe TeraGyyrrooid Experimentid Experiment

S. M. Pickles1, R. J. Blake2, B. M. Boghosian3, J. M. Brooke1,J. Chin4, P. E. L. Clarke5, P. V. Coveney4,

N. González-Segredo4, R. Haines1, J. Harting4, M. Harvey4,M. A. S. Jones1, M. Mc Keown1, R. L. Pinning1,

A. R. Porter1, K. Roy1, and M. Riding1.

1. Manchester Computing, University of Manchester2. CLRC Daresbury Laboratory, Daresbury3. Tufts University, Massachusetts4. Centre for Computational Science, University College London5. Department of Physics & Astronomy, University College London

New Application at AHM2004New Application at AHM2004

Philip Fowler, Peter Coveney, Shantenu Jha and Shunzhou Wan

UK e-Science All Hands Meeting31 August – 3 September 2004

“Exact” calculation of peptide-protein binding energies by steered thermodynamic integration using high-performance computing grids.

UKLight Town Meeting, NeSC, 9/9/200419

Why are we studying this system?

Measuring binding energies are vital for e.g. designing new drugs.

Calculating a peptide-protein binding energy can take weeks to months.

We have developed a grid-based

method to accelerate this process

To compute Gbind during the AHM 2004 conference i.e. in less than 48 hours

Using federated resources of UK National Grid Service and US TeraGrid

UKLight Town Meeting, NeSC, 9/9/200420

lambda

H

=0.1

=0.2

=0.3

=0.9

Starting conformation

H

t

Seed successive simulations

(10 sims, each 2ns)

Check for convergence

Combine and calculate integraltime

Use steering to launch, spawn and terminate - jobs

Run each independent job on the Grid

Thermodynamic Integration on Computational Grids

UKLight Town Meeting, NeSC, 9/9/200421

monitoring

checkpointing

steering and control

UKLight Town Meeting, NeSC, 9/9/200422

We successfully ran many simulations…

This is the first time we have completed an entire calculation.– Insight gained will help us improve the throughput.

The simulations were started at 5pm on Tuesday and the data was collated at 10am Thursday.

26 simulations were run At 4.30pm on Wednesday, we had nine simulations in

progress (140 processors)– 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x NGS-Leeds, 1x NGS-RAL

We simulated over 6.8ns of classical molecular dynamics in this time

UKLight Town Meeting, NeSC, 9/9/200423

Very preliminary results

Thermodynamic Integrations

-200

-100

0

100

200

300

400

0 0.2 0.4 0.6 0.8 1

lambda

dE/d

l

dppo

We expect our value to improve with further analysis around the endpoints.

G

(kcal/mol)

Experiment -1.0 ± 0.3

“Quick and dirty” analysis* -9 to -12* - as at 41 hours

UKLight Town Meeting, NeSC, 9/9/200424

Conclusions

We can harness today’s grids to accelerate high-end computational science

On-line visualization and job migration require high bandwidth networks

Need persistent network infrastructure– else set up costs are too high

QoS: Would like ability to reserve bandwidth– and processors, graphics pipes, AG rooms, virtual venues, nodops... (but

that’s another story)

Hence our interest in UKLight