Summary of “Towards Petaflops” Workshop, 24—28 May 1999 A D Kennedy, D C Heggie, S P Booth The...

Summary of “Towards Petaflops” Workshop, 24—28 May 1999

A D Kennedy, D C Heggie, S P BoothThe University of Edinburgh

Towards PetaflopsSummary

ProgrammeMonday24th May

Tuesday25th May

Wednesday26th May

Thursday27th May

Friday28th May

ComputationalChemistry

Particle Physics &Astronomy

Biological sciences Materials, soft & hardMeteorology &

Fluids09:00

09.30 REGISTRATIONDr N H Christ

Columbia University, USADr J M Levesque

IBM Research, USADr G Ackland

University of EdinburghDr P BurtonUKMO, Bracknell

10:00 WelcomeProf A D Kennedy

10:15TEA/COFFEE

Prof R CatlowRoyal Institution, London

11:15 TEA/COFFEEProf R D’Inverno

University of SouthamptonDr M F GuestCLRC, Daresbury

Dr M PayneCambridge University

Dr T ArberUniversity of St Andrews

11:30 Prof E De SchutterUniversity of Antwerp, Belgium

Dr R TripiccioneINFN-Pisa, Italy

Dr N TophamUniversity of Edinburgh

Dr P CoveneyQueen Mary & Westfield

College, London

Dr M RuffertUniversity of Edinburgh

12.30 LUNCH13.30 Discussion Session

Chair: Prof R CatlowDr A H NelsonUniversity of Cardiff

Prof M van HeelImperial College, London

Dr D RowethQSW, Bristol

Discussion SessionChair: Dr T Arber

14.15 Dr L G PedersenUniversity of North Carolina,

Prof D C HeggieUniversity of Edinburgh

Mr M WoodacreSGI/Cray

Dr M F O’BoyleUniversity of Edinburgh

15.00 TEA/COFFEE15.30 Dr M Wilson

University of DurhamDiscussion SessionChair: Prof N Christ

Discussion SessionChair: Dr J Bard

Discussion SessionChair: Dr M Payne

16.15 Dr A R JenkinsUniversity of Durham

Dr C M ReevesUniversity of Edinburgh

17:00 CLOSE

Participants

Dr Graham Ackland graeme@holyroodProf Tony Arber tda@dcs.st-and.ac.ukDr Jonathan Bard J.Bard@ed.ac.ukDr Stephen Booth spb@epcc.ed.ac.ukDr Ken Bowler kcb@ph.ed.ac.ukDr Paul Burton pmburton.meto.gov.ukProf Mike Cates m.cates@ph.ed.ac.ukProf Richard Catlow richard@ricx.ri.ac.ukProf Norman Christ nhc@phys.columbia.eduProf Peter Coveney p.v.coveney.qmw.ac.ukProf Ray D d' Inverno rdi@maths.soton.ac.ukProf Erik De Schutter erik@bbf.uia.ac.beDr Paul Durham p.j.durham@dl.ac.ukMr Dietland Gerloff gerloff@darwin.bru.ed.ac.ukMr Simon Glover scog@roe.ac.ukDr Bruce Graham B.Graham@ed.ac.ukDr Martyn F Guest m.f.guest@dl.ac.ukProf Douglas C Heggie d.c.heggie@ed.ac.ukDr Suhail A Islam islam@icrf.icnet.ukDr Adrian R Jenkins A.R.Jenkins@durham.ac.ukMr Bruce Jones bruce@uk.nec-ess.comMr Balint Joo bj@ph.ed.ac.ukProf Anthony D Kennedy adk@ph.ed.ac.ukProf Richard Kenway rdk@ph.ed.ac.ukDr Crispin Kneble crck@sguk.reading.sgi.co.uk

Mr John M Levesque levesque@watson.ibm.comDr Nick Maclaren nmm1@cam.ac.ukMr Rod McAllister rod@falkirkd.sgi.comDr Avery Meiksin aam@roe.ac.ukDr Alistair Nelson nelsona@cardiff.ac.ukDr Mike O'Boyle mob@dcs.ed.ac.ukDr John Parkinson johnny@chem.ed.ac.ukDr Mike C Payne mcp1@phy.cam.ac.ukDr Lee G Pederson pedersen@shabam.niehsDr Nilesh Raj n-raj@hpcc.hitachi-eu.co.ukDr Federico Rapuano Federico.Rapuano@roma1Dr Clive M Reeves c.m.reeves@ee.ed.ac.ukDr Duncan Roweth duncan@quadrics.comDr Max Ruffert max@maths.ed.ac.ukMr Vance Shaffer vshaffer@us.ibm.comDr Doug Smith douglas@epcced.ac.ukMr Philip Snowdon philips@epcc.ed.ac.ukDr Nigel Topham npt@dcs.ed.ac.ukDr Arthur Trew arthur@epcc.ed.ac.ukDr Raffaele Tripiccione lele@galileo.pi.infn.itProf Marin Van Heel m.vanheel@ic.ac.ukMr Claudio Verdozzi verdozzi@ph.ed.ac.ukDr Mark Wilson mark.wilson.durham.ac.ukMr Stuart Wilson stuart@epcc.ed.ac.ukMr Michael Woodacre woodacre@reading.sgi.co.ukDr Andrea Zavanella az@dcs.ed.ac.uk

Introduction

Objectives of this Summary Summarise areas of general agreement Highlight areas of uncertainty or disagreement Concentrate on technology, architecture, & organisation For details of science which might be done see the slides of

the individual talks This summary expresses views & opinions of its authors

– It does not necessarily represent a consensus or majority view...

– … but it tries to do so as far as possible

Devices & Hardware

Silicon CMOS will continue to dominate GaAs is still tomorrow’s technology (and always will be?)

Moore’s Law Performance increases exponentially Doubling time of 18 months Will continue for at least 5 years, and probably more Trade-offs between density & speed

– Gb DRAM and GHz CPU in O(5 years) ...– … but not both on the same chip

Choice between speed & power 10 transistors per device by 2005, 10 by 2012 Most cost-effective technology is usually a generation behind

the latest technology

Devices & Hardware

Memory latency will increase More levels of cache hierarchy

– Implies a tree-like hierarchy of access speeds

– Not clear how scientific HPC applications map onto this

Access to memory becoming relatively as slow as access to remote processor’s cache

Understanding of memory architecture required to achieve optimal performance

– analogous to use of virtual memory

In the fairly near future Arithmetic will be almost free Pay for memory & communications bandwidth

Devices & Hardware

Devices & HardwareTechnology driven by mass market Commodity parts “Intercepting technology”

– systems designed to use technology current under development

– cost & risk of newest generation v. performance benefit

– “sweet point” on technology curve

PCs Workstations DSP … not designed for HPC

Devices & Hardware

Level of integration HPC vendors will move from board to chip level design Cost effective to produce O(10³—10) chips Silicon compilers Time scale?

Devices & Hardware

Error rates will increase Fault tolerance required Implications for very large systems? Time scale?

Disks & I/O Increasing density Decreasing cost/bit Increasing relative latency

Architecture

Memory addressing Flat memory (implicit communications)

– Model that naïve users want

– Does not really exist in hardware

– Dynamic hardware coherency mechanisms seem unlikely to work well enough in practice

Distributed memory (explicit communications)– NUMA

– Protocols

• MPI, OpenMP,…

• SHMEM

– Scientific problems usual have a simple static communications structure, easily handled by get and put primitives

Architecture

Single node performance Fat nodes or thin nodes?

– Limited by communication network bandwidth?

– Limited by memory bandwidth (off-chip access)?

– “Sweet point” on technology curve

Single node architecture VLIW Vectors Superscalar Multiple CPUs on a chip

Architecture

Communications What network topology?

– 2d, 3d, or 4d grid network, butterfly, hypercube, fat tree

– Crossbar switch

Bandwidth Latency

– Major problem for coarse-grain machines

Packet size– A problem for very fine-grain machines

Architecture

MPP Scalable for the right kind of problems

– up to technological limits

Commercial interconnects– e.g., from QSW (http://www.quadrics.com/)

Flexibility v. price/performance– Custom networks for a few well-understood problems which

require high-end performance (e.g., QCD)

– More general networks for large but more general purpose machines

Architecture

SMP clusters Limited scalability? Appears to be what vendors want to sell to us

– IBM, SGI, Compaq,…

– Large market for general purpose SMP machines

– Adding cluster interconnect is cheap

Unclear whether large-scale scientific problems map onto the tree-like effective network topology well

How do we program such machines?

Architecture

PC or workstation clusters Beowulf, Avalon, ... Cheap, but not tested for large machines Communication mechanisms unspecified “Farms” of PCs very cost-effective solution to provide large

capacity

Static v. dynamic scheduling Static (compiler) instruction scheduling more appropriate than

dynamic (hardware) scheduling for most large scientific applications

Languages & Tools

Efficiency new languages will not be widely used for HPC unless they

can achieve performance comparable with low level languages (assembler, C, Fortran)

Portability to different parallel architectures to next generation of machines to different vendor’s architecture

Reusability Object-oriented programming Current languages not designed for HPC (C++, JAVA, …)

Languages & Tools

Optimisation Compilers can handle local optimisation well

– register allocation

– instruction scheduling

Global optimisation will not be automatic– choice of algorithms

– data layout

– memory hierarchy management

– re-computation v. memory use

– could be helped by better languages & tools

Languages & Tools

How to get scientists & engineers to use new languages & tools? Performance must be good enough Compilers & tools must be widely available Compilers & tools must be reliable Documentation & training

New generation of scientists with interest and expertise in both Computer Science and Applications required Encouragement to work & publish in this area Usual problems for interdisciplinary work

– credit for software written is not on par with publications

Models & Algorithms

Disciplines with simple well-established methods Models are “exact” Methods are well understood & stable Errors are under control

– at least as well as for experiments

Leading-edge computation required for international competitiveness

Examples:– Particle physics (QCD)

– Astronomy (N body)

Models & Algorithms

Disciplines with complex models Approximate models used for small-scale physics Is reliability limited by

– sophistication of underlying models?

– scale of computation?

– availability of data for initial or boundary conditions?

Many different calculations for different systems– capacity v. capability issues

Examples:– Meteorology

– Materials

Models & Algorithms

Reliance on packages Commercial packages not well-tuned for large parallel

machines– Algorithms may need changing

Community has resistance to changing to new packages or writing their own systems

Examples:– Chemistry

– Engineering

Models & Algorithms

Exploration HPC not widely used Access to machine and expertise is a big hurdle Exciting prospects for future progress Algorithms and models need development Examples:

– Biology

Access & Organisation

Bespoke machines The best solution for a few special areas QCDSP, APE

Special-purpose machines Grape

Performance versus cost Slide courtesy of Norman

Christ (Columbia University) Diagonal lines are fixed cost Note dates of various

machines

Commercial machines SMP

– Convenient, easy to use, but not very powerful

– Good for capacity as opposed to capability

SMP clusters– Unclear how effective for large-scale problems

– Unclear how they will be programmed

Commercial interconnects (QSW,…)

Capacity v. capability Large machine required to get “final point on graph” Cost-effectiveness International competitiveness

Shared v. dedicated resources Systems management costs

– advantages of shared resources

• central large-scale data store

• backups

– disadvantages

• more reasonable requests if users have to pay for their implementation

• tendency for centres to invent software projects which are not the users’ highest priority

Dedicated machines for consortia Flexibility in scheduling

– Do not have to prioritise projects in totally different subjects

– Users know & can negotiate with each other

Ease of access for experimental projects– consortia can be more flexible at allocating resources for

promising new approaches

Summary of “Towards Petaflops” Workshop, 24—28 May 1999 A D Kennedy, D C Heggie, S P Booth The...

Documents

Part 1 Introductionto Intel MIC Programmingprace.it4i.cz/sites/prace.it4i.cz/files/part1-theory.pdf · Historical Analysis PetaFLOPS (MIC) Vector Machines MPPs with Multicores and

Energy and Public Utility Trends and Issues What happens if customers “cut the wire”? 2015 CAMPUT Energy Regulation Course Bob Heggie, Chief Executive,

Employability and the sociological imagination - Kety Faina, Gordon Heggie, Jade McCarroll, Neil McPherson,,

Software optimization for petaflops/s scale Quantum … · logos Quantum Monte Carlo The QMC=Chem code Software optimization for petaﬂops/s scale Quantum Monte Carlo simulations

Erik P. DeBenedictis Sandia National Laboratories February 24, 2005 Sandia Zettaflops Story A Million Petaflops Sandia is a multiprogram laboratory operated

EPI Tutorial - European Processor Initiative...2019/10/03 · homogeneous heterogeneous, accelerated China US Japan Sierra / LLNL, 2019 IBM P9 + NVidia GPU 125 petaflops (peak) (2021)

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at 33.86 petaflops Pushing toward exa-scale computing

A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś

SERC, IISc CRAY PetaFLOPS System

Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations

HoLi Monte Carlo Modelling of Globular Clusters - uni-bonn.depavel/HLmeeting/2006/talks/heggie.pdf · HoLi Monte Carlo Modelling of Globular Clusters Douglas Heggie University of

Keeneland: Bringing Heterogeneous gPu ComPuting to …casl.gatech.edu/wp-content/uploads/2013/01/keenland-2010.pdf · 2013-01-22 · petaflops/s on the TOP500 Linpack benchmark (),

eFiling - altalink.ca · eFiling . Alberta Utilities Commission Eau Claire Tower, 1400, 600 Third Avenue S.W. Calgary, Alberta T2P 0G5 . Attention: Mr. R.D. Heggie, Chief Executive

The Blockchain Identity - Global Risk Institute...Original blockchain. How powerful? • Currently 80,704,290 petaFLOPS • #1 supercomputer is Sunway TaihuLight at 93 PetaFLOPS •

Exponential growth of ICT: how long can it last - mii.lt · PDF fileExample: High Performance ... –Latest IBM HPC delivers 16 petaflops •Heading for exaflop computers ... Drivers

A Powerful World First · • Largest supercomputer. Called SuperMUC, the platform delivers three petaflops of com - puting performance and is the largest Intel-powered supercomputer

client.blueskybroadcast.com€¦ · Heggie: Qt. — — — —Qji (2) translation-invariant CX coordinates triangle constraints: Qty + QJk Qki O (_,'x linear space of dim. N —

Beyond PetaFlops: Scalable, Energy Efficient IBM …...1 Beyond PetaFlops: Scalable, Energy Efficient IBM System x iDataPlex dx360 M4 powered by Intel Xeon processor E5-2600 Product

THE SKA IMAGING AND CALIBRATION CHALLENGE · Requirements --SKA Phase 1 SDP Local Monitor & Control High Performance •~100 PetaFLOPS Data Intensive •~100 PetaBytes/observation

Modulator Design for Plasma Ion Implantation Professor Michael Bradley Dale Heggie, Joel Leslie, Curtis Olson March 24 th, 2004