37
Joachim Hein EPCC, The University of Edinburgh [email protected] +44 131 651 3390 Experiences on the Edinburgh Blue Gene System

Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

  • Upload
    donhan

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

Joachim HeinEPCC, The University of Edinburgh

[email protected]+44 131 651 3390

Experiences on the Edinburgh

Blue Gene System

Page 2: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Outline

• What can be learned from “STREAMS”

• Fourier transforms

• Application performance– CASTEP– H2MOL– DL_POLY– NAMD– MDCASK– PCHAN– LUDWIG

Page 3: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Systems

• BlueSky: Single e-Server Blue Gene frame– 1024 dual core chips– 2048 PowerPC440 processors, 700 MHz– 512 MB of RAM per chip (distributed memory system), shared

between the two cores– 4713 TFLOP/s Linpack, joint No 58 on top500

• HPCx: 50 IBM e-Server p690+ frames– SMP cluster– 32 Power4 processors per frame, 1700 MHz– 32 GB of RAM per frame– Federation interconnect, 2 links (4 wires) per frame– 6188 TFLOP/s Linpack, No 45 on top500

Page 4: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Relative performance expectation:

2.43 (double unit) 4.86 (single unit)

Naïve perf-ratio:

1 (use double unit) 2 (use single unit)

2 single word FPU 2 single word LSU

1 double word FPU 1 double word LSU

Functional units:

32 GB/frame512 MB/chipMemory:

512 MB/frame, 512 b4MB/chip, 128 bL3 cache:

1.5kB/2 proc, 128 bmem-bufferL2 cache:

32kB, 128 byte line32kB, 32 byte lineL1 cache:

17/7 = 2.431700 MHz700 MHzClock:

Performance ratioHPCxBG

Page 5: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Alignment

• FPU units– 16 byte registers, holding two 8 byte floating point words– Instructions can operate simultaneously on both 8 byte parts

• 16 byte alignment– FPU must load two 8 byte words from L1 in a single instruction – Words must be consecutive and occupy either the first or the second

half of a L1 cache line (32 bytes)

OK

Not OK

Page 6: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Streams tests:

0 + 1r = r + a(i)vsum (own test)

1 + 2a(i) = b(i)+s*c(i)triad

1 + 2a(i) = b(i)+c(i)add

1 + 1a(i) = s*b(i)scale

1 + 1a(i) = b(i)copy

Streams:

(store + load)

Operation:Test:

Page 7: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Example Routine (copy)subroutine copy_st(a,b,n)

integer :: n

Real(kind(1.0d0)), dimension(n) :: a, b

integer :: i

Call alignx(16,a(1))

Call alignx(16,b(1))

Do i = 1, n

a(i) = b(i)

end Do

end subroutine copy_st

• Assumes a(1) and b(1) are 16 byte aligned

Page 8: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Alignment

2030161MB, a(2), b(2)

2030311MB, a(2), b(1)

2030341MB, a(1), b(2)

203036031MB, a(1), b(1)

13291616MB, a(2), b(2)

13303116MB, a(2), b(1)

13293416MB, a(1), b(2)

1329133016MB, a(1), b(1)

440 (MB/s)440d (MB/s)

– Performance benefit for L3 problem size, if the assertion is correct– Incorrect alignment results in a substantial performance loss

– caused by trapping of alignment exceptions

Page 9: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Streams on L1 cache

• Hardware limit:– 11200 MB/s

(double-fpu)– 5600 MB/s

(single-fpu)

• Copy: very satisfactory

• Vsum: not quite– Short of HW

limit– -qhot & 440d

very bad

Page 10: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Streams on L3 cache

• About 1/3 of L1 perform.

• 440d almost doubles performance

• vsum still not happy

Page 11: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Streams on main memory

• No difference between 440 and 440d

• Vsum: 440d, no hot ok, finally

Page 12: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

dcbz

• ‘Hidden’ loads reduce performance– Loading data to cache prior to a write (L1 does write allocate)– dcbz instruction can be used to eliminate these loads - by zeroing a cache

line– Need to know 32-byte alignment

subroutine copy_st(a,b,n)..........do i = 1, n, 4!IBM* CACHE_ZERO(a(i))

a(i) = b(i)a(i+1) = b(i+1)a(i+2) = b(i+2)a(i+3) = b(i+3)

end doend subroutine copy_st

• Stream copy a(j) = b(j)

• 16Mb, 440d, dp:– without DCBZ

1316 MB/s – with DCBZ

2299 MB/s

• Use with care!

• Need to control 32-byte alignment

Page 13: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Compare to gnu-compiler

• C-version of Streams (out of the box), no alignment declared

• Main memory

• xlC clearly outperforms the gnu compiler

Page 14: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Streams CO vs VN mode

Page 15: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Comparison to p690+

• L2 helps for intermediate sizes

• L1 and Memory: no factor of 2.48

• Mem: stable for single FPU

• Single task/frame

• Copy and add under perform

Page 16: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Strided Access

Cache Lines:

• BG:L1: 4 wordsL3: 16 words

• p690+L1: 16 wordsL2: 16 wordsL3: 4*16 wordsTLB: 512 words

(small pages)

Graph: Memory Copy

Page 17: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Summary of Streams

• L1 performance (compiler) at the hardware limit (11 GB/s)

• L3 can sustain two cores (VN-mode), L1 obviously can

• Counting properly: about 2.2 GB/s from/to main memory

• Memory bandwidth almost halfed in VN-mode

• L1 and Memory better balanced than Power4

• Intermediate problem size: Power4 is faster due to L2

• Large strides: no TLB pays off for BG

Page 18: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Fast Fourier Transform

• Measured benefits for large problem

• L1: HPCx – 2.0x Vienna– 2.8x comp

• Large problem: Full HPCx slower than BG

Page 19: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Application comparison

• Study the performance of a variety of scientific applications on the BG system and compare to the performance on HPCx (p690+ cluster)

• For some applications: – Reduced optimisation level for some/all routines, to make them work– Take as a present status

Page 20: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

CASTEP

• Density functional theory application, Payne et al., 2002, Segall et al., 2002

• Widely used in the UK (largest user of HPCx)

• Web site: http://www.tcm.phy.cam.ac.uk/castep/

• Compiled on BG with –O3, apart from one routine

• Looked at a couple of benchmark configurations, here:– Titanium Nitride: 33 atoms, Cell volume of 654 Å3

– Al2O3: 120 atoms, Cell volume of 2160 Å3

• Application benefits from MASS lib and FFTW

Page 21: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

CASTEP: Titanium Nitride

• Good Scaling on both machines

• VN benefits

• HPCx about 5 times faster

Page 22: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

CASTEP: AL2O3

• Benchmark to large for VN

• Again: similar scaling

• HPCx about 2.6 times faster

• Previous benchmark better helped by cache?

Page 23: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

H2MOL

• Written by Ken Taylor and Daniel Dundas at Queens University Belfast

• Solves time dependent Schrödinger equation

• Laser driven dissociation of H2-molecules

• Refines grid when increasing processor count, hence constant work/proc

Page 24: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

H2MOL

• Writing of intermed. results bad

• HPCxabout 1.7x faster

Page 25: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

DL_POLY3

DL_POLY – A general purpose molecular dynamics package– Smith and Forester, 1996

• DL_POLY3 uses a distributed domain decomposition model

• Utilises it own FFT that maps directly to the decomposition used in DL_POLY– A smaller number of larger messages are sent, over that which a

traditional parallel FFT would use

• No problems porting this application

• The benchmark: Gramicidin– a system of eight Gramicidin-A species (792,960 atoms)

Page 26: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

DL_POLY3

• VN beneficial

• BG scales slightly better

• HPCxabout 2.7x faster

• Benefits from MASS

Page 27: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

NAMD

• NAMD: parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems– Kalé et al. 1999

• Relies on a series of packages– FFTW single precision – no double FPU– charm++

• charm++– Required quite a few changes to 5.8 (e.g. library locations, compilers),

however a release of 5.9 fixed many of these problems

• NAMD 2.6b1 compiled with only minor problems– 2.5 was more complicated

• ApoA1 benchmark– 92224 atoms

Page 28: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

NAMD 2.6b1

• xlc7 boost performance on HPCx

• VN mode beneficial

• HPCx 4.2x faster (NAMD doesn’t use memory!)

Page 29: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

MDCASK

• MDCASK: first developed as a molecular dynamics code to study radiation damage in metals

• Part of ASCI purple benchmark suite

• Benchmark used: 1372000 atoms in Ti lattice

• Needs n3 processors

Page 30: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

MDCASK

• HPCxabout 4x faster

• Code speeds up, to 512 proc

• BG better scaling

Page 31: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

PCHAN

• Finite difference code for Turbulent Flow: shock/boundary layer interaction (SBLI)

• Developed at the University of Southampton

• PCHAN: simple turbulent channel flow benchmark using the SBLI code

• Communications: halo-exchanges between adjacent computational sub-domains

Page 32: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

PCHAN

• Very good scaling all systems -HPCx superscales

• Benefits from VN

• HPCx faster 2.1x (32 p) 2.8x (256 p)

T2 (240x240x240)

Page 33: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

LUDWIG

• LUDWIG: studying complex fluids (mixtures of fluids, solids/fluids)– Jean Christophe Desplat, Dublin Institute for Advanced Studies– Kevin Stratford, Mike Cates, The University of Edinburgh– Applications include personal care products, e.g. shampoo

• Blue Gene– Well suited to parallelisation

– Being used to investigate the dynamics of mixtures involving colloidal particles and systems under shear

Page 34: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

LUDWIG - performance

• Ludwig (Model size 3843)– VN beneficial– HPCx 1.4x faster– IBM p575 – Power 5 processor, 1.9 GHz

0

5

10

15

20

25

30

35

40

64 256BG-chips/Power-proc

Exec

utio

n Ti

me

(hou

rs)

HPCxBG (CO)BG (VN)IBM p575

Page 35: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

LUDWIG

• Blue Gene gives better performance than the p690+ system (using VN mode, on a per chip (resource) basis)

• Key routines of LUDWIG are memory access bound: many strided accesses to main memory– Blue Gene memory access latency relatively smaller than the p690+

system– Similar bandwidths

• P575 gives excellent performance

Page 36: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Acknowledgements

• Lorna Smith, Mark Bull, Alan Gray: Editing material

• Bartosz Dobrzelecki, Alan Gray, Ian Bush (DL), Lorna Smith,Owain Kenway, Mike Ashworth (DL), Fiona Reid, Kevin Stratford, J-C Desplat (now Dublin): various Benchmark results

Page 37: Experiences on the Edinburgh Blue Gene System - HPCx · Full HPCx slower than BG. 06/10 ... 06/10/2005 Experiences on the Edinburgh Blue Gene System NAMD 2.6b1 ... 06/10/2005 Experiences

06/10/2005 Experiences on the Edinburgh Blue Gene System

Summary

• One can learn a lot from streams (even if it is only to convince you that you can “double hum”)

• If 256MB of memory sufficient, VN mode beneficial

• We observe good scaling for a wide variety of codes, however BG does not obviously “out-scale” HPCx

• In the light of the naïve expectations of 4.86: Serial performance on BG is good in comparison to Power4 - we are doing better than I thought