Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Next KEK machine

Shoji Hashimoto (KEK)

@ 3rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 2

KEK supercomputer

Leading computing facility in that time• 1985 Hitachi S810/10 350 MFlops• 1989 Hitachi S820/80 3 GFlops• 1995 Fujitsu VPP500 128 GFlops• 2000 Hitachi SR8000 F1 1.2 TFlops

• 2006 ???


Formality

• “KEK Large Scale Simulation Program” : call for proposals of project to be performed on the supercomputer.

• Open for Japanese researcher working on high energy accelerator science (particle and nuclear physics, astrophysics, accelerator physics, material science related to the photon factory)

• Program Advisory Committee (PAC) decides the approval and machine time allocation.


Usage

Lattice QCD is a dominant user.• About 60-80% of the computer time f

or lattice QCD– Among them, ~60% is for the JLQCD co

llaboration– Others include Hatsuda-Sasaki, Nakam

ura et al., Suganuma et al., Suzuki et al. (Kanazawa), …

• Simulation for accelerator design is another big user: beam-beam simulation for the KEK-B factory.


JLQCD collaboration

• 1995~ (on VPP500)– Continuum limit in the quenched

approximation

BK fB, fDms


JLQCD collaboration

• 2000~ (on SR8000)– Dynamical QCD with the improved Wils

on fermion

mV vs mPS2 fB, fBs Kl3 form factor


Around the triangle


The wall

• Chiral extrapolation: very hard to go beyond ms/2

• Problem for every physical quantities.

• Maybe, solved by the new algorithms and machines…

JLQCD Nf=2 (2002)

MILC coarse lattice (2004)

New generation of dynamical QCD


Upgrade

Thanks to Hideo Matsufuru (Computing Research Center, KEK) for his hard work.

• Upgrade scheduled on March 1st 2006.• Called for bids from vendors.• At least 20x more computing power, measu

red mainly using the QCD codes.• No restriction on architecture (scalar or vec

tor, etc.), but some amount must be a shared memory machine.

• Decision was made, recently.


The next machine

A combination of two systems:

• Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak performance.

• IBM Blue Gene/L, 10 racks, 57.3 TFlops peak performance.

Hitachi Ltd. is the prime contractor.


Hitachi SR11000 K1

• POWER5+: 2.1 GHz, dual core, 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB L2 (on chip), 36 MB L3 (off chip)

• 8.5 GB/s chip-memory bandwidth, hardware and software prefetch

• 16-way SMP (134.4 GFlops/node), 32 GB memory (DDR2 SDRAM).

• 16 nodes (2.15 TFlops)• Interconnect: Federation switch 8GB/s (bidi

rectional)

Will be announced, tomorrow.


SR11000 node


16-way SMP


High Density Module


IBM Blue Gene/L

• Node: 2 PowerPC440 (dual core), 700 MHz, double FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared), 512 MB memory.

• Interconnect: 3D torus, 1.4 Gbps/link (6 in + 6 out) from each node.

• Midplane: 8x8x8 nodes (2.87 TFlops); rack = 2 Midplane

• 10 rack system

All the info in the following comes from the “Red book” (ibm.com/redbooks) and the articles in IBM Journal of Research and Development.


BG/L system

10 Racks


BG/L node ASIC

• Double Floating-Point-Unit (FPU) added to the PPC440 core. 2 fused multiply-add per core

• Not a true SMP: L1 has no cache coherency, L2 has a snoop.

• Shared 4MB L3.• Communication between

the two core through the “multiported shared SRAM buffer”

• Embedded memory controller and networks.


Compute note modes

• Virtual node mode: use both CPUs separately, running a different process on each core. Communication using MPI, etc. Memory and bandwidth are shared.

• Co-processor mode: use the secondary processor as a co-processor for communication. Peak performace is ½.

• Hybrid node mode: use the secondary processor also for computation. Need a special care about the L1 cache incoherency. Used for Linpack.


QCD code optimization

Jun Doi and Hikaru Samukawa (IBM Japan):

• Use the virtual node mode

• Fully used the Double FPU (hand-written assembler code)

• Use a low-level communication API


Double FPU

• SIMD extension of PPC440.

• 32 pairs of 64-bit FP register, addresses are shared.

• Quadword load and store.• Primary and secondary pi

pelines. Fused multiply-add for each pipe.

• Cross operations possible; best suited for complex arithmetic.


Examples

Instruction Mnemonic primary secondary

Load floating parallel double indexed

lfpdx PT=dw(EA) ST=dw(EA+8)

Store floating parallel double indexed

stfpdx Dw(EA)=PT Dw(EA+8)=ST

Floating parallel multiply-add

fpmadd PT=PA.PC+PB ST=SA.SC+SB

Floating cross multiply-add

fxmadd PT=SA.PC+PB ST=PA.SC+SB

Asymmetric cross copy-primary nsub-primary multiply-add

fxcpnpma PT=-PA.PC+PB ST=PA.SC+SB

Floating cross cmplx nsub-primary multiply-add

fxcxnpma PT=-SA.SC+PB ST=SA.PC+SB


SU(3) matrix*vector

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

complex mult: u[0][0] * x[0]

FXPMUL (y[0],u[0][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

+ u[0][1] * x[1] + u[0][2] * x[2];

FXCPMADD (y[0],u[0][1],x[1],y[0])

FXCXNPMA (y[0],u[0][1],x[1],y[0])

FXCPMADD (y[0],u[0][2],x[2],y[0])

FXCXNPMA (y[0],u[0][2],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=-im(u[0][0])*im(x[0])

im(y[0])+=im(u[0][0])*re(x[0])

must be combined with other rows to avoid pipeline stall (wait 5 cycles).


Scheduling

• 32+32 registers can hold 32 complex numbers.

• 3x3(=9) for a gauge link; 3x4(=12) for a spinor: need 2 spinors for input and output

• Load the gauge link while computing, using 6+6 registers. Straightforward for y+=U*x, but not so for y+=conjg(U)*x.

• Use the inline-assembler of gcc; xlf and xlc have intrinsic functions.

• Early xlf/xlc wasn’t good enough to produce these code, but is improved more recently.


Parallelization on BG/L

Example: 243x48 lattice.• Use the virtual node mode.• For the midplane, divide the entire lat

tice onto 2x8x8x8 processors. For one rack, 2x8x8x16. (2 is inner-node.)

• To use more than one rack, 323x64 lattice is the minimum.

• Each processor has 12x3x3x6 (or 12x3x3x3) lattice.


Communication

Communication is fast:• 6 links to nearest-ne

ighbors. 1.4 Gbps (bi-directional) for each link.

• latency is 140ns for one hop.

MPI is too heavy:• Need additional buff

er copy = waste the cache and memory bandwidth.

• Multi-thread not available in the virtual node mode.

• Overlapping comp and comm is not possible within MPI.


“QCD Enhancement Package”

Low-level communication API• Directly send/recv by accessing the t

orus interface FIFO. No copy to memory buffer.

• Non-blocking send; blocking recv.• Up to 224 byte data to send/recv at o

nce (spinor at one site = 192 byte).• Assuming the nearest-neighbor com

munication.


An example

#define BGLNET_WORK_REG 30

#define BGLNET_HEADER_REG 30

BGLNetQuad* fifo;

BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6);

for(i=0;i<Nx;i++){

// put results to reg 24--29

BGLNet_Send_Enqueue_Header(fifo);

BGLNet_Send_Enqueue(fifo,24);






BGLNet_Send_Packet(fifo);

}

Create packet header

Put the packet header to the send buffer

Put the data to the send buffer

Kick!


Benchmark

• Wilson solver (BiCGstab)

• 243x48 lattice on a midplace (8x8x8=512 nodes, half rack)

• 29.2% of the peak performance

• 32.6% if measured the Dslash only

• Domain-wall solver (CG)

• 243x48 lattice on a midplace; Ns=16.

• Doesn’t fit in the on-chip L3

• ~22% of the peak performance


Comparison

Vranas @ Lattice 2004

~50% improvement


Physics target

“Future opportunities: ab initio calculations at the physical quark masses”

• Using dynamical overlap fermion• Details are under discussion (action

s, algorithms, etc.)• Primitive code has been written; test

runs are on-going on SR8000.

• Many things to do by March…


Summary

• New KEK machine will be made available for Japanese lattice community on March 1st, 2006.

• Hitachi SR11000 (2.15 TF) + IBM BlueGene/L (57.3 TF)

Documents

Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005