31
Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Embed Size (px)

Citation preview

Page 1: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Next KEK machine

Shoji Hashimoto (KEK)

@ 3rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Page 2: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 2

KEK supercomputer

Leading computing facility in that time• 1985 Hitachi S810/10 350 MFlops• 1989 Hitachi S820/80 3 GFlops• 1995 Fujitsu VPP500 128 GFlops• 2000 Hitachi SR8000 F1 1.2 TFlops

• 2006 ???

Page 3: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 3

Formality

• “KEK Large Scale Simulation Program” : call for proposals of project to be performed on the supercomputer.

• Open for Japanese researcher working on high energy accelerator science (particle and nuclear physics, astrophysics, accelerator physics, material science related to the photon factory)

• Program Advisory Committee (PAC) decides the approval and machine time allocation.

Page 4: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 4

Usage

Lattice QCD is a dominant user.• About 60-80% of the computer time f

or lattice QCD– Among them, ~60% is for the JLQCD co

llaboration– Others include Hatsuda-Sasaki, Nakam

ura et al., Suganuma et al., Suzuki et al. (Kanazawa), …

• Simulation for accelerator design is another big user: beam-beam simulation for the KEK-B factory.

Page 5: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 5

JLQCD collaboration

• 1995~ (on VPP500)– Continuum limit in the quenched

approximation

BK fB, fDms

Page 6: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 6

JLQCD collaboration

• 2000~ (on SR8000)– Dynamical QCD with the improved Wils

on fermion

mV vs mPS2 fB, fBs Kl3 form factor

Page 7: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 7

Around the triangle

Page 8: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 8

The wall

• Chiral extrapolation: very hard to go beyond ms/2

• Problem for every physical quantities.

• Maybe, solved by the new algorithms and machines…

JLQCD Nf=2 (2002)

MILC coarse lattice (2004)

New generation of dynamical QCD

Page 9: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 9

Upgrade

Thanks to Hideo Matsufuru (Computing Research Center, KEK) for his hard work.

• Upgrade scheduled on March 1st 2006.• Called for bids from vendors.• At least 20x more computing power, measu

red mainly using the QCD codes.• No restriction on architecture (scalar or vec

tor, etc.), but some amount must be a shared memory machine.

• Decision was made, recently.

Page 10: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 10

The next machine

A combination of two systems:

• Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak performance.

• IBM Blue Gene/L, 10 racks, 57.3 TFlops peak performance.

Hitachi Ltd. is the prime contractor.

Page 11: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 11

Hitachi SR11000 K1

• POWER5+: 2.1 GHz, dual core, 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB L2 (on chip), 36 MB L3 (off chip)

• 8.5 GB/s chip-memory bandwidth, hardware and software prefetch

• 16-way SMP (134.4 GFlops/node), 32 GB memory (DDR2 SDRAM).

• 16 nodes (2.15 TFlops)• Interconnect: Federation switch 8GB/s (bidi

rectional)

Will be announced, tomorrow.

Page 12: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 12

SR11000 node

Page 13: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 13

16-way SMP

Page 14: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 14

High Density Module

Page 15: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 15

IBM Blue Gene/L

• Node: 2 PowerPC440 (dual core), 700 MHz, double FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared), 512 MB memory.

• Interconnect: 3D torus, 1.4 Gbps/link (6 in + 6 out) from each node.

• Midplane: 8x8x8 nodes (2.87 TFlops); rack = 2 Midplane

• 10 rack system

All the info in the following comes from the “Red book” (ibm.com/redbooks) and the articles in IBM Journal of Research and Development.

Page 16: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 16

BG/L system

10 Racks

Page 17: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 17

BG/L node ASIC

• Double Floating-Point-Unit (FPU) added to the PPC440 core. 2 fused multiply-add per core

• Not a true SMP: L1 has no cache coherency, L2 has a snoop.

• Shared 4MB L3.• Communication between

the two core through the “multiported shared SRAM buffer”

• Embedded memory controller and networks.

Page 18: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 18

Compute note modes

• Virtual node mode: use both CPUs separately, running a different process on each core. Communication using MPI, etc. Memory and bandwidth are shared.

• Co-processor mode: use the secondary processor as a co-processor for communication. Peak performace is ½.

• Hybrid node mode: use the secondary processor also for computation. Need a special care about the L1 cache incoherency. Used for Linpack.

Page 19: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 19

QCD code optimization

Jun Doi and Hikaru Samukawa (IBM Japan):

• Use the virtual node mode

• Fully used the Double FPU (hand-written assembler code)

• Use a low-level communication API

Page 20: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 20

Double FPU

• SIMD extension of PPC440.

• 32 pairs of 64-bit FP register, addresses are shared.

• Quadword load and store.• Primary and secondary pi

pelines. Fused multiply-add for each pipe.

• Cross operations possible; best suited for complex arithmetic.

Page 21: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 21

Examples

Instruction Mnemonic primary secondary

Load floating parallel double indexed

lfpdx PT=dw(EA) ST=dw(EA+8)

Store floating parallel double indexed

stfpdx Dw(EA)=PT Dw(EA+8)=ST

Floating parallel multiply-add

fpmadd PT=PA.PC+PB ST=SA.SC+SB

Floating cross multiply-add

fxmadd PT=SA.PC+PB ST=PA.SC+SB

Asymmetric cross copy-primary nsub-primary multiply-add

fxcpnpma PT=-PA.PC+PB ST=PA.SC+SB

Floating cross cmplx nsub-primary multiply-add

fxcxnpma PT=-SA.SC+PB ST=SA.PC+SB

Page 22: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 22

SU(3) matrix*vector

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

complex mult: u[0][0] * x[0]

FXPMUL (y[0],u[0][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

+ u[0][1] * x[1] + u[0][2] * x[2];

FXCPMADD (y[0],u[0][1],x[1],y[0])

FXCXNPMA (y[0],u[0][1],x[1],y[0])

FXCPMADD (y[0],u[0][2],x[2],y[0])

FXCXNPMA (y[0],u[0][2],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=-im(u[0][0])*im(x[0])

im(y[0])+=im(u[0][0])*re(x[0])

must be combined with other rows to avoid pipeline stall (wait 5 cycles).

Page 23: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 23

Scheduling

• 32+32 registers can hold 32 complex numbers.

• 3x3(=9) for a gauge link; 3x4(=12) for a spinor: need 2 spinors for input and output

• Load the gauge link while computing, using 6+6 registers. Straightforward for y+=U*x, but not so for y+=conjg(U)*x.

• Use the inline-assembler of gcc; xlf and xlc have intrinsic functions.

• Early xlf/xlc wasn’t good enough to produce these code, but is improved more recently.

Page 24: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 24

Parallelization on BG/L

Example: 243x48 lattice.• Use the virtual node mode.• For the midplane, divide the entire lat

tice onto 2x8x8x8 processors. For one rack, 2x8x8x16. (2 is inner-node.)

• To use more than one rack, 323x64 lattice is the minimum.

• Each processor has 12x3x3x6 (or 12x3x3x3) lattice.

Page 25: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 25

Communication

Communication is fast:• 6 links to nearest-ne

ighbors. 1.4 Gbps (bi-directional) for each link.

• latency is 140ns for one hop.

MPI is too heavy:• Need additional buff

er copy = waste the cache and memory bandwidth.

• Multi-thread not available in the virtual node mode.

• Overlapping comp and comm is not possible within MPI.

Page 26: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 26

“QCD Enhancement Package”

Low-level communication API• Directly send/recv by accessing the t

orus interface FIFO. No copy to memory buffer.

• Non-blocking send; blocking recv.• Up to 224 byte data to send/recv at o

nce (spinor at one site = 192 byte).• Assuming the nearest-neighbor com

munication.

Page 27: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 27

An example

#define BGLNET_WORK_REG 30

#define BGLNET_HEADER_REG 30

BGLNetQuad* fifo;

BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6);

for(i=0;i<Nx;i++){

// put results to reg 24--29

BGLNet_Send_Enqueue_Header(fifo);

BGLNet_Send_Enqueue(fifo,24);

BGLNet_Send_Enqueue(fifo,25);

BGLNet_Send_Enqueue(fifo,26);

BGLNet_Send_Enqueue(fifo,27);

BGLNet_Send_Enqueue(fifo,28);

BGLNet_Send_Enqueue(fifo,29);

BGLNet_Send_Packet(fifo);

}

Create packet header

Put the packet header to the send buffer

Put the data to the send buffer

Kick!

Page 28: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 28

Benchmark

• Wilson solver (BiCGstab)

• 243x48 lattice on a midplace (8x8x8=512 nodes, half rack)

• 29.2% of the peak performance

• 32.6% if measured the Dslash only

• Domain-wall solver (CG)

• 243x48 lattice on a midplace; Ns=16.

• Doesn’t fit in the on-chip L3

• ~22% of the peak performance

Page 29: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 29

Comparison

Vranas @ Lattice 2004

~50% improvement

Page 30: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 30

Physics target

“Future opportunities: ab initio calculations at the physical quark masses”

• Using dynamical overlap fermion• Details are under discussion (action

s, algorithms, etc.)• Primitive code has been written; test

runs are on-going on SR8000.

• Many things to do by March…

Page 31: Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

Oct 4, 2005 Shoji Hashimoto (KEK) 31

Summary

• New KEK machine will be made available for Japanese lattice community on March 1st, 2006.

• Hitachi SR11000 (2.15 TF) + IBM BlueGene/L (57.3 TF)