1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 1

Atomic computing -Atomic computing -a different perspective on a different perspective on

massively parallel problemsmassively parallel problemsAndrew Brown, Rob Mills, Jeff

Reeve, Kier DuganUniversity of Southampton, UK

[email protected]

Steve FurberUniversity of Manchester, UK

[email protected]


OutlineOutline

• Machine architecture• Programming model• Atomic computing

– Finite difference time marching– Neural simulation– Pointwise computing

• Where next?


Machine architectureMachine architecture

• Triangular mesh of nodes

• Connected as a 256 x 256 toroid


One Spinnaker nodeOne Spinnaker node

• 6 bi-directional comms links

• Core farm• (1 Monitor)

• System...• NoC• RAM• Watchdogs

• Off-die SDRAM


128 Mbyte DDR SDRAM


Physical constructionPhysical construction

48 node board:48 x 18 cores/node

= 864 cores

Final machine:256 x 256 nodes...

x 18 cores/node...

= 1179648 cores


SpiNNaker machinesSpiNNaker machines

103 machine: 864 cores, 1 PCB, 75W 104 machine:10,368 cores, 1 rack, 900W (12 PCBs: largest configuration possible for

operation without aircon) 105 machine: 103,680 cores, 1 cabinet, 9kW

106 machine: 1M cores, 10 cabinets, 90kW (Largest configuration possible for operation with forced air, no

water cooling)


OutlineOutline



• Where next?


Programming modelProgramming model

A conventional multi-processor program:

Problem: represented as a network of programs with a certain behaviour...

...embodied as data structures and

algorithms in code...

...compile, link...

...binary files loaded into instruction memory...

MPI farm (or similar)

Myrinet (or similar)

Messages addressed at runtime from arbitrary process to arbitrary

process

Interface presented to the application is a homogenous set of

processes of arbitrary size; process can talk to process by

messages under application software control


...on SpiNNaker...on SpiNNaker

• The problem (Circuit under Simulation) is defined as a graph

• Torn into two components:– CuS topology

• Embodied as hardware route tables in the nodes

– Circuit device behaviour• Embodied as software event handlers running on cores


SpiNNaker executionSpiNNaker execution

SpiNNaker:

Problem: represented as a network of nodes with a

certain behaviour...

...behaviour of each node embodied as an interrupt

handler in code...

...compile, link...

...binary files loaded into core instruction memory...

Messages launched at

runtime take a path defined by

the firmware router

...problem is split into two parts...

...problem topology loaded into firmware

routing tables...

...abstract problem topology...

The code says "send message" but has no control where the output message goes. It knows the

original source of the message that woke up the handler but not the path by which it was delivered.


Event-driven softwareEvent-driven software

• Packet arrival– initiate DMA

• DMA of connection data complete– Process inputs– Insert device delay– Generate outputs?

• Real time:– Timer interrupt

update_device_state();

update_stimulus();

sleeping

event

goto_sleep();

Priority 1 Priority 2Priority 3

Timer millisecond

interrupt

fetch_connection_

data();

DMA completion

interrupt

Packet received interrupt


Managing interruptsManaging interrupts

Stack (resides in

DTCM)

Stack controller

Queueable requests

Priority request queue(resides in DTCM)

Interrupt request arrives

Interrupt handler

executing

(Un)mask interrupt

instruction

Non-queueable requests

Queueable requests pulled off the top of the priority queue in order unless.........

......... a non-queueable request jumps in and pushes the stack


Design intentionDesign intention

• Packet propagation fast – 0.1 us/node hop

• Software handlers fast – 200 MHz ARM9 BUT code

is small

==>– Most of the time, most

cores are idle– Most packet queues lightly

loaded

Wallclock time send-receive

Message size (bytes)

core-core inter node

core-core intra node

3 us 12 us

4 Gbyte/s 1.8 Gbyte/s

0.1 us

0.03 Gbyte/s

What's it cost?(0.1 us /node hop)

Southampton Iridis cluster, 800 nodes, 4 cores/nodeSpiNNaker


OutlineOutline



• Where next?


Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field

Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field

Atomic computingAtomic computing

• Anything that can be transformed into– Large number of simple processes– Asynchronous short-range communication

Finite differenceOne process/element

Finite differenceOne process/element

Discrete simulationOne process/device

Discrete simulationOne process/device

Continuous simulationOne process/nodeOne process/connection

Continuous simulationOne process/nodeOne process/connection

Neural simulationOne processor/103 neurons

Neural simulationOne processor/103 neurons

Ray tracingOne process/pixel

Ray tracingOne process/pixel


Mapping problem Mapping problem graph to compute meshgraph to compute mesh

Multiple devices per core

06

07

03

09 01

07

01

Problem graph (circuit)

02

4

72

23

Node 94

14

15

Core 10

2

65

9

36

Connection 10

2

7

111

8

12

Connection topology (circuit) embodied in

node route tables

Device states stored locally in relevant cores

– Generic - independent of problem domain


• Simulation (single core):

Discrete simulationDiscrete simulation

S

g1

g2 g3

g4

δ=1

δ=8

δ=2

t=1

g1 1Queue

g4 9Queue

g2 2Queue

g4 9

g3 4Queue

g4 9

g2 and g4 inserted in any order, but the queue is ordered

in time

t:=1

t:=2

t:=4

.. -Queue

.. -

.. -Future events inserted into

central time-ordered queue as they are computed

Next event popped from queue head

1 2 3 4 5 6 7 8 9 10 11

g1

g2

g3

g4

ConventionalConventional


• Simulation (distributed):

SimulationSimulation

..-

..-

..-

Overhead: a complex choreography of

synchronisation signals and anti-events to maintain

causality

..-

..-

..-

..-

..-

..-

Inter-core messages are:

● Conventionally expensive

● Cheap on SpiNNaker

SpiNNakerSpiNNaker


Discrete simulationDiscrete simulation

• Simulation of a simulation– Iridis:

• 800 nodes, 4 cores/node

– Dynamic re-mapping of CuS devices : physical cores during simulation Dynamic load balancing in discrete simulation


Finite difference Finite difference time marchingtime marching

• {...} may be simple...

• ...but there's a lot of it

ConventionalConventional

At each time point...

In every spatial dimension...In every spatial dimension...In every spatial dimension...

For each grid point...{}


Finite differencesFinite differences

void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}

void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}

Handler awoken by arrival of changed neighbour state

Stencil updated

Compute new state

If nothing has changed....

Tell neighbours I've changed

One handler/mesh pointComputation data drivenSolution trajectories non-deterministicSteady state validConvergence?

SpiNNakerSpiNNaker


Finite differencesFinite differences

• Canonical 2D square grid:

Diagonal temperature profile vs iteration


Solution timesSolution times


Reliable computing on Reliable computing on unreliable computersunreliable computers

• Finite difference grid sites mapped to faulty core– Algorithm 'self-heals' around unresponsive

core


Neural simulationNeural simulation

– SpiNNaker SpiNNaker maps a user defined graph : machine topology

– 106 (million core) machine:

• 109 devices• 106 cores

– It's just another discrete system?

1000 neurons per processor


Neural simulationNeural simulation SpiNNakerSpiNNaker

• Devices (neurons) represented by a differential equation– Integrated in real time– Integration timestep << equation time constants

• Therefore:– Solution correct and technique stable


Neural simulationNeural simulation SpiNNakerSpiNNaker

sn

s2

s1

Σs

clock

Individual message frequencies < real-

time clock

Superposition of all inputs: exact timing = fn(neuron:core) i.e. independent of CuS (bad)BUT message latency << CuS time constants

(so it doesn't matter)

Change of neuron state derived locally, stored until next (real) timestep

Change of neuron state broadcast (or not) at next (real) timestep


Brian validationBrian validation

Apparent phase lag in timer ticks between simulations because of the reporting latency out of SpiNNaker


Large simulationsLarge simulations

Simulation of large neural aggregates in real timeLarge: design intention is 109 on 106 core machine

SpiNNakerSpiNNaker


Pointwise computingPointwise computing

• Matrix operations:– 1 core/matrix element

• Complexity– Trade off N operations for N cores


LU decompositionLU decomposition

Message contains l41 evaluated after time step 2

(1,1)(1,1)

(2,1)(2,1)

u11(1)

l21(2)

(1,2)(1,2)

(2,2)(2,2)

u12(1)

(1,3)(1,3)

(2,3)(2,3)

u13(1)

(1,4)(1,4)

(2,4)(2,4)

u14(1)

(3,1)(3,1)l31(2)

(3,2)(3,2)

u22(3)

l32(4)(3,3)(3,3)

u23(3)

(3,4)(3,4)

u24(4)

(4,1)(4,1)l41(2)

(4,2)(4,2)l42(2)

(4,3)(4,3)

u33(5)

l43(6)(4,4)(4,4)

u34(5)

l41(2)

(1,1)(1,1)

(2,1)(2,1)

y1(1)

l21y1(2)(2,2)(2,2)

(3,1)(3,1)l31y1(2)

(3,2)(3,2)

y2(3)

l32y2(4)(3,3)(3,3)

(4,1)(4,1)l41y1(2)

(4,2)(4,2)l42y2(4)

(4,3)(4,3)

y3(5)

l43y3(6)(4,4)(4,4)

x4(1)

x2(5)

u13x3(4)

x3(3)

(1,1)(1,1)

(2,1)(2,1)

u12x2(6)

(2,2)(2,2)

(3,1)(3,1) (3,2)(3,2)

u23x3(4)

(3,3)(3,3)

(4,1)(4,1) (4,2)(4,2) (4,3)(4,3) (4,4)(4,4)

u14x4(2) u24x4(2) u34x4(2)


Conjugate gradientConjugate gradient

At every search trajectory inflection, O(n)

vector.matrix product needs to be computed


Life ... Life ... the Universe, and everythingthe Universe, and everything


OutlineOutline



• Where next?


Where next?Where next?

• Neural simulation– Robotics– Modelling of auditory

and visual systems– Cognitive disorders

• Physics applications– Computational fluid

dynamics– Thermal modelling– Plasmas– Inverse field problems– Computational chemistry

Documents

1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University