36
ParCo'13 September 2013 1 Atomic computing - Atomic computing - a different perspective on a different perspective on massively parallel problems massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University of Southampton, UK [email protected] Steve Furber University of Manchester, UK [email protected]

1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

Embed Size (px)

Citation preview

Page 1: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 1

Atomic computing -Atomic computing -a different perspective on a different perspective on

massively parallel problemsmassively parallel problemsAndrew Brown, Rob Mills, Jeff

Reeve, Kier DuganUniversity of Southampton, UK

[email protected]

Steve FurberUniversity of Manchester, UK

[email protected]

Page 2: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 2

OutlineOutline

• Machine architecture• Programming model• Atomic computing

– Finite difference time marching– Neural simulation– Pointwise computing

• Where next?

Page 3: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 3

Machine architectureMachine architecture

• Triangular mesh of nodes

• Connected as a 256 x 256 toroid

Page 4: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 4

One Spinnaker nodeOne Spinnaker node

• 6 bi-directional comms links

• Core farm• (1 Monitor)

• System...• NoC• RAM• Watchdogs

• Off-die SDRAM

Page 5: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 5

128 Mbyte DDR SDRAM

Page 6: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 6

Physical constructionPhysical construction

48 node board:48 x 18 cores/node

= 864 cores

Final machine:256 x 256 nodes...

x 18 cores/node...

= 1179648 cores

Page 7: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 7

SpiNNaker machinesSpiNNaker machines

103 machine: 864 cores, 1 PCB, 75W 104 machine:10,368 cores, 1 rack, 900W (12 PCBs: largest configuration possible for

operation without aircon) 105 machine: 103,680 cores, 1 cabinet, 9kW

106 machine: 1M cores, 10 cabinets, 90kW (Largest configuration possible for operation with forced air, no

water cooling)

Page 8: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 8

OutlineOutline

• Machine architecture• Programming model• Atomic computing

– Finite difference time marching– Neural simulation– Pointwise computing

• Where next?

Page 9: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 9

Programming modelProgramming model

A conventional multi-processor program:

Problem: represented as a network of programs with a certain behaviour...

...embodied as data structures and

algorithms in code...

...compile, link...

...binary files loaded into instruction memory...

MPI farm (or similar)

Myrinet (or similar)

Messages addressed at runtime from arbitrary process to arbitrary

process

Interface presented to the application is a homogenous set of

processes of arbitrary size; process can talk to process by

messages under application software control

Page 10: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 10

...on SpiNNaker...on SpiNNaker

• The problem (Circuit under Simulation) is defined as a graph

• Torn into two components:– CuS topology

• Embodied as hardware route tables in the nodes

– Circuit device behaviour• Embodied as software event handlers running on cores

Page 11: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 11

SpiNNaker executionSpiNNaker execution

SpiNNaker:

Problem: represented as a network of nodes with a

certain behaviour...

...behaviour of each node embodied as an interrupt

handler in code...

...compile, link...

...binary files loaded into core instruction memory...

Messages launched at

runtime take a path defined by

the firmware router

...problem is split into two parts...

...problem topology loaded into firmware

routing tables...

...abstract problem topology...

The code says "send message" but has no control where the output message goes. It knows the

original source of the message that woke up the handler but not the path by which it was delivered.

Page 12: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 12

Event-driven softwareEvent-driven software

• Packet arrival– initiate DMA

• DMA of connection data complete– Process inputs– Insert device delay– Generate outputs?

• Real time:– Timer interrupt

update_device_state();

update_stimulus();

sleeping

event

goto_sleep();

Priority 1 Priority 2Priority 3

Timer millisecond

interrupt

fetch_connection_

data();

DMA completion

interrupt

Packet received interrupt

Page 13: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 13

Managing interruptsManaging interrupts

Stack (resides in

DTCM)

Stack controller

Queueable requests

Priority request queue(resides in DTCM)

Interrupt request arrives

Interrupt handler

executing

(Un)mask interrupt

instruction

Non-queueable requests

Queueable requests pulled off the top of the priority queue in order unless.........

......... a non-queueable request jumps in and pushes the stack

Page 14: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 14

Design intentionDesign intention

• Packet propagation fast – 0.1 us/node hop

• Software handlers fast – 200 MHz ARM9 BUT code

is small

==>– Most of the time, most

cores are idle– Most packet queues lightly

loaded

Wallclock time send-receive

Message size (bytes)

core-core inter node

core-core intra node

3 us 12 us

4 Gbyte/s 1.8 Gbyte/s

0.1 us

0.03 Gbyte/s

What's it cost?(0.1 us /node hop)

Southampton Iridis cluster, 800 nodes, 4 cores/nodeSpiNNaker

Page 15: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 15

OutlineOutline

• Machine architecture• Programming model• Atomic computing

– Finite difference time marching– Neural simulation– Pointwise computing

• Where next?

Page 16: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 16

Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field

Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field

Atomic computingAtomic computing

• Anything that can be transformed into– Large number of simple processes– Asynchronous short-range communication

Finite differenceOne process/element

Finite differenceOne process/element

Discrete simulationOne process/device

Discrete simulationOne process/device

Continuous simulationOne process/nodeOne process/connection

Continuous simulationOne process/nodeOne process/connection

Neural simulationOne processor/103 neurons

Neural simulationOne processor/103 neurons

Ray tracingOne process/pixel

Ray tracingOne process/pixel

Page 17: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 17

Mapping problem Mapping problem graph to compute meshgraph to compute mesh

Multiple devices per core

06

07

03

09 01

07

01

Problem graph (circuit)

02

4

72

23

Node 94

14

15

Core 10

2

65

9

36

Connection 10

2

7

111

8

12

Connection topology (circuit) embodied in

node route tables

Device states stored locally in relevant cores

– Generic - independent of problem domain

Page 18: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 18

• Simulation (single core):

Discrete simulationDiscrete simulation

S

g1

g2 g3

g4

δ=1

δ=8

δ=2

t=1

g1 1Queue

g4 9Queue

g2 2Queue

g4 9

g3 4Queue

g4 9

g2 and g4 inserted in any order, but the queue is ordered

in time

t:=1

t:=2

t:=4

.. -Queue

.. -

.. -Future events inserted into

central time-ordered queue as they are computed

Next event popped from queue head

1 2 3 4 5 6 7 8 9 10 11

g1

g2

g3

g4

ConventionalConventional

Page 19: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 19

• Simulation (distributed):

SimulationSimulation

..-

..-

..-

Overhead: a complex choreography of

synchronisation signals and anti-events to maintain

causality

..-

..-

..-

..-

..-

..-

Inter-core messages are:

● Conventionally expensive

● Cheap on SpiNNaker

SpiNNakerSpiNNaker

Page 20: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 20

Discrete simulationDiscrete simulation

• Simulation of a simulation– Iridis:

• 800 nodes, 4 cores/node

– Dynamic re-mapping of CuS devices : physical cores during simulation Dynamic load balancing in discrete simulation

Page 21: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 21

Finite difference Finite difference time marchingtime marching

• {...} may be simple...

• ...but there's a lot of it

ConventionalConventional

At each time point...

In every spatial dimension...In every spatial dimension...In every spatial dimension...

For each grid point...{}

Page 22: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 22

Finite differencesFinite differences

void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}

void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}

Handler awoken by arrival of changed neighbour state

Stencil updated

Compute new state

If nothing has changed....

Tell neighbours I've changed

One handler/mesh pointComputation data drivenSolution trajectories non-deterministicSteady state validConvergence?

SpiNNakerSpiNNaker

Page 23: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 23

Finite differencesFinite differences

• Canonical 2D square grid:

Diagonal temperature profile vs iteration

Page 24: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 24

Solution timesSolution times

Page 25: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 25

Reliable computing on Reliable computing on unreliable computersunreliable computers

• Finite difference grid sites mapped to faulty core– Algorithm 'self-heals' around unresponsive

core

Page 26: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 26

Neural simulationNeural simulation

– SpiNNaker SpiNNaker maps a user defined graph : machine topology

– 106 (million core) machine:

• 109 devices• 106 cores

– It's just another discrete system?

1000 neurons per processor

Page 27: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 27

Neural simulationNeural simulation SpiNNakerSpiNNaker

• Devices (neurons) represented by a differential equation– Integrated in real time– Integration timestep << equation time constants

• Therefore:– Solution correct and technique stable

Page 28: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 28

Neural simulationNeural simulation SpiNNakerSpiNNaker

sn

s2

s1

Σs

clock

Individual message frequencies < real-

time clock

Superposition of all inputs: exact timing = fn(neuron:core) i.e. independent of CuS (bad)BUT message latency << CuS time constants

(so it doesn't matter)

Change of neuron state derived locally, stored until next (real) timestep

Change of neuron state broadcast (or not) at next (real) timestep

Page 29: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 29

Brian validationBrian validation

Apparent phase lag in timer ticks between simulations because of the reporting latency out of SpiNNaker

Page 30: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 30

Large simulationsLarge simulations

Simulation of large neural aggregates in real timeLarge: design intention is 109 on 106 core machine

SpiNNakerSpiNNaker

Page 31: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 31

Pointwise computingPointwise computing

• Matrix operations:– 1 core/matrix element

• Complexity– Trade off N operations for N cores

Page 32: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 32

LU decompositionLU decomposition

Message contains l41 evaluated after time step 2

(1,1)(1,1)

(2,1)(2,1)

u11(1)

l21(2)

(1,2)(1,2)

(2,2)(2,2)

u12(1)

(1,3)(1,3)

(2,3)(2,3)

u13(1)

(1,4)(1,4)

(2,4)(2,4)

u14(1)

(3,1)(3,1)l31(2)

(3,2)(3,2)

u22(3)

l32(4)(3,3)(3,3)

u23(3)

(3,4)(3,4)

u24(4)

(4,1)(4,1)l41(2)

(4,2)(4,2)l42(2)

(4,3)(4,3)

u33(5)

l43(6)(4,4)(4,4)

u34(5)

l41(2)

(1,1)(1,1)

(2,1)(2,1)

y1(1)

l21y1(2)(2,2)(2,2)

(3,1)(3,1)l31y1(2)

(3,2)(3,2)

y2(3)

l32y2(4)(3,3)(3,3)

(4,1)(4,1)l41y1(2)

(4,2)(4,2)l42y2(4)

(4,3)(4,3)

y3(5)

l43y3(6)(4,4)(4,4)

x4(1)

x2(5)

u13x3(4)

x3(3)

(1,1)(1,1)

(2,1)(2,1)

u12x2(6)

(2,2)(2,2)

(3,1)(3,1) (3,2)(3,2)

u23x3(4)

(3,3)(3,3)

(4,1)(4,1) (4,2)(4,2) (4,3)(4,3) (4,4)(4,4)

u14x4(2) u24x4(2) u34x4(2)

Page 33: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 33

Conjugate gradientConjugate gradient

At every search trajectory inflection, O(n)

vector.matrix product needs to be computed

Page 34: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 34

Life ... Life ... the Universe, and everythingthe Universe, and everything

Page 35: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 35

OutlineOutline

• Machine architecture• Programming model• Atomic computing

– Finite difference time marching– Neural simulation– Pointwise computing

• Where next?

Page 36: 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University

ParCo'13 September 2013 36

Where next?Where next?

• Neural simulation– Robotics– Modelling of auditory

and visual systems– Cognitive disorders

• Physics applications– Computational fluid

dynamics– Thermal modelling– Plasmas– Inverse field problems– Computational chemistry