Upload
daniella-cole
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
ParCo'13 September 2013 1
Atomic computing -Atomic computing -a different perspective on a different perspective on
massively parallel problemsmassively parallel problemsAndrew Brown, Rob Mills, Jeff
Reeve, Kier DuganUniversity of Southampton, UK
Steve FurberUniversity of Manchester, UK
ParCo'13 September 2013 2
OutlineOutline
• Machine architecture• Programming model• Atomic computing
– Finite difference time marching– Neural simulation– Pointwise computing
• Where next?
ParCo'13 September 2013 3
Machine architectureMachine architecture
• Triangular mesh of nodes
• Connected as a 256 x 256 toroid
ParCo'13 September 2013 4
One Spinnaker nodeOne Spinnaker node
• 6 bi-directional comms links
• Core farm• (1 Monitor)
• System...• NoC• RAM• Watchdogs
• Off-die SDRAM
ParCo'13 September 2013 5
128 Mbyte DDR SDRAM
ParCo'13 September 2013 6
Physical constructionPhysical construction
48 node board:48 x 18 cores/node
= 864 cores
Final machine:256 x 256 nodes...
x 18 cores/node...
= 1179648 cores
ParCo'13 September 2013 7
SpiNNaker machinesSpiNNaker machines
103 machine: 864 cores, 1 PCB, 75W 104 machine:10,368 cores, 1 rack, 900W (12 PCBs: largest configuration possible for
operation without aircon) 105 machine: 103,680 cores, 1 cabinet, 9kW
106 machine: 1M cores, 10 cabinets, 90kW (Largest configuration possible for operation with forced air, no
water cooling)
ParCo'13 September 2013 8
OutlineOutline
• Machine architecture• Programming model• Atomic computing
– Finite difference time marching– Neural simulation– Pointwise computing
• Where next?
ParCo'13 September 2013 9
Programming modelProgramming model
A conventional multi-processor program:
Problem: represented as a network of programs with a certain behaviour...
...embodied as data structures and
algorithms in code...
...compile, link...
...binary files loaded into instruction memory...
MPI farm (or similar)
Myrinet (or similar)
Messages addressed at runtime from arbitrary process to arbitrary
process
Interface presented to the application is a homogenous set of
processes of arbitrary size; process can talk to process by
messages under application software control
ParCo'13 September 2013 10
...on SpiNNaker...on SpiNNaker
• The problem (Circuit under Simulation) is defined as a graph
• Torn into two components:– CuS topology
• Embodied as hardware route tables in the nodes
– Circuit device behaviour• Embodied as software event handlers running on cores
ParCo'13 September 2013 11
SpiNNaker executionSpiNNaker execution
SpiNNaker:
Problem: represented as a network of nodes with a
certain behaviour...
...behaviour of each node embodied as an interrupt
handler in code...
...compile, link...
...binary files loaded into core instruction memory...
Messages launched at
runtime take a path defined by
the firmware router
...problem is split into two parts...
...problem topology loaded into firmware
routing tables...
...abstract problem topology...
The code says "send message" but has no control where the output message goes. It knows the
original source of the message that woke up the handler but not the path by which it was delivered.
ParCo'13 September 2013 12
Event-driven softwareEvent-driven software
• Packet arrival– initiate DMA
• DMA of connection data complete– Process inputs– Insert device delay– Generate outputs?
• Real time:– Timer interrupt
update_device_state();
update_stimulus();
sleeping
event
goto_sleep();
Priority 1 Priority 2Priority 3
Timer millisecond
interrupt
fetch_connection_
data();
DMA completion
interrupt
Packet received interrupt
ParCo'13 September 2013 13
Managing interruptsManaging interrupts
Stack (resides in
DTCM)
Stack controller
Queueable requests
Priority request queue(resides in DTCM)
Interrupt request arrives
Interrupt handler
executing
(Un)mask interrupt
instruction
Non-queueable requests
Queueable requests pulled off the top of the priority queue in order unless.........
......... a non-queueable request jumps in and pushes the stack
ParCo'13 September 2013 14
Design intentionDesign intention
• Packet propagation fast – 0.1 us/node hop
• Software handlers fast – 200 MHz ARM9 BUT code
is small
==>– Most of the time, most
cores are idle– Most packet queues lightly
loaded
Wallclock time send-receive
Message size (bytes)
core-core inter node
core-core intra node
3 us 12 us
4 Gbyte/s 1.8 Gbyte/s
0.1 us
0.03 Gbyte/s
What's it cost?(0.1 us /node hop)
Southampton Iridis cluster, 800 nodes, 4 cores/nodeSpiNNaker
ParCo'13 September 2013 15
OutlineOutline
• Machine architecture• Programming model• Atomic computing
– Finite difference time marching– Neural simulation– Pointwise computing
• Where next?
ParCo'13 September 2013 16
Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field
Particle and fieldOne process/field pointOne process/particleField moves the particle; particle bends the field
Atomic computingAtomic computing
• Anything that can be transformed into– Large number of simple processes– Asynchronous short-range communication
Finite differenceOne process/element
Finite differenceOne process/element
Discrete simulationOne process/device
Discrete simulationOne process/device
Continuous simulationOne process/nodeOne process/connection
Continuous simulationOne process/nodeOne process/connection
Neural simulationOne processor/103 neurons
Neural simulationOne processor/103 neurons
Ray tracingOne process/pixel
Ray tracingOne process/pixel
ParCo'13 September 2013 17
Mapping problem Mapping problem graph to compute meshgraph to compute mesh
Multiple devices per core
06
07
03
09 01
07
01
Problem graph (circuit)
02
4
72
23
Node 94
14
15
Core 10
2
65
9
36
Connection 10
2
7
111
8
12
Connection topology (circuit) embodied in
node route tables
Device states stored locally in relevant cores
– Generic - independent of problem domain
ParCo'13 September 2013 18
• Simulation (single core):
Discrete simulationDiscrete simulation
S
g1
g2 g3
g4
δ=1
δ=8
δ=2
t=1
g1 1Queue
g4 9Queue
g2 2Queue
g4 9
g3 4Queue
g4 9
g2 and g4 inserted in any order, but the queue is ordered
in time
t:=1
t:=2
t:=4
.. -Queue
.. -
.. -Future events inserted into
central time-ordered queue as they are computed
Next event popped from queue head
1 2 3 4 5 6 7 8 9 10 11
g1
g2
g3
g4
ConventionalConventional
ParCo'13 September 2013 19
• Simulation (distributed):
SimulationSimulation
..-
..-
..-
Overhead: a complex choreography of
synchronisation signals and anti-events to maintain
causality
..-
..-
..-
..-
..-
..-
Inter-core messages are:
● Conventionally expensive
● Cheap on SpiNNaker
SpiNNakerSpiNNaker
ParCo'13 September 2013 20
Discrete simulationDiscrete simulation
• Simulation of a simulation– Iridis:
• 800 nodes, 4 cores/node
– Dynamic re-mapping of CuS devices : physical cores during simulation Dynamic load balancing in discrete simulation
ParCo'13 September 2013 21
Finite difference Finite difference time marchingtime marching
• {...} may be simple...
• ...but there's a lot of it
ConventionalConventional
At each time point...
In every spatial dimension...In every spatial dimension...In every spatial dimension...
For each grid point...{}
ParCo'13 September 2013 22
Finite differencesFinite differences
void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}
void ihr() {Recv(val,port); // React to neighbour value changeghost[port] = val; // It WILL BE differentoldtemp = mytemp; // Store current statemytemp = fn(ghost); // Compute new valueif (oldtemp==mytemp) stop; // If nothing changed, keep quietSend(mytemp); // Broadcast changed state}
Handler awoken by arrival of changed neighbour state
Stencil updated
Compute new state
If nothing has changed....
Tell neighbours I've changed
One handler/mesh pointComputation data drivenSolution trajectories non-deterministicSteady state validConvergence?
SpiNNakerSpiNNaker
ParCo'13 September 2013 23
Finite differencesFinite differences
• Canonical 2D square grid:
Diagonal temperature profile vs iteration
ParCo'13 September 2013 24
Solution timesSolution times
ParCo'13 September 2013 25
Reliable computing on Reliable computing on unreliable computersunreliable computers
• Finite difference grid sites mapped to faulty core– Algorithm 'self-heals' around unresponsive
core
ParCo'13 September 2013 26
Neural simulationNeural simulation
– SpiNNaker SpiNNaker maps a user defined graph : machine topology
– 106 (million core) machine:
• 109 devices• 106 cores
– It's just another discrete system?
1000 neurons per processor
ParCo'13 September 2013 27
Neural simulationNeural simulation SpiNNakerSpiNNaker
• Devices (neurons) represented by a differential equation– Integrated in real time– Integration timestep << equation time constants
• Therefore:– Solution correct and technique stable
ParCo'13 September 2013 28
Neural simulationNeural simulation SpiNNakerSpiNNaker
sn
s2
s1
Σs
clock
Individual message frequencies < real-
time clock
Superposition of all inputs: exact timing = fn(neuron:core) i.e. independent of CuS (bad)BUT message latency << CuS time constants
(so it doesn't matter)
Change of neuron state derived locally, stored until next (real) timestep
Change of neuron state broadcast (or not) at next (real) timestep
ParCo'13 September 2013 29
Brian validationBrian validation
Apparent phase lag in timer ticks between simulations because of the reporting latency out of SpiNNaker
ParCo'13 September 2013 30
Large simulationsLarge simulations
Simulation of large neural aggregates in real timeLarge: design intention is 109 on 106 core machine
SpiNNakerSpiNNaker
ParCo'13 September 2013 31
Pointwise computingPointwise computing
• Matrix operations:– 1 core/matrix element
• Complexity– Trade off N operations for N cores
ParCo'13 September 2013 32
LU decompositionLU decomposition
Message contains l41 evaluated after time step 2
(1,1)(1,1)
(2,1)(2,1)
u11(1)
l21(2)
(1,2)(1,2)
(2,2)(2,2)
u12(1)
(1,3)(1,3)
(2,3)(2,3)
u13(1)
(1,4)(1,4)
(2,4)(2,4)
u14(1)
(3,1)(3,1)l31(2)
(3,2)(3,2)
u22(3)
l32(4)(3,3)(3,3)
u23(3)
(3,4)(3,4)
u24(4)
(4,1)(4,1)l41(2)
(4,2)(4,2)l42(2)
(4,3)(4,3)
u33(5)
l43(6)(4,4)(4,4)
u34(5)
l41(2)
(1,1)(1,1)
(2,1)(2,1)
y1(1)
l21y1(2)(2,2)(2,2)
(3,1)(3,1)l31y1(2)
(3,2)(3,2)
y2(3)
l32y2(4)(3,3)(3,3)
(4,1)(4,1)l41y1(2)
(4,2)(4,2)l42y2(4)
(4,3)(4,3)
y3(5)
l43y3(6)(4,4)(4,4)
x4(1)
x2(5)
u13x3(4)
x3(3)
(1,1)(1,1)
(2,1)(2,1)
u12x2(6)
(2,2)(2,2)
(3,1)(3,1) (3,2)(3,2)
u23x3(4)
(3,3)(3,3)
(4,1)(4,1) (4,2)(4,2) (4,3)(4,3) (4,4)(4,4)
u14x4(2) u24x4(2) u34x4(2)
ParCo'13 September 2013 33
Conjugate gradientConjugate gradient
At every search trajectory inflection, O(n)
vector.matrix product needs to be computed
ParCo'13 September 2013 34
Life ... Life ... the Universe, and everythingthe Universe, and everything
ParCo'13 September 2013 35
OutlineOutline
• Machine architecture• Programming model• Atomic computing
– Finite difference time marching– Neural simulation– Pointwise computing
• Where next?
ParCo'13 September 2013 36
Where next?Where next?
• Neural simulation– Robotics– Modelling of auditory
and visual systems– Cognitive disorders
• Physics applications– Computational fluid
dynamics– Thermal modelling– Plasmas– Inverse field problems– Computational chemistry