Extreme computing Infrastructure

Extreme computingInfrastructure

Stratis D. Viglas

School of InformaticsUniversity of Edinburgh

Stratis D. Viglas www.inf.ed.ac.uk

Infrastructure Parallelism and parallel/concurrent programming

Outline

Infrastructure

Overview

Distributed file systems

Replication and fault tolerance

Virtualisation

Parallelism and parallel/concurrent programming

Services



The world is going parallel

• To be fair, the world was always parallel

• We just didn’t comprehend it, didn’t pay attention to it, or we were not

exposed to the parallelism

• Parallelism used to be elective

• There were tools (both software and hardware) for parallelism and we

could choose whether to use them or not

• Or, special problems that were a better fit for parallel solutions

• Parallelism is now enforced

• Massive potential for parallelism at the infrastructure level

• Application programmers are forced to think in parallel terms

• Multicore chips; parallel machines in your pocket



Implicit vs. explicit parallelism

• High-level classification of parallel architectures

• Implicit: there but you do not see it

• Most of the work is carried out at the architecture level

• Pipelined instruction execution is a typical example

• Explicit: there and you can (and had better) use it

• Parallelism potential is exposed to the programmer

• When applications are not implemented in parallel, we’re not making

best use of the hardware

• Multicore chips and parallel programming frameworks (e.g.,MPI,

MapReduce)



Implicit parallelism through superscalar execution

• Issue varying number of instructions per clock tick• Static scheduling

• Compiler techniques to identify potential• In-order execution (i.e.,instructions are not reordered)

• Dynamic scheduling• Instruction-level parallelism (ILP)• Let the CPU (and sometimes the compiler) examine the next few

hunderds of instructions and identify parallism potential• Schedule them in parallel as operands/resources become available• There are plenty of registers; user them to eliminate dependencies• Execute instructions out of order

• Change the order of execution to one that is more inherently parallel• Speculative execution

• Say there is a branch instruction but which branch will be taken is uknownat issue time

• Maintain statistics and start executing one branch



Pipelining and superscalar execution

IF ID EX WBIF ID EX WB


IF ID EX WB

1 2 3 4 5 6 7 8Instruction i

Instruction i+1

Instruction i+2

Instruction i+3

Instruction i+4

Cycle

IFID

EXWB

Instruction FetchInstruction Decode

ExecuteWrite Back

1 2 3 4 5 6 7 8Instruction i

Instruction i+1

Instruction i+2

Instruction i+3

Instruction i+4

Cycle

IntegerFloating Point





EX WBIDIFEX WBIDIF





2-issue superscalar architecture

standard pipelining



ILP, data dependencies, and hazards

• CPU and compiler must preserve program order: the order in whichthe instruction would be executed if the original program wereexecuted sequentially

• Study the dependencies among the program’s instructions• Data dependencies are important

• Indicated the possibility of a hazard• Determines the order in which computations should take place• Most importantly: sets an upper bound on how much parallelism can

possibly be exploited

• Goal: explolt parallelism by preserving program order only where itaffects the outcome of the program



Speculative execution

• Greater ILP: overcome control dependencies by speculating inhardware the outcome of branches and executing the program as ifthe guess were correct

• Dynamic scheduling: fetch and issue instructions• Speculation: fetch, issue, and execute a stream of instructions as if

branch predictions were correct• Different predictors

• Branch predictor: outcome of branching instructions• Value predictor: outcome of certain computations• Prefetching: lock on to memory access patterns and fetch

data/instructions• But predictors make mistakes

• Upon a mistake, we need to empty the pipeline(s) of all erroneouslypredicted data/instructions

• Power consumption: more circuitry on the chip means more power,which means more heat



Explicit parallelism

operation

throughput per cycle

latency in cycles

• Parallelism potential is exposed to software• Both to the compiler and the programmer

• Various different forms• From loosely coupled multiprocessors to tightly

coupled very long instruction word architetures• Little’s law: parallelism = throughput × latency

• To maintain a throughput of T operations per cyclewhen each operation has a latency of L cycles, weneed to execute T × L independent operations inparallel

• To maintain fixed parallelism• Decreased latency results in increased throughput• Decreased throughput allows increased latency



Types of parallelism

time

time

time

time

pipelined parallelism: different operationsin different stages

data-level parallelism: same computation over different data

instruction-level parallelism: no two threadsperforming the same computation at the same time

thread-level parallelism: different threadsperforming different computations



Shared Instruction Multiple Data (SIMD) architecture

• Precursor to massively data-parallel architectures

• Central controller broadcasts instructions to multiple processing

elements (PEs)

• One controller for the whole array

• One copy of the program

• All computations fully synchronised

inter-PE connection network

PE PE PE PE PE PE PE

Mem Mem Mem Mem Mem Mem Mem

Arraycontroller

control

data



Massively Parallel Processors (MPP)

• Message passing between microprocessors (µP) through a network

interface (NI)

• Architecture scales to thousands of nodes

• Data layout must be handled by

software

• Any two nodes can exchange

data, but only through explicit

request/reply messages

• Message passing has high

overhead

• Invoke the OS on each message

• Access to NI is also expensive

• Sending is cheap, receiving is

expensive (poll the NI or invoke

the OS through an interrupt)

interconnect network

μP μP μP μP μP μP μP

Mem Mem Mem Mem Mem Mem Mem

NI NI NI NI NI NI NI



Shared memory multiprocessors

• Just what the name suggests

• Will work with any data placement; any memory location is

accessible by any processing node

• Standard load and store instructions used to communicate data

• No OS involvement

• Low software overhead

• Only some special synchronisation primitives

• Shared memory is usually implemented as physically distributed

memory modules

• Each processor has its cache, so coherency is an issue

• Snooping: broadcast data requests to all processors

• Processors snoop to see if they have a copy of data and act accordingly

• Works well if there is a common bus

• Dominates in small-scale machines

• Directory-based: keep track of what is stored where in a directory

• Directory is usually distributed for scalability

• Point-to-point requests to handle inconsistencies

• Scales better than snooping



Multicore chips

interrupt distributor

CPU interface

CPU interface

CPU interface

CPU interface

CPUL1

cache

CPUL1

cache

CPUL1

cache

CPUL1

cache

snooping unit

shared L2shar

ed L

2

hardware interrupts private IRQ

northbridge (memory I/O) southbridge (peripherals I/O)

• Also known as chip multiprocessors• Multiple processors, their cache memories, and the interconnect on the same die• Latencies for chip-to-chip communication drastically reduced



Enforced parallelism

• The era of programmers not caring what’s under the hood is over• To gain performance, we need to understand the infrastructure• Different software design decisions based on the architecture• But without getting too close to the processor

• Problems with portability• Common set of problems faced, regardless of the architecture

• Concurrency, synchronisation, type of parallelism



Concurrent programming

• Sequential program: a single thread of control that executes oneinstruction and when it is finished it moves on to the next one

• Concurrent program: collection of autonomous sequential threadsexecuting logically in parallel

• The physical execution model is irrelevant• Multiprogramming: multiplexed threads on a uniprocessor• Multiprocessing: multiplexed threads on a multiprocessor system• Distributed processing: multiplexed processes on different machines

• Concurrency is not only parallelism• Difference between logical and physical models• Interleaved concurrency: logically simultaneous processing• Parallelism: physically simultaneous processing



Reasons for concurrent programming

• Natural application structure• World is not sequential; easier to program in terms of multiple and

independant activities• Increased throughput and responsiveness

• No reason to block an entire application due to a single event• No reason to block the entire system due to a single application

• Enforced by hardware• Multicore chips are the standard

• Inherent distribution of contemporary large-scale systems• Single application running on multiple machines• Client/server, peer-to-peer, clusters

• Concurrent programming introduces problems• Need multiple threads for increased performance without compromising

the application• Need to synchronise concurrent threads for correctness and safety



Synchronisation

• To increase throughput we interleave threads• Not all interleavings of threads are acceptable and correct programs

• Correctness: same end-effect as if all threads were executedsequentially

• Synchronisation serves two purposes• Ensure safety of shared state updates• Coordinate actions of threads

• So we need a way to restrict the interleavings explicitly at thesoftware level



Thread safety

• Multiple threads access shared resource simultaneously• Access is safe if and only if

• Accesses have no effect on the state of the resource (for example,reading the same variable)

• Accesses are idempotent (for example y = x2)• Accesses are mutually exclusive

• In other words, accesses are serialised

3:053:103:153:203:253:30

Arrive home Look in fridge, no milkGo to supermarket

Arrive supermarket, buy milk

3:35 Arrive home, put milk in fridge3:403:45

Arrive home Look in fridge, no milkGo to supermarket

Arrive supermarket, buy milk

Arrive home, put milk in fridge3:50

You Your roomate

resources not protected: too much milk



Mutual exclusion and atomicity

• Prevent more than one thread from accessing a critical section at a

given time

• Once a thread is in the critical section, no other thread can enter the

critical section until the first thread has left the critical section

• No interleavings of threads within the critical section

• Serialised access

• Critical sections are executed as atomic units

• A single operation, executed to completion

• Can be arbitrary in size (e.g.,a whole block of code, or an increment to

a variable)

• Multiple ways to implement this

• Semaphores, mutexes, spinlocks, copy-on-write, . . .

• For instance, the synchronized keyword in Java ensures only one

thread accesses the synchronised block

• At the end of the day, it’s all locks around a shared resource

• Shared or no lock to read resource (multiple threads can have it)

• Exclusive lock to update resource (there can be only one)



Potential concurrency problems

• Deadlock: two or more threads stop and wait for each other• Usually caused be a cycle in the lock graph

• Livelock: two or more threads continue to execute but they make noreal progress toward their goal

• Example, each thread in a pair of threads undoes what the other threadhas done

• Real world example: the awkward walking shuffle• Walking towards someone, you both try to get out of each other’s way but

end up choosing the same path

• Starvation: some thread gets deferred forever• Each thread should get a fair chance to make progress

• Race condition: possible interleaving of threads results in undesiredcomputation result

• Two threads access a variable simulatneously and one access is a write• A mutex usually solves the problem; contemporary hardware has the

compare-and-swap atomic operation



Parallel performance analysis

• Extent of parallelism in an algorithm• A measure of how parallelisable an algorithm is and what we can

expect in terms of its ideal performance• Granularity of parallelism

• A measure of how well the data/work of the computation is distributedacross the processors of the parallel machine

• The ratio of computation over communication• Computation stages are typically separated from communication by

synchronization events• Locality of computation

• A measure of how much communication needs to take place



Parallelism extent: Amdahl’s law

sequential

parallel

sequential

25 seconds

50 seconds

25 seconds

100 seconds

sequential

sequential

time

25 seconds

10 seconds

25 seconds

60 seconds

• Program speedup is defined by the fraction of code that can beparallelised

• In the example, speedup = sequential time / parallel time =100 seconds/67 seconds = 1.67

• speedup = 1(1−p)+ p

n• p: fraction of work that is parallelisable• n: number of parallel processors



Implications of Amdahl’s law

• Speedup tends to be 11−p as number of processors tends to infinity

• Parallel programming is only worthwhile when programs have a lot ofwork that is parallel in nature (so called embarrassingly parallel)

spee

dup

number of processors

linear

spee

dup

super linear speeduppossible dueto caching andlocality effects

typical speedupless than linear



Fine vs. coarse granularity

Fine-grain parallelism

• Low computation to

communication ratio

• Small amounts of computation

between communication stages

• Less opportunity for

performance enhancement

• High communication overhead

Coarse-grain parallelism

• High computation to

communication ratio

• Large amounts of computation

between communication stages

• More opportunity for

performance enhancement

• Harder to balance efficiently



Load balancing

• Processors finishing early have to wait for the processor with thelargest amount of work to complete; leads to idle time, lowerutilisation

• Static load balancing: programmer makes decisions and fixes a priorithe amount of work distributed to each processor

• Works well for homogeneous parallel processing• All processors are the same• Each core has an equal amount of work

• Not so well for heterogeneous parallel processing• Some processors are faster than others• Difficult to distrubute work evenly

• Dynamic load balancing: when one processors finishes its allocatedwork it takes work from processor with the heaviest workload

• Also known as task stealing• Ideal for uneven workloads and heterogeneous systems



Communication modes

• Processors need to exchange data and/or control messages• Control messages are usually short, data messages longer

• Computation and communication are concurrent operations• Similar to instruction pipelining

• Synchronous communication: sender is notified when message isreceived

• Asynchronous communication: sender only knows the message issent

• Blocking communication: sender waits until a message is fullytransmitted, receiver waits until message is fully received

• Non-blocking communication: processing continues even if messagehas not been transmitted

• Depending on the platform, the programmer might be exposed to thecommunication mechanism and have to orchestrate it

• Point to point, broadcast/reduce, all to all, scatter/gather• Radically different proramming paradigms



Communication cost model

• C = f ·�

o + l + n/mB + t − overlap

�

• C: cost of communication• f : frequency of message exchange• o: overhead per message• l : latency per message• n: total amount of data sent• m: number of messages exchanged• B: communication bandwidth• t : contention per message• overlap: latency hidden by overlapping communication with

computation through concurrency



Locality

• Ensure that each processor is close to the data it processes• Move processing to data, not data to processing

• Data transfer is a either a known cost, or a hidden cost• Known if we explicitly move data, hidden if data needs to be transferred

due to a high-level programming interface• A good task scheduler will minimise data movement• Possible scenario: parallel computation is serialised due to bus

contention and lack of bandwidth• Distribute data across processors to relieve contention and increase

effective bandwidth• Also known as partitioning, or declustering



Example architectures

• Uniform access• Centrally located store, processors are equidistant• Typical example: physical memory of a shared memory multiprocessor

system• Non-uniform access

• Physically partitioned data, but accessible by all• Processors see the same address space, but placement of data affects

performance• Typical example: Non-Uniform Memory Access (NUMA) architectures for

multicore systems


Documents

Extreme computing Infrastructure