Welcome to COMP4300/8300 – Parallel Systems · COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJJ III 19. Health Warning! l course is run every other year (drop out this

Welcome to COMP4300/8300 – Parallel Systems

UltraSPARC T2(Niagara-2)multicore chip layout

(courtesy of T. Okazaki, Flickr)

COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 1

Lecture Overview

l parallel computing concepts and scales

l sample application areas

l parallel programming’s rise, decline and rebirth:

n the role of Moore’s Law and Dennard Scaling

n the multicore revolution and a new design era

l the Top 500 supercomputers and challenges for ‘petascale computing’

l why parallel programming is hard

l course contact

l assumed knowledge and assessment


Parallel Computing: Concept and Rationale

The idea:

l split your computation into bits that can be executed simultaneously

Motivation:

l Speed, Speed, Speed · · · at a cost-effective price

n if we didn’t want it to go faster we wouldn’t bother with the hassles of parallel

programming!

l reduce the time to solution to acceptable levels

n no point in taking 1 week to predict tomorrow’s weather!

n simulations that take months are NOT useful in a design environment

Parallelism is when the different components of a computation execute together. It is a

subset of concurrency where the components can execute in any order.


Parallelization

Split program up and run parts simultaneously on different processors

l on p computers, the time to solution should (ideally!) be 1/p

l Parallel programming: the art of writing the parallel code

l Parallel computer: the hardware on which we run our parallel code

This course will discuss both!

Beyond raw compute power, other motivations may include

l enabling more accurate simulations in the same time (finer grids)

l providing access to huge aggregate memories

l providing more and/or better input/output capacity

l hiding latency


Scales of Parallelism

l within a CPU/core: pipelined instruction execution, multiple instruction issue

(superscalar), other forms of instruction level parallelism, SIMD units*

l within a chip: multiple cores*, hardware multithreading*, accelerator units* (with

multiple cores), transactional memory*

l within a node: multiple sockets* (CPU chips), interleaved memory access (multiple

DRAM chips), disk block striping / RAID (multiple disks)

l within a SAN (system area network): multiple nodes* (clusters, typical

supercomputers), parallel filesystems

l within the internet: grid computing*, distributed workflows*

*requires significant parallel programming effort

What programming paradigms are typically applied to each feature?


Sample Application Areas: Grand Challenge Problems

l fluid flow problems

n weather prediction and climate change, ocean

flown aerodynamic modeling for cars, planes, rockets

etc

l structural mechanics

n building, bridge, car etc strength analysisn car crash simulation

l speech and character recognition, image processing

l visualization, virtual reality

l semiconductor design, simulation of new chips

l structural biology, design of drugs

l human genome mapping

l financial market analysis and simulation

l data mining, machine learning, games


Example: World Climate Modeling

l atmosphere divided into 3D regions or cells

l complex mathematical equations describe conditions in each cell, e.g. pressure,

temperature, wind speed, etc

n conditions in each cell change according to conditions in neighbouring cells

n updates repeated many times to model the passage of time

n cells are affected by more distant cells the longer the forecast

l assume

n cells are 1×1×1 mile to a height of 10 miles⇒ 5×108 cells

n 200 floating point operations (FLOPS) to update each cell⇒ 1011 FLOPS per

timestep

n a timestep represents 10 minutes and we want 10 days⇒ total of 1015 FLOPS

l on a 100 MFLOP computer this would require 100 days

l on a 1 TFLOP computer this would take just 10 minutes


The (Rocky) Rise of Parallel Computing

l early ideas: 1946: Cellular Automota, John Von Neumann; 1958: SOLOMON(1024 1-bit processors), Daniel Slotnick; 1962: Burroughs D825 4-CPU SMP

l 1967: Gene Amdahl proposes Amdahl’s Law, debates with Slotnick at AFIPS Conf.

l 1970’s: vector processors become the mainstream supercomputers (e.g. Cray-1);

a few ‘true’ parallel computers are built

l 1980’s: small-scale parallel vector processors dominant (e.g. Cray X-MP)

When a farmer needs more ox power to plow his field, he doesn’t get a bigger ox, he getsanother one. Then he puts the oxen in parallel. Grace Hopper, 1985-7

If you were plowing a field, which would you rather use? Two strong oxen or 1024chickens? Seymour Cray, 1986

l 1988: Reevaluating Amdahl’s Law, John Gustafson

l late 80s+: large-scale parallel computers begin to emerge: Computing Surface(Meiko), QCD machine (Columbia Uni), iPSC/860 hypercube (Intel)

l 90’s–00’s: shared and distributed memory parallel computers used for servers(small-scale) and supercomputers (large-scale)


Moore’s Law & Dennard Scaling

Two “laws” underpin the exponential increase in performance of (serial) processors from

the 1970’s


Moore’s Law and Dennard Scaling Undermine Parallel Computingl || computing looked promising in the 90’s, but many companies failed due to the

‘free lunch’ from the combination of Moore’s Law and Dennard Scalingn Why parallelize my codes? Just wait two years, and the processors will be 4

times faster!On several recent occasions, I have been asked whether parallel computing will soon be

relegated to the trash heap . . . Ken Kennedy, CRPC Director, 1994

l demography of Parallel Computing (mid 90’s, origin unknown)

Prossessorspeed (R)

Degree ofParllelism (P)

(P,R+dR)

(P+dP,R)

(P+dP,R−dR)(P−dP,R−dR)

(P−dP,R+dR) (P+dP,R+dR)

Heretics

LudditesFanatics

Agnostics

TrueBelievers

Luke−warmBelievers


http://en.wikipedia.org/wiki/Moore's_law

http://en.wikipedia.org/wiki/Dennard_scaling

Chip Size and Clock Speed

The ‘free lunch’ continued into the early 2000’s, allowing the construction of faster andmore complex (serial) processors: systems so fast that they could not communicateacross the chip in a single clock cycle!


End of Dennard Scaling and Other Free Lunch Ingredients

l even with Dennard

Scaling, we saw a

exponential increase in

the power density of

chips 1985–2000

n 2000 Intel chip

equivalent to a

hotplate, would

have⇒ a rocket

nozzle by 2010!

l then Dennard Scaling

ceased around 2005!!

l instruction level

parallelism (ILP) also

reached its limits


The Multicore Revolution

l vendors began putting multiple CPUs (cores) on a chip, and stopped increasingclock frequency

n 2004: Sun releases dual-core Sparc IV, heralding the start of the multicore era

l (dynamic) power of a chip is given by: P = QCV 2 f , V is the voltage, Q is thenumber of transistors, C is a transistor’s capacitance and f is the clock frequency

n but on a given chip, f ∝ V , so P ∝ Q f 3!

l (ideal) parallel performance for p cores is given by R = p f , but Q ∝ p

n double p⇒ double R, but also Pn double p, halve f ⇒ maintain R, but quarter P!

l doubling the number of cores is better than doubling clock speed!

l Moore’s Law (increase Q at constant QC) is expected to continue (till ???), so wecan gain in R at constant P

l cores can be much simpler (hence utilize less ILP), but there can be many

l chip design and testing costs are significantly reduced

l parallelism now must be exposed by software: The Free Lunch Is Over!


http://www.gotw.ca/publications/concurrency-ddj.htm

A New Chip Development Era

1960-2010 2010-?few transistors no shortage of transistors

no power limitations severe power limitations

maximize transistor utility minimize energygeneralize customize

We are now seeing:

l (customized) accelerators, generally manycore with low clock frequency

n e.g. Graphics Processing Units (GPUs), customized for fast numerical

calculations

l ‘dark silicon’: need to turn off parts of chip to reduce power

l hardware-software codesign: speed via specialization


The Top 500 Most Powerful Computers: June 2015

The Top 500 provides an interesting window to these hardware trends and issues.

(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)


http://www.top500.org

Top500: Performance Trends



The Top500: Multicore Emergence



Petascale and Beyond: Challenges and Opportunities

Level Characteristic Challenges/Opportunities

As a whole

sheer number of nodes

l Tianhe-2 has

equivalent of 3M cores

l programming

language/environment

l fault tolerance

within a nodeheterogeneity

l Tianhe-2 uses CPUs

and GPUs

l what to use when

l co-location of data with

the unit processing it

within a chip

energy minimization

l already processors

have frequency and

voltage scaling

l minimize data size and

movement including

use of just enough

precision

l specialized cores

In RSCS we are working in all these areas.


Software: Why Parallel Programming is Hard

l writing (correct and efficient) parallel programs is hard!

n hard to expose enough parallelism; hard to debug!

l getting (close to ideal) speedup is hard! Overheads include:

n communicating shared data (e.g. cache line invalidations and resulting reloads)n synchronization (barriers and locks)n need for redundant computations; balancing load evenly

Also, not all of the application may be parallelizable:

Amdahl’s Law: given a fraction f of ‘fast’ computation, at rate Rf, and Rs being the‘slow’ computation rate, the overall rate is: R = (1− f

Rs+ f

Rf)−1

n interpreted for parallel execution with p processors:f is the fraction of non-serial computation, which (ideally) executes at the rateRf = pRs

n e.g. with f = 0.9, R = 10Rs at p = ∞ !n counterargument (Gustafson’s Law): 1− f is not fixed; but decreases with the

data size N e.g. 1− f ∝ N−12


http://en.wikipedia.org/wiki/Amdahl's_law

Health Warning!

l course is run every other year (drop out this year and it wont be repeated until 2019)

l its a 4000/8000-level course, its supposed to:

n be more challenging than a 3000-level course!n you may (will) be exposed to ‘bleeding edge’ technologiesn be less well structuredn have a greater expectation on youn have more student participationn be fun!

l it assumes you have done courses in concurrency (e.g. COMP2310) and in2000-level mathematics

l it will require strong programming skills – in C!

l Nathan Robertson, 2002 honours student: Parallel systems and thread safety atMedicare: 2/16 understood it - the other guy was a $70/hr contractor

l attendance at lectures and pracs is strongly recommended (even though notassessed)


Course Contact

l course web site: http://cs.anu.edu.au/courses/comp4300

(we will use wattle only for lecture recordings)

l course coordinator & lecturer/tutor:

Peter Strazdins, CSIT N217, 6125-5140, comp4300@cs (.anu.edu.au)

l discussion forum accessible by Piazza

l course schedule

n note: Practicals start in week 3! Register via StReAMs


http://cs.anu.edu.au/courses/comp4300

http://cs.anu.edu.au/~Peter.Strazdins

http://piazza.com

http://cs.anu.edu.au/courses/comp4300/schedule.html

http://cs.anu.edu.au/streams

Proposed Assessment Scheme and Texts

l see the assessment web page 2 assignments: 20%+15%, MSE 20%, Final Exam

45%

l some reading will be essential for the course. Recommended texts:

n Principles of Parallel Programming, Lyn & Snyder. Available Bookshop,

Short-Term Loan (2)

n Introduction to Parallel Computing, 2nd Ed., Grama et al, Available from

library.anu.edu.au

n Introduction to High Performance Computing for Scientists and Engineers,

Hager & Wellein. Available online.


http://cs.anu.edu.au/courses/comp4300/assessment.html

Documents

Welcome to COMP4300/8300 – Parallel Systems · COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJJ III 19. Health Warning! l course is run every other year (drop out this