Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Welcome to COMP4300/8300 – Parallel Systems
UltraSPARC T2(Niagara-2)multicore chip layout
(courtesy of T. Okazaki, Flickr)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 1
Lecture Overview
l parallel computing concepts and scales
l sample application areas
l parallel programming’s rise, decline and rebirth:
n the role of Moore’s Law and Dennard Scaling
n the multicore revolution and a new design era
l the Top 500 supercomputers and challenges for ‘petascale computing’
l why parallel programming is hard
l course contact
l assumed knowledge and assessment
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 2
Parallel Computing: Concept and Rationale
The idea:
l split your computation into bits that can be executed simultaneously
Motivation:
l Speed, Speed, Speed · · · at a cost-effective price
n if we didn’t want it to go faster we wouldn’t bother with the hassles of parallel
programming!
l reduce the time to solution to acceptable levels
n no point in taking 1 week to predict tomorrow’s weather!
n simulations that take months are NOT useful in a design environment
Parallelism is when the different components of a computation execute together. It is a
subset of concurrency where the components can execute in any order.
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 3
Parallelization
Split program up and run parts simultaneously on different processors
l on p computers, the time to solution should (ideally!) be 1/p
l Parallel programming: the art of writing the parallel code
l Parallel computer: the hardware on which we run our parallel code
This course will discuss both!
Beyond raw compute power, other motivations may include
l enabling more accurate simulations in the same time (finer grids)
l providing access to huge aggregate memories
l providing more and/or better input/output capacity
l hiding latency
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 4
Scales of Parallelism
l within a CPU/core: pipelined instruction execution, multiple instruction issue
(superscalar), other forms of instruction level parallelism, SIMD units*
l within a chip: multiple cores*, hardware multithreading*, accelerator units* (with
multiple cores), transactional memory*
l within a node: multiple sockets* (CPU chips), interleaved memory access (multiple
DRAM chips), disk block striping / RAID (multiple disks)
l within a SAN (system area network): multiple nodes* (clusters, typical
supercomputers), parallel filesystems
l within the internet: grid computing*, distributed workflows*
*requires significant parallel programming effort
What programming paradigms are typically applied to each feature?
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 5
Sample Application Areas: Grand Challenge Problems
l fluid flow problems
n weather prediction and climate change, ocean
flown aerodynamic modeling for cars, planes, rockets
etc
l structural mechanics
n building, bridge, car etc strength analysisn car crash simulation
l speech and character recognition, image processing
l visualization, virtual reality
l semiconductor design, simulation of new chips
l structural biology, design of drugs
l human genome mapping
l financial market analysis and simulation
l data mining, machine learning, games
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 6
Example: World Climate Modeling
l atmosphere divided into 3D regions or cells
l complex mathematical equations describe conditions in each cell, e.g. pressure,
temperature, wind speed, etc
n conditions in each cell change according to conditions in neighbouring cells
n updates repeated many times to model the passage of time
n cells are affected by more distant cells the longer the forecast
l assume
n cells are 1×1×1 mile to a height of 10 miles⇒ 5×108 cells
n 200 floating point operations (FLOPS) to update each cell⇒ 1011 FLOPS per
timestep
n a timestep represents 10 minutes and we want 10 days⇒ total of 1015 FLOPS
l on a 100 MFLOP computer this would require 100 days
l on a 1 TFLOP computer this would take just 10 minutes
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 7
The (Rocky) Rise of Parallel Computing
l early ideas: 1946: Cellular Automota, John Von Neumann; 1958: SOLOMON(1024 1-bit processors), Daniel Slotnick; 1962: Burroughs D825 4-CPU SMP
l 1967: Gene Amdahl proposes Amdahl’s Law, debates with Slotnick at AFIPS Conf.
l 1970’s: vector processors become the mainstream supercomputers (e.g. Cray-1);
a few ‘true’ parallel computers are built
l 1980’s: small-scale parallel vector processors dominant (e.g. Cray X-MP)
When a farmer needs more ox power to plow his field, he doesn’t get a bigger ox, he getsanother one. Then he puts the oxen in parallel. Grace Hopper, 1985-7
If you were plowing a field, which would you rather use? Two strong oxen or 1024chickens? Seymour Cray, 1986
l 1988: Reevaluating Amdahl’s Law, John Gustafson
l late 80s+: large-scale parallel computers begin to emerge: Computing Surface(Meiko), QCD machine (Columbia Uni), iPSC/860 hypercube (Intel)
l 90’s–00’s: shared and distributed memory parallel computers used for servers(small-scale) and supercomputers (large-scale)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 8
Moore’s Law & Dennard Scaling
Two “laws” underpin the exponential increase in performance of (serial) processors from
the 1970’s
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 9
Moore’s Law and Dennard Scaling Undermine Parallel Computingl || computing looked promising in the 90’s, but many companies failed due to the
‘free lunch’ from the combination of Moore’s Law and Dennard Scalingn Why parallelize my codes? Just wait two years, and the processors will be 4
times faster!On several recent occasions, I have been asked whether parallel computing will soon be
relegated to the trash heap . . . Ken Kennedy, CRPC Director, 1994
l demography of Parallel Computing (mid 90’s, origin unknown)
Prossessorspeed (R)
Degree ofParllelism (P)
(P,R+dR)
(P+dP,R)
(P+dP,R−dR)(P−dP,R−dR)
(P−dP,R+dR) (P+dP,R+dR)
Heretics
LudditesFanatics
Agnostics
TrueBelievers
Luke−warmBelievers
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 10
Chip Size and Clock Speed
The ‘free lunch’ continued into the early 2000’s, allowing the construction of faster andmore complex (serial) processors: systems so fast that they could not communicateacross the chip in a single clock cycle!
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 11
End of Dennard Scaling and Other Free Lunch Ingredients
l even with Dennard
Scaling, we saw a
exponential increase in
the power density of
chips 1985–2000
n 2000 Intel chip
equivalent to a
hotplate, would
have⇒ a rocket
nozzle by 2010!
l then Dennard Scaling
ceased around 2005!!
l instruction level
parallelism (ILP) also
reached its limits
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 12
The Multicore Revolution
l vendors began putting multiple CPUs (cores) on a chip, and stopped increasingclock frequency
n 2004: Sun releases dual-core Sparc IV, heralding the start of the multicore era
l (dynamic) power of a chip is given by: P = QCV 2 f , V is the voltage, Q is thenumber of transistors, C is a transistor’s capacitance and f is the clock frequency
n but on a given chip, f ∝ V , so P ∝ Q f 3!
l (ideal) parallel performance for p cores is given by R = p f , but Q ∝ p
n double p⇒ double R, but also Pn double p, halve f ⇒ maintain R, but quarter P!
l doubling the number of cores is better than doubling clock speed!
l Moore’s Law (increase Q at constant QC) is expected to continue (till ???), so wecan gain in R at constant P
l cores can be much simpler (hence utilize less ILP), but there can be many
l chip design and testing costs are significantly reduced
l parallelism now must be exposed by software: The Free Lunch Is Over!
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 13
A New Chip Development Era
1960-2010 2010-?few transistors no shortage of transistors
no power limitations severe power limitations
maximize transistor utility minimize energygeneralize customize
We are now seeing:
l (customized) accelerators, generally manycore with low clock frequency
n e.g. Graphics Processing Units (GPUs), customized for fast numerical
calculations
l ‘dark silicon’: need to turn off parts of chip to reduce power
l hardware-software codesign: speed via specialization
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 14
The Top 500 Most Powerful Computers: June 2015
The Top 500 provides an interesting window to these hardware trends and issues.
(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 15
Top500: Performance Trends
(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 16
The Top500: Multicore Emergence
(http://www.top500.org/blog/slides-highlights-of-the-45th-top500-list/)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 17
Petascale and Beyond: Challenges and Opportunities
Level Characteristic Challenges/Opportunities
As a whole
sheer number of nodes
l Tianhe-2 has
equivalent of 3M cores
l programming
language/environment
l fault tolerance
within a nodeheterogeneity
l Tianhe-2 uses CPUs
and GPUs
l what to use when
l co-location of data with
the unit processing it
within a chip
energy minimization
l already processors
have frequency and
voltage scaling
l minimize data size and
movement including
use of just enough
precision
l specialized cores
In RSCS we are working in all these areas.
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 18
Software: Why Parallel Programming is Hard
l writing (correct and efficient) parallel programs is hard!
n hard to expose enough parallelism; hard to debug!
l getting (close to ideal) speedup is hard! Overheads include:
n communicating shared data (e.g. cache line invalidations and resulting reloads)n synchronization (barriers and locks)n need for redundant computations; balancing load evenly
Also, not all of the application may be parallelizable:
Amdahl’s Law: given a fraction f of ‘fast’ computation, at rate Rf, and Rs being the‘slow’ computation rate, the overall rate is: R = (1− f
Rs+ f
Rf)−1
n interpreted for parallel execution with p processors:f is the fraction of non-serial computation, which (ideally) executes at the rateRf = pRs
n e.g. with f = 0.9, R = 10Rs at p = ∞ !n counterargument (Gustafson’s Law): 1− f is not fixed; but decreases with the
data size N e.g. 1− f ∝ N−12
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 19
Health Warning!
l course is run every other year (drop out this year and it wont be repeated until 2019)
l its a 4000/8000-level course, its supposed to:
n be more challenging than a 3000-level course!n you may (will) be exposed to ‘bleeding edge’ technologiesn be less well structuredn have a greater expectation on youn have more student participationn be fun!
l it assumes you have done courses in concurrency (e.g. COMP2310) and in2000-level mathematics
l it will require strong programming skills – in C!
l Nathan Robertson, 2002 honours student: Parallel systems and thread safety atMedicare: 2/16 understood it - the other guy was a $70/hr contractor
l attendance at lectures and pracs is strongly recommended (even though notassessed)
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 20
Course Contact
l course web site: http://cs.anu.edu.au/courses/comp4300
(we will use wattle only for lecture recordings)
l course coordinator & lecturer/tutor:
Peter Strazdins, CSIT N217, 6125-5140, comp4300@cs (.anu.edu.au)
l discussion forum accessible by Piazza
l course schedule
n note: Practicals start in week 3! Register via StReAMs
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 21
Proposed Assessment Scheme and Texts
l see the assessment web page 2 assignments: 20%+15%, MSE 20%, Final Exam
45%
l some reading will be essential for the course. Recommended texts:
n Principles of Parallel Programming, Lyn & Snyder. Available Bookshop,
Short-Term Loan (2)
n Introduction to Parallel Computing, 2nd Ed., Grama et al, Available from
library.anu.edu.au
n Introduction to High Performance Computing for Scientists and Engineers,
Hager & Wellein. Available online.
COMP4300/8300 L1: Introduction to Parallel Systems 2017 JJ J • I II × 22