Upload
griselda-evans
View
223
Download
3
Embed Size (px)
Citation preview
CPE 779 Parallel Computing - Spring 2012 1
CPE 779 Parallel Computing
http://www.ju.edu.jo/sites/Academic/abusufah/Material/cpe779_spr12/index.html
Lecture 1: Introduction
Walid Abu-Sufah
University of Jordan
Acknowledgment: Collaboration
This course is being offered in collaboration with
• The IMPACT research group at the University of Illinois http://impact.crhc.illinois.edu/
• The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/
2CPE 779 Parallel Computing - Spring 2012
Acknowledgment: Slides
Some of the slides used in this course are based on slides by
• Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) http://www.cs.berkeley.edu/~demmel/cs267_Spr12/
• Wen-mei Hwu of the University of Illinois and David Kurk, Nvidia Corporation http://courses.engr.illinois.edu/ece408/ece408_syll.html
• Kathy Yelick, University of California at Berkeley http://www.cs.berkeley.edu/~yelick/cs194f07
3CPE 779 Parallel Computing - Spring 2012
Course Motivation
In the last few years:• Conventional sequential processors can not get faster• Previously clock speed doubled every 18 months
• All computers will be parallel
• >>> All programs will have to become parallel programs• Especially programs that need to run faster.
4CPE 779 Parallel Computing - Spring 2012
Course Motivation (continued)
There will be a huge change in the entire computing industry
• Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them.
• Multi/ many core chips have started a revolution in the software industry
5CPE 779 Parallel Computing - Spring 2012
Course Motivation (continued)
Large research activities to address this issue are underway
• Computer companies: Intel, Microsoft, Nvidia, IBM, ..etc• Parallel programming is a concern for the entire computing industry.
• Universities• Berkeley's ParLab (2008: $20 million grant)
6CPE 779 Parallel Computing - Spring 2012
Course Goals
Part 1 (~4 weeks)• focus on the techniques that are most appropriate
for multicore programming and the use of parallelism to improve program performance. Topics include• performance analysis and tuning• data techniques• shared data structures• load balancing. and task parallelism• synchronization
7CPE 779 Parallel Computing - Spring 2012
Course Goals (continued - I)
Part 2 (~ 12 weeks)• Learn how to program massively parallel processors
and achieve• high performance• functionality and maintainability• scalability across future generations
• Acquire technical knowledge required to achieve the above goals• principles and patterns of parallel algorithms• processor architecture features and constraints• programming API, tools and techniques
8CPE 779 Parallel Computing - Spring 2012
Outline of rest of lecture
• Why powerful computers must use parallel processors
• Examples of Computational Science and Engineering (CSE) problems which require powerful computers
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
9
Commercial problems too
Including your laptops and handhelds
all
But things are improving
CPE 779 Parallel Computing - Spring 2012
CPE 779 Parallel Computing - Spring 2012 10
What is Parallel Computing?
• Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor
• Examples of parallel machines: • A cluster computer that contains multiple PCs combined together with
a high speed network • A shared memory multiprocessor (SMP*) by connecting multiple
processors to a single memory system• A Chip Multi-Processor (CMP) contains multiple processors (called
cores) on a single chip
• Concurrent execution comes from the desire for performance • * Technically, SMP stands for “Symmetric Multi-Processor”
Units of Measure • High Performance Computing (HPC) units are:
• Flop: floating point operation• Flops/s: floating point operations per second• Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…Mega: Mflop/s = 1006 flop/sec; Mbyte = 220 = 1048576 ~ 106 bytesGiga: Gflop/s = 1009 flop/sec; Gbyte = 230 ~ 109 bytesTera: Tflop/s = 1012 flop/sec; Tbyte = 240 ~ 1012 bytes Peta: Pflop/s = 1015 flop/sec; Pbyte = 250 ~ 1015 bytesExa: Eflop/s = 1018 flop/sec; Ebyte = 260 ~ 1018 bytesZetta: Zflop/s = 1021 flop/sec; Zbyte = 270 ~ 1021 bytesYotta: Yflop/s = 1024 flop/sec; Ybyte = 280 ~ 1024 bytes
• Current fastest (public) machine ~ 11 Pflop/s• Up-to-date list at www.top500.org
CPE 779 Parallel Computing - Spring 2012 11
CPE 779 Parallel Computing - Spring 2012 13
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
15
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a factor of x ?
• Clock rate goes up by x because wires are shorter• actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase• typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !• typically x3 is devoted to either on-chip
• parallelism: hidden parallelism such as ILP• locality: caches
• So most programs x3 times faster, without changing them
CPE 779 Parallel Computing - Spring 2012
Power Density Limits Serial Performance
4004
8008
8080
8085
8086
286386
486
Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
YearP
ow
er D
ensi
ty (
W/c
m2 )
Hot Plate
Nuclear
Reactor
Rocket
Nozzle
Sun’sSurfaceSource: Patrick Gelsinger,
Shenkar Bokar, Intel
Scaling clock speed (business as usual) will not work
• High performance serial processors waste power- Speculation, dynamic dependence checking, etc. burn power
- Implicit parallelism discovery• More transistors, but not faster serial processors
• Concurrent systems are more power efficient
– Dynamic power is proportional to V2fC
– Increasing frequency (f) also increases supply voltage (V) cubic
effect– Increasing cores
increases capacitance (C) but only linearly
– Save power by lowering clock speed
16CPE 779 Parallel Computing - Spring 2012
17
Revolution in Processors
• Chip density is continuing increase ~2x every 2 years• Clock speed is not• Number of processor cores may double instead• Power is under control, no longer growingCPE 779 Parallel Computing - Spring 2012
18
Parallelism in 2012?
• These arguments are no longer theoretical• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine• To keep doubling performance, parallelism must double
• Which (commercial) applications can use this parallelism?• Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?• New software model needed• Try to hide complexity from most programmers – eventually• In the meantime, need to understand it
• Computer industry betting on this big change, but does not have all the answers• Berkeley ParLab established to work on this
CPE 779 Parallel Computing - Spring 2012
19
Parallelism in 2012?
• These arguments are no longer theoretical• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine• To keep doubling performance, parallelism must double
• Which (commercial) applications can use this parallelism?• Do they have to be rewritten from scratch?
• Will all programmers have to be parallel programmers?• New software model needed• Try to hide complexity from most programmers – eventually• In the meantime, need to understand it
• Computer industry betting on this big change, but does not have all the answers• Berkeley ParLab established to work on this
CPE 779 Parallel Computing - Spring 2012
Memory is Not Keeping Pace
Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two
• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two
• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
Source: David Turek, IBM
Cost of Computation vs. Memory
Question: Can you double concurrency without doubling memory?• Strong scaling: fixed problem size, increase number of processors
• Weak scaling: grow problem size proportionally to number of processors
Source: IBM
20
• Listing the 500 most powerful computers in the world
• Yardstick: Rmax of Linpack• Solve Ax=b, dense problem, matrix is random• Dominated by dense matrix-matrix multiply
• Update twice a year:• ISC’xy in June in Germany• SCxy in November in the U.S.
• All information available from the TOP500 web site at: www.top500.org
The TOP500 Project
21CPE 779 Parallel Computing - Spring 2012
38th List: The TOP10
Rank Site Manufacturer Computer Country CoresRmax
[Pflops]Power[MW]
1RIKEN Advanced Institute
for Computational Science
FujitsuK Computer
SPARC64 VIIIfx 2.0GHz, Tofu Interconnect
Japan 795,024 10.51 12.66
2National SuperComputer
Center in TianjinNUDT
Tianhe-1ANUDT TH MPP,
Xeon 6C, NVidia, FT-1000 8C
China 186,368 2.566 4.04
3Oak Ridge National
LaboratoryCray
Jaguar Cray XT5, HC 2.6 GHz
USA 224,162 1.759 6.95
4National Supercomputing
Centre in ShenzhenDawning
NebulaeTC3600 Blade, Intel X5650, NVidia
Tesla C2050 GPU China 120,640 1.271 2.58
5GSIC, Tokyo Institute of
TechnologyNEC/HP
TSUBAME-2HP ProLiant, Xeon 6C, NVidia,
Linux/WindowsJapan 73,278 1.192 1.40
6 DOE/NNSA/LANL/SNL CrayCielo
Cray XE6, 8C 2.4 GHzUSA 142,272 1.110 3.98
7NASA/Ames Research
Center/NASSGI
Pleiades SGI Altix ICE 8200EX/8400EX
USA 111,104 1.088 4.10
8DOE/SC/
LBNL/NERSCCray
HopperCray XE6, 6C 2.1 GHz
USA 153,408 1.054 2.91
9Commissariat a l'Energie
Atomique (CEA)Bull
Tera 100Bull bullx super-node
S6010/S6030France 138.368 1.050 4.59
10 DOE/NNSA/LANL IBMRoadrunner
BladeCenter QS22/LS21USA 122,400 1.042 2.3422CPE 779 Parallel Computing - Spring 2012
Performance Development
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
59.7 GFlop/s59.7 GFlop/s
400 MFlop/s400 MFlop/s
1.17 TFlop/s1.17 TFlop/s
10.51 PFlop/s10.51 PFlop/s
50.9 TFlop/s50.9 TFlop/s
74.2 PFlop/s74.2 PFlop/s
SUM
N=1
N=500
23CPE 779 Parallel Computing - Spring 2012
Projected Performance Development
SUM
N=1
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
24CPE 779 Parallel Computing - Spring 2012
Moore’s Law reinterpreted
• Number of cores per chip can double every two years
• Clock speed will not increase (possibly decrease)
• Need to deal with systems with millions of concurrent threads
• Need to deal with inter-chip parallelism as well as intra-chip parallelism
26CPE 779 Parallel Computing - Spring 2012
27
Outline
• Why powerful computers must be parallel processors
• Large CSE problems require powerful computers
• Why writing (fast) parallel programs is hard
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
CPE 779 Parallel Computing - Spring 2012
28
Drivers for Change
• Continued exponential increase in computational power • Can simulate what theory and experiment can’t do
• Continued exponential increase in experimental data
• Moore’s Law applies to sensors too
• Need to analyze all that data
CPE 779 Parallel Computing - Spring 2012
29
Simulation: The Third Pillar of Science
• Traditional scientific and engineering method:(1) Do theory or paper design(2) Perform experiments or build system
• Limitations: –Too difficult—build large wind tunnels –Too expensive—build a throw-away passenger jet –Too slow—wait for climate or galactic evolution –Too dangerous—weapons, drug design, climate
experimentation
• Computational science and engineering paradigm:(3) Use computers to simulate and analyze the phenomenon• Based on known physical laws and efficient numerical methods• Analyze simulation results with computational tools and methods
beyond what is possible manually
Simulation
Theory Experiment
CPE 779 Parallel Computing - Spring 2012
Data Driven Science
•Scientific data sets are growing exponentially- Ability to generate data is exceeding our ability to
store and analyze- Simulation systems and some observational
devices grow in capability with Moore’s Law•Petabyte (PB) data sets will soon be common:
• Climate modeling: estimates of the next IPCC data is in 10s of petabytes
• Genome: JGI alone will have .5 petabyte of data this year and double each year
• Particle physics: LHC is projected to produce 16 petabytes of data per year
• Astrophysics: LSST and others will produce 5 petabytes/year (via 3.2 Gigapixel camera)
•Create scientific communities with “Science Gateways” to data
30
31
Some Particularly Challenging Computations• Science
• Global climate modeling• Biology: genomics; protein folding; drug design• Astrophysical modeling• Computational Chemistry• Computational Material Sciences and Nanosciences
• Engineering• Semiconductor design• Earthquake and structural modeling• Computation fluid dynamics (airplane design)• Combustion (engine design)• Crash simulation
• Business• Financial and economic modeling• Transaction processing, web services and search engines
• Defense• Nuclear weapons -- test by simulations• Cryptography CPE 779 Parallel Computing - Spring 2012
32
Economic Impact of HPC• Airlines:
• System-wide logistics optimization systems on parallel systems.
• Savings: approx. $100 million per airline per year.• Automotive design:
• Major automotive companies use large systems (500+ CPUs) for:• CAD-CAM, crash testing, structural integrity and aerodynamics.• One company has 500+ CPU parallel system.
• Savings: approx. $1 billion per company per year.
• Semiconductor industry:• Semiconductor firms use large systems (500+ CPUs) for
• device electronics simulation and logic validation
• Savings: approx. $1 billion per company per year.
• Energy• Computational modeling improved performance of current nuclear power
plants, equivalent to building two new power plants.
CPE 779 Parallel Computing - Spring 2012
33
$5B World Market in Technical Computing in 2004
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1998 1999 2000 2001 2002 2003 Other
Technical Management andSupport
Simulation
Scientific Research and R&D
MechanicalDesign/Engineering Analysis
Mechanical Design andDrafting
Imaging
Geoscience and Geo-engineering
Electrical Design/EngineeringAnalysis
Economics/Financial
Digital Content Creation andDistribution
Classified Defense
Chemical Engineering
Biosciences
Source: IDC 2004, from NRC Future of Supercomputing Report
CPE 779 Parallel Computing - Spring 2012
CPE 779 Parallel Computing - Spring 2012 35
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)• Granularity• Locality• Load balance• Coordination and synchronization• Performance modeling
All of these things makes parallel programming harder than sequential programming.
36
“Automatic” Parallelism in Modern Machines
• Bit level parallelism• within floating point operations, etc.
• Instruction level parallelism (ILP)• multiple instructions execute per clock cycle
• Memory system parallelism• overlap of memory operations with computation
• OS parallelism• multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
CPE 779 Parallel Computing - Spring 2012
37
Finding Enough Parallelism: Amdahl’s Law
T1 = execution time using 1 processor (serial execution time)
Tp = execution time using P processors
S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor)
C = fraction of computation which could be executed by p processors
Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S
• Maximum speedup (i.e. when P=∞), Smax = 1/S; example S=.05 , speedup max= 20
• Currently the fastest machine has 705K processors; 2nd fastest has ~186K processors +GPUs
• Even if the parallel part speeds up perfectly performance is limited by the sequential part
CPE 779 Parallel Computing - Spring 2012 38
Speedup Barriers: (a) Overhead of Parallelism
• Given enough parallel work, overhead is a big barrier to getting desired speedup
• Parallelism overheads include:• cost of starting a thread or process• cost of communicating shared data• cost of synchronizing• extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
Speedup Barriers: (b) Working on Non Local Data
• Large memories are slow, fast memories are small• Parallel processors, collectively, have large, fast cache
• the slow accesses to “remote” data we call “communication”• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
potentialinterconnects
L3 Cache
Memory
L3 Cache
Memory
ProcCache
L2 Cache
ProcCache
L2 Cache
39CPE 779 Parallel Computing - Spring 2012
40
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000198
0198
1198
3198
4198
5198
6198
7198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9200
0
DRAM
CPU
198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Goal: find algorithms that minimize communication, not necessarily arithmetic
CPE 779 Parallel Computing - Spring 2012
CPE 779 Parallel Computing - Spring 2012 41
Speedup Barriers: (c) Load Imbalance
• Load imbalance occurs when some processors in the system are idle due to• insufficient parallelism (during that phase)• unequal size tasks
• Algorithm needs to balance load
42
Outline
• Why powerful computers must be parallel processors
• Large CSE problems require powerful computers
• Why writing (fast) parallel programs is hard
• Structure of the course
Commercial problems too
Including your laptops and handhelds
all
But things are improving
CPE 779 Parallel Computing - Spring 2012
Instructor (Sections 2 & 3)
• Instructor: Dr. Walid Abu-Sufah• Office: CPE 10• Email: [email protected]• Office Hours: Monday 11-12, Tuesday 12-1 and by
appointment. • Course web site: http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr12/
index.html
CPE 779 Parallel Computing - Spring 2012 43
44
Prerequisite
• CPE 432: Computer Design, and general C programming skills
CPE 779 Parallel Computing - Spring 2012
45
Grading Policy
• Programming Assignments: 25%• Demo/knowledge: 25%• Functionality and Performance: 40%• Report: 35%
• Project: 35%• Design Document: 25%• Project Presentation: 25%• Demo/Functionality/Performance/Report: 50%
• Midterm: 15%• Final: 25 %
CPE 779 Parallel Computing - Spring 2012
46
Bonus Days• Each of you get five bonus days
• A bonus day is a no-questions-asked one-day extension that can be used on most assignments
• You can’t turn in multiple versions of a team assignment on different days; all of you must combine individual bonus days into one team bonus day.
• You can use multiple bonus days on the same assignment
• Weekends/holidays don’t count for the number of days of extension (Thursday-Sunday is one day extension)
• Intended to cover illnesses, just needing more time, etc.
CPE 779 Parallel Computing - Spring 2012
CPE 779 Parallel Computing - Spring 2012 47
Using Bonus Days
• Bonus days are automatically applied to late projects• Penalty for being late beyond bonus days is 10% of
the possible points/day, again counting only• Things you can’t use bonus days on:
• Final project design documents, final project presentations, final project demo, exam
CPE 779 Parallel Computing - Spring 2012 48
Academic Honesty
• You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who’ve already taken the course is also fine.
• Any reference to assignments from web postings is unacceptable
• Any copying of non-trivial code is unacceptable• Non-trivial = more than a line or so• Includes reading someone else’s code and then going off
to write your own.
CPE 779 Parallel Computing - Spring 2012 49
Academic Honesty (cont.)
• Giving/receiving help on an exam is unacceptable• Penalties for academic dishonesty:
• Zero on the assignment for the first occasion• Automatic failure of the course for repeat offenses
CPE 779 Parallel Computing - Spring 2012 50
Team Projects
• Work can be divided up between team members in any way that works for you
• However, each team member will demo the final checkpoint of each project individually, and will get a separate demo grade• This will include questions on the entire design• Rationale: if you don’t know enough about the whole
design to answer questions on it, you aren’t involved enough in the project
CPE 779 Parallel Computing - Spring 2012 51
Text/Notes
1. D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” Morgan Kaufman Publisher, 2010, ISBN 978-0123814722
2. Cleve B. Moler, Numerical Computing with MATLAB, Society for Industrial Mathematics (January 1, 2004). Available for individual download at http://www.mathworks.com/moler/chapters.html
3. NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia, 2011 (reference book)
4. Lecture notes will be posted at the class web site
52
Rough List of Topics• Basics of computer architecture, memory
hierarchies, performance• Parallel Programming Models and Machines
• Shared Memory and Multithreading (OpenMP)• Distributed Memory and Message Passing (MPI)
CPE 779 Parallel Computing - Spring 2012
Rough List of Topics (continued)
• Programming NVIDIA processors using CUDA• Introduction to CUDA C• CUDA Parallel Execution Model with Fermi Updates• CUDA Memory Model with Fermi Updates • Tiled Matrix-Matrix Multiplication• Debugging and Profiling, Introduction to Convolution • Convolution, Constant Memory and Constant Caching
53CPE 779 Parallel Computing - Spring 2012
Rough List of Topics (continued)
• Programming NVIDIA processors using CUDA (continued)• Tiled 2D Convolution • Parallel Computation Patterns - Reduction Trees • Memory Bandwidth • Parallel Computation Patterns - Prefix Sum (Scan) • Floating Point Considerations • Atomic Operations and Histogramming• Data Transfers and GMAC• Multi-GPU Programming in CUDA and GMAC • MPI and CUDA Programming
CPE 779 Parallel Computing - Spring 2012 54