Upload
anfossimla
View
215
Download
0
Embed Size (px)
DESCRIPTION
Parallel Program
Citation preview
8/29/2007 CS194 Lecure 1
CS 194 Parallel Programming
Why Program for Parallelism?
Katherine [email protected]
http://www.cs.berkeley.edu/~yelick/cs194f07
8/29/2007 CS194 Lecure 2
What is Parallel Computing?
• Parallel computing: using multiple processors in parallel to solve problems more quickly than with a single processor
• Examples of parallel machines: • A cluster computer that contains multiple PCs combined
together with a high speed network • A shared memory multiprocessor (SMP*) by connecting
multiple processors to a single memory system• A Chip Multi-Processor (CMP) contains multiple processors
(called cores) on a single chip• Concurrent execution comes from desire for
performance; unlike the inherent concurrency in a multi-user distributed system
• * Technically, SMP stands for “Symmetric Multi-Processor”
8/29/2007 CS194 Lecure 3
Why Parallel Computing Now?
• Researchers have been using parallel computing for decades: • Mostly used in computational science and engineering• Problems too large to solve on one computer; use 100s or 1000s
• There has been a graduate course in parallel computing (CS267) for over a decade
• Many companies in the 80s/90s “bet” on parallel computing and failed• Computers got faster too quickly for there to be a large market
• Why is Berkeley adding an undergraduate course now?• Because the entire computing industry has bet on parallelism• There is a desperate need for parallel programmers
• Let’s see why…
8/29/2007 CS194 Lecure 4
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
8/29/2007 CS194 Lecure 5
Microprocessor Transistors and Clock Rate
i4004
i80286i80386
i8080
i8086
R3000R2000
R10000Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005Year
Tran
sist
ors
Growth in transistors per chip Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000Year
Clo
ck R
ate
(MH
z)
Why bother with parallel programming? Just wait a year or two…
8/29/2007 CS194 Lecure 6
Limit #1: Power density
400480088080
8085
8086
286 386486
Pentium®P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Pow
er D
ensi
ty (W
/cm
2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurface
Source: Patrick Gelsinger, Intel®
Scaling clock speed (business as usual) will not work
Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‘07
8/29/2007 CS194 Lecure 7
Parallelism Saves Power
• Exploit explicit parallelism for reducing power
Power = C * V2 * F Performance = Cores * F
Capacitance Voltage Frequency
• Using additional cores– Increase density (= more transistors = more
capacitance)– Can increase cores (2x) and performance (2x)– Or increase cores (2x), but decrease frequency (1/2):
same performance at ¼ the power
Power = 2C * V2 * F Performance = 2Cores * FPower = 2C * V2/4 * F/2 Performance = 2Cores * F/2Power = (C * V2 * F)/4 Performance = (Cores * F)*1
• Additional benefits– Small/simple cores more predictable performance
8/29/2007 CS194 Lecure 8
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
25%/year
52%/year
??%/year
Limit #2: Hidden Parallelism Tapped Out
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here
• ½ due to transistor density• ½ due to architecture
changes, e.g., Instruction Level Parallelism (ILP)
8/29/2007 CS194 Lecure 9
Limit #2: Hidden Parallelism Tapped Out
• Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer• multiple instruction issue• dynamic scheduling: hardware discovers parallelism
between instructions• speculative execution: look past predicted branches• non-blocking caches: multiple outstanding memory ops
• You may have heard of these in 61C, but you haven’t needed to know about them to write software
• Unfortunately, these sources have been used up
8/29/2007 CS194 Lecure 10
Performance Comparison
• Measure of success for hidden parallelism is Instructions Per Cycle (IPC)• The 6-issue has higher IPC than 2-issue, but far less than 3x• Reasons are: waiting for memory (D and I-cache stalls) and dependencies
(pipeline stalls)Graphs from: Olukotun et al, ASPLOS, 1996
8/29/2007 CS194 Lecure 11
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perfo
rman
ce (v
s. V
AX-1
1/78
0)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint) Today
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
⇒ Sea change in chip design: multiple “cores” or processors per chip
3X
2x every 5 years?
8/29/2007 CS194 Lecure 12
Limit #3: Chip Yield
• Moore’s (Rock’s) 2nd law: fabrication costs go up
• Yield (% usable chips) drops
• Parallelism can help•More smaller, simpler processors are easier to design and validate•Can use partially working chips:•E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield
Manufacturing costs and yield problems limit use of density
8/29/2007 CS194 Lecure 13
Limit #4: Speed of Light (Fundamental)
• Consider the 1 Tflop/s sequential machine:• Data must travel some distance, r, to get from memory
to CPU.• To get 1 data element per cycle, this means 1012 times
per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:• Each bit occupies about 1 square Angstrom, or the size
of a small atom.• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
8/29/2007 CS194 Lecure 14
Revolution is Happening Now• Chip density is
continuing increase ~2x every 2 years• Clock speed is not• Number of processor
cores may double instead
• There is little or no hidden parallelism (ILP) to be found
• Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
8/29/2007 CS194 Lecure 15
Multicore in Products
• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2005)
• All microprocessor companies switch to MP (2X CPUs / 2 yrs)⇒ Procrastination penalized: 2X sequential perf. / 5 yrs
And at the same time, • The STI Cell processor (PS3) has 8 cores• The latest NVidia Graphics Processing Unit (GPU) has 128 cores• Intel has demonstrated an 80-core research chip
128442Threads/chip
16221Threads/Processor
8222Processors/chip
Sun/’07IBM/’04Intel/’06AMD/’05Manufacturer/Year
8/29/2007 CS194 Lecure 16
Tunnel Vision by Experts
• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
• Ken Kennedy, CRPC Directory, 1994
• “640K [of memory] ought to be enough for anybody.”• Bill Gates, chairman of Microsoft,1981.
• “There is no reason for any individual to have a computer in their home”
• Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “I think there is a world market for maybe five computers.”
• Thomas Watson, chairman of IBM, 1943.Slide source: Warfield et al.
8/29/2007 CS194 Lecure 17
Why Parallelism (2007)?
• These arguments are no long theoretical• All major processor vendors are producing multicore chips
• Every machine will soon be a parallel machine• All programmers will be parallel programmers???
• New software model• Want a new feature? Hide the “cost” by speeding up the code first• All programmers will be performance programmers???
• Some may eventually be hidden in libraries, compilers, and high level languages• But a lot of work is needed to get there
• Big open questions:• What will be the killer apps for multicore machines• How should the chips be designed, and how will they be
programmed?
8/29/2007 CS194 Lecure 18
Outline
• Why powerful computers must be parallel processors
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
Including your laptop
all
8/29/2007 CS194 Lecure 19
Why writing (fast) parallel programs is hard
8/29/2007 CS194 Lecure 20
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)• Granularity• Locality• Load balance• Coordination and synchronization• Performance modeling
All of these things makes parallel programming even harder than sequential programming.
8/29/2007 CS194 Lecure 21
Finding Enough Parallelism
• Suppose only part of an application seems parallel• Amdahl’s law
• let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable
• P = number of processorsSpeedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly performance is limited by the sequential part
8/29/2007 CS194 Lecure 22
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to getting desired speedup
• Parallelism overheads include:• cost of starting a thread or process• cost of communicating shared data• cost of synchronizing• extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
Locality and Parallelism
• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache
• the slow accesses to “remote” data we call “communication”• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
8/29/2007 CS194 Lecure 24
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to• insufficient parallelism (during that phase)• unequal size tasks
• Examples of the latter• adapting to “interesting parts of a domain”• tree-structured computations • fundamentally unstructured problems
• Algorithm needs to balance load
8/29/2007 CS194 Lecure 25
Course Organization
8/29/2007 CS194 Lecure 26
Course Mechanics
• Expected background• All of 61 series • At least one upper div software/systems course, preferably 162
• Work in course• Homework with programming (~1/week for first 8 weeks)
• Parallel hardware in CS, from Intel, at LBNL• Final project of your own choosing: may use other hardware
(PS3, GPUs, Niagra2, etc.) depending on availability• 2 in-class quizzes mostly covering lecture topics
• See course web page for tentative calendar, etc.:• http://www.cs.berkeley.edu/~yelick/cs194f07
• Grades: homework (30%), quizzes (30%), project (40%)• Caveat: This is the first offering of this course, so things
will change dynamically
8/29/2007 CS194 Lecure 27
Reading Materials
• Optional text• Introduction to Parallel Computing, 2nd Edition Ananth Grama,
Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003
• Some on-line texts (on high performance scientific programming):
• Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996.
• http://www.cs.berkeley.edu/~demmel/cs267_Spr99/• Ian Foster’s book, “Designing and Building Parallel
Programming”. • http://www-unix.mcs.anl.gov/dbpp/