Why Parallel?

8/29/2007 CS194 Lecure 1

CS 194 Parallel Programming

Why Program for Parallelism?

Katherine [email protected]

http://www.cs.berkeley.edu/~yelick/cs194f07

8/29/2007 CS194 Lecure 2

What is Parallel Computing?

• Parallel computing: using multiple processors in parallel to solve problems more quickly than with a single processor

• Examples of parallel machines: • A cluster computer that contains multiple PCs combined

together with a high speed network • A shared memory multiprocessor (SMP*) by connecting

multiple processors to a single memory system• A Chip Multi-Processor (CMP) contains multiple processors

(called cores) on a single chip• Concurrent execution comes from desire for

performance; unlike the inherent concurrency in a multi-user distributed system

• * Technically, SMP stands for “Symmetric Multi-Processor”

8/29/2007 CS194 Lecure 3

Why Parallel Computing Now?

• Researchers have been using parallel computing for decades: • Mostly used in computational science and engineering• Problems too large to solve on one computer; use 100s or 1000s

• There has been a graduate course in parallel computing (CS267) for over a decade

• Many companies in the 80s/90s “bet” on parallel computing and failed• Computers got faster too quickly for there to be a large market

• Why is Berkeley adding an undergraduate course now?• Because the entire computing industry has bet on parallelism• There is a desperate need for parallel programmers

• Let’s see why…

8/29/2007 CS194 Lecure 4

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Slide source: Jack Dongarra

8/29/2007 CS194 Lecure 5

Microprocessor Transistors and Clock Rate

i4004

i80286i80386

i8080

i8086

R3000R2000

R10000Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005Year

Tran

sist

ors

Growth in transistors per chip Increase in clock rate

0.1

1

10

100

1000

1970 1980 1990 2000Year

Clo

ck R

ate

(MH

z)

Why bother with parallel programming? Just wait a year or two…

8/29/2007 CS194 Lecure 6

Limit #1: Power density

400480088080

8085

8086

286 386486

Pentium®P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty (W

/cm

2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurface

Source: Patrick Gelsinger, Intel®

Scaling clock speed (business as usual) will not work

Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‘07

8/29/2007 CS194 Lecure 7

Parallelism Saves Power

• Exploit explicit parallelism for reducing power

Power = C * V2 * F Performance = Cores * F

Capacitance Voltage Frequency

• Using additional cores– Increase density (= more transistors = more

capacitance)– Can increase cores (2x) and performance (2x)– Or increase cores (2x), but decrease frequency (1/2):

same performance at ¼ the power

Power = 2C * V2 * F Performance = 2Cores * FPower = 2C * V2/4 * F/2 Performance = 2Cores * F/2Power = (C * V2 * F)/4 Performance = (Cores * F)*1

• Additional benefits– Small/simple cores more predictable performance

8/29/2007 CS194 Lecure 8

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

25%/year

52%/year

??%/year

Limit #2: Hidden Parallelism Tapped Out

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here

• ½ due to transistor density• ½ due to architecture

changes, e.g., Instruction Level Parallelism (ILP)

8/29/2007 CS194 Lecure 9

Limit #2: Hidden Parallelism Tapped Out

• Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer• multiple instruction issue• dynamic scheduling: hardware discovers parallelism

between instructions• speculative execution: look past predicted branches• non-blocking caches: multiple outstanding memory ops

• You may have heard of these in 61C, but you haven’t needed to know about them to write software

• Unfortunately, these sources have been used up

8/29/2007 CS194 Lecure 10

Performance Comparison

• Measure of success for hidden parallelism is Instructions Per Cycle (IPC)• The 6-issue has higher IPC than 2-issue, but far less than 3x• Reasons are: waiting for memory (D and I-cache stalls) and dependencies

(pipeline stalls)Graphs from: Olukotun et al, ASPLOS, 1996

8/29/2007 CS194 Lecure 11

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint) Today

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

⇒ Sea change in chip design: multiple “cores” or processors per chip

3X

2x every 5 years?

8/29/2007 CS194 Lecure 12

Limit #3: Chip Yield

• Moore’s (Rock’s) 2nd law: fabrication costs go up

• Yield (% usable chips) drops

• Parallelism can help•More smaller, simpler processors are easier to design and validate•Can use partially working chips:•E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield

Manufacturing costs and yield problems limit use of density

8/29/2007 CS194 Lecure 13

Limit #4: Speed of Light (Fundamental)

• Consider the 1 Tflop/s sequential machine:• Data must travel some distance, r, to get from memory

to CPU.• To get 1 data element per cycle, this means 1012 times

per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.

• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:• Each bit occupies about 1 square Angstrom, or the size

of a small atom.• No choice but parallelism

r = 0.3 mm

1 Tflop/s, 1 Tbyte sequential machine

8/29/2007 CS194 Lecure 14

Revolution is Happening Now• Chip density is

continuing increase ~2x every 2 years• Clock speed is not• Number of processor

cores may double instead

• There is little or no hidden parallelism (ILP) to be found

• Parallelism must be exposed to and managed by software

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

8/29/2007 CS194 Lecure 15

Multicore in Products

• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”

Paul Otellini, President, Intel (2005)

• All microprocessor companies switch to MP (2X CPUs / 2 yrs)⇒ Procrastination penalized: 2X sequential perf. / 5 yrs

And at the same time, • The STI Cell processor (PS3) has 8 cores• The latest NVidia Graphics Processing Unit (GPU) has 128 cores• Intel has demonstrated an 80-core research chip

128442Threads/chip

16221Threads/Processor

8222Processors/chip

Sun/’07IBM/’04Intel/’06AMD/’05Manufacturer/Year

8/29/2007 CS194 Lecure 16

Tunnel Vision by Experts

• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”

• Ken Kennedy, CRPC Directory, 1994

• “640K [of memory] ought to be enough for anybody.”• Bill Gates, chairman of Microsoft,1981.

• “There is no reason for any individual to have a computer in their home”

• Ken Olson, president and founder of Digital Equipment Corporation, 1977.

• “I think there is a world market for maybe five computers.”

• Thomas Watson, chairman of IBM, 1943.Slide source: Warfield et al.

8/29/2007 CS194 Lecure 17

Why Parallelism (2007)?

• These arguments are no long theoretical• All major processor vendors are producing multicore chips

• Every machine will soon be a parallel machine• All programmers will be parallel programmers???

• New software model• Want a new feature? Hide the “cost” by speeding up the code first• All programmers will be performance programmers???

• Some may eventually be hidden in libraries, compilers, and high level languages• But a lot of work is needed to get there

• Big open questions:• What will be the killer apps for multicore machines• How should the chips be designed, and how will they be

programmed?

8/29/2007 CS194 Lecure 18

Outline

• Why powerful computers must be parallel processors

• Why writing (fast) parallel programs is hard

• Principles of parallel computing performance

• Structure of the course

Including your laptop

all

8/29/2007 CS194 Lecure 19

Why writing (fast) parallel programs is hard

8/29/2007 CS194 Lecure 20

Principles of Parallel Computing

• Finding enough parallelism (Amdahl’s Law)• Granularity• Locality• Load balance• Coordination and synchronization• Performance modeling

All of these things makes parallel programming even harder than sequential programming.

8/29/2007 CS194 Lecure 21

Finding Enough Parallelism

• Suppose only part of an application seems parallel• Amdahl’s law

• let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable

• P = number of processorsSpeedup(P) = Time(1)/Time(P)

<= 1/(s + (1-s)/P)

<= 1/s

• Even if the parallel part speeds up perfectly performance is limited by the sequential part

8/29/2007 CS194 Lecure 22

Overhead of Parallelism

• Given enough parallel work, this is the biggest barrier to getting desired speedup

• Parallelism overheads include:• cost of starting a thread or process• cost of communicating shared data• cost of synchronizing• extra (redundant) computation

• Each of these can be in the range of milliseconds (=millions of flops) on some systems

• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Locality and Parallelism

• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache

• the slow accesses to “remote” data we call “communication”• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

8/29/2007 CS194 Lecure 24

Load Imbalance

• Load imbalance is the time that some processors in the system are idle due to• insufficient parallelism (during that phase)• unequal size tasks

• Examples of the latter• adapting to “interesting parts of a domain”• tree-structured computations • fundamentally unstructured problems

• Algorithm needs to balance load

8/29/2007 CS194 Lecure 25

Course Organization

8/29/2007 CS194 Lecure 26

Course Mechanics

• Expected background• All of 61 series • At least one upper div software/systems course, preferably 162

• Work in course• Homework with programming (~1/week for first 8 weeks)

• Parallel hardware in CS, from Intel, at LBNL• Final project of your own choosing: may use other hardware

(PS3, GPUs, Niagra2, etc.) depending on availability• 2 in-class quizzes mostly covering lecture topics

• See course web page for tentative calendar, etc.:• http://www.cs.berkeley.edu/~yelick/cs194f07

• Grades: homework (30%), quizzes (30%), project (40%)• Caveat: This is the first offering of this course, so things

will change dynamically

8/29/2007 CS194 Lecure 27

Reading Materials

• Optional text• Introduction to Parallel Computing, 2nd Edition Ananth Grama,

Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003

• Some on-line texts (on high performance scientific programming):

• Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996.

• http://www.cs.berkeley.edu/~demmel/cs267_Spr99/• Ian Foster’s book, “Designing and Building Parallel

Programming”. • http://www-unix.mcs.anl.gov/dbpp/

Documents

Why Parallel?