50
Parallel numerical algorithms: Course overview Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.01] Tuesday, January 8, 2008 1

Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Parallel numerical algorithms:Course overview

Prof. Richard Vuduc

Georgia Institute of Technology

CSE/CS 8803 PNA, Spring 2008

[L.01] Tuesday, January 8, 2008

1

Page 2: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Sources for today’s material

CS 267 (Yelick @UCB)

David Keyes

Williams, et al. SC’07 (UCB)

Higham, Accuracy and Stability of Numerical Algorithms

2

Page 3: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Lecture: Patterson’s Law of Attention SpanD. Patterson (UC Berkeley)

Attention span

5 min 10 min 50 min

“In conclusion…”

3

Page 4: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Why study PNA today?

4

Page 5: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Why study PNA today?

Current and future apps are numerical, data-intensive

Parallel hardware widely available

Computing industry betting its future on it! (next lecture)

Need to program for parallelism and locality explicitly

Algorithmic costs changed: “mops” not flops; “accuracy”

“Winners” unclear

Interaction of applications, algorithms, and systems

4

Page 6: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

5

Page 7: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

5

Page 8: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Application example:Google’s PageRank

6

Page 9: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Google’s web search procedure

Step 1: Rank all web pages, given the web

Update every once in a while

Step 2: Return list of matching pages in order by rank, given query

At “query-time”

7

Page 10: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Google’s web search procedure

Step 1: Rank all web pages, given the web

Update every once in a while

Step 2: Return list of matching pages in order by rank, given query

At “query-time”

Compute ranking using a “random surfer” model

7

Page 11: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Start on page i.A “random surfer” model.

i

8

Page 12: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Jump to j with some probability.A “random surfer” model.

i

j

9

Page 13: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Repeat.A “random surfer” model.

i

j

k

10

Page 14: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Where will the surfer end up?A “random surfer” model.

i

j

k

11

Page 15: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Probability of being on page i at time t.

i

xt(i)

12

Page 16: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Transition probability (=0 if no link).

i

j

W (i, j)

13

Page 17: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Probability of being on page j at time t+1, with transition probability W(i,j).

i

j

i

i

xt+1(j) =n!

i=1

xt(i) · W (i, j)

14

Page 18: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Repeat matrix-vector multiply until convergence.

i

j

i

i

xt+1 = WT · xt

15

Page 19: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

“Power method” for computing the principal eigenvector of WT.

i

j

i

i

xt+1 = WT · xt = (WT )2 · xt!1 = · · ·= (WT )t+1 · x0

16

Page 20: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Applications and architectures

17

Page 21: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Units of measurement

“Flop”: Floating-point operation

“Flop/s”: Flops per second

“Bytes”: Size of data (double-precision float is 8 bytes)

Today’s peak speeds

Single processor core ~5 Gflop/s

STI-Cell (8-core) ~15 Gflop/s

NVIDIA Tesla (128-core) ~500 Gflop/s

Note: Achievable speeds often < 10% of peak

18

Page 22: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Algorithmic efficiency rivals architectural advances in scale.If n=64, flops reduced by ~16 M [6 mo. to 1 sec.]; Source: Keyes (2004)

Year Method Reference Storage Flops

1947

1950

1971

1984

GaussianElimination

Von Neumann & Goldstine

n5 n7

Optimal SOR Young n3 n4 log n

Conj. grad. Reid n3 n3.5 log n

Full multigrid Brandt n3 n3

∇2u=f 64

64 64

19

Page 23: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

year

relativespeedup

Algorithmic and architectural improvements go hand-in-hand.

20

Page 24: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Parallelism and algorithmic innovation outpace Moore’s Law

1

10

100

1000

10000

100000

1000000

1988 1990 1993 1995 1997 1999 2001 2003 2005 2007

Gflo

p/s

Year

Gordon Bell Winners

Moore’s Law Projection

105x improvement in ~20 years

100x

21

Page 25: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Source: SCaLeS report 2 (2004), via D. Keyes. http://pnl.gov/scales

22

Page 26: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

0

1

2

3

4

5

6

7

8

9

10

1980 1990 2000 2010

Calendar Year

Log

Effe

ctiv

e G

igaF

LOPS

High Order

Autocode

ARK integratorcomplex chem Higher

orderAMR

NERSCRS/6000

NERSCSP3

Cray 2

AMR

Low Mach

“Moore’s Law” for Combustion Simulations

Source: SCaLeS report 2 (2004), via D. Keyes. http://pnl.gov/scales

23

Page 27: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Intel, on their hardware strategy:

“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing.” (2005)

24

Page 28: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Per

form

ance

(vs.

VA

X-1

1/78

0)

25%/year

52%/year

??%/year

Uniprocessor performance not keeping up with Moore’s law.

3X

2x every5 years?

Source: Hennessey and Patterson, CA:AQA (4th edition), 2006

25

Page 29: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

What’s so hard about parallel programming?

26

Page 30: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

On p processors with fraction s serial work:

Speedup(p) =Time(1)Time(p)

! 1s + 1!s

p

! 1s

Finding enough parallelism.Amdahl’s Law: Maximum speedup limited by sequential part.

27

Page 31: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Parallelism incurs overheads.

Examples of overheads:

Cost of starting a thread or process

Cost of communicating shared data

Cost of synchronization

Cost of extra (redundant) computation

Costs may be milliseconds, which is millions of flops

Tradeoff: Granularity of task vs. amount of parallelism

28

Page 32: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

on-chipcacheregisters

datapath

control

processor

Secondlevelcache(SRAM)

Mainmemory

(DRAM)

Secondarystorage(Disk)

Tertiarystorage

(Disk/Tape)

TBGBMBKBBSize

10sec10ms100ns10ns1nsCost

Memory hierarchies:Non-uniform memory access cost.Better algorithms and implementations exploit locality.

29

Page 33: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

NUMA in the parallel setting.Processors should minimize communication (remote data access).

30

Page 34: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

+NUMA/Affinity

Naïve Pthreads

Naïve Single Thread

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Dense

Protein

FEM-Sphr

FEM-Cant

Tunnel

FEM-Har

QCD

FEM-Ship

Econom

Epidem

FEM-Accel

Circuit

Webbase LP

Median

GFlop/s

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Opteron

DDR2 DRAM

Single Thread

Multiple Threads,One memory controller

Multiple Threads,Both memory controllers

NUMA ⇒ Explicit

data placement.

2x

Sparse matrix-vector multiplyon 2x2-core AMD Opteron.

31

Page 35: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

y = AAT xt ! AT x

y ! At

NUMA ⇒ Seek algorithmic reuse.

For large A, “reads” A twice.

32

Page 36: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

AAT x = (a1 · · · an)

!

"#aT1...

aTn

$

%& x =n'

i=1

ai

(aT

i x)

Reorganize to improve reuse.

33

Page 37: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Algorithms and implementations must guard against load imbalance.

Load imbalance: Subset of processors becomes idle

Insufficient parallelism

Unequally sized tasks

Sources of imbalance

Local adaptation (adaptive mesh refinement)

Tree-structured computations

Fundamentally unstructured problem

34

Page 38: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

x̄ =1n

n!

i=1

xi

!(x) =1

n! 1

n!

i=1

(xi ! x̄)2

Let !̂(x) = computed !(x),and " = machine precision.then:

!̂(x)! !(x)!(x)

" (n + 3)" + O""2

#

Large-scale parallelism may raise numerical accuracy issues.

Finite-ness of precision matters more for larger problems

Ill-conditioning

Round-off

Example: Compute variance of data set using “two-pass” algorithm (right)

In single prec. (ε ~ 10-7), all digits lost when n ~ 10 million

35

Page 39: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Course organization

36

Page 40: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Student make-up of course?

Mostly CS students, so will emphasize algorithm-to-machine mapping more than new-algorithm-development

Work in teams of two and three; be interdisciplinary where possible!

37

Page 41: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Workload and grading

Grading

10% - Class participation and “scribe notes”

24% - Two homework/programming assignments

6% - Attend SIAM PP and write-up what you learned

60% - Course project

All homework during first 8 weeks; last 8 weeks for your project!

Collaboration encouraged, but don’t “cheat.”

No textbook, but will supplement lectures with readings

38

Page 42: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Schedule of topics(approximate)

Week 1: Overview and hardware trends

Week 2: Sources of parallelism & locality; performance modeling

Week 3: Basics of parallel programming [HW 1]

Week 4: Structured grids; dense linear algebra

Week 5-6: Dense and sparse linear algebra

Week 7: FFT ; floating-point issues

Week 8: Single-processor performance tuning [HW 2; project proposals]

Guest lecture by Hyesoon Kim on General-purpose GPU programming

39

Page 43: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Schedule of topics (2)

Week 9: Automatic performance tuning (autotuning)

Week 10: Load balancing; SIAM PP

Week 11-12: Particle methods; graph partitioning [Project checkpoint]

Week 13: PDEs

Week 14: Event-driven methods

Week 15: Volunteer computing ; parallel languages research

Week 16: Project presentations

Project write-up due on “final exam” day

40

Page 44: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Homework #0:Complete course survey

I will post all materials at GT T-Square site for “CSE-8803-PNA”

Go to: http://t-square.gatech.edu

Note: Cross-listed with CS-8803-PNA / CS-4803-PNA, but ignore these

Fill out survey questions under “Poll” tab

When you “submit” the assignment, briefly describe what you hope to get out of this course

Also: Use “Forum” to introduce yourselves

41

Page 45: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

“In conclusion…”

42

Page 46: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

a11 a11

a12 a12

Need for locality and parallelism drives new algorithms.Example: y = A2*x

43

Page 47: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

a11 a11

a12 a12

Need for locality and parallelism drives new algorithms.Example: y = A2*x

44

Page 48: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

Need for locality and parallelism drives new algorithms.Example: y = A2*x. A new kernel implies a new algorithm… ?

45

Page 49: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Some questions raised by this example.

This algorithm has locality, but what about parallelism?

What is the relationship between locality and parallelism?

The “inner loop” of a solver based on this kernel is now Ak*x, not just A*x. How does this change the “outer loops?”

46

Page 50: Parallel numerical algorithms: Course overviewvuduc.org/teaching/cse8803-pna-sp08/slides/cse8803-pna-sp08-01.pdfParallelism and algorithmic innovation outpace Moore’s Law 1 10 100

Technical challenges of PNA

Precision, memory, and processing are finite resources

Curse of dimensionality: 3D space + time eats up Moore’s Law

Hardware changes quickly

Huge amount of relevant domain-knowledge exists: Collaborate!

Parallel programming is hard

47