52
Basic Cilk Programming

Basic Cilk Programming

  • Upload
    vivian

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

Basic Cilk Programming. Multithreaded Programming in Cilk L ECTURE 1. Adapted from. Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multi-core processors. …. P. P. P. $. $. - PowerPoint PPT Presentation

Citation preview

Page 1: Basic Cilk Programming

Basic Cilk Programming

Page 2: Basic Cilk Programming

Multithreaded Programming in

CilkLECTURE 1

Multithreaded Programming in

CilkLECTURE 1

Charles E. LeisersonSupercomputing Technologies Research Group

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

Adapted from

Page 3: Basic Cilk Programming

Multi-core processors

P P P

NetworkNetwork

MemoryMemory I/OI/O

$ $ $

MIMD – shared memory

Page 4: Basic Cilk Programming

Cilk Overview• Cilk extends the C language with just a handful

of keywords: cilk, spawn, sync• Every Cilk program has a serial semantics.• Not only is Cilk fast, it provides performance

guarantees based on performance abstractions.• Cilk is processor-oblivious.• Cilk’s provably good runtime system auto-

matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.

• Cilk supports speculative parallelism.

Page 5: Basic Cilk Programming

Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

C elisionC elision

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk codeCilk code

Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.

Page 6: Basic Cilk Programming

Basic Cilk Keywords

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Identifies a function as a Cilk procedure, capable of being spawned in parallel.

Identifies a function as a Cilk procedure, capable of being spawned in parallel.

The named child Cilk procedure can execute in parallel with the parent caller.

The named child Cilk procedure can execute in parallel with the parent caller.Control cannot pass this

point until all spawned children have returned.

Control cannot pass this point until all spawned children have returned.

Page 7: Basic Cilk Programming

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Dynamic Multithreading

The computation dag unfolds dynamically.

Example: fib(4)

“Processor oblivious”“Processor oblivious”

4

3

2

2

1

1 1 0

0

Page 8: Basic Cilk Programming

Multithreaded Computation

• The dag G = (V, E) represents a parallel instruction stream.• Each vertex v 2 V represents a (Cilk) thread: a maximal

sequence of instructions not containing parallel control (spawn, sync, return).

• Every edge e 2 E is either a spawn edge, a return edge, or a continue edge.

spawn edgereturn edge

continue edge

initial thread final thread

Page 9: Basic Cilk Programming

Algorithmic Complexity MeasuresTP = execution time on P processors

Page 10: Basic Cilk Programming

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = work

Page 11: Basic Cilk Programming

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = work

T1 = span*

* Also called critical-path length or computational depth.

Page 12: Basic Cilk Programming

Algorithmic Complexity MeasuresTP = execution time on P processors

T1 = work

LOWER BOUNDS

• TP ¸ T1/P

• TP ¸ T1

LOWER BOUNDS

• TP ¸ T1/P

• TP ¸ T1

*Also called critical-path length or computational depth.

T1 = span*

Page 13: Basic Cilk Programming

Speedup

Definition: T1/TP = speedup on P processors.

If T1/TP = (P) · P, we have linear speedup;= P, we have perfect linear speedup;> P, we have superlinear speedup,

which is not possible in our model, because of the lower bound TP ¸ T1/P.

Page 14: Basic Cilk Programming

July 13, 2006 14

Parallelism

Because we have the lower bound TP ¸ T1, the maximum possible speedup given T1 and T1 isT1/T1 = parallelism

= the average amount of work per step along the span.

Page 15: Basic Cilk Programming

July 13, 2006 15

Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8

33 44

55

66

11

22 77

88

Work: T1 = 17

Page 16: Basic Cilk Programming

July 13, 2006 16

Parallelism: T1/T1 = 2.125Span: T1 = ?Work: T1 = ?

Example: fib(4)

Assume for simplicity that each Cilk thread in fib() takes unit time to execute.

Span: T1 = 8Work: T1 = 17 Using many more

than 2 processors makes little sense.

Using many more than 2 processors makes little sense.

Page 17: Basic Cilk Programming

July 13, 2006 17

Parallelizing Vector Additionvoid vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

C

Page 18: Basic Cilk Programming

July 13, 2006 18

Parallelizing Vector AdditionC

C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {

if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {

void vadd (real *A, real *B, int n){

vadd (A, B, n/2);vadd (A+n/2, B+n/2, n-n/2);

}} }}

Parallelization strategy: 1. Convert loops to recursion.

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

Page 19: Basic Cilk Programming

July 13, 2006 19

if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {

if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else {

Parallelizing Vector AdditionC

Parallelization strategy: 1. Convert loops to recursion.2. Insert Cilk keywords.

void vadd (real *A, real *B, int n){cilk

spawn vadd (A, B, n/2);vadd (A+n/2, B+n/2, n-n/2);spawn

Side benefit: D&C is generally good for caches!

Side benefit: D&C is generally good for caches!

}} }}

sync;

Cilk

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}

Page 20: Basic Cilk Programming

July 13, 2006 20

Vector Additioncilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; }}

cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; }}

Page 21: Basic Cilk Programming

Work: T1 = Span: T1 = Parallelism: T1/T1 = (n/lg n)

(n)

Vector Addition AnalysisTo add two vectors of length n, where BASE = (1):

(lg n)

BASE

Page 22: Basic Cilk Programming

Another ParallelizationC void vadd1 (real *A, real *B, int n){

int i; for (i=0; i<n; i++) A[i]+=B[i];}void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); }}

void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); }}

Cilk cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); }sync;}

cilk void vadd1 (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i];}cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); }sync;}

Page 23: Basic Cilk Programming

Work: T1 = Span: T1 = Parallelism: T1/T1 = (1)

(n)

Analysis…

(n)

To add two vectors of length n, where BASE = (1):

BASE

Page 24: Basic Cilk Programming

Parallelism: T1/T1 = (√n )

Optimal Choice of BASE

To add two vectors of length n using an optimal choice of BASE to maximize parallelism:

Work: T1 = (n)

BASE

Span: T1 = (BASE + n/BASE)Choosing BASE = √n ) T1 = √n)

Page 25: Basic Cilk Programming

Weird! Don’t we want to remove recursion?

Parallel Programming =Sequential Program +Decomposition +Mapping +Communication and synchronization

July 13, 2006 25

Page 26: Basic Cilk Programming

Scheduling• Cilk allows the

programmer to express potential parallelism in an application.

• The Cilk scheduler maps Cilk threads onto processors dynamically at runtime.

• Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler.

P P P

NetworkNetwork

MemoryMemory I/OI/O

$ $ $

Page 27: Basic Cilk Programming

Greedy SchedulingIDEA: Do as much as possible on every step.

Definition: A thread is ready if all its predecessors have executed.

Page 28: Basic Cilk Programming

Greedy SchedulingIDEA: Do as much as possible on every step.

Complete step •¸ P threads ready.• Run any P.

Definition: A thread is ready if all its predecessors have executed.

P = 3

Page 29: Basic Cilk Programming

Greedy SchedulingIDEA: Do as much as possible on every step.

Complete step •¸ P threads ready.• Run any P.

Incomplete step • < P threads ready.• Run all of them.

Definition: A thread is ready if all its predecessors have executed.

P = 3

Page 30: Basic Cilk Programming

Theorem [Graham ’68 & Brent ’75]. Any greedy scheduler achieves

TP T1/P + T.

Greedy-Scheduling Theorem

Proof. • # complete steps · T1/P,

since each complete step performs P work.

• # incomplete steps · T1, since each incomplete step reduces the span of the unexecuted dag by 1. ■

P = 3

Page 31: Basic Cilk Programming

Optimality of GreedyCorollary. Any greedy scheduler achieves within a factor of 2 of optimal.

Proof. Let TP* be the execution time produced by the optimal scheduler.

Since TP* ¸ max{T1/P, T1} (lower bounds), we have

TP · T1/P + T1

· 2¢max{T1/P, T1}

· 2TP* . ■

Page 32: Basic Cilk Programming

Linear SpeedupCorollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ¿ T1/T1. Proof. Since P ¿ T1/T1 is equivalent to T1 ¿ T1/P, the Greedy Scheduling Theorem gives us

TP · T1/P + T1

¼ T1/P .

Thus, the speedup is T1/TP ¼ P. ■Definition. The quantity (T1/T1 )/P is called the parallel slackness.

Page 33: Basic Cilk Programming

Lessons

Work and span can predict performance on large machines better than running

times on small machines can.

Focus on improving Parallelism (ie. Maximize (T1/T1 )). This will allow you to

effectively use larger processor counts.

Page 34: Basic Cilk Programming

Cilk Performance● Cilk’s “work-stealing” scheduler achieves

■ TP = T1/P + O(T1) expected time (provably);

■ TP T1/P + T1 time (empirically).

● Near-perfect linear speedup if P ¿ T1/T1 .

● Instrumentation in Cilk allows the user to determine accurate measures of T1 and T1 .

● The average cost of a spawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.

Page 35: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

Spawn!

Page 36: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

Spawn!Spawn!

Page 37: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

Return!

Page 38: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

Return!

Page 39: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Steal!

Page 40: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Steal!

Page 41: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Page 42: Basic Cilk Programming

Cilk’s Work-Stealing SchedulerEach processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack.

PP PP PPPP

When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

Spawn!

Page 43: Basic Cilk Programming

Performance of Work-StealingTheorem: Cilk’s work-stealing scheduler achieves an expected running time of

TP T1/P + O(T1)on P processors.Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T1. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PT1). Since there are P processors, the expected time is

(T1 + O(PT1))/P = T1/P + O(T1) . ■

Page 44: Basic Cilk Programming

Space BoundsTheorem. Let S1 be the stack space required by a serial execution of a Cilk program. Then, the space required by a P-processor execution is at most SP · PS1 .Proof (by induction). The work-stealing algorithm maintains the busy-leaves property: every extant procedure frame with no extant descendents has a processor working on it. ■

PP

PP

PP

S1

P = 3

Page 45: Basic Cilk Programming

Linguistic Implications

Code like the following executes properly without any risk of blowing out memory:

for (i=1; i<1000000000; i++) { spawn foo(i);}sync;

for (i=1; i<1000000000; i++) { spawn foo(i);}sync;

MORAL

Better to steal parents than children!

MORAL

Better to steal parents than children!

Page 46: Basic Cilk Programming

Summary• Cilk is simple: cilk, spawn, sync• Recursion, recursion, recursion, …• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span• Work & span

Page 47: Basic Cilk Programming

Sudoko

• A game where you fill in a grid with numbers– A number cannot appear more than once in any column– A number cannot appear more than once in any row– A number can not appear more than once in any “region”

• Typically presented with a 9 by 9 grid … but for simplicity we’ll consider a 4 by 4 grid

A 4 x 4 Sudoku puzzle with 11 open positions … we show three steps in the solution

1

23

Since 1 is the only number missing in this column

Since 3 already appears in this region

Since 3 is the only number missing in this row

Page 48: Basic Cilk Programming

Sudoko Algorithm

• The two-dimensional Sudoko grid is flattened into a vector– Unsolved locations are filled with zeros– The first two rows of the initial 4 x 4 puzzle are shown– The current working location [loc=0] is shown in red and the subgrid size is 3– Initially call spawn solve(size=3, grid, loc=0)

3 0 0 4 0 0 0 2 …

• The first location has a solution so move to next location– Recursively call spawn solve(size=3, grid, loc=loc+1)

grid

3 0 0 4 0 0 0 2 …

Page 49: Basic Cilk Programming

49

Exhaustive Search

• The next location [loc=1] has no solution (‘0’ in the current cell) so …– Create 4 new grids and try each of the 4 possibilities (1,2,3,4) concurrently– Note: the search goes much faster if the guess is first tested to see if it is legal– Spawn a new search tree for each guess k– Call: spawn solve(size=3, grid[k], loc=loc+1)

Source: Mattson and Keutzer, UCB CS294

3 1 0 4 0 0 0 2 …

new grids

3 2 0 4 0 0 0 2 …

3 3 0 4 0 0 0 2 …

3 4 0 4 0 0 0 2 …

Illegal since 3 and 4 are already in the same row

Page 50: Basic Cilk Programming

Cilk Sudoko solution (part 1 of 3)

cilk int solve(int size, int* grid, int loc){ int i, k, solved, solution[MAX_NUM]; int* grid[MAX_NUM]; int numNumbers = size*size: int Girdlen = numNumbers*numNumbers;

if (loc == Gridlen) { /* maximum depth; reached the end of the puzzle */ return check_solution(size, grid); }

/* if this node has a solution (given by puzzle) at this location */ /* move to next node location */ if (grid[loc] != 0) { solved = spawn solve(size, g, loc+1); return solved; }

Page 51: Basic Cilk Programming

Cilk Sudoko solution (part 2 of 3)

/* try each number (unique to row,col,sq) */ numGrids = 0; for (i = 0, k = 0; i < MAX_NUM; i++) { k = next_guess(size, k, loc, grid); if (k == 0) break; /* no more legal solutions at t his location */

/* need new grid to work with */ myGrid[i] = new_grid(size, grid); myGrid[i][loc] = k; solution[i] = spawn solve(size, myGrid[i], loc+1); nGrids += 1; }

sync;

Page 52: Basic Cilk Programming

Cilk Sudoko solution (part 3 of 3)

/* check to see if there is a solution */

solved = 0; for (i = 0; i < nGrids; i++) { if (solution[i] == 1) { int n; /* found a solution, copy result to parent */ for (n = loc; n < len; n++) { grid[n] = (myGrid[i])[n]; } solved = 1; } free(myGrid[i]); }

return solved;}