45
(C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003-2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

Embed Size (px)

Citation preview

Page 1: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003-2004

Algorithm Engineering of Parallel Algorithms and Parallel Data Structures

Philippas Tsigas

Page 2: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003-2004

NOBLEA Library of Non-Blocking Concurrent Data Structures

Philippas Tsigas

Results jointly with:

Håkan Sundell and Yi Zhang

Page 3: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Overview

Introduction Synchronization Non-blocking Synchronization

Is Non-blocking Synchronization performance-beneficial for Parallel Applications?

NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?

Lock-free Skip lists Conclusions, Future Work

Page 4: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Systems: SMP

Cache-coherent distributed shared memory multiprocessor systems:UMANUMA

Page 5: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Synchronization

Barriers Locks, semaphores,… (mutual

exclusion) “A significant part of the work performed

by today’s parallel applications is spent on synchronization.”

...

Page 6: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Lock-Based Synchronization:Sequential

Page 7: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Non-blocking Synchronization

Lock-Free SynchronizationOptimistic approach

• Assumes it’s alone and prepares operation which later takes place (unless interfered) in one atomic step, using hardware atomic primitives

• Interference is detected via shared memory

• Retries until not interfered by other operations

• Can cause starvation

Page 8: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

type Qtype = record v: valtype; next: pointer to Qtype endshared var Tail: pointer to Qtype;local var old, new: pointer to Qtype

procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(&Tail, &(old->next), old, NIL, new, new)

type Qtype = record v: valtype; next: pointer to Qtype endshared var Tail: pointer to Qtype;local var old, new: pointer to Qtype

procedure Enqueue (input: valtype) new := (input, NIL); repeat old := Tail until CAS2(&Tail, &(old->next), old, NIL, new, new)

Example: Shared Queue

Tail

old

Tail

oldnew new

The usual approach is to implement operations using retry loops.Here’s an example:

Slide provided by Jim Anderson

Page 9: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Non-blocking Synchronization

Lock-Free Synchronization Avoids problems that locks have Fast Starvation? (not in the Context of HPC)

Wait-Free Synchronization Always finishes in a finite number of its own

steps.• Complex algorithms• Memory consuming• Less efficient on average than lock-free

Page 10: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Overview

Introduction Synchronization Non-blocking Synchronization

Is Non-blocking Synchronization performance-beneficial for Parallel Scientific Applications?

NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?

Conclusions, Future Work

Page 11: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Non-blocking Synchronisation

Synchronisation: An alternative approach for synchronisation

introduced 25 years ago Many theoretical results

Evaluation: Micro-benchmarks shows better performance

than mutual exclusion in real or simulated multiprocessor systems.

Page 12: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Practice

Non-blocking synchronization is still not used in practical applications

Non-blocking solutions are often complex having non-standard or un-clear

interfaces non-practical

??

Page 13: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Practice

Question? ”How the performance of

parallel scientific applications is affected by the use of non-blocking synchronisation rather than lock-based one?”

??

?

Page 14: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Answers

The identification of the basic locking operations that parallel programmers use in their applications.

The efficient non-blocking implementation of these synchronisation operations.

The architectural implications on the design of non-blocking synchronisation.

Comparison of the lock-based and lock-free versions of the respective applications

How the performance of parallel scientific applications is affected by the use of non-blocking synchronisation rather than lock-based one?

Page 15: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Applications

Ocean simulates eddy currents in an ocean basin.

Radiosity computes the equilibrium distribution of light in a scene using the radiosity method.

Volrend renders 3D volume data into an image using a ray-casting method.

Water Evaluates forces and potentials that occur over time between water molecules.

Spark98 a collection of sparse matrix kernels.

Each kernel performs a sequence of sparse matrix vector product operations using matrices that are derived from a family of three-dimensional finite element earthquake applications.

Page 16: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Removing Locks in Applications

Many locks are “Simple Locks”.

Many critical sections contain shared floating-point variables.

Large critical sections.

CAS, FAA and LL/SC can be used to implement non-blocking version.

Floating-point synchronization primitives are needed. A Double-Fetch-and-Add primitive was designed.

Efficient Non-blocking implementations of big ADT are used.

Page 17: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Experimental Results: Speedup

58P

58P

58P

58P

32P24P24P

Page 18: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

SPARK98

Before: spark_setlock(lockid); w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]; w[col][1] += A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]; w[col][2] += A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]; spark_unsetlock(lockid);

After:

dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]); dfad(&w[col][1], A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]); dfad(&w[col][2], A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]);

Page 19: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Overview

Introduction Synchronization Non-blocking Synchronization

Is Non-blocking Synchronization beneficial for Parallel Scientific Applications?

NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer?

Conclusions, Future Work

Page 20: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Practice

Non-blocking synchronization is still not used in practical applications

Non-blocking solutions are often complex having non-standard or un-clear

interfaces non-practical

??

Page 21: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Create a non-blocking inter-process communication interface with the properties: Attractive functionality Programmer friendly Easy to adapt existing solutions Efficient Portable Adaptable for different programming languages

NOBLE: Brings Non-blocking closer to Practice

Page 22: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

NOBLE Design: Portable

#define NBL...#define NBL...#define NBL...

Noble.h

#include “Platform/Primitives.h”…

QueueLF.c#include “Platform/Primitives.h”…

StackLF.c

CAS, TAS, Spin-Locks…

SunHardware.asmCAS, TAS, Spin-Locks...

IntelHardware.asm. . .

. . .

Platform dependent

Platform in-dependent

Exported definitions

Identical for all platforms

Page 23: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Using NOBLE

stack=NBLStackCreateLF(10000);...

Main

NBLStackPush(stack, item);

oritem=NBLStackPop(stack);

Threads

#include <noble.h>...NBLStack* stack;

Globals• First create a global variable handling the shared data object, for example a stack:

• Create the stack with the appropriate implementation:

• When some thread wants to do some operation:

Page 24: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Using NOBLE

When the data structure is not in use anymore:

stack=NBLStackCreateLF(10000);...NBLStackFree(stack);

Main

#include <noble.h>...NBLStack* stack;

Globals

Page 25: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Using NOBLE

stack=NBLStackCreateLB();...NBLStackFree(stack);

Main

NBLStackPush(stack, item);

oritem=NBLStackPop(stack);

Threads

#include <noble.h>...NBLStack* stack;

Globals

• To change the synchronization mechanism, only one line of code has to be changed!

Page 26: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Design: Attractive functionality

Data structures for multi-threaded usageFIFO QueuesPriority Queues DictionariesStacksSingly linked lists SnapshotsMWCAS ...

Clear specifications

Page 27: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Status

Multiprocessor supportSun Solaris (Sparc)Win32 (Intel x86)SGI (Mips) Linux (Intel x86)

Availiable for academic use:http://www.noble-library.org/

Page 28: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Did our Work have any Impact?

1) Industry has initialized contacts and uses a test version of NOBLE.

2) Free-ware developers has showed interest.

3) Interest from research organisations. NOBLE is freely availiable for research and educational purposes.

Page 29: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

A Lock-Free Skip list

H. Sundell, Ph. Tsigas Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´03), May 2003 (TR 2002). Best Paper AwardBest Paper Award

A very similar skip list algorithm will be presented this August at the ACM Symposium on Principles of Distributed Computing (PODC 2004):

”Lock-Free Linked Lists and Skip Lists” Mikhail Fomitchev, Eric Ruppert

Page 30: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Randomized Algorithm: Skip Lists

William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced Trees”, 1990 Layers of ordered lists with different

densities, achieves a tree-like behavior

Time complexity: O(log2N) – probabilistic!

1 2 3 4 5 6 7

Head Tail

50%25%…

Page 31: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Our Lock-Free Concurrent Skip List

Define node state to depend on the insertion status at lowest level as well as a deletion flag

Insert from lowest level going upwards

Set deletion flag. Delete from highest level going downwards

1 2 3 4 5 6 7D D D D D D D

123

p

123

p D

Page 32: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Concurrent Insert vs. Delete operations

Problem:

- both nodes are deleted!

Solution (Harris et al): Use bit 0 of pointer to mark deletion status

1

3

42Delete

Insert

a)b)

1

3

42 * a)b)

c)

Page 33: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Dynamic Memory Management

Problem: System memory allocation functionality is blocking!

Solution (lock-free), IBM freelists:Pre-allocate a number of nodes, link

them into a dynamic stack structure, and allocate/reclaim using CAS

Head Mem 1 Mem 2 Mem n…

Used 1Reclaim

Allocate

Page 34: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

The ABA problem

Problem: Because of concurrency (pre-emption in particular), same pointer value does not always mean same node (i.e. CAS succeeds)!!!

1 76

4

2 73

4

Step 1:

Step 2:

Page 35: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

The ABA problem

Solution: (Valois et al) Add reference counting to each node, in order to prevent nodes that are of interest to some thread to be reclaimed until all threads have left the node

1 * 6 *

2 73

4

1 1

? ? ?

1

CAS Failes!

New Step 2:

Page 36: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Helping Scheme

Threads need to traverse safely

Need to remove marked-to-be-deleted nodes while traversing – Help!

Finds previous node, finish deletion and continues traversing from previous node

1 42 *1 42 * or

? ?

1 42 *

Page 37: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Overlapping operations on shared data Example: Insert operation

- which of 2 or 3 gets inserted? Solution: Compare-And-Swap

atomic primitive:

CAS(p:pointer to word, old:word, new:word):booleanatomic do

if *p = old then *p := new; return true;

else return false;

1

2

3

4

Insert 3

Insert 2

Page 38: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Experiments

1-30 threads on platforms with different levels of real concurrency

10000 Insert vs. DeleteMin operations by each thread. 100 vs. 1000 initial inserts

Compare with other implementations:Lotan and Shavit, 2000Hunt et al “An Efficient Algorithm for

Concurrent Priority Queue Heaps”, 1996

Page 39: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Full Concurrency

Page 40: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Medium Pre-emption

Page 41: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

High Pre-emption

Page 42: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Lessons Learned

The Non-Blocking Synchronization Paradigm can be suitable and beneficial to large scale parallel applications.

Experimental Reproducable Work. Many results claimed by simulation are not consistent with what we observed.

Applications gave us nice problems to look at and do theoretical work on. (IPDPS 2003 Algorithmic Best Paper Award)

NOBLE helped programmers to trust our implementations.

Page 43: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Future Work

Extend NOBLE for loosely coupled systems.

Extend the set of data structures supported by NOBLE based on the needs of the applications.

Reactive-Synchronisation

Page 44: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Questions?

Contact Information: Address:

Philippas TsigasComputing ScienceChalmers University of

Technology Email:

tsigas @ cs.chalmers.se Web:

http://www.cs.chalmers.se/~tsigas http://www.cs.chalmers.se/~dcs http://www.noble-library.org

Page 45: (C) Ph. Tsigas 2003-2004© Ph. Tsigas 2003- 2004 Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas

(C) Ph. Tsigas 2003-2004

Pointers:

NOBLE: A Non-Blocking Inter-Process Communication Library. ACM Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (LCR ´02).

Evaluating The Performance of Non-Blocking Synchronization on Shared Memory Multiprocessors. ACM SIGMETRICS 2001/Performance2001 Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2001).

Integrating Non-blocking Synchronization in Parallel Applications: Performance Advantages and Methodologies. ACM Workshop on Software and Performance (WOSP ´01).

A Simple, Fast and Scalable Non-Blocking Concurrent FIFO queue for Shared Memory Multiprocessor Systems, ACM Symposium on Parallel Algorithms and Architectures (SPAA ´01).

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´03).

Fast, Reactive and Lock-free Multi-word Compare-and-swap Algorithms. 12th EEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ´03)

Scalable and Lock-free Cuncurrent Dictionaries. Proceedings of the 19th ACM Symposium on Applied Computing (SAC ’04).