41
tbb::parallel_for template<typename TI, typename TF> void tbb::parallel_for( TI begin, TI end, const TF& f); Implements a parallel for loop for values in the range [begin,end) Each iteration may execute in parallel There are no guarantees HPCE / dt10 / 2012 / 7.1 There are no guarantees Iterations may be executed in any order Entirely legal to do the iterations in opposite order to normal Programmer’s job: make sure all iterations are independent TBB run-time’s job: try to execute iterations as fast as possible

tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

tbb::parallel_for

template<typename TI, typename TF>

void tbb::parallel_for( TI begin, TI end, const TF& f);

– Implements a parallel for loop for values in the range [begin,end)– Each iteration may execute in parallel

• There are no guarantees

HPCE / dt10 / 2012 / 7.1

• There are no guarantees

– Iterations may be executed in any order• Entirely legal to do the iterations in opposite order to normal

– Programmer’s job: make sure all iterations are independent– TBB run-time’s job: try to execute iterations as fast as possible

Page 2: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Function object as worker

num_t ParallelFactor (num_t n)

{

num_t mx=1+ceil(sqrt(n));

num_t res =n;

tbb ::parallel_for(2, mx,

FactorWorker (n,&res)

);

struct FactorWorker

{

num_t n ;

num_t *res;

FactorWorker (num_t _n, num_t *_res)

{

n=_n;

res =_res;

HPCE / dt10 / 2012 / 7.2

return res;

}

res =_res;

}

template<class TR>

void operator()(const TR &r)

{

for(num_t i=r.begin();i<r.end();i++){

if( (n%i) ==0){

* res =i;

break;

}

}

}

};

Page 3: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Installing TBB

• TBB can be downloaded from http://threadingbuildingblocks.org/– Follow “Downloads” -> “Stable Releases”– I’m using “tbb40_20111130oss”, but any tbb4 should work

• Windows: download “tbb40_20111130oss_win.zip” and unzip– Can be used from wherever you unzip it

HPCE / dt10 / 2012 / 7.3

– Can be used from wherever you unzip it– We’ll call that directory <TBB_DIR>

• Linux: I recommend building from sources – download tbb40_20111130oss_src.tgz– untar, and run “make” in the tbb directory– libraries will appear in the “build” directory

• Mac: ?– I have no idea how macs work, I assume it is straightforward

Page 4: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Compiling TBB programs

• Two main things to worry about for compilation– Location of the header files: parallel_for.h etc.– Location of the import libraries: tbb.lib

• Header files are found in “<TBB_DIR>\include”

HPCE / dt10 / 2012 / 7.4

Page 5: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Compiling TBB programs

• Two main things to worry about for compilation– Location of the header files: parallel_for.h etc.– Location of the import libraries: tbb.lib

• Header files are found in “<TBB_DIR>\include”• Import libraries are found in “<TBB_DIR>lib”

HPCE / dt10 / 2012 / 7.5

• Import libraries are found in “<TBB_DIR>lib”

Page 6: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Compiling TBB programs

• Two main things to worry about for compilation– Location of the header files: parallel_for.h etc.– Location of the import libraries: tbb.lib

• Header files are found in “<TBB_DIR>\include”– Specify the include path to your compiler

HPCE / dt10 / 2012 / 7.6

– Specify the include path to your compiler• GCC : g++ <FILE>.cpp -I<TBB_DIR>/include

• MSVC : cl <FILE>.cpp /I<TBB_DIR>\include

• Import libraries are found in “<TBB_DIR>\lib”– Need to make sure you have the right ones for your compiler– e.g. Visual Studio 2008: “<TBB_DIR>\lib\ia32\vc8”

cl <FILE>.cpp /I<TBB_DIR>\include /link /LIBPATH:<TBB _DIR>\lib\ia32\vc8

– For linux they should automatically turn up in “<TBB_DIR>/lib”– g++ <FILE>.cpp -I<TBB_DIR>/include -L<TBB_DIR>/lib -ltbb

Page 7: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Compiling gotchas

• Don’t forget to set optimisation flags– It’s pointless to time executables that the compiler didn’t optimise– Multiple flags, but “-O2” for g++ or “/O2” for cl are decent

HPCE / dt10 / 2012 / 7.7

• If you use assertions don’t forget to turn them off when timing– Use “-DNDEBUG=1” for g++ or “/DNDEBUG=1” for cl

Page 8: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Running TBB programs

• TBB uses a dynamically linked library– At run-time it needs to load in code from a separate file– You need to make sure that the OS can find that library

• Windows: there are multiple solutions– Ok: add “<TBB_DIR>\bin\ia32\vc8” to the path

HPCE / dt10 / 2012 / 7.8

– Ok: add “<TBB_DIR>\bin\ia32\vc8” to the path• set PATH=%PATH%;“<TBB_DIR>\bin\ia32\vc8”

– Ok: copy tbb.dll and tbb_debug.dll to same directory as .exe– Poor: copy tbb.dll and tbb_debug.dll to windows directory

• Linux: depends on whether you are admin– Good: Put the .so files in /usr/lib or /usr/local/lib– Ok: add directory with shared libraries to LD_LIBRARY_PATH

• Mac: ?

Page 9: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Implementing tbb::parallel_for

• How could we implement parallel_for?– Simple helper class for

describing a range

• Start with a sequential version

template<class TI>

class range

{

private:

TI m_begin , m_end;

public:

range (TI _begin, TI _end)

{ m_begin=_begin; m_end=_end; }

HPCE / dt10 / 2012 / 7.9

TI begin () const

{ return m_begin; }

TI end () const

{ return m_end; }

};

template<class TI,class TF>

void parallel_for(TI beg,TI end, const TF &f)

{

for(TI i=beg;i<end;i++){

range <TI> r(i, i+1);

f ( r );

}

}

Page 10: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Implementing tbb::parallel_for

• How could we implement parallel_for?– Simple helper class for describing a range

• Start with a sequential version• What about a Cilk version?

– Q : What does the critical path look like?

HPCE / dt10 / 2012 / 7.10

– Q : What does the critical path look like?

template<class TI,class TF>

void parallel_for(TI begin, TI end, const TF &f)

{

for(TI i=begin;i<end;i++){

range <TI> r(i, i+1);

spawn f ( r );

}

}

Page 11: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Implementing tbb::parallel_for

• How could we implement parallel_for?– Simple helper class for describing a range

• Start with a sequential version• What about a Cilk version?

– Let’s try again

HPCE / dt10 / 2012 / 7.11

– Let’s try again

template<class TI,class TF>

void parallel_for(TI begin, TI end, const TF &f)

{

if(begin+1==end){

range <TI> r(i, i+1);

spawn f ( r );

}else{

spawn parallel_for (begin, (begin+end)/2, f);

spawn parallel_for ((begin+end)/2, end, f);

}

}

Page 12: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Implementing tbb::parallel_for

• This looks good, except what about cost of work!• TBB will automatically apply agglomeration

– Split range up into larger contiguous ranges– Hand ranges to function for processing in sequential loop

• How does it know what THRESH should be?– Dynamically varies based on time for last batch – auto-tune!

HPCE / dt10 / 2012 / 7.12

– Dynamically varies based on time for last batch – auto-tune!

template<class TI,class TF>

void parallel_for(TI begin, TI end, const TF &f)

{

if(end-begin < THRESH){

range <TI> r(begin, end);

spawn f ( r );

}else{

spawn parallel_for (begin, (begin+end)/2, f);

spawn parallel_for ((begin+end)/2, end, f);

}

}

Page 13: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.13

Page 14: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.14

Page 15: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.15

Page 16: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

References vs pointers• C++ has references as well as pointers

– Pointers use asterisk (*), references use ampersand (&)

• Some differences between pointers and references– References are guaranteed to be non-null– A reference always points at the same object

• Operations on the references happen to the original object

HPCE / dt10 / 2012 / 7.16

void MyPtrFunc(my_class *x)

{

x->wibble();

my_class *p=x;

* x = my_class(5);

my_class tmp ;

x=tmp;

}

my_class x;

MyPtrFunc(&x); // pass by pointer

void MyRefFunc(my_class &x)

{

x.wibble();

my_class *p=&x;

x = my_class(5);

my_class tmp ;

&x=&tmp;

}

my_class x;

MyRefFunc(x); // pass by ref

Page 17: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Const methods

• Methods with the const modifier cannot change the object– Cannot modify the internal state of the object– Object is “the same” after any const method is called

• References with the const modifier cannot be changed– Can only call const methods and read object properties

HPCE / dt10 / 2012 / 7.17

– Can only call const methods and read object properties– Only methods marked const are considered const methods

• Methods which don’t change state are “non-const” unless marked

• Const references are very useful– Pass by reference makes it cheap to pass object to function– Const-ness means object can be safely used in parallel

• This is why parallel_for demands a const reference function

Page 18: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Scheduling through reference counts

• Each task has a reference count and a successor task• The reference count identifies whether a task is blocked

– If the reference count is zero then the task could be run– But only if it has been given to the task scheduler– Legal to create a task and not give it to the scheduler

HPCE / dt10 / 2012 / 7.18

– Legal to create a task and not give it to the scheduler– Note the difference: “reference count” vs “C++ reference”

• The successor task identifies the task blocked by this task– Same concept as “parent” in Cilk, but slightly more general– When a task completes it decrements the count of its successor

Page 19: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.19

Page 20: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.20

Page 21: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.21

Page 22: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.22

Page 23: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.23

Page 24: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.24

Page 25: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.25

Page 26: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.26

Page 27: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.27

Page 28: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.28

Page 29: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.29

Page 30: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.30

Page 31: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.31

Page 32: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.32

Page 33: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Managing reference counts

• What happens if we get the reference count wrong?

• Finishing task calls decrement_ref_count on successor– Automatically returns task to scheduler if count becomes zero

HPCE / dt10 / 2012 / 7.33

Page 34: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.34

Page 35: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

HPCE / dt10 / 2012 / 7.35

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(2);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2));

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2), end);

spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

Page 36: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.36

Page 37: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

HPCE / dt10 / 2012 / 7.37

Page 38: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Some help is available

• TBB library comes in two forms: debug and release– release library does no error checking – all about speed– debug library will check reference counts at many points

• Choose library version at compilation and link stages

HPCE / dt10 / 2012 / 7.38

• Choose library version at compilation and link stages– Debug: #define TBB_USE_DEBUG=1 when compiling

• On microsoft compilers it will automatically link the correct library• On other compilers use “-ltbb ” vs “-ltbb_debug ”

– Usually maintain different release and debug settings• Debug: /DTBB_USE_DEBUG=1 /MDd• Release: /DNDEBUG=1 /O2

– Can setup in Visual Studio or in a makefile

Page 39: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Data-parallelism vs task parallelism

• Two very broad types of parallelism we’ve seen so far

• Data-parallelism: do the same task lots of time in any order– The code for the task stays the same for each execution– The input to the task changes with each execution– There are no dependencies between different executions

HPCE / dt10 / 2012 / 7.39

– There are no dependencies between different executions• Often applied to elements of an array• Also described as “loop-parallelism”

• Task-parallelism: do many different tasks with dependencies– Each task has zero or more dependencies that must be met– Different tasks may have different code– More powerful than data-parallelism?

Page 40: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Is our factorisation program correct?

• What if there are more than two factors?

HPCE / dt10 / 2012 / 7.40

• What if num_t is something more complex than an integer?

Page 41: tbb::parallel fordt10/teaching/2011/hpce/hpce-lec7-tbb-details.pdf · void tbb::parallel_for( TI begin, TI end, const TF& f); – Implements a parallel for loop for values in the

Coursework 1

• Now available on the course website• Due Mon 27th of Feb

• Acceleration of five algorithms using TBB• Code should be fairly straightforward

HPCE / dt10 / 2012 / 7.41

• Code should be fairly straightforward– Not designed to be difficult to parallelise– May be difficult to parallelise efficiently

• Standard plagiarism rules apply– You can work together– Code and report must be your own