Chapter 24a More Numerics and Parallelismpeople.ds.cam.ac.uk/nmm1/C++/24a_more_numerics.pdfWrite a small Fortran subroutine and return via arguments LOGICAL and character lengths are

Chapter 24aChapter 24aMore Numerics and ParallelismMore Numerics and Parallelism

Nick MaclarenNick Maclarenhttp://www.ucs.cam.ac.uk/docs/course-notes/unhttp://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/CPLUSPLUSix-courses/CPLUSPLUS

This was written by me, not Bjarne StroustrupThis was written by me, not Bjarne Stroustrup

http://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/CPLUSPLUS

http://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/CPLUSPLUS

Numeric AlgorithmsNumeric Algorithms These are only These are only accumulate()accumulate(), , inner_product()inner_product(), , partial_sum()partial_sum() and and adjacent_difference()adjacent_difference() NotNot what numerical programmers call algorithms what numerical programmers call algorithms I can't see any particular reason to use themI can't see any particular reason to use them

C++ developers rarely pay attention to numerical C++ developers rarely pay attention to numerical properties, or high performance, unlike Fortran onesproperties, or high performance, unlike Fortran ones They are likely to be just the obvious codeThey are likely to be just the obvious code The first three can be implemented much betterThe first three can be implemented much better

I recommend doing as I show in the exercisesI recommend doing as I show in the exercises BLAS, BLAS, long doublelong double or compensated summation or compensated summation

1212Stroustrup/ProgrammingStroustrup/Programming

Gaussian EliminationGaussian Elimination The book teaches Gaussian elimination with pivoting The book teaches Gaussian elimination with pivoting

and an example of a typical numeric algorithmand an example of a typical numeric algorithm You may need to write such code, in other contextsYou may need to write such code, in other contexts But But DON'TDON'T just copy that code, for reasons I shall just copy that code, for reasons I shall

explainexplain I am I am NOTNOT criticising the book or code – merely criticising the book or code – merely

stressing the software reuse principlestressing the software reuse principle

The executive summary here is “use LAPACK”The executive summary here is “use LAPACK”


Using LibrariesUsing Libraries The first approach is to call a (good!) libraryThe first approach is to call a (good!) library

These usually have a Fortran or C interfaceThese usually have a Fortran or C interface There are some C++ libraries around, tooThere are some C++ libraries around, too They are of They are of VERYVERY mixed quality mixed quality

NAG, LAPACK, FFTW are reliableNAG, LAPACK, FFTW are reliable Netlib is patchy, but some of it is goodNetlib is patchy, but some of it is good Numerical Recipes is Numerical Recipes is NOTNOT reliable reliable


How to Write ThemHow to Write Them Choose a numerically competent algorithm!Choose a numerically competent algorithm!

This is This is thethe key to accuracy and performance key to accuracy and performance Do Do NOTNOT use Numerical Recipes as a guide use Numerical Recipes as a guide The NAG documentation is The NAG documentation is muchmuch better better

When coding them, watch out for numeric errorsWhen coding them, watch out for numeric errors Typically accumulation and cancellation errorsTypically accumulation and cancellation errors For these, there are some adequate solutionsFor these, there are some adequate solutions

Or subtracting/dividing two nearly-equal numbersOr subtracting/dividing two nearly-equal numbers This one is harder to resolve, and I shall skip itThis one is harder to resolve, and I shall skip it


Improving AccuracyImproving Accuracy Often arises when using Often arises when using accumulate()accumulate() or or inner_product()inner_product()

Only Only simplesimple solution is to use solution is to use long doublelong double for the for the accumulationaccumulation It's useful for the multiplication in It's useful for the multiplication in inner_product()inner_product(), ,

too, but is not essentialtoo, but is not essential This is left as an exercise (see later)This is left as an exercise (see later) It may or may not help, for very complicated reasonsIt may or may not help, for very complicated reasons

You can actually do a lot better (in accuracy)You can actually do a lot better (in accuracy) But it's But it's NOTNOT a task for the non-expert a task for the non-expert Both numerically and in the C++ Both numerically and in the C++ and Cand C languages languages


Improving AccuracyImproving Accuracy Do Do notnot, repeat , repeat NOTNOT, simply code Kahan summation, simply code Kahan summation

A nightmare in C++, even for the A nightmare in C++, even for the VERYVERY few experts few experts The problem is primarily the C and C++ standardsThe problem is primarily the C and C++ standards

They don't specify what most people think they doThey don't specify what most people think they do AllAll compilers, versions and options will vary compilers, versions and options will vary

Look in the specimen answers for this chapter on the Look in the specimen answers for this chapter on the local course Web sitelocal course Web site fancy_accumulate.cppfancy_accumulate.cpp and and fancy_inner.cppfancy_inner.cpp And read the commentsAnd read the comments – they are not exaggerated! – they are not exaggerated! Those work as they stand under gcc but not IntelThose work as they stand under gcc but not Intel

I can get them working under Intel, painfullyI can get them working under Intel, painfully


Doing BetterDoing Better Rule number oneRule number one is to look for a better algorithm is to look for a better algorithm

And at the highest level possible, tooAnd at the highest level possible, too It is tricky, but the potential gains are hugeIt is tricky, but the potential gains are huge

You can extend the arithmetic's precisionYou can extend the arithmetic's precision Do that only when a few 'operations' are the problemDo that only when a few 'operations' are the problem Addition/subtraction is the only easy caseAddition/subtraction is the only easy case I can and have done multiplication, math. functions etc.I can and have done multiplication, math. functions etc.

It's anywhere from painful to fiendish or worseIt's anywhere from painful to fiendish or worse

The C/C++ standards are the real problemThe C/C++ standards are the real problem It can often be It can often be easiereasier in assembler :-( in assembler :-(


BLAS and LAPACKBLAS and LAPACK Always a good idea to use their interfaceAlways a good idea to use their interface

Have Have optionoption of writing your own or calling them of writing your own or calling them Optimised libraries can be a Optimised libraries can be a LOTLOT faster faster

Atlas, MKL, ACML etc. but not standard Linux onesAtlas, MKL, ACML etc. but not standard Linux ones Mainly the level 3 BLAS, but can include level 1Mainly the level 3 BLAS, but can include level 1

E.g. xGEMM – matrix multiplyE.g. xGEMM – matrix multiply inner_product()inner_product() is level 1 ( is level 1 (DDOTDDOT, , ZDOTZDOT, , ZDOTCZDOTC))

The BLAS can increase accuracy, but generally don'tThe BLAS can increase accuracy, but generally don't

LAPACK generally uses the level 3 BLASLAPACK generally uses the level 3 BLAS Optimised ones include NAG, MKL, ACMLOptimised ones include NAG, MKL, ACML They are also numerically robust algorithmsThey are also numerically robust algorithms


Calling ThemCalling Them Calling the BLAS and LAPACK interface:Calling the BLAS and LAPACK interface:

The interface is usually Fortran 77The interface is usually Fortran 77 A vendor may provide a C one, or even a C++ oneA vendor may provide a C one, or even a C++ one The The codecode may be in anything – it's not may be in anything – it's not youryour problem problem

This is not a big deal, but needs careThis is not a big deal, but needs care Fortran 77 is to modern Fortran as C is to C++Fortran 77 is to modern Fortran as C is to C++ And you can usually get between Fortran 77 and CAnd you can usually get between Fortran 77 and C

BLAS/LAPACK are unmodified Fortran 77BLAS/LAPACK are unmodified Fortran 77 This can't be called entirely portablyThis can't be called entirely portably The next slide gives the The next slide gives the USUALUSUAL rules rules


Calling Fortran 77Calling Fortran 77 Call via Call via extern “C”extern “C” BLAS name BLAS name DDOTDDOT becomes becomes ddot_ddot_ A Fortran A Fortran SUBROUTINESUBROUTINE is a C is a C voidvoid function function ALLALL arguments are passed as pointers arguments are passed as pointers doubledouble and and intint carry across, including function results carry across, including function results complexcomplex and C character arrays are OK, with care and C character arrays are OK, with care

Do Do NOTNOT call functions returning either as the result call functions returning either as the result Write a small Fortran subroutine and return via argumentsWrite a small Fortran subroutine and return via arguments

LOGICALLOGICAL and character lengths are a bit of a problem and character lengths are a bit of a problem In Fortran subroutine, translate In Fortran subroutine, translate LOGICALLOGICAL to to intint For character strings, pass the length separatelyFor character strings, pass the length separately Fortran character strings are not null-terminatedFortran character strings are not null-terminated


PerformancePerformance It is possible to get array-handling C++ code to run as fast It is possible to get array-handling C++ code to run as fast

as Fortran (my specimen answers do, for example)as Fortran (my specimen answers do, for example) But it is But it is MUCHMUCH harder to achieve harder to achieve Quite a lot of that has to do with the last dimension varying fastest Quite a lot of that has to do with the last dimension varying fastest

(row-major order)(row-major order)

The problems are mainly that most good array libraries are The problems are mainly that most good array libraries are Fortran-basedFortran-based This includes the BLAS and LAPACKThis includes the BLAS and LAPACK But there do seem to be some fundamental ones as wellBut there do seem to be some fundamental ones as well E.g. find E.g. find xx such that such that b=A.xb=A.x is more natural for column-major is more natural for column-major Left solution (i.e. to find Left solution (i.e. to find xx such that such that b=x.Ab=x.A) fits row-major better) fits row-major better


ParallelismParallelism Using multiple processes is easyUsing multiple processes is easy

Distributed memory and message passingDistributed memory and message passing Use MPI via C – see my MPI course for moreUse MPI via C – see my MPI course for more You will need to pack and unpack C++ classesYou will need to pack and unpack C++ classes

CilkPlus looks interesting – currently Intel onlyCilkPlus looks interesting – currently Intel only I can't remember exactly which product, so it may costI can't remember exactly which product, so it may cost Intel are funding gcc to include itIntel are funding gcc to include it I hope to investigate it and maybe write a courseI hope to investigate it and maybe write a course It's a shared-memory C++ It's a shared-memory C++ language extensionlanguage extension


Shared MemoryShared Memory

Aargh!Aargh! This area of POSIX is a nightmare areaThis area of POSIX is a nightmare area

Its specification often makes no senseIts specification often makes no sense Its memory model isn't compatible with C99'sIts memory model isn't compatible with C99's Its synchronisation doesn't cover program stateIts synchronisation doesn't cover program state

C++ 2011 threading isn't usable by mere mortalsC++ 2011 threading isn't usable by mere mortals ExpertsExperts could use it to write higher-level primitives could use it to write higher-level primitives But I have reason to believe it won't work reliablyBut I have reason to believe it won't work reliably I haven't had time to complete a test programI haven't had time to complete a test program


OpenMPOpenMP This is the leader for shared-memory parallelismThis is the leader for shared-memory parallelism

When the requirement is performanceWhen the requirement is performance My OpenMP course describes a defensive strategyMy OpenMP course describes a defensive strategy

Its specification makes even POSIX's look goodIts specification makes even POSIX's look good And it doesn't fit well with C++And it doesn't fit well with C++

Realistically, you can parallelise only C-style codeRealistically, you can parallelise only C-style code That's a soluble problem, in most casesThat's a soluble problem, in most cases

You can use C++ in serial code, including You can use C++ in serial code, including <vector><vector> Theoretically, OpenMP supports a lot more of C++Theoretically, OpenMP supports a lot more of C++

In practice, I would expect In practice, I would expect truly foultruly foul problems problems


Other Shared-MemoryOther Shared-Memory There are Boost facilities, too – There are Boost facilities, too – DON'TDON'T rely on them rely on them

The shared-memory problem is The shared-memory problem is NOTNOT about the calls about the calls It's not even even about synchronisation etc.It's not even even about synchronisation etc. It's It's ALLALL about the memory consistency model about the memory consistency model TheThe question is whether the compiler agrees with Boost question is whether the compiler agrees with Boost

And much the same applies to any other facilitiesAnd much the same applies to any other facilities There are a zillion threading libraries, all dangerousThere are a zillion threading libraries, all dangerous As all experts agree, this As all experts agree, this CAN'TCAN'T be done by a library be done by a library Language and compiler support is Language and compiler support is CRITICALCRITICAL


ExercisesExercises Instead of exercise 10, look up Marsaglia's Instead of exercise 10, look up Marsaglia's

DIEHARD or Knuth TAOCP, vol. 2DIEHARD or Knuth TAOCP, vol. 2 Code one of the better tests – e.g. the runs testCode one of the better tests – e.g. the runs test Use realistic sample sizes – millions or moreUse realistic sample sizes – millions or more

Or use the spacings testOr use the spacings test, which I have done, which I have done Generate a U(0,1) sample of size N and sort into orderGenerate a U(0,1) sample of size N and sort into order The spacings are negative exponential, mean 1/(N+1)The spacings are negative exponential, mean 1/(N+1) Test using Kolmogorov-Smirnov or otherwiseTest using Kolmogorov-Smirnov or otherwise LOTSLOTS of simulations rely on adjacency properties of simulations rely on adjacency properties


ExercisesExercises The first two extra ones are about basic algorithms The first two extra ones are about basic algorithms

and accuracy, to give you a feel for thatand accuracy, to give you a feel for that One uses the BLAS, but it probably won't do muchOne uses the BLAS, but it probably won't do much Look at my code to see why I say what I doLook at my code to see why I say what I do

The others are about using matricesThe others are about using matrices I use Cholesky as a basis, because it is simpler than I use Cholesky as a basis, because it is simpler than

Gaussian eliminationGaussian elimination It is for It is for positive definite realpositive definite real matrices matrices ONLYONLY, and needs no , and needs no

pivotingpivoting


ExercisesExercises Exercise 13. Take Exercise 13. Take accumulate.cppaccumulate.cpp and complete and complete

it (see statements marked CHANGE)it (see statements marked CHANGE) It's completed (and more) in It's completed (and more) in fancy_accumulate.cppfancy_accumulate.cpp

Exercise 14. Do the same for Exercise 14. Do the same for inner.cppinner.cpp You will need You will need lblaslblas to link it to link it It's completed (and more) in It's completed (and more) in fancy_inner.cppfancy_inner.cpp

These exercises are fairly easyThese exercises are fairly easy The point of my fancy coding is to show why I make the The point of my fancy coding is to show why I make the

remarks I do – remarks I do – There be dragons!There be dragons!


ExercisesExercises I recommend doing exercises 15-18 if you are going I recommend doing exercises 15-18 if you are going

to need to do any serious n-D array handlingto need to do any serious n-D array handling They are about the simplest realistic problem possibleThey are about the simplest realistic problem possible Tackling a 'real' problem as a first step is insaneTackling a 'real' problem as a first step is insane FAR WORSEFAR WORSE, you are likely to do things in bad ways, you are likely to do things in bad ways

They teach how to call the BLAS/LAPACKThey teach how to call the BLAS/LAPACK And provide a proper interface to them!And provide a proper interface to them!

They will expose some of the They will expose some of the gotchasgotchas Never underestimate the problems these can causeNever underestimate the problems these can cause


ExercisesExercises Exercise 15. Take Exercise 15. Take Book_matrix_zero.cppBook_matrix_zero.cpp and and

complete it according to the instructions.complete it according to the instructions. This will use the book's This will use the book's Matrix.hMatrix.h class to solve Cholesky class to solve Cholesky

by calling the BLAS/LAPACK, and by handby calling the BLAS/LAPACK, and by hand There is a specimen answer in There is a specimen answer in Book_matrix_one.hBook_matrix_one.h Do not worry if the matrix multiply is very slowDo not worry if the matrix multiply is very slow

Exercise 16. Attempt to optimise Exercise 16. Attempt to optimise matmul()matmul() Aim for same time as Aim for same time as cholesky()cholesky() on 1000x1000 on 1000x1000 Clue: transpose matrices to do all inner loops along fastest Clue: transpose matrices to do all inner loops along fastest

varying dimension – uses slices when you can do thatvarying dimension – uses slices when you can do that There is a specimen answer in There is a specimen answer in Book_matrix_two.cppBook_matrix_two.cpp


ExercisesExercises Exercise 17. Exercise 17. Change Change matmul()matmul(), , cholesky()cholesky() and and solver()solver() to use the BLAS and LAPACK to use the BLAS and LAPACK This will be a LOT faster if you use MKL, ACML etc., and This will be a LOT faster if you use MKL, ACML etc., and

faster (especially the solver) even with GNU versionsfaster (especially the solver) even with GNU versions

Be warned: this needs a clear headBe warned: this needs a clear head I did it by comparing intermediate results with a working I did it by comparing intermediate results with a working

version on 3x3 matricesversion on 3x3 matrices The problem is storage order incompatibilityThe problem is storage order incompatibility


ExercisesExercises Exercise 18. Take Exercise 18. Take My_matrix_zero.cppMy_matrix_zero.cpp, add , add

complete the programcomplete the program Write a very simple 2-D Write a very simple 2-D doubledouble matrix class matrix class Implement only what you needImplement only what you need Use first dimension varying fastest (column-major)Use first dimension varying fastest (column-major) Complete the calls to the BLAS and LAPACKComplete the calls to the BLAS and LAPACK Write a Write a matmul()matmul() There is a specimen answer in There is a specimen answer in My_matrix_one.cppMy_matrix_one.cpp Do not try to be clever at this stageDo not try to be clever at this stage

This is a lot easier than you might thinkThis is a lot easier than you might think


ExercisesExercises Exercise 19. Take the program you wrote in exercise Exercise 19. Take the program you wrote in exercise

18 and extend it to work better18 and extend it to work better Use the techniques in this chapterUse the techniques in this chapter The higher level code should use inner product calls andThe higher level code should use inner product calls and

A += z*AA += z*A, where A is a 1-D slice, where A is a 1-D slice Do Do NOTNOT try to provide a proper interface for slices try to provide a proper interface for slices

Provide them solely for Provide them solely for matmul()matmul(), , cholesky()cholesky() and and solver()solver()

DoDo support both row and column slices support both row and column slices

Try to get matrix multiply to run fasterTry to get matrix multiply to run faster There is a specimen answer in There is a specimen answer in My_matrix_two.hMy_matrix_two.h


ExercisesExercises My_matrix_three.hMy_matrix_three.h uses a high-precision inner uses a high-precision inner

product (from my fancy answer to exercise 14)product (from my fancy answer to exercise 14) It doesn't make very much difference, to time or accuracyIt doesn't make very much difference, to time or accuracy The solver is twice as slow and still much less than The solver is twice as slow and still much less than

machine accuracymachine accuracy Why? The time is in memory access, and the accuracy Why? The time is in memory access, and the accuracy

limit is in the limit is in the mathematicsmathematics – LAPACK is robust – LAPACK is robust But, But, occasionallyoccasionally, this technique can be necessary, this technique can be necessary

Exercise 20. For extreme masochists only. Exercise 20. For extreme masochists only. Try repeating these exercises with the Try repeating these exercises with the <valarray><valarray> and and <gslice><gslice> or or Boost::multi_arrayBoost::multi_array


Next lectureNext lecture There is no next lecture!There is no next lecture! We are at the endWe are at the end


Documents

Chapter 24a More Numerics and Parallelismpeople.ds.cam.ac.uk/nmm1/C++/24a_more_numerics.pdfWrite a small Fortran subroutine and return via arguments LOGICAL and character lengths are