Interval arithmetic on graphics processing units

Sylvain Collange*, Jorge Flórez** and David Defour*

RNC'8July 79, 2008

* ELIAUS, Université de Perpignan Via Domitia** GILab, Universitat de Girona

How to draw implicit surfaces

Applications in computer graphics

Several approaches

PolygonalApproximation

Computationally efficient (implementation on GPUs)

Inaccurate

Several approaches

Ray tracing,not guaranteed(91 seconds)

Search of function roots using heuristics

More accurate, may still miss solutions

Search of function roots using an interval Newton method

Interval operations required:+, , ×, ÷, √, xi, ex

Ray tracing,not guaranteed(91 seconds)

Ray tracing,guaranteed

(137 seconds)

Several approaches

Outline

Interval arithmetic

Graphics Processing Units (GPUs)

Implementing directed rounding

Further optimizations

Interval arithmetic

Redefinition of basic operators +, , ×, /,... on intervalseg. [1, 5] + [2, 3] = [1, 8]

Inclusion property: for X, Y, Z intervals, ° any operator

Need to take rounding errors into account

May return a wider interval

Loss of variable dependencyX × (1 X) ≠ (X1/2)2 1/4

Z=X °Y ∀ x∈X ,∀ y∈Y , z=x° y z∈Z

Outline

Interval arithmetic

Graphics Processing Units (GPUs)

Concluding remarks

What is a GPU?

NVidia GeForce 8800

Graphicsmemory

Shared memory

Constant memory

Registers

Memory acces unit

Multiprocessor

Cluster

Single precision FP arithmetic on GPUs

NVIDIA GeForce 8000/9000/200 families,AMD Radeon HD 2000

Multiplication, addition correctly rounded to nearestWith Cuda, can round toward zero

No directed rounding (towards +/ infinity)

Division, square root accurate to 2 units on the last place (ulp)

MultiplyAdd with multiply rounded toward zero on NVIDIA

AMD Radeon HD 3000, HD 4000

Division correctly rounded to nearest

SIMD execution

The same instruction is run on multiple threads

When a branch instruction occursIf all threads follow the same path, do a real branch

Otherwise, execute both paths and use predication

Diverging branches are expensive

How to program GPUs

NVIDIA CUDAFor GeForce 8000/9000/200 GPUs only

Subset of C++ language

Templates not officially supported as of CUDA 2.0

but their use is encouraged [Harris07]

in practice, most C++ features are supported

AMD CAL/Brook+

Graphics API + shader languageLimited languages and features

Remains the only portable way for lowlevel arithmetic

Outline

Interval arithmetic

GPU architectural issues

We developed two libraries

CUDA versionIn C++

Adapted from the Boost Interval library

Provides various levels of consistency checking (NaNs, Infs...) and precision

Cg versionNeed GPUspecific routines

We focus on the CUDA version in this talk

We want rounddown and roundup

CUDA provides roundtowardzeroEquivalent to either rounddown or roundup, depending on sign

0Toward zero

UpDown

Positive numbersNegative numbers

For the other directionAdd one ulp to the magnitude of theroundedtowardzero value

An integer addition on FP representation does this

Except in case of underflow/overflow

Or a multiplication by (1+EPS) EPS=223 for single precision

No special cases for overflows/NaNs

0Toward zero

UpDown

Positive numbersNegative numbers

Corrective term

Porting the Boost Interval library

CUDA on a NVIDIA GeForce 8500 GTMeasured throughput in cycles/warp (1 warp = 32 threads)

* Without divergent branching

Cycles

33 – 55 (*)

20 – 25 (*)

141 – 148 (*)

Operation

Scalar Add

Interval Add

Interval Multiply

Interval Square

Interval x5

Outline

Interval arithmetic

GPU architectural issues

Multiplication: existing algorithm if (interval_lib::user::is_neg(xl)) if (interval_lib::user::is_pos(xu)) if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // M * M return I(::min(rnd.mul_down_neg(xl, yu), rnd.mul_down_neg(xu, yl)), ::max(rnd.mul_up_pos (xl, yl), rnd.mul_up_pos (xu, yu)), true); else // M * N return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_pos(xl, yl), true); else if (interval_lib::user::is_pos(yu)) // M * P return I(rnd.mul_down_neg(xl, yu), rnd.mul_up_pos(xu, yu), true); else // M * Z return I(static_cast<T>(0), static_cast<T>(0), true); else if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // N * M return I(rnd.mul_down_pos(xl, yu), rnd.mul_up_neg(xl, yl), true); else // N * N return I(rnd.mul_down_pos(xu, yu), rnd.mul_up_pos(xl, yl), true); else if (interval_lib::user::is_pos(yu)) // N * P return I(rnd.mul_down_neg(xl, yu), rnd.mul_up_neg(xu, yl), true); else // N * Z return I(static_cast<T>(0), static_cast<T>(0), true); else if (interval_lib::user::is_pos(xu)) if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // P * M return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_pos(xu, yu), true); else // P * N return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_neg(xl, yu), true); else if (interval_lib::user::is_pos(yu)) // P * P return I(rnd.mul_down_pos(xl, yl), rnd.mul_up_pos(xu, yu), true); else // P * Z return I(static_cast<T>(0), static_cast<T>(0), true); else // Z * ? return I(static_cast<T>(0), static_cast<T>(0), true);

Rewriting multiplication

We do not want branches

Starting from the general formula:

We need to round all subproducts both up and downCost: 4 × (2 mul + 1 min + 1 max) + 3 min + 3 max= 22 flops

Or do we?

[a , b ]×[c , d ]=[min ac , ad , bc , bd , max ac , ad , bc , bd ]

What is the sign of [a, b]×[c, d]?

ac and bd are always positive when used

ad and bc are always negative when used

We know in advance the rounding directionNo need to select it dynamically

Now 12 flops

Power to an integer: existing algorithm

T pow_dn(const T& x_, int pwr, Rounding& rnd) // x and pwr are positive{ T x = x_; T y = (pwr & 1) ? x_ : 1; pwr >>= 1; while (pwr > 0) { x = rnd.mul_down(x, x); if (pwr & 1) y = rnd.mul_down(x, y); pwr >>= 1; } return y;}

T pow_up(const T& x_, int pwr, Rounding& rnd) // x and pwr are positive{ [idem using mul_up]}

interval<T, Policies> pow(const interval<T, Policies>& x, int pwr){ [...] if (interval_lib::user::is_neg(x.upper())) { // [-2,-1] T yl = pow_dn(-x.upper(), pwr, rnd); T yu = pow_up(-x.lower(), pwr, rnd); if (pwr & 1) // [-2,-1]^1 return I(-yu, -yl, true); else // [-2,-1]^2 return I(yl, yu, true); } else if (interval_lib::user::is_neg(x.lower())) { // [-1,1] if (pwr & 1) { // [-1,1]^1 return I(-pow_up(-x.lower(), pwr, rnd), pow_up(x.upper(), pwr, rnd), true); } else { // [-1,1]^2 return I(0, pow_up(::max(-x.lower(), x.upper()), pwr, rnd), true); } } else { // [1,2] return I(pow_dn(x.lower(), pwr, rnd), pow_up(x.upper(), pwr, rnd), true); }}

Improvements

Exponent is constant: we can unroll the loopCuda 1.0 cannot do loop unrolling

Cuda 1.1 can, but without constant propagation and dead code removal

Cuda 2.0 beta ignores the directive

We use template metaprogramming instead

We can compute pow_up from pow_downAdd a bound on the error: multiply by (1+n EPS)

Performance results

Cuda on a NVIDIA GeForce 8500 GTMeasured Throughput in cycles/warp

* Without divergent branching

Original

Addition 24 24

Multiplication 33 – 55 (*) 36

141 – 148 (*) 23

Operation Optimized

Multiplication in constant time

x5 as fast as an addition

Application to ray-tracing

Xeon 3GHz vs 8800 GTX

Surface

300 2720 2900 3

1080 3

CPU (s) GPU (s)

SphereKusnerSchmittTangleGumdrop Torus

NVIDIA GeForce GTX 280 now supports double precision

In all 4 IEEE rounding modes

With no switching overhead

More efficient for interval arithmetic than current CPUs!

Still not available for single precision

Could use double precision for the last convergence steps of interval Newton method

Future directions

Thank you for your attention

Earlier GPUs

Noncompliant rounding for basic operations

nVidia GeForce 7, multiplicationFaithfulrounding

AMD ATI Radeon X1x00, multiplicationNot toward zero,not faithful

Memory access issues

Memory access units are shared among a SIMD block

Fast memory accesses must obey specific patternsBroadcast: all threads access the same address

Coalescing: all threads access consecutive addresses

In CudaOnchip shared memory accessible in both modes

Onchip constant memory in broadcast mode

Offchip global memory with coalescing capabilities

Interval arithmetic on graphics processing units - · PDF fileInterval arithmetic on graphics...

Documents

Chapter 2: Computer Arithmetic and Digital Logic...Chapter 2: Computer Arithmetic and Digital Logic 1. Why is binary arithmetic employed by digital computers? SOLUTION Binary arithmetic

Integer Arithmetic Floating Point Representation Floating Point Arithmetic Topics

Arithmetic Circuits - Universiti Malaysia Pahangee.ump.edu.my/hazlina/teaching_ESD/teaching_ESD_new_chap4...Electronic System Design Arithmetic Circuits ... Fundamental arithmetic

Integer Arithmetic Floating Point Representation Floating Point Arithmetic

Arithmetic Mean & Arithmetic Series

Arithmetic theta series - Video Lectures · These arithmetic theta series would then provide an arithmetic theta correspondence between automorphic forms on H and classes in the arithmetic

Arithmetic and Algebra - University College Dublin 1 - Arithmetic and Algebra.pdf · Arithmetic and Algebra 1.1. Arithmetic of Numbers. While we have calculators and computers to

Csc1401 lecture03 - computer arithmetic - arithmetic and logic unit (alu)

GPGPU: General-Purpose Computation on GPUs - Nvidiadownload.nvidia.com/developer/presentations/2005/GDC/OpenGL_Day… · Graphics and General-Purpose Computation ... • Arithmetic

Arithmetic Series. Definition of an arithmetic series. The sum of the terms in an arithmetic sequence

Chapter 6: Computer Arithmetic and the Arithmetic Unitfivedots.coe.psu.ac.th/Software.coe/241-440/colorado/Ch06.pdf · 6-2 Chapter 6—Computer Arithmetic and the Arithmetic Unit

Arithmetic Dynamics, Arithmetic Geometry, and …jhs/Presentations/MAGNTS2019.pdfArithmetic Dynamics and Arithmetic Geometry 11 Unlikely Intersections in Arithmetic Geometry Many theorems

· PDF fileINTERVAL Senior Choir Conductor ~ Alex Osiatynski Jesus Christ Superstar Medley- Andrew Lloyd Webber arr. Neil Slater Senior Girls’ Choir Conductor

MATLAB Jirawat Kanjanapitak (Tae). What is MATLAB A computer program for doing numerical computation including; Arithmetic, Polynomials, Graphics, 2-D

Graphics and Computing GPUs - Elsevier...C.6 Floating-point Arithmetic C-41 C.7 Real Stuff: The NVIDIA GeForce 8800 C-45 ... progressing from indexed arithmetic, to integer and fi

Chapter 13 Mandelbrot Set - MathWorks · Mandelbrot Set Fractals, topology, complex arithmetic and fascinating computer graphics. Benoit Mandelbrot was a Polish/French/American mathematician

Chapter 3 - GitHub Pagesharmanani.github.io/classes/csc320/Notes/ch03.pdfChapter 3 —Arithmetic for Computers —5 Arithmetic for Multimedia n Graphics and media processing operates

Sec 2.1.5 How Arithmetic Sequences Work? Generalizing Arithmetic Sequences

Math 34: Arithmetic Sequences - Pennsylvania …...Arithmetic Sequences Math 34: Spring 2015 Arithmetic Sequences Review Real World Examples General Way to Write an Arithmetic Sequence

Arithmetic OperatorsArithmetic Operators Topics Arithmetic Operators Operator Precedence Evaluating Arithmetic Expressions In-class Project Incremental Programming Reading Section