26
Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer Science Department Texas A&M University

Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer

Embed Size (px)

Citation preview

Efficient Run-Time Dispatching in Generic Programming with

Minimal Code Bloat

Lubomir BourdevAdvanced Technology Labs

Adobe Systems

Jaakko JärviComputer Science Department

Texas A&M University

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Context: Image Manipulation

• Images vary in many different ways

• Writing generic and efficient image processing algorithms is challenging

Image Representations• 4x3 image in which the second pixel is hilighted• In interleaved form:

• In planar form: planar vs. interleaved

channel depth

8-bit, 16-bit…

channel order (RGB vs. BGR)

Color space

(RGB, CMYK…)

optional padding at the end of rows

Generic Image Library (GIL)

• Adobe’s Open Source Image Libraryhttp://opensource.adobe.com/gil

• Abstracts image representations from algorithms on images

• Allows for writing the algorithm once & having it work on images of any representation, without loss of performance

Problem Statement

• How do we write image processing algorithms that are:– Generic– Efficient– Compact– Run-Time Flexible

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Image algorithms via inheritance & polymorphism

struct pixel { virtual void invert()=0; };

struct rgb_pixel : public pixel {

virtual void invert();

};

struct gray_pixel : public pixel {

virtual void invert();

};

struct image {

pixel* operator[](size_t i);

};

void invert(image* img) {

for (i=0; i<img.size(); ++i)

img[i]->invert();

}

Generic X

Efficient X

Compact √

Run-Time Flexible √

Performance problem:

dynamic dispatch once per pixel

Image Algorithms via Generic Programming

struct rgb_pixel {…};struct gray_pixel {…};void invert_pixel(rgb_pixel&) {…}void invert_pixel(gray_pixel&) {…}

template <typename Pixel>struct image { Pixel& operator[](size_t i);};

template <typename Image>void invert(Image& img) { for (i=0; i<img.size(); ++i) invert_pixel(img[i]);}

Generic √

Efficient √

Compact √

Run-Time Flexible X

Generic Code Lacks Flexibility

• We need run-time flexibility:

typedef boost::mpl::vector<rgb8_image, gray8_image> images;gil::any_image<images> runtime_image;

gil::jpeg_read_image(runtime_image, “test.jpg”);invert(runtime_image);

• How can we do that without loss of performance?– Variant construct (see boost::variant)– runtime_image holds:

• index: index to the type of image• bits: buffer containing the currently instantiated image

– To invoke an algorithm, go through a switch statement & cast– Efficient: invoke dynamic dispatch only once per algorithm

Variant invocation

void invert_image(void* bits, int index) {

switch (index) {

case kLAB: invert(*(image<lab_pixel>*)(bits));

case kRGB: invert(*(image<rgb_pixel>*)(bits));

}

}

Generic version:

template <typename Op>

void apply_operation(void* bits, int index, Op op) {

switch (index) {

case kLAB: op(*(image<lab_pixel>*)(bits));

case kRGB: op(*(image<rgb_pixel>*)(bits));

}

}

Generic √

Efficient √

Compact x

Run-Time Flexible √

Solution: Template Hoisting

• Define a class hierarchy:template <int k> class k_channel_image {…};

class rgb_image : public k_channel_image<3> {};

class lab_image : public k_channel_image<3> {};

• Define the algorithm at the appropriate level of the hierarchy:

template <int k> void invert(k_channel_image<k>&) {…}

- enforces a specific hierarchy

- different algorithms may need different hierarchies- switch statement overhead remains

- does not help when the function is inlined

Generic x

Efficient √

Compact

Run-Time Flexible √

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Type Reduction

• Every algorithm partitions the space of its argument types into a set of equivalence classes

• Members of an equivalence result in the same assembly when instantiated

• The algorithm is instantiated only with one representative from each equivalence class

Type Reduction Implementation

• Metafunction to define the partition:

template <typename Op, typename T>struct reduce { typedef T type;};

• Generic algorithm invocation:

template <typename Op, typename T>inline void apply_operation(const T& argument, Op op) {    typedef typename reduce<Op,T>::type base_t;    op(reinterpret_cast<const base_t&>(argument));}

Example: The invert algorithm• Define the algorithm as a function object:struct invert_op { template <typename Image> void operator()(Image&){…} };

• Provide a function overload to invoke it:template <typename Image> inline void invert(Image& image) { apply_operation(image, invert_op());}

• Inverting RGB and LAB images is assembly-level identical:template<> struct reduce<invert_op, lab8_image_t> { typedef rgb8_image_t;};

The technique generalizes to multiple dimensions

template <typename Op, typename T1, typename T2>void apply_operation(T1& arg1, T2& arg2, Op op) {    typedef typename reduce<Op,T1>::type base1_t;    typedef typename reduce<Op,T2>::type base2_t;    typedef std::pair<T1*, T2*> pair_t;    typedef typename reduce<Op,pair_t>::type base_pair_t;    std::pair<void*,void*> p(&arg1,&arg2);    op(reinterpret_cast<base_pair_t&>(p));}

template <> struct reduce<copy_pixels_op,lab8_image_t> {…};

template <> struct reduce<copy_pixels_op, std::pair<lab8_image_t,lab8_image_t> > {…};

Defining Reduce Specializations

• Reduce dimensions separately, then combine:template <typename Image> struct reduce<invert_pixels_op, Image> { typedef reduce_cs<Image::color_space_t>::type cs; typedef reduce_ch<Image::channel_t>::type channel; typedef image_type<cs,channel,…>::type type;};

• Reuse structures via metafunction forwarding:template <typename T1, typename T2> struct reduce<resample_pixels_op, std::pair<T1,T2> > : public reduce<copy_pixels_op, std::pair<T1,T2> > {};

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Reduction in variants

Input: a variant of:input_types: [rgb8_image, lab8_image, cmyk16_image, rgba16_image]input_index: 2

• Step 1: Reduce each member of the vector:reduced_t: [rgb8_image, rgb8_image, rgba16_image, rgba16_image]

• Step 2: Remove duplicates:output_types_t: [rgb8_image, rgba16_image]

• Step 3: Create index vector from reduced_t to output_types_t:indices_t: [0, 0, 1, 1]

• Step 4: Use indices_t to map the input index to an output index: output_index = indices_t[input_index] = indices[2] = 1

Invoke the algorithm on a variant of:output_types_t: [rgb8_image, rgba16_image]output_index: 1

Binary reduction in variants

• Step 1: Perform unary pre-reduction on each argument[A1, A2, A3, A4] with index 2 -> [A1, A3] with out_index1 = 1

[B1, B2, B3] with index 3 -> [B1, B2] with out_index2 = 0

• Step 2: Compute a vector of the cross-products of types[(A1,B1), (A1,B2), (A3,B1), (A3,B2)]

• Step 3: Apply unary reduction on it:output_types_t = [(A1,B1), (A1,B2), (A3,B2)]

• Step 4: Compute the index in the output vectorout_index = out_index1 * size(Vec1) + out_index2

Invoke the algorithm on a single variant of: output_types_t = [(A1,B1), (A1,B2), (A3,B2)]

out_index

Agenda

• Context & problem statement

• Background – previous approaches

• Our approach to code bloat reduction

• Code bloat reduction in run-time dispatch

• Results & conclusion

Tests

• Test sets– Set A: 90 types (10 color spaces, 3 channel types, other variations)– Set B: 10 types (4 color spaces, other) – Set C: 12 types (3 color spaces, planar/interleaved, step/nonstep)

• Tests– Test 1: copy_pixels on Set B (inlined binary algorithm)– Test 2: copy_pixels on Set C (inlined binary algorithm)– Test 3: resample_pixels on Set B (non-inlined binary algorithm)– Test 4: resample_pixels on Set C (non-inlined binary algorithm)– Test 5: invert_pixels on Set A (inlined unary algorithm)

Results

Test 1 42.0 34.5 18% 201.6 107.5 47%

Test 2 41.5 26.0 37% 252.8 75.9 70%

Test 3 46.0 42.5 8% 259.8 144.0 45%

Test 4 33.5 34.0 -1% 318.7 98.8 69%

Test 5 24.0 16.5 31% 62.2 31.2 50%

Visual Studio 8 GCC 4.0

No Reduce

ReducePercent

reductionNo

Reduce Reduce

Percent reduction

Test 1 106% 116%

Test 2 78% 97%

Test 3 87% 118%

Test 4 75% 103%

Test 5 194% 307%

VS 8.0 GCC 4.0Reduction in code bloat

Effect on compile time

Conclusion• Drawbacks

– Unsafe– Requires intimate knowledge of the types and the

algorithm– Some compilers can optimize most of the code bloat

• Benefits– Works even when functions are inlined– Simplifies code generated by variants (especially

double dispatch)– Does not impose class hierarchy (essential for

generic code!)– Works when algorithms differ in requirements