k+-buffer: Fragment Synchronized k-buffer · k+-buffer: Fragment Synchronized k-buffer ... by a family of GPU-accelerated methods that capture the surface-intersections (fragments)

k+-buffer: Fragment Synchronized k-buffer

Andreas A. Vasilakis∗

Dept. of Computer Science & EngineeringUniversity of Ioannina

Ioannis Fudos†

Dept. of Computer Science & EngineeringUniversity of Ioannina

A

B AUB A-B A∩B

(a) (b) (c)

Figure 1: A large repertoire of multi-fragment effects can be supported from our framework: (a) Illustrating order-independent transparencyof an engine consisting of 195 random-painted components. (b) Rendering boolean operations between a head (model A) and a clipped sphere(model B) surfaces. (c) Detecting collision (highlighted with red color) between a twirl object moving towards a static clipped sphere.

Abstract

k-buffer facilitates novel approaches to multi-fragment renderingand visualization for developing interactive applications on the GPU.Various alternatives have been proposed to alleviate its memory haz-ards and to avoid completely or partially the necessity of geometrypre-sorting. However, that came with the burden of excessive mem-ory allocation and depth precision artifacts. We introducek+-buffer,a fast and accurate framework that simulates thek-buffer behav-ior by exploiting fragment culling and pixel synchronization. TwoGPU-accelerated data structures have been developed: (i) themax-array and (ii) themax-heap. These memory-bounded data struc-tures accurately maintain thek-foremost fragments per pixel in asingle geometry pass. The choice of the data structure depends onthe sizek (application-dependent). Without any software-redesign,the proposed scheme can be adapted to perform as a Z-buffer or an A-buffer capturing a single or all generated fragments, respectively. Amemory-friendly strategy is also proposed, extending the proposedpipeline to dynamically lessen the potential wasteful memory allo-cation. Finally, an extensive experimental evaluation is provideddemonstrating the advantages ofk+-buffer over all priork-buffervariants in terms of memory usage, performance cost and imagequality.

CR Categories: I.3.6 [Computer Graphics]: Methodology andTechniques—Graphics data structures and data types

Keywords: GPU, k-buffer, a-buffer, max-heap, multi-fragmenteffects, real-time, pixel synchronization

1 Introduction

Determining visibility when rasterizing objects in arbitrary or-der is a challenging task in terms of time and space for a hostof algorithms that simulate complex rendering effects in real-time. Many image-based techniques produce visually realistic

∗e-mail:[email protected]†e-mail:[email protected]

results at interactive speeds in games (order-independent trans-parency [Maule et al. 2011], shadows [Yang et al. 2010], hairrendering [Yu et al. 2012]) and other graphics applications (volumerendering [Callahan et al. 2005], collision detection [Jang and Han2008], constructive solid geometry [Rossignac et al. 2013]) enabledby a family of GPU-accelerated methods that capture the surface-intersections (fragments) when ray casting from the viewer positionthrough each screen pixel. Figure2 shows an illustrative exampleof the fragment generation process through per-pixel ray-surfaceintersection tests. Buffer-based techniques ensure the correctvisibility order on the level of fragments, avoiding drawbacks (e.g.geometry interpenetration, primitive splitting, dynamic scenes) thatoccur in object/primitive sorting techniques [Govindaraju et al.2005; Sintorn and Assarsson 2008].

Extending Z-buffer, where the closest-to-viewer fragment is stored,A-buffer [Carpenter 1984] was the first method to capture allfragments per pixel in a single geometry pass. Fragments arestored intovariable-length linked listsper pixel during geometryrendering, followed by a post-sorting process that correctly reordersfragments by their depth. Several variants and optimizations [Mauleet al. 2011] have been proposed recently to simulate the behaviorof the A-buffer architecture with reduced memory requirementsgenerating more or less approximate image outputs.

k-buffer [Bavoil et al. 2007] employed per-pixelfixed-size vectorsmaintaining ak-nearly sorted sequence of fragments in boundedGPU memory. When the number of generated fragments exceedsk,the rasterization order of the generated fragments affects the final re-sult due to memory read-modify-write (RMW) hazards. Heuristics(primitive pre-sorting and triangle batches) may alleviate most of theimage quality flaws with the burden of higher computational cost.A number of GPU-acceleratedk-buffer variants have been recentlyintroduced that avoid RMW hazards and geometry ordering prior torasterization by either initially building an A-buffer [Yu et al. 2012]or exploring a two-geometry-pass scheme based on atomic memoryoperations [Liu et al. 2010; Maule et al. 2013]. However, a signifi-cant increase of unbounded memory is required when A-buffer isused as temporary storage. On the other hand, atomic operations arerestricted to only work on 32-bit unsigned integer format resultingin (i) visual artifacts from lost precision when converting depthvalues and (ii) performance cost of an additional geometry pass.

We introducek+-buffer (K+B), a single-geometry-passk-bufferframework that overcomes the aforementioned performancebottlenecks, memory footprints and rasterization artifacts. Inspiredby [Crassin 2010b], a semaphore-basedspin-lock mechanismensures atomicity of the per-pixel fragment operations at the sharedmemory. Implementation details are also provided to easily switchto the hardware-implemented pixel syncing solution available on the

Haswell architecture [Salvi 2013]. To alleviate contention (busy-waiting) of distant fragments, we concurrently performcullingchecks that efficiently discard fragments that are further from allcurrently maintained fragments. Two array-based data structures arebuilt on the GPU to accurately store the closest per-pixel fragments:(i) max-array, an array where the maximum element is always storedat the first entry and (ii)max-heap, a complete binary tree in whichthe value of each internal node is greater than or equal to the valuesof the children of that node. Despite its linear complexity, the formerperforms faster than the latter when the problem size is sufficientlysmall. For example, order-independent transparency presentshigh approximation images even with a small core of capturedlayers (k ≤ 16). Conversely, plausible photorealistic appearance ofhair requires the contribution of a larger set of hair strands (k > 16).

To avoid the wasteful pre-allocated storage requirements ofpixels that contain less thank fragments (k-fragmentlesspixels,see Figure2), we have extended our framework by the S-bufferpipeline [Vasilakis and Fudos 2012]. An additional geometry pass isperformed for counting fragments per pixel, enabling us to allocatethe exact amount of memory that we actually need. Memory islinearly organized intovariable contiguous regionsfor each pixel,making it feasible to implement both proposed data structures. Toour knowledge, this is the firstk-buffer implementation with dy-namic and precise allocation of the required storage space. Withoutany adaptation, our framework can also support the functionalitiesof the Z-buffer and A-buffer by settingk to either one or largeenough to handle all possible per-pixel fragment cardinalities.

The structure of this paper is as follows: Section2 offers a briefoverview of prior art. Section3 introduces the algorithmic details ofourk+-buffer framework. Section4provides extensive comparativeresults for several of multi-fragment rendering alternatives. Finally,Section5offers conclusions and identifies future research directions.

x x

sorting

x x x x

view

direction

z0 z1

z0 z1 z3 z2

z2 z1 z0 z3

Figure 2: Illustrating the construction process of a row of a4-buffer (highlighted with blue at the top-right thumbnail), when raycasting the dragon model. A significant amount of memory spaceis wasted at pixels that consist of less than 4 fragments due to thepre-allocation of the same buffer length per pixel.

2 Related Work

There is an abundance of recent research on treating the problemof storing multiple fragments from different perspectives. A-buffer [Carpenter 1984] exploited an unbounded buffer for buildingdynamically created per pixel lists ofall fragments in submissionorder using a single rendering pass. Then, stored fragments are post-sorted according to their depth values. An actual GPU-acceleratedimplementation of A-buffer based on atomic memory operationswas recently introduced in [Yang et al. 2010] (LL). However, itsperformance degrades rapidly due to the heavy contention and the

random memory accesses when constructing and assembling theentire fragment list, respectively. A memory-friendly variationof this algorithm was described in [Crassin 2010b], wherepagedper-pixel linked-lists (LL-Paged) improve caching and data busoccupancy. On the other hand, [Vasilakis and Fudos 2013] improvedperformance by concurrently storing fragments into more than oneper-pixel linked lists. They have proposed to uniformly split thedepth range of the rasterized scene (see also [Liu et al. 2009b; Sintornand Assarsson 2009]) and assign one linked list in each subdivision.

FreePipe [Liu et al. 2010], a complete CUDA-based rasterizationpipeline, maintains multiple fragments using constant-size per pixelvectors. To ensure complete fragment extraction for all pixels, thebuffer length must be set accurately. FreePipe has been realizedusing modern OpenGL APIs [Crassin 2010a], thus avoiding switch-ing from the traditional graphics pipeline to a software rasterizer.Despite its high computation speed, FreePipe suffers from large andpotentially unnecessary memory consumption. Last but not least,S-buffer (SB) [Vasilakis and Fudos 2012], a two-geometry-passesA-buffer implementation on the GPU, has overcome the limitationsof both linked-lists and fixed-array techniques by taking advantageof the fragment distribution and the sparsity of the pixel-space.

Regardless of the data structure, the aforementioned approachessuffer from (i) memory overflows resulted from the unboundedbuffer needed to store all generated fragments, and (ii) performancebottlenecks arise when the number of per-pixel fragments to be post-sorted increases significantly. To this end,k-buffer (KB) [Callahanet al. 2005; Bavoil et al. 2007] reduced memory requirementscapturing thek-closest to the viewer fragments in a single geometryrasterization. However, it is susceptible to disturbing flickering arti-facts caused from RMW hazards during fragment insertion updates.[Liu et al. 2009a] extended this work to a multi-pass approach (KB-Multi) achieving robust rendering behavior with the trade-off of lowframe rates. Moreover, [Bavoil and Myers 2008] eliminated most ofthe memory conflicts by performingstencil routingoperations on amulti-sample anti-aliasing buffer (KB-SR). Finally, [Zhang 2013]explored amemory-hazard-awaresolution (KB-MHA) based ona depth-error correction coding scheme. In practice however, theycannot guarantee correct results in all cases. The image quality ofthis class of methods is highly dependent on a coarse CPU-basedpre-sorting in primitive-space which eliminates the arrival of out-of-order fragments. Multiple rendering iterations, are further requiredto provide an A-buffer output, due to the limited number of multiplerender targets on the GPU. This results in significant performancedowngrade. Conversely, [Wang and Xie 2013] proposes partitioningthe input scene into components with a few layers and then renderingthem individually in order to fit into the limited KB-SR buffer size.However, this comes with the burden of non-supporting animatedscenes and having limited order-dependent applicability.

Multi depth testscheme (KB-MDT), developed in both CUDA [Liuet al. 2010] and OpenGL [Maule et al. 2013] APIs, guarantees correctdepth order results by capturing and sorting fragments on the fly via32-bit atomic integer comparisons. Since 64-bit atomic operationsare not supported from available APIs to update the depth and colorbuffers simultaneously, a costly additional geometry pass is thereforesuggested. Furthermore, noisy images may be generated due to theprecision lost when converting floating depth values of close frag-ments. Similar to our method, [Salvi 2013] extended the originalk-buffer to avoid fragment racing by employing hardware-aware pixelsynchronization (KB-PS). However, this method is currently com-patible only with graphics cards based on the Haswell architecture.

Finally, [Yu et al. 2012] proposed two linked-list-aware solutions toaccurately compute thek-foremost fragments. The idea of the firstone is to capture all fragments by initially constructing A-buffer bythe method of [Yang et al. 2010], followed by a step that selects andsorts thek-nearest fragments (KB-ABLL). On the other hand, thesecond approach directly computes depth-ordered per-pixel linkedlists avoiding the unnecessary A-buffer construction (KB-LL).Despite the fact that the second approach requires less storage,fragments are sparsely stored in graphics memory causing theadditional allocation of contiguous blocks of memory.

Table1 presents a comparative overview of allk-buffer alternativeswith respect to memory requirements, rendering complexity,fragment extraction accuracy and sorting necessity.

3 Framework Overview

We propose an efficientk-buffer implementation on the GPU whichis free from: (i) geometry sorting prior rasterization, (ii) unboundedmemory necessity, (iii) RMW memory-hazards and (iv) depth pre-cision conversion artifacts. Contrary to most of the aforementionedk-buffer alternatives which store and sort the generated fragmentson the fly, we follow a faster strategy similar to the one used by theA-buffer construction: Thek-nearest fragments are captured in anunsorted sequence, followed by a post-sorting step that reordersthem by their depth. We explore a GPU-acceleratedspin-lockstrategy via pixel semaphores to ensure real-time synchronizedconstruction of the unsortedk-front fragments (Section3.1). Twobounded array-based data structures for fragment data are intro-duced which enable a low cost culling test that concurrently discardsoutlier fragments (Section3.2). A pipeline extension is furtherintroduced to dynamically and precisely handle graphics memoryallocation (Section3.3). k+-buffer can also be considered as anunified framework that successfully integrates the functionalitiesof Z-buffer, k-buffer and A-buffer (Section3.4). The overallframework is described by offering shader-like pseudocode and thefragment processing pipeline. Finally, we highlight features andtradeoffs of our framework, pointing out implementation details andlight-weight modifications that can be used to guide the decision ofwhich pipeline alternative to employ in a given setting.

3.1 Spin-lock Strategy

Per-pixel binary semaphoresare utilized as a synchronizationmechanism to ensure fragment exclusive use of the critical storagesection. Taking into account the possibility of simultaneous accessto the lock, which could cause race conditions, an implementation ofan atomictest-and-setoperation is explored. The calling process ob-tains the lock if the old value was0. It spins writing1 to the variableuntil this occurs. One way to implement spin-lock strategy employ-ing test-and-set into a pixel shader is shown in the Algorithm1.

Algorithm 1 TestAndSet (Texturet, Pixelp)

1: while truedo ⊲ spin until lock is free2: if imageAtomicExchange(t, p, 1) == 0 then3: {critical section} ⊲ exclusive use4: imageStore(t, p, 0); ⊲ release lock when finished5: discard; ⊲ exit shader6: end if7: end while

A 32-bit unsigned integer texture with internal pixel formatR 32UI is allocated to represent the per-pixel semaphores. Atfirst, a full-screen rendering (clear pass) is executed to initializetexture with zeros. Our method is enhanced by the OpenGL’simageAtomicExchange(texture lock, ivec2 P, uint V) functionwhich atomically replaces the value V of the atomic object with theargument into texel at coordinate P and returns its original value.Note that there is no need for an atomic operation to perform thelock release (since the running fragment has exited from the criticalsection) as opposed to the implementation of [Crassin 2010b] wherean additional costly atomic exchange is used.

Pixel Synchronization (PS) is a graphics extension that Intel hasimplemented for 4th Generation Intel Core processors with Irisand Iris Pro graphics based on Haswell architecture. PS provides aperformance-wise inexpensive mechanism which avoids fragmentconflicts in the critical section and ensures that RMW memoryoperations are performed in submission order [Salvi 2013]. Ourframework can be enhanced by the use of PS without remodelingthe proposed pipeline. Implementation-wise, a simple call ofbegin-FragmentShaderOrderingINTEL()function is necessary to provide

fragment serializability. Thus, the per-pixel semaphore-basedspin-lock strategy can be omitted (specifically, lines 18-19 and25-28 in Algorithm3). Avoiding the usage of per-pixel semaphoresalso results in reduced memory demand. While DirectX11+ andOpenGL extensions are available for Intel graphics cards, we expectthat these will be supported in the near future by all manufacturers.

3.2 Fragment Capturing

A geometry rendering (store pass) is initially carried out to capturethe closest fragment data per-pixel in a 64-bit floating point 3D arraybuffer with internal format of RG32F, (R for color and G for depth)andk length. Figure2 illustrates a k+-buffer which can hold up to400 fragments (screen size:10× 10, k = 4).

To alleviate the spinning ofn generated fragments that do notbelong to the closestk, a fast culling mechanism is performed.The idea is to efficiently discard each incoming fragmentfi,∀i ∈ {0, . . . , n − 1} that has equal or larger depth value (fi.z)from all currently maintained fragments, before trying to acquirethe semaphore. Note thati determines the submission order. LetKBi[:] = {KBi[j], j = 0 . . . k − 1} denotes the contents of thek-buffer when fragmentfi has been processed. Initially, we don’tdiscard any incoming fragment until the fragment storage bufferis full (∀i < k). Then, we discard all fragmentsfi such thatfi.z≥ max{KBi−1[:].z}. On the other hand, a fragment withfi.z< max{KBi−1[:].z} replaces the fragment of the KB with thelargest depth value. This strategy guarantees that thek-nearestfragments will always survive since:max{KBn−1[:].z} ≤ · · · ≤max{KBi−1[:].z} ≤ · · · ≤ max{KBk−1[:].z}.

Note that this process has no impact at the worst case scenario offragments arriving in descenting depth order. To achieve fragmentculling without traversing the entire pixel row for every incomingfragment, we have developed two array-based data structures on theGPU that both store the maximum element at the first array position:(i) max-array(K+B-Array) and (ii)max-heap(K+B-Heap). Thus,this operation is performed in constant time. Figure3 illustrateshow two incoming fragments are successfully discarded using thisformula when the buffer is completely full.

Max-array can be considered as an array where the fragment withthe largest depth value is always stored at the first location and therest are randomly positioned. When an incoming fragment obtainsa semaphore, it stores its information in the first empty entry (O(1)).In this case, a per-pixelcounter(32-bit unsigned integer texture withinternal pixel format R32UI) is utilized as index and incrementedafter a successful insertion. Per-pixel counters are initialized tozero during the clear full-screen rendering pass. If the array is full(counter == k), it takes the place of the fragment with the largerdepth value. Note that since culling mechanism resides outsidecritical section, an additional checking is mandatory to guaranteecorrect results. To keep max-array consistent after an insertion on acompletely filled array, we find the fragment with the largest depthvalue (O(k)) and swap it with the newly added fragment (exceptthe latter is the largest one). This process is implemented withoutthe use of any costly atomic memory operations since fragmentatomicity is guaranteed.

However when the problem size increases rapidly (k > 16),fragment data information is maintained in a max-heap datastructure. Max-heap is a complete binary tree (shapeproperty) inwhich all nodes are greater than or equal to each of its children (heapproperty). Max-heap can be implemented using a simplek-sizedarray without allocating any space for pointers: If the tree root is atindex0, then each element at indexi ∈ [0, k) has children at indices2i+1 and2i+2 and its parent is located at index⌊ (i−1)

2⌋. Since the

first node contains the largest element, the core pipeline followedby max-array is not altered. Both inserting operations to an emptyor a full heap modify the heap to conform to the shape propertyfirst, by adding nodes from the end of the heap or replacing theheap root (O(1)). Then, the heap property is restored by traversingup-heap or down-heap (O(log2 k)). Pseudocode for both insertion

12 20 5 7 10 11 9 25 1

counter > kcounter ≤ kfragments

arrival order 15 18

Max-Heap

20

9 7 10 11

12 15

5

20

10 11

15

22222000000000000000000000000022222200000000000000000022222220000000000000000000000000000001

15

10 110 1

1

15

9 7 10 1

12 11

0

1 2

3 4 5 6

5

7

150

121

112

93

74

105

16

57

15 15

12

Max-Array

- 15 12 20 5 7 10 11

15 1 12 9 5 7 10 11

20 15 12 9 5 7 10 11

0 1 2 3 4 5 6 7

- 15 12 20 5 7 10 -

- 15 - - - - - -

Figure 3: Overview of the insertion process of an arbitrary sequence of out-of-order fragments when (left) max-array and (right) max-heapdata structures withk = 8 are utilized. The incoming fragment in each step is highlighted with a glow effect.When the array is full, fragmentswith value larger than the maximum captured fragment (yellow-colored) are efficiently discarded (f8.z= 25 andf10.z= 18).

functions is shown in Algorithm2, where P(f) defines the parent ofa fragmentf and L(f) and R(f) its left and right children. Figure3illustrates how both data structures withk = 8 are constructedand updated from a number of out-of-order fragment insertions. Arepresentation comparison between max-array and max-heap nodepointers is also shown.

Algorithm 2 InsertToHeap (Heaph, Pixelp, Fragmentf , Int k)

1: procedure UP-HEAP(h, p, f, k)2: i := 0;3: h[p.counter] := f ; ⊲ Addf to the bottom level ofh4: while i++< log2(k) do ⊲ Iterate until leaves are reached5: if f .z> P(f).z then ⊲ Comparef with its parent6: swap(f, P(f)); ⊲ Swapf with its parent7: else8: break; ⊲ Correct depth order, exit9: end if

10: end while11: end procedure

12: procedure DOWN-HEAP(h, f, k)13: i := 0;14: h[0] := f ; ⊲ Replace root withf15: while i++< log2(k) do ⊲ Iterate until leaves are reached16: C(f) := max{L(f),R(f)}; ⊲ Findf ’s largest child17: if f .z< C(f).z then ⊲ Comparef with its largest child18: swap(f, C(f)); ⊲ Swapf with its largest child19: else20: break; ⊲ Correct depth order, exit21: end if22: end while23: end procedure

Finally, a sorting process is employed to reorder the fragments foreach pixel before generating the final image (resolve pass). Un-sorted fragments are initially copied into a local array before per-forming the depth sort, as it is relatively faster to perform read-writeoperations in the register space rather in the global graphics memory.Based on the number of captured fragments, a mechanism decideswhich sorting algorithm is applied to the pixel. Despite its quadraticcomplexity, insertion sortis faster for sorting small fragment se-quences (k ≤ 16). Whenk increases, O(k log k) sorting algorithms,such asshell sort, have better performance [Knowles et al. 2012].

3.3 Precise Memory Allocation

Similar to allk-buffer alternatives wherek is the same for all pixels,k+-buffer suffers from potentially large unused memory spaceallocation ofk-fragmentless pixels. For example, Figure2 illustratesthe wastefully allocated storage of a 4-buffer for (top) a pixel thatconsists of 2 fragments and (bottom) an empty-pixel. Note that thevalue ofk is not automatically adjusted based on the rasterized sceneand must be carefully set a priori by the user.

Inspired by S-buffer [Vasilakis and Fudos 2012], we introducea memory-awarek+-buffer implementation using two geometrypasses (K+B-SB). A precise allocation of the required memoryspace is achieved by performing an initial geometry rendering(count pass) which sums up the number of fragments covering eachpixel. Contrary to S-buffer where all fragments contribute to theper-pixel aggregation, we bound the number of fragments that affecta pixel byk whenf(p) > k, wheref(p) is the number of generatedfragments at pixel locationp[x, y]. For each incoming fragment,the per-pixel counter is atomically incremented. When the valueof the counter reachesk, the subsequently arriving fragments arediscarded. The total size of thek+-buffer is estimated by accumu-lating the bounded per-pixel fragmentsfk using hardware occlusionqueries. Then, the memory offset lookup table (referencing pass)is computed in parallel fashion exploiting sparsity in pixel space.For additional information of the algorithmic details and shaderimplementations of this full-screen rendering pass, we refer readerto the original paper of S-buffer. Finally, per-pixel counters arereinitialized to zero to guide the subsequent storing phase.

A geometry rasterization is employed to store the most significantfragments to a hybrid buffer scheme starting from the memoryoffsets computed for each pixel. Knowing its fragment cardinalitya priori, each pixel can efficiently choose the fastest way of storingits fragments in either a max-array or a max-heap storage. Notethat this is feasible since both data structures are implementedusing fixed-arrays. Since max-array structure inserts faster thanmax-heap when the capacity is not full andk stays small, we applythe following strategy: iff(p) > k andk > 16 then we pickmax-heap, otherwise we use the max-array data structure as storagebuffer (see also Section4).

In terms of performance, accessing global memory for concurrentlystoring all fragments becomes a significant bottleneck as opposedto the original single-passk+-buffer which benefits from thefast operations in the register memory space. Last but not least,the need of an additional geometry rendering step also adds atessellation-dependent computation cost.

The completek+-buffer framework, including the original andits memory-aware version, is shown in Figure4 and Algorithm3,wherep.name denotes the corresponding member value of pixelstructurep. Similarly, a[i].name defines information located ati position in the buffer arraya, andf .name denotes attributes ofeach running fragmentf . insertempty() and insertfull() are theabstract insertion functions. max-array and max-heap own versionsof both functions are available in the shader source code provided assupplementary material.

3.4 Support of Z-buffer and A-buffer

Without loss of generality,k+-buffer can also be considered as aunified framework that successfully integrates the functionality of Z-buffer,k-buffer and A-buffer by simply adjusting the value ofk. By

allocating a single entry per pixel (k = 1), our method ensures dis-playing the closest fragment to the viewer. However, this comes withthe additional expense of extra memory requirements and perfor-mance downgrade when compared to the hardware depth buffering.

On the other hand, users have to set the value ofk large enoughto avoid any fragment-overflow (k = maxp{f(p)}). Morespecifically, our framework can be considered as a hybrid schemethat correctly simulates the behavior of FreePipe (when K+B isused) or S-buffer (when K+B-SB is used). Despite the fact that ourframework is not restricted from (i) multiple render targets and (ii)samples of the anti-aliasing buffer, a multi-pass variation may berequired to achieve a memory-bounded A-buffer functionality. Inthis case,maxp{f(p)}/k rendering iterations have to be performed,resulting at a significant workload increase. Performance-wise,max-array structure should naturally be chosen as the fragmentstorage due to its constant insertion complexity when the array isnot full (∀p[x, y] : f(p) ≤ k).

Algorithm 3 k+-buffer (Arraya, Pixelp, Fragmentf , Int k)

1: procedure CLEAR(p) ⊲ full-screen pass2: p.counter:= 0;3: p.semaphore:= 0;4: end procedure

5: procedure COUNT(p, k) ⊲ geometry pass6: if p.counter< k then7: p.counter← p.counter+1; ⊲ bounded fragment accumulation8: else9: discard;

10: end if11: end procedure

12: procedure REFERENCING(p) ⊲ full-screen pass13: computepixel offset(p.counter);14: p.counter:= 0;15: end procedure

16: procedure STORE(a, p, f, k) ⊲ geometry pass17: if p.counter< k or f .z< a[0].z then ⊲ fragment culling18: while truedo19: if (p.semaphoreև 1) == 0 then20: if p.counter< k then ⊲ enter critical section21: insertempty(p.counter++);22: else if f .z< a[0].z then23: insertfull();24: end if ⊲ exit critical section25: p.semaphore:= 0;26: break;27: end if28: end while29: end if30: end procedure

31: procedure RESOLVE(p) ⊲ full-screen pass32: if p.counter≤ 16 then33: insertionsort(p.counter);34: else35: shellsort(p.counter);36: end if37: computeeffect(p.counter);38: end procedure

⊲ where{←,և} denote atomic{store, exchange} operations

Figure 4: Diagram of thek+-buffer pipeline. Each box representsa shader program. The blue boxes are executed per-pixel using afull-screen rendering pass, while the green ones are executed foreach geometry-rasterized fragment.

4 Results

We present an experimental analysis of ourk+-buffer approachagainst a set ofk-buffer and A-buffer realizations focusing on perfor-mance, robustness, and memory requirements under different testingconditions. We have measured performance in terms of framesper second (fps) and milliseconds (ms) and memory requirementsin terms of megabytes (MB). For the purposes of comparison, wehave developed two variations of KB-ABLL, where instead of usingper-pixel linked lists for the A-buffer construction, we have appliedeither fixed-length (KB-ABArray) or variable-length (KB-ABSB)arrays for each pixel. The shader source code from all testingmethods has been also provided as supplementary material. Allmethods are implemented using OpenGL4.3 API and performed ona NVIDIA GTX 480 graphics memory with1.5 GB memory.

Figure1 demonstrates the importance of accurately handling multi-fragments for several applications (transparency effects [Mauleet al. 2011], csg operations [Rossignac et al. 2013], and collisiondetection [Jang and Han 2008]). Table 1 presents a comparativeoverview of all k-buffer alternatives with respect to memoryrequirements, rendering complexity, and other features.

4.1 Performance Analysis

We have performed an experimental performance evaluation ofour methods against competing techniques using a collection ofscenes under several different configurations. Instead of renderingscenes under different image resolutions, we have used a854× 480viewport and perform zooming operations defining the percentageof image being rasterized. For a fair comparison, all methods aretested under artificially generated scenes that cover a percentage ofscreen size (or pixel density:pd) and producen = r · k randomlyarrived fragments per pixel, wherer ≥ 1.

4.1.1 k -buffer Comparison

Impact of k. Figure5 shows how the computation time, for eachrendering pass of a set ofk-buffer methods, scales by increasingthe value ofk = 4, . . . , 64 for a scene that consists ofn = 128fragments per-pixel. Except from KB-MDT which needs two passes(namely, Store(Z) and Store), the other memory-bounded methodsrequire to rasterize the scene only once. We observe that our K+B-Array and K+B-Heap perform better than the rest of the techniquesfor all k values. As expected, K+B-Heap performs better thanK+B-Array when moving from low to highk values. Note that theResolvestep is more expensive for A-buffer-aware methods, sinceit has to locate the closestk fragments from all captured ones beforesorting. KB-LL and KB-MDT do not perform depth reorderingsince they both store and sort fragments on the fly. Despite the goodperformance of KB-MDT for small buffer sizes, the computationcost of storing and sorting depth fragments exhibits a rapid increasefor largerk values. This leads to an important conclusion: a futuresingle-pass KB-MDT, which will be enhanced by the expected64-bit atomic updates on the graphics memory, will present an in-significant performance gain. Note that K+B is up to25× faster thanthe current implementation of KB-MDT whenk = 64. Finally, weobserve thatCountandResolvepasses of K+B-SB cost less in termsof computations as compared to the ones of KB-ABSB due to therestricted operations carried out by the former. However, slow frag-ment storing in global memory from K+B-SB increases fragmentspinning which subsequently results at a performance downgradewhen the rasterized fragments are significantly increased.

Impact of Sorting. Figure 6 illustrates performance compari-son of our bounded K+B against KB and KB-SR methods forvarying k values. All methods are tested under two scenar-ios wheren = k, . . . , 1024 fragments are generated for thepd = {25%, 75%} of all pixels. To accurately capture the closestfragments without RMW hazards from KB and KB-SR, the scene israsterized in depth order. K+B-Array performs slightly better thanK+B-Heap since the firstk fragments need O(1) time (compared

1

5

25

125

Count/Store(Z) Store Resolve

k=32 k=64 k=16 k=4 k=8

Figure 5: Performance evaluation in ms(log5 scale) ofk-buffer variants with varyingk on a scene with constant fragment complexity(128).

to O(log2 k)) to be inserted in the array. Due to the sorted arrivalfragment order, the remaining fragments do not affect performancesince they are successfully culled in both methods. A linear behavioris observed when moving from less ({0.21M, . . . , 104.93M})to more ({0.61M, . . . , 314.82M}) generated fragments. Ourmethods are superior when compared with KB for all scene con-figurations. Similar performance conclusions can be made for thenot-implemented KB-PS and KB-MHA methods, since they are sup-posed to perform slightly worse than KB. Despite its efficiency dueto the hardware-implemented stencil routing, KB-SR speed is fallingwhenk reaches higher values. An additional geometry pass must beemployed to successfully capture all fragments for KB (k > 16) andKB-SR (k > 32), resulting in a significant performance downgrade.

Impact of Memory. Figure7 illustrates the performance evaluationin terms of fps per MB for a testingk-buffer method set whenperformance and memory are of utmost importance. To constructk-fragmentless pixels, we permit pixels to be influenced by up ton =10 ·k fragments. Thus, we definefp as the probability of a generatedfragment to not be discarded. We observe that K+B-SB is preferredto be used for handling scenes with many empty pixels (pd = 25%)and small number of rasterized fragments (fp = 25%). Whenpixel and fragment densities increase (pd = 75%, fp = 75%),K+B-SB performs better than the rest memory-aware methods.However, K+B-SB behavior is normally worst than the boundedmethods since it theoretically performs slower (e.g. one extrapass, storing data at global memory) in conjunction with thesmall unused memory of the bounded methods. Despite the fastspeed of KB-ABSB on sparse scenes, performance is significantlyreduced when generated fragments blast off to high levels. Finally,KB-ABArray, KB-ABLL, KB-ABSB and KB-LL fail to work whenfragment allocation results in memory overflow (k = 64).

0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 1.024 2.048 4.096 8.192

16.384 32.768

[2,20] [4,40] [8,80] [16,160] [32,320] [64,640] [2,20] [4,40] [8,80] [16,160] [32,320] [64,640]

K+B-Heap K+B-Array K+B-SB KB-AB(Array) KB-AB(SB) KB-AB(LL) KB-LL KB-MDT

[25%,25%] [75%,75%]

Figure 7: Performance evaluation in fps/MB(log2 scale) of allk-buffer variants when moving from a scene withr = 10 from smallnumber towards a large number of generated fragments.

Impact of Tessellation. Figure8 illustrates how the performancescales by moving from a low (1 level) to a high (64 level) tessellatedscene. A representative set of scenes are used to compare a numberof k-buffer methods that aim at capturing8 fragments. Threescenes are generated where the same number of fragments arerasterizedn = {16, 40, 80} for pd = 50% of the pixels. Differentconfigurations yield similar speed results. In all tests, a smallperformance impact is observed from the SB-aware methods asopposed to the linear behavior of the restk-buffers. This is due thefact that the counting geometry pass is not-tessellation dependent.

8

16

32

64

128

256

512

1 4 16 64

[8,16]

K+B-Heap K+B-Array K+B-SB KB-AB(Array) KB-AB(SB) KB-AB(LL) KB-LL KB-MDT

1 4 16 64

[8,40]

1 4 16 64

[8,80]

Figure 8: Performance evaluation in fps(log2 scale) of all k-bufferversions on a scene with varying tessellation resolution[1, 64] andincreasing per-pixel fragment complexityn : r = {2, 5, 10}, k = 8.

4.1.2 A-buffer Comparison

Figure9 illustrates performance comparison of our methods againstthe A-buffer alternatives for a scene with varying depth complexity.The same scene configurations with Figure7’s test have been used,with the difference thatk is set to the fragment cardinality, so thatK+B methods are able to capture all generated fragments. Weinitially observe that both bounded K+B methods perform betterfrom all memory-aware A-buffer variants and slightly worse thanFreePipe, the fastest A-buffer implementation so far. The unnec-essary culling mechanism is responsible for this burden. Similarwith previous tests, K+B-Array outperforms K+B-Heap, enhancedby its constant-time insertion process. On the other hand, K+B-SBdespite its smaller computational cost as compared to LL is worsethan SB in all cases. Except from the culling cost, the additionalcondition for each fragment at the count pass significantly affectsperformance. Finally, note that the performance gap exhibitedbetween K+B-SB and KB-ABSB is alleviated when the pixeldensity is increased (resulting at more rasterized fragments, from{0.21M, . . . , 13.12M} to{0.61M, . . . , 39.35M}).

1

4

16

64

256

1024

2 4 8 16 32 64 128 2 4 8 16 32 64 128

K+B-Array K+B-Heap FreePipe LL-Paged LL K+B-SB SB

25% 75%

Figure 9: Performance evaluation in fps(log2 scale) of A-bufferalternatives on a scene with varying maximum depth complexity.

4.2 Memory Allocation Analysis

Table 1 presents complexity in terms of memory consumptionfor all available methods that more or less simulate the behaviorof k-buffer. We initially observe that our K+B methods requireslightly more storage (8-byte) per-pixel than the rest of the memory

1

2

4

8

16

32

64

128

256

512

1024

2048

2 8 32 128 512 1,024 4 8 32 128 512 1,024 8 32 128 512 1,024 16 32 128 512 1,024 32 128 512 1,024 64 128 512 1,024

KB (75) KB-SR (75) K+B-Heap (75) K+B-Array (75) KB (25) KB-SR (25) K+B-Heap (25) K+B-Array (25)

Figure 6: Performance evaluation in fps(log2 scale) of our bounded K+B methods and the sorting-awarek-buffer methods for a large set ofk values. Pixel density is shown in brackets.

bounded methods due to the additional allocation of the counter andsemaphore textures. When moving to extreme screen resolutionsthis burden is noticeable. However, these methods need morestorage when data packing is explored (∀k > 1 : 4k > 2k + 2).K+B methods require less memory resources when compared tothe KB-SR (∀k > 2 : 3k > 2k + 2). Note that semaphore textureallocation is further avoided when pixel synchronization extension isemployed on Haswell hardware. On the other hand, video-memoryconsumption blasts off to high levels when A-buffer is constructed.Observe the increased memory requirements of KB-ABArray

due to its strategy to allocate the maximum memory per pixelp (n = maxp{f(p)} ≫ k). KB-ABLL, KB-LL, KB-AB SB

require less storage resources by dynamically allocating storageonly for non-empty pixels (f(p) ∈ [1, n]). Our memory-awaremethod K+B-SB requires equal (whenf(p) ≤ k) or less (whenf(p) > k) storage than the unbounded A-buffer-based methodsreducing the risk of a memory-overflow. Finally, an interestingobservation is that K+B and K+B-SB when extended to captureall fragments (k = n) require the same storage demands whencompared with the FreePipe and SB methods, respectively.

4.3 Image Quality Analysis

Figure 10 shows the image differences of KB, KB-MDT andKB-ABArray methods when compared with the ground truth onthree different scenarios: (a) Z-buffer: Aradial engineCAD modelis rendered using Gooch shading, (b)k-buffer: a transparenthairballmodel is visualized with red strips, and (c) A-buffer: a transparenttemplemodel is completely rasterized. Noticeable quality down-grade is observed at the first two image columns due to (center, left)RMW hazards of KB and (center, right) depth conversion artifactsof KB-MDT. To avoid memory overflow of KB-ABArray, we haveto allocate less storage than we actually need leading at (right) avisually information loss for a small set of pixels. Note that in the lastexample, K+B and KB-ABArray naturally produce the same image.

5 Conclusions

We have introducedk+-buffer, an improved GPU-accelerated k-buffer framework, which handles RMW memory-hazards and depthprecision conversion artifacts, and avoids geometry pre-sorting andthe requirement for unbounded memory. An additional geometryrasterization may be carried out to enable precise memory allocation.Implementation details and light-weight modifications are offeredto enable full support of our framework on Fermi and Haswell GPUarchitectures. Furthermore, Z-buffer and A-buffer functionalitiesare successfully integrated under the proposed framework. An ex-tensive experimental comparison has demonstrated the superiorityof our framework as compared to previousk-buffer alternatives withregards to storage requirements, performance and image quality.

Further directions may be explored for tackling the problem ofmulti-fragment rendering in rasterization architectures. Addi-tional experiments showing the performance trade-off betweensoftware (spin-lock) and hardware (pixel sync) implemented

KB KB-MDT KB-ABArray K+B

(a)

(b)

(c)

Figure 10: Color-coded differences between (left) the imagesgenerating using K+B against the outputs of (center-left to right)KB, KB-MDT and KB-ABArray.

semaphores on Haswell GPU should further be conducted. Consid-ering completeness, a dynamick-buffer technique, wherek is notthe same for all pixels, should be investigated. In cases, where thenumber of fragments varies when the camera or scene is animated,an interesting alternative is to capture thek% of the generatedfragments per-pixel. Finally, another challenge is to improve imagequality by making use of anti-aliasing algorithms.

Acknowledgements. Stanford dragonand hairball models aredownloaded fromMorgan McGuire’s Computer Graphics Archive.The models oftwirl object, clipped sphereandheadare courtesyof the AIM@SHAPE Shape Repository. We would like to thankGeorgios Papaioannouand Louis Bavoil for providing us theancientgreek templeandmotor enginemeshes, respectively. This researchhas been co-financed by the European Union (European Social FundESF) and Greek national funds through the Operational Program Ed-ucation and Lifelong Learning of the National Strategic ReferenceFramework (NSRF) - Research Funding Program: Heracleitus II.Investing in knowledge society through the European Social Fund.

References

BAVOIL , L., AND MYERS, K. 2008. Deferred rendering using astencil routed k-buffer.ShaderX6: Advanced Rendering Tech-niques, 189–198.

BAVOIL , L., CALLAHAN , S. P., LEFOHN, A., COMBA , J.A . L. D.,AND SILVA , C. T. 2007. Multi-fragment effects on the GPU usingthe k-buffer. InProceedings of the 2007 symposium on Interactive3D graphics and games, ACM, New York, NY, USA, I3D ’07,97–104.

http://graphics.cs.williams.edu/data

http://shapes.aim-at-shape.net/

http://graphics.cs.aueb.gr/graphics/downloads_tutorials.html

Performance

Acronym Description Rendering Passes on primitives on fragments Max k 32bit Float Precision Per Pixel Allocation Fixed

KB Initial k -buffer implementation [Callahan2005,Bavoil2007] 1 √ √ 2k; 4k

KB-Multi Multi-pass k -buffer [Liu2009a] 1 to k √ √ 2k; 4k

KB-SR Stencil routed k -buffer [Bavoil2008] 1 √ √ 32 3k

KB-PS k -buffer using pixel synchronization extension [Salvi2013] 1 x √ - 2k

K+B-Array k

+-buffer using max-array 1 x √ - 2k + 2

K+B-Heap k

+-buffer using max-heap 1 x √ - 2k + 2

KB-MDT Multi depth test scheme [Liu2010,Maule2013] 2 x x - 2k

KB-MHA Memory-hazard-aware k -buffer [Zhang2013] 1 √ √ 8; 16 2k; 4k

KB-ABArray k -buffer based on A-buffer (fixed-size arrays) 1 x √ - 2n + 1

KB-ABLL k -buffer based on A-buffer (dynamic linked lists) [Yu2012] 1 x √ - 3f + 1

KB-LL k -buffer based on linked lists [Yu2012] 1 x x - 3f + 6

KB-ABSB k -buffer based on S-buffer (variable-contigious regions) 2 x √ - 2f + 2

K+B-SB Memory-friendly variation of k

+-buffer 2 x √ - 2fk + 3

In A ; B, A denotes the layers/memory for the basic

method and B for the variation using attribute packing

f(p) = # fragments at pixel p[x,y]

fk(p) = (f(p) < k) ? f(p) : k fk(p) ≤ k

n = maxx,y{f(p)}

Memory

x

√

Peeling AccuracySorting need

8; 16

√

√

x

Algorithm

Table 1: Comprehensive comparison of the priork-buffer approaches and the introducedk+-buffer methods.

CALLAHAN , S. P., IKITS, M., COMBA , J. L., AND SILVA , C. T.2005. Hardware-assisted visibility sorting for unstructured vol-ume rendering.IEEE Transactions on Visualization and Com-puter Graphics 11, 3, 285–295.

CARPENTER, L. 1984. The A-buffer, an antialiased hidden surfacemethod. InProceedings of the 11th annual conference on Com-puter graphics and interactive techniques, ACM New York, NY,USA, vol. 18, 103–108.

CRASSIN, C., 2010. Icare3D Blog: Fast and accurate single-pass A-buffer, http://blog.icare3d.org/2010/06/fast-and-accurate-single-pass-buffer.html.

CRASSIN, C., 2010. Icare3D Blog: Linked lists of frag-ment pages, http://blog.icare3d.org/2010/07/opengl-40-abuffer-v20-linked-lists-of.html.

GOVINDARAJU, N. K., HENSON, M., L IN , M. C., ANDMANOCHA, D. 2005. Interactive visibility ordering and trans-parency computations among geometric primitives in complexenvironments. InProceedings of the 2005 symposium on Inter-active 3D graphics and games, ACM, New York, NY, USA, I3D’05, 49–56.

JANG, H., AND HAN , J. 2008. Fast collision detection using theA-buffer. Vis. Comput. 24, 7 (July), 659–667.

KNOWLES, P., LEACH, G., AND ZAMBETTA , F. 2012. Efficientlayered fragment buffer techniques. InOpenGL Insights, P. Cozziand C. Riccio, Eds. CRC Press, 279–292.

L IU , B., WEI, L.-Y., XU, Y.-Q., AND WU, E. 2009. Multi-layer depth peeling via fragment sort. In11th IEEE InternationalConference on Computer-Aided Design and Computer Graphics,2009., 452 –456.

L IU , F., HUANG, M.-C., LIU , X.-H., AND WU, E.-H. 2009. Effi-cient depth peeling via bucket sort. InProceedings of the Confer-ence on High Performance Graphics 2009, ACM, New York, NY,USA, HPG ’09, 51–57.

L IU , F., HUANG, M.-C., LIU , X.-H., AND WU, E.-H. 2010.FreePipe: a programmable parallel rendering architecture for ef-ficient multi-fragment effects. InProceedings of the 2010 ACMSIGGRAPH symposium on Interactive 3D Graphics and Games,ACM, New York, NY, USA, I3D ’10, 75–82.

MAULE , M., COMBA , J. L., TORCHELSEN, R. P.,AND BASTOS,R. 2011. A survey of raster-based transparency techniques.Com-puters & Graphics 35, 6, 1023 – 1034.

MAULE , M., COMBA , J. A ., TORCHELSEN, R., AND BASTOS,R. 2013. Hybrid transparency. InProceedings of the ACM

SIGGRAPH Symposium on Interactive 3D Graphics and Games,ACM, New York, NY, USA, I3D ’13, 103–118.

ROSSIGNAC, J., FUDOS, I., AND VASILAKIS , A. 2013. Direct ren-dering of boolean combinations of self-trimmed surfaces.Com-put. Aided Des. 45, 2 (Feb.), 288–300.

SALVI , M. 2013. Advances in real-time rendering in games: Pixelsynchronization: Solving old graphics problems with new datastructures. InACM SIGGRAPH 2013 courses, ACM, New York,NY, USA, SIGGRAPH ’13.

SINTORN, E.,AND ASSARSSON, U. 2008. Real-time approximatesorting for self shadowing and transparency in hair rendering. InProceedings of the 2008 symposium on Interactive 3D graphicsand games, ACM, New York, NY, USA, I3D ’08, 157–162.

SINTORN, E., AND ASSARSSON, U. 2009. Hair self shadowingand transparency depth ordering using occupancy maps. InPro-ceedings of the 2009 symposium on Interactive 3D graphics andgames, ACM, New York, NY, USA, I3D ’09, 67–74.

VASILAKIS , A., AND FUDOS, I. 2012. S-buffer: Sparsity-awaremulti-fragment rendering. InProceedings of Eurographics 2012Short Papers, 101–104.

VASILAKIS , A., AND FUDOS, I. 2013. Depth-fighting aware meth-ods for multifragment rendering.IEEE Transactions on Visual-ization and Computer Graphics 19, 6, 967–977.

WANG, W., AND X IE, G. 2013. Memory-efficient single-pass GPUrendering of multifragment effects.IEEE Transactions on Visu-alization and Computer Graphics 19, 8 (Aug.), 1307–1316.

YANG, J. C., HENSLEY, J., GRUN, H., AND THIBIEROZ, N. 2010.Real-time concurrent linked list construction on the GPU.Com-puter Graphics Forum 29, 4, 1297–1304.

YU, X., YANG, J. C., HENSLEY, J., HARADA , T., AND YU, J.2012. A framework for rendering complex scattering effects onhair. InProceedings of the ACM SIGGRAPH Symposium on In-teractive 3D Graphics and Games, ACM, New York, NY, USA,I3D ’12, 111–118.

ZHANG, N. 2013. Memory-hazard-aware k-buffer algorithm fororder-independent transparency rendering.IEEE Transactions onVisualization and Computer Graphics 99, PrePrints, 1.

Documents

k+-buffer: Fragment Synchronized k-buffer · k+-buffer: Fragment Synchronized k-buffer ... by a family of GPU-accelerated methods that capture the surface-intersections (fragments)