Standalone Distributed Rendering For Supercomputerscutler/classes/advancedgraphics/S13/final_projects/tobin...Recent state-of-the art supercomputers such as the IBM Blue Gene/Q, or

Standalone Distributed Rendering For Supercomputers

William [email protected]

Daniel [email protected]

May 2, 2013

Abstract

We present a method for rendering large distributedmeshes on large distributed architectures that lacktraditional graphics rendering capabilites and hard-ware. This method renders each distributed portionof the mesh independently on an individual processinto a depth-buffer and image-buffer. After individ-ual renderings are taken, the Message Passing Inter-face is used to merge the individual rendering buffersinto a final frame. We discuss our implementation onthe BlueGene/Q architecture and the performance ofthis implementation, which is fast enough for large-scale simulation monitoring at most resolutions andat moderate- and low-resolutions is fast enough thatreal-time interaction may be possible.

1 Introduction

Recent state-of-the art supercomputers such as theIBM Blue Gene/Q, or BGQ, are built using manytightly coupled processors. Designed for high-performance computing and simulation, these ma-chines do not include graphics cards and rarely installsoftware support for rendering. Traditionally, anygraphical output from simulations running on the su-percomputer is performed by a separate visualizationcluster, which consists of more loosely coupled pro-cessors with graphics hardware and related softwaresupport. The utility gained by visualization capabili-ties motivates the implementation of an efficient ren-dering technique in order to prevent introduction ofunnecessary computation delays into scientific HPCapplications. This report explores the possibility of

rendering using the BGQ hardware itself. We focuson implementing the traditional graphics pipeline forunstructured surface and volume meshes, which arefrequently used in scientific computation as well asgraphical animation applications. The method wepresent is capable of rendering any mesh that will fiton the BGQ’s RAM with a runtime logarithmic inthe number processors used and proportional to themesh surface per processor. We achieve this by ren-dering partial frames on each processor and mergingthe frames in a binary tree communication pattern toform a final frame. This method comes close to inter-active speeds for some common video resolutions andmesh sizes. We implement this method as a C libraryand programs, making use of the SDL and libpng [1]libraries to output the resulting images. The render-ing library is small enough to integrate into existingscientific applications, enabling in-situ visualizationat a more closely coupled level than before. This isfurther motivated by the lack of support for simulta-neous program execution or dynamic library linkingon many supercomputers architectures. The systempresented would enable visual interaction with scien-tific applications, and the merging algorithm is gen-eral enough to support purely graphical applications.

2 Related Work

The standard approach to massively parallel visual-ization in HPC applications involves the use of vari-ous in-situ data extraction techniques combined withinter-cluster data transfer to an independent visual-ization system.

The ParaView Coprocessing Library [2] is used

1

in many HPC simulations as a data visualizationtool. The library allows for the construction ofpipelineswhich send visualization data to a coprocess-esing machine cluster (such as the cluster installed atthe CCNI alongside the Blue Gene/Q) or to a clientmachine where the data is used to produce visualrenderings representing the state of the simulationon the supercomputer. The data is acquired throughuse of adaptor code so that ParaView is not requiredto work with data formats internal to a particularsimulation code.

ParaView has been coupled with several simula-tion codes including the PHASTA simulation codethrough work conducted at SCOREC.

Ellsworth et al. [3] developed a method sharingcommonalities with the ParaView approach initiallyintended for visualization of data from a massivelyparallel forecast model implemented and executed onthe NASA Ames “Columbia” supercomputer. Theirapproach involves the simulation code allocating andcopying simulation data to areas of shared memoryon the supercomputer. This data is than accessed bya separate program executing on the same nodes andsent across an Infiniband network to a separate clus-ter, similar to the ParaView approach. This clusterbreaks the single data stream received from the simu-lation system into several separate streams which willall eventually produce separate visualizations. Thisapproach allows simultaneous visualization of multi-ple pieces of data simultaneously. Each data streamis sent via standard TCP protocol to a visualizationsystem which renders the stream and encodes it usingthe MPEG video format before writing it to sharednetwork storage. This filesystem is accessed via SSHby display cluster nodes, which stream the individualvideo files created by each data stream to individualdisplays.

Tu et al. [4] [5] developed a full simulation work-flow including a visualization system for volume ren-dering of vector fields. Their system uses a spatialdecomposition of the simulated domain (using an oc-tree) to distribute rendering responsibilities amongthe nodes of the system. They use this octree struc-ture to implement a ray-tracing system whereby eachsub-volume is rendered independently. They makeuse of the SLIC [6] method for image compositing

which involves direct sends from nodes holding ren-derings of the subvolumes directly to the node respon-sible for image composition. This parallel renderingmethod was described in [7] .

3 Graphics Pipeline

Our implementation of a graphics pipeline consistsof three stages. The highest level constructs a flex-ible mesh data structure and renders it by passingprimitive shapes to the 3D viewing system. The 3Dviewing system performs transformations, clipping,and perspective to obtain image-space vertex coordi-nates for each primitive, and then hands these overto the 2D raster system. The image-space primitivesare rasterized into a depth-buffered image, which iseither immediately output or given to the mergingalgorithm in section 4.

We implemented a 2D raster system which can ren-der points, lines, and triangles, resulting in softwarearrays representing an image and its correspondingper-pixel depth buffer. All color values are storedin a 24-bit RGB format. Image-space primitives areencoded as vectors, where each vector has integer Xand Y coordinates, a double-precision floating-pointdepth, or Z, value, and a color value. The image

structure stores the color values for pixels, and hassupport for I/O either on a window using SDL or as aPNG file using libpng. The drawing structure storesan image and a depth buffer, which are initialized to abackground color and infinite depth distances beforerendering.

Drawing a point is straightforward: If the depthvalue of the point is closer to the camera than thecurrent dept buffer value at that pixel, assign thepoint’s depth and color values to the pixel. To drawa line, we use Bresenham’s algorithm to obtain a se-ries of points given the endpoint vertices of a line,and draw each point. Intermediate Z and color val-ues are interpolated linearly between the endpointsof the line. To draw a triangle, we invoke Bresen-ham’s algorithm on each of the edges of the triangle.From the resulting vertices we can find the first andlast points of the triangle in each row of the image onwhich the triangle appears. Each row of the triangle

2

is then drawn as a line based on these first and lastpoints.

Our general 3D viewing algorithm for primitives isas follows:

1. Transform the primitive into camera space

2. If the primitive is a triangle, perform flat shadingand backface culling

3. Clip the primitive using an infinite pyramid ofclipping planes in camera space

4. Transform the primitive into screen space usinga perspective transformation

5. Rasterize the primitive in 2D screen space

In the 3D system, primitives are stored as a set ofdouble-precision position vectors for vertices and asingle color value.

The transformation from world space into cameraspace is stored as an affine structure, which is a3×3 rotation matrix and a translation vector insteadof the traditional 4 × 4 matrix. Flat shading andbackface culling for triangles both make use of thedot product between the triangle normal and the unitvector from the eye point to the triangle’s centroid. Apositive dot product is used as a scaling factor for thecolor’s RGB values, and a negative product culls thetriangle. By clipping in camera space as opposed toperspective space, we avoid many of the issues whichmotivated the use of a near and far plane to create aviewing frustrum. As such, we only use the remaining4 planes to clip what will not appear in screen space.

Clipping for lines and points is more or less trivial,but triangle clipping requires handling several cases.The algorithm for clipping a triangle using one planeoutputs zero, one, or two triangles depending on thecombination of points inside or outside the half-space.Figure 1 illustrates the non-trivial cases when theplane intersects two edges of the triangle. The redsub-triangle is output if the half space is above thedotted line representing the plane, otherwise the twogreen triangles are output.

The triangle viewing function recursively appliessingle-plane clipping to the resulting triangles untilall 4 planes have been considered. This creates zero

Figure 1: Triangle clipping cases

to 16 output triangles for the next step, althoughtypically the number is two or less. Currently theclipping planes provide a viewing angle of 90◦, whichis to say that the left and right planes are orthogonal,as are the top and bottom planes.

The clipped primitives may be safely scaled bytheir Z value relative to screen diameter and then ras-terized in screen space. The screen diameter is themaximum screen dimension, either width or height.This ensures that the bi-unit square in which all per-spective XY values lie is scaled to fit around thescreen image.

In addition, we have implemented an intuitive userinterface for manipulating the camera to object rela-tionship using the mouse. We call it a “globe” inter-face because it is based on spinning and tilting theobject like a globe before translating it to its finalposition. The affine transformation for the globe isobtained by combining the transformations for spin-ning, tilting, and translation given the spin and tiltangles and translation vector. Left clicking and drag-ging on the screen in the X and Y directions manip-ulates the spin and tilt angles, respectively. Middleclicking and dragging on the screen adjusts the Xand Y values of the translation vector, while rightclicking and dragging will convert Y mouse motioninto Z translation, all relative to the current distancefrom the camera to the object space origin. Distance-relative zooming and panning is quite useful for han-dling objects of widely varying scales elegantly, anddoes not require knowledge of the object size.

Currently only tetrahedral volume meshes and tri-angular surface meshes are supported in by the meshrendering system.

The flat shading mesh visualization algorithm pro-ceeds by iterating over all faces in the local partition

3

of the volumetric mesh. If the face has two adjacentmesh region volumes then it is an interior face and isunimportant for a flat-shaded render. Any face deter-mined to be on the boundary of the volume mesh hasthe three vertices adjacent to it retrieved and passedas an array to the triangle rendering function, whichtakes care of backface culling, clipping, and furtherrendering functions.

For wireframe visualization, all edges in the meshare simply iterated over, their adjacent vertices re-trieved, and passed on to the rendering system.

There is one known issue with this implementationof mesh rendering, which is that triangles sharing anedge will result in Z-fighting along that edge dur-ing rendering. This is due to the fact that pixels onthis edge do not sample exactly the line between thepoints, and the distance from the line changes theirdepth values according to the surface orientation ofthe triangle. This issue is the origin of artifacts inthe mesh renderings to be discussed in the ResultsSection.

4 Image Merging

Parallel merging of the frames rendered on the indi-vidual processes was conducted using the MPI mes-sage passing system which is the de-facto standard inHPC parallel applications such as those implementedon our BlueGene/Q target architecture. MPI pro-vides support for general message passing patternsand as such construction of a general reduction al-gorithm using non-blocking sends and receives onthe rendering nodes was considered. This approachwould have the benefit of allowing processes thatcomplete the render operation to move on to compu-tational phases immediately following a non-blockingsend. This operation would be conducted hierarchi-cally in a tree-based communication pattern.

MPI has a blocking reduction operation alreadypresent which can be used to perform built-in reduc-tion operations on standard MPI datatypes (such asfinding the maximum value of a set of distributed in-tegers). The communication pattern for this built-inreduce operator is very similar to our original con-ceptual parallel merge communication, however it is

collective and blocking so all processes must call theoperation prior to any being released.

MPI has provided the ability to define customdatatypes since the first standard . These datatypescan be used in all standard MPI communications rou-tines including the asynchronous sends and receiveswhich would have been used in the implementationbased off of our initial concept. However, MPI alsoprovides the ability to define custom reduction op-erations for the MPI Reduce function (also since theinitial standard). This provided us with the possi-bility of a much cleaner parallel frame merge opera-tion by defining a custom MPI Datatype and MPI Op

and using the MPI Reduce functionality which likelyprovides a much cleaner and more efficient reductioncommunication pattern than we would be able to im-plement.

The parallel frame reduction / merge process takesplace as follows: the color buffer, depth buffer, andframe dimensions are extracted and placed into acontiguous memory buffer. A new MPI Datatype iscreated which is simply a series of contiguous bytesequivalent to the size of the memory buffer. A newMPI Op is then created, which simply registers a pre-written reduction function with the correct functionsignature as the reduction operation. Finally theMPI Reduce call is made, which takes care of hier-archically merging the individual frames. This re-duction operation will be called dlg(n)e times on theroot process (where the final frame is collected) inorder to compose the final rendered frame.

The reduction operation takes two data buffers,the number of data elements being merged, and apointer to the MPI Datatype describing the type inthe buffers. Offsets to the initial color buffer anddepth buffer locations are then calculated. Fromthere the frame dimensions are used to limit the iter-ation over the two image buffers in each data buffer,at each point determining which depth buffer has alesser value, then (possibly) overwriting the resultimage and depth buffers with those values from the‘closer‘ buffer.

Reducing the data used in a merge may be pos-sible by simply limiting the data sent by each nodeto the rectangular subsection of the frame in whichthe local part of the parallel mesh was rendered, but

4

our results for the initial implementation do not sug-gest this is immediately necessary. This does howeverpresent a possible area for future work in the parallelreduction phase.

The tradeoff in our approach of course is the re-striction that the MPI Reduce operation is a collectiveand blocking. In the upcoming MPI-3 standard non-blocking collective calls will likely be added that maybe of use by allowing processes to immediately con-tinue with analysis procedures after processing anyintermediate merges assigned on the local process .The alternative is essentially an ad-hoc implemen-tation of this functionality using the non-blockingpoint-to-point capabilities of MPI, which would likelyresult in sub-optimal frame merge times.

5 Results

To test the raster and viewing code, we can begin byrendering primitives In several different cases. Figure2 shows a combination of several different primitivecases rasterized in 2D:

1. A point and some overlapping lines

2. An axis-aligned blue triangle

3. A green triangle with zero width

4. A color-interpolated triangle intersecting theblue triangle

Figure 2: A set of primitive raster demonstrations

During implementation a bug was introduced caus-ing only a single of the RGB values to be used in thecolor interpolation function, this produced the un-usual color interpolation results shown in Figure 3

Figure 3: Bad color interpolation on triangles

This case is representative of much of the testingthat was done on the raster system. These initial re-sults allowed us to move into implementation of the3D viewing system. Early tests using geometric prim-itives ironed out simple bugs in the implementation.Results become interesting in the context of triangleclipping, which is a complex operation and is key toour rendering implementation. Figure 4 highlightsthe clipping algorithm at work on a triangle by ren-dering outlines of sub-triangles. The triangle is firstclipped by the bottom plane of the view frustum, breaking it in two triangles, and then clipped by theright frustum plane, which splits each of the exist-ing sub-triangles creating four seperate triangles forrasterization.

Figure 4: A demonstration of the triangle clippingalgorithm

5

The mesh rendering is best demonstrated throughlive interaction with the provided graphical interface.Figure 5 shows a rendering of the Stanford bunny us-ing this interface, alongside a buggy rendering causedby inverted indexing of mesh vertices.

Figure 5: Forward and inversed vertex indexing ofthe Stanford bunny

Using the graphical interface, meshes on the orderof 20000 triangles can be manipulated in real time,and triangle counts of 40000 or more begin to lagbehind user input rates.

Note that although we have a graphical interface,it is just a front end to the rendering code. The func-tions and structures which produce images are com-pletely decoupled from any dependencies, and can becompiled using just a modern C compiler. The SDLand libpng libraries simply provide different methodsof output for the in-memory image structure.

Our most significant results come from experi-ments conducted on the IBM Blue Gene/Q at RPI’sComputational Center for Nanotechnology Innova-tions. This BGQ system consist of two racks, eachcontaining containing 1024 compute nodes. Eachcompute node, in turn, contains 16 cores and 16 GBof RAM. Each core runs at 1.6 GHz and supports upto 4-way symmetric multi-threading. For these ex-periments, we mapped mesh partitions to cores, thatis a single core performed the local rendering for onemesh partition and participated in the parallel imagemerge.

Two parallel meshes were available from previouswork. The first is a 32-part mesh of some blood ves-sels, which we will call the arterial mesh. Each part ofthis mesh is made of 65000 tetrahedra, for a total ofabout 2 million tetrahedra. The second mesh is a 64-

part mesh of the Stanford Linear Accelerator, whichis called the SLAC mesh. Each part of the SLACmesh is composed of 230000 tetrahedra, for a total ofabout 15 million tetrahedra in the entire SLAC mesh.The mesh description files for the SLAC mesh totalover 2 GB of data.

The parallel tests proceed as follows: given a cam-era position, color selection, and image resolution,render the mesh using these parameters by locallyrendering the surface triangles of one part and merg-ing the images. The final image is written to file.During this process we keep track of the time requiredto render, the maximum number of surface triangleslocally rendered, the time to merge the images, andthe number of merge steps.

We begin by rendering the arterial mesh with anappropriate red and blue color scheme. The resultis shown in Figure 8. We initially had trouble withmissing triangles in this mesh, caused by improperbackface culling, which can be seen in Figure 6. Atthat time triangles were judged by the dot productwith the Z vector, but under perspective transforma-tion the camera-to-triangle vector should be used.

Figure 6: Incorrect backface culling of the arterialmesh

6

Once that was corrected, Figure 8 was producedin at a resolution of 1920 × 1080, also known as fullHD or 1080p. The reproduction in this paper wascropped to single out the mesh. This initial run took0.234 seconds to render locally and 1.66 seconds tomerge the 32 images. When merging images for thearterial mesh, MPI has to do 5 levels of pairwisemerging since 32 = 25. We carefully profiled the runfrom this point and tried to improve performance.Two major improvements were made: first, the mergeoperation was rewritten into a single optimized loopover pixels. Second, we noticed that variability inmerge times was due to processor imbalance after lo-cal rendering, so we introduce a barrier before themerge operation so that our results reflect the truelower bound on merge time for synchronized proces-sors. The result is that the same image is locally ren-dered in 0.189 seconds and merged in 0.803 seconds,of which 0.628 seconds were spent in the pairwise op-erator and the rest are due to message passing andmemory management.

Careful readers will notice a few black pixels onotherwise clear surface, which is due to the Z-fightingphenomenon discussed in the Graphics Pipeline Sec-tion. In this case, such artifacts are caused by tri-angles which are not part of the object’s surface butare on the boundary between parallel partitions. No-tice that while there is Z-fighting between all adja-cent triangles, only these interior triangles cause vi-sual problems. There are well-established methods ofclassifying faces that would alleviate this problem bynot rendering the interior faces. Implementing thissolution is an immediate area of future work.

Satisfied at least that our merge implementationwas close to optimal, we move on to the larger SLACmesh. The first good image from the exterior of theSLAC mesh is actually Figure 12. The Stanford Lin-ear Accelerator is a long apparatus, so there is lessdetail to see when the entire device is visible. Focus-ing on this section shows details of the mesh sincetriangles are still individually distinguishable, andproduces a more interesting frame. Figure 12 alsoshows the perspective effects near the far right of theframe, The full SLAC mesh rendered in 1080p res-olution is shown in Figure 9. The maximum localrendering work done for the SLAC mesh is render-

ing 22660 surface triangles in 0.439738 seconds. Themerging operation, used 6 times for 26 = 64 parts,took a total of 0.715846 seconds to run.

In order to stress the parallel rendering system toits fullest potential, we construct a rendering inputsufficient to conduct a test using 1024 processes. Un-like before, we are forced to overcommit processesto cores due to machine partition limitations, whichmeans we used 2 processes per core, 32 processesper node, and 32 nodes. In order to create enoughmesh entities to render we instance the SLAC mesh16 times in a rectangular tiled pattern. Each of 16groups of 64 processes renders an instance of theSLAC mesh, and the images from all 1024 processesare merged as before, again in 1080p resolution. Theresulting image is shown in Figure 13. This figuretook exactly the same time to locally render, since noprocess had more work than before. The merging re-quired 10 binary steps for 210 = 1024 processes, andtook a total of 2.657428 seconds to execute. Noticethat in both this result and the single SLAC mesh re-sult, individual triangles become smaller than a pixel,and the output produced is still quite smooth so ourmethod handles this situation well.

Finally, although the majority of these renderingsused huge meshes and 1080p resolution, we wouldalso like to examine the efficiency of this technique forfast rendering of low-resolution images, although stillfocusing on large meshes. To begin with, we begin byconfirming the predicted relationship between mergetime and pixel count by rendering the SLAC meshat all the standard resolutions with 4:3 aspect ratio.Note that this is again using 64 processes and thus 6serial pairwise merge operations. Figure 7 shows thetiming results for this test.

For a resolution of 320 by 200 pixels, the mergeoperation requires only 0.027 seconds To merge 64images across 4 BGQ nodes. This figure suggeststhat our merging technique uses approximately 76nanoseconds per pairwise merge per pixel. A simi-lar calculation using the 1024 process case estimatesabout 85 nanoseconds per pairwise merge per pixel.Using a conservative estimate of 100 nanoseconds perpairwise merge per pixel, we can estimate that a 200by 200 pixel image could be created using 1024 localimages in just 0.04 seconds. Likewise, we can esti-

7

Figure 7: Merge time over pixel count for standardresolutions

mate based on the SLAC mesh renderings that tri-angles are rendered on average in about 20 microsec-onds. Based on these estimates, we could predictthat a mesh partitioned to a maximum of 1000 tri-angles per process could be locally rendered in 0.02seconds. Combining these, a 1 million triangle meshpartitioned into 1024 parts could be rendered in 0.06seconds, which gives a frame rate close to 16 framesper second. Pushing the frame rate limit of our tech-nique is an area of immediate future work.

6 Contributions

This project was completed by the authors over thecourse of about one month. Dan Ibanez developedthe 2D and 3D rendering system for primitives. BillTobin developed mesh rendering code based on prim-itive rendering, and implemented the parallel imagemerging algorithm using MPI.

7 Conclusion

We have presented an implementation of the graph-ics pipeline that executes directly on an IBM BlueGene/Q, as well as an image merging technique whichallows parallel rendering using a single reduction.This method has been tested with meshes composedof millions of interior tetrahedra and surface trianglespartitioned up to a thousand processes. The output

images are smooth enough for many scientific andgraphical applications. The rendering times are com-petitive with other methods of in-situ visualizationand, and seems a promising foundation for interac-tive in-situ rendering on machines like the BGQ inthe future.

There are many avenues of future work for thissystem. Most importantly, we would like to tune thesystem for high frame rates using meshes with lesstriangles per part. Given interactive frame rates, thenext technical challenge is establishing a communi-cation link to interact with the BGQ in near realtime. The mesh rendering technique could also beimproved, including using partition model classifica-tion to eliminate harmful Z-fighting and implement-ing more advanced shading for graphical applications.

References

[1] “libpng.txt - A description on how to use andmodify libpng,” 2010.

[2] N. Fabian, K. Moreland, D. Thompson, A. Bauer,P. Marion, B. Geveci, M. Rasquin, and K. Jansen,“The paraview coprocessing library: A scalable,general purpose in situ visualization library,” inLarge Data Analysis and Visualization (LDAV),2011 IEEE Symposium on, pp. 89–96, 2011.

[3] D. Ellsworth, B. Green, C. Henze, P. Moran,and T. Sandstrom, “Concurrent visualizationin a production supercomputing environment,”IEEE Transactions on Visualization and Com-puter Graphics, vol. 12, pp. 997–1004, Sept. 2006.

[4] T. Tu, H. Yu, L. Ramirez-Guzman, J. Bielak,O. Ghattas, K.-L. Ma, and D. R. O’Hallaron,“From mesh generation to scientific visualization:an end-to-end approach to parallel supercomput-ing,” in Proceedings of the 2006 ACM/IEEE con-ference on Supercomputing, SC ’06, (New York,NY, USA), ACM, 2006.

[5] H. Yu, K.-L. Ma, and J. Welling, “A parallel vi-sualization pipeline for terascale earthquake sim-ulations,” in Proceedings of the 2004 ACM/IEEE

8

conference on Supercomputing, SC ’04, (Washing-ton, DC, USA), pp. 49–, IEEE Computer Society,2004.

[6] A. Stompel, K.-L. Ma, E. B. Lum, J. Ahrens, andJ. Patchett, “Slic: scheduled linear image com-positing for parallel volume rendering,” in Pro-ceedings of the 2003 IEEE Symposium on Paralleland Large-Data Visualization and Graphics, p. 6,IEEE Computer Society, 2003.

[7] K.-L. Ma, J. S. Painter, C. D. Hansen, and M. F.Krogh, “Parallel volume rendering using binary-swap compositing,” Computer Graphics and Ap-plications, IEEE, vol. 14, no. 4, pp. 59–68, 1994.

9

8 Figures

Figure 8: A high-resolution rendering of the arterial mesh

Figure 9: A high-resolution rendering of the entire SLAC mesh

10

Figure 10: A rendering of the arterial mesh colored by processor

Figure 11: A rendering of the SLAC mesh colored by processor

11

Figure 12: A high-resolution closeup of the SLAC mesh

12

Figure 13: 16 tiled instances of the SLAC mesh, 1024 processes

13

Documents

Standalone Distributed Rendering For Supercomputerscutler/classes/advancedgraphics/S13/final_projects/tobin...Recent state-of-the art supercomputers such as the IBM Blue Gene/Q, or