Multi-screen Tiled Displayed, Parallel Rendering System for a Large

The International Journal of Virtual Reality, 2006, 5(4):47-54

47

Abstract—Real time terrain rendering plays a very important

role in many fields, such of GIS, virtual reality, and military simulations. With the rapid development of PC-level hardware, recent interest in parallel rendering systems focuses on a PC cluster, which is much cheaper than an expensive graphics workstation. In this paper a parallel rendering system is proposed for a large terrain dataset, and the technique of multi-screen tiled display is used to show the rendering result. The retained-mode parallel architecture avoids the frequency of data transitions in the network, and the sort-first mode architecture conveniently partitions between PC nodes. Terrain data are preprocessed and organized as a static quad tree that is GPU optimized. In order to process an out-of-core terrain dataset and achieve a real-time rendering requirement, many techniques are used, such as an LOD algorithm, view-frustum culling, and out-of-core data management. Experimental results show that the proposed parallel rendering system works well for a terrain dataset of more than one Gigabyte. In addition, both geometric and color calibrations are performed in the multi-screen tile display, which generates a truly seamless display with large dynamic range.

Index Terms—Multi-screen tiled display, PC cluster, retained-mode, sort-first, terrain parallel rendering system.

I. INTRODUCTION

Nowadays the rapid growth of data information for 3D data processing places a higher demand on processing speed, data quantity, display size and frequency, etc. Traditionally, expensive specialized graphics machines are used to render extremely large data sets. However, these machines are too expensive for popular use. On the other hand, the performance of PCs consistently improves, and in particular, the development of commercial graphics cards even exceeds Moore's Law. Thus a PC cluster that uses cheap cost-effective PCs and a high-speed network for the hardware platform is used more and more often as a substitute for an expensive specialized graphics machine. On the other hand, because of the limited resolution of the current single projector, multiple projectors can be used to construct a multi-screen tile display

Manuscript Received on October 8, 2006. This work is supported by National Grand Fundamental Research 973

program of China under Grant No.2002CB312105 and the key NSFC project of “Digital Olympic Museum” under Grant No. 60533080.

Ping Yin was a master student of College of Computer Science in Zhejiang University, China310027. (phone: 86-010-5896-3037; e-mail: [email protected] )

Xiaohong Jiang, an associate professor at College of Computer Science, Zhejiang University, China 310027. (e-mail: [email protected])

Jiaoying Shi is professor and Ph.D. advisor at State Key Lab of CAD&CG in Zhejiang University, China 310027. (e-mail: [email protected])

Ran Zhou was a master student of College of Computer Science in Zhejiang University. (e-mail: ranzhou@ microsoft.com).

driven by a PC cluster to generate a high resolution and distensible visual field [1].

Real-time rendering of a large terrain is an important aspect of a Geographic Information System (GIS) as well as outdoor virtual applications, military simulation, etc. Due to current technology, such as associated with satellites, the size of an acquired terrain dataset goes far beyond the capacity of main memory. Given tremendous amounts of terrain data, we have to depend on secondary storage. Consequently parallel rendering system for large terrain data must be able to handle data stored in secondary storage at real-time speeds. In our system multiple techniques are used to achieve real-time performance, e.g. multi-resolution representation, view frustum culling, and out-of-core management.

There are two essential processes involved in parallel rendering architecture: the geometric process and the rasterization process. A geometric primitive must be assigned to each different rendering node following some distribution algorithm. The distribution process is called a sort. Molnar [2] divided parallel rendering systems into three classes: sort-first, sort-middle, and sort-last. Under the sort-first architecture the display screen is partitioned into multiple tiles with each tile assigned to one rendering node. During the rendering, scene data are partitioned and sent to appropriate render nodes associated with the designated tiles. All the rendered pieces are then combined to generate the final graphic display. The advantage of sort-first is that each rendering pipeline is complete and independent, requiring less communication between render nodes. Thus sort-first architecture is very appropriate to construct a parallel rendering system on a PC cluster.

The sort-middle architecture redistributes the middle result of the rendering pipeline, and is seldom supported by the consumer grade graphics card of a PC. Thus implementing a sort-middle parallel rendering system on a PC cluster is not easy.

The sort-last parallel rendering system does not partition the screen into tiles. Rather, the scene dataset is partitioned into sub-datasets of nearly equal size, each with different depth information. Each sub-dataset is assigned to a rendering node. The output of each rendering node is a whole screen-size image with distinct depth information, and all the output images are combined to generate one integrated final result. The workflow of sort-last parallel rendering is simple and seldom encounters the problem of load unbalancing. However, it requires a composing stage between the rendering stage and the display stage that tends to be a system bottleneck, and the performance falls dramatically with increasing screen size.

Parallel rendering systems can also be divided into two classes: immediate-mode and retained mode [3]. Originally,

Multi-screen Tiled Displayed, Parallel Rendering System for a Large Terrain Dataset

Ping Yin, Xiaohong Jiang, Jiaoying Shi and Ran Zhou1


48

immediate-mode and retained-mode were aspects of graphics libraries for a single PC. In immediate-mode architecture render nodes do not save scene data, and the application has to send scene data to the render nodes during the rendering. This makes the immediate-mode system highly dependent on a high-speed network, limiting the size of data that can be handled by the immediate-mode system. WireGL and Chromium are good examples of this kind of system. In a retained-mode parallel rendering system, the render nodes keep the scene data transferred from the application. Thus when rendering a new frame, only the changed data needs to be transferred over the network, and the need for bandwidth in the network is greatly reduced.

Our parallel rendering system for a terrain dataset aims at processing a large dataset stored out of the core at real-time speed; hence the data transferred between PC nodes must be reduced as much as possible. Besides, terrain data are not depth complexity data, so load balance should not be a serious problem. Therefore we adopt the sort-first and retain-mode parallel rendering architecture for our real-time terrain rendering. An entire copy of the terrain dataset is stored in the secondary storage of each PC node, and in the rendering each node needs only to process the terrain data that will be sent to the assigned tile of the entire display.

II. RELATED WORKS

WireGL [3] was the first sort-first parallel rendering system for a PC cluster, and it provides solutions to key problems such as parallel rendering architecture. With WireGL graphics, applications can be easily done in parallel mode. Chromium [4] is the advanced version of WireGL, and was developed by Stanford University. It modularizes a parallel rendering system with an SPU (Stream Process Unit). Chromium is open source, and the user can define new SPUs to extend the SPU library. Moreover, Chromium can easily construct different parallel rendering architectures, including sort-first, sort-last, and hybrid. Both WireGL and Chromium are immediate mode systems and need to transfer data frequently in the network, limiting the parallel capability. Samanta et al. [5] designed a hybrid sort-first and sort-last parallel algorithm based on retained-mode. This algorithm processes a dynamic view-dependent distribution by the screen display of 3-D models, thereby achieving a certain load-balance. The algorithm is complex, and the distribution stage tends to be a system bottleneck. Peng et.al. [6] simplified this algorithm and improved the efficiency. AnyGL [7] also implemented a hybrid sort-first and sort-last architecture that expands the distributed graphics system.

Real-time rendering algorithms for terrain datasets can be divided into three categories according to the time periods when they were developed:

1) In-core LOD (Level of Detail) algorithms: Most of these

kinds of algorithms were proposed in the late of 1990’s. During this era the limitation of computer hardware caused these algorithms to only handle data resident in the main memory. There are two main methods for terrain data construction. The first one is a regular hierarchical structure [8, 9, 10] that is

generated by a regular refinement scheme. The other one is an irregular mesh [11, 12], which provides a better approximation than the triangular mesh for a given number of faces. However, it is more difficult in terms of data management.

2) Out-of-core LOD algorithms: With the improvement of computer capability and the increase of acquired datasets, more and more algorithms have focused on processing out-of-core datasets. In 2001 Lindstrom proposed the first out-of-core terrain visualization algorithm. Since then, more and more out-of-core LOD algorithms have been proposed [14, 15]. However, most of these algorithms are not GPU optimized.

3) Graphics hardware optimized LOD algorithms: Most the LOD algorithms presented in the previous two categories are too CPU intensive and cause an unbalanced workload between the GPU and the CPU. Thus algorithms introduced in this era aim to make use of the GPU as much as possible [16, 17, 18, 19, 20]. Most are based on a preprocessed dataset. In our parallel rendering system, terrain data are constructed as quad-tree [21] in a preprocessed stage, which not only expedites the process procedure of the GPU, but also facilitates the management of the terrain dataset.

III. HARDWARE

Fig. 1. The PC-cluster for computer graphics: 32 PC nodes.

TABLE 1: THE CONFIGURATION OF THE CLUSTER CPU Intel XEON 2.4GHz CPU Main memory Kingston 1GB DDR Mother board Supermicro X5DAL-G Graphics card nVidia Geforce FX5950 Network card Intel Pro 1000MT dual network card

A integrated Gigabit network card in the main board Switch Enterasys Matrix N7 The state key lab of CAD&CG in Zhejiang University

developed a PC cluster consisting of 32 PC nodes as shown in Fig.1. For the display we used the back projection mode and build a

super-high resolution display system consisting of 15 LCD projectors. Fig. 2(a) shows the array of our projectors, and Fig. 2(b) shows the projected image before calibration.


49

IV. RETAINED-MODE PARALLEL ARCHITECTURE

The architecture of our system is shown in Fig. 3, including both the control node and the render nodes, which are connected by FastGB Ethernet. Each render node is connected to a projector that projects the result on a given tile in the display wall.

Fig. 2(a). The display array of projectors.

Fig. 2(b). The projection of the projector array without calibration.

The control node interacts with users and sends command instructions to the render nodes. Initially, the control node partitions the display wall into tiles and assigns each tile to a render node. After the partition there will be no change of assignment in the render nodes. The command instructions sent by the control node include both user instructions and synchronizing instructions.

Each render node is connected to a projector and stores a copy of the whole terrain data in the secondary storage. During rendering each render node only needs to process the data that are related to the specific assignment and project the result on the assigned tile of the display wall. All of the projected tiles collectively generate the final high-resolution projected image.

Fig. 3 shows the projected image of four render nodes. Actually, our system has 15 render nodes and 15 projectors. Since the 15 render nodes are independent, synchronization is

needed in order to generate a visually single projected image. Usually there are two synchronizing strategies. The first one is to use CPU time to synchronize, and the other is to use synchronizing command instructions controlled by the control node. Since the CPU time is not exactly the same on different PCs, the difference accumulates, and using CPU time to synchronization is not the best choice. In our system we exploit the method of synchronizing command instructions. Each of the render nodes sends an instruction to the control node when it finishes rendering a frame. When the control node receives all the 15 instructions, it sends command instruction to the render node to begin rendering the next frame.

Fig. 3. The architecture of the parallel rendering system.

The main bottleneck of our parallel rendering system lies in the performance of the render node because our render nodes each need to handle large amounts of terrain data at real-time speeds. In order to process data out of the main memory, we organize both terrain geometric data and texture data as a static quadtree in a preprocessed stage. The root of the quadtree is low-resolution data of the entire terrain data, and the leaf node of the quadtree is the highest resolution of the terrain data. Each of the nodes in the static quadtree is organized as a triangular strip. Constructing data as a static quadtree not only optimizes terrain data for the GPU, but also facilitates data management. To further improve real-time rendering, many techniques are exploited in our parallel rendering system, including multi-resolution representation, view frustum culling, and out-of-core management. We now elaborate on each of these three techniques.


50

1) Multi-resolution: Given the display resolution of each projector, we use high resolution for data near the user and lower resolution for data farther away. Screen space error is also a factor in the data resolution. Using a multi-resolution representation can greatly cut down the amount of data rendered. In order to eliminate visual popping artifacts caused by switching between different levels of resolution, geomorphing is adopted. In addition, the “skirt” technique is used to settle the problem of T-crack.

2) View frustum culling: Given a rendered terrain scene, one can only perceive a limited field of view. For our parallel rendering system, it is not necessary to output the entire terrain scene for each render node. Rather, only a rendered image of a given field of view is required, culling the unrelated terrain data. Using view frustum culling can also greatly reduce the amount of data needed to render. In general, culling can be divided into two categories: view-frustum culling and occlusion culling. Occlusion culling is usually used for depth complexity data, and since terrain data does not have depth complexity, we only use view frustum culling in our parallel rendering system.

3) Out-of-core data management: Since the size of terrain data exceeds the capacity of the main memory, we have to resort to secondary storage. Data exchange between the main memory and secondary storage causes the bottleneck in the real-time parallel rendering system. In order to reduce that effect, we need to manage the terrain data loading and unloading efficiently, which we do by data prefetching. The organization of terrain data as a static quadtree facilitates out-of-core data management. When rendering a new frame, we travel the quadtree from the root and compute the desired level of data resolution according to the distance from the viewpoint and the screen space error. If the desired resolution data are in the main memory, we put it into the graphics pipeline. If the data are not in the main memory, a request to load data into the main memory is sent. However, we do not have to suspend the rendering thread until the arrival of the required data, so it can use the parent node or ancestor node cached in the main memory to render instead. In addition, speculative prefetching is performed using time continuity and space continuity. For example, when we visit data in the parent tree node, it is most likely that we shall either visit its children nodes or its neighbor nodes. That is, prefetching will load the speculative data into main memory in advance. Both data loading and data prefetching are implemented in a new thread that can make the best use of our dual CPU in the PC cluster.

V. CALIBRATION

In the multi-screen tiled display wall the most important thing is geometric calibration. Each projector should be calibrated for position and direction. The whole display wall should be seamless and resemble a single display device. The left and right picture in figure 4 shows the projected result on the 2x2 display wall before and after calibration. On the left, because the projection has distortion, it is hard to make the projected tile a regular rectangle. Moreover, a highlight exists

in the overlap region between adjoining projectors. The right side of Fig.4 shows the result after calibration.

For the calibration we present a geometric calibration algorithm using multiple matrices. First, we use a high resolution digital camera to capture a series of feature points. Because the resolution of the digital camera is lower than that of the display wall, we use the feature circles with 5 pixel diameter instead of points. Second, we use these feature circles to calculate the effective region on the display wall for each projector. Here we again use a 2x2 display wall to illustrate. In figure 5, green, yellow, blue, and red lines form the boundaries of the regions covered by individual projectors. The purple lines bound the effective region of each projector. In the end, we calculate the mapping matrices and finish by mapping each effective region to the frame buffer.

Fig. 4. The result before and after calibration.

Fig. 5. The effective region calculated by feature circles.

We initially implemented the single matrix geometric calibration [23] on our 5x3 display wall, and the result contained pillow distortion. There was a small gap between adjoining projectors. Then multiple matrices were used to map the image in the frame buffer to the display wall. Each effective region was divided into several smaller regions and each small region used one of the mapping matrices, as in Fig. 6. The greater the number of small regions, the higher the precision is. In our experiment, 16x16 small regions achieved good results. In such a multiple matrices geometric calibration algorithm, each two adjoining projections should be seamless with no overlap. Such an algorithm is called a seamless calibration algorithm (Fig. 7.), and there may exist a dark thin line between


51

two adjoining projections. The reason lies in the fact that the edge of the effective region will be a curve in the frame buffer, and the anti-alias algorithm in the GPU will smooth the distortion. Thus some pixels become darker, creating the dark thin lines.

Fig. 6 (a). Divide the effective region into more small ones.

Fig. 6 (b). The shape of the effective region in the frame buffer.

Fig. 7. The rendering result of seamless calibration algorithms. See Color Plate

16

In order to solve that problem, an overlap calibration algorithm is needed. In such an algorithm adjoining regions are made to overlap, and highlighted regions appear, as in Fig. 8 (a). Then we use alpha blending in the highlighted region to make the transition of the adjoining images natural, as in Fig. 8 (b). In our system the geometric calibration and alpha blending will be implemented by GPU hardware, so the performance will be almost unaffected.

VI. EXPERIMENTAL RESULTS

A parallel rendering system for a large terrain dataset is implemented on our PC cluster. There is one control node and 15 render nodes that are connected by a Gigabits Ethernet. Each render node is connected to an LCD projector. The configuration of the PC cluster is presented in table 1. The operating system on each PC node is a Windows 2003 server. In the initialization the control node sends the configuration file to the render nodes, which includes the job assignment and calibration parameters. After that the user can interact with the control node and navigate through the rendered terrain scene.

The rendering performance may be affected by many factors,

including the maximum allowed screen space error, the location of the viewpoint, and the resolution of terrain data. For example, given the same conditions, the less screen space error tolerated, the higher the resolution of terrain data will be rendered, meaning more triangles. Fig. 9 gives the rendering result for two different screen space errors. The lines in the picture show the bounding boxes of the terrain tree nodes with each tree node containing nearly the same number of triangles. Table 2 shows the experimental results for the four terrain

Fig. 8(a). The highlighted regions of overlap.

Fig. 8(b). The rendering result after alpha blending.


52

models when the maximal allowed screen space is 2.0f. Here the frames/second are measured as an average. More rendered images are shown in Fig.10, Fig.11, and Fig.12, related to the terrain models of Puget, Island, and Kauai, respectively. The rendered image of the terrain model of Catcliff is shown in Fig.8.

TABLE 2: THE EXPERIMENTAL RESULT OF FOUR TERRAIN MODEL

VII. CONCLUSION

In this paper we investigated various real-time terrain rendering algorithms and parallel rendering system architectures. Then we proposed a parallel rendering system for a large terrain dataset in the PC cluster environment. The technique of multi-screen tiled display is used to show the rendering results. The retained-mode parallel architecture avoids the frequency of data transition in the network, and the sort-first mode architecture is convenient for task partitioning

between PC nodes. Terrain data are preprocessed and organized as a static quadtree that is GPU optimized. In order to process an out-of-core terrain dataset and achieve real-time rendering, many techniques were used, e.g. LOD algorithms, view frustum culling, and out-of-core data management.

Experimental results show that the proposed parallel

rendering system works well for a terrain dataset larger than one Gigabyte size in real-time performance. In addition,

Fig. 12. The rendering result of terrain model Kauai. See Color Plate 18.

Fig. 11. The rendering result of terrain model Island.

Fig. 10. The rendering result of terrain model Puget

Fig. 9(a) The rendering result of a single render node when the maximum

allowed screen space error is 0.5f. See Color Plate 17

Fig. 9(b). The rendering result of a single render node when the maximum allowed screen space error is 4.0f. See Color Plate 17

Terrain model

Geometric data (MBytes)

Texture data (MBytes)

Frames/ sec

Catcliff 111．610 12．289 20 Kauai 503．862 12．289 20 Island 1873．532 49．085 30 Puget 1909．345 49．085 20


53

geometric calibration and color calibration performed in the multi-screen tile display generate a truly seamless display with high resolution. In the future we would like to do more work to expand our system to handle depth complexity data and implement real-time load balance.

REFERENCES [1] R. Zhou, H. Lin and J. Shi. Geometric Calibration on Tiled Display using

Multiple Matrices, Poster of Pacific Graphics 2005. [2] S. Molnar and et al. A Sorting Classification of Parallel Rendering, IEEE

Computer Graphics and Applications, vol. 14, no. 4, pp. 23~32, 1994. [3] G. Humphreys and M. Eldridge. WireGL: A Scalable Graphics System for

Clusters, In: Computer Graphics Proceedings, Annual Conference Series, ACM SIGGRAPH, pp. 129~140, Los Angeles, California, 2001.

[4] G. Humphreys and et al. Chromium: A Stream Processing Framework for Interactive Rendering on Clusters, Computer Graphics Proceedings, Annual Conference Series, ACM SIGGRAPH 2002.

[5] R. Samanta, T. Funkhouser and K. Li. Sort-First Parallel Rendering with a Cluster of PCs, Sketch at SIGGRAPH 2000, New Orleans, Louisiana - July 2000.

[6] H. Peng, Z. Jin and J. Shi. An In-the-Core Parallel Graphics Rendering System for Extreme Large Data Sets Based on Retained-mode, Journal of Software, supplement, pp. 222－229, 2004

[7] C. J. Kaufman. Rocky Mountain Research Lab., Boulder, CO, private communication, May 1995.

[8] P. Lindstrom, D. Koller, W. Ribarsky, L. Hodges and N. Faust. Real-time, continuous level of detail rendering of height field, ACM SIGGRAPH, pp. 109-118, 1996.

[9] M. Duchaineau, M. Wolinsky, D. Sigeti, M. Miller, C. Aldrich and M. Mineev-weinstein. ROAMing terrain: Real-time optimally adapting meshes, Proceedings IEEE Visualization, pp. 81-88, 1997.

[10] R. Pajarola. Large scale terrain visualization using the restricted quadtree triangulation, Proceeding of Visualization, pp. 19-26, 1998.

[11] P. Cignoni, E. Puppo and R. Scopigno. Representation and visualization of terrain surfaces at variable resolution, The Visual Computer, vol. 13, no. 5, pp. 199-217, 1997.

[12] H. Hoppe. View-dependent refinement of progressive meshes, ACM SIGGRAPH 1997, pp. 189-198.

[13] R. Lindstrom. Visualization of Large Terrain Made Easy, Proceeding of IEEE Visualization, California, 2001.

[14] P. Lindstrom and P. Valerio. Terrain aimplification simplified: A general framework for viewdependent out-of-corevisualization, IEEE Transactions on Visualization and Computer Graphics, vol. 8, no. 3, pp. 239–254, 2002.

[15] B. Xiaohong and P. Renato. LOD-based Clustering Techniques for Optimizing Large-scale Terrain Storage and Visualization, Proceedings SPIE Conference on Visualization and Data Analysis, pp. 225-235, 2003.

[16] N. Bent. Real-time Rendering using Smooth Hardware Optimized Level of Detail, WSCG2003.

[17] P. Cignoni, F. Ganovelli, E. Gobbetti, F. Marton, F. Ponchio and R. Scopigno. BDAM – Batched dynamic adaptive meshes for high performance terrain visualization, Computer Graphics Forum, vol. 22, no. 3, 2003.

[18] P. Cignoni, F. Ganovelli, E. Gobbetti, F. Marton, F. Ponchio and R. Scopigno. Planet-sized batched dynamic adaptive meshes (P-BDAM), Proceeding of IEEE Visualization, 2003.

[19] F. Losasso and H. Hoppe. Geometry Clipmaps: Terrain Rendering Using Nested Regular Grid, ACM SIGGRAPH, pp. 769-776, 2004.

[20] D. Carsten and S. Marc. Rendering Procedural Terrain by Geometry Image Warping, Eurographics Symposium Rendering, 2004.

[21] Ping Yin and jiaoying Shi. Cluster Based Real-time Rendering System for Large Terrain Dataset, Proceeding of Computer Aided Design and Computer Graphics, pp. 365-370, 2005.

[22] Thatcher Ulrich. Rendering Massive Terrains using Chunked Level of Detail Control presented at the "Super-size it! Scaling up to Massive Virtual Worlds" course at SIGGRAPH 2002.

[23] C. Li, H. Lin and J. Shi. Multi-Projector Tiled Display Wall Calibration With a Camera, Proceeding of IS&T/SPIE 17th Annual Symposium on Electronic Imaging, San Jose, California USA, 2005.

Ping Yin received her Bachelor degree and master degree from Zhejiang University, Computer Science and Technology Department in 2004 and 2006. Her research interest was Computer Graphics and Computer architecture when she was an graduate student. Currently, she works at Search Technology Group in Microsoft.

Xiaohong Jiang, received the B.Sc. and M.Eng. degrees in computer science from Nanjing University in 1988 and 1991respectively, and the Ph.D. degree in computer science from Zhejiang Univeristy in 2003. Since Aug. 1991, she started to work at the Department of Computer Science and Engineering in Zhejiang University. Currently she is an associate professor at the College of Computer Science in Zhejiang Univeristy. Her main research interests are distributed systems, image processing and computer graphics.

Jiaoying Shi is a professor and Ph.D. advisor in the College of Computer Science and Engineering at Zhejiang University. He is now the director of the Academic Committee of State Key Lab of Computer Aided Design and Computer Graphics. He is deputy Chairman of the China Image and Graphics Association and the deputy chairman of the China CAD and Graphics Society under the China Computer Federation. He is the representative of the Asia in Education Committee of ACM SIGGRAPH. Since 1990, his works have been

concentrated in computer graphics, visualization in scientific computing, and virtual environment. He has published more than 100 papers and four books.

Ran Zhou received his Bachelor degree from Chukochen Honor Colleague of Zhejiang University in 2004 and master degree from Computer Science and Technology Department in 2006. His research interest was Computer Graphics and image processing when he was an graduate student. Currently, he works in Exchange Team in Microsoft.

Documents

Multi-screen Tiled Displayed, Parallel Rendering System for a Large