10
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008 1929 An Area Efficient Early -Test Method for 3-D Graphics Rendering Hardware Chang-Hyo Yu, Member, IEEE, Donghyun Kim, Member, IEEE, and Lee-Sup Kim, Senior Member, IEEE Abstract—In this paper, we propose a new early -test which re- quires a minimized internal memory while removing redundant and color reads as well as texture reads. The proposed method determines whether a pixel is screened by a certain mask plane which is containing the history of a pixel’s appearance in front of it. If a pixel is screened by the plane, it can be removed. Given an initial position, the method adaptively updates the plane position to maximize the rejected pixels. As a result, on average 39.9% of the memory bandwidth and 21.2% of total power consumption is saved with only a 256 B on-chip memory. The proposed method was implemented into a multimedia system-on-chip with 61 k ap- plication-specific integrated circuit (ASIC) gates using a 0.13- m CMOS process technology. Index Terms—3-D graphics hardware, depth test, early -test, rasterization engine (RE), rendering hardware, -test. I. INTRODUCTION G RAPHICS rendering processes for 3-D require lots of memory operations, including texture mapping, stencil tests, tests, color blending, and vertex data fetching. Be- cause of the large memory accesses, the performance of the rendering engine greatly depends on the bandwidth of the memory system. Among the memory operations, most of the operations are required for 3-D rasterization engine (RE) due to the per-pixel processing characteristics. In conventional high-performance rendering processors, texture mapping is performed before the -test [1], [2]. This architecture properly supports the semantics of the standard APIs, such as OpenGL [3], but the major disadvantage is the unnecessary memory operations of texture mapping and -test for hidden pixels, which waste a great deal of the memory bandwidth. Several studies have been conducted to resolve the memory bandwidth problem. A hierarchical -test [4], ATI’s architec- ture for their commercial products, keeps a reduced resolution map of the external -buffer on an on-chip memory and removes hidden pixels as early as possible; such pixels are removed away Manuscript received April 18, 2006; revised June 29, 2007. First published February 8, 2008; last published August 13, 2008 (projected). This work was supported in part by the University IT Research Center Project and Samsung Electronics. This paper was recommended by Associate Editor I. Verbauwhede. C.-H Yu was with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Dae- jeon 305-701, Korea. He is now with Samsung Electronics, Gyeonggi-Do 446- 711, Korea. (e-mail: [email protected]). D. Kim was with the Department of Electrical Engineering and Computer Sci- ence, KAIST, Daejeon 305-701, Korea. He is now with Qualcomm, San Diego, CA 92121, USA (e-mail: [email protected]). L.-S. Kim is with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Dae- jeon 305-701, Korea (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSI.2008.918078 from the pipeline before texture mapping and other per-pixel processes. However, it requires a large on-chip memory to store the hierarchical -buffer [5]. Maintaining the on-chip buffer for every external frame-buffer update may also bring an excessive burden to the memory bandwidth. Mid-texturing [6] uses a two-stage -test operation. Mid- texturing focuses on reducing the texture operation only. The method uses the first -test before the texture mapping and the second -test after the texture mapping. Its two-stage -test shares the -buffer in the frame-buffer. Unlike the ATI’s method, midtexturing does not need an auxiliary on-chip -buffer for the first -test. However, the midtexturing method still has the overhead, which is caused by the long distance (about 50 cycles or more of pipeline depth [7]) between the two -tests. The distance can increase the miss-rate of the pixel ( ) cache. The ATI’s method tries to reduce the both texture reads and reads, while the midtexturing method reduces only the tex- ture reads. As the advanced techniques [8], [9] of the texturing methods are developed, the waste portion of the reads become serious as well as the texture reads. Therefore, the ATI’s ap- proach would be more effective as the 3-D graphics rendering hardware rapidly develops. In this paper, we present a method to reduce color reads as well as both texture and reads with a minimized buffer which requires only 1 or 2 bits per pixel for the storage, unlike the conventional 24 or 32 bits per pixel. We previously presented the basic algorithm [10], [11] conceptually. In this paper, we completed our algorithm and implemented it into a system-on- chip (SoC) [7], which integrated a full 3-D graphics pipeline. II. CONVENTIONAL -TEST AND EARLY -TEST A visibility problem, caused when a nearby pixel is eclipsed by other distant pixels in the same screen coordinate due to the rendering sequence, is solved by a hardwired -test [4], [6], [7], [12]. The -test method determines which pixel is visible by the comparison of -values. In other words, the method stores the information of the closest pixel from the eye-point, and the other pixels are removed. However, all the removed pixels are also passed through the RE operations till the -test. This wastes lots of the memory bandwidth for the hidden pixels. The opera- tions (texture read, read, and the processing resources for their operations) for those pixels are not necessary if we had known earlier that the pixel would be removed. To resolve this problem, there have been several studies to re- ject the hidden pixels as early as possible. Fig. 1 shows the RE architecture of the conventional -test and the previous works. The previous methods put an additional -test on the rasterizer, 1549-8328/$25.00 © 2008 IEEE

An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

  • Upload
    ngothuy

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008 1929

An Area Efficient Early �-Test Methodfor 3-D Graphics Rendering Hardware

Chang-Hyo Yu, Member, IEEE, Donghyun Kim, Member, IEEE, and Lee-Sup Kim, Senior Member, IEEE

Abstract—In this paper, we propose a new early -test which re-quires a minimized internal memory while removing redundant

and color reads as well as texture reads. The proposed methoddetermines whether a pixel is screened by a certain mask planewhich is containing the history of a pixel’s appearance in front ofit. If a pixel is screened by the plane, it can be removed. Given aninitial position, the method adaptively updates the plane positionto maximize the rejected pixels. As a result, on average 39.9% ofthe memory bandwidth and 21.2% of total power consumption issaved with only a 256 B on-chip memory. The proposed methodwas implemented into a multimedia system-on-chip with 61 k ap-plication-specific integrated circuit (ASIC) gates using a 0.13- mCMOS process technology.

Index Terms—3-D graphics hardware, depth test, early -test,rasterization engine (RE), rendering hardware, -test.

I. INTRODUCTION

G RAPHICS rendering processes for 3-D require lots ofmemory operations, including texture mapping, stencil

tests, tests, color blending, and vertex data fetching. Be-cause of the large memory accesses, the performance of therendering engine greatly depends on the bandwidth of thememory system. Among the memory operations, most of theoperations are required for 3-D rasterization engine (RE) dueto the per-pixel processing characteristics. In conventionalhigh-performance rendering processors, texture mapping isperformed before the -test [1], [2]. This architecture properlysupports the semantics of the standard APIs, such as OpenGL[3], but the major disadvantage is the unnecessary memoryoperations of texture mapping and -test for hidden pixels,which waste a great deal of the memory bandwidth.

Several studies have been conducted to resolve the memorybandwidth problem. A hierarchical -test [4], ATI’s architec-ture for their commercial products, keeps a reduced resolutionmap of the external -buffer on an on-chip memory and removeshidden pixels as early as possible; such pixels are removed away

Manuscript received April 18, 2006; revised June 29, 2007. First publishedFebruary 8, 2008; last published August 13, 2008 (projected). This work wassupported in part by the University IT Research Center Project and SamsungElectronics. This paper was recommended by Associate Editor I. Verbauwhede.

C.-H Yu was with the Department of Electrical Engineering and ComputerScience, Korea Advanced Institute of Science and Technology (KAIST), Dae-jeon 305-701, Korea. He is now with Samsung Electronics, Gyeonggi-Do 446-711, Korea. (e-mail: [email protected]).

D. Kim was with the Department of Electrical Engineering and Computer Sci-ence, KAIST, Daejeon 305-701, Korea. He is now with Qualcomm, San Diego,CA 92121, USA (e-mail: [email protected]).

L.-S. Kim is with the Department of Electrical Engineering and ComputerScience, Korea Advanced Institute of Science and Technology (KAIST), Dae-jeon 305-701, Korea (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSI.2008.918078

from the pipeline before texture mapping and other per-pixelprocesses. However, it requires a large on-chip memory to storethe hierarchical -buffer [5]. Maintaining the on-chip buffer forevery external frame-buffer update may also bring an excessiveburden to the memory bandwidth.

Mid-texturing [6] uses a two-stage -test operation. Mid-texturing focuses on reducing the texture operation only. Themethod uses the first -test before the texture mapping andthe second -test after the texture mapping. Its two-stage

-test shares the -buffer in the frame-buffer. Unlike the ATI’smethod, midtexturing does not need an auxiliary on-chip

-buffer for the first -test. However, the midtexturing methodstill has the overhead, which is caused by the long distance(about 50 cycles or more of pipeline depth [7]) between the two

-tests. The distance can increase the miss-rate of the pixel ( )cache.

The ATI’s method tries to reduce the both texture reads andreads, while the midtexturing method reduces only the tex-

ture reads. As the advanced techniques [8], [9] of the texturingmethods are developed, the waste portion of the reads becomeserious as well as the texture reads. Therefore, the ATI’s ap-proach would be more effective as the 3-D graphics renderinghardware rapidly develops.

In this paper, we present a method to reduce color reads aswell as both texture and reads with a minimized buffer whichrequires only 1 or 2 bits per pixel for the storage, unlike theconventional 24 or 32 bits per pixel. We previously presentedthe basic algorithm [10], [11] conceptually. In this paper, wecompleted our algorithm and implemented it into a system-on-chip (SoC) [7], which integrated a full 3-D graphics pipeline.

II. CONVENTIONAL -TEST AND EARLY -TEST

A visibility problem, caused when a nearby pixel is eclipsedby other distant pixels in the same screen coordinate due to therendering sequence, is solved by a hardwired -test [4], [6], [7],[12]. The -test method determines which pixel is visible bythe comparison of -values. In other words, the method storesthe information of the closest pixel from the eye-point, and theother pixels are removed. However, all the removed pixels arealso passed through the RE operations till the -test. This wasteslots of the memory bandwidth for the hidden pixels. The opera-tions (texture read, read, and the processing resources for theiroperations) for those pixels are not necessary if we had knownearlier that the pixel would be removed.

To resolve this problem, there have been several studies to re-ject the hidden pixels as early as possible. Fig. 1 shows the REarchitecture of the conventional -test and the previous works.The previous methods put an additional -test on the rasterizer,

1549-8328/$25.00 © 2008 IEEE

Page 2: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

1930 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008

Fig. 1. RE pipeline: (a) of the conventional �-test, (b) the midtexturing, and (c) the hierarchical �-test.

which is the first stage of the RE. These methods are called Early-Test (EZT) because of their additional -test at the first stage

of the per-pixel operations. The midtexturing method [6] con-sists of the two stage -tests. The midtexturing uses an iden-tical -test for the first stage -test on the rasterizer. It requires amemory access to read a -value from the external -buffer, butcannot write back to the -buffer at the first -test due to the con-sistency problem. As shown in Fig. 1(b), texture-reads can be re-duced by the method. The hierarchical -test [4] uses a reducedresolution map of the -buffer on the external frame-buffer. Themap is a hierarchical -buffer made up of lower resolutions ofthe original -buffer. The method enables the -test with lowerresolution in the earlier pipeline stage and reduces both textureand -reads as shown in Fig. 1(c).

III. PROPOSED EARLY -TEST

The proposed method also uses a hierarchical conceptwherein a new filtering block performs an EZT in the rasterizer,but the method does EZT by a spatial plane (screen) not areduced resolution map of the -buffer.

A. Basic Algorithm [10]

The algorithm of our method is that a pixel can be rejected ifthe pixel is screened out by a plane. The plane has the informa-tion of that the pixel’s existence in front of the plane’s position.Fig. 2 represents a conceptual image of the RE pipelines withthe planes. Triangles received from the geometry engine (GE)are filled with pixels by the rasterizer at the first stage of the REpipelines. A set of pixels, generated from triangles A and B, arerendered in consecutive order. The pixels of the triangle A arein front of the plane, and the other pixels are behind the plane.In the case of Line1, the pixel of triangle A is tested by the EZTand passed to the next pipeline because the pixel is in front ofthe plane. The plane information is then updated. Then the pixelof triangle B is tested by the EZT and removed away from the

RE pipelines. The pixel of triangle B on Line2, however, failsthe EZT because there have never been pixels in front of theplane, and the plane information is not updated because the cur-rent pixel is behind the plane. If the -value of the current pixelis smaller than (a position of the plane), the value of that co-ordinate in the EZT buffer is set to “1.” This means the pixel iscloser to the eye (near plane) than to the . When the -valueof the pixel is larger than , the EZT buffer is not updated. Allthe values of the EZT buffer are set to be “0” initially.

The one-plane (1P) system, as shown in Fig. 2(a), only re-quires one bit per screen coordinate, and if the EZT buffer occu-pies two bits per coordinate, the EZT can have up to three planes.If we set the third plane of the 3P system as a special purposemask plane, right on the far plane (1.0) as shown in Fig. 2(b), theEZT has an additional effect. This is called two-plane skipping

-reads (2P-SZR) system. The special plane contains informa-tion on whether a pixel had been rendered or not. A conventional

-test always requires a -value from the external -buffer, evenwhen a pixel is rendered for the first time. In this case, the

-read operation is not necessary. The RE pipeline only writesa -value to the -buffer without reading the -value due to theinformation. Moreover, since color-reads take place with the

-writes, the color-reads are also reduced by the method.

B. Adaptive Updating Method [11]

The proposed EZT uses the mask plane to remove the pixels;therefore, the plane should be located at an optimal position soas to yield a maximum number of rejected pixels.

1) Characteristics of Pixels: 3-D object data are distributedby users at any region between the near-plane and the far-plane[0.0, 1.0], and there is no closed form of the pixel distribution.Therefore, we cannot have a deterministic distribution functionof pixels. The cumulative distribution function of the pixels,however, has a regular characteristic. The normalized densityfunction ( ) of an example input vector is shown in Fig. 3(a).In Fig. 3(b), there are two groups of pixels: front-side pixels

Page 3: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

YU et al.: AREA EFFICIENT EARLY -TEST METHOD 1931

Fig. 2. Conceptual picture of the proposed EZT method. (a) and (b) depict the 1P and 3P EZT systems, respectively.

Fig. 3. Pixel characteristics. (a) Pixel distribution from an example model. (b) Pixels classified into two groups by the plane position. (c) � can be classifiedinto � and � . � � � has a minimum near the intersection point of the � and � .

( ) and back-side pixels ( ), according to the plane posi-tion.

Let be an arbitrary position of -coordinates

(1)

According to (1), and always have a mono-tonic function of because there is no negative value of . Ifa plane is on , pixels can be divided into three parts, ,

, and as shown in Fig. 3(c). The , fartherthan , are divided into two classes by the plane. The first classis the sum of the surviving pixels ( ). These pixels have beentested by the EZT but could not be rejected. The second class isthe sum of the rejected pixels ( ) by the proposed EZT.

2) An Optimal Plane Position: The eventual purpose of ourwork is to obtain the largest .

(2)where, ,

.According to (2), the maximum can be obtained by

the minimum value of ( ), and the two pixelgroups, and , have the special characteristics according

to the plane position. The following two properties are derivedfrom the characteristics of these two pixel groups.

• Property 1: As the plane moves to the near-plane (), the EZT has more candidates due to the larger, but the plane has less information due to the less .

• Property 2: As the plane moves to the far-plane ( ),the plane has more information due to the larger but theEZT has less candidates due to the less .

Since these properties are mutually contradictory, and there isno deterministic equation of these two properties, we approxi-mate the value of so as to have equal probability. The positionis where

(3)

The is the point in which the and are thesame. Based on our simulation results, the approximate optimalposition, , closes to the ideal position with about 95.6% ofthe ideal rejection rate.

IV. HARDWARE ARCHITECTURE

The proposed EZT was implemented into an application [7]for an embedded 3-D graphics system. The method consists of

Page 4: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

1932 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008

Fig. 4. RE block diagram with the proposed EZT method.

Fig. 5. Block diagram of the proposed EZT main hardware.

two main blocks: the EZT main block and an adaptation block.Each block was implemented into the rasterizer and conven-tional -test block in the 3-D RE pipeline, respectively.

A. Early -Test Block in the Rasterizer

The proposed EZT main block is located at the rasterizer ofthe 3-D RE as shown in Fig. 4. The rasterizer generates fourpixels per cycle from the input triangle. The EZT block receivesthese four pixels each cycle and then processes the EZT. If somepixels from those four pixels are verified that they are screenedby the planes, the corresponding valid bits of the pixels are set tobe “invalid.” The valid bit of each pixel indicates the result of theEZT. The EZT main block consists of two major parts: filter andcache as shown in Fig. 5. The filter operates the main process ofthe EZT. The block reads a value of the plane and compares theincoming pixel’s -value with the plane position. According to

the result, the block determines whether transfer the pixels to thenext pipeline or not and updates the current plane information.The cache is designed with a fully set-associative eight-slot ar-chitecture, and manipulates the slots in a least-frequently-usedreplacement strategy. The ideal size of the on-chip buffer for theplane information is 75 kB at the 640 480 display resolution.The actual hardware condition, however, is not adequate for thearea overhead even for the higher resolution applications whichrequire more size of the on-chip buffer to store a single frame.Moreover, since the target system of our work is an embeddedsystem, we adopted a cache architecture.

B. Adaptation Block in the -Test Block

The adaptation unit estimates the next position of the plane inevery end of a frame. The plane tracks the point where theand the have the same numbers on the conventional -test

Page 5: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

YU et al.: AREA EFFICIENT EARLY -TEST METHOD 1933

TABLE ITABLE FOR ESTIMATING THE INPUT PIXEL DENSITY

Fig. 6. Region divided into four sections, two outer regions and two inner re-gions, according to ����� �.

block. The adaptation process is achieved by the following twostages.

• Determining the current density of the incoming pixels toadjust the variation of the plane position.

• Updating the next plane position by the relationships of theand with the variation.

1) Estimation of the Density: Since the various distributionsof pixels can cause an incorrect updating of the plane position,the adaptation block estimates the density of the incoming pixelsfirst. The normalized difference, as shown in (4), of the twopixel groups indicates the difference between the current planeposition and the position where indirectly

(4)

(5)

According to (4) and (5), we divide a region into two inner re-gions and two outer regions. The outer regions are far from theintersection point of and , and the inner regions are closeto the intersection point. Fig. 6 depicts the classified regions ofan example pixel distribution. A change from the previous re-gion to the current region gives us density information as shownin Table I.

2) Updating the Plane Position: The amount of the variationis decided in two ways:

(6)The first way is that the variation is a given large value,

, for the outer regions. The second way is that thevariation is calculated with a given small value, , and

the result of (4) for the inner regions. The inner regions usethe small variation, , with for fine tuning, while theouter regions use only the large variation, , for fast tracking.Since the distribution of the pixels changes each frame andthe 3-D object data can change abruptly, it is necessary toadjust the variation so as not only to shrink it but also to scaleit up. The adaptive block, however, uses the density of thedistribution only to shrink the variation. Instead of scaling upthe variation, we initialize the variation to the predefined initialvalue every 20 frames. This method may incur an unwantedinitiation of the variation to the predefined variation when thesystem tracks well into the capture range (inner regions) ofand . In this case, however, the calculated variation, whichis determined by the density and of the current and

, is already small and thus, the initialized variation does notaffect it significantly. Since the variation of the plane positionis not scaled up in any other cases, there is no divergence of theadaptation process.

Fig. 7 shows a block diagram of the EZT adapta-tion hardware architecture. The region classifier is im-plemented with two incrementors (for and ),two integer adders, and two leading-one detectors. In-stead of a divider, we use the leading-one detectors:

. Thedensity check block provides the current pixel density and re-gion information for the variation variables and the multiplexerbased on Table I, respectively. The lfactor and sfactor are thepredefined constants to shrink the large and small variations.The C indicates a comparator to determine the direction.

Fig. 8 shows the trace of the plane position from a test bench.The plane positions of the 2P-SZR and 3P EZTs are depictedwith their only one plane position: the second plane’s positionof their systems. To show the adaptation into a certain position,we rendered the model data on a position in ten frames as shownin Fig. 8(a). The plane converges to the optimal position in threeframes. In a dynamic rendering situation, a normal case of theconsecutive 360 frames is shown in Fig. 8(b), the EZT planefollows the optimal position throughout the whole frames. Theother planes of the 2P-SZR and 3P EZTs are placed in a pre-defined distance from the center plane which is determined bythe adaptation process. The predefined distance also is shrunkby the input pixel’s density.

V. PERFORMANCE RESULTS

A. Evaluation Environment

To evaluate the performance achieved in the 3-D graphicspipeline by using the proposed method, we used the embedded

Page 6: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

1934 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008

Fig. 7. Block diagram of the proposed EZT adaptation hardware.

Fig. 8. Plane position of the test model. (a) The static represents the plane position when the identical 10 frames are rendered. (b) The dynamic shows the case ofa general moving objects in 360 frames.

full 3-D graphics evaluation chip’s environment [7]. The chipintegrates the proposed EZT on its pipeline and caches for theexternal memory accesses. There are four caches to supporttexture, color, , and EZT operations. The specifications ofthe caches and the other evaluation environment are shown inTable II. The test models are shown in Fig. 9. Model (a) is themost complex model in which a coordinate on the screen has3.49 overlapping pixels on average, and the model (b) has 2.54.Model (c) is a commercial game software, Quake 3[13]. Finally,to test the worst case situation, model (d) has no overlappingpixels. It is just a large planar polygon textured by an image.

B. Overhead of the Proposed Method

To achieve as higher rejection rate as possible while main-taining the lowest overhead, the proposed method takes thecache architecture. Therefore, there is an overhead due to themiss penalty of the EZT cache which is used for reducing theon-chip buffer size. To find an appropriate slot number andcache-line size, we simulated the various configurations ofthe EZT cache. The miss rate and the burst length of the busarchitecture [14] lead us to determine the proper size. Fig. 10shows the simulation result of the various organizations of the

Page 7: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

YU et al.: AREA EFFICIENT EARLY -TEST METHOD 1935

Fig. 9. Test models and their depth (�) complexities.

TABLE IISPECIFICATIONS OF THE EVALUATION CHIP’S ENVIRONMENT

Fig. 10. Required memory bandwidth of various configurations of the on-chipEZT cache and its miss rate.

on-chip EZT cache. The required memory bandwidth (RMB)of the cache-miss operation can be calculated as follows:

(7)

A 166 Mpixels/s of pixel fill-rate is used for the metric. Thefill-rate comes from the maximum RE performance of the evalu-ation chip with a tri-linear texture filter. is the miss rate,and is the burst length of a cache line. The is thewrite back length of the cache replacement for a cache line. The128 pixels cache-line with the eight-slot combination shows thesmallest overhead at the 256 bytes of the on-chip cache size asshown in Fig. 10.

C. Analysis and Comparison of the EZTs

is the rejection rate of the EZT and the conventional-test with respect to the total number of pixels, and rep-

resents that how many pixels are rejected early on by the EZTamong the pixels that will be rejected in the end. The followingshows the detail components of the RMB of the EZT methods.

(8)Fig. 11 shows the performance results of the three types of

the proposed EZT. As the -complexity and the portion of thepixels which are rendered in front-to-back sequence increase,the bandwidth gain becomes higher. In the case of model (d),there is no overlapping pixel. It means that there is no bandwidthgain from any kinds of EZTs. Even in this case, the proposed2P-SZR reduces 43.6% of the memory bandwidth by elimi-nating the unnecessary and color reads from the RE pipeline.Even though the 3P system has a higher rejection rate, the max-imum bandwidth gain is achieved by the 2P-SZR system due tothe additional bandwidth gain from the unnecessary and colorreads. The 2P-SZR reduces the memory bandwidth up to 57.3%of the conventional -test as shown in Table III.

To compare our system with the previous systems, we useda widely known 3-D game, Quake 3, because all the previousworks used this model to test their performance. Fig. 12 showsa comparison of the three kinds of EZTs with the conventional

-test. The midtexturing performs well as the size of the cacheincreases, but the method reduces only the texture-reads andshows a less performance improvement when the cache is

Page 8: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

1936 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008

TABLE IIITHE EZT PERFORMANCE AND BANDWIDTH SAVING PORTION OF THE PROPOSED EZTS.

Fig. 11. The required memory bandwidth of the three types of the proposedEZT and the conventional �-test with a tri-linear texture filtering. (a) showsthe read accesses, and (b) shows the whole memory accesses in the 3-D REpipelines.

small. There is also an overhead which comes from the wideseparation between the first and second -reads, which causeeven worse performance degradation when the texture block re-quires a higher quality of filtering because the pipeline depth be-tween the two -reads becomes longer. The hierarchical -testalso performs well because both and texture reads are reducedby the EZT, but the method requires a large on-chip buffer, andwe did not consider the additional memory bandwidth over-head incurred by the on-chip buffer updating. The proposed2P-SZR shows the best performance in spite of the smallest areaof the and EZT caches. Since the -reads and color-reads oc-cupy 48.5%, 57.9%, and 67.8% of the memory bandwidth in theconventional -test pipeline, when the texture filtering methodsare tri-linear, bi-linear, and nearest, respectively, the proposed2P-SZR results in a significant reduction on every read accesses.Fig. 13 shows a comparison of the read-access components inthe RE memory operations. The midtexturing reduces only tex-ture-reads, and the hierarchical -test and the proposed 1P and

Fig. 12. The required memory bandwidth of the proposed EZTs and the pre-vious methods in Quake3 with a tri-linear texture filtering.

Fig. 13. The required memory bandwidth of the read-accesses in the REpipeline.

3P methods reduce both texture-reads and -reads. The pro-posed 2P-SZR reduces all of the read-accesses in the memoryoperations.

Page 9: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

YU et al.: AREA EFFICIENT EARLY -TEST METHOD 1937

TABLE IVTHE HARDWARE COSTS AND AVERAGE POWER CONSUMPTION OF THE PROPOSED 2P-SZR EZT.

Fig. 14. (a) The power consumption of the previous works and the proposed2P-SZR on Quake3 with the external memory bandwidth normalized to the con-ventional �-test. (b) Average power consumption of the proposed EZT config-urations with the required memory bandwidth.

D. Hardware Size and Power Dissipation

The midtexturing requires only one additional read-port onthe pixel ( ) cache, but it needs larger cache storage to over-come the degradation of hit rates when the cache size is small.The hierarchical -test requires about 14 kB of on-chip buffersize for the 640 480 screen resolution. The proposed EZT re-quires only a 256 B on-chip buffer while providing a compa-rable or better performance than the other methods. The pro-posed EZT method requires about 1.78% of the on-chip buffersize compared to the hierarchical -test. Table IV summarizesthe hardware size and power consumption of the proposed EZTcomponents. The proposed EZT accounts for 3.6% of the 3-Dgraphics core in hardware size and consumes additional powerof about 2.3% of the 3-D graphics core.

To compare the efficiency in the point of overall power con-sumption, the power dissipation of the 3-D graphics core andexternal memories is depicted in Fig. 14. The evaluation plat-form of the embedded 3-D graphics system consists of severalperipherals, but we measure only the 3-D graphics core amongthe full chip and external memories among the peripherals ofthe evaluation platform. There are two 512 Mbits DDR SDRAM[15] modules accessed as a pair of 32-bit bus width. Fig. 14(a)shows the overall power consumption of the previous worksand the proposed method in Quake3 with a tri-linear filteringalong with the reduction on the external memory bandwidth, andFig. 14(b) shows average results of the several configurations of

the proposed EZT method in the test models also with the re-sults of the memory bandwidth normalized to the conventional

-test method. The decreased occupation on the bus bandwidthreduces the accesses to external memories which leads to low-ered power dissipation, and on the other hand, the effectivelyexpanded bus bandwidth resolves the bottlenecks occurred bythe limited bus bandwidth.

VI. CONCLUSION

As more complex models become commonplace and displayresolutions become larger in the embedded 3-D system, itbecomes increasingly important to remove unnecessary opera-tions that waste memory bandwidth. In this paper, we presenta method that saves memory bandwidth in the 3-D graphicspipeline with a minimized area overhead. The proposed EZTis a spatial filter moving along the -space not a reducedresolution map of the screen area. The method reduces theunnecessary and color reads as well as the texture reads.The results show that 39.9% of the total memory bandwidthof the 3-D RE and 21.2% of the total power consumption canbe saved by the proposed EZT on average. It shows a verypromising result for embedded environments, where all thecomponents on the SoC must share the limited bus bandwidthand the power budget. The proposed EZT was implementedwith only 61 k ASIC gates in a 0.13- m CMOS technologyprocess and consumes 9.5 mW of power as a part of the 3-Dgraphics rendering engine.

REFERENCES

[1] L. Garber, “The wild world of 3-D graphics chips,” Computer, vol. 33,no. 9, pp. 12–16, Sep. 2000.

[2] T. Ikedo and J. Ma, “The Truga001: A scalable rendering processor,”IEEE Comput. Graph. Appl., vol. 18, no. 2, pp. 59–79, Mar./Apr. 1998.

[3] R. Kempf and C. Frazier, OpenGL Reference Manual. Reading , MA:Addison Wesley, 1996.

[4] S. Morein, “ATI Radeon – HyperZ technology,” in Proc. Hot3D Session2000 Eurograph. Workshop Comput. Graph. Hardw., Aug. 2000.

[5] N. Greene, M. Kass, and G. Miller, “Hierarchical �-buffer visibility,”in Proc. ACM SIGGRAPH ’93, Aug. 1993, pp. 231–238.

[6] W. C. Park, K. W. Lee, I. S. Kim, T. D. Han, and S. B. Yang, “An ef-fective pixel rasterization pipeline architecture for 3-D rendering pro-cessors,” IEEE Trans. Comput., vol. 52, pp. 1501–1508, Nov. 2003.

[7] D. Kim, K. Chung, C. H. Yu, C. H. Kim, I. Lee, J. Bae, J. Y. Kim,J. H. Park, S. Kim, Y. H. Park, N. H. Seong, J. A. Lee, J. Park, O.Stephen, S. W. Jeong, and L. S. Kim, “An SoC with 1.3 Gtexels/sec3-D graphics full pipeline engine for consumer applications,” IEEE J.Solid-State Circuits, vol. 41, no. 1, pp. 71–84, Jan. 2006.

[8] C. H. Kim, Y. H. Iim, and L. S. Kim, “Miss-rate reduction in texturecache by adaptive cache indexing,” Electron. Lett., vol. 40, no. 10, pp.597–598, May 2004.

[9] M. Cox, N. Bhandari, and M. Shantz, “Multi-level texture caching for3-D graphics hardware,” in Proc. IEEE Int. Symp. Comput. Arch. , Jul.1998, pp. 86–97.

Page 10: An Area Efficient Early ${Z}$-Test Method for 3-D Graphics Rendering Hardware

1938 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008

[10] C. H. Yu and L. S. Kim, “A hierarchical depth buffer for minimizingmemory bandwidth in 3-D rendering engine: Depth filter,” in Proc.IEEE Int. Symp. Circuits Syst., May 2003, vol. 2, pp. 724–727.

[11] C. H. Yu and L. S. Kim, “An adaptive spatial filter for early depth test,”in Proc. IEEE Int. Symp. Circuits Syst., May 2004, vol. 2, pp. 137–140.

[12] M. Imai, T. Nagasaki, J. Sakamoto, H. Takeuchi, H. Nagano, S.Iwasaki, M. Hatakenaka, J. Fujita, K. Keino, T. Motomura, T. Ueda,T. Niki, and H. Tomikawa, “A 109.5 mW 1.2 V 600 M texels/s3-D graphics engine,” in Proc. IEEE Int. Solid-State Circuits Conf.(ISSCC’04), Feb. 2004, vol. 1, pp. 332–333.

[13] Quake3, , 2001 [Online]. Available: http://www.idsoftware.com/games/quake/quake3-arena/

[14] “AMBA Specification (Rev 2.0),” ARM Ltd., Arlington Heights,IL, 1999 [Online]. Available: http://www.arm.com/products/solu-tions/AMBA_Spec.html

[15] “MT46V32M16,” Micron Technology, Inc., Boise, ID [Online]. Avail-able: http://download.micron.com/pdf/datasheets/dram/ddr/512MB-DDRx4x8x16.pdf

Chang-Hyo Yu (S’04-M’07) received the B.S.,M.S., degrees in electrical engineering and computerScience, and the and Ph.D. degree from KoreaAdvanced Institute of Science and Technology(KAIST), Daejon, Korea, in 2001, 2003, and 2007,respectively.

He is currently working on 3-D graphics acceler-ator architecture at Samsung Electronics, Giheung,Korea. His research interests include 3-D graphicshardware design and multimedia programmable pro-cessor design.

Donghyun Kim (S’04-M’07) received the B.S. andM.S. degrees in electrical engineering and the Ph.D.degree from Korea Advanced Institute of Scienceand Technology (KAIST), Daejeon, Korea, in 2000,2002, and 2006, respectively.

He worked on the design of three chips for 3-Dgraphics during his doctoral studies. He has studiedon floating point data-paths for 3-D graphics as apost-doctoral researcher in KAIST in 2007, and he isnow working on mobile 3-D graphics hardware de-sign at Qualcomm Inc. His research interests include

programmable 3-D graphics processor, floating point data-path, rasterizationand memory bandwidth reduction methodology.

Lee-Sup Kim (M’89-SM’05) received the B.S. de-gree in electronics engineering from Seoul NationalUniversity, Seoul, Korea, in 1982, and the M.S. andPh.D. degrees in electrical engineering from StanfordUniversity, Stanford, CA, in 1986 and 1990, respec-tively.

He was a Postdoctoral fellow at Toshiba Corpora-tion, Kawasaki, Japan, during 1990–1993, where hewas involved in the design of the high-performanceDSP and single-chip MPEG2 decoder. Since March1993, he has been with the Korea Advanced Institute

of Science and Technology (KAIST), Daejeon, Korea. In September 2002, hebecame a Professor. During the year of 1998, he was on the sabbatical leave withChromatic Research and SandCraft Inc. in Silicon Valley, CA. His research in-terests are 3-D graphics processing unit design and high-speed/low-power dig-ital integrated circuit design.