3-D graphics processor unit with cost-effective rasterization using valid screen space region

Y.-K. Lai et al.: 3-D Graphics Processor Unit with Cost-Effective Rasterization Using Valid Screen Space Region 705

Contributed Paper Manuscript received 07/01/13 Current version published 09/25/13 Electronic version published 09/25/13. 0098 3063/13/$20.00 © 2013 IEEE

3-D Graphics Processor Unit with Cost-Effective Rasterization Using Valid Screen Space Region

Yeong-Kang Lai, Senior Member, IEEE, and Yu-Chieh Chung, Student Member, IEEE

Abstract —In order to render 3-D graphics efficiently, rasterization techniques have been developed. Traditional clipping techniques using six-planes of view volume are complicated and not cost-effective. This paper develops a novel cost-effective strategy for primitives with regard to clipping in rasterization. Throughout the process, no expensive clipping action is required and no extra clipping-derived polygons are produced. It also presents the architecture of a 200-MHz multi-core, multi-thread 3-D graphics SoC in 65nm 1P9M process with a core size of 4.97mm2 and 153.3mW for power consumption. The proposed clip-less architecture in rasterization processes the valid screen space region of each primitive in eight cycles, with a gate-count of only 20k. In addition, the throughput can achieve up to 25 M Triangles/Sec.1

Index Terms — 3-D graphics processor, rasterization.

I. INTRODUCTION

In recent years, 3-D graphics applications have become increasingly popular, particularly, with regard to the consumer electronic market for 3-D graphics gaming applications on mobile devices. Notably, 3-D graphics-intensive applications are predicted to become widely available on a variety of portable mobile devices ranging from Tablets to Notebooks and Smart Phones [1]. These features include multi-core SOCs, higher resolution and lower power requirements. Consequently, low power design is an important strategy in order to become competitive in the arena of portable mobile consumer electronics.

OpenGL ES [2] aims at creating a new, compact and efficient application programming interface (API) that is mostly a subset of OpenGL to eliminate expensive functionality in order to meet the stringent requirements of a tight die area budget, limited memory, short battery life and real-time execution performance expectations. In OpenGL ES 2.0, shaders are programmable components of the rendering pipeline. There are two types of shaders: Vertex and Fragment shaders. The vertex shader executes general operations on vertices such as model view transformations, texture coordinate generation, per-vertex lighting, calculating values for lighting each pixel. The fragment shader (sometimes called pixel shader) executes general

1 This work was supported in part by National Science Council, R.O.C.,

under Grant NSC 101-2220-E-005-004. Yeong-Kang Lai is with the Department of the Electrical Engineering,

National Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, Taiwan, R.O.C. (e-mail: [email protected]).

Yu-Chieh Chung is with the Department of the Electrical Engineering, National Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, Taiwan, R.O.C. (e-mail: [email protected]).

operations on fragments such as texture blending, fog, alpha testing, dependent textures, pixel discard, bump and environment mapping. In sum, shaders are all about creating executable binary objects that later have direct influence over the OpenGL ES 2.0 pipeline. The others are fixed function components of the rendering pipeline. There are two types of processing modules: the geometry module and the rendering module. The geometry module supports the primitive assembly, back-face culling, pre-clipping and clipping functions. The rendering module supports the rasterizer, (alpha, depth and stencil) test, color buffer blending and dither process. In order to efficiently render 3-D graphics, rasterization techniques are developed. Traditional clipping techniques using the six-planes of view volume to split the outside part of the primitive are complicated and not cost-effective (30 cycles per clip plane) [3]. For 3-D graphics gaming applications with high resolution, the cost-effective mobile device hardware is crucial and even necessary. This paper develops a novel cost-effective strategy for primitive clipping in rasterization. In the entire process, no expensive clipping action is involved and no extra clipping-derived polygons are produced. The proposed clip-less architecture in rasterization processes the valid screen space region of each primitive in eight cycles and the gate-count is only 20k, using the 65nm 1P9M process. Additionally, the throughput can reach up to 25 M Triangles/Sec.

This enhanced paper uses different spaces for standard conversion to give a detailed derivation of the proposed algorithm and give geometric diagrams and formulas to prove the proposed algorithm [4]. Furthermore, a detailed introduction to the internal block diagram of the GPU hardware architecture and internal functions is provided; In addition, this paper introduces the entire system’s verification environment and hardware and software platform for co-simulation methods and verification processes. Finally, there is a discussion of the hardware rendered 3-D graphics to show the results of the implementation. This enhanced paper also introduces the proposed GPU architecture for consumer handheld device 3-D application methodology. The remainder of this paper is organized as follows. The related work of the clipping algorithm is discussed in Section II, and the concept creation and concept flow of the new clip-less algorithm follows in Section III. The proposed clip-less algorithm and its data flow are described in Section IV. The proposed 200-MHz multi-core, multi-thread 3-D graphics SoC architecture is discussed in Section V. Consumer applications to embed the 3-D graphic processor into a mobile device is discussed in Section VI. The implementation results of this 3-

706 IEEE Transactions on Consumer Electronics, Vol. 59, No. 3, August 2013

D graphics SoC are shown in Section VII. Finally, a summary is provided in Section VIII.

II. RELATED WORKS

The OpenGL ES system often involves hardware/software co-design methodology to reach full optimization goals. The research work of [5] aims at an SOC design geared in this direction. GPU architecture design is widely investigated both in academic research and industrial applications. Larrabee architecture integrates high performance computing and graphics applications into a single processor [6]. The GRAMPS project of Stanford develops a programming model allowing user defined 3-D graphics pipelines [7]. Clipping is an important task in the projection process but often leads to too much computation overhead. Notably, there are two main steps to clipping [8], the first step (pre-clipping) which deals with the triangles, divides them into three categories: Triangle Accept Case (TAC), Triangle Reject Case (TRC) and Triangle Clip Case (TCC). After the pre-clipping hardware filters out those triangles that are completely outside the visible range, the next main step (clipping) can just deal with the part of the triangle that is in the visible range (the TCC), while the triangle completely inside the visible range (the TAC) can be output directly to the rendering module.

The most famous polygon clipping algorithm is that of Sutherland-Hodgman (SH). The main concept of SH’s algorithm [9] is divide-and-conquer where polygons are required in turn to be treated for the single visual boundary judgments, beyond visual range here, the intersection point is needed to calculate out. This method for design is suited for the implementation of the hardware in pipeline architectures. Fig. 1 shows an example of the SH algorithm in two-dimensional space. PE1 for the right visual boundary for polygons cut and calculated for two points of intersection of the boundary, PE2 for the crop at the top of the visual boundary for the polygon and calculated for the two points of

intersection of the boundary. Similarly, PE3 is for the left visual boundary polygon for cutting and PE4 is for below the visible boundary for the polygon crop and should calculate the boundary of the two points of intersection. A lot of research [8][11][12][13] efforts have been spent for promoting clipping performance. J.Bae [11] proposed a vertex location tree to manage multiple vertices, where the clipping algorithm is aimed to get a complete triangle for clipping and each crop action is only used for a visual boundary. Polygons need to judge at each vertex and are located inside or outside the cropping visual boundary, which will increase the processing time. J-H kim [12] focuses on conserving performance, while clipping-ratio changes. This architecture separates the data-paths into clip data-paths and Perspective Division & Viewport Mapping (PDVM) data-paths, which makes it possible for the clipping engine to handle two triangles at a time. However, this engine, which clips triangles for only two planes, –z and +z, for the reduction of hardware area and processing time. S.-F Hsiao [13] proposed a Modified SH algorithm, whose input / output format at each PE is one vertex only for smooth pipeline in hardware implementation. This work also modified each PE for two parallel visual boundaries for simultaneously cropped judgments (such as Xmax and Xmin, Ymax and Ymin). K.-H Lin [8] gives an improved and modified algorithm of [13]’s, whose input / output format at each PE is two vertices. This work also adopts two parallel visual boundaries for simultaneously cropped judgments in each PE. However, the data flow is a dual path which improves the time taken for finding the intersection point as regards better efficiency.

III. CONCEPT CREATION AND FLOW This paper adopts a new clip-less method in the 3-D

graphics pipeline of primitive clipping; the concept creation is taken from the top view of normalizing the viewing frustum to

Fig. 3. Marc Olano’s [14] model: Triangle with one vertex’s Z <0 in a canonical eye space is normalized and mapped to the screen space. This shows one vertex with W<0, and the valid region with shaded slashes is found out.

Fig. 4. Marc Olano’s [14] model: Triangle with two vertices’ Z < 0 in a canonical eye space is normalized and mapped to the screen space. This shows that two vertices with W<0, and the valid region with shaded slashes is found out.

Fig. 2. Top view of the normalized viewing frustum to a cube and canonical view volume.

Fig. 1. The pipelined hardware uses the Sutherland-Hodgman algorithm in two-dimensional space.


a cube, i.e., a canonical view volume & Marc Olano’s [14] model. Fig. 2 shows the top view of the model defined for the near and far planes to be at: z = f0 = − near and z = f1 = − far, respectively. The points at infinity are normalized and mapped to the plane z = f0 + f1. Notice in particular that the region labeled J, K, L on the left is normalized and mapped to the regions L, K, J on the right. This will help the basic concept of doing linear extrapolation for the new clip-less edge equation. Due to the traditional process of sequentially parsing six planes of view with regard to volume clipping, challenges are added to the action of implementing a clipping function in low-cost hardware design. When one or two vertices are located behind the camera, this will cause a flip of the projection and put the wrong region in the screen space. To understand the connection between 2D homogeneous triangles and 3-D triangles, first look at a single 3-D triangle and its projection onto the screen from Marc Olano’s model. To project the 3-D point (X, Y, Z), set x = X, y = Y and w = Z to get the 2D Homogenous point (x, y, w). Fig. 3 shows that one vertex with Z < 0 is normalized and mapped to the screen space. This shows one vertex with W < 0 and find out the valid region with shaded slashes. Fig. 4 shows that two vertices with Z < 0 are normalized and mapped to screen space. This shows two vertices W < 0 and find out the valid region with shaded slashes.

Fig. 5 shows the concept flow for the triangle with vertex A at Z < 0 and both vertex B and vertex C are Z > 0 via the projection transform and the W-division is mapped to the normalized device space. The top view of normalized device

space is shown below then via the view-port transform to project onto the screen space. This can get vertex A’s is W < 0 and both vertex B and vertex C are W > 0.

Fig. 6 shows the concept flow for the triangle with both vertex B and vertex C at Z < 0 and vertex A is Z > 0 via the projection transform and the W-division is mapped to the normalized device space. The top view of normalized device space is shown below then via the view-port transform to project onto the screen space. This can get both vertex B and vertex C are W < 0 and vertex A is W > 0.

If the triangle with all vertices at Z > 0 via projection transform and W-division is mapped to the normalized device space, then via the view-port transform to project onto the screen space. This will find all vertices at W > 0 and no clipping is needed. If the triangle with all vertices at Z < 0, via projection transform and W-division is mapped to the normalized device space, then via view-port transform to project onto the screen space. This will find all vertices at W < 0 and this triangle will be discarded.

Fig. 7 shows that the scale of Y is proportional to the depth Z in the side view; similarly, the scale of X is proportional to the depth Z in the top view. The process can use these two cube spaces with normal mapping and project properties with the near clipping plane to determine a valid region in the screen space.

Fig. 8 shows the ratio from the top view of the normalized device space. Use the similar triangle theory which can perform a linear extrapolation on the position value via the ratio of depth Z. This paper will use this proportional relationship to derive the proposed algorithm in next section.

IV. PROPOSED ALGORITHM

This paper aims use the top view of the normalized device space and the proportional to the depth Z to derive the linear extrapolation of the position value (x or y) in order to find an effective vertex on the near plane for a valid region. The proposed algorithm in the Rasterizer consists of two main steps: build a valid 2D screen space region and revise the parameter for the edge equation based tile-traversal. This paper builds the clip-less process in a screen space with a

Fig. 5. Concept Flow: Triangle with one vertex’s Z < 0 via projection transform and is normalized and mapped to the normalized device space. Then, via a view-port transform, it is projected onto the screen space.

Fig. 6. Concept Flow: Triangle with two vertices’ Z < 0 via projection transform and is normalized and mapped to a normalized device space. Then, via a view-port transform, it is projected onto the screen space

Fig. 7. Camera Space: The scale of Y is proportional to the depth Z (side view) and the scale of X is proportional to the depth Z (top view).

Fig. 8. Use top view of the normalized device space to derive the position C’ on the Near Plane (triangle A-B-C with vertex of negative W).


number of negative value W of the vertices. Using the three projection edge equations in the screen space, if one vertex has a negative W, the valid region will be located at the other side of the edge intersection. In order to get the valid region in a screen space, the first step can determine a test vertex from the valid area. Then, it can decide whether the signs of the three projection edge functions defined by the three projection vertices should be changed. The second step offers an extrapolation against the near plane from one vertex with a negative value W relative to the vertex with positive value W and can get the new effective vertex on a near plane. Fig. 8 derives the Linear Extrapolation (LE) by using the similar triangle theory and performs a linear extrapolation on the position value via the ratio of the depth Z.

In a normalized device space, it extends both line AC and line AB to the near plane to get the point of intersection C and B and form the similar triangle A-B -C . This can obtain a similar ratio equation where the position value is proportional to depth Z:

z1 z2 / z2 z3 p1 p2 / p2 p3 (1) where z is the depth value (z), p is the position value (x or y).

After reordering the above identity equation, the process can provide the linear extrapolation equation for p3 in the normalized device below: LE p1, z1, p2, z2, z3 p3 p2 p1 p2 z2z3 / z1 z2 (2) where the p1 is the position value (x or y) of the vertex with a negative value w, z1 is the depth value (z) of the vertex with a negative value w; p2 is the position value (x or y) of the vertex with a positive value w, z2 is the depth value (z) of the vertex with a positive value w; p3 is the position value (x or y) of the vertex on the near plane, z3 is the depth value of the near plane.

The viewport transformation is illustrated in the corresponding figure. The parameters for this mapping are the

coordinates s and s of the lower and left corner of the viewport (the rectangle of the screen that is rendered) and its width w and height h , as well as the depths n and f of the front and near clipping planes. Where x 、 y 、z are the normalized device space and x 、y 、z are the screen (window) space. The matrix of the viewport transformation is shown below:

x

y

z

1

0

0

0 s

0 s0 0

0 0

0 0

x

y

z

1

(3)

Use the viewport transform on the left part of (1) and state the z component below:

z z / z z f n z z /2 / f n z z /2

z z z z⁄ (4) Use the viewport transform on the right part of (1) and

state the x component below: x x / x x

s x x / s x x

x x x x⁄ (5) Similarly, the viewport transform is on the right part of (1)

and states the y component below: y y / y y

s y y / s y y

y y y y⁄ (6) Combine the above identity equation (4)-(6), and render the

linear extrapolation equation for p3 in the screen (window) space below:

LE p1, z1, p2, z2, z3 p3 p2 p1 p2z z / z z (7)

Where p1, p2, p3 are the position values (x or y ) of the vertex, p1 with a negative value w, p2 and p3 with positive value w.

In the screen space, Fig. 9 shows that when A (w) <0, which uses the original vertex A relative to the vertex B and C to perform a linear extrapolation and to find new effective

vertices B' and C' on the plane which the near plane projected, this B-C-C'-B' is the valid region of this primitive.

In the other case, Fig. 10 shows that when both B (w) < 0 and C (w) < 0, which use the original vertex B and C relative

Fig. 9. Screen space: When A (w) < 0, the original vertex A relative to the vertices B and C is used to perform linear extrapolation and to find out new effective vertices B ' and C ' on the plane which are near plane projected, this B-C-C'-B' is the valid region of this primitive.

Fig. 10. Screen space: When both B (w) <0 and C (w) <0 , we use the original vertices B and C relative to the vertex A to perform linear extrapolation and to find new effective vertices B ' and C' on the plane which are near plane projected, this A- B'-C' is the valid region of this primitive.

Fig. 11. Dataflow of the proposed algorithm.


to vertex A to perform a linear extrapolation and to find new effective vertex B' and C' on the plane which the near plane projected. This A- B'-C' is the valid region of this primitive. The dataflow of the proposed algorithm is shown in Fig. 11.

V. PROPOSED ARCHITECTURE

Fig. 12 shows a block diagram of the integrated mobile platform using a proposed 3-D Graphics Processor. It dedicatedly integrates a graphics IP to the acceleration of the 3-D graphics pipeline and is implemented based on the specifications of OpenGL ES 2.0, which is a standard API for embedded 3-D graphics.

A. System Architecture of 3-D Graphics Processor

The 3-D graphics IP has scalable shader architecture which adopts high precision and medium precision shader design. Here, vertex shading can only deal with high precision and pixel shading can deal with both a high and medium precision shader. This high/medium precision shader architecture is designed for pixel bounded cases. In order to avoid too much context switch and to improve shader core cache performance; this paper uses a relatively simple approach. It is based on a sub-draw-call units process where a high-precision shader will handle all of the vertex shading processes until the entire sub-draw-call vertices are processed. The high-precision shader will convert to handle pixel shading operations. Notably, a mid-precision shader deals with pixel shading operations only.

This architecture is satisfied for vertex and pixel shader ESSL specifications. The system is also expandable in terms of both concurrent threads and processor cores. The motivation behind concurrent threads is to hide texture image access latencies. The structure of the multi-core boosts execution efficiency by promoting throughput. In addition to the processor architecture, fixed functions such as multiple texture units, a rasterization unit and a raster operation (ROP) unit are essential components contributing to the smooth rendering flow. A centralized scheduler is devised to dynamically monitor and balance the work load of hardware units. The scheduler is notified of the receipt of data to inform shaders for fetching and processing.

The function block diagram of the developed Rasterizer IP is depicted in Fig. 13. It dedicatedly integrates several modules to the acceleration of the 3-D rendering pipeline fixed function. The key IP of the Rasterizer consists of two major functional units: a geometry engine and a pixel render engine. Specifically, the internal blocks are the main processing sub-modules and they can be partitioned into six big pipeline stages.

In the geometry engine, the first stage Command FIFO receives a job from the vertex Scheduler and sends the required control signals to the data load for vertex data loading from the vertex buffer. The vertex buffer stores the vertex transform and lighting (T&L) varyings generated from vertex shader core. The second stage involves Line / Point preprocessing which will partition the line and point into two triangles based on line size and point size. The third stage can be divided into two parallel parts, one is the processing viewport transform, where the polygon is offset to resolve the Z-fighting problem; the other one is Stage 1 of barycentric coefficient (vertex level) or LOD calculation for supporting texture mip-map and cube-map level selection. The four stages also can be divided into two parallel parts, one is the processing backside culling to reject the backside polygon before the raster pipeline, clip-less edge equation and bounding box finding with regard to cost-effective design considerations; the other one is Stage 2 of barycentric coefficient (vertex level) or LOD calculation for support texture mip-map and cube-map level selection.

Then, the process will enter pixel render engine stage. When the primitive type is a point, it only checks where or not this point/pixel is in view volume and then sends the information to an un-shading pixel buffer. The point does not need to calculate barycentric coordinates. If the primitive type is a triangle, there are two stages: the first one involves tile traversal and will send available tiles to the next stage using a special traversal algorithm. The second stage is a 4-pixel per cycle parallel generation and pixel-level barycentric calculation; it will test all pixels in the tile then output the valid pixels.

The texture unit assists the shader core to process textual load instruction, and goes to the Ext-MEM for corresponding texel loading shader core computing. The function of this module in addition to the completion the calculation of fixed circuit includes: coordinate conversion into a memory address to fetch decompression and filtering. Here, the most important

Fig. 13. Block diagram of the Rasterizer function.

Fig. 12. Block diagram of integrated mobile platform using the proposed 3-D Graphics Processor. 3-D Graphics Processor.


thing is reducing the data latency. As such, cache design in this module becomes a very important design feature. In addition to the texture unit’s own design, consider the shader core texture latency hiding mechanisms which will affect the design and overall system performance.

B. Rasterizer Geometry Engine with Clip-less Architecture

Fig. 14 shows the geometry engine in the Rasterizer which links among triangle vertices in the 3-D coordinate system and pixels on the 2-D display. This computes the triangle parameters required for the rasterization of the triangles. The PE pipeline FSM and operand scheduler implement the function block of the geometry engine via the PE Float-Point ALU, while the reduced level edge equation need for the TAC is generated via the PE fixed-point ALU.

The PE control input module has a command FIFO which accepts the job push from the scheduler and loads the configuration from the global register. Setup varying fetch module function first passes over three vertex indexes of triangle by the scheduler and converts them to the physical address of the vertex buffer. Varying Fetch’s FSM enter the FETC state after receiving the PrimeRdy signal from the vertex scheduler. Begin the parallel reading in vector-4 with the varying of three vertexes from the vertex buffer according to the varying order of 0 to 7. The global register will configure the function of the Rasterizer geometry engine, as regards primitive mode (triangle, line or point), front face orientation, depth range, varying type, viewport, visible region, texture width, and texture high and function enable signals.

The finite state machine (FSM) of the PE pipeline controls the six big pipeline stages, including the clip-less edge EQ

stage; it will generates signals for each finite state to control the input mux for selecting the correct ALU inputs. The operand scheduler handles all operand generations for each rendering pipelined stage in the geometry engine and also handles the resource sharing for the same calculation results of the operation to get the best utilization of general-purpose registers. The advantage of the operand scheduler is that it can reduce the area of redundant hardware and clear the unnecessary calculation cycles.

Fig. 15 shows the hardware architecture of the proposed clip-less engine in the Rasterizer. The clip-less valid region for position value (x , y ) needs to undergo linear extrapolation and needs to determine new effective vertex B' and C' on the plane which is near plane projected. This stage has a total of four times the linear extrapolation needed to calculate in eight cycles, following clip-less Eq. (2) and (7). The architecture uses four 32-bit float point multipliers (FMUL) and three 32-bit float point adders (FADD), two SOP kernel modules and selects the operands from the operand scheduler in each cycle to implement the proposed clip-less architecture. The SOP kernel modules which can perform the special float point operations support: the reciprocal or base-2 logarithm and base-2 exponential.

VI. EMBEDDING THE 3-D GRAPHIC PROCESSOR INTO A

MOBILE DEVICE

In this section, the design analysis for embedding 3-D graphic processor into a mobile platform is described. Then, the advantages of applying the proposed 3-D graphic processor engine to high-end mobile devices are offered.

Fig. 14. Architecture of the geometry engine in the Rasterizer.

FMUL

8x32b8x32b

8:1 Mux8:1 Mux

32b 32b

32b

FMUL

8x32b8x32b

8:1 Mux8:1 Mux

32b 32b

32b

FMUL

8:1 Mux8:1 Mux

FMUL

8:1 Mux8:1 Mux

FMUL

8:1 Mux8:1 Mux

FMUL

8:1 Mux8:1 Mux

FMUL

8:1 Mux8:1 Mux

32b 32b

32b

FMUL

8:1 Mux8:1 Mux

32b 32b

32b

FADD

8:1Mux 8:1Mux

32b 32bFADD

8:1Mux 8:1Mux

32b 32b

FADD

8x32b8x32b

8:1Mux 8:1Mux

32b 32bFADD

8x32b8x32b

8:1Mux 8:1Mux

32b 32b

FADD

8:1Mux 8:1Mux

32b 32b

32b

FADD

8:1Mux 8:1Mux

32b 32b

32b

SOPKernel

8:1Mux

SOPKernel

8:1Mux

SOPKernel

8:1Mux

8x32b

32b

32b

SOPKernel

8:1Mux

8x32b

32b

32b

Fig. 15. Hardware architecture of the proposed clip-less engine in the Rasterizer.

Fig. 16. Partition for draw call with vertex pre-load buffer scheme.


A. Design analysis for embedding 3-D graphic processor into mobile platform

The 3-D graphic processors are used in embedded systems, mobile phones, personal computers, workstations and game consoles. They were used to accelerate the memory-intensive work of texture mapping and rendering polygons. In order to improve the efficiency of 3-D graphic pipelines in a mobile platform, the design analysis of the cache architecture is recommended. Moreover, to undergo cost and performance trade-offs in a platform with limited resources is necessary. Data scheduling and the effective use of resources is also an important consideration in a multi-threaded multi-core SoC chip design.

1) Data Fetch with un-cached data in shared L2 strategy Data fetch is the GPU front-end unit and its main task is to

move the desired computing data into internal GPU’s cache and buffer so that the shader operations can quickly access the information needed in order to improve overall performance. The data which is required to move contains the vertex/pixel shader instruction, vertex/pixel shader constants and vertex data (attributes). Here, use the mobile system specifications for the input stage architecture design and analysis. The main function of the data fetch is to fetch the shader’s data, which includes the vertex and shader data. Vertex data mainly includes vertex attributes, shader data includes instruction and constant data (each divided into vertex shader and pixel shader). From the perspective of this pipeline issue, the data fetch embodies the first pipeline stage whereas the design stage focuses on how to quickly provide information in regard to GPU computing, Of course, the sooner this happens the better, but considering the cost, there must be some trade-offs.

Fig. 17 shows the block diagram of the data fetch and the cache buffer. The L2 cache is aimed to provide texture data and ROP data temporarily, in order to reduce data latency and increase bus transaction utilization in order to reduce power consumption. Texture data and frame data in the spatial characteristics are similar and both are accessed using the block in order to improve the efficiency of bus request (using the burst). Specifically, this is to block data units where a feasible design is 2×2 or 4×4 as regards the cache line. Hence, the cache bus is 256 or 512 bits.

Consider that the vertex data format is not only complicated, but that it also includes a stride mode. Therefore, the cache line with the block format will produce internal segmentation and form a wasted space. The vertex data uses an additional

independent pre-load buffer to improve efficiency and the shader data. Because 90% probability of shader data can put into I-Cache and D-Cache completely, so it was not need to place the shader data in the shared L2 cache. This avoids crowding out texture and raster operation data.

2) Draw call partition with vertex pre-load buffer scheme The design of the draw call partition supports in driver

helps the vertex data process in grouping mode and increases the relevance of data access in order to reduce the amount of bus traffic load. The draw call partition needs the scheme to support in data fetch and vertex buffer design. Fig. 16 shows the partition for draw call with the vertex pre-load buffer scheme. When the vertex shader still processes the 1st draw call’s vertex data, the data fetch begins the 2nd draw call process: The first step involves an index fetch, and the data fetch module must be read out of the index from the index buffer in order to know where to catch the vertex attribute. The sub draw-call partition will also be processed in this module. Then, dealing with the need-to-read calculator, which is used to compare the last of the sub draw-calls and identify the repeated vertex because the repeated vertex is theoretically no need to calculate more than once.

After this stage, the output must be read out. Then the data fetch begins to read out attribute data from the L2 cache, and the index data and attribute are written into the preload buffer. After all data is stored in the preload buffer, the data fetch will trigger the preload buffer and put all the preload buffer data that was deposited into the vertex buffer before the timed vertex shader begins to process the 2nd draw call’s vertex data from the vertex buffer. This scheme increases the performance between the data fetch and vertex shader.

B. Towards a new generation of mobile device

With the advances in mobile panel technology, the image resolution of a new generation of mobile device becomes ever increasingly complex. As such, the memory-intensive work of texture mapping and rendering polygons becomes much greater in scope as the graphic processor unit must accelerate geometric calculations such as the rotation and translation. Computation power is lacking, and consumers must wait longer to see the images.

1) Reduce the transactions and computations In the multi-core architecture, competing for access to the

infrequency data occurs frequently and the hit rate of caches becomes extremely low. With the developed ESSL compiler

Fig. 17. Block diagram of Data Fetch and Cache Buffer. Fig. 18. Data analysis for the centralized scheduler in multi-core render state pipeline.


which records the infrequency data in an LUT to inform the centralized scheduler. The scheduler must transfer the data to the local cache of the shader before the shader can use it. Thus, the hit rate can obtain two times the efficiency of the conventional cache scheme. Thus, reduce the number of redundant memory access and save on power consumption.

Moreover, in the computation of power consumption, with the proposed cost-effective rasterization architecture embedded into a mobile engine, no expensive clipping action is involved and no extra clipping-derived polygons are produced. This is important for the high resolution mobile device in 3-D rendering applications because the proposed cost-effective rasterization architecture will prevent to generate extra clipping-derived polygons.

2) Energy efficiency scheduler On the other hand, in high-end smart phones or mobile

devices, this work aims to enhance 3-D mobile games with regard to the user experience. Notably, the proposed multi-core, multi-thread 3-D mobile graphics SoC can handle the user requirements for complex applications. Fig. 18 shows the centralized scheduler data analysis in the multi-core render state pipeline. The centralized scheduler connects all computation modules and internal buffers. During the system execution, the centralized scheduler optimizes the workload between multi-core shaders and modules of rendering pipeline. It dispatches the task depending on the shader execution complexity and the module of pertaining to the rendering pipeline and buffer status. The centralized scheduler uses task FIFOs with render state

controllers to perform the task dispatch; the technique can reduce 81% of total execution time as compared to native complex 3-D executing. GPU computing offers unprecedented application performance by offloading computation-intensive portions of the application to the GPU while the remainder of the code still runs on the CPU. From the user's perspective, applications will be executed very smoothly.

VII. IMPLEMENTATION RESULTS

In order to propose a cost-effective strategy for primitive clipping in rasterization, this paper use the clip-less algorithm and moves the clipping stage to the screen space in the graphics pipeline. The proposed architecture processes the valid screen space region of each primitive in eight cycles and the gate count is only 20k using the 65nm 1P9M process. Furthermore, the throughput reaches 25M Triangles/Sec. Table I lists the performance, throughput and gate counts as compared between clipping engines [8][11][12][13]. Clipping engines [11] and [13] both use single path data flow and support for X, Y, Z planes. Clipping engine [11] includes the Outcode Unit, PVPDVM Unit, Interpolation Unit, Vertex Register Files and Control Unit. Clipping Engine [13] just used three PE to handle a six cropped plane; this design, despite an increase in state judge, can be reduced by half the number of PE. Clipping engines [12] and [8] both use a dual path data flow. Clipping engine [12] only supports for Z plane. Its architecture not only separates the data-path into the clip data-path and the Perspective Division & Viewport Mapping (PDVM) data-path but also use the dual thread architecture for the dual path process. Clipping engine [8] used just one PE to handle a six cropped plane and also modified the format of [13]’s input/output format at each PE for two vertices simultaneously which support X, Y, and Z planes. In this paper, the proposed algorithm can implement with the edge equation parameter in parallel at the building stage and is suitable for cost-effectiveness with regard to hardware design. Thus, it is called the novel clip-less algorithm. This work has tested the PVR OpenGL ES 2 AP, the third party OGL AP, and several in-house APs via this integrated verification environment. Results are shown in Figs. 19-21.

Fig. 21. Rendering results for the PVR demo test case.

Fig. 20. Rendering results for the third party OGL sample test case.

Fig. 19. Rendering results for the PVR test case.

TABLE I PERFORMANCE, THROUGHPUT AND GATE COUNT COMPARISON

Clipping Engine

Algorithm Area (gate counts)

Support Clip Plane

Throughput (M/Triang les/sec)

Process /Cycle time

[8] Dual Path 165.17k X,Y,Z

Planes 10.9 90nm

/6 ns [11] Single Path 155.89k X,Y,Z

Planes 11 0.13um

/6ns

[12] Dual Path 79k+3k

SRAM Z Planes

none 0.18um /6ns

[13] Single Path 213.2k X,Y,Z Planes

none 0.18um/6ns

This Work

Finding valid region in screen space

20k X,Y,Z Planes

25 65nm /5 ns


This work also presents the architecture of a 200-MHz multi-core, multi-thread 3-D graphics SoC in 65nm 1P9M process with a core size of 4.97mm2. The performance can be achieved by the proposed centralized scheduler and draw call partition with a vertex pre-load buffer scheme exhibiting 153.3mW for its power consumption. This work has developed an integrated verification environment for C-model simulation and RTL model simulation as well as FPGA testing and debugging, as shown in Fig. 22. The framework permits combinations of different models for fast prototyping, shortened development cycle, increased speed of concept validation, and helps to reveal design mistakes at early stages. The rendering results between the hardware and software can reach bit-accuracy goals.

VIII. CONCLUSION

Basic polygon cutting (primitives clipping) in the entire 3-D graphics pipeline (rendering pipeline) is an important step in traditional polygon cutting. The cost of the hardware is substantially increased and not suitable for a low-power mobility GPU design. This paper proposes a novel cost-effective strategy for primitive clipping in rasterization. In the proposed process, no expensive clipping action is involved and no extra clipping-derived polygons are produced. As such, the implementation cost is significantly reduced. Based on the integration framework, the OpenGL ES 2 conformance and a few publicly available benchmarks have been verified on the SoC platform. The 3-D Graphics SoC team continues to seek benchmarks with regard to complex graphics effects insofar as sharpening and proving the robustness of the system.

REFERENCES [1] Khronos Group, Open Standards for Media Authoring and Acceleration. [2] M. Segal and K. Akeley “The OpenGL Graphics System: A

Specification”, Ver.2.0, Oct. 2004 [3] D. Kim, K. Chung, C. H. Yu, C. H. Kim, I. Lee, J. Bae, Y. J. Kim, J. H.

Park, S. Kim, Y. H. Park, N. H. Seong, J. A. Lee, J. Park, S. Oh, S. W .Jeong, and L. S. Kim “An SoC with 1.3 Gtexels/s 3-D graphics full pipeline for consumer applications” IEEE J. Solid-State Circuits, vol. 41, no. 1, Jan. 2006.

[4] Yeong-Kang Lai and Yu-Chieh Chung, “Cost-effective rasterization using valid screen space region,” IEEE Int. Conf. Consumer Electronics, pp. 143-144, Jan. 2013.

[5] R. T. Gu, T. C. Yeh, W. S. Hunag, T. Y. Huang, C. H. Tsai, C. N. Lee, M. C. Chiang, S. F. Hsiao, Y. N. Chang, I. J. Huang, “A low cost tile-based 3-D graphics full pipeline with real-time performance monitoring support for OpenGL ES in consumer electronics”, IEEE Symp. on Consumer Electronics, pp. 1-6, Jun. 2007.

[6] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash,P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, “Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. on Graphics, 2008

[7] J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan,“GRAMPS: a programming model for graphics pipelines”, ACM Trans. on Graphics, Jan. 2009.

[8] K. H. Lin “Design of an efficient clipping engine for OpenGL-ES 2.0 vertex shaders in 3-D graphics systems”, master’s thesis, Dept. Computer Science and Eng., National Sun Yat-Sen University, Aug. 2009.

[9] Sutherland, Ivan and Hodgman, Gary, ``Re-entrant polygon clipping'', Communications of the ACM, 17(1):32-42, Jan. 1974.

[10] S Laine, T Aila, T Karras and J Lehtinen “Clipless dual-space bounds for faster stochastic rasterization” ACM Trans. on Graphics, 2011.

[11] J. Bae, D. Kim, and L.-S. Kim, “An 11M-triangles/sec 3-D graphics clipping engine for triangle primitives”, in Proc. IEEE Int. Symp. Circuits Syst., vol.5, May 2005.

[12] J.-H. Kim, et al., “Clipping-Ratio-Independent 3D graphics clipping engine by dual-thread algorithm”, in Proc. IEEE Int. Symp. Circuits Syst., May 2008.

[13] S. F. Hsiao and T. C. Tien, “Hardware design and verification of clipping algorithm in 3D graphics geometry engine”, master’s thesis, Dept. Computer Science and Eng., National Sun Yat-Sen University, Jul. 2008.

[14] M. Olano and T. Greer, “Triangle scan conversion using 2D homogeneous coordinates”, in Proc. of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics Hardware, 1997.

[15] S. S. Muchnick, “Advanced compiler design implementation, Morgan Kaufmann Publishers, 1997.

[16] M. Poletto and V. Sarkar, “Linear scan register allocation”, ACM Trans. on Programming Languages and Systems, pp. 895-913, Sep. 1999.

BIOGRAPHIES

Yeong-Kang Lai (SM’13) was born in Taipei, Taiwan, R.O.C., in 1966. He received the B.S. degree in electrical engineering from the Tamkang University, Taipei, Taiwan, in 1988, and the M.S. and Ph.D. degree from the National Taiwan University 1990 and 1997, respectively.

From 1992 to 1993, he was with the Institute of Information Science, Academia Sinica, Taiwan, R.O.C.,

where he worked on video conference system. In 1997, he was with Department of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan, as an Assistant Professor. From 1998 to 2001, he was with Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan, as an Assistant Professor. Since 2001, he has been with National Chung Hsing University, Taichung, Taiwan. Currently, he is with the Department of Electrical Engineering, National Chung Hsing University as a Professor. His research interests include 3-D display, 3-D video, video compression, DSP architecture design, video signal processor design, and VLSI signal processing.

He is a member of the honor society Phi Tau Phi. In 2011, he received the Outstanding Teaching Professor Award of National Chung Hsing University. In 2010, he also received the Best Paper Award of the International SoC Design Conference. He is a member of the Technical Program Committee of IEEE International Conference on Consumer Electronics.

Yu-Chieh Chung (S’13) was born in Taipei, Taiwan, R.O.C., in 1968. He received the M.S. degree in electrical engineering from the National Chung Hsing University, Taichung, Taiwan, R.O.C., in 2005. He is currently pursuing the Ph.D. degree in the department of electrical engineering at National Chung Hsing University. He also has been a Tech. Leader at Institute for Information

Industry. His major research interests include 3-D graphic, image and video processing, VLSI architecture design of image and video codec, VLSI design for digital signal processing, and reconfigurable computing architecture.

Fig. 22. Implementation on an FPGA Platform.

Documents

3-D graphics processor unit with cost-effective rasterization using valid screen space region