Efficient Vector Graphics Rasterization Accelerator Using Optimized Scan-Line Buffer

  • Published on

  • View

  • Download

Embed Size (px)



    Efficient Vector Graphics Rasterization AcceleratorUsing Optimized Scan-Line Buffer

    Ting-Chi Tong and Yun-Nan Chang

    Abstract This paper presents a small and fast VLSI archi-tecture of a vector graphics rasterization accelerator. To decidethe filling regions of a graphics object, a large on-chip scan-line buffer (SB) is very often used and frequently accessed toderive the pixels winding count. This paper, first, proposes aspecial 2-bit coding scheme for buffer entry along with active-edge-table rescan to record the intersection information of scanlines and the object paths. Second, for AA rendering applications,a coverage buffer is proposed to avoid the duplication of SBs.Compared with the conventional approach, the required buffersize can be reduced by up to 89%. Besides buffer reduction,this paper also proposes a hierarchical SB architecture in whichthe upper-level buffer indicates which scan-line sections haveintersected with objects in order to skip the access to successivebuffer entries. The same technique, along with the differentialcoverage transformation, can also be applied to coverage buffer.Our experimental results show that more than 87% of memoryaccesses can be reduced, which results in saving 66.4% of clockcycles in practical hardware implementation. The gate count ofthe proposed rasterization accelerator is only about 32 232, andcan run at 250 MHz under UMC 90-nm technology for HDTVapplications.

    Index Terms 2-D graphics, anti-aliasing (AA), OpenVG,rasterization, scan-line buffer (SB), vector graphics.


    VECTOR graphics applications have been widely used invarious consumer electronic devices such as electronicbooks, electronic maps, and mobile phones. In contrast tobitmap graphics, vector graphics uses mathematical equationsto describe objects such that the amount of data representingthe objects can be reduced. By evaluating the mathematicalequations according to the given screen resolution, the objectscan be rendered without losing image quality. Therefore, vec-tor graphics is very suitable for devices with limited memorycapacity, such as embedded systems or image viewers withzoom-in or zoom-out capability such as smart phones.

    While more and more products equipped with vector graph-ics applications are developed by different manufacturers, astandard for vector graphics needs to be complied with inorder to make applications compatible at diverse platforms.SVG [1] and OpenVG [2] are the two most popular vector

    Manuscript received December 27, 2011; revised April 21, 2012; acceptedJune 12, 2012. Date of publication August 9, 2012; date of current versionJune 21, 2013. This work was supported in part by the National ScienceCouncil of Taiwan under Grant NSC 100-2221-E-110-053-MY3.

    The authors are with the Department of Computer Science and Engineer-ing, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan (e-mail:d973040003@student.nsysu.edu.tw; ynchang@cse.nsysu.edu.tw).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2012.2207413

    Fig. 1. OpenVG standard rendering pipeline flow.

    graphics standards, especially OpenVG, which was releasedseveral years ago and aimed primarily at hand-held devicesthat require portable acceleration of high-quality graphics. Thisstandard suggests a pipeline rendering flow that can benefit andfacilitate both software and hardware acceleration developers.Many schemes related to the implementation of vector graph-ics systems have since been proposed [3][12]. Although therendering of vector graphics can also be accelerated by usingthe existing programmable [13] or dedicated 3-D graphicsprocessing units [14], the cost or power efficiency of thesesolutions may not suit all embedded applications. Therefore,how to develop dedicated vector graphics rendering circuits isan important issue that needs to be addressed.

    Fig. 1 shows the rendering pipeline flow specified in theOpenVG standard. Among these eight processing stages, ras-terization consumes more than 90% and 40% of the totalexecution time, respectively, in the pure reference softwareimplementation (RI [15]) and a dedicated hardware designreported in [10], for the rendering of tiger object [3]. Howto accelerate the operations of this stage by adopting moreclever algorithms [4], [5] or by developing efficient hardwarerasterization accelerators has attracted much attention [7][10].Rasterization is a procedure used to convert the object outlineinto a series of pixels. Whether a pixel belongs to the interiorof an object depends on its winding count (WC) numberand the selected filling rule. The WC of a pixel is derivedfrom counting the number of directional edges crossed by atest straight line from a very distant point to this pixel. Tofacilitate this counting procedure, those pixels with the samevertical coordinate value will be tested altogether based on thesame scan line that passes through these pixels. In addition,an alternative number called fractional WC (FWC) will becalculated first for each pixel. The FWC of a pixel representsthe number of directional edges crossing the region of thescan line between this pixel and its left-hand side neighboringpixel. For a downward edge, FWC is increased by 1; for anupward edge, it is subtracted by 1. After the FWC of all pixelsin a scan line are produced, WC can be simply computed byaccumulating the FWC from left to right along the scan line.

    There are three major schemes used in the vector graphicsrasterization systems to produce WC, as shown in Fig. 2. Thefirst two schemes receive the edges that are converted from

    1063-8210/$31.00 2012 IEEE


    Fig. 2. Process flow of different WC generation schemes.

    tessellating the incoming object paths in the previous stages,and organize the edge information into a global edge table(GET). The GET normally maintains a link list for each scanline. The associated edges will be linked together according tothe first scan line they will cross in the upward direction. Eachlist entry records information about each edge including thex-coordinate value of the lowest crossing point, the maximumy-coordinate value of two edge endpoints, the reciprocal slopevalue, and the direction of the edge. After the GET has beenbuilt, the rasterization flow will move on to compute WC ofthe pixels starting from the very bottom scan line. Those edgesthat cross the current processed scan line are often referred toas the active edges (AEs); their data will be retrieved fromthe GET and moved to another table called the AE table(AET). In order to compute the WC of pixels in one scan line,the first scheme in Fig. 2 has to sort AEs according to the x-coordinate of intersection points in ascending order [16], [17].To prevent the sorting of AEs, the second scheme employs ascreen-wide buffer called the scan-line buffer (SB) to recordthe FWC [11]. The final scheme shown in Fig. 2 can furtheravoid the use of the GET by directly rendering the paths intoa frame-size buffer by recording FWC of all pixels in a framein the same phase.

    Among the aforementioned three schemes, to realize the firstone by dedicated hardware will require an additional sortingmodule to sort the AE of each scan line, which results inan increase of not only the overall gate count but also thecomputation cycles. On the other hand, the third scheme needsa huge frame-size buffer to store FWC of all pixels on thescreen. Unless the screen resolution is very low, the buffer ishardly realized by on-chip memory. Instead, part of off-chipsystem memory will be allocated for this buffer, which mayresult in significant increase of bus bandwidth due to the FWCgeneration and WC accumulation operations. In contrast, thesize of SB for the second scheme is much smaller such that itcan be placed on chip. In addition, it does not require sortingof the AET. Therefore, it is the most popular scheme adoptedin the literature [6], [10][12].

    In order to achieve better rendering quality, an anti-aliasing(AA) function is often supported in most graphics renderingsystems by upsampling the rendering resolution such that thecoverage percentages of pixels encompassed by a path can befigured out. This multisampling technique will result in theincrease of SB size because one pixel contains more than onesample point. Therefore, [12] proposed an approximate FWC

    summation technique to reduce the usage of SB. However,if a path is self-intersecting or contains multiple subpaths,their approach may produce incorrect AA colors. To speedup the rendering time, [10] proposed a dual SB approach suchthat two scan lines can be processed in parallel. This paperproposed an efficient rasterization accelerator design on thebasis of a new SB organization. Not only the hardware cost ofthe buffer but also the overall rendering cycles can be reduced.

    The rest of this paper is organized as follows. Section II firstbriefly explains the conventional SB-based AA rasterizationprocess. Section III reviews the technique used in previousworks to reduce the required SB size. Our proposed SBreduction scheme is discussed in Section IV. For both non-AAand AA rendering applications, we propose our respectiveoptimization methods. Section V presents a new buffer ar-chitecture that can reduce the rendering cycles by savingmany buffer access operations. The detailed circuit-levelimplementation of the rasterization accelerator is addressed inSection VI. Finally, Section VII provides some experimentaland comparison results, followed by some conclusions inSection VIII.


    Vector graphics is a form of figures represented by someprimitive segments defined by mathematical equations to de-scribe the outline of graphics objects. Those primitive segmenttypes include lines, Bzier curves and elliptical arcs. To rendera figure in an efficient and unified way, all equations aretransformed into edges through tessellation. Then, where theedges intersect with each scan line has to be found andsorted in the horizontal direction to determine the fillingregions by the rasterization stage. As shown in Fig. 2, thesorting procedure can be realized directly by hardware sortersconsisting of compareswap arithmetic units, or by annotatingthe intersection information in an additional buffer. The useof SB is the most popular scheme used in vector graphicsrendering systems.

    In addition to the rendering efficiency, image quality is alsoan important issue of vector graphics rendering. Artifact effectis a common problem, which arises in the finite-resolutiondisplay, and can be alleviated by the AA processing. TheAA function in OpenVG is realized by first calculating thecoverage value (CV) of each pixel, which will then be usedas a blending factor to generate the compensated color for theassociated pixel. CV of a pixel represents the percentage of thespace region associated with this pixel which is encompassedby the object paths. However, it is very difficult to compute theactual CV value since it may involve complex area calculationof a region; therefore, CV is often approximated by samplingthis region with many subpixels and counting the number ofsample points that belong to the object interior. In OpenVG RI[15], it adopts the 8-queen sampling pattern for the fast AAsolution. Based on this pattern, the CV of a pixel is determinedby eight sampling points which are distributed to eight subscan lines (SS). The flowchart of the conventional rasterizationapproach based on the 8-queen multisampling approach isshown in Fig. 3.


    Fig. 3. Conventional SB-based AA rasterization flowchart.

    In Fig. 3, the calculation of CV will not start until WC ofall the sampling points in eight SSs are computed. Therefore,the total of eight lines of SB will be used because each SSrequires one SB. Let SBn[i ] represents the content of i entryof nth SB, where 1 n 8, 0 i < W , and W denotesthe screen width. At the FWC generation stage, FWC of eachsampling point is generated and stored in the SB. Assume thecurrent processed SS number is n, and the incoming AE Eiintersects with this SS at x = a, the following SB updateoperation will be executed:

    SBn[a] = SBn[a] + direct(Ei ) (1)where the function direct(Ei ) will return 1 if Ei goes upward;otherwise, it will return 1. Fig. 4 shows a small portion of anobject outline, and illustrates how it intersects with the currenteight SSs. Fig. 5(a) shows the content of each SB after theFWC of all samples of eight SSs have been generated. Thenext step of rasterization is to calculate the WC of each sampleby aggregating the FWC values stored in SB from left to right,and storing the accumulated results back to SB as follows:

    for i = 0 to W 1for n = 1 to 8

    SBn[i ] = SBn[i 1] + SBn[i ]. (2)The SB content for the example of Fig. 4 after WC accumula-tion is shown in Fig. 5(b). Now, SBn[i ] stores the WC value ofthe i th sample in the nth SS. To calculate the final CV of eachpixel in the current scan line, WC of the corresponding eightsamples will be fetched out from SB and passed through thefill decision to generate the inout decision bit (DB) accordingto the filling rule (evenodd or nonzero). The DBs of eightsamples will be summed up together in the CV calculationstage to produce the final CV output. The operations in the

    Fig. 4. Detailed intersection plot of a self-intersecting path and eight SSsof one scan line.

    Fig. 5. Intermediate results at different processing stages of the conventionalrasterization approach for the example of Fig. 4. (a) SB content after FWCgeneration. (b) SB content after WC accumulation. (c) DB values after fill-decision. (d) CV values after CV calculation.

    fill decision and CV calculation stages can be described asfollows:for i = 0 to W 1for n = 1 to 8{

    DB = FillingRule(SBn[i ]) CV[i ] = CV[i ] + DB8}


    where the DB is obtained by applying the FillingRule()function to WC of the current sample. The filling-rule judgefunction FillingRule() will return either 1 or 0 to indicatewhether the sample is inside the object or not.

    III. PREVIOUS SB REDUCTION METHODFrom the aspect of hardware implementation, the usage

    of SB costs considerable on-chip memory capacity. OpenVGspecification stipulates that there be at most 255 edge crossingsin one scan line, which means it requires at least 9 bitsincluding a sign bit to represent the possible FWC and WCvalues stored in each SB entry. If the 8-queen multisamplingtechnique is realized, then the total number of SB size willbe up to 77 760 bits for 1080 1920 screen resolution.Consequently, [12] proposed a low-cost rasterization designto reduce the required buffer size. Fig. 6 shows the detailedflowchart of their approach, where minx and maxx...


View more >