A reconfigurable multilevel parallel texture cache memory with 75 …ssl.kaist.ac.kr › 2007 › data › journal › sjpJSSC2002.pdf · 2019-03-05 · parallel SRAM L1 caches, pipelined

612 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 5, MAY 2002

A Reconfigurable Multilevel Parallel TextureCache Memory With 75-GB/s Parallel Cache

Replacement BandwidthSe-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kang-Min Lee, Tae-Hum Yang, Jin-Yong Jung,

and Hoi-Jun Yoo

Abstract—Recently, the level of realism in PC graphicsapplications has been approaching that of high-end graphicsworkstations, necessitating a more sophisticated texture datacache memory to overcome the finite bandwidth of the AGP orPCI bus. This paper proposes a multilevel parallel texture cachememory to reduce the required data bandwidth on the AGP orPCI bus and to accelerate the operations of parallel graphicspipelines in PC graphics cards. The proposed cache memoryis fabricated by 0.16- m DRAM-based SOC technology. It iscomposed of four components: an 8-MB DRAM L2 cache, 8-wayparallel SRAM L1 caches, pipelined texture data filters, and aserial-to-parallel loader. For high-speed parallel L1 cache datareplacement, the internal bus bandwidth has been maximized upto 75 GB/s with a newly proposed hidden double data transferscheme. In addition, the cache memory has a reconfigurablearchitecture in its line size for optimal caching performance invarious graphics applications from three-dimensional (3-D) gamesto high-quality 3-D movies. This architecture also leads to optimalpower consumption with an adaptive sub-wordline activationscheme. The pipelined texture data filters and the dedicatedstructure of the L1 caches implemented by the DRAM peripheraltransistors show the potential of DRAM-based SOC design withbetter performance-to-cost ratio.

Index Terms—3-D graphics, DRAM-based SOC, DRAM L2cache, multilevel parallel cache, texture cache.

I. INTRODUCTION

A S THREE-DIMENSIONAL (3-D) graphics become oneof the most important components of multimedia appli-

cations today, 3-D graphics processing power has become amajor performance index of multimedia systems, such as PCs orportable information terminals [1]–[5]. At present, the level ofrealism of PC graphics scenes is comparable to that of high-endgraphics workstations a few years ago [6], [7]. For the gener-ation of realistic scenes, texture mapping has been frequentlyused in 3-D graphics [8]. This technique enhances the realismof 3-D graphics scenes by wrapping 3-D model surfaces withtwo-dimensional (2-D) texture images obtained by scanning thesurface of the 3-D objects in real space. By the texture mappingoperation, surface details such as color and roughness can easily

Manuscript received July 28, 2001; revised November 9, 2001.S.-J. Park, J.-S. Kim, R. Woo, S.-J. Lee, K.-M. Lee and H.-J. Yoo are with

the Department of Electrical Engineering, Korea Advanced Institute of Scienceand Technology, Taejon 305-701, Korea (e-mail: [email protected]).

T.-H. Yang and J.-Y. Jung are with the Advanced Circuit Design Team,Memory R&D Division, Hynix Semiconductor, Ichon, Korea (e-mail: [email protected]).

Publisher Item Identifier S 0018-9200(02)03670-3.

be represented by simply attaching the original scanned surfaceimages to the 3-D graphics model surfaces, as shown in Fig. 1.

However, the texture mapping operation requires intensivesystem memory access because texture data is usually stored ina PC’s system memory, and then loaded into a graphics cardthrough the AGP or PCI bus on demand. This is called pullarchitecture, which is better in terms of memory utilizationas compared to push architecture [9]. In push architecture,all the texture data is loaded into the PC graphics memorybefore starting rendering. This limits the size of the texturedata to the graphics memory size and leads to low memoryutilization because the graphics memory can only be used when3-D graphics applications are running. Therefore, in most PCgraphics cards, pull architecture is preferred from the viewpointof memory utilization. Although Intel’s AGP bus has beendeveloped for more efficient texture data loading with a smallL1 texture cache within a pull architecture [10], more efficientuse of the finite bus bandwidth by a more sophisticated texturecache memory is still required to increase the graphics realismwith interactive frame rate. Furthermore, due to the widespreaduse of parallel rendering architectures even in PC graphicscards, it is important to support parallel graphics pipelineswithout texture cache memory access conflicts among theparallel graphics pipelines [11].

With consideration of the aforementioned requirements, thenew cache memory design has two goals. The first is to reducethe required bandwidth on the AGP or PCI bus for loading tex-ture image data, and the second is to support parallel graphicspipelines for maximum speed operations. Fig. 2 shows a blockdiagram of the proposed cache memory architecture. It is com-posed of four components: an 8-MB DRAM L2 cache memory,8-way parallel SRAM L1 cache memories, eight pipelined tex-ture filter modules, and a serial-to-parallel loader. All of thesecomponents are integrated on a single chip and fabricated using0.16- m DRAM-based SOC technology.

The large DRAM L2 cache memory reduces the required databandwidth on the AGP or PCI bus by exploiting the interframetexture data coherency, which is similar to that found in MPEGalgorithms [9]. Since most of the texture data for rendering thecurrent graphics frame is reused for rendering the next frame,the large DRAM L2 cache memory can reduce the required databandwidth on the AGP or PCI bus by 20 times for 1024768screen resolution. This will be explained in detail in the perfor-mance analysis section of this paper. The 8-way parallel SRAML1 cache memories can independently support parallel graphics

0018-9200/02$17.00 © 2002 IEEE

PARK et al.: RECONFIGURABLE MULTILEVEL PARALLEL TEXTURE CACHE MEMORY 613

Fig. 1. Realistic image generation by texture mapping.

Fig. 2. Block diagram of the multilevel parallel texture cache.

pipelines with dedicated texture data filter modules. The inde-pendent L1 cache memories remove the cache access conflictsby parallel graphics pipelines, and enable each graphics pipelineto run at its maximum speed.

For maximizing the advantages of the proposed architecture,wide data bandwidth between the L2 and L1 cache memoriesis crucial for smoothing parallel L1 cache refill operations. Forthis goal, a wide internal bus (IBUS) has been adopted, and anewly proposed hidden double data transfer scheme maximizesthe IBUS bandwidth up to 75 GB/s. This wide IBUS band-width enables eight L1 caches to be serviced by a large DRAML2 cache memory without starvation, which is unfeasible inPCB-level design. In addition, the cache line sizes of the L2 andL1 caches can be reconfigured in the range of 44, 8 8, and16 16 pixel areas to keep optimal caching performance forvarious graphics applications from 3-D games to high-quality3-D movies [12]. Furthermore, the dedicated SRAM L1 cacheand the pipelined filter structure based on the texture mappingalgorithm show good performance in spite of using low-speedDRAM peripheral transistors in the DRAM-based SOC design.This results in a more cost-effective design than that with expen-sive merged DRAM logic (MDL) technology [13]. Although theuse of embedded DRAM to store texture image data has beenstudied by other researchers [14], their architecture was based

on push architecture, which has limits in texture image size, andthere was no actual VLSI implementation.

In Section II, the texture mapping and filtering algorithm bytrilinear interpolation is introduced. In Section III, details of thecache architecture are described. In Sections IV–VII, details ofthe sub-blocks and the adopted circuit techniques are explained.In Section VIII, the performance improvement is demonstratedby running real 3-D graphics applications, and the chip imple-mentation results are shown. Finally, conclusions are presentedin Section IX.

II. TEXTURE MAPPING AND TRILINEAR INTERPOLATION

The conceptual texture mapping operation is composed oftwo steps. The first is wrapping 3-D graphics object surfaceswith 2-D scanned texture images, as shown in Fig. 1. Thesecond step is projecting the textured 3-D graphics objects ontoa 2-D screen. Thus, it incorporates two-stage floating-pointmatrix calculations (2-D texture image space 3-D objectspace 2-D screen space). However, the actual texturemapping process takes reverse calculation steps to removeunnecessary calculations for hidden objects in 3-D space [8].Therefore, inverse transform matrix calculations are performedon each pixel of the 2-D screen. Although this inverse mappingprocess reduces the required processing power, it brings anotherproblem of aliasing artifacts in the generated graphics scenes[8]. Since the screen pixel is not a mathematically definedpoint, but rather an area, the corresponding portion in thetexture image is also an area. Furthermore, although the sizeof the screen pixel is fixed, the size of the corresponding areain the texture image space varies according to the geometricalrelationships between the 3-D object and the 2-D screen in3-D space [8]. Thus, to find a representative pixel value in thecorresponding texture area, another calculation step, calledfiltering, is required. Although a pixel in the texture image iscalled a “texel” in 3-D graphics, in this paper we simply call ita “pixel” for nonspecialists in 3-D graphics.

The proposed cache architecture is specially designed forthe texture filtering method based on trilinear interpolationwith mipmap texture images [8], which is adopted by most oftoday’s 3-D graphics hardware. Trilinear interpolation is a kindof 2-D image filtering operation to reduce the aliasing artifactsin 3-D graphics scenes. This filtering algorithm first makes themipmap. It is made by prefiltering an original texture imageand resampling it to make a half-size image, and repeatingthese operations until the texture image size becomes 22,


Fig. 3. Mipmap and trilinear interpolation.

Fig. 4. Conceptual structure of the proposed multilevel parallel texture cache memory.

as shown in Fig. 3. The original texture image is named thelevel-of-detail 0 (LOD 0) image, and the smallest image isnamed the LOD image. The trilinear interpolation selectstwo neighboring LOD images which have the closest 1 : 1 arearelationship with the screen pixel area [15], [16]. Then, thismethod reads eight pixel values, four from even LOD levelsand four from odd LOD levels, and weights them to evaluatea final pixel value to be mapped onto the screen pixel. Thistrilinear interpolation method using the mipmap can evaluatethe final filtered pixel value in a fixed cycle time regardless ofthe size variation of the texture image portion to be filtered inthe original texture image (LOD 0 image).

In an even or odd LOD level image of the two selectedLOD level images, four pixel values at four neighboringinteger coordinates are read and interpolated to find a rep-

resentative pixel value of the four neighboring pixel valuesby weighting the distances between the four integer coor-dinates and the transformed floating-point coordinatesfrom a pixel coordinate in the screen space by the transformmatrix. This process is illustrated in Fig. 3. It is called bi-linear interpolation. Since the calculated LOD value is also afloating-point value, and is between the two selected integerLOD levels, the two bilinear interpolated pixel values fromthe two selected integer LOD level images are interpolatedagain by weighting the fractional value of the floating-pointLOD value. The final value becomes a trilinear interpolatedpixel value to be mapped onto a screen pixel. Therefore,the texture cache based on the trilinear interpolation methodusing mipmaps stores portions of the mipmap texture imagesin different LOD levels and in different texture images.


Fig. 5. Memory cell arrangement and sense amplifier structure.

III. PROPOSEDCACHE ARCHITECTURE

Fig. 4 shows the conceptual structure of the proposed mul-tilevel parallel texture cache memory, assuming that eachmemory cell corresponds to 32-bit color information (R, G,B, and A) of a pixel. It is composed of two parts: theleft-hand part for even LOD texture images and the right-handpart for odd LOD texture images. In each LOD part, thetexture data flows in four stages: a serial-to-parallel latch(SP_LATCHE, SP_ LATCHO), a DRAM L2 cache, parallelSRAM L1 caches, and trilinear interpolators. The wide in-ternal bus (IBUSE [255:0], IBUSO [255:0]) runs verticallyand connects the serial-to-parallel latch to the DRAM L2cache and the DRAM L2 cache to the 8-way parallel SRAML1 caches.

The DRAM L2 cache size is 8 MB, which is optimal for1024 768 screen resolution in most graphics applications [9],and the optimal size of each SRAM L1 cache is 16 kB [12]. EachSRAM L1 cache has parallel output data paths for transferringeight pixel values simultaneously to its trilinear interpolator.The pipelined trilinear interpolator ( texture filter) generates afinal aliasing-free pixel value in each clock cycle with a latencycorresponding to the number of pipeline stages. The paralleldata path of the SRAM L1 cache and the pipelined trilinearinterpolator allow cost-effective system performance in spite ofusing DRAM-based SOC technology, which is lower in speedthan the expensive MDL technology. The cache lines of theL2 and L1 caches are mapped onto 2-D texture image blocks,not on a one-dimensional line, for a lower cache miss rate [12].Furthermore, the cache line size is reconfigurable in the rangeof 4 4, 8 8, and 16 16 pixel areas to maintain optimalcaching performance for various graphics applications [12].

IV. HIDDEN DOUBLE DATA TRANSFER

For the parallel SRAM L1 caches to operate with suffi-cient cache refill bandwidth, the IBUS bandwidth has beenmaximized by a hidden double data transfer scheme. Thisscheme is similar with that of the page mode operation inconventional DRAMs. However, the application is different inthat it is for maximizing the bus bandwidth on the wide databus where the bus width normally cannot be maximized due tothe size difference between the small DRAM cell and the largelogic I/O pitch in SOC designs. This scheme transfers 2-bitdata through a single-bit IBUS pair during a single DRAMread/write cycle. Fig. 5(a) shows a single-bit pair of the IBUS

and the related memory cells. EightDRAM cells and two SRAM cells reside under the IBUS pairbecause of the size differences of the memory cells. The eightDRAM cells are divided into four logically different row ad-dress groups, and the two DRAM cells having the same logicalrow address are accessed during a single DRAM read/writecycle. Each sense amplifier has independent read/write gatingsignals, as shown in Fig. 5(b), and the eight sense amplifiersare connected to the IBUS line by multiplex control signals(WR_U [0:3], RD_U [0:3], WR_D [0:3], RD_D [0:3]).

During a DRAM write cycle, the 2-bit data in SP_LATCH_Land SP_LATCH_R, for example, are written into the twoDRAM cells having the logically same row address 0 (0L, 0R).These 2-bit data correspond to the 2-bit color information of apixel. The detailed operations of the write cycle are explainedin the second write cycle (Write Cycle 2) of Fig. 6. DuringW4 clock cycle, a sub-wordline (SWL) is activated, and twobitline signals (BL_D0, BL_U0) are developed by the lowerand upper sense amplifiers (SA_D0, SA_U0). In the next write


Fig. 6. Read/write cycles of the DRAM L2 cache.

clock cycle (W5), the cell data in SP_ LATCH_L is writteninto the left DRAM cell (0L) through the lower sense amplifier(SA_D0), assuming that the data in SP_LATCH_L is 1. In theW6 and W7 clock cycles, the cell data in SP_LATCH_R iswritten into the right DRAM cell (0R) through the upper senseamplifier (SA_U0), assuming that the data in SP_LATCH_R is0. Since the DRAM cell write operation takes at least two clockcycles, the left DRAM cell (0L) write operation is performedduring the W5 and W6 clock cycles, and write operation of theright DRAM cell (0R) during the W6 and W7 clock cycles.Therefore, the W6 clock cycle hides another memory celldata write operation, reducing one clock cycle in the writingoperation.

More clock cycles can be reduced in the DRAM read cycle(Read Cycle 1) as shown in Fig. 6. The two DRAM cellsdata (0L, 0R) is read in a similar manner as the DRAM writecycle. However, data transfer from the two DRAM cells occursduring the DRAM cell data restoring cycles (R1, R2) for bothof the DRAM cells through a single-bit IBUS pair, resulting indouble data transfers during a single cell data restoration time.Therefore, the peak bandwidth from the L2 cache to the L1caches is 75 GB/s when the cache line sizes of the L2 and L1caches are configured for 1616 pixel blocks at a clock speedof 150 MHz. ( GB/s (Even, Odd) (IBUS pairs ina cell matrix) (double data transfers) (MPTC_Bank A,B, C, D) (bits/byte) (bytes/pixel for R, G, B, A)(MHz) (three read clock cycles one precharge clockcycle).)

V. RECONFIGURABLECACHE LINE SIZE

In order to obtain better caching performance for variousgraphics applications, the proposed cache has reconfigurable

Fig. 7. Reconfigurable cache line architecture.

architecture in its cache line size. Since the optimal cacheline size varies according to the characteristics of incominggraphics applications [12], changing the cache line size to itsoptimal value results in a lower miss rate. It also reduces powerconsumption by removing unnecessary sub-wordline activation[17].

The L2 and L1 cache memory cells are divided into 16sub-groups on a main wordline in each LOD part, as shown inFig. 7. Each sub-group corresponds to one sub-wordline forpartial activation. In the L1 cache, each sub-wordline has 32SRAM cells, which cover a 2-D-mapped 44 pixel area inthe texture image space with 2-bit color data for each pixel.The L1 and L2 block selectors adaptively activate 1, 4, or 16sub-groups simultaneously to change cache line sizes in bothL1 and L2 caches. The cache line sizes can be configured to4 4, 8 8, or 16 16 pixel blocks in both L1 and L2 cachesby external control inputs (L1_BK [1:0], L2_BK [1:0]).


Fig. 8. Parallel access of the SRAM L1 cache.

One sub-wordline of the L2 cache contains 128 DRAM cells,which are divided into four logically different row addressgroups. The 32 DRAM cells that make a logical group have thesame logical address, and only one logical group is involvedin a L2 cache read/write cycle. This logically divided rowaddressing scheme can increase the DRAM cell efficiency byassigning more DRAM cells on a sub-wordline.

VI. SRAM L1 CACHE WITH SCALABLE PARALLEL 2-DCOLUMN DECODER

To achieve sufficient L1 cache access speed in spite of usinglow-speed DRAM peripheral transistors in DRAM-based SOCtechnology, each SRAM L1 cache has parallel output data pathsto its trilinear interpolator, supplying eight pixel data simultane-ously in a single clock cycle. With this parallel L1 cache datapath, eight times wider L1 cache access bandwidth has beenachieved at a clock speed of 150 MHz. This simultaneous pixeldata access enables to generate a filtered pixel value in everyclock cycle with initial latency in the trilinear interpolator.

Fig. 8 shows the cell matrix of the SRAM L1 cache con-taining 2-bit color information with a parallel output data path.There are four SRAM cell matrices for 8-bit color informationin the four MPTC banks (MPTC_Bank A, B, C, D), as shownin the layout photograph of Fig. 17. In a SRAM cell matrix, onesub-wordline contains 16 pixel data for a 44 pixel area with2-bit color information per pixel. The column decoder only re-ceives the address of the upper-left pixel among the four neigh-boring target pixels, which are necessary for texture filtering ineach LOD part, to reduce the pin count of the input address, asshown in Fig. 4. The other three neighboring pixels are auto-matically selected by a new column decoder. Fig. 8 shows theunit block of the column decoder, which simultaneously gener-ates four selection signals from the single input address in eachLOD part. The column decoder can also change its decodingrange from 4 4 to 16 16 for the reconfigurable cache blocksize.

To meet the functional requirements of the column decoder, ascalable parallel 2-D column decoder has been newly designed.It has the ability to simultaneously select four neighboringtarget pixel data, and it also possesses a scalable architecturefor variable decoding range by merging multiple unit column

Fig. 9. Structure of the scalable parallel 2-D column decoder.

decoders. The scalable parallel 2-D column decoder is com-posed of two blocks: a unit column decoder (Unit CDEC)array and a propagation channel, as shown in Fig. 9. The unitCDEC covers a 4 4 pixel area, and acts as an independent2-D column decoder when the L1 cache line size is configuredto a 4 4 pixel block. However, when the L1 cache line size isconfigured to an 8 8 or 16 16 pixel block, the propagationchannel bridges the unit CDECs to provide a wider decodingrange.

For the multiple unit CDECs to be merged as a single CDEC,boundary problems should be solved. These occur when thefour neighboring target pixels reside on the edges of different4 4 pixel blocks. The propagation channel solves this problemby transferring propagation signals toneighboring unit CDECs in the texture image space. Fig. 10shows cases when the propagation signals are generated as-suming that the cache block size is increased from a 44 to8 8 pixel block. Fig. 10(a) shows the memory cell mappingrelationship between a 44 pixel block and 16 memory cells.Fig. 10(b) shows a case when the decoding range is enlarged toan 8 8 pixel block and four unit CDECs cooperate as a singlelarge CDEC by exchanging propagation signals. The propaga-tion signals can be classified into four cases as follows.

Case 1) (no propagation): the propagation channel acts as asimple block decoder.

Case 2) ( propagation) : two pixels reside on blockand two pixels reside on block.

Case 3) ( propagation) : two pixels reside on blockand two pixels reside on block.

Case 4) ( propagation): each pixel resides on ,, , blocks.

Case 1 means that no propagation signal is generated. InCase 2, the propagation signal is transferred fromthe unit CDEC for the block to the unit CDEC for the

block, and notifies the unit CDEC for theblock to generate the output signals for the right two pixels. InCase 3, the propagation signal is generated forthe bottom two pixels. In Case 4,and propagation signals

are generated for the upper-right andlower-left pixels, respectively, and then, for the lower-rightpixel, and propagation signalsfrom the unit CDECs for the and theblocks are generated. Although the propagation channel delaycan lower the operation speed, in this design it was not criticalbecause of the low chip-level clock speed, 150 MHz.


Fig. 10. Operation examples of the scalable parallel 2-D column decoder. (a) Block size: 4� 4. (b) Block size: 8� 8.

Fig. 11. Scaling to an 8� 8 2-D column decoder by merging four unit CDEC.

Fig. 11 shows the detailed interconnections through thepropagation channel among the four unit CDECs for an 88pixel block. All the unit CDECs share the input addressesof and , and each unit CDEC is enabled bythe combination of , , and the enable signal (ENA).Propagation paths shown as dotted lines are shorted by theBK[0] signals when the cache block size is configured for the8 8 pixel block. These four unit CDECs can also act as aunit CDEC recursively when 16 unit CDECs are configuredfor a 16 16 pixel block.

Fig. 12 shows the schematic of the unit CDEC. It has fourbasic inputs for the upper-left pixel posi-tion of the four neighboring target pixels, and has 16 outputs.It also has a decoder enable input (ENA) and propagationsignal I/Os

to communicate with neighboring unitCDECs. Four major operations according to the propagationinputs and the ENA input are tabularized in Fig. 12. The tablealso shows the internal node signals and theiroperations.


Fig. 12. Schematic of the unit column decoder and its operations.

Fig. 13. Operations of the three-stage pipelined trilinear interpolator.

VII. PIPELINED TRILINEAR INTERPOLATOR

The eight pixel data simultaneously transferred from the L1cache are processed in the three-stage pipelined interpolator,which generates a filtered output value in each clock cyclewith initial latency. The trilinear interpolator uses the 4-bitfractional parts of the physical L1 cache

input addresses as the weighing factors, as shown in Fig. 13(a).These 4-bit addresses divide a 11 integer pixel block into16 sub-blocks, as shown in Fig. 13(a), and the 4-bit addressesare used as the weighting factors for the four neighboring pixelvalues at the integer coordinates. The target pixel coordinates

in the two LOD parts are different because


Fig. 14. (a) I/O waveforms for the L1 cache access. (b) Stored L1 cache data and its calculated outputs.

their image sizes are different in mipmap. Therefore, thephysical L1 cache address is used in its original form in oneLOD part, and 1-bit right-shifted physical L1 cache address isused as the target pixel coordinate in the other LOD part.

Fig. 13(b) shows the interpolation steps in the three-stagepipeline. In the first stage, it interpolates the eight pixel valuesalong the direction by the 4-bit fractional part of theinputaddress in each LOD part, which givesfour interpolated values. In the second stage, interpolations areperformed along the direction by the 4-bit fractional part ofthe address , resulting in two interpo-lated pixel values. Finally, in the third stage, the two interpolatedpixel values from the even and odd LOD parts are interpolatedby the 4-bit fractional value of the LOD value ,which also divides the distance between two integer LOD values

into 16 steps. Fig. 14 shows the operations of the L1 cache andthe three-stage pipelined trilinear interpolator. Three input ad-dresses are sampled at the first three rising clock edges, and pro-ceed into the four-stage pipeline including the L1 cache accessstage. Trilinear interpolated pixel values are obtained from thesixth clock cycle, assuming that the stored pixel values in the L1cache are as shown in Fig. 14(b).

VIII. PERFORMANCEGAIN AND CHIP IMPLEMENTATION

For the architectural analysis, a software graphics pipelineand a multilevel parallel texture cache model have been im-plemented using C++. This architecture analysis environmentallows clock-level performance analysis. As a test model, theQuake III Arena computer game has been used [11]. Fig. 15


Fig. 15. Test model and its data characteristics.

(a)

(b)

Fig. 16. (a) Bandwidth reduction by the DRAM L2 cache. (b) Parallel speedupby the wide IBUS bandwidth.

shows a sample graphics frame and its model data characteris-tics. An external SRAM tag memory was assumed to be used,and a prefetching technique with a latency first-in-first-out(FIFO) was used to hide the tag memory access latency [18].

Fig. 16(a) shows the required bandwidth for the L2 and L1cache replacement when rendering 50 consecutive frames inthe Quake III Arena game. The upper graph shows the requiredbandwidth on the IBUS for replacing the 8-way parallel SRAML1 caches, and the lower graph for the DRAM L2 cache re-

Fig. 17. Die photograph of the multilevel parallel texture cache memory.

placement. The required average bandwidths for the L2 and L1caches are 210 kB/frame and 4.7 MB/frame, respectively. Thus,the L2 cache has reduced the required bandwidth on the AGP orPCI bus to about 20 times smaller than that without it. Fig. 16(b)shows the parallel speedup by the parallel L1 caches with wideIBUS bandwidth. Compared with the PCB-level parallel cachedesign (assuming L1 cache block transfer time for 1616 blocksize to be 200 ns), this single-chip architecture can achieve par-allel speedup of eight without parallel speedup saturation. Thisresults in a sustained texture data access speed of 6.6 Gpixels/s,and a trilinear interpolated pixel rate of 825 Mpixels/s.

A prototype chip has been fabricated by 0.16-mDRAM-based SOC technology using 1-poly and three-metallayers (1 W 2 Als). Fig. 17 shows its die photograph. Thischip is for only one color component (R, G, B, or A), and hasfour parallel SRAM L1 caches as an experimental version.Increasing the number of the L1 caches leads to long IBUSlines, which can reduce the operating frequency of the chip.Since this chip has large line drivers for the IBUS lines, up toeight SRAM L1 caches can be attached on the IBUS for peakperformance without lowering operation frequency. The chipis vertically divided into two parts, one for even LOD levelimages and another for odd LOD level images. There are fourmemory banks (MPTC_Bank A, B, C, D), each containing2-bit color information. The die size of the prototype chip is15.6 mm 7.5 mm, and the operation frequency of the filterand the SRAM is 150 MHz. The die size of the prototype chip islarge because it was designed to have a large operation marginas a prototype chip and uses DRAM peripheral transistors forthe logic circuits, such as SRAMs and texture filters. The diesize is anticipated to shrink by 60% of the prototype chip witha more optimal design using MDL technology in which logiccircuits can be implemented with smaller die area. Using theDRAM peripheral transistors in the SRAM and the texture fil-ters lowers the operation frequency; however, the parallel datapath between the SRAM and the texture filters compensatesfor the low operation frequency, resulting in comparable pixelthroughput to that when using a logic optimized process.

The voltages for DRAM cores, logic circuits includingSRAMs, and I/O logics are 2.0, 2.3, and 3.3 V, respectively.


Fig. 18. Shmoo plot of voltage versus cycle time.

Fig. 19. Graphics system using four prototype chips.

Average power consumption is 89 mW when the line sizesof the L2 and L1 cache are configured to 1616 and 4 4,respectively, which are known as the optimal cache line sizes inmost PC graphics game applications. Fig. 18 shows the shmooplot of the prototype chip. Fig. 19 shows the experimentalgraphics board for validating the operation of the prototypechip. There are four prototype chips on the graphics board, eachone for one color component, R, G, B, or A. The graphics boardhas four DSP processors, each having 1-Gflops processingcapability, for satisfying the high throughput of the prototypechip. The graphics board with the prototype chip showed thesame performance measured in the architecture simulation,except for the reduction of total pixel rate due to the reducednumber of SRAM L1 caches. This is because the simulationresults are based on a clock-level simulation, which exactlymodels the operations of the prototype chip.

IX. CONCLUSION

For greater realism of 3-D graphics scenes in PCs withinteractive frame rate, a dedicated single-chip multilevel par-allel texture cache memory has been proposed and fabricatedby 0.16- m DRAM-based SOC process technology. The

integrated large DRAM L2 cache has solved the bandwidthbottleneck problem on the AGP or PCI bus, and the eightindependent SRAM L1 caches accelerate the operationsof the parallel graphics pipelines without L1 texture cacheaccess conflicts. The maximized IBUS bandwidth by thehidden double data transfer scheme smoothes parallel L1cache replacement operations, even in 8-way parallel SRAML1 caches. Furthermore, by the use of reconfigurable cacheline architecture, optimal cache miss rate and lower powerconsumption have been achieved in compliance with variousgraphics application characteristics. The SRAM L1 cachesand the pipelined texture filter architecture implemented usingDRAM peripheral transistors allowed a more cost-effectivedesign than that using expensive MDL process technology.

REFERENCES

[1] C.-W. Yoon, R. Woo, J. Kook, S.-J. Lee, K. Lee, B. Young-Don, P.In-Cheol, and Y. Hoi-Jun, “A 80/20-MHz 160-mW multimedia pro-cessor integrated with embedded DRAM MPEG-4 accelerator and 3-Drendering engine for mobile applications,” inIEEE Int. Solid State Cir-cuits Conf. (ISSCC’01) Dig. Tech. Papers, 2001, pp. 142–143.

[2] N. Ideet al., “2.44-GFLOPS 300-MHz floating-point vector-processingunit for high-performance 3-D graphics computing,”IEEE J. Solid-StateCircuits, vol. 35, pp. 1025–1033, July 2000.

[3] S.-J. Park, J.-S. Kim, R. Woo, S.-J. Lee, K.-M. Lee, T.-H. Yang, J.-Y.Jung, and H.-J. Yoo, “A reconfigurable multilevel parallel graphicscache memory with 75-GB/s parallel cache replacement bandwidth,” inSymp. VLSI Circuits Dig. Tech. Papers, 2001, pp. 233–236.

[4] H. Kubosawaet al., “A 2.5-GFLOPS, 6.5-million polygons per second,four-way VLIW geometry processor with SIMD instructions and asoftware bypass mechanism,”IEEE J. Solid-State Circuits, vol. 34, pp.1619–1626, Nov. 1999.

[5] K. Inoue et al., “A 10 MB frame buffer memory with Z-compare andA-blend units,” IEEE J. Solid-State Circuits, vol. 30, pp. 1563–1568,Dec. 1995.

[6] K. Akeley, “RealityEngine graphics,” inProc. 20th Annu. Conf. Com-puter Graphics (SIGGRAPH’93), 1993, pp. 109–116.

[7] J. S. Montrym, D. R. Baum, D. L. Dignam, and C. J. Migdal, “InfiniteRe-ality: A real-time graphics system,” inProc. 24th Annu. Conf. ComputerGraphics (SIGGRAPH’97), 1997, pp. 293–302.

[8] P. S. Heckbert, “Survey of texture mapping,”IEEE Comput. Graph. Ap-plicat., vol. 6, pp. 56–67, Nov. 1986.

[9] M. Cox, N. Bhandari, and M. Shantz, “Multilevel texture caching for3-D graphics hardware,” inProc. 25th Annu. Int. Symp. Computer Ar-chitecture (ISCA’98), 1998, pp. 86–97.

[10] “Accelerated Graphics Port Interface Specification Revision 1.0,” IntelCorp., 1996.

[11] H. Igehy, M. Eldridge, and P. Hanrahan, “Parallel texture caching,” inProc. EUROGRAPHICS/SIGGRAPH Workshop Graphics Hardware,1999, pp. 95–106.

[12] Z. S. Hakura and A. Gupta, “The design and analysis of a cache archi-tecture for texture mapping,” inProc. 24th Annu. Int. Symp. ComputerArchitecture (ISCA’97), 1997, pp. 108–120.

[13] S. P. Cunningham and J. G. Shanthikumar, “Empirical results on therelationship between die yield and cycle time in semiconductor waferfabrication,”IEEE Trans. Semiconduct. Manufact., vol. 9, pp. 273–277,1996.

[14] A. Schilling et al., “Texram: A smart memory for texturing,”IEEEComput. Graph. Applicat., vol. 16, pp. 32–40, May 1996.

[15] J. Foley, A. van Dam, S. Feiner, and J. Hughes,Computer Graphics:Principles and Practice, 2nd ed. Reading, MA: Addison-Wesley,1990.

[16] J. P. Ewins, M. D. Waller, M. White, and P. F. Lister, “Mip-map levelselection for texture mapping,”IEEE Trans. Visualization Comput.Graph., vol. 44, pp. 317–329, 1998.

[17] N. Nakamuraet al., “A 29-ns 64-MB DRAM with hierarchical array ar-chitecture,”IEEE J. Solid-State Circuits, vol. 31, pp. 1302–1307, 1996.

[18] H. Igehy, M. Eldridge, and K. Proudfoot, “Prefetching in a texturecache architecture,” inProc. EUROGRAPHICS/SIGGRAPH WorkshopGraphics Hardware, 1998, pp. 133–142.


Se-Jeong Parkreceived the B.S. degree in electricalengineering from Han-Yang University, Seoul,Korea, in 1994, and the M.S. and Ph.D. degreesin electrical engineering from Korea AdvancedInstitute of Science and Technology (KAIST),Taejon, Korea, in 1997 and 2002, respectively. Hemajored in high-performance parallel 3-D computergraphics architecture and its SOC (system-on-a-chip)implementation for mobile applications.

His research ranges from 3-D graphics systemarchitecture to VLSI implementation using merged

DRAM logic (MDL) technology. His research also includes disk cache memorydesign for massive digital video storage systems for HDTV.

Jeong-Su Kimwas born in Korea on March 30, 1975.He received the B.S. degree in electronic and elec-trical engineering from Kyungpook National Univer-sity, Taegu, Korea, in 1998. He is currently workingtoward the M.S. degree in electrical engineering atKorea Advanced Institute of Science and Technology(KAIST), Taejon, Korea.

His research interests include high-performanceDRAM design with merged DRAM logic (MDL)technology and display driver ICs.

Ramchan Woo was born on January 1, 1978, inKorea. He received the B.S. (summa cum laude) andM.S. degrees in electrical engineering from KoreaAdvanced Institute of Science and Technology(KAIST) in 1999 and 2001, respectively. He iscurrently working toward the Ph.D. degree inelectrical engineering at KAIST.

In 1999, he joined the Semiconductor SystemLaboratory (SSL) at KAIST as a Research Assistant.His research interests include low-power high-per-formance circuits and portable multimedia system

design with specific interest in portable 3-D computer graphics architectureand its implementation with merged-DRAM technology.

Se-Joong Leewas born on January 9, 1978, inKorea. He received the B.S. and M.S. degreesin electrical engineering and computer sciencefrom the Korea Advanced Institute of Science andTechnology (KAIST), Taejon, Korea, in 1999 and2001, respectively, where he is currently workingtoward the Ph.D. degree.

Since 1999, he has been a Research Assistantat KAIST. His research activities are related tonetwork switches and high-speed circuit techniques,especially network processor design using embedded

memory logic (EML) technology.

Kang-Min Lee was born on December 11, 1978,in Korea. He received the B.S. degree in electricalengineering and computer science from KoreaAdvanced Institute of Science and Technology(KAIST) in 2000, where he is currently workingtoward the M.S. degree.

In 2000, he joined the Semiconductor SystemLaboratory (SSL) at KAIST as an active Researcher.His research concerns the theory, architecture, andimplementation of high-speed network routers andswitches, especially high-performance network

switch design using embedded memory logic (EML) technology

Tae-Hum Yang received the B.S.E.E. and M.S.E.E.degrees from Seoul National University, Seoul,Korea, in 1992 and 1994, respectively.

Since 1994, he has worked for Hynix Semi-conductor Inc., Ichon, Korea. From 1994 to 1996,he worked in the Flash EEPROM Device Team,involved in device characterization and processintegration from 0.8-�m and 0.35-�m technology.In 1997, he joined the Flash EEPROM DesignTeam and developed low-V products (3.3 V to2.0 V). In 1999, he moved to the DRAM Design

Part, responsible for the development of the 256-MB SDR SDRAM and128-MB DDR SDRAM with 0.16-�m technology. His current interest is in thedevelopment of high-speed, low-voltage, and embedded memory products.

Jin-Young Jung received the B.S.E.E. degree fromSeoul National University, Seoul, Korea, in 1974 andthe M.S.E.E. degree from Korea Advanced Instituteof Science and Technology, Taejon, Korea, in 1976.

From 1976 to 1978, he worked for KoreaSemiconductor Inc., which later became the Semi-conductor Business Unit of Samsung Electronics,where he was involved in the design of timepiecesand custom CMOS chip designs. Since 1979, hehas been involved in memory design, and hasworked for various companies, including National

Semiconductor, Synertek, and Vitelic. He has developed CMOS SRAMs,from 4 K to 64 K, mask ROMs, and CMOS DRAMs. In 1987, he joined LGSemiconductor, Korea, where he developed 256-K to 16-M DRAMs and otherstandard logic products. In 1992, he joined Mosel-Vitelic, where he developedhigh-speed DRAMs, and the 256-K� 8 high-speed DRAM became the firstsemistandard DRAM, which helped the company to go public. Since 1996, heworked for Hynix Semiconductor Inc., Ichon, Korea, as a Senior Vice Presidentand Chief Architect in Memory R&D. His current interest is in developmentof ultra-high-speed super-low voltage and low-power memory products, noveldevice research in ferroelectric and magnetic memories, and new generation3-D devices.

Hoi-Jun Yoo graduated from the Electronics Depart-ment of Seoul National University, Seoul, Korea, in1983 and received the M.S. and Ph.D. degrees in elec-trical engineering from the Korea Advanced Instituteof Science and Technology (KAIST), Seoul, in 1985and 1988, respectively. His Ph.D. work concerned thefabrication process for GaAs vertical optoelectronicintegrated circuits.

From 1988 to 1990, he was with Bell Commu-nications Research, Red Bank, NJ, and inventedthe two-dimensional phase-locked VCSEL array,

the front-surface-emitting laser, and the high-speed lateral HBT. In 1991,he became Manager of a DRAM design group at Hyundai Electronics anddesigned a family of fast 1-M DRAMs and synchronous DRAMs including256-M SDRAM. From 1995 to 1997, he was a faculty member of KangwonNational University, Kangwon, Korea. In 1998, he joined the faculty of theDepartment of Electrical Engineering at KAIST and currently leads a projectteam on RAMP (RAM Processor). In 2001, he founded a national researchcenter, SIPAC (System Integration and IP Authoring Research Center), fundedby the Korean government to promote worldwide IP authoring and its SOCapplication. His current interests are SOC design, IP authoring, high-speed andlow-power memory circuits and architectures, design of embedded memorylogic, opto-electronic integrated circuits, and novel devices and circuits. He isthe author of the booksDRAM Design(in Korean, 1996) andHigh PerformanceDRAM (in Korean, 1999).

Dr. Yoo received the 1994 Electronic Industrial Association of Korea Awardfor his contribution to DRAM technology.

Documents

A reconfigurable multilevel parallel texture cache memory with 75 …ssl.kaist.ac.kr › 2007 › data › journal › sjpJSSC2002.pdf · 2019-03-05 · parallel SRAM L1 caches, pipelined