Status – Week Status – Week 226226
Victor MoyaVictor Moya
SummarySummary
Recursive descent.Recursive descent. Hierarchical Z Buffer.Hierarchical Z Buffer.
Recursive RasterizationRecursive Rasterization
TILE FIFO
TRIANGLESETUP
TILEEVAL
TILEEVAL
TILEEVAL
HZTEST
FRAGMENTFIFO
TILEEVAL
Recursive RasterizationRecursive Rasterization
Tile FIFO:Tile FIFO: Store start position of tiles to test/split.Store start position of tiles to test/split. Each tile has the following information:Each tile has the following information:
4 c values (3 edge, 1 z/w): 4 x 32 bits.4 c values (3 edge, 1 z/w): 4 x 32 bits. Tile level/size: log2(max(maxH, maxV)) bits.Tile level/size: log2(max(maxH, maxV)) bits.
– Ex: for 2048x2048, 12 bits.Ex: for 2048x2048, 12 bits. Expand bit: if the tile must be expanded.Expand bit: if the tile must be expanded.
For N tile evaluators could be arranged as a For N tile evaluators could be arranged as a NxM queue.NxM queue.
Triangle setup could add 1 tile (the full Triangle setup could add 1 tile (the full viewport) or N tiles (reduces in 1 the traversal viewport) or N tiles (reduces in 1 the traversal depth).depth).
Recursive RasterizationRecursive Rasterization
Expand Tile: New tiles generated.
level, no expandlevel – 1,
expand
start sample
generated sample
level, no expand
level, no expand
Recursive RasterizationRecursive Rasterization
No Expand Tile: new tiles
level –1, expand
start sample
generated sample
Recursive RasterizationRecursive Rasterization
Tile evaluator: N = 1Tile evaluator: N = 1 1x1 tiles (no subtiles tested at tile evaluator).1x1 tiles (no subtiles tested at tile evaluator). Calculates three new sample positions.Calculates three new sample positions. Tests if any triangle fragment is inside the tile:Tests if any triangle fragment is inside the tile:
If all the tile 4 corners are negative (outside) for any If all the tile 4 corners are negative (outside) for any of the edge equations.of the edge equations.
Performs HZ test.Performs HZ test. Only top level.Only top level. Top level and N middle levels.Top level and N middle levels. All levels.All levels.
Generates a 2x2 fragment stamp.Generates a 2x2 fragment stamp.
Recursive RasterizationRecursive Rasterization
Tile Evaluator: N = 1Tile Evaluator: N = 1 3 x 4 equations evaluators:3 x 4 equations evaluators:
Linear equations : u = ax + by + cLinear equations : u = ax + by + c 3 edge equations.3 edge equations. 1 z/w parameter equation.1 z/w parameter equation. Incremental update: Incremental update:
– ccnewnew = c = cstartstart + (a << level) + (b << level). + (a << level) + (b << level). 4 x 4 x (e >= 0) tests:4 x 4 x (e >= 0) tests:
Sample inside/outside triangle.Sample inside/outside triangle. 4 x Z tests.4 x Z tests.
Against the proper HZ level.Against the proper HZ level.
Recursive RasterizationRecursive Rasterization
C A B
<< <<
+
Equation Evaluator
Only for N = 2
Samples NxN
Recursive RasterizationRecursive Rasterization
Tile Evaluator: N = 2Tile Evaluator: N = 2 2x2 tile (subtiles/fragments).2x2 tile (subtiles/fragments). 8 new samples generated.8 new samples generated. 8 x 4 equation evaluators.8 x 4 equation evaluators. 8 x (e >= 0) tests.8 x (e >= 0) tests. 8 x Z tests.8 x Z tests. Generates 2x2 or 3x3 fragment stamps. Generates 2x2 or 3x3 fragment stamps. EXPAND TILES ARE NO LONGER REQUIRED.EXPAND TILES ARE NO LONGER REQUIRED.
Recursive RasterizationRecursive Rasterization
CMP
HZLevel i
CMPCMPCMP
AND
4 x Z/W
Tile Passes
only one value read
Tile Start Position
Level
Recursive RasterizationRecursive Rasterization
Tile evaluator critical path?Tile evaluator critical path? Z/W parameter evaluation | HZ Z/W parameter evaluation | HZ
access.access. Z Compare/Test (>).Z Compare/Test (>).
But could be pipelined:But could be pipelined: Same throughputSame throughput Longer latency.Longer latency. Larger tile queue?Larger tile queue?
Recursive RasterizationRecursive Rasterization
Each tile evaluator could also work Each tile evaluator could also work in more than one triangle at the in more than one triangle at the time.time.
Tile evaluator for 2 triangles: N = 1Tile evaluator for 2 triangles: N = 1 2 x 3 x 4 equation evaluators.2 x 3 x 4 equation evaluators. 2 x 3 x 4 e >=0 tests.2 x 3 x 4 e >=0 tests. 2 x 4 Z tests.2 x 4 Z tests. Generates 2 2x2 fragment stamps.Generates 2 2x2 fragment stamps.
Recursive RasterizationRecursive Rasterization
Benefits:Benefits: Single access to the HZ hierarchy.Single access to the HZ hierarchy. Increases throughput.Increases throughput. Shares latency for first fragment.Shares latency for first fragment.
Problems:Problems: Overlaping triangles.Overlaping triangles. Produces more tiles.Produces more tiles. Produces more fragments per cycle (but Produces more fragments per cycle (but
that would also happen with N=2).that would also happen with N=2).
Recursive RasterizationRecursive Rasterization
TriangleSetup
RecursiveRasterization
HierarchicalZ
FragmentFIFO
Simulator Boxes
Recursive RasterizationRecursive Rasterization
RecursiveRasterization
newTilenewTriangle
newFragment
HZ Update
(Tile level)
Recursive Rasterization Box and Signals
Hierarchical Z BufferHierarchical Z Buffer Multiple levels.Multiple levels.
Level 0:Level 0: 1 to 16 registers.1 to 16 registers.
Level 1:Level 1: Fixed size ~64 KB.Fixed size ~64 KB. Maps to 4x4, 8x8 or 16x16 tiles (relative to viewport Maps to 4x4, 8x8 or 16x16 tiles (relative to viewport
size).size). Z-Buffer.Z-Buffer.
Add additional level(s) between 0 and 1.Add additional level(s) between 0 and 1. More memory.More memory. Latency for updates?Latency for updates? Comparators?Comparators?
Hierarchical Z BufferHierarchical Z Buffer
Z-Buffer could be accessed at the Z-Buffer could be accessed at the fragment level.fragment level. Good for prefetching?Good for prefetching? But would require N reads.But would require N reads.
Provide the full Z cache line.Provide the full Z cache line. Fragments stamps (NxM) map to a single Fragments stamps (NxM) map to a single
Z cache line.Z cache line.
Hierarchical Z BufferHierarchical Z Buffer
Update mechanism:Update mechanism: Write to Z-Buffer.Write to Z-Buffer. At cache misses pack/compress the At cache misses pack/compress the
cache line and calculate the larger Z cache line and calculate the larger Z value for that line.value for that line.
Propagate upwards the Z value of the Propagate upwards the Z value of the line.line.
Could require a lot of comparations for Could require a lot of comparations for top levels.top levels.
Expensive?Expensive?
Hierarchical Z BufferHierarchical Z Buffer
Access:Access: Tiles:Tiles:
At tile evaluators:At tile evaluators: For fragments:For fragments:
At tile evaluators: At tile evaluators: – Shares hardware.Shares hardware.– Larger latency for tile evaluators at fragment level.Larger latency for tile evaluators at fragment level.– Use HZ Level 1 for tiles larger than a stamp but Use HZ Level 1 for tiles larger than a stamp but
smaller than a HZ level 1 block.smaller than a HZ level 1 block. At an HZ test stage before Fragment FIFO:At an HZ test stage before Fragment FIFO:
– Smaller latency for tile evaluators.Smaller latency for tile evaluators.– Access to Z buffer.Access to Z buffer.
Hierarchical Z BufferHierarchical Z Buffer
Location:Location: Level 0:Level 0:
At tile evaluators (duplicated):At tile evaluators (duplicated):– Better latency.Better latency.– Broadcast for updates.Broadcast for updates.
At HZ separated memory:At HZ separated memory:– Worst latency?Worst latency?– Shared => multiported!!!!.Shared => multiported!!!!.
Hierarchical Z BufferHierarchical Z Buffer Location:Location:
Level 1:Level 1: At tile evaluators:At tile evaluators:
– Too large.Too large. At HZ separated on die memory:At HZ separated on die memory:
– Better solution?Better solution?– Must be multiported: Must be multiported:
1 access per tile evaluator.1 access per tile evaluator. Multiple sets? Multiple sets? Set conflict?Set conflict?
At video memory:At video memory:– For very large HZ buffers or very small precission?For very large HZ buffers or very small precission?– Access time?Access time?– HZ cache?HZ cache?
Hierarchical Z BufferHierarchical Z Buffer
Location:Location: Z-Buffer:Z-Buffer:
On video memory.On video memory.– Compressed.Compressed.– Reduce read/write bandwidth usage.Reduce read/write bandwidth usage.
Cache on die.Cache on die.– Uncompressed.Uncompressed.
Hardware packer/unpacker:Hardware packer/unpacker:– Used also for the HZ update.Used also for the HZ update.
Hierarchical Z BufferHierarchical Z Buffer
SizesSizes Level 0:Level 0:
Register kind access.Register kind access. 1 – 2 cycles max.1 – 2 cycles max. 1 – 32 values.1 – 32 values. Block size depends of the viewport Block size depends of the viewport
resolution:resolution:– 2048x20482048x2048
1 register: 2048x2048 block.1 register: 2048x2048 block. 16 registers: 128x128 blocks16 registers: 128x128 blocks 32 registers: 64x64 blocks.32 registers: 64x64 blocks.
Hierarchical Z BufferHierarchical Z Buffer Level 1:Level 1:
On die memory:On die memory:– Fixed size.Fixed size.– Limited by technology.Limited by technology.– Estimation:Estimation:
2048x2048 viewport.2048x2048 viewport. 8x8 blocks.8x8 blocks. 16 bits Z value.16 bits Z value. 128KB.128KB.
Video memory:Video memory:– Unlimited?Unlimited?– Variable size.Variable size.– But larger access time!!!But larger access time!!!– Requires cache.Requires cache.– Update!!Update!!
Hierarchical Z BufferHierarchical Z Buffer
Z-Buffer:Z-Buffer: Video memory.Video memory. Unlimited size.Unlimited size. Large access time.Large access time. 32 bits (with stencil) per value.32 bits (with stencil) per value. Cache line size:Cache line size:
– HZ block 8x8 (tiled): 2048 bits.HZ block 8x8 (tiled): 2048 bits.– HZ block row/column 8: 256 bits.HZ block row/column 8: 256 bits.– Less than an HZ block row/column.Less than an HZ block row/column.
Hierarchical Z BufferHierarchical Z Buffer
Packer/Unpacker:Packer/Unpacker: Diferential ‘whatever is called’ compression.Diferential ‘whatever is called’ compression. Calculate max Z value in cache line.Calculate max Z value in cache line. Z block type:Z block type:
00: cleared Z line.00: cleared Z line. 01: uncompressed.01: uncompressed. 10: compression 1.10: compression 1. 11: compression 2.11: compression 2.
Two compression levels:Two compression levels: 4 bits per value.4 bits per value. 16 bits per value.16 bits per value.
Hierarchical Z BufferHierarchical Z Buffer HZ update hardware.HZ update hardware.
If a cache line has the size of a HZ level 1 blockIf a cache line has the size of a HZ level 1 block Nothing, just store.Nothing, just store.
If a cache line is smaller than a HZ level 1 If a cache line is smaller than a HZ level 1 block.block.
Must be stored in a combine cache.Must be stored in a combine cache.– Stores the current larger value for HZ level 1 block.Stores the current larger value for HZ level 1 block.– Stores if a HZ block has been fully updated.Stores if a HZ block has been fully updated.
When a combine cache line is full the HZ level 1 can When a combine cache line is full the HZ level 1 can be updated.be updated.
Use a FIFO policy for the combining cache (only Use a FIFO policy for the combining cache (only space locality).space locality).
Combining cache size?Combining cache size?
Hierarchical Z BufferHierarchical Z Buffer Hierarchical Z Buffer.Hierarchical Z Buffer.
How to update HZ level 0?How to update HZ level 0? For a 2048x2048 viewport a 2x2 level 0 has For a 2048x2048 viewport a 2x2 level 0 has
128x128 HZ level 1 blocks.128x128 HZ level 1 blocks. Compare the new HZ level 1 value against all Compare the new HZ level 1 value against all
the other HZ level 1 would require too much the other HZ level 1 would require too much hardware or too much time.hardware or too much time.
Combining cache for level 0:Combining cache for level 0: 4 combining lines (2x2).4 combining lines (2x2). Stores the further Z value written for the level 0 block.Stores the further Z value written for the level 0 block. Update when the full level 0 block is written.Update when the full level 0 block is written.
– !!! Could never happen !!!!!! Could never happen !!!
Hierarchical Z BufferHierarchical Z Buffer
L0 Combine Cache.L0 Combine Cache. Size:Size:
2x2 HZ Level 0 buffer.2x2 HZ Level 0 buffer. 256x256 HZ Level 1 buffer.256x256 HZ Level 1 buffer.
– at 2048x2048 HZ L1 block is 8x8.at 2048x2048 HZ L1 block is 8x8. 128x128 L1 blocks per L0 block128x128 L1 blocks per L0 block
– 16Kbit Mask for each L0 combine cache 16Kbit Mask for each L0 combine cache entry !!!!entry !!!!
Hierarchical Z BufferHierarchical Z Buffer Solution: Solution:
– Add a L0 line combine cache.Add a L0 line combine cache. 128 bit mask for line combine cache entry.128 bit mask for line combine cache entry. Size: 4 entries?Size: 4 entries? Update when line full to L0 combine cache.Update when line full to L0 combine cache. FIFO?FIFO?
– 128 bit mask for L0 combine cache entry.128 bit mask for L0 combine cache entry. 1 Z value per entry, 1 Z comparator per entry.1 Z value per entry, 1 Z comparator per entry. No replacement!!No replacement!! Only update.Only update. 1Kbit for bitmasks.1Kbit for bitmasks. 128 bits for Z values (16 bits).128 bits for Z values (16 bits). 8 Z comparators (16 bits).8 Z comparators (16 bits).
Hierarchical Z BufferHierarchical Z Buffer
L1 Combine Cache:L1 Combine Cache: Size:Size:
Z cache line: 8 Z values (256 bits).Z cache line: 8 Z values (256 bits). 2048x2048 viewport:2048x2048 viewport:
– 8x8 L1 blocks.8x8 L1 blocks.– 8bit bitmask per L1 combine cache entry.8bit bitmask per L1 combine cache entry.
4096x4096 viewport:4096x4096 viewport:– 16x16 L1 blocks.16x16 L1 blocks.– 16bit bitmask per L1 combine cache entry.16bit bitmask per L1 combine cache entry.
1 Z value and 1 Z comparator per entry.1 Z value and 1 Z comparator per entry. FIFO replacement policy?FIFO replacement policy?
Hierarchical Z BufferHierarchical Z Buffer
Number of entries: 256Number of entries: 256– 4Kbits for bitmask.4Kbits for bitmask.– 4Kbits for Z values (16bit).4Kbits for Z values (16bit).– 256 Z comparators.256 Z comparators.
Fully associative.Fully associative.
Hierarchical Z BufferHierarchical Z BufferCombine Cache
Z cache line max Z
HZ Level 1 Buffer
HZ Level 0 Buffer
Level 1 block Z
Level 1 combine cache
Level 0 combine
cache
Level 0 block Z
lines
blocks
Hierarchical Z BufferHierarchical Z Buffer
TmpZ BitMask
Combine cache line
- TmpZ stores max Z value written in the block.
- BitMask stores a mask with the written positions in the block
Hierarchical Z BufferHierarchical Z Buffer Optimized access to HZ from the Tile Optimized access to HZ from the Tile
Evaluators.Evaluators. Child tiles can reuse HZ Z value read for parent Child tiles can reuse HZ Z value read for parent
tiles.tiles. Add a Z value at each tile in the Tile Buffer.Add a Z value at each tile in the Tile Buffer. Initialized with HZ L0 Z value at the proper tile Initialized with HZ L0 Z value at the proper tile
level.level. Reuse tile Z value until tile level is the same as Reuse tile Z value until tile level is the same as
HZ L1.HZ L1. At 8x8 for example for 8x8 HZ L1 blocks.At 8x8 for example for 8x8 HZ L1 blocks.
Final fragment stamps can be smaller than HZ Final fragment stamps can be smaller than HZ L1 blocks.L1 blocks.
Hierarchical Z BufferHierarchical Z Buffer
Optimized access from Tile Optimized access from Tile Evaluators:Evaluators: Fetch of HZ L1 is done at 8x8 tile level Fetch of HZ L1 is done at 8x8 tile level
and reused for the fragment stamp.and reused for the fragment stamp. That implies that there is no latency That implies that there is no latency
penalty for fragments.penalty for fragments. In fact it doesn’t have to access the HZ In fact it doesn’t have to access the HZ
L1 buffer at stamp level.L1 buffer at stamp level. 0 cycles for fragment stamp HZ test.0 cycles for fragment stamp HZ test. Reduces accesses to the HZ buffer.Reduces accesses to the HZ buffer.
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization
RasterizationRasterization