Anatomy of a Texture Fetch

Anatomy of aTexture Fetch

Mark J. Kilgard

[email protected]

presented for Dr. Bajaj’s graphics classUniversity of Texas, Austin

April 29, 2010

About Me

Principal System Software EngineerNVIDIA Distinguished Inventor

Native Texan, living in AustinBut instead of UT, attended Rice (C.S. 1991)

Yet I root for Texas in football (unless playing Rice)

Interested inProgrammable shading languages for GPUs

Novel GPU-accelerated rendering paradigms

OpenGL & GPU Architecture Improvements

Our Topic: the Texture Fetch Dissected

Texture FetchAll modern 3D games rely on texture-rich rendering

Basis for all sorts of richness and detail

More than graphics—GPU compute supports texture fetches

ConceptuallyAn image load with built-in re-sampling

GeForce 480 capable of 42,000,000,000 per second! *

* 700 Mhz clock × 15 Streaming Multiprocessors × 4 fetches per clock

A Texture Fetch Simplified

Seems pretty simple…

Given1. An image

2. A position

Return thecolor of image at position

Fetch at

(s,t) = (0.6, 0.25)

T a

xis

S axis0.0

1.0

0.0

1.0

RGBA Result is

0.95,0.4,0.24,1.0)

Texture Supplies Detail to Rendered Scenes

Without texture

With texture

Textures Make Graphics Pretty

Texture → detail,

detail → immersion,

immersion → fun

Unreal Tournament

Microsoft Flight Simulator X

Sacred 2

Shaders Combine Multiple Textures

(modulate)

=

lightmaps onlylightmaps only

decal onlydecal only

combined scenecombined scene* Id Software’s Quake 2

circa 1997

Projected Texturing for Shadow Mapping

Depth map from light’s point of viewis re-used as a texture andre-projected into eye’s view to generate shadows

lightposition

without

shadows

with

shadows

“what thelight sees”

Shadow Mapping Explained

Planar distance from lightPlanar distance from light Depth map projected onto sceneDepth map projected onto scene

≤≤ ==lesslessthanthan

True “un-shadowed” True “un-shadowed” region shown greenregion shown green

equalsequals

Volume rendering of fluid turbulence,

Lawrence Berkley National Lab

Automotive design, RTT

Seismic visualization, Landmark Graphics

Texture’s Not All Fun and Games

What’s so hard?

FilteringPoor quality results if you just return the closest color sample in the imageBilinear filtering + mipmapping needed

ComplicationsWrap modes, formats, compression, color spaces, other dimensionalities (1D, 3D, cube maps), etc.

Gotta be quickApplications desire billions of fetches per secondWhat’s done per-fragment in the shader, must be done per-texel in the texture fetch—so 8x times as much work!

Essentially a miniature, real-time re-sampling kernel

Anatomy of a Texture Fetch

Filteredtexel vectorTexel

SelectionTexel

Combination

Texel offsets

Texel data

Texture images

Combination parameters

Texture coordinate

vector

Texture parameters

Texture Fetch Functionality (1)

Texture coordinate processingProjective texturing (OpenGL 1.0)Cube map face selection (OpenGL 1.3)Texture array indexing (OpenGL 2.1)Coordinate scale: normalization (ARB_texture_rectangle)

Level-of-detail (LOD) computationLog of maximum texture coordinate partial derivative (OpenGL 1.0)LOD clamping (OpenGL 1.2)LOD bias (OpenGL 1.3)Anisotropic scaling of partial derivatives (SGIX_texture_lod_bias)

Wrap modesRepeat, clamp (OpenGL 1.0)Clamp to edge (OpenGL 1.2), Clamp to border (OpenGL 1.3)Mirrored repeat (OpenGL 1.4)Fully generalized clamped mirror repeat (EXT_texture_mirror_clamp)Wrap to adjacent cube map faceRegion clamp & mirror (PlayStation 2)

Arrays of 2D Textures

Multiple skins packed in texture arrayMotivation: binding to one multi-skin texture array avoids texture bind per object

Texture array index

0 1 2 3 4

0

1

234

Mip

map

lev

el i

nd

ex


Filter modesMinification / magnification transition (OpenGL 1.0)

Nearest, linear, mipmap (OpenGL 1.0)

1D & 2D (OpenGL 1.0), 3D (OpenGL 1.2), 4D (SGIS_texture4D)

Anisotropic (EXT_texture_filter_anisotropic)

Fixed-weights: Quincunx, 3x3 Gaussian

Used for multi-sample resolves

Detail texture magnification (SGIS_detail_texture)

Sharpen texture magnification (SGIS_sharpen_texture)

4x4 filter (SGIS_texture_filter4)

Sharp-edge texture magnification (E&S Harmony)

Floating-point texture filtering (ARB_texture_float, OpenGL 3.0)


Texture formatsUncompressed

Packing: RGBA8, RGB5A1, etc. (OpenGL 1.1)Type: unsigned, signed (NV_texture_shader)Normalized: fixed-point vs. integer (OpenGL 3.0)

CompressedDXT compression formats (EXT_texture_compression_s3tc)4:2:2 video compression (various extensions)1- and 2-component compression (EXT_texture_compression_latc,OpenGL 3.0)Other approaches: IDCT, VQ, differential encoding, normal maps, separable decompositions

Alternate encodingsRGB9 with 5-bit shared exponent (EXT_texture_shared_exponent)Spherical harmonicsSum of product decompositions


Pre-filtering operationsGamma correction (OpenGL 2.1)

Table: sRGB / arbitraryShadow map comparison (OpenGL 1.4)

Compare functions: LEQUAL, GREATER, etc. (OpenGL 1.5)Needs “R” depth value per texel

Palette lookup (EXT_paletted_texture)Thresh-holding

Color keyGeneralized thresh-holding

Delicate Color Fidelity with sRGB

Problem: PC display devices have a legacy non-linear (sRGB) display gamut—delicate color shading looks wrong

Conventional

rendering(uncorrect

ed color)

Gamma correct(sRGB rendered)

Softer and more natural

Unnaturallydeep facial

shadows

NVIDIA’s Adriana GeForce 8 Launch Demo

Implication for texture hardware: Should perform sRGB-to-linear color space convert per-texel, so 24 scalar conversions—or more—per fetch


OptimizationsLevel-of-detail weighting adjustments

Mid-maps (extra pre-filtered levels in-between existing levels)

Unconventional usesBitmap textures for fonts with large filters (Direct3D 10)

Rip-mapping

Non-uniform texture border color

Clip-mapping (SGIX_clipmap)

Multi-texel borders

Silhouette maps (Pardeep Sen’s work)

Shadow mapping

Sharp piecewise linear magnification

Phased Data Flow

Must hide long memory read latency between Selection and Combination phases

Memoryreads for samples

FIFOing of combination

parameters

Filteredtexel vector

TexelSelection

TexelCombination

Texel offsets

Texel data

Texture images


Texture parameters

Texture coordinate

vector

What really happens?

Let’s consider a simple tri-linear mip-mapped 2D projective texture fetch

Logically just one instructionfloat4 color = tex2Dproj(decalSampler, st);

TXP o[COLR], f[TEX3], TEX2, 2D;

LogicallyTexel selection

Texel combination

How many operations are involved?

Assembly instruction

(NV_fragment_program)

High-level language

statement (Cg/HLSL)

Medium-Level Dissectionof a Texture Fetch

Converttexel

coordsto

texeloffsets

integer /fixed-point

texelcombination

texel offsets

texel data

texture images

combinationparameters

interpolatedtexture coords

vector

texture parameters

Converttexturecoords

totexel

coords

filteredtexelvector

texel coords

floor /frac

integercoords & fractional

weights floating-pointscaling

andcombination

integer /fixed-pointtexelintermediates

Interpolation

First we need to interpolate (s,t,r,q)This is the f[TEX3] part of the TXP instructionProjective texturing means we want (s/q, t/q)

And possible r/q if shadow mapping

In order to correct for perspective, hardware actually interpolates(s/w, t/w, r/w, q/w)

If not projective texturing, could linearly interpolate inverse w (or 1/w)Then compute its reciprocal to get w

Since 1/(1/w) equals w Then multiply (s/w,t/w,r/w,q/w) times w

To get (s,t,r,q)

If projective texturing, we can insteadCompute reciprocal of q/w to get w/qThen multiple (s/w,t/w,r/w) by w/q to get (s/q, t/q, r/q)

Observe projective texturing is same cost as perspective correction

Interpolation Operations

Ax + By + C per scalar linear interpolation2 MADs

One reciprocal to invert q/w for projective texturingOr one reciprocal to invert 1/w for perspective texturing

Then 1 MUL per component for s/w * w/qOr s/w * w

For (s,t) means4 MADs, 2 MULs, & 1 RCP(s,t,r) requires 6 MADs, 3 MULs, & 1 RCP

All floating-point operations

Texture Space Mapping

Have interpolated & projected coordinates

Now need to determine what texels to fetch

Multiple (s,t) by (width,height) of texture base levelCould convert (s,t) to fixed-point first

Or do math in floating-point

Say based texture is 256x256 so

So compute (s*256, t*256)=(u,v)

Mipmap Level-of-detail Selection

Tri-linear mip-mapping means compute appropriate mipmap levelHardware rasterizes in 2x2 pixel entities

Typically called quad-pixels or just quadFinite difference with neighbors to get change in u and v with respect to window space

Approximation to ∂u/∂x, ∂u/∂y, ∂v/∂x, ∂v/∂yMeans 4 subtractions per quad (1 per pixel)

Now compute approximation to gradient lengthp = max(sqrt((∂u/∂x)2+(∂u/∂y)2), sqrt((∂v/∂x)2+(∂v/∂y)2))

one-pixel separation

Level-of-detail Bias and Clamping

Convert p length to power-of-two level-of-detail and apply LOD bias

λ = log2(p) + lodBias

Now clamp λ to valid LOD rangeλ’ = max(minLOD, min(maxLOD, λ))

Determine Mipmap Levels andLevel Filtering Weight

Determine lower and upper mipmap levelsb = floor(λ’)) is bottom mipmap level

t = floor(λ’+1) is top mipmap level

Determine filter weight between levelsw = frac(λ’) is filter weight

Determine Texture Sample Point

Get (u,v) for selected top and bottom mipmap levelsConsider a level l which could be either level t or b

With (u,v) locations (ul,vl)

Perform GL_CLAMP_TO_EDGE wrap modesuw = max(1/2*widthOfLevel(l), min(1-1/2*widthOfLevel(l), u))

vw = max(1/2*heightOfLevel(l), min(1-1/2*heightOfLevel(l), v))

Get integer location (i,j) within each level(i,j) = ( floor(uw* widthOfLevel(l)), floor(vw* ) )

border

edge

s

t

Determine Texel Locations

Bilinear sample needs 4 texel locations(i0,j0), (i0,j1), (i1,j0), (i1,j1)

With integer texel coordinatesi0 = floor(i-1/2)

i1 = floor(i+1/2)

j0 = floor(j-1/2)

j1 = floor(j+1/2)

Also compute fractional weights for bilinear filteringa = frac(i-1/2)

b = frac(j-1/2)

Determine Texel Addresses

Assuming a texture level image’s base pointer, compute a texel address of each texel to fetch

Assume bytesPerTexel = 4 bytes for RGBA8 texture

Exampleaddr00 = baseOfLevel(l) + bytesPerTexel*(i0+j0*widthOfLevel(l))addr01 = baseOfLevel(l) + bytesPerTexel*(i0+j1*widthOfLevel(l))addr10 = baseOfLevel(l) + bytesPerTexel*(i1+j0*widthOfLevel(l))addr11 = baseOfLevel(l) + bytesPerTexel*(i1+j1*widthOfLevel(l))

More complicated address schemes are needed for good texture locality!

Initiate Texture Reads

Initiate texture memory reads at the 8 texel addressesaddr00, addr01, addr10, addr11 for the upper level

addr00, addr01, addr10, addr11 for the lower level

Queue the weights a, b, and wLatency FIFO in hardware makes these weights available when texture reads complete

Phased Data Flow

Must hide long memory read latency between Selection and Combination phases

Memoryreads for samples

FIFOing of combination

parameters

Filteredtexel vector

TexelSelection

TexelCombination

Texel offsets

Texel data

Texture images


Texture parameters

Texture coordinate

vector

Texel Combination

When texels reads are returned, begin filteringAssume results are

Top texels: t00, t01, t10, t11Bottom texels: b00, b01, b10, b11

Per-component filtering math is tri-linear filterRGBA8 is four components

result = (1-a)*(1-b)*(1-w)*b00 + (1-a)*b*(1-w)*b*b01 + a*(1-b)*(1-w)*b10 + a*b*(1-w)*b11 + (1-a)*(1-b)*w*t00 + (1-a)*b*w*t01 + a*(1-b)*w*t10 + a*b*w*t11;24 MADs per component, or 96 for RGBA

Lerp-tree could do 14 MADs per component, or 56 for RGBA

Total Texture Fetch Operations

Interpolation6 MADs, 3 MULs, & 1 RCP (floating-point)

Texel selectionTexture space mapping

2 MULs (fixed-point)LOD determination (floating-point)

1 pixel difference, 2 SQRTs, 4 MULs, 1 LOG2LOD bias and clamping (fixed-point)

1 ADD, 1 MIN, 1 MAXLevel determination and level weighting (fixed-point)

1 FLOOR, 1 ADD, 1 FRACTexture sample point

4 MAXs, 4 MINs, 2 FLOORs (fixed-point)Texel locations and bi-linear weights

8 FLOORs, 4 FRACs, 8 ADDs (fixed-point)Addressing

16 integer MADs (integer)Texel combination

56 fixed-point MADs (fixed-point)

Assuming a fixed-point RGBAtri-linear mipmap filteredprojective texture fetch

Observations about the Texture Fetch

Lots of ways to implement the mathLots of clever ways to be efficient

Lots more texture operations not considered in this analysis

Compression

Anisotropic filtering

sRGB

Shadow mapping

Arguably TEX instructions are “world’s most CISC instructions”Texture fetches are incredibly complex instructions

Good deal of GPU’s superiority at graphics operations over CPUs is attributable to TEX instruction efficiency

Good for compute too

Take Away Information

The GPU texture fetch is about two orders of magnitude more complex than the most complex CPU instruction

And texture fetches are extremely common

Dozens of billions of texture fetches are expected by modern GPU applications

Not just a graphics thingUsing CUDA, you can access textures from within your compute- and bandwidth-intensive parallel kernels

GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010San Jose Convention Center, San Jose, California

The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing

Preview disruptive technologies and emerging applications

Get tools and techniques to impact mission critical projects

Network with experts, colleagues, and peers across industries

Opportunities

Call for SubmissionsSessions & posters

Sponsors / ExhibitorsReach decision makers

“CEO on Stage” Showcase for Startups

Tell your story to VCs and analysts

Opportunities

Call for SubmissionsSessions & posters

Sponsors / ExhibitorsReach decision makers

“CEO on Stage” Showcase for Startups

Tell your story to VCs and analysts