Upload
mark-kilgard
View
7.722
Download
15
Embed Size (px)
DESCRIPTION
Presentation for Dr. Bajaj's computer graphics class (CS 354) at University of Texas, Austin (April 29, 2010).
Citation preview
Anatomy of aTexture Fetch
Mark J. Kilgard
presented for Dr. Bajaj’s graphics classUniversity of Texas, Austin
April 29, 2010
About Me
Principal System Software EngineerNVIDIA Distinguished Inventor
Native Texan, living in AustinBut instead of UT, attended Rice (C.S. 1991)
Yet I root for Texas in football (unless playing Rice)
Interested inProgrammable shading languages for GPUs
Novel GPU-accelerated rendering paradigms
OpenGL & GPU Architecture Improvements
Our Topic: the Texture Fetch Dissected
Texture FetchAll modern 3D games rely on texture-rich rendering
Basis for all sorts of richness and detail
More than graphics—GPU compute supports texture fetches
ConceptuallyAn image load with built-in re-sampling
GeForce 480 capable of 42,000,000,000 per second! *
* 700 Mhz clock × 15 Streaming Multiprocessors × 4 fetches per clock
A Texture Fetch Simplified
Seems pretty simple…
Given1. An image
2. A position
Return thecolor of image at position
Fetch at
(s,t) = (0.6, 0.25)
T a
xis
S axis0.0
1.0
0.0
1.0
RGBA Result is
0.95,0.4,0.24,1.0)
Texture Supplies Detail to Rendered Scenes
Without texture
With texture
Textures Make Graphics Pretty
Texture → detail,
detail → immersion,
immersion → fun
Unreal Tournament
Microsoft Flight Simulator X
Sacred 2
Shaders Combine Multiple Textures
(modulate)
=
lightmaps onlylightmaps only
decal onlydecal only
combined scenecombined scene* Id Software’s Quake 2
circa 1997
Projected Texturing for Shadow Mapping
Depth map from light’s point of viewis re-used as a texture andre-projected into eye’s view to generate shadows
lightposition
without
shadows
with
shadows
“what thelight sees”
Shadow Mapping Explained
Planar distance from lightPlanar distance from light Depth map projected onto sceneDepth map projected onto scene
≤≤ ==lesslessthanthan
True “un-shadowed” True “un-shadowed” region shown greenregion shown green
equalsequals
Volume rendering of fluid turbulence,
Lawrence Berkley National Lab
Automotive design, RTT
Seismic visualization, Landmark Graphics
Texture’s Not All Fun and Games
What’s so hard?
FilteringPoor quality results if you just return the closest color sample in the imageBilinear filtering + mipmapping needed
ComplicationsWrap modes, formats, compression, color spaces, other dimensionalities (1D, 3D, cube maps), etc.
Gotta be quickApplications desire billions of fetches per secondWhat’s done per-fragment in the shader, must be done per-texel in the texture fetch—so 8x times as much work!
Essentially a miniature, real-time re-sampling kernel
Anatomy of a Texture Fetch
Filteredtexel vectorTexel
SelectionTexel
Combination
Texel offsets
Texel data
Texture images
Combination parameters
Texture coordinate
vector
Texture parameters
Texture Fetch Functionality (1)
Texture coordinate processingProjective texturing (OpenGL 1.0)Cube map face selection (OpenGL 1.3)Texture array indexing (OpenGL 2.1)Coordinate scale: normalization (ARB_texture_rectangle)
Level-of-detail (LOD) computationLog of maximum texture coordinate partial derivative (OpenGL 1.0)LOD clamping (OpenGL 1.2)LOD bias (OpenGL 1.3)Anisotropic scaling of partial derivatives (SGIX_texture_lod_bias)
Wrap modesRepeat, clamp (OpenGL 1.0)Clamp to edge (OpenGL 1.2), Clamp to border (OpenGL 1.3)Mirrored repeat (OpenGL 1.4)Fully generalized clamped mirror repeat (EXT_texture_mirror_clamp)Wrap to adjacent cube map faceRegion clamp & mirror (PlayStation 2)
Arrays of 2D Textures
Multiple skins packed in texture arrayMotivation: binding to one multi-skin texture array avoids texture bind per object
Texture array index
0 1 2 3 4
0
1
234
Mip
map
lev
el i
nd
ex
Texture Fetch Functionality (2)
Filter modesMinification / magnification transition (OpenGL 1.0)
Nearest, linear, mipmap (OpenGL 1.0)
1D & 2D (OpenGL 1.0), 3D (OpenGL 1.2), 4D (SGIS_texture4D)
Anisotropic (EXT_texture_filter_anisotropic)
Fixed-weights: Quincunx, 3x3 Gaussian
Used for multi-sample resolves
Detail texture magnification (SGIS_detail_texture)
Sharpen texture magnification (SGIS_sharpen_texture)
4x4 filter (SGIS_texture_filter4)
Sharp-edge texture magnification (E&S Harmony)
Floating-point texture filtering (ARB_texture_float, OpenGL 3.0)
Texture Fetch Functionality (3)
Texture formatsUncompressed
Packing: RGBA8, RGB5A1, etc. (OpenGL 1.1)Type: unsigned, signed (NV_texture_shader)Normalized: fixed-point vs. integer (OpenGL 3.0)
CompressedDXT compression formats (EXT_texture_compression_s3tc)4:2:2 video compression (various extensions)1- and 2-component compression (EXT_texture_compression_latc,OpenGL 3.0)Other approaches: IDCT, VQ, differential encoding, normal maps, separable decompositions
Alternate encodingsRGB9 with 5-bit shared exponent (EXT_texture_shared_exponent)Spherical harmonicsSum of product decompositions
Texture Fetch Functionality (4)
Pre-filtering operationsGamma correction (OpenGL 2.1)
Table: sRGB / arbitraryShadow map comparison (OpenGL 1.4)
Compare functions: LEQUAL, GREATER, etc. (OpenGL 1.5)Needs “R” depth value per texel
Palette lookup (EXT_paletted_texture)Thresh-holding
Color keyGeneralized thresh-holding
Delicate Color Fidelity with sRGB
Problem: PC display devices have a legacy non-linear (sRGB) display gamut—delicate color shading looks wrong
Conventional
rendering(uncorrect
ed color)
Gamma correct(sRGB rendered)
Softer and more natural
Unnaturallydeep facial
shadows
NVIDIA’s Adriana GeForce 8 Launch Demo
Implication for texture hardware: Should perform sRGB-to-linear color space convert per-texel, so 24 scalar conversions—or more—per fetch
Texture Fetch Functionality (5)
OptimizationsLevel-of-detail weighting adjustments
Mid-maps (extra pre-filtered levels in-between existing levels)
Unconventional usesBitmap textures for fonts with large filters (Direct3D 10)
Rip-mapping
Non-uniform texture border color
Clip-mapping (SGIX_clipmap)
Multi-texel borders
Silhouette maps (Pardeep Sen’s work)
Shadow mapping
Sharp piecewise linear magnification
Phased Data Flow
Must hide long memory read latency between Selection and Combination phases
Memoryreads for samples
FIFOing of combination
parameters
Filteredtexel vector
TexelSelection
TexelCombination
Texel offsets
Texel data
Texture images
Combination parameters
Texture parameters
Texture coordinate
vector
What really happens?
Let’s consider a simple tri-linear mip-mapped 2D projective texture fetch
Logically just one instructionfloat4 color = tex2Dproj(decalSampler, st);
TXP o[COLR], f[TEX3], TEX2, 2D;
LogicallyTexel selection
Texel combination
How many operations are involved?
Assembly instruction
(NV_fragment_program)
High-level language
statement (Cg/HLSL)
Medium-Level Dissectionof a Texture Fetch
Converttexel
coordsto
texeloffsets
integer /fixed-point
texelcombination
texel offsets
texel data
texture images
combinationparameters
interpolatedtexture coords
vector
texture parameters
Converttexturecoords
totexel
coords
filteredtexelvector
texel coords
floor /frac
integercoords & fractional
weights floating-pointscaling
andcombination
integer /fixed-pointtexelintermediates
Interpolation
First we need to interpolate (s,t,r,q)This is the f[TEX3] part of the TXP instructionProjective texturing means we want (s/q, t/q)
And possible r/q if shadow mapping
In order to correct for perspective, hardware actually interpolates(s/w, t/w, r/w, q/w)
If not projective texturing, could linearly interpolate inverse w (or 1/w)Then compute its reciprocal to get w
Since 1/(1/w) equals w Then multiply (s/w,t/w,r/w,q/w) times w
To get (s,t,r,q)
If projective texturing, we can insteadCompute reciprocal of q/w to get w/qThen multiple (s/w,t/w,r/w) by w/q to get (s/q, t/q, r/q)
Observe projective texturing is same cost as perspective correction
Interpolation Operations
Ax + By + C per scalar linear interpolation2 MADs
One reciprocal to invert q/w for projective texturingOr one reciprocal to invert 1/w for perspective texturing
Then 1 MUL per component for s/w * w/qOr s/w * w
For (s,t) means4 MADs, 2 MULs, & 1 RCP(s,t,r) requires 6 MADs, 3 MULs, & 1 RCP
All floating-point operations
Texture Space Mapping
Have interpolated & projected coordinates
Now need to determine what texels to fetch
Multiple (s,t) by (width,height) of texture base levelCould convert (s,t) to fixed-point first
Or do math in floating-point
Say based texture is 256x256 so
So compute (s*256, t*256)=(u,v)
Mipmap Level-of-detail Selection
Tri-linear mip-mapping means compute appropriate mipmap levelHardware rasterizes in 2x2 pixel entities
Typically called quad-pixels or just quadFinite difference with neighbors to get change in u and v with respect to window space
Approximation to ∂u/∂x, ∂u/∂y, ∂v/∂x, ∂v/∂yMeans 4 subtractions per quad (1 per pixel)
Now compute approximation to gradient lengthp = max(sqrt((∂u/∂x)2+(∂u/∂y)2), sqrt((∂v/∂x)2+(∂v/∂y)2))
one-pixel separation
Level-of-detail Bias and Clamping
Convert p length to power-of-two level-of-detail and apply LOD bias
λ = log2(p) + lodBias
Now clamp λ to valid LOD rangeλ’ = max(minLOD, min(maxLOD, λ))
Determine Mipmap Levels andLevel Filtering Weight
Determine lower and upper mipmap levelsb = floor(λ’)) is bottom mipmap level
t = floor(λ’+1) is top mipmap level
Determine filter weight between levelsw = frac(λ’) is filter weight
Determine Texture Sample Point
Get (u,v) for selected top and bottom mipmap levelsConsider a level l which could be either level t or b
With (u,v) locations (ul,vl)
Perform GL_CLAMP_TO_EDGE wrap modesuw = max(1/2*widthOfLevel(l), min(1-1/2*widthOfLevel(l), u))
vw = max(1/2*heightOfLevel(l), min(1-1/2*heightOfLevel(l), v))
Get integer location (i,j) within each level(i,j) = ( floor(uw* widthOfLevel(l)), floor(vw* ) )
border
edge
s
t
Determine Texel Locations
Bilinear sample needs 4 texel locations(i0,j0), (i0,j1), (i1,j0), (i1,j1)
With integer texel coordinatesi0 = floor(i-1/2)
i1 = floor(i+1/2)
j0 = floor(j-1/2)
j1 = floor(j+1/2)
Also compute fractional weights for bilinear filteringa = frac(i-1/2)
b = frac(j-1/2)
Determine Texel Addresses
Assuming a texture level image’s base pointer, compute a texel address of each texel to fetch
Assume bytesPerTexel = 4 bytes for RGBA8 texture
Exampleaddr00 = baseOfLevel(l) + bytesPerTexel*(i0+j0*widthOfLevel(l))addr01 = baseOfLevel(l) + bytesPerTexel*(i0+j1*widthOfLevel(l))addr10 = baseOfLevel(l) + bytesPerTexel*(i1+j0*widthOfLevel(l))addr11 = baseOfLevel(l) + bytesPerTexel*(i1+j1*widthOfLevel(l))
More complicated address schemes are needed for good texture locality!
Initiate Texture Reads
Initiate texture memory reads at the 8 texel addressesaddr00, addr01, addr10, addr11 for the upper level
addr00, addr01, addr10, addr11 for the lower level
Queue the weights a, b, and wLatency FIFO in hardware makes these weights available when texture reads complete
Phased Data Flow
Must hide long memory read latency between Selection and Combination phases
Memoryreads for samples
FIFOing of combination
parameters
Filteredtexel vector
TexelSelection
TexelCombination
Texel offsets
Texel data
Texture images
Combination parameters
Texture parameters
Texture coordinate
vector
Texel Combination
When texels reads are returned, begin filteringAssume results are
Top texels: t00, t01, t10, t11Bottom texels: b00, b01, b10, b11
Per-component filtering math is tri-linear filterRGBA8 is four components
result = (1-a)*(1-b)*(1-w)*b00 + (1-a)*b*(1-w)*b*b01 + a*(1-b)*(1-w)*b10 + a*b*(1-w)*b11 + (1-a)*(1-b)*w*t00 + (1-a)*b*w*t01 + a*(1-b)*w*t10 + a*b*w*t11;24 MADs per component, or 96 for RGBA
Lerp-tree could do 14 MADs per component, or 56 for RGBA
Total Texture Fetch Operations
Interpolation6 MADs, 3 MULs, & 1 RCP (floating-point)
Texel selectionTexture space mapping
2 MULs (fixed-point)LOD determination (floating-point)
1 pixel difference, 2 SQRTs, 4 MULs, 1 LOG2LOD bias and clamping (fixed-point)
1 ADD, 1 MIN, 1 MAXLevel determination and level weighting (fixed-point)
1 FLOOR, 1 ADD, 1 FRACTexture sample point
4 MAXs, 4 MINs, 2 FLOORs (fixed-point)Texel locations and bi-linear weights
8 FLOORs, 4 FRACs, 8 ADDs (fixed-point)Addressing
16 integer MADs (integer)Texel combination
56 fixed-point MADs (fixed-point)
Assuming a fixed-point RGBAtri-linear mipmap filteredprojective texture fetch
Observations about the Texture Fetch
Lots of ways to implement the mathLots of clever ways to be efficient
Lots more texture operations not considered in this analysis
Compression
Anisotropic filtering
sRGB
Shadow mapping
Arguably TEX instructions are “world’s most CISC instructions”Texture fetches are incredibly complex instructions
Good deal of GPU’s superiority at graphics operations over CPUs is attributable to TEX instruction efficiency
Good for compute too
Take Away Information
The GPU texture fetch is about two orders of magnitude more complex than the most complex CPU instruction
And texture fetches are extremely common
Dozens of billions of texture fetches are expected by modern GPU applications
Not just a graphics thingUsing CUDA, you can access textures from within your compute- and bandwidth-intensive parallel kernels
GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010San Jose Convention Center, San Jose, California
The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing
Preview disruptive technologies and emerging applications
Get tools and techniques to impact mission critical projects
Network with experts, colleagues, and peers across industries
Opportunities
Call for SubmissionsSessions & posters
Sponsors / ExhibitorsReach decision makers
“CEO on Stage” Showcase for Startups
Tell your story to VCs and analysts
Opportunities
Call for SubmissionsSessions & posters
Sponsors / ExhibitorsReach decision makers
“CEO on Stage” Showcase for Startups
Tell your story to VCs and analysts