Upload
nora-robertson
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Feb 15, 2005
2
Size of market
• Many millions of gpu’s shipped per month• The 3d market is entertainment (games)• Each new generation of gpu adds enough
performance to support a new version of a game.
• Each time a game is released, player have to replace hardware to run the game.
• Game industry is larger then Hollywood.
Feb 15, 2005
3
Technology view
Not enough ok Too good
performance / function
gpu cpu
Proprietary Commodity
architecture
interfaces
Mutable Locked down
Feb 15, 2005
4
How much headroom
• Pixar uses 100,000 min of compute per min of image
• Gpu’s are real time so 100,000 = 20 doubles • Most optimistic marketing version of Moore’s
law – performance doubles every 6 months
• So there is 10 years to go.
Feb 15, 2005
5
Application space
• Problems are embarrassingly parallel • Problems are big, screen 1000 x 1000, program
runs per pixel, including some pixels that are behind others so 10* 1000 * 1000 calls per frame * 20-60 frames per second
• Run the same program over and over so • Gpus are SIMD machines
Feb 15, 2005
6
SIMD
• There are many units executing in parallel– These are in lock-step, executing the same
instruction on different pixels/vertices at the same time
– Dynamic flow control can cause inefficiencies in such an architecture since different pixels/vertices can take different code paths
– Dynamic branching is not always a performance win
– For an if…then…else, need to execute both sides, turning processors on and off.
Feb 15, 2005
7
Application space
• Many values are coherent – values in neighbor pixels are close.
• Compute coherent variables at selected points use interpolation to find the intermediate values
• Today programmer specifies which variables are coherent by splitting programs in two.
Feb 15, 2005
8
Application space
• Common subproblem is texture filtering – Evaluate some array of memory around a stencil and
combine
– Provide a small fixed set of stencil patterns in hardware
– You could think of this as slighty smart memory
– Hardware support for 1-3 d arrays and several filtering functions
– Exact stencil patterns and combining operations are proprietary (some look better then others)
Feb 15, 2005
9
Application space
• Little communication between processing elements
• Approximate spatial derivative by 2x2 difference operator
• Forces all machine designs to work on multiples of four pixels
Feb 15, 2005
10
Application space
• Throughput is important• Use threading to cover latency • The chips can support hundreds of threads, and
can switch from thread to thread every cycle– No thread switch overhead
– Hardware scheduler and thread system
– Compiler knows about threads and splits resources over threads
• Caches are very different – can only cover spatial locality
Feb 15, 2005
11
Programming model
• Performance is much less then users want • Min of 100,000 times less• Most developers write each program at least
four times – Xbox
– Playstation
– Ati top machine
– Nvidia top machine
• Programs are in two parts: Vertex and Pixel shaders.
Feb 15, 2005
12
Programming model 2
• Programs could be written in a high level language (C like) HLSL/OGL2
• Or in virtual assembly language (DirectX, …)– Almost one dialect per chip
– While virtual languages but physical resources.
• developers review virtual machine listings for performance
• developers ship virtual assembly language.
Feb 15, 2005
13
Programming model 3
• At game startup – virtual assembly language is JIT compiled to real machine language –
– Drastic change in resource requirements
– Somewhat hard to debug
– Hard to identify performance bottlenecks
• Even though applications could build code on the fly, developers pretest everything – they want the most performance to get the best looking image. Only approximate what they really want.
Feb 15, 2005
14
Programmable PipelineProgrammable Pipeline
Vertex Data(Model space)
Fixed Function Transform and
Lighting
Clipping and Viewport Mapping
Texture Stages
Fog, Alpha, Stencil Depth Testing
Geometry Stage
Rasterizer Stage
Vertex Shader
Pixel Shader
Feb 15, 2005
15
Vertex Processing FlowVertex Processing Flow
PositionNormalTexture CoordinatesEtc.
Per-Vertex DataView MatrixProjection MatrixSkin/Bone MatricesLight PositionsEtc.
Constants
Temporary Registers Vertex Shader
Instructions
Triangle Mesh
Vertex Shader Engine
Position“Texture” CoordinatesColor(s)
Feb 15, 2005
16
Vertex Shader• Input:
– Program specifies vertex data• Position• Normal• Vertex color• Texture coordinate(s)• …
– Data is sent to the graphics card and processed by the vertex shader
• Output– Vertex shader computes output quantities
• Position• Vertex color: diffuse and specular• Texture coordinate(s)
– Sent to rasterizer via interpolators
Feb 15, 2005
17
Pixel Processing FlowPixel Processing Flow
Temporary Registers
“Texture” CoordinatesColor(s)
Light ColorsAmbient Lighting ColorsEtc.
Constants
Pixel Shader
Instructions
Interpolated Values
Textures
Pixel Shader
Engine
Color Multi-Render Target
Feb 15, 2005
18
Program sizes
• Most programs are very small • 100 virtual instructions would be a large
program• Basic data type is a four element vector of
floats• Integer data types are not yet available• Dynamic branching is new• Small amount of nesting allowed
Feb 15, 2005
19
polygons• Polygon Budget– Ruby : 75,000
– Optico: 50,000
– Ninja: 25,000
– Environment: 100,000
– Props: 50,000
• Lighting Limits– 3 Dynamic lights per shot (1 shadow casting)
– Lightmaps used for set
• Animation Limits– 35 total blend shapes
– 5 simultaneous blend shapes
– 4 weighted bones per vertex
– Number of on-screen characters limited to 4 at once
Feb 15, 2005
20
Shader Breakdown
• Depth of Field
• Hair
• Skin
Feb 15, 2005
21
Depth Of FieldDepth Of Field
Feb 15, 2005
22
Depth Of FieldDepth Of Field
Feb 15, 2005
24
Shader Breakdown
• Glows
• Motion Blur
• Reflections
Feb 15, 2005
25
Glows
Feb 15, 2005
26
Motion Blur
Feb 15, 2005
27
Reflections
Feb 15, 2005
28
Hardware view
• X1900 • Xbox 360
• Both machines are current
Feb 15, 2005
30
X1900
PixelShaderEngine
Z /
Ste
ncil B
uff
er
Cach
e
Setup Engine
VertexShaderEngine
Backface Cull
Perspective DivideClip
Viewport Transform
Backface Cull
Perspective DivideClip
Viewport Transform
Vertex Data
Textu
re C
ach
eTextu
re U
nits
Textu
re U
nits
Textu
re U
nits
Textu
re U
nits
Decom
pre
ss
Com
pre
ss
Decom
pre
ss
Com
pre
ss
Ultra-ThreadingDispatchProcessor
Ultra-ThreadingDispatchProcessor
Decom
pre
ss
Com
pre
ss
Decom
pre
ss
Com
pre
ss
Hie
rarc
hic
al
Z T
est
Geometry AssemblyRasterization
Geometry AssemblyRasterizationInterpolators
General Purpose Register ArraysGeneral Purpose Register ArraysGeneral Purpose Register ArraysGeneral Purpose Register Arrays
Feb 15, 2005
31
Quad Pixel Shader CoreVector ALU 2Vector ALU 2
Vector ALU 1Vector ALU 1ScalarScalarALUALU
11
ScalarScalarALUALU
22
BranchBranchExecutionExecution
UnitUnit
Vector ALU 2Vector ALU 2
Vector ALU 1Vector ALU 1ScalarScalarALUALU
11
ScalarScalarALUALU
22
BranchBranchExecutionExecution
UnitUnit
PixelShaderEngine
Z /
Ste
ncil B
uff
er
Cach
e
Setup Engine
VertexShaderEngine
Hie
rarc
hic
al
Z T
est
I nterpolators
Geometry Assembly
Rasterization
Backface Cull
Perspective Divide
Clip
Viewport Transform
Vertex Data
Textu
reC
ach
e
General Purpose Register ArraysGeneral Purpose Register Arrays
Ultra-ThreadedDispatch Processor
Ultra-ThreadedDispatch Processor
Textu
re U
nits
Textu
re U
nits
Deco
mp
ress
Com
pre
ss
Deco
mp
ress
Com
pre
ss
Deco
mp
ress
Com
pre
ss
Pixel Shader ProcessorPer Clock Cycle:
1 vec3 ADD + input modifier1 scalar ADD + input modifier1 vec3 ADD/MUL/MADD1 scalar ADD/MUL/MADD1 flow control instruction
Texture Address Units
1 texture address instructionsper unit per clock cycle
TextureTextureAddressAddress
UnitUnit11
TextureTextureAddressAddress
UnitUnit22
TextureTextureAddressAddress
UnitUnit33
TextureTextureAddressAddress
UnitUnit44
Pixel Shader Processors
Feb 15, 2005
32
Vertex Engine
• Upgraded to support SM3.0– Dynamic flow control– 1,024 instructions (practically
unlimited with flow control)– More temporary registers
• 8 Vertex Shader Processors– Each can handle 2 shader
instructions per clock
– 10 billion instructions per second
Backface Cull
Perspective Divide
Clip
Viewport Transform
Vertex Data
128-bitVector
ALU
32-bitScalarALU
Flow Control
ToSetupEngine
Feb 15, 2005
33
Ring Bus Memory Controller
• Supports today’s fastest graphics memory devices
– GDDR3, 48+ GB/sec– GDDR4, The future
• 512-bit Ring Bus– Simplifies layout and enables
extreme memory clock scaling• New Cache Design
– Fully Associative for more optimal performance
• Improved Hyper Z– Better compression and hidden
surface removal• Programmable Arbitration Logic
– Maximizes memory efficiency– Can be upgraded via software
Feb 15, 2005
34
Memory Channels - 4x Improvement in Random Access over X850
32-bitchannel
32-bitchannel
32-bitchannel
32-bitchannel
32-bitchannel
32-bitchannel
32-bitchannel
32-bitchannel
Memory ControllerMemory Controller
64-bitchannel
64-bitchannel
64-bitchannel
64-bitchannel
Memory ControllerMemory Controller
256 bit interface
Memory DevicesMemory Devices
Memory DevicesMemory DevicesRadeonX850
4x64-bitchannels
4 banks Per Dram
RadeonX850
4x64-bitchannels
4 banks Per Dram
Radeon X1900
8x32-bitchannels
8 Banks Per Dram
Radeon X1900
8x32-bitchannels
8 Banks Per Dram
Feb 15, 2005
35
Cache Design
GraphicsMemory
GraphicsMemory CacheCache
GraphicsMemory
GraphicsMemory CacheCache
DirectMappedCache
DirectMappedCache
FullyAssociative
Cache
FullyAssociative
Cache
• Fully Associative Caches– Cache lines can map to any location in
external memory– Earlier designs used Direct Mapped &
N-Way Associative Caches – Could only access limited blocks of
external memory• Texture, Color, Z & Stencil caches are all
now fully associative– Reduces memory bandwidth
requirements– Minimizes cache contention stalls– Optimized game performance– Gains up to 25% clock for clock in
fill/bandwidth bound cases
Feb 15, 2005
36
Xbox
• 3.2GHz Custom IBM Central Processor • Three CPU Cores • Two Threads Per core • VMX Unit Per Core • 128 VMX Registers Per Thread • 1MB L2 Cache (Lockable by Graphics Processor) • 500MHz Custom ATI Graphics Processor • Unified Shader Core • 48 ALU’s for Vertex or Pixel Shader processing • 16 Filtered & 16 Unfiltered Texture samples per clock • 10MB eDRAM Framebuffer • 512MB System RAM • Unified Memory Architecture (UMA) • 128-bit interface • 700MHz GDDR3 RAM
Feb 15, 2005
37
CommandCommandProcessorProcessor
Memory HubMemory Hub
VertexVertexGrouperGrouper
PrimitivePrimitiveAssemblyAssembly
ShaderShaderInterpInterp
ShaderShaderInterpInterp
SequencerSequencer ShaderShaderPipePipe(x16)(x16)
Vertex CacheVertex Cache
TextureTexturePipePipe
TextureTexturePipePipe
TextureTexturePipePipe
TextureTexturePipePipe
ShaderShaderPipePipe(x16)(x16)
ShaderShaderPipePipe(x16)(x16)
PipePipeCommComm
256 GB/sec
Texture CacheTexture Cache
ScanScanConverterConverter
Z/Alpha/StencilProcessors
Z/Alpha/StencilProcessors
10MB 10MB DRAMDRAM
Architecture
Feb 15, 2005
38
Adaptive Shader Array
•Unified shader architecture•One processor type•Dynamic load balancing• Pixel and vertex processing where and when they’re needed
–48 shaders• 120 billion operations per second
Feb 15, 2005
39
Feb 15, 2005
40
Some interesting problems
• Coherence (branch prediction?)• What are the right instructions• Can you do non graphics applications• Programming language• Threading by compiler• Off line compile?
Feb 15, 2005
41
Implications for programming languages
• GPU – can convince people to use a new language if you can prove it is faster, even if it means lots of changes
• Desktop CPU – have to prove it can meet some other (non-performance/function) need
• Top of the line price for GPU going up- top of the line desktop CPU price going down, lots of change to do cool design.
• Less need to be backward compatible.
Feb 15, 2005
42
More info
• http://www.ati.com/developer/index.html