Compiling Metaprogrammed Shaders to Stream GPUs
Michael D. McCool
Computer Graphics LabUniversity of Waterloo
Graphics Hardware 2003
Topics GPUs are “Stream Processors”… But what does that mean, exactly? Can general programs be compiled to GPUs? Can they run efficiently on GPUs? How can GPUs be evolved to support more
powerful programming models without negatively impacting performance?
What abstractions should programming languages for GPUs support?
Imagine Stream ProcessorSIMD kernel processing on streams
containing homogeneous recordsMemory hierarchy
Local registers Stream register file External memory
Streaming external memory accessConditional read and write
Stream GPU ArchitectureVertex Shader
Rasterizer
Fragment Shader
CompositorDisplay
New
Optional
Stream GPU ArchitectureStream input to vertex unitArray inputs to fragment unitAt least two stream outputs from
fragment unit supporting conditional writes
Array output from fragment unit via compositor
Sh Metaprogramming LibraryEmbedded metaprogrammingBoth a library and a high-level
programming languageAvailable from SourceForge:
http://libsh.sourceforge.netCurrently semantically “Cg-equivalent”Adding control constructs, stream
algebra in next phase…
Julia Set: Sh ExampleShAttrib1f julia_max_iter = 20.0;ShAttrib1f julia_scale = 0.05;ShAttrib2f julia_c(1.0, -0.3);
ShTexture2D<ShColor3f> julia_map(32,32);
. . .
ShProgram julia0 = SH_BEGIN_VERTEX_SHADER { ShInputTexCoord2f ui; ShInputPosition3f pm; ShOutputTexCoord2f uo(ui); ShOutputPosition4f pd; pd = (perspective | modelview) | pm;} SH_END_SHADER;
ShProgram julia1 = SH_BEGIN_FRAGMENT_SHADER { ShInputTexCoord2f u; ShInputPosition2f pdxy; ShOutputColor3f fc; ShAttrib1f i = 0.0; SH_WHILE(((v|v) < 2.0) * (i < julia_max_iter)) { ShTexCoord2f v; v(0) = u(0)*u(0) - u(1)*u(1); v(1) = 2.0*u(0)*u(1); u = v + julia_c; i++; } SH_ENDWHILE; ShTexCoord2f lookup(0.0,0.0); lookup(0) = julia_scale * i; fc = julia_map(lookup);} SH_END_SHADER;
Compiler: ControlFlowGraph
Control GraphControl flow graph from compiler also
describes multipass stream program!Need conditional write to avoid
accumulation of “garbage records” Iteration and conditionals may scramble
order of records --- but can always sort by ID later if necessary.
Julia Set:Control Graph
Iterator Render197.48 Kwords
9450 Kwords
197.48 Kwords
197.48 Kwords
Rasterize
56.25Kwords
(800 tris)
Adaptive Tessellation:Control Graph
Oracle
Tess4
Tess3
Tess2
Bump
Split
Stack arc4010
Kwords(42771 tris)5748
Kwords
5661 Kwords
3.46 Kwords
368.6 Kwords
1137 Kwords
86.71 Kwords
(800 tris)
SchedulerLocal arcs are system-allocated stream
buffers (ideally stream registers)System picks kernel to run:
Has enough input data Space available in available output buffer
Picks kernel that maximizes throughputRepeat until no more data in input
stream
Observations:True conditionals and iteration:
Implementable with conditional write to stream output
NEED NULL COMPRESSION! Multiple stream outputs also desirable
Fragment scatter: Implementable with render-to-vertex-array F-buffer feedback also desirable
Simulating Null Compression Want conditional write to stream No space wasted for nullified records Can simulate on current GPUs:
Write to array Use occlusion test to count number of non-null
records Sort array by mark bit (use depth channel to mark) Discard null records (now at end of array)
Expensive, perhaps other ways…
HW Stream Null Compression
fsh
fsh
fsh
fsh
fsh
fsh
fsh
fsh
10 2
9 1
3
4
5
6
7
8
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
31
30
32
33
34
Stream AlgebraShProgram p;(a,b) = p(d,e,f);(a,b) = p << (d,e,f);(a,b) = p << d << e << f;(a,b) = p << q << (d,e,f);(a,b,u,v,w) = (p ** q) << (d,e,f,j,k,l);fb += p << r << q << (c,n,v)[i];ShStream cq = optimize(q << (c,n,v)[i]);fb += p << r << cq;a += s * t;ShCampaign k = . . .
TargetsGPUs (via Cg, OGL Slang, etc.)
SIMD Multithreaded MIMD
SSE, SSE2 (via Intel compiler)Cluster computersShared-mem computersPS2, PS3
Issues: Null compression can be simulated with sparse
texture compression, but slow. H/W support would be useful.
On-chip stream registers… Off-chip stream buffer compression… On-GPU scheduler… Compilation of recursive algorithms? Virtualization: registers, stream record size, stream
length, textures, array read-write, synchronization, etc.
Abstractions: streams, sequences, sets, indexes, arrays, programs, campaigns, shapes, etc.
Material Mapping: Control Graph
HF
Wood
HF + WoodRastSplit
Control Construct Templates
CA
SP
Predecessor
WHILE (C) {
A
}
Successor
CB
SP
Predecessor
IF (C) {
A
} ELSE {
B
}
Successor
A