Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Optimized Effects for Mobile Devices
Ed Plowman, Director of Performance Analysis, ARM
Stacy Smith, Senior Software Engineer, ARM
Grab Your Crystal Balls
Top 3 questions I get asked:
Q. What does the future of mobile content look like?
A. That depends on how much GPU capability you have?
Q. How much performance will content developers need?
A. As much as you can give them!
Q. When will the mobile reach console quality?
Well, lets take a look at that and see if we can answer the others along the way…
Mobile GPU Compute Year On Year
0
1000
2000
3000
4000
5000
6000
2006 2008 2010 2012 2014 2016 2018
GF
LO
PS
/Sec
How long before Desktop GPU compute is in Mobile?
PS3 Xbox 360
0.000
50.000
100.000
150.000
200.000
250.000
300.000
350.000
400.000
2006 2008 2010 2012 2014 2016 2018
Gig
a b
yte
s/se
c
How long before Desktop GPU Bandwidth is seen in Mobile?
Mobile GPU BW Growth Year on Year
PS3 Xbox 360
Sate of the Art
Mobile
Why is BW Not Progressing as Fast?
Simple… Power!
I did the graph, but desktop GPU power is too horrific!
Desktop = 170 Watts to >300 Watts… that’s just the GPU!
Console = 80-100 Watts (CPU/GPU/WiFi/Network)
Mobile Platform = 3 - 7 Watts (CPU/GPU/Modem/WiFi)!
How to Get 100W of Work from 3Watts?
“I believe the sign of maturity is accepting
deferred gratification.” - Peggy Cahn
Five main suppliers of GPU tech in mobile
Three are deferred renderers
And those three make up >90% of the volume
This is not a coincidence!
Deferred rendering is most efficient GPU tech for Mobile
Efficiency of BW, HW resource and Power
Getting the most from it requires slightly different thinking…
Thinking in a Deferred World…
Minimize draw calls and state changes
Draw Calls/API calls are not free use them wisely
Grouping draw calls with like state = good
But… don’t go crazy
Large object batches with high potential occlusion can be costly
Remember those vertices still need processing
Draw Target Bind/unbind on each draw call = bad
Seen (disappointingly) in a lot of commercial engines
Can cause flush and reload cycles of tile/cache memory
Bind it once, issue all draw calls, unbind it…
Hint: Take a look at the use of glDiscardFramebufferEXT()
Indicates to driver that render attachment is done with/complete
Thinking in a Deferred World…
Use Vertex Buffer Objects
Client side vertex buffers use copy on write (CoW) on each Draw Call
VBO’s don’t, so they provide a considerable performance increase
Avoid dynamic VBO or IBO updates using glBufferSubData()
Multiple Render Targets (New for OpenGL® ES 3.0)
Very efficient on deferred GPU
Make sure sum of bits/frag is “do-able” “in tile” for max performance
Different criteria for each GPU provider
Avoiding Blocking Behaviours
Deferred GPU’s use a pipeline
glReadPixels(), glCopyTexImage(), glTexSubImage() = bad…
If you must use glReadPixels use PBO’s
Use FBO instead of glCopyTexImage()
Also Occlusion Query (OpenGL ES 3.0) - Results delayed by 1-2 frames
Busy waiting on OQ bad idea!
Build Command Vertex Shading Fragment Shading
Frame N [-----] [-----]
Frame N+1 Frame N [-----]
Frame N+2 Frame N+1 Frame N
Frame N+1 [-----] [-----]
Make Every Access Count
Think about “cacheability” of data
De-interleave vertex data
Think about representation
Do you really need a FP32/component for a texture coordinates accessing a 512x512 texture?
X Y Z W RGB TexCord Vertex =
Vert 0 Vert 1
Vert 2 Vert 3
Cache line 1 =
Cache line 2 =
Vert 0
(XYZW) (R
GB
) Cache line 1 =
Cache line 2 =
Vert 1
(XYZW)
Vert 2
(XYZW)
Vert 3
(XYZW)
Tex
Cord Cache line 3 =
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
(RG
B)
Tex
Cord
Tex
Cord
Tex
Cord
Tex
Cord
Tex
Cord
Tex
Cord
Tex
Cord
Compress, Compress, Compress!
25
35
45
55
8 5.12 3.56 2 1.28 0.89
PS
NR
(d
B)
Compression Rate (bpp)
ASTC = Adaptive Scalable Texture Compression
New texture compression standard developed by ARM, adopted by Khronos
KHR_texture_compression_astc_ldr for OpenGL ES and Open GL
Increased quality and fidelity at low bit-rates
Expansive range of input formats offers complete flexibility
Choice of base format, 2D and 3D plus addition of HDR formats
Compression in the Pre-ASTC World
L
LA
X+Y
HDR L
RGB
XY+Z
RBGA
RGB+A
HDR X+Y
HDR RGB
HDR XY+Z
HDR RGBA
HDR RGB+A
1 2 3 4 5 6 7 8 Compressed bits/pixel
Input
Colo
r Fo
rmat
s
8
16
16
16
24
24
32
32
32
48
48
64
64
Input
bits/
pix
el
ETC, BC5
ETC, BC4
BC7
All Major Players
PVRTC PVRTC
ETC, BC1
BC6
ETC, BC2
BC3, BC7
ASTC Choices
L
LA
X+Y
HDR L
RGB
XY+Z
RBGA
RGB+A
HDR X+Y
HDR RGB
HDR XY+Z
HDR RGBA
HDR RGB+A
1 2 3 4 5 6 7 8 Compressed bits/pixel
Input
Colo
r Fo
rmat
s
16
24
24
32
32
32
48
48
64
64
Input
bits/
pix
el
All ASTC
8
16
16
Look it up or Calculate it? You would be surprised what you can get done in a cycle…
precision mediump float;
varying vec4 detailtc_envtc, bumptrans;
uniform sampler2D dettex, envtex, colormap;
uniform float color_param, bumpstrength;
void main()
{
vec4 bt = bumptrans;
vec2 bt_crossmul = bt.xy * bt.wz;
float diffuse = max(0.0, bt_crossmul.x-bt_crossmul.y);
vec4 bump_cr = texture2D(dettex,detailtc_envtc.xy);
vec4 tbump = bt * bumpstrength * bump_cr.xyxy;
vec2 envtc = tbump.xy + tbump.zw + detailtc_envtc.zw;
vec4 col = texture2D(colormap, vec2(bump_cr.z, color_param));
vec4 env = texture2D(envtex,envtc);
gl_FragColor = col * diffuse + env * bump_cr.w;
}
Shader Features:
•Paletted color mapping
•Environment mapping
•Bump mapping
•Variable reflectance mapping
•Diffuse falloff of the texture color
•Adjustable bump map strength
•Adjustable color table
Mali-T600 series = 3 Cycles
GDC 2012 Demo: Timbuktu
Asset Conditioning
Cross platform - desktop & mobile
Desktop build - caching
Mobile build - loads caches
Asset pipeline - utility functions
Batching Deferred immediate mode rendering
glDrawElements and glDrawArrays have an overhead
Less draw calls, less overhead.
DrawCall class stitches multiple objects into one draw
Macro functions in shaders make batching as simple as:
vec4 pos=transform[getInstance()]*getPosition();
Batching
Batching
uniform mat4 transforms[4];
Object Instancing Multiple geometries, or single object instances:
for(int i=0;i<50;i++)
drawbuilder.addGeometry(geo1);
drawbuilder.Build()
Can implement LOD switching, when objects are sorted front to
back and correctly culled.
Seen in TrueForce
Special Effects
Special Effects: Bloom
Special Effects: Bloom
When considering uses of greater colour resolution the first
thought was HDR and Bloom.
But how to do bloom without the HDR images?
Special Effects: Bloom
Special Effects: Bloom
Render to low res FBO mapped to texture
Value filter and blur in 1st
post- processing pass, onto
second FBO texture
Sample vertical blur in
second pass then apply to
full resolution frame buffer
Special Effects: Depth of Field
Special Effects: Depth of Field
16bit depth buffers as textures opened the possibility of a
variable blur for depth of field
But how to do it without 16bit textures?
Special Effects: Depth of Field
Special Effects: Depth of Field
Additive
Mix
Bloom
Special Effects: Terrain Mapping
Special Effects: Terrain Mapping
Uniform Buffers and Vertex IDs can be used to implement
tessellated mesh subdivision
But how can this be approximated without the buffers or IDs?
Special Effects: Terrain Mapping
Special Effects: Terrain Mapping
SIGGRAPH 2012 Demo: Timbuktu 2
Timbuktu 2: Extended features
OpenGL® ES 3.0!
3D textures
Shadow comparison
16 bit depth textures
HDR lighting
3D Textures
Give more definition to deformed track
3D Textures mipmap in all 3 dimensions
Instead used 2D Texture arrays
3D Textures
floor(z)
fract(z)
ceil(z)
texture2DArray
mix(t1, t2, frac)
Shadow Mapping
Depth rendered to FBO from viewpoint of light
Projected onto scene to compare to distance
Doing this in the shader yields some interesting results
Shadow Comparison Texture
Compare
Interpolate
Compare
Compare
Compare
Compare
Interpolate
OpenGL ES 2.0 Texture Compare:
OpenGL ES 3.0 Shadow Texture Compare:
16 Bit Depth Buffers Also used for:
Soft Particles
Better fidelity of DOF
Particle Lighting Displacement Mapping:
Texture offset to X and Y coords
Distortion strength increases over time
HDR Lighting RGBA 10 10 10 2 format used
Everything gets normalised
Bright spots need to stand out
Make everything else darker!
Set exposure in post processing
Main conference ARM Sponsored Sessions
Thursday, March 28th
10:00 – 11:00am
Room 3022
Optimized Effects for Mobile Devices Stacy Smith, Senior Software Engineer, ARM
Ed Plowman, Director of Performance Analysis, ARM
1:00 – 2:00pm
Room 3016
The Future of Mobile Gaming
PANEL SESSION Moderator: Jason Della Rocca, Co-founder of Execution Labs
Baudouin Corman, Vice President of Publishing, Americas for Gameloft
David Helgason, CEO, Unity
Dr Chris Doran, Founder & COO, Geomerics
Niccolo De Masi, CEO, Glu
Jasper Smith, CEO, PlayJam
Nizar Romdhane, Director of Ecosystem, ARM
ARM #1124 in-booth Educational Theater
Over 30 talks from ARM and partners such as EpicGames, Havok, PlayJam,
Geomerics, Softkinetics, Metaio, Marmalade
20 minute length talks with Q&A at the end
An Android tablet prize draw at each session
Summary and videos of all Educational Theater Talks at
http://malideveloper.arm.com/gdc2013
malideveloper.arm.com
Thank you
Any questions?