15
1 Sample Tweaker Ocean Fog Overview This paper will discuss how we successfully optimized an existing graphics demo, named Ocean Fog, for our latest processors with Intel® Integrated Graphics. We achieved a 4x boost in performance (40 FPS to 160 FPS) with very little to no fidelity loss by applying techniques such as reducing texture sizes and lowering precision. These optimization techniques are not revolutionary by any means, but knowing when to apply them can be a bit more involved. To help us identify where we might be able to optimize, we used Intel’s graphics profiler, called Intel® Graphics Performance Analyzers or Intel® GPA for short. We will use screenshots of Intel GPA to show how we identified a graphics bottleneck and then detail how we tried to optimize or fix those problem areas. Understanding the architecture that you are optimizing for can really help you in deciding how to fix problem areas. Intel GPA allows you to run different tests against problem areas to help identify the problems and possible fixes without an intimate knowledge of the architecture. In this paper, you will see that our tests are labeled as 2x2 textures or simple pixel shader. Those tests are built into Intel GPA and are not something that a person would have to modify themselves in the existing application. The purpose of the original Ocean Fog project was to investigate how to effectively render a realistic ocean scene on differing graphics solutions while trying to provide a good, current, working class set of data to the graphics community. The ocean was rendered by using a projected grid that is displayed orthogonally to the viewer. The vertices of the grid are displaced using a height field. Perlin noise was used for generating wave motion. In the original paper, the author notes that computation Perlin noise was less CPU-intensive than other methods. However, other methods like Navier- Stokes work better on the GPU side and the author mentions it is worth further investigation. Snell’s law was used for reflection and refractions. For more information on how the water was rendered, please see Claes Johanson’s Master’s thesis, Real-time water rendering - Introducing the projected grid concept . The fog was also generated using Perlin noise. The processing for the fog was also done on the CPU side. This was done by sampling points in the 3D texture space. There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse. For further discussion please see: Ocean Fog using Direct3D 10 . Optimization Summary The original application was running at 40 FPS on our test hardware 1 ; after all optimizations, it was running 4x faster at 160 FPS. CPU utilization went from 8% to 84%, GPU active time from 32% to 85%, and GPU stall time from 53% to 9%. 1 We used 3 systems. First, an Intel® microarchitecture codename Sandy Bridge processor-based platform with a 2.4Ghz processor running 64-bit Microsoft Windows* 7, 4GB of memory and an 80GB solid-state disk. Second, an Intel® Core™ i5 640 processor-based platform with a 3.2Ghz processor running 32-bit Microsoft Windows Vista*, 2GB of memory and a Seagate* 7200RPM 500GB disk. Third, an Intel® Core™ 2 Duo T7700 processor-based system with a 2.4Ghz processor running 64-bit Microsoft Windows* 7, 4GB of memory, an 80GB solid-state disk, and an NVIDIA Quadro* FX-570 graphics card

Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

1

Sample Tweaker

Ocean Fog

Overview This paper will discuss how we successfully optimized an existing graphics demo, named Ocean Fog, for our latest

processors with Intel® Integrated Graphics. We achieved a 4x boost in performance (40 FPS to 160 FPS) with very little

to no fidelity loss by applying techniques such as reducing texture sizes and lowering precision. These optimization

techniques are not revolutionary by any means, but knowing when to apply them can be a bit more involved. To help us

identify where we might be able to optimize, we used Intel’s graphics profiler, called Intel® Graphics Performance

Analyzers or Intel® GPA for short.

We will use screenshots of Intel GPA to show how we identified a graphics bottleneck and then detail how we tried to

optimize or fix those problem areas. Understanding the architecture that you are optimizing for can really help you in

deciding how to fix problem areas. Intel GPA allows you to run different tests against problem areas to help identify the

problems and possible fixes without an intimate knowledge of the architecture. In this paper, you will see that our tests

are labeled as 2x2 textures or simple pixel shader. Those tests are built into Intel GPA and are not something that a

person would have to modify themselves in the existing application.

The purpose of the original Ocean Fog project was to investigate how to effectively render a realistic ocean scene on

differing graphics solutions while trying to provide a good, current, working class set of data to the graphics community.

The ocean was rendered by using a projected grid that is displayed orthogonally to the viewer. The vertices of the grid

are displaced using a height field. Perlin noise was used for generating wave motion. In the original paper, the author

notes that computation Perlin noise was less CPU-intensive than other methods. However, other methods like Navier-

Stokes work better on the GPU side and the author mentions it is worth further investigation. Snell’s law was used for

reflection and refractions. For more information on how the water was rendered, please see Claes Johanson’s Master’s

thesis, Real-time water rendering - Introducing the projected grid concept.

The fog was also generated using Perlin noise. The processing for the fog was also done on the CPU side. This was done

by sampling points in the 3D texture space.

There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse.

For further discussion please see: Ocean Fog using Direct3D 10.

Optimization Summary The original application was running at 40 FPS on our test hardware1; after all optimizations, it was running 4x faster at

160 FPS. CPU utilization went from 8% to 84%, GPU active time from 32% to 85%, and GPU stall time from 53% to 9%.

1 We used 3 systems. First, an Intel® microarchitecture codename Sandy Bridge processor-based platform with a 2.4Ghz processor running 64-bit Microsoft

Windows* 7, 4GB of memory and an 80GB solid-state disk. Second, an Intel® Core™ i5 640 processor-based platform with a 3.2Ghz processor running 32-bit

Microsoft Windows Vista*, 2GB of memory and a Seagate* 7200RPM 500GB disk. Third, an Intel® Core™ 2 Duo T7700 processor-based system with a 2.4Ghz

processor running 64-bit Microsoft Windows* 7, 4GB of memory, an 80GB solid-state disk, and an NVIDIA Quadro* FX-570 graphics card

Page 2: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

2

The largest slowdown on the application was the generation and shading of the water. Optimizations included normal

map size and depth reduction, reflection and refraction size and depth reduction, code and resource cleanup, and

rendering fixes.

Output

Figure 1 - Ocean Fog

Results

Final Results (msec/frame)

Sandy Bridge Intel®Core™ i5-661 with Intel HD Graphics

Intel® CoreTM 2 Duo T7700 with an NVIDIA Quadro* FX 570M card

Original 25 50 20

Optimized 6.289 10.870 6.452

Intel® microarchitecture codename Sandy Bridge showed 4x improvement over original.

Below are side-by-side screenshots of the Intel® GPA System Analyzer application before and after optimizations. The

application has several line charts that show the activity levels of the CPU and GPU. The first line chart (at the top)

Page 3: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

3

shows the frame rate. The second chart down shows the CPU utilization by processor. The third chart down shows the

GPU EU active % time, which is roughly the utilization of the GPU. The fourth chart down is the GPU % busy time, which

is the GPU active time plus the GPU stalled time. The fifth chart down is the GPU % stalled time. Stalls are a general

bucket that could refer to many conditions. For example, you could be stalled on a texture fetch from the sampler, or a

vertex fetch from main memory.

As you can see here, the application is stalled on the GPU a good percentage of the time and is hardly utilizing any of the

CPU or GPU. This is typically a good sign if you are looking to see whether there are possible optimizations.

Original vs. Optimized

Original – No Overrides Optimized – No Overrides

Significant increase in GPU percent active time and significant decrease in GPU percent stalled time.

Page 4: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

4

Below are side by side screenshots of the Intel® GPA Frame Analyzer tool. The tool can show the performance of each

individual draw call (named erg in the tool) and allow you to try experiments to see how you might be able to improve

that particular draw call. Ergs that take longer are represented by taller bars in the bar chart, so those are generally a

good place to start. It is hard to tell, but if you look at the Y-scale numbers of the bar chart, you will notice the biggest

erg went from over 12,000 microseconds to around 2500 microseconds.

Original vs. Optimized Original Optimized

The two largest ergs went from 62.5% of scene time to 35.7% of the scene time.

Page 5: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

5

Next is another side-by-side screenshot of Intel GPA System Analyzer. This time we compare the original profile against

2x2 textures. 2x2 textures is an override in the tool that causes all textures to be used in the GPU to be a simple 2x2

texture. This is a good way to know whether you are stalled due to size of your textures. You’ll notice below that the

GPU stalled time (bottom chart) goes from around 57% down to 17% with 2x2 textures. This tells us that our textures

are too big for our sampler cache to hold efficiently.

Original No Overrides 2x2 Textures

Note: The Core i5-661 processor-based system with Intel HD Graphics showed similar results

Page 6: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

6

Performance Analysis

Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

reduction of GPU stall time, to about 17%.

Normal Map Generation Calculation of water normal map (1024x2048 RGBA 32 bit) Pixel Shader takes substantially more time.

Optimizations

Scaling water normal map

The water normal map took about 32 MB to generate and was shown in Intel GPA Frame Analyzer to be the most costly

erg. Remember the tall yellow line that went from 12,000 microseconds to 2500 microseconds in the side-by-side

screen shot of Intel GPA Frame Analyzer? This is what it actually represents in the scene. We experimented by changing

the size of the normal map to see how it affected performance and visual fidelity. Our goal throughout optimization was

to preserve visual fidelity as much as possible. In the table below, you will notice that reducing the size of the normal

map greatly reduced the time of that erg. Looking at the 3 side-by-side screenshots of the water, you will notice some

falloff in visual fidelity based on the reduction in size.

Normal Map

Build Resolution msec/frame

Original 1024x2048 25.000

Optimized 1024x2048 23.810

Optimized 512x1024 11.111

Optimized 256x512 7.407

Figure 2 - Normal map sizes: 1024, 512, 256

Page 7: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

7

In conjunction with other water noise settings (falloff, scale), we were able to get the quality to look almost as good as

the original 1024x2048 normal map without the frame rate penalty while maintaining the crisp water effect.

Figure 3 - Normal map: 1024 vs. 256 Settings Tweaked

Page 8: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

8

Water Shading In the Intel GPA Frame Analyzer side-by-side screenshot below, we are showing the before and after of applying the 2x2

textures experiment within the tool. This erg is definitely texture-bound.

2x2 Textures Experiment Before After

After 2x2 Textures experiment, there was 32.2% less GPU time

After performing Intel GPA Frame Analyzer’s 2x2 Textures experiment, it was shown there was a 32.2% improvement on

GPU time. Additionally, the textures in the pipeline were inspected. The reflection, refraction, and normal map were 32-

bit textures. So the optimizations here would be reducing the size and depth of the textures without losing much

fidelity.

Page 9: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

9

Figure 4 - Graphics Pipeline showing 32-bit textures

Water.fx sampled textures in pixel shader, Original Build

Type Description Resolution Depth

TextureCube Environment 1024 x 1024 x 6 8 bit

Texture3D Fog 50 x 50 x 50 8 bit

Texture2D Fresnel 256 x 1 8 bit

Texture2D Normal 1024 x 2048 32 bit

Texture2D Refraction 512 x 512 32 bit

Texture2D Reflection 512 x 512 32 bit

Optimizations Reflection, refraction and normal maps were changed from 32-bit to 16-bit textures and their changes to the FPS were

noted. Reduction of reflection and refraction maps showed the greatest improvement, followed by using 16-bit depth.

Texture depth changes (msec/frame)

Reflection/Refraction Map Normal Map

RGBA 32 bit 16.667 16.529

RGBA 16 bit 14.388 15.038

Reflection/Refraction Map Dimension change (msec/frame)

Reflection/Refraction Map

512x512, RGBA 32 bit 16.667

256x256, RGBA 32 bit 13.072

256x256, RGBA 16 bit 11.111

Both map reduction and 16 bit depth provided 1.5x improvement

The depth change from 32-bit to 16-bit showed a slightly grainier normal map. The reflection and refraction dimension

reduction to 256x256 showed a pixelated reflection/refraction map only when there was zero wave amplitude, and thus

no water distortion. However, after any wave amplitude or water distortion, the pixelation could not be seen; along with

the addition of fog, the difference between the image fidelities could not be seen anymore.

Page 10: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

10

Figure 5 - Reflection/Refraction Map 32bit vs 16 bit

Figure 6 - Reflection/Refraction Map - 256x256 vs. 512x512

Next, the skybox (1024x1024x6) was replaced with a smaller version (256x256x6), and because of the gradient and

unfocused nature of the texture, there was no change in fidelity. There was about a 3% FPS increase with all other

objects turned off in the scene.

Page 11: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

11

Figure 7 - 1024x1024 vs 256x256 Cubemap

Page 12: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

12

Miscellaneous Optimizations

Removing unnecessary clear calls The clearing of the reflection and refraction render targets were disabled when they were unchecked from the GUI. This

gave a frame boost from 52 FPS (water render only) to about 73 FPS. When everything else was rendered in the scene,

the frame rate dropped when reflection was disabled; however, this behavior was also observed with the original build.

Clearing must be done at every frame because as the camera moves, the reflection and refraction map must change.

MIP Generation Generating the additional MIP levels for the normal map did not show a significant change in FPS, but we thought it

might help, so we tried the experiment anyway.

MIP generation (msec/frame)

Normal Map Size

MIPs 1024x2048, 32 bit 1024x2048, 16 bit 256x512, 16 bit

One 37.736 18.519 7.168

All eight levels 37.736 18.018 7.220

Offloading GPU work One possible optimization would be moving more work from the GPU because the CPU is not being fully utilized. The

two largest shaders that compute the height and normal map were disabled, which showed about a 14% increase in

frame rate. One possible implementation would to pass normal and height map information, generated on the CPU,

along with the rest of the vertex data. This could also greatly reduce frame time, but we thought it might be an

interesting experiment that we didn’t have time to try.

Disabling Shader work (msec/frame)

Normal Map Size

256x512 1024x2048

GPU Work 11.050 21.978

No GPU Work 9.709 10.582

Summary We achieved a 4x performance improvement in this application by using Intel GPA to help us identify the GPU

bottlenecks and possible solutions. This application was stalled mainly on textures, so reducing the size and precision

allowed us to gain some substantial performance. In doing so, we lost some minor visual fidelity, and in some cases we

mitigated that loss by varying other simulation parameters. The optimizations that we did should be considered on a

case-by-case basis, because sometimes you might need that extra precision or even fidelity to convey to the user what is

visually important in your application or game. The technique and tools we used can be applied to any graphics

application to troubleshoot performance problems, so you should consider those on your next optimization adventure.

About the Author Jeff Laflam is a software engineer in the Intel Software and Services Group, where he supports Intel graphics solutions in

the Visual Computing Software Division.

Page 13: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

13

Optimization Notice Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software

products.

Page 14: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

14

Appendix

General Procedure Various application overrides were created such as disabling the water, island, lighthouse spot light, clouds, and sky (Fig.

12). It was found that disabling the water showed the greatest improvement in frame rate. The application overrides

were used in conjunction with Intel GPA. Slowdowns were further narrowed down through Intel GPA System Analyzer,

showing that 2x2 textures override marked the greatest frame improvement. Finally, Intel GPA Frame Analyzer was used

to pinpoint the exact ergs in which the slowdowns occurred. Additionally, Intel GPA Frame Analyzer allowed 2x2

textures experiment and texture information in the pipeline, which allowed better measurement of what was being

used in the erg.

Figure 8 - Application Overrides

Page 15: Sample Tweaker - Intel Developer Zone · Performance Analysis Overview Intel GPA System Analyzer showed nearly 60% stall time initially. Using 2x2 textures override showed the greatest

15

Final Results – Miscellaneous Original 2x2 Textures vs. Optimized

Original – 2x2 Textures Optimized - No Overrides

Optimized version shows better results than the original’s 2x2 textures override.

Textures There are 27 textures (29.5 MB) in the textures folder. The skybox, “cubemap-NEWER.dds”, is 24 MB of the total. There

are additional textures procedurally generated at the beginning. First is the fog texture, which is a 50x50x50 8-bit

texture. The application takes up about 280-310 MB of memory before starting (before optimizations). This can be from

file reads, texture generation, and various inefficiencies. Other textures need to be adjusted for proper use, as some are

too large or small for their surface.