35
Intel, the Intel logo, Intel® Xeon Phi™, Intel® Xeon® Processor are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks. Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs Attila Áfra, Carsten Benthin, Ingo Wald, Jacob Munkberg Intel Corporation

Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

Intel, the Intel logo, Intel® Xeon Phi™, Intel® Xeon® Processor are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks.

Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs Attila Áfra, Carsten Benthin, Ingo Wald, Jacob Munkberg

Intel Corporation

Page 2: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Legal Disclaimer and Optimization Notice

2

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2016, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Page 3: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing

Standard method for production rendering

Main steps:

Ray traversal

Shading

Usually more than half of the rendering time

3

Page 4: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing

4

Single-ray tracing

SIMD single-ray traversal

Scalar shading

Packet tracing

SIMD single-ray traversal (or packet traversal)

SoA-based SIMD packet shading

Page 5: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing

Random rays incoherent traversal and shading

Low utilization of vector units

The wide (8-32) vector units of modern CPUs and GPUs are wasted

Incoherent memory accesses

5

Page 6: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Incoherent shading

6

SIMD divergence

Many different shaders are evaluated within a SIMD batch

Low SIMD utilization

Incoherent texture access

Non-cached reads form memory, disk, or network

0 1 2 3 4 5 6 7

0 4

1 2 6

3 7

5

Page 7: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Coherence extraction

7

Performance can be improved by extracting coherence

Find batches of similar rays and process them together

Most previous research focused on traversal

Stream shading

Trace streams of rays and sort them on various criteria (e.g., material)

Previous methods operate on a single large, global stream (millions of rays):

Wavefront path tracing on GPUs [Laine et al. 2013]

Sorted deferred shading for production path tracing [Eisenacher et al. 2013]

Page 8: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Our algorithm

8

Traces and sorts small local streams independently on each CPU thread

2K-8K rays per stream

Enables efficient SIMD shading with low overhead

Why local?

Has much lower overhead than global!

Cache-friendly: the streams fit into the CPU’s last-level cache (LLC)

Avoids expensive cross-core communication

Very fast (and simple) ray sorting

Sufficient for high (> 90%) SIMD utilization

Page 9: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing integrator

9

Unidirectional path tracer with next event estimation

Cast a ray from the camera

Evaluate the material at the hit point

Material ID

Material shader which constructs a BSDF

Cast a shadow ray toward a light source

Cast an extension ray and repeat

Page 10: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

10

Two ray streams:

Extension ray stream

Shadow ray stream

SoA memory layout

SIMD-friendly

Compact

No gaps (inactive rays)

Algorithm consists of stages

Each stage involves a stream iteration

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 11: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

11

Ray generation

Generate primary rays from an image tile

e.g., 16x16 pixels, 8 samples per pixel

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 12: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

12

Ray generation

Generate primary rays from an image tile

e.g., 16x16 pixels, 8 samples per pixel

Ray intersection

Intersect all extension rays in the stream

Single-ray traversal

Stream traversal

DRST [Barringer & Akenine-Möller 2014]

ORST [Fuetterling et al. 2015]

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 13: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

13

Sorting

Sort ray IDs by material ID

Counting sort fast!

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 14: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

14

Sorting

Sort ray IDs by material ID

Counting sort fast!

Material evaluation

Iterate over the sorted ray IDs

Execute shaders for coherent SIMD batches

Generate extension and shadow rays

Append to new streams using pack-stores

Filter out terminated paths

Double buffering

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 15: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

15

Shadow ray intersection

Test all shadow rays for occlusion

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 16: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

16

Shadow ray intersection

Test all shadow rays for occlusion

Accumulation

For unoccluded shadow rays, add direct light

For terminated paths, accumulate to image

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 17: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Stream tracing

17

Shadow ray intersection

Test all shadow rays for occlusion

Accumulation

For unoccluded shadow rays, add direct light

For terminated paths, accumulate to image

Path regeneration (optional)

Append new primary rays to the stream

Replace terminated paths

Sorting

Ray generation

Ray intersection

Accumulation

Shadow ray intersection

Material evaluation

Page 18: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

18

Sorted ray ID array:

Input array:

Output array: 0 4 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14 D

ou

ble

bu

ffers

Page 19: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

19

Input array:

Output array: 0 4 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14 D

ou

ble

bu

ffers

Sorted ray ID array:

Page 20: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

20

Input array:

Output array: 0 4 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14 D

ou

ble

bu

ffers

Sorted ray ID array:

Page 21: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

21

Input array:

SIMD register:

Output array: 0 4 11

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14 D

ou

ble

bu

ffers

Sorted ray ID array:

Page 22: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

22

Input array:

SIMD register:

Output array: 0 4 11

1 2 6 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14 D

ou

ble

bu

ffers

Sorted ray ID array:

Gather

Page 23: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

23

Input array:

SIMD register:

Output array:

SIMD register:

0 4 11

1 2 9

1 2 6 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14

Shading

Gather

Do

ub

le b

uffe

rs Sorted ray ID array:

Page 24: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD stream shading example

24

Input array:

SIMD register:

Output array:

SIMD register:

0 4 11 1 2 9

1 2 9

1 2 6 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 4 11 1 2 6 9 15 5 10 12 13 3 7 8 14

Pack-store

Shading

Gather

Do

ub

le b

uffe

rs Sorted ray ID array:

Page 25: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Results

25

Stream tracing (Our)

Stream size: 2K rays (376 KB/thread)

Single-ray tracing w/ scalar shading

Packet tracing w/ SIMD shading

Same SIMD single-ray traversal kernel

8-wide SIMD, AVX2 instruction set

Hardware: dual-socket Xeon E5-2699 v3

36 cores, 72 threads, 90 MB LLC (30% used for streams)

Page 26: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Test scenes

26

Art Deco / 111 materials Mazda / 76 materials Villa / 97 materials

Conference / 36 materials complex procedural shaders

Dragon / 5 materials simple shaders

Page 27: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing performance (Mray/s)

27

0

20

40

60

80

100

120

Art Deco Mazda Villa Conference Dragon

Mra

y/s

Single

Packet

Our

Page 28: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Path tracing performance (Mray/s)

28

0

20

40

60

80

100

120

Art Deco Mazda Villa Conference Dragon

Mra

y/s

Single

Packet

Our

3× speedup

Page 29: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD utilization for shading (%)

29

0

10

20

30

40

50

60

70

80

90

100

Art Deco Mazda Villa Conference Dragon

SIM

D u

tili

zati

on

(%)

Single

Packet

Our

Page 30: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Rendering time breakdown

30

0

10

20

30

40

50

60

70

80

90

100

Packet Our Packet Our Packet Our Packet Our Packet Our

Art Deco Mazda Villa Conference Dragon

No

rma

lize

d t

ime

Unaccounted

Sorting

Shading

Traversal

Page 31: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

0

10

20

30

40

50

60

70

80

90

100

8 32 128 512 2048 8192 32768

SIM

D u

tili

zati

on

(%)

Stream size

Art Deco

Mazda

Villa

Conference

Dragon

SIMD utilization vs. stream size

31

Page 32: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Conclusion

32

Achieves much higher SIMD utilization than single-ray and packet shading

Reduces shading time by 2-3× for complex scenes with hundreds of shaders

Could perform even better with production-quality shaders

Scales well to hundreds of CPU cores, wider SIMD (16), and bigger caches

Future work:

Additional sorting steps (e.g., textures)

Bidirectional path tracing

Page 33: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

Questions?

33

Page 34: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream
Page 35: Local Shading Coherence Extraction for SIMD …© 2016 Intel Corporation Our algorithm 8 Traces and sorts small local streams independently on each CPU thread 2K-8K rays per stream

© 2016 Intel Corporation

SIMD utilization vs. number of materials

35

0

10

20

30

40

50

60

70

80

90

100

2 4 8 16 32 64 128 256 512 1024

SIM

D u

tili

zati

on

(%)

Number of materials

Packet

Stream 2K

Stream 8K

Stream 32K