Vertex Culling illustration at SBR07

SBR07

Faster ray packets through

vertex cullingIllustration and implementation by

Syoyo Fujita

SBR07SBR07

Faster ray packets through

vertex culling

Alexander Reshetov

IEEE/EG Symposium on

Interactive Ray Tracing 2007

Faster Ray Packets - Triangle Intersection through Vertex Culling Alexander Reshetov

Intel Corporation

ABSTRACT

Acceleration structures are used in ray tracing to sharply reduce number

of ray-triangle intersection tests at the expense of traversing such

structures. Bigger structures eliminate more tests, but their traversal

becomes less efficient, especially for ray packets, for which number of

inactive rays increases at the lower levels of the acceleration structures.

For dynamic scenes, building or updating acceleration structures is one

of the major performance impediments.

We propose a new way to reduce the total number of tests by creating a

special transient frustum every time a leaf is traversed by a packet of rays.

This frustum contains intersections of active rays with a leaf node and

eliminates over 90% of all potential tests. It allows a tenfold reduction in

size of acceleration structure whilst still achieving a better performance.

!"#$%&'()!!"#$%&'&(")'*++!"#$*,-(&./+0'.++&0'!()1/+0'.+2+&0(')1*,++()&,03,!&(")!

1 INTRODUCTION

Finding the intersection of a ray and a triangle is equivalent to solving

the linear system of three equations

!"#"$"%"4"&'"#"(")&*"!"#'+"#",")&-"!"#'+ (1)

with five additional requirements

$""%""&""%""&"!.%"

$""%""'("""$""%"")("""'*)""%""+

(2)

(3)

The left part of the system (1) defines a ray with the origin !"and the

direction %; the right part ! a point inside the triangle with vertices &'/"&*/"

and"&-. The unknown variables are: $ ! distance to the intersection point

"#$%& '()& #*+,-& $#./.0, and (/" , ! Barycentric coordinates of the point

inside the triangle. It is required that the found intersection point be

closer '$& '()& #*+,-&origin than the previously found one (2) and within

'()&'#.*0/1),-&2$304*#.)-&(3).

One way to solve the system (1) is to use 5#*%)#,-&rule, which allows

finding numerical values directly for all three variables. This will result

in the implementations similar to Möller-Trumbore [13]. On the other

hand, the system (1) also could be solved using Gaussian elimination. It is computationally efficient only if the distance $ is calculated first

(which, of course, could also be computed directly by using the well-

known expression for the distance to a plane along a ray). By substituting the found value of $ in any two of the remaining equations,

( and , variables could be found. This substitution geometrically

corresponds to projecting the triangle and the intersection point to the

plane defined by the coordinates of the two chosen equations and leads

to algorithms comparable with one given by Badouel [1]. The S.S.E.

implementation for groups of four rays, proposed by Wald [21], is based

on the equivalent 2D projection. Example of the S.S.E. based 3D

approach (formulated in terms of Plücker coordinates) could be found in

6)0'(.0,-& 7(898 [2]. For vector implementations, conditions (2-3) are usually converted to masks used to selectively update the $, (, and ,

values. It makes sense to use packets bigger than the intrinsic SIMD

width of the targeted architecture as it amortizes per-triangle

computations and saves bandwidth. It also allows to make decisions

summarily for the whole packet without processing individual rays, for

example using frustum or interval arithmetic ([5], [7], [11], [22], [23]). There is no one universal way of solving the system (1) which will be

suitable for all situations and architectures. In particular, CPU

implementations could use conditions (2-3) for exiting from the test

"*+,-.) ,."/,0'"&1&"(2"3%45-03".16%+7

'!!,$&,5+&"+6777879+:.#$"3(%#+")+6)&,0'!&(;,+<'.+=0'!()1+>??@+A&&$B88CCCD%)(2%*#D5,80&?@8<=?@DA&#*+E*#/+9,0#')./+:,$&,#F,0+G?2G>/+>??@7

early if either one of the five conditions is false. For implementations on

GPU [14] or Cell [3], branchless algorithms are more efficient.

For most scenes, the great majority of ray - triangle intersection tests can be

eliminated by creating special acceleration structures such as kd-trees,

grids, Bounding Volume Hierarchies (BVH), which exploit spatial

coherency of a scene ([2], [7], [22], [23]). During rendering, the

acceleration structure is traversed and all triangles in visited leaf nodes are

tested for intersections. For structures which utilize spatial subdivision

(kd-trees, grids), it is possible to create an instance of this type for which

total number of tests will be just slightly more than the number of ray-

segments (one test per ray), but this may not result in the best

performance. The optimal size of the acceleration structure is dependent

on how fast it could be traversed compared with the average speed of the

used ray-triangle intersection test. For hierarchical acceleration structures,

the higher levels of the hierarchy are usually the most effective in reducing

the number of potential tests. Going down in the spatial hierarchy, nodes

are becoming smaller and smaller and some of the rays in the packet may

miss them. This negatively affects the utilization of SIMD units [15].

In recent years, focus of ray tracing research has shifted to dynamic

scenes ([8], [12], [17], [19], [22], [23]), for which it is necessary to

optimize for the total execution time (build/update time plus rendering

time). Frequently, to improve the build time, only axis-aligned bounding

boxes (AABB) of triangles are used even for kd-trees ([8], [17]). For

this reason, it is important to analyze performance or ray-triangle

intersection tests in the situation when intersection could not be

eliminated by testing a ray against the AABB of a triangle. This

approach was chosen by Kensler and Shirley [10], in which different 3D

versions of intersection algorithms were analyzed and the best one was

found via genetic optimization of the fitness function.

Even when a ray does intersect the AABB, the probability of it

intersecting the triangle is only about 20%. It is important to swiftly

reject the remaining 80%. Amongst the approaches analyzed in the

literature [4] are: use of interval arithmetic; testing the intersection of a

frustum containing rays (typically represented as 4 corner rays) with a

triangle [5]; and culling of the AABB of the individual triangle against

the frustum. In these approaches per-packet data structures are computed

before the traversal and then used to eliminate unnecessary tests. The

culling techniques could also be used for the individual rays as well, as

first proposed by Snyder and Barr in 1987 paper [18], in which a ray was

first tested against the box formed by the intersection of the $2:);',-&

bounding box and a visited cell.

Few years ago, ray-tracing researchers were mostly interested in

rendering static scenes for which persistent data structures were created,

optimized for all possible camera positions. Eventually, focus was shifted

to dynamic scenes, for which data structures are created (or updated)

every frame. There are also initial studies in how to create kd-trees lazily,

deeply subdividing only those areas of space that are visited during

rendering of a particular frame [9]. In a sense, we pursue this trend to the

extreme, when transient data structures are created for every packet and

every visited leaf node. This allows very tight structures, optimized for a

given packet and a cell. Surprisingly, these fleeting structures could be

created and handled very efficiently, building on top of the clipping

algorithms, characteristic for the modern packet traversal techniques.

8-9:&"7;D+=A,+&0')3(,)&+ H0%3&%#+ (3+!0,'&,5+,;,0.+ &(#,+'+0'.+$'!I,&+;(3(&3+'+*,'H+ )"5,+ J3"*(585'3A,5+ *(),3+ !"00,3$")5+ &"+ '!&(;,8()'!&(;,+ 0'.3KD+ L)*.+'!&(;,+ 0'.3+ '0,+ %3,5+ &"+ !"#$%&,+ &A,+ H0%3&%#D+ M,H&B+ 1,),0'*+ 0'.3D+ <(1A&B+$0(#'0.+0'.3+J!"##")+"0(1()KD+

SBR07SBR07

Demo

SBR07SBR07

Key contribution

SBR07SBR07

90 %1/10

Culling RateSpatial DataStruture Size

SBR07SBR07

Ray-triangle intersection cost

SBR07SBR07

1 Cycle(on average)

SBR07SBR07

It’s extremely

fast!

SBR07SBR07

Algorithm

SBR07SBR07

What VC do?(VC = Vertex Culling)

SBR07SBR07

Every timea packet enters

a leaf node,do following

SBR07SBR07

Create packet

Precompute coeff

for each trianglestriangle-packet Cull

triangle-ray intersectif failed

SBR07SBR07

Create packet

Precompute coeff



SBR07SBR07

Create packet from active rays

Active ray

Inactive ray

SBR07SBR07

Create packet

Precompute coeff



SBR07SBR07

Think about

triangle-packettest

SBR07SBR07

Axis aligned

SBR07SBR07

Frustum plane

SBR07SBR07

vo

n(v - o) . n < 0

= outside

SBR07SBR07

v any of dot(v, n) < 0= outside of the frustum

SBR07SBR07

v

all of dot(v, n) >= 0= inside of the frustum

SBR07SBR07

v0

v1

v2o

n

all of (v .o) < 0 = triangle is outside

of the plane

SBR07SBR07

v0

v1

v2

Finally, we got

any of (v . n) > 0= frustum and triangle (possibly) intersects

SBR07SBR07

1 vertex, 4 plane inside/outside test computation

can be reduced to

d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0

SBR07SBR07

d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0

q1, q0 =SIMD variable.Pre-computable

when packet enters leaf node.

SBR07SBR07

Culling cost per vertex.

SBR07SBR07

1 SIMD mul+

2 SIMD add

SBR07SBR07

Very efficient

SBR07SBR07

This is the core of VC

SBR07SBR07

Create packet

Precompute coeff



SBR07SBR07

Just apply efficient vertex-frustum

culling for each triangle vertex

SBR07SBR07

Culling cost per triangle per packet =

3(vtx) * 3(mul,add) +

compare

SBR07SBR07

Create packet

Precompute coeff



SBR07SBR07

SBR07SBR07

First do corner rays and

triangle test

SBR07SBR07

Such a case can not be culled by VC.

SBR07SBR07

SBR07SBR07

uv

all(u) < 0 orall(v) < 0 orall(u+v) > 1

=All rays in the packet

does not intersect triangle.

SBR07SBR07

If all failed,do

ray-triangle test for each rays in

the packet.

SBR07SBR07

Thats all!

SBR07SBR07

Benefit of VC

SBR07SBR07

90 %1/10

Culling RateSpatial DataStruture Size

SBR07SBR07

less traversalmuch VCs

much traversalless VCs

SBR07SBR07

Right has 12x more nodes than left,

but right is just 30% faster than left.

SBR07SBR07

spatial data structureCoarse Fine

Sequential access more isects

less traversalsless memory

Random access less isects

more traversalsmuch memory

BVHBIH

kd-tree

SBR07SBR07

spatial data structureCoarse Fine

Sequential access more isects

less traversalsless memory

Random access less isects

more traversalsmuch memory

VC

SBR07SBR07

The strongest acceleration

structure on the earth?

SBR07SBR07

Implementation

SBR07SBR07

Here is my implementation of

VC + BVH

SBR07SBR07

C++, for VC

Original

C, for gcc

My impl.

SBR07SBR07

Performance &

Profiling

SBR07SBR07

512x51210k tris

SAH-BVH

SBR07SBR07

Compile flag-O3

-ffast-math-mfpmath=sse

-msse2

SBR07SBR07

4.7M rays/sec

@ 2.16 GHz Core2 Mac

SBR07SBR07

= 460 Cycles/ray

Hmm...

SBR07SBR07

nPackets : 8192nTraversedLeafs(per packet) : 43066 (5.257080)nTraversedInnerNodes(per packet) : 109922 (13.418213)nPacketBBoxTests(per packet) : 228036 (27.836426) nBVHAnyHits : 131968 (57.871564 %) nBVHAllMisses : 61462 (26.952762 %) nBVHLastResorts : 34606 (15.175674 %) nBVHLastResortRayBBoxTests : 322288 (per last resort tests) : 9.313067nTestedTriangles(per packet) : 314650 (38.409424) nCulledByBeams(rate) : 223618 (71.068807 %) nCulledByUV(rate) : 18536 (5.890990 %)nActuallyTestedTrianglesPerRay : 2.060669

Statistics

SBR07SBR07

Core2 Mac

SBR07SBR07

VC

BVH

SBR07SBR07

Redundant calcs...

SBR07SBR07

Non SIMDizedcodes...

SBR07SBR07

Yeah, I have to optimize BVH code before

optimizing VC :-)

SBR07SBR07

Implementation period

SBR07SBR07

packetBVH

Vertex Culling

2 Days6 Days

SBR07SBR07

Very easy to implement!

MLRTA requires a half year in my case.

SBR07SBR07

TODO

SBR07SBR07

Much more optimization.

SBR07SBR07

Try to use VC technique for

incoherent rays

SBR07SBR07

http://lucille.svn.sourceforge.net/svnroot/lucille/angelina/eleonore/

Here is my VC + BVH

implementation





SBR07SBR07

?

SBR07SBR07

Thank you!

Business

Vertex Culling illustration at SBR07