Upload
syoyo-fujita
View
2.536
Download
1
Embed Size (px)
Citation preview
SBR07
Faster ray packets through
vertex cullingIllustration and implementation by
Syoyo Fujita
SBR07SBR07
Faster ray packets through
vertex culling
Alexander Reshetov
IEEE/EG Symposium on
Interactive Ray Tracing 2007
Faster Ray Packets - Triangle Intersection through Vertex Culling Alexander Reshetov
Intel Corporation
ABSTRACT
Acceleration structures are used in ray tracing to sharply reduce number
of ray-triangle intersection tests at the expense of traversing such
structures. Bigger structures eliminate more tests, but their traversal
becomes less efficient, especially for ray packets, for which number of
inactive rays increases at the lower levels of the acceleration structures.
For dynamic scenes, building or updating acceleration structures is one
of the major performance impediments.
We propose a new way to reduce the total number of tests by creating a
special transient frustum every time a leaf is traversed by a packet of rays.
This frustum contains intersections of active rays with a leaf node and
eliminates over 90% of all potential tests. It allows a tenfold reduction in
size of acceleration structure whilst still achieving a better performance.
!"#$%&'()!!"#$%&'&(")'*++!"#$*,-(&./+0'.++&0'!()1/+0'.+2+&0(')1*,++()&,03,!&(")!
1 INTRODUCTION
Finding the intersection of a ray and a triangle is equivalent to solving
the linear system of three equations
!"#"$"%"4"&'"#"(")&*"!"#'+"#",")&-"!"#'+ (1)
with five additional requirements
$""%""&""%""&"!.%"
$""%""'("""$""%"")("""'*)""%""+
(2)
(3)
The left part of the system (1) defines a ray with the origin !"and the
direction %; the right part ! a point inside the triangle with vertices &'/"&*/"
and"&-. The unknown variables are: $ ! distance to the intersection point
"#$%& '()& #*+,-& $#./.0, and (/" , ! Barycentric coordinates of the point
inside the triangle. It is required that the found intersection point be
closer '$& '()& #*+,-&origin than the previously found one (2) and within
'()&'#.*0/1),-&2$304*#.)-&(3).
One way to solve the system (1) is to use 5#*%)#,-&rule, which allows
finding numerical values directly for all three variables. This will result
in the implementations similar to Möller-Trumbore [13]. On the other
hand, the system (1) also could be solved using Gaussian elimination. It is computationally efficient only if the distance $ is calculated first
(which, of course, could also be computed directly by using the well-
known expression for the distance to a plane along a ray). By substituting the found value of $ in any two of the remaining equations,
( and , variables could be found. This substitution geometrically
corresponds to projecting the triangle and the intersection point to the
plane defined by the coordinates of the two chosen equations and leads
to algorithms comparable with one given by Badouel [1]. The S.S.E.
implementation for groups of four rays, proposed by Wald [21], is based
on the equivalent 2D projection. Example of the S.S.E. based 3D
approach (formulated in terms of Plücker coordinates) could be found in
6)0'(.0,-& 7(898 [2]. For vector implementations, conditions (2-3) are usually converted to masks used to selectively update the $, (, and ,
values. It makes sense to use packets bigger than the intrinsic SIMD
width of the targeted architecture as it amortizes per-triangle
computations and saves bandwidth. It also allows to make decisions
summarily for the whole packet without processing individual rays, for
example using frustum or interval arithmetic ([5], [7], [11], [22], [23]). There is no one universal way of solving the system (1) which will be
suitable for all situations and architectures. In particular, CPU
implementations could use conditions (2-3) for exiting from the test
"*+,-.) ,."/,0'"&1&"(2"3%45-03".16%+7
'!!,$&,5+&"+6777879+:.#$"3(%#+")+6)&,0'!&(;,+<'.+=0'!()1+>??@+A&&$B88CCCD%)(2%*#D5,80&?@8<=?@DA&#*+E*#/+9,0#')./+:,$&,#F,0+G?2G>/+>??@7
early if either one of the five conditions is false. For implementations on
GPU [14] or Cell [3], branchless algorithms are more efficient.
For most scenes, the great majority of ray - triangle intersection tests can be
eliminated by creating special acceleration structures such as kd-trees,
grids, Bounding Volume Hierarchies (BVH), which exploit spatial
coherency of a scene ([2], [7], [22], [23]). During rendering, the
acceleration structure is traversed and all triangles in visited leaf nodes are
tested for intersections. For structures which utilize spatial subdivision
(kd-trees, grids), it is possible to create an instance of this type for which
total number of tests will be just slightly more than the number of ray-
segments (one test per ray), but this may not result in the best
performance. The optimal size of the acceleration structure is dependent
on how fast it could be traversed compared with the average speed of the
used ray-triangle intersection test. For hierarchical acceleration structures,
the higher levels of the hierarchy are usually the most effective in reducing
the number of potential tests. Going down in the spatial hierarchy, nodes
are becoming smaller and smaller and some of the rays in the packet may
miss them. This negatively affects the utilization of SIMD units [15].
In recent years, focus of ray tracing research has shifted to dynamic
scenes ([8], [12], [17], [19], [22], [23]), for which it is necessary to
optimize for the total execution time (build/update time plus rendering
time). Frequently, to improve the build time, only axis-aligned bounding
boxes (AABB) of triangles are used even for kd-trees ([8], [17]). For
this reason, it is important to analyze performance or ray-triangle
intersection tests in the situation when intersection could not be
eliminated by testing a ray against the AABB of a triangle. This
approach was chosen by Kensler and Shirley [10], in which different 3D
versions of intersection algorithms were analyzed and the best one was
found via genetic optimization of the fitness function.
Even when a ray does intersect the AABB, the probability of it
intersecting the triangle is only about 20%. It is important to swiftly
reject the remaining 80%. Amongst the approaches analyzed in the
literature [4] are: use of interval arithmetic; testing the intersection of a
frustum containing rays (typically represented as 4 corner rays) with a
triangle [5]; and culling of the AABB of the individual triangle against
the frustum. In these approaches per-packet data structures are computed
before the traversal and then used to eliminate unnecessary tests. The
culling techniques could also be used for the individual rays as well, as
first proposed by Snyder and Barr in 1987 paper [18], in which a ray was
first tested against the box formed by the intersection of the $2:);',-&
bounding box and a visited cell.
Few years ago, ray-tracing researchers were mostly interested in
rendering static scenes for which persistent data structures were created,
optimized for all possible camera positions. Eventually, focus was shifted
to dynamic scenes, for which data structures are created (or updated)
every frame. There are also initial studies in how to create kd-trees lazily,
deeply subdividing only those areas of space that are visited during
rendering of a particular frame [9]. In a sense, we pursue this trend to the
extreme, when transient data structures are created for every packet and
every visited leaf node. This allows very tight structures, optimized for a
given packet and a cell. Surprisingly, these fleeting structures could be
created and handled very efficiently, building on top of the clipping
algorithms, characteristic for the modern packet traversal techniques.
8-9:&"7;D+=A,+&0')3(,)&+ H0%3&%#+ (3+!0,'&,5+,;,0.+ &(#,+'+0'.+$'!I,&+;(3(&3+'+*,'H+ )"5,+ J3"*(585'3A,5+ *(),3+ !"00,3$")5+ &"+ '!&(;,8()'!&(;,+ 0'.3KD+ L)*.+'!&(;,+ 0'.3+ '0,+ %3,5+ &"+ !"#$%&,+ &A,+ H0%3&%#D+ M,H&B+ 1,),0'*+ 0'.3D+ <(1A&B+$0(#'0.+0'.3+J!"##")+"0(1()KD+
SBR07SBR07
Demo
SBR07SBR07
Key contribution
SBR07SBR07
90 %1/10
Culling RateSpatial DataStruture Size
SBR07SBR07
Ray-triangle intersection cost
SBR07SBR07
1 Cycle(on average)
SBR07SBR07
It’s extremely
fast!
SBR07SBR07
Algorithm
SBR07SBR07
What VC do?(VC = Vertex Culling)
SBR07SBR07
Every timea packet enters
a leaf node,do following
SBR07SBR07
Create packet
Precompute coeff
for each trianglestriangle-packet Cull
triangle-ray intersectif failed
SBR07SBR07
Create packet
Precompute coeff
for each trianglestriangle-packet Cull
triangle-ray intersectif failed
SBR07SBR07
Create packet from active rays
Active ray
Inactive ray
SBR07SBR07
Create packet
Precompute coeff
for each trianglestriangle-packet Cull
triangle-ray intersectif failed
SBR07SBR07
Think about
triangle-packettest
SBR07SBR07
Axis aligned
SBR07SBR07
Frustum plane
SBR07SBR07
vo
n(v - o) . n < 0
= outside
SBR07SBR07
v any of dot(v, n) < 0= outside of the frustum
SBR07SBR07
v
all of dot(v, n) >= 0= inside of the frustum
SBR07SBR07
v0
v1
v2o
n
all of (v .o) < 0 = triangle is outside
of the plane
SBR07SBR07
v0
v1
v2
Finally, we got
any of (v . n) > 0= frustum and triangle (possibly) intersects
SBR07SBR07
1 vertex, 4 plane inside/outside test computation
can be reduced to
d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0
SBR07SBR07
d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0
q1, q0 =SIMD variable.Pre-computable
when packet enters leaf node.
SBR07SBR07
Culling cost per vertex.
SBR07SBR07
1 SIMD mul+
2 SIMD add
SBR07SBR07
Very efficient
SBR07SBR07
This is the core of VC
SBR07SBR07
Create packet
Precompute coeff
for each trianglestriangle-packet Cull
triangle-ray intersectif failed
SBR07SBR07
Just apply efficient vertex-frustum
culling for each triangle vertex
SBR07SBR07
Culling cost per triangle per packet =
3(vtx) * 3(mul,add) +
compare
SBR07SBR07
Create packet
Precompute coeff
for each trianglestriangle-packet Cull
triangle-ray intersectif failed
SBR07SBR07
SBR07SBR07
First do corner rays and
triangle test
SBR07SBR07
Such a case can not be culled by VC.
SBR07SBR07
SBR07SBR07
uv
all(u) < 0 orall(v) < 0 orall(u+v) > 1
=All rays in the packet
does not intersect triangle.
SBR07SBR07
If all failed,do
ray-triangle test for each rays in
the packet.
SBR07SBR07
Thats all!
SBR07SBR07
Benefit of VC
SBR07SBR07
90 %1/10
Culling RateSpatial DataStruture Size
SBR07SBR07
less traversalmuch VCs
much traversalless VCs
SBR07SBR07
Right has 12x more nodes than left,
but right is just 30% faster than left.
SBR07SBR07
spatial data structureCoarse Fine
Sequential access more isects
less traversalsless memory
Random access less isects
more traversalsmuch memory
BVHBIH
kd-tree
SBR07SBR07
spatial data structureCoarse Fine
Sequential access more isects
less traversalsless memory
Random access less isects
more traversalsmuch memory
VC
SBR07SBR07
The strongest acceleration
structure on the earth?
SBR07SBR07
Implementation
SBR07SBR07
Here is my implementation of
VC + BVH
SBR07SBR07
C++, for VC
Original
C, for gcc
My impl.
SBR07SBR07
Performance &
Profiling
SBR07SBR07
512x51210k tris
SAH-BVH
SBR07SBR07
Compile flag-O3
-ffast-math-mfpmath=sse
-msse2
SBR07SBR07
4.7M rays/sec
@ 2.16 GHz Core2 Mac
SBR07SBR07
= 460 Cycles/ray
Hmm...
SBR07SBR07
nPackets : 8192nTraversedLeafs(per packet) : 43066 (5.257080)nTraversedInnerNodes(per packet) : 109922 (13.418213)nPacketBBoxTests(per packet) : 228036 (27.836426) nBVHAnyHits : 131968 (57.871564 %) nBVHAllMisses : 61462 (26.952762 %) nBVHLastResorts : 34606 (15.175674 %) nBVHLastResortRayBBoxTests : 322288 (per last resort tests) : 9.313067nTestedTriangles(per packet) : 314650 (38.409424) nCulledByBeams(rate) : 223618 (71.068807 %) nCulledByUV(rate) : 18536 (5.890990 %)nActuallyTestedTrianglesPerRay : 2.060669
Statistics
SBR07SBR07
Core2 Mac
SBR07SBR07
VC
BVH
SBR07SBR07
Redundant calcs...
SBR07SBR07
Non SIMDizedcodes...
SBR07SBR07
Yeah, I have to optimize BVH code before
optimizing VC :-)
SBR07SBR07
Implementation period
SBR07SBR07
packetBVH
Vertex Culling
2 Days6 Days
SBR07SBR07
Very easy to implement!
MLRTA requires a half year in my case.
SBR07SBR07
TODO
SBR07SBR07
Much more optimization.
SBR07SBR07
Try to use VC technique for
incoherent rays
SBR07SBR07
http://lucille.svn.sourceforge.net/svnroot/lucille/angelina/eleonore/
Here is my VC + BVH
implementation
SBR07SBR07
?
SBR07SBR07
Thank you!