Upload
manchor-ko
View
349
Download
1
Tags:
Embed Size (px)
Cache-Aware Hybrid Sorter
Manny Ko
Outline
• Sorting in CG
• Quick radix sort refresher
• Issues with radix sort – Incoherent memory access during parts of it
– Originally only for integers
• Two-phase sort – Cache-aware stream splitting
– Cache friendly merge using Loser Tree
– A lot faster than STL sort (several times)
Sort in CG
• Depth-sort for transparency Patney [2010]
• Better Z-cull
• Collision detection [Lin 2000]
• Minimizing state-changes
• Ray coherency Garanzha & Loop [2010]
• HPC to handle irregular workloads
• PBGI ?
Inspirations
• Out-of-core sorts, e.g. AlphaSort Nyberg[95]
• GPU based stream processing
• Cache-Aware algorthims
• Came out of my work on fast kd-tree builder
Importance of Memory
• GPUs and CPU cores are faster and faster;
• Tons of cores and more are coming
• For GFLOPS Moore’s Law still holds
• NOT for bandwidth to memory
– While GFLOPS doubles or triples every 18m
– Bandwidth barely moves (~15%)
• Bandwidth equals power; pushing electrons
Real-time Rendering
• Have been focusing on cache and memory patterns for a while
• CG researchers like Ingo Wald et al. have tackled that in ray-tracing
STL Sort
• Quicksort based
– Memory access pattern less than ideal
– Not sequential and lots of branching
• Will not dwell too much on this
Radix Sort
• The only practical O(dN) sort algorithm
– d is the # of radix digits, e.g. for 32b word and 1 bit per pass d is 32.
• No branching (almost) at least for integers
Counting Sort – Pass 1
• For radix = 2 we allocate two counters
• Each pass we go through the input and count the # of inputs that has 0s and 1s
• Extract digit (1 bit) and use that as index to increment the right counter – no branching
• d is a key design parameter
Pass 2 - Scatter
• At the end of the pass the counter for 0s will give us the offset to insert the 1s
• We go through the input using the counters to guide us where to scatter into the output buf
Number of Passes
• Original radix-sort each radix digit requires 1 pass through the input and 1 scatter pass
• Swap input and output; repeat d times
• Each of the passes is a stable-sort
Prefix-Sum
• Radix-2 is simple; in general we have to compute the prefix-sum for the counters
• Key building block for GPU computing
• A big topic on its own
• Our array is only 256 entries long, so we didn’t use fancy SIMD method
Access Patterns
• Pass 1 – pure sequential read. Good
– Very parallelizable too.
• Pass 2 – random scatter. Not so good
• Each pass requires one complete round trip from and to memory
Random Scatter
• Idea: utilize the cache
• Split the input into sub-streams
• Sub-streams defined by cache size/fast memory
Cache Resident Passes
• When we swap input and outputs
– Output from previous pass still in cache
Stream Merging
• Sorted sub-streams with be merged
• Merge is streaming friendly:
– Input are read sequentially
– Output is generated sequentially
• This is where the fun is
• We will get back to this. I promise.
Cache-Aware Hybrid Sort
• Cache-aware because we use the actual cache size of the machine to split the input
• Hybrid: radix sort sub-streams then merge
Cache sizing
• cpuid instruction
• code in the book ‘Game Engine Gems II’, AK Peters 2011.
Stream Spliting
• Depends on # of threads
• General strategy is to keep the output of each scatter pass completely within the cache
Substream Sorting
• Each byte is a digit
• Radix-256 sort – allocate 256 counters
– 1kb or 2kb (64b); fits in L1 cache
– Actually we allocate 4 sets of counters
• d logically is 4 but we do it all in 1 pass
• form the 4 sets of prefix-sums
• 4 scatter-passes
Floats
• Radix-sort original designed for ints
• What if we treat float as int? Casting?
• Almost works, if all the floats are postive
• IEEE is sign-exponent-mantissa.
• sign bit makes all negative number appears to be larger than the positive ones
Float example
2.0 is 0x40000000
-2.0 is 0xc0000000
-4.0 is 0xc0800000
Which implies -4.0 > -2.0 > 2.0,
just the opposite of what we want
Terdiman’s Solution
• Usual solution [Terdiman 2000] treats high byte special and use a test in the inner loop
• Modern CPUs do not like branching
• GPUs likes it even less
Herf’s Hack
1. always invert the sign bit
2. If the sign bit was set, then invert the exponent and mantissa
2.0 is 0x40000000 -> 0xc0000000
-2.0 is 0xc0000000 -> 0x3fffffff
-4.0 is 0xc0800000 -> 0x3f7fffff
We get the correct ordering
Herf’s FloatFlip
U32 FloatFlip(U32 f)
{
U32 mask = -int32(f >> 31) | 0x80000000;
return (f ^ mask);
}
My Version
int32 mask = (int32(f) >> 31) | 0x8000000;
Utilize the sign extension while shifting signed numbers. Generates better code.
Parallel Sorting
• Each substream can be sorted in parallel
• We allocate 1 core per substream
• We size the substream so that it fits into each core’s L2 or L1 cache (or GPU share memory)
• At the end of substream sort phase we have read the input from memory (disk) twice
/*! -- RadixSorter: a builder class to aid with the use of radix-sorter. -- It splits the input stream into substreams that fits into cache. -- Mostly it holds the indices and temporaries for reuse. -- It currently only supports sorting of <key,index> pairs. Caller can either -- request for the sorted indices or request the original values to be moved. */
class RadixSorter { typedef size_t* Indices; static const size_t kStreams = 4; public: static const size_t kNumThreads = 4; // # of threads RadixSorter( int count ); ~RadixSorter(); /// reallocate internal storage to prepare for a stream of length 'count': void Resize( int count ); /// deallocate all storage: void Clear(); /// initialize the sorter for 'values' : void SortInit( float* values, int count ); /// sort 'values' : void Sort( float* values, int count ); /// sort sub-stream ‘s': void SortStream( int s ); void MergeStreams();
public: size_t m_blockSizes[kStreams]; //!< size of each sub-stream float* m_streams[kStreams]; //!< our sub-streams of work float* m_temp[kNumThreads]; //!< working buffer carved from output buffer float* m_outbuf; int m_count; //!< max size of the input sequence bool m_inited;
Stream Merging
• Usually performed using a priority-queue, most likely a heap-based PQ
• I tried to find the best PQ implementation
• Disappointing, the gain from radix-sort was almost negated by the merge phase
Loser-Tree
• Comes to the rescue
• Thanks Knuth
– The Art of Computer Programming Vol. 3
• Almost forgotten and I am a Knuth fan
• It is a kind of tournament-tree
Tournament Tree
• Single elimination
• Loser-tree is a tournament tree where the loser is kept in each round
• Winner moves on (in a register)
Our Tree
• Node consist of a float key and a payload of stream_id
• Linearized binary-tree, no pointers
– Navigation up and down is by shifts and adds
• Initialized by inserting the head of each substream into the tree
– Size_of_tree = 2 x # of substreams
• Let the play begins!
Winner
• Winner rises to the top
– We remove the winner and output the key
• We use the winner’s stream_id to pull the next item from the stream
• Key idea: new winner can only come from players that had faced the previous winner – i.e. the path from the root to the original position of the winner
Repeat Matches
• Repeat those matches, a new winner emerges
Access Pattern of Merge
• Each substream is accessed sequentially
• Output is written sequentially
• Modern CPUs and GPUs like these sorts of patterns due to their pre-fetch, write-coalesce and caching logic
• Tree is small and fits into the L1 cache or even register file
Performance (1 core)
• Serially sort all substreams
• Merge using Loser-Tree on same thread
• Small data set: 2.1..3.5 times faster than STL
– The poor access pattern of quicksort is less problematic when everything fits into cache
Scalability (4-cores)
Threaded (Q6600)
Serial (Q6600)
Threaded (I7) Serial (I7)
1 stream 5.12 5 3.89 3.62
2 streams 6.90 10.04 4.20 7.1
3 streams 8.08 15.07 4.56 10.69
4 streams 10.97 20.55 4.86 14.2
4 + merge 16.4 26.01 9.61 19.0
Multi-Core Performances
• One million entries: Q6600
– STL took 76ms,
– radix-sort 28 ms
– 4-core: 16.4ms = where 5-6ms in merge
• One million entries: I7
– STL (58ms),
– hybrid (9.6ms)
• 6 times faster than STL
Threading Overhead
• The 1 stream vs. serial time is 5.12 vs. 5
– So only .12ms of threading overhead
Related Work
• Funnel-Sort, Brodal [2008]
• GPU radix-sort, Satish [2009]