GPU Scheduling

GPU Scheduling Navneet Jha

Arizona State University, Tempe, AZ

1-(480)-207-0546 [email protected]

ABSTRACT It is well known that the GPU cache is not as effectively utilized as a CPU cache. This is because the GPU has several threads running on the same Streaming Multiprocessor at a given time. This leads to severe contention among the threads for the shared L1 cache. If the threads are scheduled in a round robin manner, and even if every warp shows both special and temporal locality, the cache performance might not be as good as expected. The data brought into the cache for one thread might actually be evicted before it can be scheduled again, if there is a high number of memory requests generated by the Streaming Multiprocessor. So, even if the thread uses the same data again and again (extreme case of special and temporal locality), it might suffer form a cache miss every time it tries to access the cache.

Cache Conscious Wavefront Scheduling [1] is a scheduling algorithm proposed by Timothy G. Rogers, Mike OConnor and Tor M. Aamodt to address the issue. The idea tries to exploit intra warp special and temporal locality. It detects the warps suffering from misses due to high contention of L1 cache and schedules them more often.

We also explore the field of Dynamic Warp Formation [2], as proposed by Wilson W. L. Fung, Ivan Sham George Yuan and Tor M. Aamodt. The idea helps us to avoid branch divergence by grouping threads, which execute the same code sequence after a branch, together dynamically.

Keywords GP-GPU (General Purpose Graphics Processing Unit), GPGPU-sim (GPGPU simulator), CCWS (Cache Conscious Wavefront Scheduling), DWF (Dynamic Warp Formation)

1. INTRODUCTION With the performance of single core CPUs saturating, we need to look at multiple cores for further performance enhancement. GPUs, formerly used for graphics applications, have recently found their importance for general purpose computing, owing to their huge computing capabilities. The use of GPU for general-purpose computation and introduction of CUDA programming language gave a new dimension for the use of GPUs in general purpose computations. As much as this provides a scope for faster computation, general purpose programming flow does not always align with the ideal workflow of GPU. Efficient general purpose computation is mainly

limited because of the issues of memory latency, cache misses and branch divergence.

The effect of memory latency can be hidden by increasing the number of threads in a Streaming Multiprocessor. By increasing the number of running threads at any given time, the Streaming Multiprocessor can switch to another thread if one thread issues a memory request. The threads can be scheduled using a Round Robin scheduling algorithm. This will ensure that if the memory request given by a thread is completed by the time it is scheduled next. This helps increase the overall throughput of the system, but increases the execution time of a single thread.

One of the problems of having a large number of threads running simultaneously on a Streaming Multiprocessor is that it tends to reduce the effective utilization of the shared L1 cache of the Streaming Multiprocessor. The more the number of threads we have, the higher is the contention for the L1 cache. To solve the problem we explore a new scheduling algorithm, which helps reduce the contention for L1 cache. Cache Conscious Wavefront Scheduling [2] algorithm provides an excellent solution and helps reduce the thrashing of L1 cache.

Another issue that limits the general purpose computation efficiency is that of Branch Divergence. Branch divergence occurs when a warp encounters a branch instruction. When a branch instruction is encountered, some of the threads may follow a different path in the code than the other threads. This hampers the overall performance. We explore the idea of Dynamic Warp Formation [2] to try and reduce the effect of Branch Divergence, by grouping the threads executing the same code after the branch together. However, care should be taken that the Dynamic Warp Formation should not affect the intra warp locality. This means that if the threads in a warp access contiguous memory addresses, we may not want to assign the threads to a different warp, as this might cause memory divergence, which can negatively effect the overall performance of the system.

2. BASELINE ARCHITECTURE The workloads are executed using the GPGPU-Sim environment. The baseline GPU architecture used for the work is illustrated in Figure 1. Table 1 illustrates the configuration of the baseline architecture used in our work.

Table 1. Baseline Architecture Specifications

Num. of SMs 15

Max. # of Warps per SM 48

Max. # of Blocks per SM 8

# of Schedulers per SM 2

# of Registers per SM 32768

Shared Memory 48KB

L1 Data Cache 16KB per SM (128B lines/4-ways)

L2 Cache 768KB unified cache (128B lines/8-ways/12-banks)

The work uses GPGPU-sim to evaluate the performances of the workloads. GPGPU-sim is a simulator, which can be used to simulate the CUDA and OpenCL workloads. The workload used to evaluate the performance of Cache Conscious Wavefront Scheduling is a CUDA application for basic matrix multiplication. This workload was chosen due to its high cache sensitivity. We chose a CUDA application for performing Breadth First Search to try and evaluate the performance of Dynamic Warp Formation. We chose this workload as it shows a high degree of branch divergence.

3. GPGPU-Sim GPGPU-Sim [3] is a cycle accurate GPU simulator developed by W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. According to the reference [3], GPGPU-Sim models various aspects of a massively parallel architecture with highly programmable pipelines similar to

contemporary GPU architectures. The GPGPU-Sim has also been extended to support the CUDA PTX instruction set. This means that we can use the GPGPU-Sim to run the CUDA applications seamlessly, just by providing proper path to the runtime libraries. This is done by a shell script Setup_Environment, which comes with the GPGPU-Sim source code. The simulated GPU can be configured using the file gpgpusim.config, which must be available in the current directory, if a CUDA workload is run using the GPGPU-Sim libraries. This work requires the usage of GPGPU-sim, and uses the simulator to simulate a GPU and run CUDA workloads on it. GPGPU-sim can be used to evaluate the performance of CUDA and OpenCL applications, and gives us the information about all the runtime performance characteristics of the workloads run using it. A modified version of GPGPU-sim has also been used to evaluate the performance of Cache Conscious Wavefront Scheduling. This modified version of GPGPU-sim was available on the website https://www.ece.ubc.ca/~tgrogers/ccws.html.

4. CACHE CONSCIOUS WAVEFRONT SCHEDULING Cache Conscious Wavefront Scheduling [1] is a scheduling algorithm used to preserve and exploit the intra warp locality. It helps to improve the utilization of caches in GPUs. The basic idea of the algorithm is to schedule the warps which suffer from cache misses due to contention with other threads more often. This gives them more exclusive access to the cache, thereby reducing the number of misses they suffer.

4.1 Modified Architecture The modified architecture used to evaluate the performance of the Cache Coherence Wavefront Scheduling algorithm can be understood by looking at Figure 2. The implementation of Cache Conscious Wavefront Scheduling requires modification of the Wavefront Issue Arbiter and the Memory Unit of the baseline GPU architecture.

The Memory Unit now contains a modified L1 cache, which contains the Wavefront ID (WID) of the Wavefront which brought the cache line into it, along with the tag and data information for every cache line. The Memory Unit also contains a Lost Locality Detector (LLD), which detects the Wavefronts suffering from cache misses due to contention with other threads. The LLD consists of a Victim Tag Array, which contains a Tag Array corresponding to every Wavefront assigned to this Streaming Multiprocessor. The tag array for each Wavefront contains the tags of recently evicted cache lines, which was brought to the cache by the corresponding Wavefront.

The Wavefront Issue Arbiter now contains a Locality Scoring System (LSS) along with the baseline priority logic to schedule the Wavefronts. Each Wavefront has a Lost Locality Score (LLS) associated with it. This is stored in

Figure 1. Baseline Architecture (Courtesy: [1])

the form of a max sorted heap in the Locality Scoring System. The Locality Scoring System also contains the Lost Locality Score update logic, to periodically update the Lost Locality Score of the Wavefronts. It contains a Lost Locality Score Cutoff Test as well, which can deny the access to L1 cache to a Wavefront and stall it if it has a low Lost Locality Score.

4.2 Lost Locality Detector The lost locality detector consists of a Victim Tag Array, which contains a Tag Array corresponding to every Wavefront assigned to this Streaming Multiprocessor. The LLD is required to detect the Wavefronts which are suffering from cache misses due to contention with other Wavefronts for the L1 cache. Whenever there is a cache miss or an eviction in the L1 cache, the tag of the cache line along with the WID for the cache line is sent to the Victim Tag Array (VTA). If it is an eviction, the Victim Tag Array stores the tag in the tag array corresponding to the Wavefront with id WID. If it is a miss, the Victim Tag Array checks if the tag is present in the tag array corresponding to the Wavefront with id WID. If it is present (Victim Tag Hit), the Lost Locality Detector signals the Locality Scoring System that a Victim Tag Hit has occurred. In essence, a hit in the Victim Tag Array signifies that the Wavefront suffered from a cache miss due to a recently evicted cache line. The cache miss could have been avoided if the Wavefront had a more exclusive access to the L1 cache. The Locality Scoring System, when signaled about a Victim Tag Hit, updates the

Lost Locality score of the Wavefront which suffered from the cache miss.

4.3 Locality Scoring System The Locality Scoring System is the heart of the Cache Conscious Wavefront Scheduling algorithm. It contains the Lost Locality Scores of all the Wavefronts issued to the streaming multiprocessor. The Lost Locality Scores are updated periodically, and the updated value depends on whether the Wavefront encountered a Victim Tag Array hit. The LLS update logic can be better understood by looking at Figure 3.

The example illustrated in the above figure shows a situation when four Wavefronts are issued to the streaming multiprocessor. All the Wavefronts are assigned a constant BaseLocalityScore in the beginning of the execution. We also define a variable Cumulative_LLS_Cutoff and give it a value of (NumActiveWaves * BaseLocalityScore), where NumActiveWaves is the number of current active Wavefronts in the Streaming Multiprocessor. The Wavefronts are stored in a sorted max heap, as shown in Figure 3. The Wavefronts with large Lost Locality Score fall to the bottom of the heap, and the ones having a small Lost Locality Score are pushed to the top. Whenever the locality scoring system is signaled of a Victim Tag Array hit, it increases the score of the corresponding Wavefront. If no Victim Tag Array hit occurs for a Wavefront, its Lost Locality Score is reduced by one point, until is reaches the minimum of BaseLocalityScore. The Lost Locality Score is not reduced if it is equal to the BaseLocalityScore. The value of the updated Lost Locality Score when the Locality

Figure 2. Modified GPU Core Architecture (Courtesy: [1])

Figure 3. Locality Scoring System LLS update example (Courtesy: [1])

Scoring System is signaled of a Victim Tag Array hit is given by the following formula:

LLS = ( VTAHitsTotal / InstIssuedTotal ) * KThrottle * Cumulative_LLS_Cutoff Where the symbols represent the following: LLS: Updated Lost Locality Score VTAHitsTotal: Total number of VTA hits in the SM InstIssuedTotal: Total number of instructions issued by SM KThrottle: Constant Cumulative_LLS_Cutoff: Cutoff Value for LLS We can observe that the updated Lost Locality Score is a function of the total number of VTA hits across all this compute units Wavefronts and all the instructions this compute unit has issued. The constant KThrottle can be defined to control the amount of throttling when a Wavefront loses locality. As shown in Figure 3, an increase in the Lost Locality Score of a Wavefront can result in some of the Wavefronts being pushed above the Cumulative_LLS_Cutoff (at time T1). The Wavefronts pushed above the Cumulative_LLS_Cutoff are not allowed to access the L1 cache. Thus, the Locality Scoring System manages to offer a more exclusive access to the Wavefronts having a high Lost Locality Score. The Lost Locality score is reduced if there is no VTA hit, which causes the Wavefronts pushed over the Cumulative_LLS_Cutoff to come below it again. This allows then access to L1 cache again (at time T3). Thus we can see that the algorithm dynamically controls the access of the Wavefronts to the L1 cache, depending on the amount of locality being lost. The LLS cutoff test checks if the Wavefront is trying to execute a Load or Store instruction. If it is trying to execute a Load or a Store, it checks if it can be given the access to the L1 cache. The highest priority Wavefront which passes the LLS cutoff test is then scheduled to execute on the Streaming Multiprocessor.

4.4 Results The performance of the Cache Conscious Wavefront Scheduling was analyzed by running a CUDA workload (Matrix Multiplication) on GPGPU-Sim. The performance statistics of the workload was recorded when the warp scheduling was done using the Greedy Round Robin scheduling algorithm, and then with the Cache Conscious Wavefront Scheduling algorithm. We observed a miss rate reduction of around twenty percent, which resulted in almost double performance. The performance was measured by measuring the Instructions Per Clock (IPC) for the workload. The results obtained can be seen in Figure 4 and Figure 5.

Figure 4. Miss Rate comparison between Round Robin scheduling and Cache Conscious Wavefront Scheduling

Figure 5: IPC comparison between Round Robin Scheduling and Cache Conscious Wavefront Scheduling

5. DYNAMIC WARP FORMATION Dynamic Warp Formation [2] can be used to decrease the amount of Branch Divergence in the CUDA workloads. The basic idea is to detect the threads having the same outcome of a branch instruction, and group them together into one warp dynamically. This should reduce the amount of Branch Divergence in the workload, resulting in a better performance.

5.1 Baseline Architecture The baseline architecture we used to try and evaluate the performance of Dynamic Warp Formation is very similar to the one used to evaluate the performance of Cache Conscious Wavefront Scheduling. We use the same configuration for the simulated hardware as well. The baseline architecture and the simulated architecture specifications can be seen in Figure 1 and Table 1 respectively.

5.2 Lane Aware Dynamic Warp formation To support a large number of threads, the GPUs usually divide the register files into banks, each bank accessible from a single lane. This requires the thread to rum on the same SIMD lane whenever it is scheduled. Traditionally, this is done by statically assigning threads to warps, such that each of its threads execute on the same SIMD lane every time the warp is scheduled (Figure 6(c)). If the SIMD lane of the thread is changed when the warps are formed dynamically, we might need to introduce a crossbar switch into the register file (Figure 6(b)). To avoid the use of the crossbar, we can use the concept of Lane Aware Dynamic Warp Formation. The idea is to schedule the thread to the same lane, in any warp it is assigned to (Figure 6(d)). This means that a thread is assigned to a warp only if the warp does not contain any thread in the same lane.

5.3 Locality Aware Dynamic Warp Formation While forming the warps dynamically, we also need to make sure that we do not affect the intra-warp locality, if exhibited, by the workload. If the workload shows special locality in the warps formed statically, we will not want to reassign the warps, as this will cause memory divergence to

occur. Memory divergence occurs when different memory requests from different threads in a warps cannot be coalesced together into one request. This can negatively affect the performance, which means that the warps should be formed dynamically only if the intra warp locality is not affected.

5.4 Hardware Implementation The hardware implementation of the idea requires a few changes in the design of the scheduler used in the baseline architecture. The idea can be well understood by referring to Figure 7.

The last stage of the SIMD pipeline sends the thread id along with its next PC to the scheduler (Figure 7 (a)). The scheduler contains a look up table, called the PC-Warp LUT, which maintains one entry for each program counter which will be accessed by the threads in the next cycle. A current warp is formed using the threads which will be executing the same PC. Each entry contains the PC, thread id (IDX) and an occupancy vector (OCC), used to track which lanes of the current warp is free. This vector can be compared to the request vector (REQ) of the thread, which indicates which lane is required by the threads being assigned to this warp. If the lane is already occupied, a new warp is formed, and all the threads with lane conflicts are now assigned to the newly formed warp.

5.5 Issue Heuristics To utilize the SIMD lanes fully, we need to make sure that the set of Program Counters currently being executed is small. This requires the warps to progress at a similar rate. The order in which the warps are issued has a critical effect on this [2]. We will need to use a effective issue heuristic to make sure that the number of Program Counters currently being executed is low. A number of issue heuristics were explored in [2]. It assumes Majority Issue Logic, which works by choosing the most common PC among all the existing warps and issuing all warps at this PC before choosing a new PC.

5.6 Results The evaluation and implementation and of the Dynamic Warp formation requires us to first find a workload

Figure 6. Register file configuration for (a) ideal dynamic warp formation and MIMD, (b) naive dynamic warp formation, (c) static warp formation, (d) lane-aware dynamic warp formation. (Courtesy [2])

Figure 7. Implementation of dynamic warp formation and scheduling. (Courtesy [2])

suffering from branch divergence, and does not exhibit intra warp special locality. We explored several workloads, and finally found the BFS benchmark, which has a very low warp occupancy and also has memory divergence. Our investigations into examining the amount of warp occupancy and memory divergence provided us with results as can be seen in Figure 8 and Figure 9 respectively.

We further investigated to find the amount of memory divergence in every iteration of the BFS. The results can be seen in Figure 10 and Figure 11 for loads and stores respectively.

Further work is required to implement and evaluate the idea of Dynamic Wavefront Scheduling.

6. CONCLUSION We conclude by stating that the Cache Coherence Wavefront Scheduling significantly improves the GPU performance, specially for a highly cache sensitive workload. As the number of hardware threads in a GPU continue to increase, the importance of a intelligent hardware scheduler like CCWS cannot be ignored. Dynamic warp formation also shows a lot of promise. It can greatly help in improving performance by decreasing the amount of branch divergence, which is one of the key reasons for performance degradation when we try to rum a general purpose workload on a GPU. However, while forming the warps dynamically, care must be taken that we do not create a memory divergence in the warps while trying to eliminate the branch divergence, which might actually degrade the performance.

7. FUTURE WORK In the future, we can try to implement both the Cache Coherence Wavefront Scheduling and Dynamic Warp Formation together on a single machine. Having an intelligent scheduler which can exploit both the intra warp locality using the Cache Conscious Wavefront Scheduling and inter warp locality using the Dynamic Warp Formation technique can help in improving performance even further.

8. ACKNOWLEDGMENTS Firstly, I would like to thank our professor Dr. Carole-Jean Wu for her guidance and support throughout the project. I would also like to thank Mr. Akhil Arunkumar for his valuable inputs to the project. I would also like to thank my project partner Vignesh Soundararajan for his effort and contributions to this project. The project would not have been possible without the contributions of Rogers, OConnor, Aamodt and Bakhoda to the research community. I would like to thanks them as well for proposing and publishing their novel ideas and their contributions to the development of GPGPU-Sim, which was used to obtain all the results in the project.

9. REFERENCES [1] Cache conscious wave front scheduling. Rogers,

OConner, and Aamodt. MICRO 2012. [2] Divergence-aware warp scheduling. Rogers,

OConner, and Aamodt. MICRO 2013. [3] Analyzing CUDA workloads using a detailed GPU

simulator. Bakhoda et al. ISPASS 2009. [4] GPGPU-Sim Manual. (http://gpgpu-

sim.org/manual/index.php/GPGPU-Sim_3.x_Manual)

Figure 8: Warp occupancy for BFS, NW and MST

Figure 9: Memory Coalescing in BFS

Figure 10:Memory Coalescing for Kernel Loads (above) and Stores (below) in BFS

Documents

GPU Scheduling