Region-based Techniques for Modeling and Enhancing Cluster …€¦ · 55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and double data rate InﬁniBand network

Region-based Techniques forModeling and Enhancing Cluster

OpenMP Performance

Jie Cai

August 2011

A thesis submitted for the degree of Doctor of Philosophyof the Australian National University

c© Jie Cai 2011

This document was produced using TEX , LATEX and BIBTEX

For my wife, Ruru, who greatly supported my PhD research......and my loving parents.

DeclarationI declare that the work in this thesis is entirely my own and that to the best ofmy knowledge it does not contain any materials previously published or written byanother person except where otherwise indicated.

Jie Cai

AcknowledgementsDuring my PhD, many people have offered kind help and generous support. I wouldlike to thank them and appreciate their help.

Supervisors

Dr. Peter Strazdins for the guidance and advise during my whole doctoralresearch; Dr. Alistair Rendell for the time we spent together before paper

deadlines; Dr. Eric McCreath for the useful comments.

Readers

For reading my thesis and providing valuable feedbacks, thank you.Warren Armstrong, Muhammad Atif, Michael Chapman, Pete Janes, Josh

Milthorpe, Peter Strazdins, and Jin Wong.

Computer System Group Members

For the cheerful four years of my PhD, thank you “geeks”.Joseph Anthony, Ting Cao, Elton Tian, Xi Yang, Fangzhou Xiao and more ...

Industry Partners

For their generous financial contribution to support my research.Australian Research Council, Intel, Sun Microsystems (Oracle)

Last but definitely not least

For being so supportive, NCI NF colleagues.Ben Evans, Robin Humble, Judy Jenkinson and David Singleton.

AbstractCluster OpenMP enables the use of the OpenMP shared memory programmingmodel on distributed memory cluster environments. Intel has released a clusterOpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offersbetter programmability than message passing alternatives such as the MessagePassing Interface (MPI), such convenience comes with overheads resulting fromhaving to maintain the consistency of underlying shared memory abstractions.CLOMP is no exception. This thesis introduces models for understanding theseoverheads of cluster OpenMP implementations like CLOMP and proposes tech-niques for enhancing their performance.

Cluster OpenMP systems are usually implemented using page-based softwaredistributed shared memory (sDSM) systems, which create and maintain virtualglobal shared memory spaces in pages. A key issue for such system is maintainingthe consistency of the shared memory space. This forms a major source of overhead,and it is driven by detecting and servicing page faults.

To investigate and understand these systems, we evaluate their performancewith different OpenMP applications, and we also develop a benchmark, calledMCBENCH, to characterize the memory consistency costs. Using MCBENCH, wediscover that this overhead is proportional to the number of writers to the sameshared page and the number of shared pages.

Furthermore, we divide an OpenMP program into separate parallel and serialregions. Based on the regions, we develop two region-based models to rationalizethe numbers and types of the page faults and their associated costs to performance.The models highlight the fact that the major overhead is servicing the type ofpage faults, which requires data (a page or its modifications, known as diffs) tobe transferred across a network.

With this understanding, we have developed three region-based prefetch(ReP) techniques based on the execution history of each parallel and sequentialregion. The first ReP technique (TReP) considers temporal page faulting behaviourbetween consecutive executions of the same region. The second technique (HReP)considers both the temporal page faulting behaviour between consecutive execu-tions of the same region and the spatial paging behaviour within an execution ofa region. The last technique (DReP) utilizes our proposed novel stride-augmentedrun length encoding (sRLE) method to address the both the temporal and spatialpage faulting behaviour between consecutive executions of the same region. Thesetechniques effectively reduce the number of page faults and aggregate data (pagesand diffs) into larger transfers, which leverages the network bandwidth provided

by interconnects.All three ReP techniques are implemented into runtime libraries of CLOMP

to enhance its performance. Both the original and the enhanced CLOMP areevaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and twoLINPACK OpenMP benchmarks on different hardware platforms, including twoclusters connected with Ethernet and InfiniBand interconnects. The performancedata is quantitatively analyzed and modeled. Also, MCBENCH is used to evaluatethe impact of ReP techniques on memory consistency cost.

The evaluation results demonstrate that, on average, CLOMP spends 75% and55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet anddouble data rate InfiniBand network respectively. These ratios of the NPB-OMPbenchmarks are reduced effectively by ∼60% and ∼40% after implementing theReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, withthe assistance of sRLE, DReP significantly outperforms the other ReP techniqueswith effectively reducing 50% and 58% of page fault handling costs on GigabitEthernet and InfiniBand networks respectively.

Contents

Declaration v

Acknowledgements vii

Abstract ix

I Introduction and Background 1

1 Introduction 31.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 61.2.2 Region-based Performance Models . . . . . . . . . . . . . . . . 71.2.3 Region-based Prefetch Techniques . . . . . . . . . . . . . . . . 8

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 112.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Synchronization Operations . . . . . . . . . . . . . . . . . . . . 16

2.2 Cluster OpenMP Systems . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . 182.2.2 Software Distributed Shared Memory Systems . . . . . . . . . 192.2.3 Intel Cluster OpenMP . . . . . . . . . . . . . . . . . . . . . . . 232.2.4 Alternative Approaches to sDSMs . . . . . . . . . . . . . . . . 26

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . 292.3.2 Prefetch Techniques for sDSM Systems . . . . . . . . . . . . . 312.3.3 Run-Length Encoding Methods . . . . . . . . . . . . . . . . . . 35

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

II Performance Issues of Intel Cluster OpenMP 39

3 Performance of Original Intel Cluster OpenMP System 413.1 Hardware and Software Setup . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Performance of CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.1 NPB OpenMP Benchmarks Sequential Performance . . . . . . 443.2.2 Comparison of CLOMP and Intel Native OpenMP on a Single

Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.3 CLOMP with Single Thread per Compute Node . . . . . . . . 483.2.4 CLOMP with Multiple Threads per Compute Node . . . . . . 483.2.5 Elapsed Time Breakdown for NPB-OMP Benchmarks . . . . . 53

3.3 Memory Consistency Cost of CLOMP . . . . . . . . . . . . . . . . . . . 553.3.1 Memory Consistency Cost Micro-Benchmark – MCBENCH . . 563.3.2 MCBENCH Evaluation of CLOMP . . . . . . . . . . . . . . . . 57

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Region-Based Performance Models 654.1 Regions of OpenMP Programs . . . . . . . . . . . . . . . . . . . . . . . 664.2 SIGSEGV Driven Performance (SDP) Models . . . . . . . . . . . . . . 67

4.2.1 Critical Path Model . . . . . . . . . . . . . . . . . . . . . . . . . 684.2.2 Aggregated Model . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.3 Coefficient Measurement . . . . . . . . . . . . . . . . . . . . . . 71

4.3 SDP Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1 Critical Path Model Estimates . . . . . . . . . . . . . . . . . . 734.3.2 Aggregate Model Estimates . . . . . . . . . . . . . . . . . . . . 74

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

III Optimizations: Design, Implementation and Evaluation 79

5 Region-Based Prefetch Techniques 815.1 Limitations of Current Prefetch Techniques for sDSM Systems . . . 82

5.1.1 Parallel Application Examples . . . . . . . . . . . . . . . . . . 825.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.1.3 Prefetch Technique Design Assumptions . . . . . . . . . . . . . 88

5.2 Evaluation Metrics of Prefetch Techniques . . . . . . . . . . . . . . . 895.3 Temporal ReP (TReP) Technique . . . . . . . . . . . . . . . . . . . . . 905.4 Hybrid ReP (HReP) Technique . . . . . . . . . . . . . . . . . . . . . . 905.5 ReP Technique for Dynamic Memory Accessing Applications (DReP) 93

5.5.1 Stride-augmented Run-length Encoded Page Fault Records . 935.5.2 Page Miss Prediction . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Offline Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 975.6.2 Simulation Results and Discussions . . . . . . . . . . . . . . . 98

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Implementation and Evaluation 1116.1 ReP Prefetch Techniques Implementation Issues . . . . . . . . . . . . 112

6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.2 New Region Notification . . . . . . . . . . . . . . . . . . . . . . 1146.1.3 Record Encoding and Flush Filter enabled Decoding . . . . . . 1166.1.4 Prefetch Page Prediction . . . . . . . . . . . . . . . . . . . . . . 1166.1.5 Prefetch Request and Event Handling . . . . . . . . . . . . . . 1176.1.6 Page State Transition . . . . . . . . . . . . . . . . . . . . . . . . 1186.1.7 Garbage Collection Mechanism . . . . . . . . . . . . . . . . . . 119

6.2 Theoretical Performance of the ReP Enhanced CLOMP . . . . . . . . 1206.3 Performance Evaluation of the ReP Enhanced CLOMP . . . . . . . . 123

6.3.1 MCBENCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.3.2 NPB OpenMP Benchmarks . . . . . . . . . . . . . . . . . . . . 1306.3.3 LINPACK Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 1386.3.4 ReP Techniques with Multiple Threads per Process . . . . . . 142

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

IV Conclusions and Future Work 147

7 Conclusions and Future Work 1497.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.1.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 1507.1.2 SIGSEGV Driven Performance Models . . . . . . . . . . . . . . 1527.1.3 Performance Enhancement by RePs . . . . . . . . . . . . . . . 152

7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.2.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 1567.2.2 Performance Optimizations . . . . . . . . . . . . . . . . . . . . 1567.2.3 Adapting ReP Techniques to the Latest Technologies . . . . . 1567.2.4 Potential Use of sRLE . . . . . . . . . . . . . . . . . . . . . . . 157

V Appendices 159

A Algorithms Used in DReP 161A.1 Stride-augmented Run-length Encoding Algorithms . . . . . . . . . . 162

A.1.1 Algorithm 1: Page Fault Record Reconstruction Step (a) . . . 162A.1.2 Algorithm 2: Page Fault Record Reconstruction Step (b) . . . 162A.1.3 Algorithm 3: Page Fault Record Reconstruction Step (c) . . . . 163

A.2 Algorithm 4: DReP Predictor . . . . . . . . . . . . . . . . . . . . . . . 163

B Tsegv,local and Nftotal for Theoretical ReP Speedup Calculation 165

B.1 NPB-OMP Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166B.2 LINPACK Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166

C TReP and DReP Performance Results of the NPB-OMP benchmarkson a 4-node Intel Cluster 169C.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170C.2 Sequential Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 170C.3 TReP and DReP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 170

C.3.1 Elapsed Time over Gigabit Ethernet . . . . . . . . . . . . . . . 170C.3.2 Elapsed Time over DDR InfiniBand . . . . . . . . . . . . . . . 173

D MultiRail Networks Optimization for the Communication Layer 177D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178D.2 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

D.2.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178D.2.2 Single-Rail Benchmark . . . . . . . . . . . . . . . . . . . . . . . 179D.2.3 Multirail Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 180

D.3 Bandwidth and Latency Experiments . . . . . . . . . . . . . . . . . . 181D.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 182D.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184D.3.3 Uni-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 185D.3.4 Bi-directional Bandwidth . . . . . . . . . . . . . . . . . . . . . 186D.3.5 Elapsed Time Breakdown . . . . . . . . . . . . . . . . . . . . . 188

D.4 Related Work on Multirail InfiniBand Network . . . . . . . . . . . . . 190D.5 Challenge and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 191

E Performance of CAL 193E.1 Bandwidth and Latency of CAL . . . . . . . . . . . . . . . . . . . . . . 194E.2 Comparison Between OpenMPI and CAL . . . . . . . . . . . . . . . . 194

Bibliography 197

List of Figures2.1 OpenMP fork-join multi-threading parallelism mechanism [93] . . . . 13

2.2 OpenMP parallel directives and associated clauses in C and C++. . . . 13

2.3 OpenMP for directives and associated clauses in C and C++. . . . . . 14

2.4 An example OpenMP program in C using parallel for directives. . . . . 15

2.5 OpenMP synchronization directives in C and C++ languages: (a)barrier, and (b) flush. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 OpenMP threadprivate directive in C and C++ languages. . . . . . . . 16

2.7 Processes and threads in CLOMP . . . . . . . . . . . . . . . . . . . . . 23

2.8 State machine of CLOMP (derived from [47], [38], and experimentalobservation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.9 Illustration of two prefetch modes for Adaptive++ techniques. . . . . 34

3.1 Comparison of performance between native Intel OpenMP andCLOMP on a XE compute node. . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Comparison of performance between native Intel OpenMP andCLOMP on a VAYU compute node. . . . . . . . . . . . . . . . . . . . . 46

3.3 Performance of CLOMP on XE with a single thread per compute node. 49

3.4 Performance of CLOMP on VAYU with a single thread per computenode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Performance of CLOMP on XE with multi-threads per compute node. 52

3.6 Performance of CLOMP on VAYU with multi-threads per computenode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.7 MCBENCH – An array of size a-bytes is divided into chunks of c-bytes. The benchmark consists of Change and Read phases that canbe repeated for multiple iterations. Entering the Change phase ofthe first iteration, the chunks are distributed to the available threads(four in this case) in a round-robin fashion. In the Read phase afterthe barrier, each thread reads from the chunk that its neighbour hadwritten to. This is followed by a barrier which ends the first iteration.For the subsequent iteration, the chunks to Change are the sameas in the previous Read phase. That is, the shifting of the chunkdistribution only takes place when moving from the Change to Readphases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.8 MCBENCH evaluation results of CLOMP on XE with both Ethernetand InfiniBand interconnects: 64KB, 4MB and 8MB array sizesare used in these three figures respectively; comparison amongdifference chunk sizes 4B, 2KB and 4KB is illustrated in each figurefor both Ethernet and InfiniBand. . . . . . . . . . . . . . . . . . . . . . 59

4.1 Illustration of regions in an OpenMP parallel program. . . . . . . . . 67

4.2 Schematic illustration of timing breakdown for parallel region usingthe SDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 The algorithm used to determine the SDP coefficients. The codeshown is in a parallel region. R is a private array while S is a sharedone. Variables Dw and Dr represent reference times for accessingprivate array R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 Pseudo code to demonstrate the memory access patterns of the naiveLINPACK OpenMP benchmark implementation for an n×n column-major matrix A with blocking factor nb. . . . . . . . . . . . . . . . . . 83

5.2 Naive OpenMP LINPACK program with n × n matrix: (a) memoryaccess areas for different iterations. (b) page fault areas for differentiterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Pseudo code to demonstrate the memory access patterns of theoptimized LINPACK OpenMP benchmark implementation for ann× n column-major matrix A with blocking factor nb. . . . . . . . . . 86

5.4 Optimized OpenMP LINPACK program: (a) memory access areas fordifferent iterations illustrated on a n×n matrix panel. (b) page faultareas for different iterations illustrated on the n× n matrix panel. . 87

5.5 The page fault record entry for TReP and HReP prefetch techniques. 90

5.6 A flowchart of the HReP predictor. . . . . . . . . . . . . . . . . . . . . 92

5.7 Two levels of stride-augmented run-length encoding (sRLE) method:(a) Based on strides between consecutive pages, sorted missedpages are broken into small sub-arrays, and those consecutivepages with the same stride are stored in the same array. (b)The sub-arrays are compressed in to the first level sRLE recordsin a (StartPageID,CommonStride,RunLength) format. (c) Basedon the stride between the start pages of consecutive first levelsRLE records, they are further compressed into the second levelsRLE format, (FirstLevelRecord, CommonStride,RunLength) (moredetails in Section 5.5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.8 Page fault record of region execution reconstructed via run-lengthencoding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.9 The effective page miss rate reduction for different prefetch tech-niques on 2 threads (a), 4 threads (b) and 8 threads (c). . . . . . . . . 103

6.1 Intel Cluster OpenMP runtime structure. . . . . . . . . . . . . . . . . 1126.2 Data structure for stride-augmented run-length encoded page fault

records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3 ReP prefetch record data structure. . . . . . . . . . . . . . . . . . . . . 1156.4 User interactive interface of new region notification. . . . . . . . . . . 1156.5 The round-robin prefetch request communication pattern. . . . . . . 1186.6 New page state machine after introduced Prefetched diff and

Prefetched page states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.7 RePs VS. Original CLOMP: MCBENCH with 4B chunk size over both

the GigE and IB networks. (a) 64KB array size, (b) 4MB array size,(c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.8 RePs VS. Original CLOMP: MCBENCH with 2048 bytes chunk sizeover both the GigE and IB networks. (a) 64KB array size, (b) 4MBarray size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . 127

6.9 RePs VS. Original CLOMP: MCBENCH with 4KB chunk size overboth the GigE and IB networks. (a) 64KB array size, (b) 4MB arraysize, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.10 RePs VS. Original CLOMP: BT speedup comparison on both GigEand IB networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.11 RePs VS. Original CLOMP: the naive LINPACK evaluation resultscomparison usingN×N matrix (N = 4096) with blocking factorNB =

64 via both GigE and IB. . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.12 RePs VS. Original CLOMP: the optimized LINPACK evaluation

results comparison using N × N matrix (N = 8192) with blockingfactor NB = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . 141

6.13 DReP vs Original CLOMP: the optimized LINPACK benchmark (N =

8192 and NB = 64) results comparison with multiple threads perprocess via both GigE and IB. (a) 2 threads per process, (b) 4 threadsper process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

C.1 Speedup of the BT and CG benchmarks over Gigabit Ethernet. . . . 171C.2 Speedup of IS and LU benchmarks over Gigabit Ethernet. . . . . . . 172C.3 Speedup of BT and CG benchmarks over DDR InfiniBand. . . . . . . 174C.4 Speedup of IS and LU benchmarks over DDR InfiniBand. . . . . . . . 175

D.1 Single-rail bandwidth benchmark . . . . . . . . . . . . . . . . . . . . . 179D.2 Multirail communication memory access pattern. . . . . . . . . . . . 180D.3 Non-threaded multirail bandwidth benchmark . . . . . . . . . . . . . 182

D.4 Threaded multirail benchmark design. . . . . . . . . . . . . . . . . . . 183D.5 RDMA write latency comparison. . . . . . . . . . . . . . . . . . . . . . 184D.6 Uni-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . 185D.7 Uni-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 186D.8 Bi-directional multi-port bandwidth. . . . . . . . . . . . . . . . . . . . 187D.9 Bi-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . . . . 188D.10 Benchmarks elapsed time breakdown for 512bytes message. . . . . . 188D.11 Benchmarks elapsed time breakdown for 4KB message. . . . . . . . . 189D.12 Different ways to configure a InfiniBand multirail network [62]. . . . 190

List of Tables2.1 OpenMP synchronization operations. . . . . . . . . . . . . . . . . . . . 17

3.1 Evaluation experimental hardware platforms. . . . . . . . . . . . . . 433.2 Sequential elapsed time (sec) of NPB with CLOMP. . . . . . . . . . . 443.3 Page faults handling cost (SEGV Cost) of CLOMP for NPB bench-

marks as a ratio to corresponding elapsed time with single threadper proess on XE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Page faults handling cost breakdown for CLOMP for class A NPBbenchmarks with multiple threads per process on XE. “SEGV”represents the ratio of page faults handling cost to the correspondingelapsed time; “SEGV Lock” in turn represents a ratio of pthread mutexcost within “SEGV”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Critical path page faults counts for the NPB-OMP benchmarks runusing CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Comparison between observed and estimated speedup for runningNPB class A and C on the AMD cluster with CLOMP . . . . . . . . . 77

4.3 Average relative errors for the predicted NPB speedups evaluatedusing the critical path and aggregate (f = 0) SDP models and datafrom Tables 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Threshold effects of ReP techniques for naive LINPACK benchmark. 985.2 Simulation prefetch efficiency (E) and coverage (Nu/Nf ) for Adapt-

ive++, TODFCM (1 page), TReP, HReP and DReP techniques. . . . . 1085.3 Breakdown of prefetches issued by different prefetch modes and

chosen list deployed in HReP. . . . . . . . . . . . . . . . . . . . . . . . 1095.4 Comparison of F-HReP and HReP with the LU benchmark. . . . . . 109

6.1 Bandwidth and latency measured by the communication layer (CAL)of CLOMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 ReP techniques prefetch efficiency and coverage for MCBNECH with4MB array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.3 Message transfer counts (×1000) comparison between RePs enhancedCLOMP and the original CLOMP for MCBENCH with 4B chunk . . 126

6.4 Message transfer counts (×1000) comparison between RePs enhancedCLOMP and the original CLOMP for MCBENCH with 2KB chunk . 128

6.5 Message transfer counts (×1000) comparison between RePs enhancedCLOMP and the original CLOMP for MCBENCH with 4KB chunk . 129

6.6 Page fault handling costs comparison for BT benchmark among theoriginal CLOMP, the theoretical and the ReP techniques enchancedCLOMP. The computation part of elapsed time is common to allcompared items. The page fault handling costs of the originalCLOMP is presented in second, and that of others are presented as areduction ratio (e.g. Orig−TReP

Orig ). . . . . . . . . . . . . . . . . . . . . . 132

6.7 Page fault handling costs reduction ratio (Torigsegv−TsegvT origsegv

) comparison forother NPB benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.8 Detailed Tsegv breakdown analysis of the IS Class A Benchmarkfor the ReP techniques. Overall Tsegv stands for overall CLOMPoverhead. “TMK Comm” stands for the communication time spentby TMK for data transfer. “TMK local” stands for the local softwareoverhead of TMK layer. “ReP Comm” stands for the communicationtime spent on prefetching data. “ReP local” stands for the localsoftware overhead introduced by using the ReP prefetch techniques.Tsegv is presented in seconds and its components are presented as aratio to the overall Tsegv. . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.9 Detailed Tsegv breakdown analysis of the IS Class C Benchmarkfor the ReP techniques. Overall Tsegv stands for overall CLOMPoverhead. “TMK Comm” stands for the communication time spentby TMK for data transfer. “TMK local” stands for the local softwareoverhead of TMK layer. “ReP Comm” stands for the communicationtime spent on prefetching data. “ReP local” stands for the localsoftware overhead introduced by using the ReP prefetch techniques.Tsegv is presented in seconds and its components are presented as aratio to the overall Tsegv. . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.10 Sequential elapsed time for LINPACK benchmarks . . . . . . . . . . 138

6.11 Page fault handling costs comparison for LINPACK benchmarksamong the original CLOMP, the theoretical and the ReP techniquesenchanced CLOMP. The computation part of elapsed time is commonto all compared items. The page fault handling costs of the originalCLOMP is presented in second, and that of others are presented as areduction ratio (e.g. Orig−TReP

Orig ). . . . . . . . . . . . . . . . . . . . . . 139

6.12 Page faults handling cost comparison between DReP and the ori-ginal CLOMP for the optimized LINPACK benchmark with multiplethreads per process. “SEGV” represents the ratio of page faultshandling cost to the corresponding elapsed time; “SEGV Lock” inturn represents a ratio of pthread mutex cost within “SEGV”. . . . . . 142

B.1 Tsegv,local (sec) for some NPB-OMP benchmarks with different num-ber of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

B.2 Nftotal for some NPB-OMP benchmarks with different number of

processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B.3 T segvtotal (sec) for LINPACK benchmarks with different number of

processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167B.4 Nf

total for LINPACK benchmarks with different number of processes. 167

C.1 Elapsed Time (sec) of some NPB-OMP Benchmarks on one thread. . 170

E.1 Complete bandwidth and latency measured by the communicationlayer (CAL) of CLOMP on XE. . . . . . . . . . . . . . . . . . . . . . . . 194

E.2 Comparison of CAL and OpenMPI: bandwidth and latency measuredon XE via GigE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

E.3 Comparison of CAL and OpenMPI: bandwidth and latency measuredon XE via DDR IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Documents

Region-based Techniques for Modeling and Enhancing Cluster …€¦ · 55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and double data rate InﬁniBand network