View
2
Download
0
Category
Preview:
Citation preview
COOPERATIVE BATCH SCHEDULING FOR HPC SYSTEMS
BY
XU YANG
Submitted in partial fulfillment of therequirements for the degree of
Doctor of Philosophy in Computer Sciencein the Graduate College of theIllinois Institute of Technology
ApprovedAdvisor
Chicago, IllinoisMay 2017
ACKNOWLEDGMENT
I give my greatest gratitude to my thesis advisor, Professor Zhiling Lan. It has
been an really enjoyable research experience working with her in the past five years.
She gave me her relentless help when I first entered the PhD program searching
for research topics. I also appreciate the freedom she gave me for exploring the
new research area, even when she was aware of the possibility of failure. It would
be impossible for me finish the thesis work without her help and guidance. Her
motivation, devotion, enthusiasm always inspires me in my PhD study. Moreover,
I would like to give my sincere gratitude to my thesis committee: Professor Ioan
Raicu, Professor Dong Jin and Professor Jia Wang. They have given me great help
for finishing my thesis work. I would also like to thank the people who I worked with
at Argonne National Laboratory. I want to thank Dr. Robert B. Ross, who gave
me the great opportunity to work on some exciting research topics in his group at
ANL. I also want to thank Dr. John Jenkins and Dr. Misbah Mubarak who help me
polishing my ideas and work at ANL. And I have learned a lot form them!
I really appreciate the companion of my fellow colleagues at Illinois Institute
of Technology. I would like to thank all the members in SPEAR group, SCS group,
Datasys group and JinLab, for the countless nights we worked together in the lab, for
the thought-provoking discussion and for all the fun we had in the past five years.
My deepest gratitude goes to my family. My parents and my brother gave
me their unconditional love and support throughout my life. Their motivation and
encouragement make me survive some tough days in the past five years. The most
beautiful thing happened to me in the past five years was the acquaintance of Dr.
Xingye Kan, whom I married and became my soulmate. I thank her for the immense
sacrifice she made to support my research. Her wisdom, persistence and unflinching
courage will always be my strongest support in the future journey ahead of us.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . 51.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . 13
2.1. HPC Systems . . . . . . . . . . . . . . . . . . . . . . . 132.2. Batch Scheduler . . . . . . . . . . . . . . . . . . . . . 142.3. Workload Trace . . . . . . . . . . . . . . . . . . . . . . 152.4. Application Communication Trace . . . . . . . . . . . . 152.5. Simulation Tool . . . . . . . . . . . . . . . . . . . . . 15
3. ENERGY COST AWARE SCHEDULING DRIVEN BYDYNAMIC PRICING ELECTRICITY . . . . . . . . . . . . 17
3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 173.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . 203.3. Problem Description . . . . . . . . . . . . . . . . . . . 233.4. Methodology . . . . . . . . . . . . . . . . . . . . . . . 263.5. Evaluation Methodology . . . . . . . . . . . . . . . . . 303.6. Experiment Results . . . . . . . . . . . . . . . . . . . . 333.7. Case Study . . . . . . . . . . . . . . . . . . . . . . . . 403.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 42
4. LOCALITY AWARE SCHEDULING ON TORUSCONNECTED SYSTEMS . . . . . . . . . . . . . . . . . . . 44
4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 444.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . 474.3. Design Overview . . . . . . . . . . . . . . . . . . . . . 494.4. Scheduling Strategy . . . . . . . . . . . . . . . . . . . . 514.5. Evaluation Methodology . . . . . . . . . . . . . . . . . 58
iv
4.6. Experiment Results . . . . . . . . . . . . . . . . . . . . 614.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 66
5. JOB INTERFERENCE ANALYSIS ON TORUSCONNECTED SYSTEMS . . . . . . . . . . . . . . . . . . . 68
5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 685.2. Application Study . . . . . . . . . . . . . . . . . . . . 705.3. Research Vehicle . . . . . . . . . . . . . . . . . . . . . 735.4. Interference Analysis . . . . . . . . . . . . . . . . . . . 745.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 815.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 825.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . 84
6. JOB INTERFERENCE ANALYSIS ON DRAGONFLYCONNECTED SYSTEMS . . . . . . . . . . . . . . . . . . . 86
6.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 866.2. Background . . . . . . . . . . . . . . . . . . . . . . . 886.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . 916.4. Study of Parallel Workload I . . . . . . . . . . . . . . . 956.5. Study of Parallel Workload II . . . . . . . . . . . . . . . 1016.6. Hybrid Job Placement . . . . . . . . . . . . . . . . . . 1046.7. Other Placement Policies . . . . . . . . . . . . . . . . . 1066.8. Related Work . . . . . . . . . . . . . . . . . . . . . . . 1106.9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 112
7. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1. Summary of Contributions . . . . . . . . . . . . . . . . 1157.2. Future Research . . . . . . . . . . . . . . . . . . . . . 117
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
v
LIST OF TABLES
Table Page
3.1 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Electricity bill savings obtained by our scheduling policies on ANL-BGP. In each cell, the top number is the electricity bill saving ob-tained by Greedy and the bottom number is the electricity bill savingobtained by Knapsack . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Electricity bill savings obtained by our scheduling policies on SDSC-BLUE. In each cell, the top number is the electricity bill saving ob-tained by Greedy and the bottom number is the electricity bill savingobtained by Knapsack . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Electricity bill Savings obtained by our scheduling policies under dif-ferent scheduling frequencies. In each cell, the top number is on ANL-BGP and the bottom number is on SDSC-BLUE. . . . . . . . . . 38
3.5 System utilization rate under different scheduling frequencies. In eachcell, the top number is on ANL-BGP and the bottom number is onSDSC-BLUE. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 System utilization improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the num-ber on top is the improvement achieved by using B&B, the bottomnumber is the improvement achieved by using Greedy. . . . . . . . 62
4.2 Average job wait time improvement obtained by our design usingB&B and Greedy as against the Default scheduler. In each cell, thenumber on top is the improvement achieved by using B&B, the bot-tom number is the improvement achieved by using Greedy. . . . . . 63
4.3 Average job response time improvement obtained by our design us-ing B&B and Greedy as against the Default scheduler. In each cell,the number on top is the improvement achieved by using B&B, thebottom number is the improvement achieved by using Greedy. . . . 64
4.4 System utilization improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the num-ber on top is the improvement achieved by using B&B, the bottomnumber is the improvement achieved by using Greedy. . . . . . . . 65
4.5 Average job wait time improvement obtained by our design usingB&B and Greedy as against the Default scheduler. In each cell, thenumber on top is the improvement achieved by using B&B, the bot-tom number is the improvement achieved by using Greedy. . . . . . 66
vi
4.6 Average job response time improvement obtained by our design us-ing B&B and Greedy as against the Default scheduler. In each cell,the number on top is the improvement achieved by using B&B, thebottom number is the improvement achieved by using Greedy. . . 67
6.1 Nomenclature for different placement and routing configurations . . 93
6.2 Summary of Applications . . . . . . . . . . . . . . . . . . . . . 94
6.3 Three different random placement and routing configurations . . . 107
vii
LIST OF FIGURES
Figure Page
1.1 Problems and proposed solutions . . . . . . . . . . . . . . . . 6
1.2 Batch scheduling system for HPC machines . . . . . . . . . . . 9
3.1 Job Power Distribution on BGQ . . . . . . . . . . . . . . . . . 18
3.2 Job scheduling using FCFS(left) and our job power aware design aton-peak time(top right) and off-peak time (bottom right). For eachjob, its color represents its power profile, where dark color indicatespower expensive and light color indicates power efficient. . . . . 25
3.3 Overview of Job Power Aware Scheduling . . . . . . . . . . . . 27
3.4 Job size distribution of ANL-BGP(A) and SDSC-BLUE (B) . . . 31
3.5 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Cost saving for SDSC-BLUE and ANL-BGP. . . . . . . . . . . 34
3.7 Wait time improvement for SDSC-BLUE and ANL-BGP workloads. 35
3.8 Job characteristics in December of 2012 on the 48-rack Mira ma-chine. Each red point indicates a job submission . . . . . . . . . 40
3.9 The average daily system utilization . . . . . . . . . . . . . . . 42
3.10 The average daily power consumption . . . . . . . . . . . . . . 43
4.1 Typical job scheduling uses First-Come First-Serve (FCFS) schedul-ing policy. Jobs are removed from the wait queue and assigned withfree nodes one by one. The grey squares represent busy nodes occu-pied by running jobs. The green squares represent free nodes. . . . 46
4.2 Overview of our window-based locality-aware scheduling design. Thejob prioritizing module maintains a “window” of jobs retrieved fromthe wait queue, and the resource management module keeps a list ofslots. Each slot represents a contiguous set of available nodes. Ourscheduling design allocates a “window” of jobs to a list of slots at atime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Decision tree generated for finding the optimal solution by usingBranch and Bound Algorithm. There are 2 knapsacks and 3 jobs(m = 2, n = 3). . . . . . . . . . . . . . . . . . . . . . . . . . 54
viii
4.4 Scheduling result comparison between the default scheduler and ourdesign. The default scheduler (Subfigure A) makes job prioritizingsequence as 〈A,B,C,D〉, and the allocation for job A and B arefragmented, node 20 is left idle. Our design (Subfigure B) can makeoptimization so that every job gets a compact allocation and nonode is left idle. The prioritizing sequence obtained by our designis 〈C,A,B,E〉. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Job size distribution of ANL-Intrepid and SDSC-BLUE . . . . . 60
5.1 Multiple jobs running concurrently with different allocations. Eachjob is represented by a specific color. a) shows the effect of con-tiguous allocation, which reduce the inter-job interference. b) showsnon-contiguous allocation, which may introduce both intra and inter-job interferences. . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 AMG communication matrix. The label of both the x and the yaxis is the index of MPI rank in AMG. The legend bar on the rightindicates the data transfer amount between ranks. . . . . . . . . 71
5.3 Crystal Router communication matrix. The label of both the x andthe y axis is the index of MPI rank in CrystalRouter. The legendbar on the right indicates the data transfer amount between ranks. 72
5.4 MultiGrid communication matrix. The label of both the x and they axis is the index of MPI rank in MultiGrid. The legend bar onthe right indicates the data transfer amount between ranks. . . . 73
5.5 Contiguous allocation in three different shapes. Red is a 3D balancedcube, green a 3D unbalanced cube, and blue a 2D mesh. . . . . . 76
5.6 Data transfer time of AMG, Crystal Router, and MultiGrid on 2Dmesh, 3D unbalanced, and 3D balanced allocation. . . . . . . . . 77
5.7 Data transfer time of AMG, Crystal Router, and MultiGrid on 3Dbalanced allocation using different mapping strategies. . . . . . . 78
5.8 Noncontiguous allocation. Each job is represented by a specificcolor. The nodes assigned to different jobs are interleaved; the sizesof allocation unit are 16, 8, and 2. . . . . . . . . . . . . . . . 79
5.9 Interjob interference study: “cont” indicates three applications run-ning side by side concurrently on the same network with contiguousallocation. To study the impact of noncontiguous allocation on inter-job interference, applications are run concurrently with interleavedallocations of different unit sizes, namely, 16 node, 8 node, and 2node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
6.1 Five group slice of a 19-group dragonfly network. Job J1 is allocatedusing random placement, while Job J2 is allocated using contiguousplacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Aggregate traffic and saturation time for Workload I under the con-figurations listed in Table 6.3. “CA” and “CPA” have equivalentbehavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Communication time distribution across application ranks in Work-load I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Aggregate workload traffic for routers serving specific applications.“CA” and “CPA” have equivalent behavior. More routers are in-volved in serving each application when random placement is in use,compared to contiguous placement. . . . . . . . . . . . . . . . 98
6.5 Aggregate traffic and saturation time for Workload II under theconfigurations listed in Table 6.3. . . . . . . . . . . . . . . . . 102
6.6 Communication time distribution across application ranks in Work-load II. The “bully”, sAMG, benefits from random placement andadaptive routing, while the “bullied”, MultiGrid and CrystalRouter,suffer performance degradation. . . . . . . . . . . . . . . . . . 102
6.7 Aggregate workload traffic for routers serving specific applications.More routers are involved in serving each application when randomplacement is in use, compared to contiguous placement. . . . . . 104
6.8 Application communication time. Workload I is running with allplacement and routing configurations. Methods prefixed with “H”represent the hybrid allocation approach. . . . . . . . . . . . . . 105
6.9 Application communication time. Workload I is running with threedifferent random placement policies coupled with three routing con-figurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
x
ABSTRACT
The batch scheduler is an important system software serving as the interface
between users and HPC systems. Users submit their jobs via batch scheduling portal
and the batch scheduler makes scheduling decision for each job based on its request
for system resources and system availability. Jobs submitted to HPC systems are
usually parallel applications and their lifecycle consists of multiple running phases,
such as computation, communication and input/output data. Thus, the running
of such parallel applications could involve various system resources, such as power,
network bandwidth, I/O bandwidth, storage, etc. And most of these system resources
are shared among concurrently running jobs. However, Today’s batch schedulers
do not take the contention and interference between jobs over these resources into
consideration for making scheduling decisions, which has been identified as one of the
major culprits for both the system and application performance variability.
In this work, we propose a cooperative batch scheduling framework for HPC
systems. The motivation of our work is to take important factors about jobs and the
system, such as job power, job communication characteristics and network topology,
for making orchestrated scheduling decisions to reduce the contention between con-
currently running jobs and to alleviate the performance variability. Our contributions
are the design and implementation of several coordinated scheduling models and algo-
rithms for addressing some chronic issues in HPC systems. The proposed models and
algorithms in this work have been evaluated by the means of simulation using work-
load traces and application communication traces collected from production HPC
systems. Preliminary experimental results show that our models and algorithms can
effectively improve the application and the system overall performance, HPC facilities’
operation cost, and alleviate the performance variability caused by job interference.
xi
1
CHAPTER 1
INTRODUCTION
1.1 Motivation
The high performance computing (HPC) systems comprise hundreds of thou-
sands compute nodes connected by large scale interconnected networks. The insa-
tiable demand for computing power from lots of scientific areas continues to drive the
involvement of ever-growing HPC systems. It is projected that by 2023, the exascale
HPC systems are expected to face many challenges[1]. Some of the most prominent
challenges are including, ever-increasing power consumption and energy cost, network
contention and job interference, concurrency and locality. These challenges demand
significant changes and great technical breakthroughs in many aspects of the current
HPC system software and hardware stack. In order to harness the great potential
of large scale HPC systems, lots of research studies have been focusing on bringing
solutions to these challenges from different layer of the system, such as new hardware
design, storage hierarchy, operating system and all kinds of system softwares.
The batch scheduler, serving as the interface between users and the HPC sys-
tems, is an integral part of the HPC system software stack. The users submit their
applications (jobs) via the batch scheduling portal and their jobs are scheduled and
dispatched by the batch scheduler to the HPC system for execution and return the
results to the users. Jobs submitted to HPC systems are usually parallel applications
and their lifecycle consists of multiple running phases, such as computation, commu-
nication and input/output data. Thus, the running of such parallel applications could
involve various system resources, such as power[2, 3], network bandwidth[4–7][8–11]
, I/O bandwidth[12, 13], storage[14][15], etc. And most of these system resources
are shared among concurrently running jobs. It has been identified that the con-
tention and interference among concurrently running jobs over the share resources
2
are the major culprits for both the system and application performance variability.
The motivation of this work is to explore possible batch scheduling algorithms and
methodologies to alleviate the performance variability introduced by job interference.
The traditional HPC system batch schedulers make scheduling decisions based
on the number of required nodes and expected run time of each job and they do
not take the contention and interference between jobs over these resources into con-
sideration. Making scheduling decisions without coordination could leads to severe
contention and interference among concurrently running jobs, which further cause
performance loss and variability. For example, when the batch scheduler allocates
computing nodes without any knowledge about jobs’ communication patterns, there
might be network contention between concurrently running jobs due to the overlapped
communication paths. The network contention introduces interference between those
jobs, thus causing severe performance degradation. We believe it is urgent to improve
today’s batch schedulers with a more orchestrated scheduling methodology. The de-
sign of such orchestrated batch scheduling framework requires deep understanding
about the problems that we target to solve. In this work, we provide in-depth study
about three major problems that the exascale HPC systems would possible to have.
1. Energy Cost. As HPC systems continue to grow, so does their energy con-
sumption. The cost spent on power consumption is now a leading component
of total cost of ownership(TCO) of HPC systems. The typical current petas-
cale system on average consumes 2-7 MW of power [16]. Case in point, the
Argonne Leadership Computing Facility (ALCF) budgets approximately $1M
annually for electricity to operate its primary supercomputer[2, 17]. Based on
current projections, exascale supercomputers will consume 60-130 MW, which
will prove to be an unbearable burden for any facility. Therefore, energy cost
savings are crucial for reducing the operational cost of exascale systems.
3
There is a significant number of research studies about improving energy ef-
ficiency for HPC systems and most of them focusing on the following top-
ics: energy-efficient or energy proportional hardware, dynamic voltage and fre-
quency scaling (DVFS) techniques, shutting down hardware components at low
system utilizations, power capping, and thermal management. Being orthogo-
nal to existing studies, our research focus on reducing the electricity bill of HPC
systems.
2. Topology-aware Resource Allocation. As the scale of supercomputers in-
creases, so do their interconnected networks. Torus interconnection is widely
used in HPC systems, such as Cray XT/XE and IBM Blue Gene series sys-
tems [18][19], due to their linear per node cost scaling and their competitive
overall performance. A growing network means an increasing network diameter
(i.e., the maximum distance between a pair of nodes) and a decreasing bisection
bandwidth relative to the number of nodes. Consequently, applications running
on torus-connected systems suffer great performance variability caused by the
increasing network scale. The traditional batch scheduler makes job placement
without considering jobs’ communication characteristic and system topology.
Currently two allocation strategies are commonly used on torus-connected sys-
tems. One is so called partition based systems, where the scheduler assigns
each user job a compact and contiguous set of computing nodes. IBM Blue
Gene series systems fall into this category [18]. This strategy is in favor of ap-
plication’s performance by preserving locality of allocated nodes and reducing
network contention caused by concurrently running jobs sharing network band-
width. However, this strategy can cause internal fragmentation (when more
nodes are allocated to a job than it requests) and external fragmentation(when
sufficient nodes are available for a request, but they can not be allocated con-
tiguously), therefore leading to poor system performance (e.g., low system uti-
4
lization and high job response time) [20]. The other is non-contiguous allocation
system, where free nodes are assigned to user job no matter whether they are
contiguous or not. Cray XT/XE series systems fall into this category [19]. Non-
contiguous allocation eliminates internal and external fragmentation as seen in
partition-based systems, thereby leading to high system utilization. Neverthe-
less, it introduces other problems such as scattering application processes all
over the system. The non-contiguous node allocation can make inter-process
communication less efficient and cause network contention among concurrently
running jobs [21], thereby resulting in poor job performance especially for those
communication-intensive jobs. Partition-based allocation achieves good job
performance by sacrificing system performance (e.g., poor system utilization),
whereas non-contiguous allocation can result in better system performance but
could severely degrade performance of user jobs (e.g, prolonged wait-time and
run-time). As systems continue growing in size, a fundamental problem arises:
how to effectively balance job performance with system performance on torus-
connected machines?
3. Network Contention and Job Interference. Supercomputers are usually
employed as a shared resource to accommodate many parallel applications (jobs)
running concurrently. These parallel jobs share the system infrastructure such
as network and I/O bandwidth, and inevitably there is contention over these
shared resources. As supercomputers continue to evolve, these shared resources
are increasingly the bottleneck for performance. Typically, multiple jobs are
running concurrently on the system, resulting in the shared use of resources,
particularly network links. A prominent problem with network sharing is the re-
sulting contention, which can cause communication variability and performance
degradation in affected jobs [4]. This performance degradation can propagate
into the queueing time of the following submitted jobs, thus leading to lower
5
system throughput and utilization [22].
On the widely used torus-connected HPC systems [23–25], two allocation strate-
gies are commonly used. The contiguous allocation strategy assigns to each job
a compact and contiguous set of computing nodes. The partition-based alloca-
tion approach used in Blue Gene series systems is an example of such a strategy
[26]. The contiguous strategy favors application performance through isolated
networking within a partition and the locality that implies. However, this strat-
egy can cause both internal fragmentation (when more nodes are allocated to
a job than it requests) and external fragmentation (when sufficient nodes are
available for a request, but they can not be allocated contiguously), therefore
leading to lower system utilization than is otherwise possible. On the other side
of the coin, the non-contiguous allocation strategy, used by the Cray XT/XE
series [27], assigns free nodes to jobs regardless of contiguity, though of course
efforts are made to maximize locality. While eliminating internal and external
fragmentation as seen in contiguous allocation systems, in return non-contiguous
allocations introduce network contention between jobs due to the interleaving
of job nodes. The non-contiguous placement policy can significantly reduce job
performance, especially for communication-intensive ones [4][9][28]. The net-
work contention between concurrently running jobs also exist on current HPC
systems with dragonfly networks[5][8]. How to intelligently make job place-
ment on exascale HPC systems to avoid network contention remains to be a
challenging problem.
1.2 Contributions
In this dissertation, we present a series of novel batch scheduling algorithms
and methodologies to solve the above identified problems. For each problem, we
have proposed a dedicated solution as shown in Figure 1.1. The contributions of this
6
dissertation are the design and implementation of each dedicated solution.
Computing Node
Network
Energy Cost
Fragmentation
Network
Contention
Job Interference
Energy Cost
Aware Scheduling
Locality Aware
Scheduling
Topology Aware
Scheduling &
Allocation
Origin of Problems Problems Solutions
Figure 1.1. Problems and proposed solutions
1. Energy Cost-aware Scheduling. There is a significant number of research
studies about improving energy efficiency for HPC systems and most of them
focusing on the following topics: energy-efficient or energy proportional hard-
ware, dynamic voltage and frequency scaling (DVFS) techniques, shutting down
hardware components at low system utilizations, power capping, and thermal
management. Being orthogonal to existing studies, we focus on reducing the
electricity bill of HPC systems via a smart job scheduling mechanism. The
rationale is based on a key observation of HPC jobs: parallel jobs have distinct
power consumption profiles [29][17]. We also notice that the dynamic electricity
pricing policies have been widely adopted in Europe, North America, Oceania,
and parts of Asia. For example, in the U.S.A, wholesale electricity prices vary
by as much as a factor of 10 from one hour to the next [30]. Under dynamic
pricing, the power grid has on-peak time (when it bears a heavier burden and
consequently the electricity price is higher) and off-peak time (when there is
7
less demand for electricity and the price is lower) alternatively in a day. The
novelty of our energy cost-aware scheduling mechanism is that it can reduce sys-
tem’s electricity bill by scheduling and dispatching jobs according to their power
profiles and the real time electricity price, while causing negligible impact on
the system’s utilization and scheduling fairness. Preferentially, it dispatches the
jobs with higher power consumption during the off-peak period, and the jobs
with lower power consumption during the on-peak period. We formalize this
scheduling problem into a standard 0-1 knapsack model, based on which we
apply dynamic programming to efficiently solve the scheduling problem. The
derived 0-1 knapsack model enables us to reduce energy cost during high elec-
tricity pricing period with no or limited impact to system utilization.
2. Locality-aware Scheduling.
The partition policy oriented schedulers achieve good job performance by sacri-
ficing system performance (e.g., poor system utilization), whereas non-partition
oriented schedulers can result in better system performance but could severely
degrade performance of user jobs (e.g, prolonged wait-time and run-time)[26][27].
We present a new scheduling design combining the merits of partition and non-
contiguous based scheduling for torus-connected machines. In this scheduling
mechanism, the batch scheduler takes a “window” of jobs (i.e., multiple jobs)
into consideration for making prioritizing and allocation decision, to prevent the
short-term decision from obfuscating future optimization. The job prioritizing
module maintains a “window” of jobs, and these jobs are placed into the win-
dow to maintain job fairness (e.g., through FCFS). Rather than allocating jobs
one by one from the head of the wait queue as existing schedulers do, we make
scheduling decision for a “window” of jobs at a time. The resource allocation
module takes a contiguous set of nodes as a slot and maintains a list of such
slots. These slots have different sizes and each may accommodate one or more
8
jobs. The allocation of the jobs in the window onto the slot list is conducted in
such a way as to maximize system utilization. We formalize the allocation of a
window of jobs to a list of slots as a 0-1 Multiple Knapsack Problem (MKP) and
present two algorithms, namely Branch&Bound and Greedy, to solve the MKP.
A series of trace-based simulations using job logs collected from production su-
percomputers indicate that this new scheduling design has real potentials and
can effectively balance job performance and system performance.
3. Topology-aware Scheduling&Allocation.
We envision that future HPC schedulers will adopt a flexible job scheduling
and allocation mechanism which combines the best of both contiguous and
non-contiguous allocation strategies. Such a flexible mechanism would take
shared resource needs of jobs into account when making allocation decisions
(e.g., network). With knowledge and analysis of job communication patterns, it
can be identified which jobs require network isolation and locality, and to what
degree. Then, rather than allocating each job in a “know-nothing” manner,
one may specialize allocation so that, for example, only the jobs with stringent
network needs are given compact, isolated allocations, resulting in maximized
utilization and minimized perceivable resource contention effects.
We provide an in-depth analysis of intra- and inter-job communication in-
terference with different job allocation strategies on both torus and dragon-
fly connected HPC systems. We selected three signature applications from
the DOE Design Forward Project [31] as examples to conduct detailed study
about their communication patterns. We use a sophisticated simulation toolkit
named CODES (standing for Co-Design of Multi-layer Exascale Storage Archi-
tectures) [32] as a research vehicle to evaluate the performance of these appli-
cations with various allocations in a controlled environment. We then analyze
9
the intra- and inter-job interference by simulating these applications running
exclusively and concurrently with different job placement policies. The insights
presented in this work can be very useful for the design of future HPC batch
job schedulers and resource managers.
Scheduling
Module
Resource
Management
Logging Module
Queuing Module
Cooperative Batch Scheduling Framework
HPC System
Scheduled Jobs
System Status
System LogsHistorical data
Queue Status
Scheduling Decision
Users
Figure 1.2. Batch scheduling system for HPC machines
We design a cooperative batch scheduling framework with the integration of
the proposed solutions. Figure 1.2 shows the components of our framework. Our co-
operative batch scheduling framework consists of four major subsystems. The queuing
module maintains the waiting queue for the submitted jobs and passes the detailed
information about jobs to the scheduler. The resource management module organizes
the available system resources, monitors the system status and provides feedback
information, such as node availability and perceived network hop-spot, to the sched-
uler. The scheduling module makes prioritizing and allocation decision based on job’s
requirement and system feedback and dispatches jobs to the system for running. The
logging module collects the information about job’s execution and system running sta-
tus, such historical information can be used by the scheduler to optimize its scheduling
10
decisions for the future workloads. The logging module is also responsible for off-line
log analysis. Our cooperative batch scheduling framework is equipped with different
scheduling policies that are designated for scheduling jobs and allocating resources in
an coordinated way. Thus, our batch scheduling framework is capable of making or-
chestrated scheduling decisions with regard to the contention and interference among
concurrently running jobs over the shared resources, and will significantly improves
the system performance while reducing performance variability.
The contributions in this dissertation have led to 12 peer reviewed publications,
and two publications that are under review.
• Xu Yang, Zhou Zhou, Sean Wallace, Zhiling Lan, Wei Tang, Susan Coghlan,
Michael E Papka, Integrating Dynamic Pricing of Electricity into Energy Aware
Scheduling for HPC Systems, Proc. of SC’13, 2013.
• Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, Zhiling Lan, Watch
Out for the Bully! Job Interference Study on Dragonfly Network, Proc of SC16,
2016.
• Xu Yang, John Jenkins, Misbah Mubarak, Xin Wang, Robert B Ross, Zhiling
Lan, Study of intra-and interjob interference on torus networks, Parallel and
Distributed Systems (ICPADS), 2016 IEEE 22nd International Conference on.
• Xu Yang, Zhou Zhou, Wei Tang, Xingwu Zheng, Jia Wang, Zhiling Lan, Bal-
ancing Job Performance with System Performance via Locality-Aware Schedul-
ing on Torus-Connected Systems, Proc. of Cluster Computing (CLUSTER),
2014 IEEE International Conference on.
• Sean Wallace, Xu Yang, Venkatram Vishwanath, William E Allcock, Susan
Coghlan, Michael E Papka, Zhiling Lan, A Data Driven Scheduling Approach
for Power Management on HPC Systems, Proc of SC16, 2016.
11
• Zhou Zhou, Xu Yang, Zhiling Lan, Paul Rich, Wei Tang, Vitali Morozov,
Narayan Desai, Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D
Torus Network Allocation Constraints, Proc. of IPDPS’15, 2015.
• Zhou Zhou, Xu Yang, Dongfang Zhao, Paul Rich, Wei Tang, Jia Wang, Zhiling
Lan, I/O-Aware Batch Scheduling for Petascale Computing Systems, Proc. of
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
• Zhou Zhou, Xu Yang, Zhiling Lan, Paul Rich, Wei Tang, Vitali Morozov,
Narayan Desai, Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D
Torus Network Allocation Constraints, accepted by IEEE Trans. on Parallel
and Distributed Systems (TPDS), 2016.
• Zhou Zhou, Xu Yang, Dongfang Zhao, Paul Rich, Wei Tang, Jia Wang,
Zhiling Lan, I/O-Aware Bandwidth Allocation for Petascale Computing Sys-
tems,Journal of Parallel COmputing (ParCo), 2016.
• Dongfang Zhao, Xu Yang, Iman Sadooghi, Gabriele Garzoglio, Steven Timm,
Ioan Raicu, High-performance storage support for scientific applications on the
cloud, Proc of the 6th Workshop on Scientific Cloud Computing.
• Xingwu Zheng, Zhou Zhou, Xu Yang, Zhiling Lan, Jia Wang, Exploring Plan-
Based Scheduling for Large-Scale Computing Systems, Proc. of Cluster Com-
puting (CLUSTER), 2016 IEEE International Conference on.
• Jiaqi Yan, Xu Yang, Dong Jin, Zhiling Lan, Cerberus: A Three-Phase Burst-
Buffer-Aware Batch Scheduler for High Performance Computing, Proc. of SC’16,
Technical Program Posters.
1.3 Outline
12
The rest of this dissertation is organized as follows. Chapter 2 gives knowledge
about High Performance Computing and supercomputers, batch schedulers, workload
traces, application communication traces and the simulation tools used in our work.
Chapter 3 presents our work about reducing energy cost for HPC systems. Chapter
4 presents topology-aware scheduling for torus connected HPC systems. Chapter 5
presents the in-depth analyses about intra and interjob interference on torus network.
Chapter 6 presents our study about network contention and job interference on drag-
onfly network. Chapter 7 concludes the dissertation by summarizing the contributions
and discussing some future work on extending the current scheduling framework.
13
CHAPTER 2
BACKGROUND
2.1 HPC Systems
The target platform in this dissertation is the high performance computing
(HPC) system which usually referred as supercomputer. Today’s supercomputers
consist of hundreds of thousands compute nodes connected by high bandwidth inter-
connected network. We introduce two HPC systems with different network topologies.
Our work are based on the abstracted model of both network topologies.
As the scale of supercomputers increases, so do their interconnected networks.
Torus interconnection is widely used in HPC systems, such as Cray XT/XE and IBM
Blue Gene series systems [18][19], due to their linear per node cost scaling and their
competitive overall performance. Mira[33], a 10 PFLOPS (peak) Blue Gene/Q system
in Argonne National Laboratory, is a Torus connected HPC system. The computing
nodes in Mira are grouped into midplanes, each midplane contains 512 nodes in a
4× 4× 4× 4× 2 sub-torus/mesh structure. Mira has 48 racks arranged in three rows
of sixteen racks. Each rack has two such midplanes, which contains 1024 sixteen-core
nodes, for a total of 16384 cores per rack, giving a total of 786432 cores. Mira was
ranked fifth in the latest Top500 list[34].
The high-radix, low-diameter dragonfly topology can lower the overall cost of
the interconnect, improve network bandwidth and reduce packet latency [35], making
it a very promising choice for building supercomputers with millions of cores. The
dragonfly is a two-level hierarchical topology, consisting of several groups connected
by all-to-all links. Each group consists of a routers connected via all-to-all local chan-
nels. For each router, p compute nodes are attached to it via terminal links, while h
links are used as global channels for intergroup connections. The resulting radix of
14
each router is k = a + h + p− 1. Different computing centers could choose different
values for a, h, p when deploying their dragonfly network. The adoption of proper a,
h, p involves many factors such as system scale, building cost and workload character-
istics. One implementation of the dragonfly topology is Cori, a Cray Cascade system
(Cray XC30)[36] deployed at NERSC, Lawrence Berkeley National Laboratory. The
building block of Cori is an Aries router with four terminal ports and 30 network
ports. Each router has four compute nodes attached through terminal ports. Sixteen
routers form a chassis and six chassis are put into the same group. Each router has
20 ports for local channels, that is 15 ports to connect all the other routers in the
same chassis and 5 more ports to connect routers from other chassis. Each router has
10 ports for global channels for the connection between other groups.
2.2 Batch Scheduler
There are a number of source management and job scheduling tools dedicated
to HPC systems. Some of the commonly used by computing facilities are includ-
ing Moab [37] from Adaptive computing, PBS [38] from Altair, SLURM [39] from
SchedMD and Cobalt [40] from Argonne National Laboratory. Tools like Moab, PBS
are commercial products and SLURM and Cobalt are open source projects.
We have developed a simulator named CQSim to evaluate our design at scale.
The simulator is written in Python, and is formed by several modules such as job
module, node module, scheduling policy module, etc. Each module is implemented
as a class. The design principles are reusability, extensibility, and efficiency. The
simulator takes job events from the trace, and an event could be job submission, start,
end, and other events. Based on these events, the simulator emulates job submission,
allocation, and execution based on specific scheduling and allocation policies. CQsim
is open source, and is available to the community[41].
15
2.3 Workload Trace
The HPC system can accommodate multiple users to run their jobs simul-
taneously. The system records all kinds of events regarding to each job, such as
its submission, entering the waiting queue, start running, failure and completion in
chronological order. The collection of all these events recorded by the system produces
is referred as workload traces. In such workload trace file, each record is basically
composed of timestamp, event type, executable filename, job size, location, wait time,
running time, etc. The workload traces used in our study are archived on the Parallel
Workloads Archive[42].
2.4 Application Communication Trace
A parallel application usually conforms to a combination of several basic com-
munication patterns [43]. At its different execution phases, the application’s commu-
nication behavior may follow different basic patterns respectively. There are many
profiling tools available to capture information regarding communication patterns of
parallel applications [44–48]. In this work, we select three representative applications
from the DOE Design Forward Project. Each application exhibits a distinctive com-
munication pattern that is commonly seen in HPC applications. We believe that the
communication patterns of these applications are representative of a wide array of
applications running on leadership-class machines. Specifically, we study the Alge-
braic MultiGrid Solver (AMG), Geometric MultiGrid (MultiGrid) and CrystalRouter
MiniApps. For each application, we collect its communication trace generated by SST
DUMPI[47]. The detail about the DUMPI traces and communication pattern of these
applications will be introduced in later section.
2.5 Simulation Tool
A simulation toolkit named CODES enables the exploration of simulating dif-
16
ferent HPC networks with high fidelity [32][49]. CODES is built on top of Rensselaer
Optimistic Simulation System (ROSS) parallel discrete-event simulator, which is ca-
pable of processing billions of events per second on leadership-class supercomputers
[50]. CODES support both torus and dragonfly network with high fidelity flit-level
simulation. CODES has this network workload component that is capable of con-
ducting trace-driven simulations. It can take real MPI application traces generated
by SST DUMPI [47] to drive CODES network models.
17
CHAPTER 3
ENERGY COST AWARE SCHEDULING DRIVEN BYDYNAMIC PRICING ELECTRICITY
3.1 Overview
The research literature to date mainly aimed at reducing energy consumption
in HPC environments. Being orthogonal to existing studies we propose a job power
aware scheduling mechanism to reduce HPC’s electricity bill without degrading the
system utilization. The rationale is based on a key observation of HPC jobs: parallel
jobs have distinct power consumption profiles. In our recent work [3], we provided
analysis of a one-month workload on Mira, the 48-rack IBM Blue Gene/Q (BGQ)
system at Argonne National Laboratory (Figure 3.1). The histogram displays per-
centages partitioned by size ranging from single rack jobs to full system runs. The
power consumption of those jobs varies from around 40 kW/rack to 90 kW/rack.
We hypothesize that it is possible to save a significant amount on an electric
bill by exploiting a dynamic electricity pricing policy. To date, dynamic electricity
pricing policies have been widely adopted in Europe, North America, Oceania, and
parts of Asia. For example, in the U.S.A, wholesale electricity prices vary by as much
as a factor of 10 from one hour to the next [30]. Under dynamic pricing, the power
grid has on-peak time (when it bears a heavier burden and consequently the electricity
price is higher) and off-peak time (when there is less demand for electricity and the
price is lower) alternatively in a day.
In this work, we develop a job power aware scheduling mechanism. The novelty
of this scheduling mechanism is that it can reduce system’s electricity bill by schedul-
ing and dispatching jobs according to their power profiles and the real time electricity
price, while causing negligible impact on the system’s utilization and scheduling fair-
ness. Preferentially, it dispatches the jobs with higher power consumption during
19
the off-peak period, and the jobs with lower power consumption during the on-peak
period.
A key challenge in HPC scheduling is that system utilization should not be
impacted. HPC systems require a tremendous capital investment, hence taking full
advantage of this expensive resources is of great importance to HPC centers. Unlike
the utilization of Internet data centers, which only fluctuate about 20%, systems at
HPC centers are highly employed with a typical utilization around 50%-80% [42]. To
address this challenge, we propose a novel window-based scheduling mechanism[51].
In this work, rather than allocating jobs one by one from the front of the wait queue as
existing schedulers do, we schedule and dispatch a ”window” of jobs at a time. Those
jobs placed into the window are chosen to maintain job fairness, and the allocation
of these jobs onto system resources is done in such a way as to minimize electricity
bill. Two scheduling algorithms, namely a Greedy policy and a 0-1 Knapsack based
policy, are presented in this work for decision-making.
We evaluate our job power aware scheduling design via extensive trace-based
simulations and a case study of Mira. In this paper, we present a series of experiments
comparing our design against the popular first-come, first-serve (FCFS) scheduling
policy, with backfilling done in three different aspects (electricity bill saving, schedul-
ing performance, and job performance). Our preliminary results demonstrate that
our design can cut electricity bill by up to 23% without an impact on overall system
utilization. Considering HPC centers often spend millions of dollars on even the least
expensive energy contracts, such savings can translate into a reduction of hundreds
of thousands in terms of TCO.
The structure of this chapter is as follows. In section 3.2, we discuss the
existing work about energy reduction for HPC systems. Section 3.3 describes the job
power aware scheduling problem. Section 3.4 gives a detailed description of our job
20
power aware scheduling design. Sections 3.5-3.7 present our evaluation methodology,
trace-based simulations, and a case study by comparing our design against the widely-
used FCFS scheduling policy under a variety of configurations. Our findings are
presented in Section 6.9.
3.2 Related Work
First, we give a brief survey of dynamic electricity pricing policies in different
countries to demonstrate the applicability of this work. The European Exchange
Market (EEX) in Germany, PowerNext in France, and APX in the Netherlands and
Iberian market all vary their cost of electricity on an hourly basis [52]. In [53], the
author stated dynamic electricity pricing policies are well adopted in Nordic countries
as Norway, Finland, Sweden, and Denmark. Other power markets such as England,
New Zealand, and Australia also have similar policies. Since the electricity crisis
in 2011-2012, the Japanese Ministry of Economy, Trade and Industry (METI) has
initiated the Smart Community Pilot Projects in four cities in Japan(Yokohama,
Toyota, Kyoto, and Kitakyushu) to investigate the effect of dynamic pricing and
smart energy equipment on residential electricity demand [54]. In China, major cities
such as Beijing, Shanghai, Guangzhou have initiated dynamic electricity pricing for
both domestic and industrial use since 2006. Dynamic electricity pricing has also
been carried out in several provinces such as Zhejiang, Jiangsu, and Guangdong.
As we can see from the survey above, the dynamic electricity pricing policy
has already been carried out in power markets in Europe, North America, Oceania,
and China. Japan has initiated preliminary tests in some major cities to see the effect
of this dynamic electricity pricing policy on the reduction of electricity consumption.
While in this study we evaluate our design based on the on-/off-peak electricity pricing
in U.S.A, we believe our design is applicable to other countries(e.g., those listed above)
for cutting the electricity bill of their HPC systems.
21
Although there is no known effort to provide job power aware scheduling sup-
port in the field of HPC, there is a large body of related work. Due to space lim-
itations, in this section we discuss some closely related studies and point out key
differences among them.
From the hardware perspective, hardware vendors are dedicated to produce
energy-efficient devices. For instance, Barroso and Holzle argued that the power
consumption of a machine should be proportional to its workload, i.e., it should
consume no power in idle state, almost no power when the workload is very low,
and eventually more power when the workload is increased [55]. Ideally, an energy
proportional system could save half of the energy used in data center operations. Li
et al. optimized the power/ground grid to make the power supply more efficient for
the chip [56].
Since processor power consumption is a significant portion of the total sys-
tem power (roughly 50% under load [57]), DVFS is widely used for controlling CPU
power [58]. By running a processor at a lower frequency/voltage, energy savings can
be achieved at the expense of increased job execution time. In order to meet user’s
SLAs (Service Level Agreements), DVFS is typically applied at the period of low
system activity. Some research studies on this topic can be found in [59] [60] [61] [62].
In a typical HPC system, nodes often consume considerable energy in idle state
without any running application. For example, an idle Blue Gene/P rack still has a
DC power consumption of about 13 kW [29]. During low system utilization, some
nodes or their components could be shut down or switched to a low-power state.
This strategy tries to minimize the number of active nodes of a system while still
satisfying incoming application requests. Since this approach is highly dependent on
system workload, the challenge is to determine when to shut down components and
how to provide a suitable job slowdown value based on the availability of nodes.
22
Hikita et al. [57] performed an empirical study by implementing an energy-
aware scheduler for an HPC system. The operation of the scheduler is simple: if
a node is inactive for 30 minutes, it is powered off; when the node is required for
job execution, it is powered on and moved to an active state. Powering up a node
on their system takes approximately 45 minutes, which is substantial. This strategy
can improve power efficiency by 39% at best. Because the rebooting of a node may
consume significant time that will lead to a performance degradation if it happens to
be peak job request period and more nodes are required than the active ones.
Pinheiro et al. [59] presented a mechanism that dynamically turns cluster nodes
on and off. This approach uses load consolidation to transfer workload onto fewer
nodes so that idle nodes can be turned off. The experimental tests on a static 8-node
cluster indicate a 19% saving in energy. It takes about 100 seconds to power on a
server and 45 seconds to shut it down on the cluster. The degradation in performance
is approximately 20%.
Thermal management techniques are another method frequently discussed in
the literature [60][62]. The rationale is that higher temperatures have a large impact
on system reliability and can also increase cooling costs. By using thermal manage-
ment, system workload is adjusted according to a predefined temperature threshold:
if the temperature on a server rises above that threshold, its workload is reduced.
Disadvantages of thermal management are delayed response, high risk of overheating,
excessive cooling and recursive cycling [62].
Many data centers use power capping or power budgeting to reduce the total
power consumption. The operator can set a threshold of power consumption to
ensure the actual power of the data center does not exceed it [63]. It prevents sudden
rises in power supply and keeps the total power consumption under a predefined
budget. Basically, the power consumption can be reduced by rescheduling tasks
23
or CPU throttling, for example, Etinski et al. proposed a parallel job scheduling
policy based on integer linear programming under a given power profile [61]. Lefurgy
et al. presented a technique for high density servers that controls the peak power
consumption by implementing a feedback controller [64].
This work has two major differences as compared to our previous work [65].
First, our previous work targets Blue Gene/P systems, which have a special require-
ment on job scheduling, i.e., available nodes must be connected in a job specific
shape before they can be allocated to a job [51][66]. This work intends to provide
a generic job power aware scheduling mechanism for various HPC systems. Second,
our previous work relies on a power budget (similar to power capping) for energy cost
saving, which degrades system utilization slightly during on-peak electricity price pe-
riod. The scheduling policies presented in this work do not use power budget, and
they minimize the electricity bill without impacting system utilization, during both
on-peak and off-peak electricity pricing periods.
3.3 Problem Description
Typically, user jobs are submitted to an HPC system through a batch sched-
uler, and then wait in a queue for the requested amount of system resources to become
available. There may be one or multiple job queues with different priories. A job is
generally defined by its arrival time, its estimated runtime, the amount of computing
nodes requested, etc. The scheduler is responsible for assigning computing nodes to
the jobs in the queues. FCFS with backfilling is a commonly used scheduling policy
in HPC [67]. Under this policy, the scheduler picks a job from the head of the wait
queue and dispatches it to the available system resources. The nodes assigned to a
job become unavailable until the job is complete (i.e., space sharing).
As mentioned in Section 3.1, our work is based on two key observations in
24
HPC: (1) electricity price is dynamically changing within a day; and (2) HPC jobs
have distinct power consumption profiles. We shall point out that usually HPC jobs
also tend to be repetitive. These repetitive jobs can be easily identified by user ID,
project, expected runtime, etc. A batch scheduler can extract job power profile based
on historical data and use it for power aware scheduling. For the simplicity of method
description, we assume that a daily electricity price is divided into on-peak and off-
peak periods, where on-peak period is referred to the time when more electricity is
demanded (e.g., during the daytime). By exploring these two observations, the basic
idea of our work is to allocate jobs with lower power consumption profiles during
on-peak time and to allocate jobs with higher power consumption profiles during off-
peak time. Furthermore, the allocation is made under the assumption that there will
be no impact to system utilization, meaning that a situation where a job is waiting
in the queue while there is a sufficient amount of idle/available computing nodes is
not allowed.
Getting job power consumption profiles is feasible on today’s supercomputers.
Most production HPC systems are deployed with built-in sensors that monitor the
health status of its hardware components. These sensors, deployed in various locations
inside the system, report environmental conditions such as motherboard, CPU, GPU,
hard disk and other peripherals for temperature, voltage, and/or fan speed. A number
of software tools/interfaces are publicly available for users to access these sensor
readings [68][18][19][3].
Figure 3.2 pictorially illustrates an example to highlight the key idea of our
design as compared to the conventional FCFS. Suppose five jobs J0, J1, J2, J3, J4 are
submitted to a 12-node system. Each job is associated with several parameters, such
as the amount of nodes needed, the estimated runtime, etc. Further, each job is
also associated with a power consumption profile pi, which can be determined from
25
historical data [3]. Suppose these jobs have the following parameters:
Job Power Profile (W/node) Job Size
J0 50 6
J1 20 3
J2 40 3
J3 30 3
J4 10 6
Under the conventional FCFS policy, the scheduling sequence is always <
J0, J1, J2 > in spite of the scheduling time. Our scheduling mechanism provides
different scheduling sequences depending on the dynamic electricity price and job’s
power profile. More specifically, our scheduling algorithm allocates < J4, J1, J3 >
during the on-peak period, and allocates < J0, J2, J3 > during the off-peak period.
By comparing the total power consumptions of using FCFS to that of our design, it
is clear that our design is able to reduce the accumulated power consumption during
the on-peak time where as to increase the accumulated power consumption during
the off-peak time, hence reducing the overall electricity bill.
Figure 3.2. Job scheduling using FCFS(left) and our job power aware design at on-peak time(top right) and off-peak time (bottom right). For each job, its colorrepresents its power profile, where dark color indicates power expensive and lightcolor indicates power efficient.
26
Table 3.1. Nomenclature
Symbol Description
ni job size, i.e., the number of nodes that are requested by Job i
pi the average power consumption per node.
Nt the amount of available nodes in the system at time t.
N system size, i.e., the number of nodes in the system.
T the time span from the start of the first job to the end of the last job.
w scheduling window size, i.e., the number of jobs in the window .
3.4 Methodology
In Table 3.1, we list the nomenclature that will be used in the rest of this
paper. Figure 3.3 gives an overview of our job power aware scheduling design. Our
design contains two key techniques. One is the use of a scheduling window to take
into consideration job features such as job fairness, and the other is the scheduling
policy to balance energy usage and scheduling performance.
3.4.1 Scheduling Window. Balancing fairness and system performance is always
a big concern for a scheduling algorithm. The simplest way to ensure fairness and
high system performance is to use a strict FCFS policy combined with backfilling,
where jobs are started in the order of their arrivals [67].
In this work, rather than allocating jobs one by one from the front of the
wait queue, we propose a novel window-based scheduling mechanism that allocates a
window of jobs. The selection of jobs into the window is based on certain user-centric
metrics, such as job fairness while the allocation of these jobs onto system resources
is determined by certain system-centric metrics such as system utilization and energy
consumption. By doing so, we are able to balance different metrics, representing both
user satisfaction and system performance.
27
Figure 3.3. Overview of Job Power Aware Scheduling
Given that FCFS is commonly used by production batch schedulers in HPC,
we now describe how our window based scheduling works with FCFS. We maintain
a scheduling window in front of the job queue, and the submitted jobs enter the job
queue first and then into the scheduling window. The selection of jobs is based on
job arrival times, thereby guaranteeing job fairness; the allocation of the jobs from
the window to the available system nodes is based on job power profiles, which will
be described later.
Typically, the window size should be determined based on system workload
such that a large window is preferred in case of high workload. For typical workloads
at production supercomputers, we find that a window size from 10 to 30 jobs can
28
achieve reasonable electricity bill savings.
3.4.2 Job Power Aware Scheduling Policies. In this work, we develop two
power aware scheduling policies. The first is a Greedy policy, where jobs are allocated
entirely based on the values of their power profiles. The second is a 0-1 Knapsack
based policy, where both job power profile and system utilization are taken into
consideration during decision making.
Greedy Policy. In the Greedy Policy, all the jobs in the scheduling window are
first sorted by their power profiles. During the on-peak electricity price period,
all the jobs in the scheduling window are sorted in a decreasing order based on
their power profiles; conversely, they are sorted in an increasing order during
the off-peak period. After the sorting, the scheduler will dispatches the ordered
jobs out of the scheduling window. Greedy policy is simple and fast. Suppose
the number of jobs in the window size is n, then the complexity of the algorithm
is O(nlgn).
O-1 Knapsack based Policy. In the 0-1 Knapsack policy, the only difference be-
tween on-peak and off-peak scheduling selection is the aggregated power con-
sumption: during the on-peak period, the goal is to minimize the value; during
the off-peak period, the goal is to maximize the value. In the following, we
present the 0-1 Knapsack based policy works at off-peak time.
Suppose there are Nt available nodes in the system, the scheduling window
size is w {Ji, |1 6 i 6 w}, and each job Ji requires ni nodes, with a power profile of
pi. Now, the scheduling problem can be formalized as follows:
Problem 1. To selecting a subset of {Ji, |1 6 i 6 k} from the scheduling win-
dow such that the aggregate nodes∑
1≤i≤k ni is no more than Nt, with the objective
of maximizing the aggregated power consumption∑
1≤i≤k ni · pi.
29
The above problem can be formalized into a 0-1 Knapsack problem. We set
the available nodes Nt as the knapsack’s size and consider the jobs in the scheduling
window as the objects that we intend to put into the knapsack. For each job, its
power profile (measured in W/node or kW/rack) is its value, the number of required
node is considered as the weight. Hence we can further transform Problem 1 into a
standard 0-1 Knapsack model.
Problem 2. To determine a binary vector
X = {xi, |1 6 i 6 k} such that:
maximize∑1≤i≤k
xi · pi, xi = 0 or 1
subject to∑1≤i≤k
xi · ni ≤ Nt
(3.1)
The standard 0-1 Knapsack model can be solved in pseudo-polynomial time
by using dynamic programming [69]. To avoid redundant computation, when imple-
menting this algorithm we use the tabular approach by defining a 2D table G, where
G[k, w] denotes the maximum gain value that can be achieved by scheduling jobs
{ji|1 ≤ i ≤ k} which require no more than Nt computing nodes, where 1 ≤ k ≤ J .
G[k, w] has the following recursive feature:
G[k,w]=
0 kw = 0
G[k−1,w] wi ≥ w
max(G[k−1,w],vi+G[k−1,w−wi]) wi ≤ w
(3.2)
The solution G[J,Nt] and its corresponding binary vector X determine the
selection of jobs scheduled to run. The computation complexity of Equation 3.2 is
30
O(J ·Nt).
During on-peak time, 0-1 Knapsack-based policy is modified by changing the
selection criterion into minimizing the total value of the objects in the knapsack with
the constraint of knapsack size.
3.5 Evaluation Methodology
We conduct a series of experiments using trace-based simulations. In our
experiments, we compare our design as against the well-known FCFS scheduling
policy [67]. In the rest of the paper, we simply use Greedy, Knapsack, and FCFS
to denote our scheduling policies and the conventional batch scheduling policy. This
section describes our evaluation methodology, and the experimental results will be
presented in the next section.
3.5.1 CQSim: Trace-based Scheduling Simulator. Simulation is an integral
part of our evaluation of various scheduling policies as well as their aggregate effect on
performance and power consumption. We have developed a simulator named CQSim
to evaluate our design at scale. The simulator is written in Python, and is formed
by several modules such as job module, node module, scheduling policy module,
etc. Each module is implemented as a class. The design principles are reusability,
extensibility, and efficiency. The simulator takes job events from a trace, and an event
may be job submission, job start, job end, and other events. Based on these events,
the simulator emulates job submission, allocation, and execution based on specific
scheduling policy. CQsim is open source, and is available to the community [41].
3.5.2 Job Traces. In this work, we use two real workload traces collected from
production supercomputers to evaluate our design. The objective of using multiple
traces is to quantify the impact of different factors on electricity bill saving. The first
trace we used is from a machine named Blue Horizon at the San Diego Supercomputer
31
Center (denoted as SDSC-BLUE in the paper), which ran 14,4830 jobs in 2001.
Figure 3.4. Job size distribution of ANL-BGP(A) and SDSC-BLUE (B)
The second trace we used is from two racks of the IBM Blue Gene/P machine
named Intrepid at Argonne (denoted as ANL-BGP in the paper) [70][71]. This trace
contains 26,012 jobs. Since this trace is extracted out of the original 40-rack workload,
the utilization rate is relatively low. A well-known approach to remedy this problem
is to decrease job arrival intervals by a certain rate [72]. After we decrease job arrival
intervals by 40%, the trace becomes 5-month long with the utilization rate ranging
between 39% and 88%. Figure 4.5 summarize job size distribution of these traces.
ANL-BGP is used to represent capability computing where the computing power is
explored to solve larger problems, whereas SDSC-BLUE is used to represent capacity
computing where the computing power is utilized to solve a large number of small
problems.
3.5.3 Dynamic Electricity Price. In our experiments, we set two different
electricity prices: on-peak and off-peak pricing. We set the price in on-peak time
(from 12pm to 12am) higher than the off-peak time (from 12am until 12pm). This is
done to simplify our calculation and statistical analysis. Indeed, we are not concerned
about the absolute value of electricity price; instead the ratio of on-peak price to off-
peak price is more important. According to [30], the most common ratio of on-peak
32
and off-peak pricing varies from 1:2 to 1:5. Hence we set the default ratio to 1:3.
3.5.4 Job Power Profile. Since job power profile is not included in the original
traces, we assign each job with a power profile between 20 to 60W per node using a
normal distribution according to the power profile presented in Figure 3.1. Similarly,
we are not concerned about the absolute power profile value; instead the ratio of
maximum power profile to minimal power profile is more important. The default
ratio is set to 1:3.
3.5.5 Evaluation Metrics. In this work, we use three metrics to evaluate our
design against the conventional FCFS.
Electricity Bill Saving. We calculate the relative difference between the electricity
bill using our design and FCFS to measure the electricity bill savings achieved
by our design. The simulator sums up electricity bill on a daily basis for the
calculation of this metric.
System Utilization Rate. This metric denotes the ratio of the node-hours that are
used for useful computation to the elapsed system node-hours. Specifically, let
T be the total elapsed time for J jobs, ci be the completion time for job i and
si be its the start time, and ni be the size of job i, then system utilization rate
is calculated as ∑0≤i≤J (ci − si) · ni
N · T(3.3)
Average Job Wait Time. For each job, its wait time refers to the time elapsed
between the moment it is submitted to the moment it is allocated to run. This
metric is calculated as the average across all the jobs submitted to the system.
This metric is a user-centric metric, measuring scheduling performance from
user’s perspective.
33
3.6 Experiment Results
We conduct four sets of experiments on the traces described in Section 4.5.2
to evaluate our design as against FCFS.
3.6.1 Baseline Results. Baseline results are presented in Figures 3.5(a) - 3.7(b),
where we use the default setting described in Sections 3.5.3-3.5.4, meaning that job
power profile ratio is 1:3 and the off-peak/on-peak pricing ratio is 1:3. On production
supercomputers, the scheduling frequency is typically on the order of 10 to 30 seconds.
Hence, the simulator is set to make a scheduling decision every 10 seconds.
Since our evaluation focuses on the relative reduction of electricity bills, so the
absolute value of idle power consumption, which is set to 0 in our experiments, does
not impact the results.
(a) System Utilization of SDSC-BLUE (b) System Utilization of ANL-BGP
Figure 3.5. Baseline Results
Figure 3.5(a) and Figure 3.5(b) compare system utilization rates achieved by
using different job scheduling policies. It is clear that the utilization degradation
introduced by our design is always less than 5%, no matter which scheduling policy we
choose (Greedy or Knapsack). Moreover, for the 3rd and 5th month of SDSC-BLUE
trace, both our scheduling policies may achieve higher system utilization compared to
34
FCFS. These results clearly demonstrate that our scheduling design brings negligible
impact on system utilization, which is critical to HPC systems.
(a) SDSC-BLUE (b) ANL-BGP
Figure 3.6. Cost saving for SDSC-BLUE and ANL-BGP.
Figure 3.6(a) and Figure 3.6(b) present electricity bill savings obtained by
our designs as against FCFS. In general, the monthly electricity bill saving ranges
from 0.5% to 10% by using Greedy, and it is from 2% to 10% by using Knapsack.
The average electricity bill saving obtained by using Greedy and Knapsack scheduling
policy are 4.33% and 3.16% for SDSC-BLUE respectively. And for ANL-BGP, Greedy
can save 5.06% and Knapsack can save 5.53%.
The average electricity bill saving is 3.16%-5.53%. We also make two inter-
esting observations. First is that Greedy achieves more electricity bill saving on
SDSC-Blue, whereas Knapsack brings in more cost saving on ANL-BGP. Second, we
can see that more energy saving is obtained from ANL-BGP. As we can see from
Figure 4.5, these traces have distinctive job characteristics in term of job size. In
ANL-BGP trace, 38% jobs request 512 nodes, 19% request 1024 nodes and 8% re-
quest 2048 nodes. Given the system size is 2,048, this means 65% jobs are relatively
large in the sense that these jobs request more than a quarter of the system resources.
In contrast, SDSC-BLUE has different characteristics, most jobs are relatively small:
35
71% of the jobs smaller than 32, whereas the system size is 1,152. In other words,
ANL-BGP represents big capability computing and SDSC-BLUE represents small ca-
pability computing. The results indicate that our design provides more benefits for
big capability computing.
(a) SDSC-BLUE (b) ANL-BGP
Figure 3.7. Wait time improvement for SDSC-BLUE and ANL-BGP workloads.
Figure 3.7(a) and Figure 3.7(b) show the average job wait times introduced by
our design and FCFS. In general, job wait time is influenced by many factors, such
as job arrival rate, job size, etc. Hence, the average job wait time varies from month
to month. While our scheduling policies might impact the average job wait time, on
both traces we observe that the maximum change on this metric caused by our design
is less than 10 seconds as compared to FCFS. This implies that our design does not
degrade the scheduling performance from user’s perspective as compared to FCFS.
3.6.2 Impacts of Electricity Prices and Job Power Profiles. In this set of
experiments, we conduct a sensitivity study to investigate the amount of electricity
bill savings that could be achieved by our design under different combinations of power
and pricing ratios. We set three different job power consumption profiles, namely 1:2
(e.g., 20W per node as the lowest profile and 40W per node as the highest profile,)
1:3, and 1:4. We also set three off-peak versus on-peak pricing ratio (i.e., 1:3, 1:4,
36
and 1:5). The results are summarized in Table 3.2 and 3.3.
Table 3.2 and 3.3 present the electricity bill savings obtained by our schedul-
ing policies on ANL-BGP and SDSC-BLUE respectively, under different pricing and
power profile combinations. As the job power profile ratio increases, so does the
electricity bill saving obtained by both Greedy and Knapsack. The same situation
happens as the pricing ratio goes up. The highest electricity bill saving is achieved in
the case of when the job power profile ratio is set to 1:4 and the off-peak over on-peak
pricing is set to 1:5.
This is quite reasonable, because the greater the job power profile ratio is, the
more power consumption saving our design can obtain. With higher off-peak/on-peak
price ratio, the same amount power consumption saving can yield more electricity bill
saving.
Table 3.2. Electricity bill savings obtained by our scheduling policies on ANL-BGP.In each cell, the top number is the electricity bill saving obtained by Greedy andthe bottom number is the electricity bill saving obtained by Knapsack
Pricing Ratio
Power Ratio 1:3 1:4 1:5
1:2 3.54% 4.33% 4.79%
4.18% 5.07% 5.64%
1:3 5.06% 6.13% 6.85%
5.35% 6.48% 7.25%
1:4 6.27% 7.58% 8.40%
7.21% 8.52% 9.86%
From both Table 3.2 and Table 3.3, we can observe that for the ANL-BGP
trace, Knapsack outperforms Greedy under all power and pricing ratio combinations
and for SDSC-BLUE trace the situation is just opposite. As mentioned earlier, the
37
Table 3.3. Electricity bill savings obtained by our scheduling policies on SDSC-BLUE.In each cell, the top number is the electricity bill saving obtained by Greedy andthe bottom number is the electricity bill saving obtained by Knapsack
Pricing Ratio
Power Ratio 1:3 1:4 1:5
1:2 3.84% 4.84% 6.19%
2.39% 3.01% 3.85%
1:3 4.33% 5.46% 6.98%
3.16% 3.98% 5.10%
1:4 5.55% 6.98% 8.95%
3.05% 3.84% 4.92%
ANL-BGP trace contains a large percentage of large jobs. During on-peak period, the
Greedy policy always selects a job with the least power profile, whereas the Knapsack
policy often picks out a job with a small power profile under the constraint of the
available system resources. While the Knapsack policy may not schedule the job with
the least power profile, it is capable of identifying the job which consumes the least
amount of aggregated power consumption on all the nodes.
3.6.3 Impact of Scheduling Frequencies. Typically batch schedulers make
allocation decisions periodically. On production supercomputers, the scheduling fre-
quency is generally on the order of 10 to 30 seconds. Hence, in this set of experiments,
we evaluate the impact of different scheduling intervals (i.e., 10 seconds, 20 seconds,
and 30 seconds) on the amount of electricity bill savings.
Table 3.4 shows the average electricity bill savings obtained by our design
compared to the conventional FCFS with three different scheduling periods on ANL-
BGP and SDSC-BLUE. As we can see that the longer the scheduling period is, the
more electricity bill savings our design can get. This is because with a relative high job
38
arrival rate, a longer scheduling period means more system nodes can be accumulated
for job allocation at a time. As such, our design is able to allocate more low power
profiled jobs during on-peak period or more high power profiled jobs during off-peak
period, resulting in more electricity bill savings.
Table 3.4. Electricity bill Savings obtained by our scheduling policies under differentscheduling frequencies. In each cell, the top number is on ANL-BGP and thebottom number is on SDSC-BLUE.
Scheduling Policy
Frequency Greedy Knapsack
10-Second 7.49% 7.13%
4.33% 3.16%
20-Second 10.07% 8.91%
9.70% 9.80%
30-Second 17.52% 22.43%
19.69% 23.07%
Table 3.5 shows the average system utilization of ANL-BGP and SDSC-BLUE
trace under different scheduling frequencies by our design and FCFS. As we can
see that both Greedy and 0-1 Knapsack scheduling policy employed in our design
have almost no impact on the system utilization rate when the scheduling period is
10 seconds. When the scheduling period is increased to 30 seconds, some available
nodes will have to wait for a relatively long time until they are assigned to new
jobs. Thus the system utilization rate suffers slightly, however is always less than
3%. Combining the results shown in Table IV and V, it is observed that a longer
scheduling frequency can bring in more electricity bill savings, at a cost of a slightly
degraded system utilization (less than 3%).
3.6.4 Impact of Scheduling Window. The use of scheduling window, rather
39
Table 3.5. System utilization rate under different scheduling frequencies. In each cell,the top number is on ANL-BGP and the bottom number is on SDSC-BLUE.
Scheduling Policy
Frequency FCFS Greedy Knapsack
10-Second 70.0% 69.70% 69.07%
69.59% 69.53% 69.50%
20-Second 68.56% 69.03% 65.97%
68.56% 69.25% 65.06%
30-Second 63.77% 60.42% 60.84%
67.38% 68.85% 66.21%
than one by one job scheduling adopted by the conventional batch scheduling, is
a key technique of our design. Typically, an optimal window size is influenced by
many factors, in particular job arrival rate. In general, a larger window means more
opportunities for our design to make an optimal decision. However, large window size
can result in high scheduling overhead, especially Knapsack policy, since window size
is a dominant factor of its computational complexity.
We conduct a sensitivity study of scheduling window by varying its size from
10 to 200. For both traces, our results show that the variations of all three metrics are
not substantial (e.g., within 5%). More importantly, our results indicate that when
the window size is set to 10 to 30, for both traces, the variations of all three metrics
are negligible. Given the high computation overhead introduced by large window size,
a window size of 10-30 jobs is preferable for typical workload at production systems.
3.6.5 Result Summary. In summary, our trace-based experiments have shown
the following:
• Workload characteristics can impact the performance of our design in terms of
40
Figure 3.8. Job characteristics in December of 2012 on the 48-rack Mira machine.Each red point indicates a job submission
electricity bill savings. In particular, our design can achieve more savings on
big capability computing systems than on small capacity computing systems.
• Both the Knapsack policy and the Greedy policy are capable of reducing elec-
tricity bill with little or no impact to system utilization, as compared to FCFS.
Further, the Knapsack policy seems to outperform the Greedy policy for capa-
bility computing.
• The amount of electricity bill savings is also influenced by scheduling frequency.
The longer the scheduling frequency is, the more electricity bill savings is
achieved by our job power profile aware scheduling.
• Higher power profile ratio and electricity pricing ration lead to greater electricity
bill savings by using our design.
• For typical workload at production systems, a scheduling window of 10-30 jobs
is sufficient.
3.7 Case Study
41
In this section, we present a case study of using our job power aware scheduler
on Mira. We collected the job trace from the machine in December 2012. During the
month, the first half of the month were jobs for acceptance testing (hence most jobs
are large jobs) and the second half was used for early science applications from users
(hence most jobs are small sized such as single rack). There are totally 3,333 jobs
executed on the machine during the month. A summary of these jobs is described in
Figure 3.8. For each job, its power profile is extracted from the environmental log [3],
and the distribution of job power profiles is presented in Figure 3.1.
In this case study, we compare the Knapsack policy to FCFS. We apply two
scheduling frequencies, i.e. 10-second and 30-second. The scheduling window is set
to 10-second.
Figure 3.9 presents the average utilization within a day. Here system uti-
lization at each time point is calculated as the average over the month. During
the off-peak time, system utilization achieved by our scheduler is higher than that
achieved by FCFS. This is because during the off-peak time, our design attempts
to allocate jobs with high power profiles, as many as possible, by taking advantage
of low electricity pricing. During the on-peak time, FCFS achieves a slightly better
system utilization over our design. This is because our design intends to schedule
large jobs with low power profiles, leaving some idle nodes that are not sufficient for
other jobs. Nevertheless, despite some minor variation at any time instant, the daily
system utilization is not impacted or degraded by using our design.
Figure 3.10 presents the average power consumption within a day. Here power
consumption at each time point is calculated as the average over the month. During
the off-peak time, the amount of power consumed by our design is higher than that
consumed by FCFS. This phenomenon is more obvious when we switch the scheduling
frequency from 10 seconds to 30 seconds. This is reasonable as our design aims to
42
Figure 3.9. The average daily system utilization
increase power consumption by taking advantage of low electricity pricing during off-
peak time. During the on-peak time,while our design is supposed to decrease the
overall power consumption to avoid high electricity cost, the figure doesn’t show such
a pattern. As we mentioned earlier, the job trace was collected in the month in
which the second half of the month was mainly used for early science testing (i.e.,
most jobs submitted are small sized such as single rack). As presented in Figure
1, small sized jobs have similar power profiles. Due to these unique characteristics
of the job trace (i.e., the same sized jobs with similar power profiles), our design
ends up with the same scheduling sequence as FCFS, hence giving the similar power
consumption during the on-peak time. We believe our design is capable of providing
more electricity bill saving when the machine is used in production.
The monthly electricity bill saving obtained by our design versus FCFS is
5.4% and 9.98% respectively by using 10-second scheduling frequency and 30-second
frequency. This is substantial, given that approximately $1M annual electricity bill
to power up this machine at Argonne.
3.8 Summary
43
Figure 3.10. The average daily power consumption
In this chapter, we have proposed an novel job power aware scheduling design,
with the objective to reduce the electricity bill of HPC systems. Our design is based on
the facts that HPC jobs have different individual power profiles and that electricity
prices vary throughout a day. By scheduling jobs with high power profiles during
low electricity pricing period and jobs with low power profiles during high electricity
pricing period, our scheduler is capable of cutting the electricity bill of HPC systems
by up to 23% without impacting system utilization, which is critical to HPC systems.
To our knowledge, this is the first electricity bill study of large-scale HPC
systems using real job traces and job power profiles from production systems. Our key
findings and contributions are: (1) a job power aware scheduling mechanism and two
scheduling policies designed to cut the electricity bill of HPC systems; (2) a scheduling
policy with real potential to substantially reduce the electricity bill for HPC systems
by exploring distinct job power profiles and varying daily electricity prices; and (3)
a trace-based scheduling simulator named CQSim for evaluating various scheduling
policies at scale, which is available online.
44
CHAPTER 4
LOCALITY AWARE SCHEDULING ON TORUSCONNECTED SYSTEMS
4.1 Overview
Two scheduling strategies are commonly used on today’s torus-connected sys-
tems. One is so called partition based systems, where the scheduler assigns each
user job a compact and contiguous set of computing nodes. IBM Blue Gene se-
ries systems fall into this category [18]. This strategy is in favor of application’s
performance by preserving locality of allocated nodes and reducing network con-
tention caused by concurrently running jobs sharing network bandwidth. However,
this strategy can cause internal fragmentation (when more nodes are allocated to a
job than it requests) and external fragmentation(when sufficient nodes are available
for a request, but they can not be allocated contiguously), therefore leading to poor
system performance (e.g., low system utilization and high job response time) [20].
The other is non-contiguous allocation system, where free nodes are assigned to user
job no matter whether they are contiguous or not. Cray XT/XE series systems fall
into this category [19]. Non-contiguous allocation eliminates internal and external
fragmentation as seen in partition-based systems, thereby leading to high system
utilization. Nevertheless, it introduces other problems such as scattering applica-
tion processes all over the system. The non-contiguous node allocation can make
inter-process communication less efficient and cause network contention among con-
currently running jobs [21], thereby resulting in poor job performance especially for
those communication-intensive jobs.
Partition-based allocation achieves good job performance by sacrificing sys-
tem performance (e.g., poor system utilization), whereas non-contiguous allocation
can result in better system performance but could severely degrade performance of
45
user jobs (e.g, prolonged wait-time and run-time). As systems continue growing in
size, a fundamental problem arises: how to effectively balance job performance with
system performance on torus-connected machines? In this work, we will present a
new scheduling design combining the merits of partition based scheduling and non-
contiguous scheduling for torus-connected machines.
Strictly speaking, a job scheduling system contains two parts, namely job
prioritizing and resource allocation. Job prioritizing makes decision about the order
in which jobs are allowed to run. The decision making is based on many factors, such
as job size, job run-time, job priority, etc. Resource allocation decides a set of nodes
allocated to each incoming job. Figure 4.1 illustrates a typical job scheduling system,
where each job is retrieved from the wait queue and computing nodes are assigned
one-by-one to this job. Many supercomputers use this kind of First-Come First-Serve
(FCFS) scheduling policy. As we can see, every job gets out of the wait queue and
free nodes are assigned to the job according to node identifiers. Often the topological
characteristics and locality of system nodes are ignored. Numerically sequential nodes
may be separated from each other in the space. A well-known approach to address
the problem is processor ordering. Processor ordering usually uses space filling curve,
such as Hilbert Curve, to map the nodes of the torus onto a 1-dimensional list to
preserve locality information [27][73]. While processor ordering works well at the
beginning of scheduling, job allocation and deallocation will eventually fragment this
1-dimensional list, making it less efficient as time goes by.
In this work, we present a window-based locality-aware scheduling design. Our
design is based on two key observations. First, as shown in Figure 4.1, existing
scheduling system makes decisions in a per-job manner. Each job is dispatched to
system resources without considering subsequent jobs. While making isolating job
decision may provide a good short-term optimization, it is likely to result in poor
46
Figure 4.1. Typical job scheduling uses First-Come First-Serve (FCFS) schedulingpolicy. Jobs are removed from the wait queue and assigned with free nodes one byone. The grey squares represent busy nodes occupied by running jobs. The greensquares represent free nodes.
performance in the long term. Second, existing scheduling system maintains a list
of free nodes for resource allocation. While special processor ordering (e.g., using a
space filling curve) is often adopted for preserving node locality in the list, the node
list becomes fragmented as time goes by and subsequent jobs inevitably get dispersed
nodes allocation due to the lack of contiguous node list.
Rather than one-by-one job scheduling, our design takes a “window” of jobs
(i.e., multiple jobs) into consideration for making prioritizing and allocation decision,
to prevent the short-term decision from obfuscating future optimization. Our job
prioritizing module maintains a “window” of jobs, and these jobs are placed into
the window to maintain job fairness (e.g., through FCFS). Rather than allocating
jobs one by one from the head of the wait queue as existing schedulers do, we make
scheduling decision for a “window” of jobs at a time. Our resource allocation module
takes a contiguous set of nodes as a slot and maintains a list of such slots. These slots
have different sizes and each may accommodate one or more jobs. The allocation of
the jobs in the window onto the slot list is conducted in such a way as to maximize
system utilization. We formalize the allocation of a window of jobs to a list of slots
as a 0-1 Multiple Knapsack Problem (MKP) and present two algorithms, namely
Branch&Bound and Greedy, to solve the MKP.
47
We evaluate our design via extensive trace-based simulations. In this paper,
we conduct a series of experiments comparing our design against the commonly used
FCFS/EASY backfilling scheduling that is enhanced with processor ordering. Our
preliminary results demonstrate that our design can reduce average job wait time
by up to 28% and average job response time by 30%, with a slight improvement on
overall system utilization.
The structure of the chapter is as follows. Section 4.3 gives a detailed de-
scription of our window-based locality-aware scheduling design. Section 4.4 describes
the problem formalization and two scheduling algorithms implemented in our design.
Section 4.5–4.6 present our evaluation methodology, trace-based simulations by com-
paring our design against the FCFS/EASY backfilling scheduling scheme. In Section
4.2, we discuss the existing work about job scheduling and allocation on HPC systems.
Our conclusion is presented in Section 4.7.
4.2 Related Work
The most commonly used scheduling policies are FCFS combined with EASY
backfilling [67]. Under this policy, the scheduler picks a job from the head of the
wait queue and dispatches it to the available system resources. Many studies seek
to improve the performance of this classic scheduling paradigm. Tang et al. made
refinement about user’s estimated job runtime in order to make the backfilling more
efficient [66] [70]. They also designed a walltime-aware job allocation strategy, which
adjacently packs jobs that finish around the same time, in order to minimize resource
fragmentation caused by job length discrepancy [51]. And there are some other vari-
ation of FCFS/EASY backfilling proposed to optimize system performance in terms
of power consumption and energy cost [65][74]. However, none of them ever take
allocation locality into consideration when making scheduling decisions.
48
There are several studies focusing on allocation algorithms to minimize system
fragmentation. Lo et al. presented a non-contiguous allocation scheme named Multi-
ple Buddy Strategy (MBS) [75]. MBS preserves locality by allocating each job a set of
“blocks” to reduce interference between jobs, with the advantage of eliminating both
internal and external fragmentation. Each “block” consists of 2n nodes that adjacent
to each other (n with different value depends on the block size). However, the distance
between “blocks” could be too long to make the communication between processes
within the same application less efficient. MBS also needs to partition the system
into fixed number of “blocks” in advance, which is time consuming and low-efficient
for big scale systems.
Leung et al. presented allocation strategies based on space filling curves and
one dimensional packing [73]. They implemented these strategies using 2-dimensional
Hilbert curves and had an integer program for general networks. Their preliminary
experimental results show that processor locality can be preserved in massively paral-
lel supercomputers using one-dimensional allocation strategies based on space filling
curve. However, space filling curve has the limitation that it can only be applied to
system with the scale of 2n nodes in each dimension.
Albing et al. conducted study about the allocation strategies that the Cray
Application Level Placement Scheduler (ALPS) used [27]. The job allocation in Cray
Linux Environment (CLE) operating system is managed by ALPS, which works from
a list of available nodes and assigns those nodes in sequence from this list to jobs.
However, ALPS does not make changes or calculations when making allocation deci-
sions. It just simply works off the ordered list, however that is ordered. They claimed
that the ordered list can be obtained by using either Hilbert curve or simply sorting
the nodes based on their spacial coordinates in the system.
There are other studies focusing on allocation algorithms to improve the per-
49
formance of user jobs. Pascual et al. proposed an allocation strategy that aiming to
assign a contiguous allocation to each job, in order to improve communication per-
formance [21]. However, this strategy results in severe scheduling inefficiency due to
increased system fragmentation. They reduced this adverse effect by using a relaxed
version called quasi-contiguous allocation strategy.
Another related work is online bin packing. In the bin packing strategy, the
objective is to pack a set of items with given sizes into bins. Each bin has a fixed
capacity and cannot be assigned to items whose total size exceeds this capacity. The
goal is to minimize the number of bins used. The off-line version is NP-hard [76] and
bin packing was one of the first problems to be studied in terms of both online and
offline approximability [77][78].
This work has two major difference as compared to the above literatures. First,
rather than one-by-one job scheduling as most existing schedulers do [27][75][21],
our design takes a “window” of jobs (i.e.,multiple jobs) into consideration for job
prioritizing and resource allocation. Second, our resource management module takes
a contiguous set of nodes as a slot and maintains a list of such slots. The slot list
is updated dynamically when job being allocated/deallocated in the system. The
allocation of the jobs in the window onto the slot list is conducted in such a way as to
maximize system utilization. This is different from the existing job allocation schemes
that work off a ordered list, which loses the spacial information of the torus-connected
system.
4.3 Design Overview
Figure 4.2 gives an overview of our window-based locality-aware job scheduling
design. Our design contains two key parts. The job prioritizing module maintains a
“window” of jobs that are retrieved from the wait queue to insure job fairness. Rather
50
Figure 4.2. Overview of our window-based locality-aware scheduling design. The jobprioritizing module maintains a “window” of jobs retrieved from the wait queue,and the resource management module keeps a list of slots. Each slot represents acontiguous set of available nodes. Our scheduling design allocates a “window” ofjobs to a list of slots at a time.
than dispatching jobs one by one as existing schedulers do, we dispatch multiple jobs
at a time. Unlike existing scheduling design, the resource management module is
responsible for organizing the available nodes into a set of slots. Each slot contains a
contiguous set of free nodes. Here, contiguity refers to the adjacency of nodes’ original
positions in torus-connected system. These slots may have different sizes. A new slot
appears when a job releases the nodes it was assigned. The newly released nodes
merge with other free neighboring nodes to form up a new slot with growing size. A
slot disappears when it (or part of it) is assigned to a job. If the job’s requirement
only takes part of a slot, the remaining part becomes a new slot. The slot list is
updated when a job is allocated or deallocated. The resource management module
needs to get feedback from the system right after a job being allocated/deallocated
to update the status of the slot list. The allocation of the jobs in the window onto
the slot list is conducted in such a way as to maximize system utilization.
Using window-based design can balance job fairness and system performance.
In our design, rather than allocating jobs one by one from the front of the wait
51
queue, the scheduler takes all the jobs in the window and make prioritizing decision
for them at a time. The selection of jobs from wait queue to the window is based on
certain scheduling rule (e.g., job arrival time in case of using FCFS policy), thereby
guaranteeing job fairness. With the information of both jobs and slots, the scheduler
aims to make an optimal decision in terms of allocating the jobs to the slots. In the
following section, we will present our detailed scheduling strategy.
4.4 Scheduling Strategy
Our scheduling strategy contains two parts. We first formalize the resource
allocation problem into a 0-1 Multiple Knapsack Problem (MKP), and then present
two algorithms to solve the MKP.
4.4.1 MKP Formalization. We consider each slot as a knapsack and the jobs
in the window are the items waiting to be put into the knapsacks. Suppose J =
{J1, J2, J3, ..., Jn} is a set of n jobs in the window. Each job Jj has weight wj, with
profit pj. And K = {K1, K2, K3, ..., Km} is a set of m knapsacks; each knapsack Ki
with the capacity of Ci. So, we want to select m disjoint subsets of jobs so that
the total profit of the selected jobs is a maximum, and each subset can be put into
different knapsack whose capacity is no less than the total weight of jobs in the subset.
Formally,
Max z =m∑i=1
n∑j=1
pj · xij (4.1)
which is subject to the following constraints:
n∑j=1
xij · wj ≤ Ci, i ∈ {1, 2, ...,m} (4.2)
52
m∑i=1
xij ≤ 1, j ∈ {1, 2, ..., n} (4.3)
xij ∈ {0, 1}, i ∈ {1, 2, ...,m}; j ∈ {1, 2, ..., n} (4.4)
where
xij =
1 if job j is put into knapsack i;
0 otherwise
(4.5)
When m=1, Multiple Knapsack problem reduces to the 0-1 knapsack problem.
In our model, we can assume that
pj and Ci are positive integers, (4.6)
wj ≤ maxi∈ICi, ∀j ∈ {1, 2, ..., n}, (4.7)
Ci ≥ minj∈Jwj, ∀i ∈ {1, 2, ...,m}, (4.8)
n∑j=1
wj > Ci, ∀i ∈ {1, 2, ...,m}, (4.9)
53
In our model, it is guaranteed that assumption 4.6 can’t be violated since the
definitions of job’s profit and weight in our maximization problem are both its size.
And the capacity of knapsack is the size of the slots, which can never be negative. If
there is a job j violating assumption 4.7, which means it requires too many nodes to
be accommodated in any knapsack, it will be hold until a slot appears which contains
enough free nodes. In our experiments, we found that this would only prolong the
big jobs’ wait time less than 10%. If a knapsack violates assumption 4.8, then it will
be taken as system fragmentation. Finally, observe that if m > n then the (m − n)
knapsacks of smallest capacity will not be included in this formalization.
The window size should be set based on the system’s workload such that a large
window is preferred when job arrival rate is high. For typical workload at production
supercomputers, we find a window of size 5 makes a good tradeoff between scheduling
quality and scheduling overhead.
4.4.2 Algorithms. In this work, we develop two scheduling algorithms. The first
is Branch and Bound Algorithm. The second is Greedy Algorithm. We will discuss
the details about both scheduling algorithms in this section.
Branch and Bound (B&B) is a general algorithm for finding optimal solutions
of discrete and combinatorial optimization problems [79]. In our model, we can control
the size of window to limit the number of jobs in the MKP model so that the overhead
of the B&B algorithm can be acceptable in terms of time and space complexity. In
our experiment, we set window size to 5, which means only the first 5 jobs in the wait
queue are considered in our MKP model. The computation time for solving the MKP
problem with 5 jobs is affordable for the scheduler. (The scheduler makes scheduling
decision periodically with a interval from 10 to 30 seconds.)
In this algorithm, successive levels of the branch-decision tree are constructed
54
Figure 4.3. Decision tree generated for finding the optimal solution by using Branchand Bound Algorithm. There are 2 knapsacks and 3 jobs (m = 2, n = 3).
by selecting a job and put it into each knapsack in turn. Once the job has been
selected for branching, it being put to knapsacks according to the increasing order of
knapsacks’ indices. After all the knapsacks have been considered, the job is excluded
from the current solution. In Figure 4.3, we give an example of how the optimal
solution is found by using B&B. Each circle represents a state with two arrays indi-
cating the current jobs J and knapsacks K. Here j1 = 2 means job1 requires 2 nodes
and k1 = 5 means the capacity of knapsack k1 is 5. These two arrays are updated
after each decision is made, which generates the searching tree as shown in Figure
4.3. The circle on the top is the initial state with three jobs and two knapsacks. The
B&B algorithm systematically enumerates all candidate solutions by using the upper
and lower estimated bounds of quantity being optimized. Here we use the depth-first
search method in B&B. As we can see, two candidate solutions are found in the left
bottom of the search tree when at first we put job j1 to knapsack k1. Obviously,
the solution circled by red line is better where all jobs get allocated and no space in
the knapsack is left idle. Also the upper bound of the space left in all knapsacks is
set to 0. Based on the upper bound, we can discard other possible decisions such as
allocating j1 to knapsack k2 or not selecting job j2. Algorithm 1-2 show the pseudo
code of this Branch and Bound algorithm.
Branch and Bound algorithm guarantee an optimal solution with an exponen-
55
Algorithm 4.1 Branch & Bound
E = new (node), this is the dummy start node
H = new (Heap), this is a max-heap for our maximization problem
while true do
if E is a final leaf then
E is an optimal solution;
print out the path from E to the root;
return;
end if
Branch(E);
if H is empty then
No solution;
return;
end if
E = delete-top(H);
end while
Algorithm 4.2 Branch
Generate all the children of E;
Compute the approximate cost value Ci of each child;
Insert each child into the heap H;
56
tial computational complexity of O(nm). This is feasible due to the small window
size. For example, in case of a window size of 5, the algorithm invokes 3,125 solutions
which can be solved within a few seconds.
When the window size grows, Branch and Bound algorithm become expensive,
we can use the polynomial-time approximate Greedy algorithm. It can obtain a
feasible solution by applying the greedy algorithm for classic 0-1 knapsack problem
to the first knapsack, then to the second one by using the remaining jobs, and so on.
This is obtained by calling m times Algorithm 3. Given the capacity Ci = Ci of the
current knapsack and the current solution, of value z, stored, for j = 1, ..., n, in
Algorithm 4.3 GREEDY
Input: n, (pj), (wj), z, (yj), i, Ci
Output: z, (yj)
for j = 1 to n do
if yj = 0 and wj ≤ Ci then
yj = i;
Ci = Ci − wj;
z = z + pj;
end if
end for
The solution obtained by calling GREEDY m times can not be optimal.
Martello and Toth proposed local exchange techniques that can improve this solution
to be optimal [79]. To implement their techniques in our model, we need to do the
following things. First, we consider all pairs of jobs put to different knapsacks and
try to interchange them if the insertion of a new job into the solution is allowed.
When all pairs have been considered, we try to exclude in turn each job currently in
the solution to replace it with one or more jobs not in the solution so that the total
57
profit is increased. Greedy algorithm has a linear time complexity, i.e., O(n). And
the interchange takes O(n) since it only happens when a new job enters the solution.
Hence, using GREEDY to find the optimal solution will cost no more than O(n2)
time.
yj =
1 if job j is currently unassigned;
index of the knapsack it is assigned to.
(4.10)
4.4.3 An Example. We have the following example to illustrate the difference
between our design and the default scheduler using FCFS/EASY backfilling. Here we
assume five jobs A, B, C, D, E are submitted and waiting in the queue. The system
consists of 20 nodes in total, and the current available nodes are: 1, 2, 3, 6, 7, 8,
9, 10, 11, 15, 16, 17, 18, 19, 20. The nodes do not appear in this list (indexes
are 4, 5, 12, 13, 14) are occupied by jobs that are still running. The FCFS/EASY
backfilling scheme used by the default scheduler will cut a chunk of six nodes from
start of this list for job A, which means node 1, 2, 3, 6, 7, 8 are assigned to job A.
And then sequentially, nodes 9, 10, 11, 15 will be assigned to B; 16, 17, 18 to C;
19 to D.
Apparently, this scheme doesn’t deliver the best allocation. First, job A and B
get a non-contiguous node sets, which means their allocation are fragmented. Node 20
is left idle, and job E has not been satisfied since the available nodes is not enough to
satisfy its requirement. Under this default scheduling scheme, the scheduling sequence
will always be 〈A, B, C, D, E〉, in spite of the current status of node list and each
job’s size.
Our design first puts these five jobs into the window according to their arrival
order, which is A, B, C, D, E. Then it checks the status of the slot list (formed based
58
on system nodes contiguity) and find out how many slots is available. In our example,
there are three such slots, 1, 2, 3, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20. Then
based on the size of these slots and each job’s size, our design will use B&B or Greedy
algorithm to make the following scheduling decision. First, it puts job C into the first
slot(assign nodes 1, 2, 3 to C); then put job A into the second slot (6, 7, 8, 9, 10,
11 to A); 15, 16, 17, 18 to B; 19, 20 to E. Apparently, our design can guarantee
that every job gets a compact allocation while maintain high system utilization.
Figure 4.4 pictorially illustrates this example to highlight the difference be-
tween our design and the default scheduler using FCFS policy. As it shows, there are
20 nodes in the node list, the grey ones are been occupied by current running jobs.
The subfigure A in 4.4 is the scheduling result obtained by the default scheduler.
It fragments the allocation for job A(with size 6) and job B (with size 4), leaving
node No.20 idle and job E (with size 2) unallocated. In subfigure B, our design can
guarantee that every job gets a compact allocation.
Figure 4.4. Scheduling result comparison between the default scheduler and ourdesign. The default scheduler (Subfigure A) makes job prioritizing sequence as〈A,B,C,D〉, and the allocation for job A and B are fragmented, node 20 is leftidle. Our design (Subfigure B) can make optimization so that every job gets acompact allocation and no node is left idle. The prioritizing sequence obtained byour design is 〈C,A,B,E〉.
4.5 Evaluation Methodology
59
We conduct a series of experiments using the traces described in Section 4.5.2
to evaluate our design as against the default scheduler using FCFS/EASY backfilling.
FCFS/EASY backfilling is the most commonly used scheduling policy on production
supercomputers [66][51]. In the rest of the paper, we use B&B, Greedy, and Default
to denote our algorithms and the default one. This section describes our evaluation
methodology and the experimental results will be presented in the next section.
4.5.1 CQSim: Trace-based Scheduling Simulator. Simulation is an integral
part of our evaluation of various allocation policies as well as their aggregate effects on
system utilization, job’s wait time and response time. We have developed a simulator
named CQSim to evaluate our design at scale. The simulator is written in Python,
and is formed by several modules such as job module, node module, scheduling policy
module, etc. Each module is implemented as a class. The design principles are
reusability, extensibility, and efficiency. The simulator takes job events from the
trace, and an event could be job submission, start, end, and other events. Based on
these events, the simulator emulates job submission, allocation, and execution based
on specific scheduling and allocation policies. CQsim is open source, and is available
to the community [41].
4.5.2 Job Traces. In this work, we use two real workload traces collected from
production supercomputers to evaluate our design. The objective of using multiple
traces is to quantify the performance of our design when dealing jobs and systems
with different characteristics. The first trace we used is from a machine named Blue
Horizon at the San Diego Supercomputer Center (denoted as SDSC-BLUE in the
paper), which contains 4,830 jobs. The second trace we used is from a IBM Blue
Gene/P system named Intrepid at Argonne National Laboratory (denoted as ANL-
Intrepid in the paper) [42]. This trace contains 2,612 jobs. Figure 4.5 summarizes
job size distribution of these traces. ANL-Intrepid is used to represent capability
60
computing where jobs require a large amount of computing nodes for solving large-
scale problems, whereas SDSC-BLUE is used to represent capacity computing where
the system is utilized to solve a large number of small-sized problems.
Figure 4.5. Job size distribution of ANL-Intrepid and SDSC-BLUE
4.5.3 Evaluation Metrics. We use three scheduling metrics for evaluation.
• System Utilization Rate. This metric denotes the ratio of the node-hours used
by jobs to the total elapsed system node-hours. Specifically, let T be the total
elapsed time for J jobs, ci be the completion time for job i and si be its the
start time, and ni be the size of job i, then system utilization rate is calculated
as ∑0≤i≤J (ci − si) · ni
N · T(4.11)
• Average Job Wait Time. For each job, its wait time refers to the time elapsed
between the moment it is submitted and the moment it is allocated to run.
This metric is calculated as the average across all the jobs submitted to the
61
system. This metric is a user-centric metric, measuring scheduling performance
from user’s perspective.
Twait = tstart − tarrive (4.12)
• Average Job Response time. Response time refers to the amount of time it take
when each job is submitted until it ends, which equals to its wait time plus its
run time.
Tresponse = Trun + Twait (4.13)
4.6 Experiment Results
We conduct a series of experiments on the traces described in Section 4.5.2 to
evaluate our design as against the default scheduler which uses FCFS/EASY back-
filling policy.
In our experiments, the scheduler makes scheduling decisions every 30 seconds.
To make the scheduler responsive, the window size is set to 5 so that the time cost
for B&B algorithm to solve the MKP is affordable (e.g., in seconds). We assume
there are β percentage jobs in each trace are sensitive to the contiguity of allocation.
Existing studies show that job runtime is influenced by resource allocation, and the
variability introduced by different allocations could be as high as 70% [4][80]. In our
experiments, we use a parameter α to denote this impact. For a job that is sensitive to
the contiguity of resource allocation, we assume its runtime on a contiguous allocation
is about α percentage shorter than on a non-contiguous allocation. We conduct a
series of sensitivity study to evaluate our design under a variety of configurations. In
the following experiments, both β and α are set to 10%, 20%, 30%, 40%, 50%.
62
4.6.1 Evaluation with SDSC-BLUE Trace. The evaluation results for SDSC-
BLUE trace are presented in Table 4.1, 4.2, 4.3. Table 4.1 shows the system utilization
improvement obtained by our design using B&B and Greedy as against the default
scheduler using FCFS/EASY backfilling. In general, our design can outperform the
default scheduler by about 1% to 3%. As the impact parameter α and job percentage
β increase, this improvement grows slowly. Apparently, the system’s throughput is
not sensitive to the growth of jobs’ running time since those jobs only required a very
small portion of system nodes.
Table 4.1. System utilization improvement obtained by our design using B&B andGreedy as against the Default scheduler. In each cell, the number on top is theimprovement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 0.98% 1.37% 1.69% 1.66% 1.92%
1.15% 1.42% 1.52% 1.68% 1.84%
20% 1.28% 1.09% 1.46% 1.50% 1.77%
1.11% 1.12% 1.54% 1.63% 1.69%
30% 2.23% 2.23% 2.35% 2.48% 2.54%
2.18% 2.21% 2.35% 2.43% 2.67%
40% 2.36% 2.49% 2.64% 2.84% 3.03%
2.28% 2.49% 2.47% 2.84% 2.92%
50% 3.10% 3.14% 3.31% 3.72% 3.83%
3.07% 3.20% 3.48% 3.68% 3.83%
Table 4.2 shows job average wait time improvement obtained by our design
using B&B and Greedy as against the default scheduler. As the impact parameter
α and job percentage β increase, the improvement can be as much as 27%. This
63
result indicates when large portion of jobs suffer from adverse impact introduced
by inappropriate node allocation, our design using B&B and Greedy can greatly
outperform the default scheduler.
Table 4.2. Average job wait time improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the number on top isthe improvement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 11.27% 11.71% 12.95% 14.57% 17.17%
11.52% 11.42% 12.62% 14.46% 18.31%
20% 11.36% 12.54% 13.99% 16.50% 18.20%
12.07% 12.44% 14.55% 16.22% 19.05%
30% 12.98% 13.87% 14.04% 16.16% 19.29%
12.55% 13.81% 12.98% 16.66% 20.31%
40% 14.61% 15.86% 16.22% 19.48% 22.36%
15.30% 15.57% 16.48% 19.31% 21.70%
50% 19.35% 21.98% 23.13% 25.29% 27.33%
18.52% 21.70% 23.84% 25.05% 26.86%
Table 4.3 shows the improvement of average job response time obtained by our
design using B&B and Greedy as against the default scheduler. This metric has the
same trend as average job wait time and reaches even higher value. This is because
job’s run time is usually longer than its wait time in SDSC-BLUE trace and most jobs
have their runtime dominate the total response time. Thus, the impact parameter α
has much greater influence to jobs’ response time than to wait time.
4.6.2 Evaluation with ANL-Intrepid Trace. The evaluation results for ANL-
Intrepid trace are presented in Table 4.4, 4.5 and 4.6. Table 4.4 shows that the system
64
Table 4.3. Average job response time improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the number on top isthe improvement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 10.47% 10.14% 10.28% 11.11% 12.62%
9.27% 10.29% 10.74% 11.20% 12.32%
20% 11.12% 12.29% 12.93% 13.33% 14.93%
11.07% 12.23% 13.10% 13.37% 14.78%
30% 12.63% 13.73% 15.01% 16.20% 17.79%
12.56% 13.68% 15.25% 16.12% 17.32%
40% 16.17% 16.81% 17.09% 17.90% 18.90%
16.03% 16.98% 17.26% 17.93% 18.76%
50% 18.81% 22.11% 23.39% 25.56% 28.83%
18.17% 22.00% 23.45% 24.98% 28.41%
utilization improvement obtained by our design from ANL-Intrepid trace can be as
much as 4.8%, which is slightly higher than that from SDSC-BLUE. This is because
two traces have different job size distribution, as shown in 4.5.2. The variation of
job size in ANL-Intrepid trace is greater than SDSC-BLUE. The smallest jobs in
ANL-Intrepid require 256-node, while the biggest jobs require 8K nodes. When a big
job released from the system, it vacates a very big slot with great potential for our
design to make optimization. Hence, the system utilization improvement obtained
from ANL-Intrepid trace is higher than from SDSC-BLUE trace.
Both average job wait time and response time improvement obtained by our
design from ANL-Intrepid trace is not as prominent as from SDSC-BLUE. As shown
in Figure 4.5, more than 50% jobs in the ANL-Intrepid trace are of size 1K, 2K,
65
Table 4.4. System utilization improvement obtained by our design using B&B andGreedy as against the Default scheduler. In each cell, the number on top is theimprovement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 0.42% 0.53% 0.61% 0.89% 1.04%
0.54% 0.56% 0.63% 0.84% 1.12%
20% 1.02% 1.28% 1.47% 1.78% 2.20%
1.18% 1.24% 1.32% 1.83% 2.28%
30% 3.11% 3.21% 3.35% 3.35% 3.34%
3.20% 3.20% 3.32% 3.30% 3.35%
40% 3.00% 3.14% 3.23% 3.23% 3.24%
3.02% 3.10% 3.23% 3.25% 3.31%
50% 4.12% 4.35% 4.38% 4.42% 4.84%
4.12% 3.35% 4.23% 4.57% 4.69%
4K, which means job size variation within these jobs are relatively small. And these
jobs have much more longer running time than those small jobs in SDSC-BLUE,
which makes their wait time not as sensitive to the impact parameter α as jobs in
SDSC-BLUE trace. The average wait time and response time got by our design from
ANL-Intrepid trace is about 10%, shown in Table 4.5 and 4.6.
4.6.3 Result Summary. In summary, our trace-based experiments have shown
the following:
• Our window-based locality-aware scheduling design can guarantee compact job
allocation while maintaining high system utilization. The experimental results
also demonstrate that our design is capable of reducing average job wait time
66
Table 4.5. Average job wait time improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the number on top isthe improvement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 6.32% 7.12% 7.39% 7.54% 8.84%
6.10% 7.18% 7.60% 7.51% 7.92%
20% 6.20% 7.87% 8.87% 8.10% 9.12%
6.16% 7.37% 8.59% 8.60% 8.94%
30% 7.04% 7.89% 8.33% 9.58% 10.69%
6.97% 7.60% 8.46% 9.73% 10.77%
40% 8.27% 8.81% 9.62% 10.06% 10.89%
8.34% 8.86% 9.60% 9.98% 10.76%
50% 8.45% 9.09% 10.13% 10.65% 11.68%
8.62% 9.15% 10.16% 10.62% 11.54%
and job response time.
• Our design can deliver up to 27% reduction on average job wait time and
response time, and 4% improvement on system utilization. The amount of
improvement varies depending on workload features such as job size and job
running time.
• Both B&B and GREEDY algorithms can deliver comparable performance in
our case studies. Considering the computational overhead, we recommend the
use of GREEDY due to its low computational complexity.
4.7 Summary
In this chapter, we have presented a window-based locality-aware job schedul-
67
Table 4.6. Average job response time improvement obtained by our design using B&Band Greedy as against the Default scheduler. In each cell, the number on top isthe improvement achieved by using B&B, the bottom number is the improvementachieved by using Greedy.
Impact Parameter α
Job Percentage β 10% 20% 30% 40% 50%
10% 3.81% 4.09% 4.87% 5.05% 5.76%
4.00% 4.26% 4.76% 4.93% 5.81%
20% 5.16% 5.53% 5.89% 6.76% 8.91%
4.03% 4.46% 5.76% 6.76% 8.80%
30% 5.03% 5.52% 6.76% 7.90% 8.93%
5.16% 5.81% 6.86% 7.67% 9.00%
40% 6.08% 7.52% 7.73% 8.89% 9.54%
5.70% 7.35% 7.58% 8.74% 9.26%
50% 6.16% 7.09% 9.12% 9.90% 10.85%
6.05% 7.10% 9.23% 9.85% 10.68%
ing design for torus-connected system. Our goal is to balance job performance with
system performance. Our design has three novel features. First, rather than one-
by-one job scheduling, our design takes a “window” of jobs (i.e.,multiple jobs) into
consideration for job prioritizing and resource allocation. Second, our design main-
tains a list of slots to preserve node contiguity information for resource allocation.
Finally, we formulate a 0-1 Multiple Knapsack problem to describe our scheduling de-
cision making and present two algorithms to solve the problem. Preliminary results
based on trace-based simulation demonstrate our design can reduce average job wait
time by up to 28% and average job response time by 30%, with a slight improvement
on overall system utilization.
68
CHAPTER 5
JOB INTERFERENCE ANALYSIS ON TORUSCONNECTED SYSTEMS
5.1 Overview
On the widely used torus network topology [23–25], two job placement policies
are commonly used. In contiguous placement, each job gets a compact and contiguous
set of computing nodes, as shown in Figure 5.1(a). The partition-based placement
adopted in the Blue Gene series of supercomputers is a prominent example [26].
Contiguous placement favors application performance through exclusive resource al-
location and the locality that implies. However, contiguous placement can cause both
internal fragmentation (when more nodes are allocated to a job than it requests) and
external fragmentation (when sufficient nodes are available for a request, but they
can not be allocated contiguously), therefore leading to lower system utilization than
is otherwise possible. On the other hand, noncontiguous placement, adopted by the
Cray XT/XE series [27], assigns free nodes to jobs regardless of contiguity, although
of course efforts are made to maximize locality. Figure 5.1(b) shows an example
of noncontiguous placement. While eliminating the internal and external fragmen-
tation seen in contiguous placement systems, noncontiguous placement introduces
contention among jobs due to the interleaving of the jobs’ nodes, particularly for
communication-intensive jobs [4].
We envision that future HPC systems will be equipped with flexible job place-
ment mechanisms that combine the best of both contiguous and noncontiguous poli-
cies. Such a mechanism would take into account the shared resource needs (e.g.,
network resources) of jobs when making scheduling and allocation decisions. With
the knowledge and analysis of job communication patterns, one can identify which
jobs require exclusive network and compact allocation and to what degree. Then,
69
rather than allocating each job in a “know-nothing” manner, one may customize
the job placement policy so that, for example, only the jobs with stringent network
needs are given compact, isolated allocations, resulting in maximized utilization and
minimized perceivable resource contention effects.
(a) Contiguous (b) Non-contiguous
Figure 5.1. Multiple jobs running concurrently with different allocations. Each job isrepresented by a specific color. a) shows the effect of contiguous allocation, whichreduce the inter-job interference. b) shows non-contiguous allocation, which mayintroduce both intra and inter-job interferences.
In this chapter, we focus on an in-depth analysis of intra- and interjob commu-
nication interference with different job placements on torus-connected HPC systems.
Although our analysis is based on torus networks, the ideas conveyed in this work are
applicable to networks with different topologies.
We selected three signature applications from the DOE Design Forward Project
[31] to conduct a detailed study about the communication behavior of parallel ap-
plications. We use a sophisticated simulation toolkit named CODES (standing for
Co-Design of Multi-layer Exascale Storage Architectures) [32] as a research vehicle to
evaluate the performance of these applications with various allocations in a controlled
environment. We then analyze the intra- and interjob interference by simulating these
applications running concurrently with different allocations. We believe the insights
presented in this work can be useful for the design of future HPC batch schedulers
70
and resource managers.
The rest of this chapter is organized as follows. Section 5.2 describes the three
representative applications chosen from the DOE Design Forward Project for our
study. Section 5.3 discusses the use of CODES as a research vehicle for our work.
Section 5.4 provides detailed analysis of the intra- and interjob interference among
the applications on a torus network with different allocations. Section 5.5 introduces
a path toward communication-pattern-aware allocation strategies, given the results of
our analysis. Section 5.6 discusses related work. Section 5.7 presents our conclusions.
5.2 Application Study
For this work, we select three applications from the DOE Design Forward
Project. Each application exhibits a distinctive communication pattern that is com-
monly seen in HPC applications. We believe that the communication patterns of these
applications are representative of a wide array of applications running on leadership-
class machines. Specifically, we study the algebraic multigrid solver (AMG), Crystal-
Router miniapps, and geometric multigrid (MultiGrid).1
5.2.1 AMG. The algebraic multigrid solver, or AMG, is a parallel algebraic multi-
grid solver for linear systems arising from problems on unstructured mesh physics
packages. It has been derived directly from the BoomerAMG solver that is being de-
veloped in the Center for Applied Scientific Computing (CASC) at LLNL [82]. The
dominant communication pattern is regional communication with decreasing message
size for different parts of the multigrid v-cycle.
Figure 5.2 shows the communication matrix of a small-scale AMG execution
1The communication matrices of each application presented in Figures 5.2, 5.3,and 5.4, respectively, are generated by the IPM data collected from the DOE DesignForward Project [81].
71
Figure 5.2. AMG communication matrix. The label of both the x and the y axis isthe index of MPI rank in AMG. The legend bar on the right indicates the datatransfer amount between ranks.
with 216 MPI ranks. Note that the dominant communication pattern of the applica-
tion does not change with scale. We observe that AMG’s dominant communication
pattern is 3D nearest neighbor: each rank has intensive communication with up to six
neighbors, depending on rank boundaries. Applications with similar patterns include
PARTISN [83] and SNAP [84].
5.2.2 Crystal Router. The second miniapp studied is Crystal Router, an extracted
communication kernel of the full Nek5000 code [85], which is a CFD application devel-
oped at Argonne National Laboratory. It features spectral-element multigrid solvers
coupled with a highly scalable, parallel coarse-grid solver that is widely used for
projects including ocean current modeling, thermal hydraulics of reactor cores, and
spatiotemporal chaos. Crystal Router demonstrates the many-to-many communica-
tion pattern through a scalable multistage communication process.
The collective communication in Crystal Router utilizes a recursive doubling
approach. Ranks in Crystal Router conform to an n-dimensional hypercube and
72
Figure 5.3. Crystal Router communication matrix. The label of both the x and they axis is the index of MPI rank in CrystalRouter. The legend bar on the rightindicates the data transfer amount between ranks.
recursively split into (n-1)-dimensional hypercubes, with communication occurring
along the splitting plane. The pattern of this communication is shown in Figure 5.3.
As a result of the logarithmic splitting process, a substantial portion of the commu-
nication occurs in small neighborhoods of ranks. Crystal Router represents a group
of applications whose dominant communication is a hybrid of multistage local and
hierarchical global communication and shares similarities with most MPI collective
communication implementations.
5.2.3 MultiGrid. MultiGrid is a geometric multigrid v-cycle from the production
elliptic solver BoxLib, a software framework for massively parallel, block-structured
adaptive mesh refinement (AMR) codes [86]. MultiGrid conforms to many-to-many
communication pattern with decreasing message size and collectives for different parts
of the multigrid v-cycle. It is widely used for structured grid physics packages.
Figure 5.4 shows the communication matrix of MultiGrid with 125 ranks. We
can see intensive communication along the diagonal that resembles nearest-neighbor
73
Figure 5.4. MultiGrid communication matrix. The label of both the x and the y axisis the index of MPI rank in MultiGrid. The legend bar on the right indicates thedata transfer amount between ranks.
communication, similar to that of AMG. However, the communication topology leads
to a greater “spread” of communication across the set of ranks, challenging the max-
imization of communication locality with respect to ranks. In this sense, it can be
considered as a many-to-many pattern. Applications with similar dominant commu-
nication patterns include FillBoundary, another PDE solver code in [86].
5.3 Research Vehicle
Experimenting accurately and flexibly with concurrently running jobs is diffi-
cult in an HPC context. One reason is that the allocation strategy used on production
machine is part of the system software, which cannot be changed by users. Even sys-
tem administrators may not be able to make such changes. Another reason is that
it is unrealistic to reserve the system exclusively to run the same job with the de-
sired allocation without interference and then compare the results with those in the
presence of interference. Therefore, we resort to simulation for this work.
Specifically, we use the toolkit CODES, which enables of simulating both torus
74
and Dragonfly networks at the flit-level with high fidelity [32, 49]. CODES is built
on top of the Rensselaer Optimistic Simulation System (ROSS) parallel discrete-
event simulator [50], which is capable of processing billions of events per second on
leadership-class supercomputers. CODES additionally has the ability to replay MPI
application traces, gathered via the SST DUMPI profiler [47].
5.4 Interference Analysis
Intrajob interference refers to the network contention between the ranks
within each application. Interjob interference is introduced by concurrently run-
ning jobs sharing network resources. Communication variability due to such interfer-
ence can cause application performance degradation.
In this section we study both kinds of interference on a torus network. The
current generation of IBM Blue Gene/Q (BG/Q) supercomputers, such as Mira at
Argonne National Laboratory and Sequoia at Lawrence Livermore National Labora-
tory, has the compute nodes connected by a 5D torus network [23]. The K computer
from Japan uses the “Tofu” interconnected system, which has a 6D mesh/torus topol-
ogy [24]. Titan, a Cray XK7 supercomputer at the Oak Ridge Leadership Computing
Facility (OLCF), has nodes connected in a 3D torus within the compute partition [25].
Since an application’s communication patterns do not change with the scale, we per-
form our experiments at modest scale (relative to leadership-class systems), simulating
a 3D torus network with 2,048 nodes (16× 16× 8) to simplify analysis.
5.4.1 Intrajob Interference Analysis. We design two sets of experiments
to study the intrajob interference of each application. In the first, we assign each
application with allocations in three different shapes. In the second, we study the
intrajob interference by using different mapping strategies for each application with
a given allocation shape.
75
5.4.1.1 Allocation Shapes Study. In the allocation shapes experiment, we
select three shapes commonly seen on the 3D torus network: 3D balanced cube, 3D
unbalanced cube, and 2D mesh, as shown in Figure 5.5.
The 3D balanced cube, shown in red in Figure 5.5, can guarantee the minimum
average pairwise distance within the allocation. Some research studies [4,73] indicate
that compact allocation can guarantee jobs with better performance. To evaluate the
compactness of the allocation, they use a variety of metrics such as average pairwise
distance, diameter, and contiguity. In this work, we select the 3D balanced cube as
the most compact allocation on a 3D torus network.
The 3D unbalanced cube, shown in green in Figure 5.5, is a rectangular prism,
which is a possible allocation shape on systems with asymmetric networks. For exam-
ple, Cray XE6/XK7 systems are 3D tori with Gemini routers. The network connec-
tions in the y-direction have only half the bandwidth of the cables used in the x and z
directions. In order to take advantage of the faster links in the x and z directions, job
allocation favors the X-Z plane [87]. Our torus network in this study is symmetric.
The 2D mesh, shown in blue in Figure 5.5, can be cut out from a single layer of
the 3D torus. The 2D mesh is a common allocation shape on torus networks for both
contiguous and noncontiguous placement policies. For example, Cray’s Application
Level Placement Scheduler (ALPS) indexes all compute nodes in the torus into a
list and allocates by simply going through that list [27]. When the list is obtained
by sorting the nodes based on their spatial coordinates in the torus, the resulting
allocations form 2D meshes. The IBM Blue Gene/Q supercomputer Mira at the
Argonne Leadership Computing Facility also allows its allocation partition to be
configured as a mesh [88].
Figure 5.6(a) shows that AMG with a 3D nearest-neighbor communication
76
Figure 5.5. Contiguous allocation in three different shapes. Red is a 3D balancedcube, green a 3D unbalanced cube, and blue a 2D mesh.
pattern takes slightly less time when running with a 3D unbalanced allocation than
with a 3D balanced allocation. And MultiGrid performs best (shortest data transfer
time) when running with a 3D balanced allocation, as shown in Figure 5.6(c). Since
MultiGrid’s communication pattern is many-to-many dominant, the 3D balanced allo-
cation is the most compact, with the shortest pairwise distance between nodes, which
can reduce the aggregated hops for transferring message among ranks in MultiGrid.
Since Crystal Router exhibits both local and global rank-to-rank data transfers, the
3D balanced allocation is also the best, but its advantage over 3D unbalanced is not
as obvious with MultiGrid, as shown in Figure 5.6(b).
A number of studies have designed complex placement algorithms to provide
applications with the most compact allocation [73,75]. As shown in our experiments,
however, providing such allocation without considering the application’s communica-
tion pattern may not guarantee the best performance for every application. Compact
allocation should be provided to applications with intensive global data transfer, such
as those exhibiting a many-to-many pattern.
77
(a) AMG (b) Crystal Router (c) MultiGrid
Figure 5.6. Data transfer time of AMG, Crystal Router, and MultiGrid on 2D mesh,3D unbalanced, and 3D balanced allocation.
5.4.1.2 Mapping Strategy Study. The rank-to-node mapping of parallel appli-
cations on HPC systems can greatly impact the performance. However, finding the
optimal mapping solution for a given application is out the scope of this work. Our
experiments aim to show how mapping strategies impact the intrajob interference of
applications with specific communication patterns.
We provide AMG, Crystal Router, and MultiGrid with a 3D balanced allo-
cation and use three mapping strategies to do the rank-to-node mapping. “Linear”
mapping, which we used in the allocation shapes study, maps each rank according
to the dimensional ordering of compute nodes. The “Cube” mapping assigns ranks
into consecutive 23 cubes. The “Random” mapping assigns ranks randomly within
the allocation.
AMG behavior remains roughly the same when it been mapped by “Linear”
and “Cube,” shown in Figure 5.7(a). Although the “Linear” and “Cube” mapping
strategies cause some routing overlap for AMG’s communication, both still preserve
the locality of AMG’s 3D nearest-neighbor communication pattern. The “Random”
mapping disrupts AMG’s communication pattern and results in intrajob interference
among the ranks. The performance degradation is as much as 90%.
The “Cube” mapping improves Crystal Router’s performance over the “Lin-
78
ear” mapping by up to 10% on average, shown in Figure 5.7(b). This improvement is
due to the fact that the global data transfer in Crystal Router takes fewer hops with
the “Cube” mapping. The “Random” mapping for Crystal Router results in poor
locality across all ranks on average and makes their communication less efficient. The
“Cube” mapping benefits MultiGrid’s many-to-many communication. Because of the
small amount of data transfered among ranks in MultiGrid, however, the “Cube”
mapping fails to exhibit a significant advantage. The “Random” mapping causes
little degradation for the same reason, shown in Figure 5.7(c).
(a) AMG (b) Crystal Router (c) MultiGrid
Figure 5.7. Data transfer time of AMG, Crystal Router, and MultiGrid on 3D bal-anced allocation using different mapping strategies.
5.4.2 Interjob Interference Analysis. Interjob interference has been identified as
one of the major culprits responsible for application’s performance variability [4,89].
Interjob interference is a more prominent issue for systems adopting noncontiguous
placement policies than for systems with a contiguous policy. Application communi-
cation times have been demonstrated to vary from 36% less to 69% more as a result
of job interference when applications are running concurrently with noncontiguous
allocations [4].
We allocate each application with a noncontiguous policy and run them con-
currently on the same network. The compute nodes belonging to different jobs are
interleaved. To study the impact of different allocation unit sizes on applications’
interjob interference, we conduct experiments with unit size of 16, 8, and 2, shown
79
respectively in Figure 5.8(a), 5.8(b), and 5.8(c). Figure 5.9 shows the results of each
application data transfer time with different allocation unit sizes.
(a) (b) (c)
Figure 5.8. Noncontiguous allocation. Each job is represented by a specific color.The nodes assigned to different jobs are interleaved; the sizes of allocation unit are16, 8, and 2.
The data transfer time of AMG in Figure 5.9(a) remains stable with allocation
unit sizes of 16 and 8 because of the nearest-neighbor pattern of AMG. When the
unit size is reduced to 2, AMG suffers prolonged data transfer time by about 10% on
average. Crystal Router is more sensitive to the allocation unit size. Figure 5.9(b)
shows that unit sizes of 16 and 8 can guarantee the same average data transfer time,
while some ranks spend more time with allocation unit size 8 than 16. When the unit
size is reduced to 2, the communication becomes less efficient and takes about 15%
more time on average for transferring data.
The data transfer time of MultiGrid with different allocation unit sizes does
not show obvious variability in Figure 5.9(c). The reason is that even a big allocation
unit size such as 16 will still fail to preserve MultiGrid’s many-to-many pattern. The
data transfer time is almost doubled when MultiGrid is running concurrently with
an allocation unit size of 16, as shown in Figure 5.9(c). Further unit size decreases
result in roughly similar average communication times, however.
When choosing the proper unit size in a noncontiguous placement policy, one
should consider the application’s communication patterns. Interjob interference is
80
inevitable in noncontiguous-based systems, but unit sizes big enough to preserve the
neighborhood communication of the application will alleviate such interference and
improve job performance.
(a) AMG (b) Crystal Router (c)
Figure 5.9. Interjob interference study: “cont” indicates three applications runningside by side concurrently on the same network with contiguous allocation. Tostudy the impact of noncontiguous allocation on interjob interference, applicationsare run concurrently with interleaved allocations of different unit sizes, namely, 16node, 8 node, and 2 node.
5.4.3 Results Summary. Based on our simulation study, we make following ob-
servations.
• Compact allocation may not be necessary for every application.
• The applications dominated by nearest-neighbor communications exhibit rela-
tively stable performance under different allocation shapes as long as the allo-
cation exhibits some degree of locality.
• The applications dominated with many-to-many communication exhibit better
performance with more compact allocations (e.g., 3D balanced).
• A good rank-to-node mapping strategy can greatly improve an application per-
formance when a specific allocation is given.
• An optimal size for allocation units should be determined according to an ap-
plication’s dominant communication pattern. In general, a unit size should be
81
large enough to accommodate neighboring communication in the application.
• Interjob interference is inevitable with a noncontiguous allocation. However,
choosing the proper allocation unit size with communication pattern awareness
can help alleviate the resulting negative effects.
5.5 Discussion
The results shown in the preceding sections provide insights for the design of a
smart and flexible job placement policy. By scrutinizing the communication behavior
of jobs, we can identify their dominant communication patterns and pinpoint locality
needs. With such knowledge about communication patterns, we can analyze the
possible interference between jobs and take precautions to alleviate interference when
making placement decisions.
When an application with intensive many-to-many communication is submit-
ted to the system, the scheduler should provide it with a compact node allocation and
exclusive network resources. Compact allocation can guarantee the shortest pairwise
distance between all the ranks. Additionally, the exclusive network provision will
prevent other jobs from sharing network resources, thus eliminating the performance
degradation due to interference.
Arguably, however, not every application requires compact allocation and ex-
clusive network provisioning. As we demonstrate in our study, applications whose
dominant communication pattern contains intensive “neighborhood communication”
such as nearest neighbor may not benefit from compact allocation and exclusive net-
work provision. Applications such as AMG can run with a noncontiguous node al-
location without significant performance degradation, as long as the allocation unit
can accommodate their rank-rank locality.
82
The size of allocation unit should not be fixed. Instead, the scheduler should
choose a proper unit size based on the granularity of communication locality of each
job. Large unit sizes cannot be fully utilized and thus cause fragmentation. Small
unit sizes will not be able to accommodate the communication locality, resulting in
less efficient intrajob communication.
The advantage of considering job communication patterns when performing
job placement is clear. Compared with contiguous placement policy, schedulers with
communication pattern awareness can be more flexible, relaxing the need to provide
contiguous partitions to accommodate the whole application, and avoiding fragmen-
tation issues inherent in contiguous placement. Indeed, smaller compact node sets
(allocation units) provided for an application’s “local communication” can help pre-
serve the communication performance sufficiently. The design of such a new scheduler
with job communication pattern awareness is part of our future work.
5.6 Related Work
Many tools are available for system monitoring and application profiling. Tools
such as TAU (Tuning and Analysis) [44] and mpiP [45] can capture application run-
time information, keeping records in event traces. Recognizing communication pat-
terns from those traces require substantial effort, however. A number of studies have
been conducted on the recognition and characterization of parallel application com-
munication patterns. Oak Ridge National Laboratory has an ongoing project involv-
ing development of the toolset Oxbow, which can characterize the computation and
communication behavior of scientific applications and benchmarks [48]. In a recent
work [43], Roth et al. demonstrate a new approach to automatically characterizing
the communication behaviors of parallel applications.
Many research efforts have been conducted to characterize scientific applica-
83
tions. For instance, the DOE Design Forward Project aims to identify the com-
putation and communication characteristics of a collection of relevant miniapps de-
veloped at a number of exascale co-design centers [31]. In this project, the com-
munication patterns of several DOE full applications and associated miniapps are
studied in order to provide a more complete snapshot of DOE workloads. A joint
project named CORAL, involving Oak Ridge, Argonne, and Lawrence Livermore Na-
tional Laboratories, provides a series of benchmarks to represent DOE workloads and
technical requirements [90]. The CORAL project includes scalable science bench-
marks, throughput benchmarks, data-centric benchmarks, skeleton benchmarks, and
microbenchmarks.
The interference among concurrently running jobs on HPC systems has been
identified as major a culprit responsible for a job’s performance variability. Bhatele
et al. found that concurrently running applications interfere with each other and
cause their communication time to vary from 36% shorter to 69% longer on different
HPC systems [4]. Skinner et al. found a 2–3 times slowdown in MPI Allreduce due
to network contention from other concurrently running jobs [89].
Several research studies focus on optimizing job allocation on HPC systems to
alleviate the interference between concurrently running jobs. Hoefler et al. proposed
using performance-modeling techniques to analyze factors that impact the perfor-
mance of parallel scientific applications [91]. As the scale of HPC systems continue to
grow, however, the interference of concurrently running jobs is getting worse, which
is hard to quantify with performance-profiling tools alone. Bogdan et al. provided
a set of guidelines on how to configure a Dragonfly network for workload with a
nearest-neighbor communication pattern [92]. Dong et al. developed simple bench-
marks that conform to four different communication patterns—ping-pong, nearest
neighbor, broadcast, and all reduce—to demonstrate the effectiveness of 5D torus
84
networks [93].
We differentiate our work from these activities in the following ways. First, we
focus on the dominant communication patterns rather than any specific application.
We believe this focus can provide a guideline for other research work. Second, we ex-
plore both intra- and interjob interference between concurrently running jobs, whereas
similar work such as [4] focuses on a single application’s performance degradation due
to network contention. Third, we analyze the impact of different placement strate-
gies on a job’s communication behaviors, identifying preferred placement strategies
for each application with a specific dominant communication pattern. Based on our
study, we claim that future batch schedulers should take job communication patterns
into consideration for placement decision making.
5.7 Conclusions
In this chapter, we have studied the communication behavior of three parallel
applications: AMG, Crystal Router, and MultiGrid. Each application has a distinc-
tive communication pattern, which can be representative for a whole range of jobs
commonly seen in HPC environment. We have used the CODES toolkit to simulate
the running of these applications on a torus network.
We have analyzed the intra- and interjob interference by simulating three ap-
plications running both independently and concurrently. Based on our comprehensive
experiments, we make six observations.
1. Compact allocation may not be necessary for every application.
2. The applications dominated by nearest-neighbor communications exhibit rela-
tively stable performance under different allocation shapes as long as the allo-
cation exhibit some degree of locality.
85
3. The applications dominated with many-to-many communication exhibit better
performance with more compact allocations (e.g., 3D balanced).
4. A good rank-to-node mapping strategy can greatly improve an application’s
performance when a specific allocation is given.
5. An optimal size for allocation units should be determined according to an ap-
plication’s dominant communication pattern. In general, a unit size should be
large enough to accommodate neighboring communication in the application.
6. Interjob interference is inevitable in noncontiguous allocation. However, choos-
ing the proper allocation unit size with communication pattern awareness can
help alleviate the resulting negative effects.
We believe that the findings in this work can provide valuable guidance for
HPC batch schedulers and resource managers to make flexible job allocations. Rather
than using predefined partitions or a noncontiguous placement policy, future HPC
systems should assign resources to each job based on job communication patterns.
86
CHAPTER 6
JOB INTERFERENCE ANALYSIS ON DRAGONFLYCONNECTED SYSTEMS
6.1 Overview
Low-latency and high-bandwidth interconnect networks play a critical role in
ensuring HPC system performance. The high-radix, low-diameter dragonfly topology
can lower the overall cost of the interconnect, improve network bandwidth and reduce
packet latency [35], making it a very promising choice for building supercomputers
with millions of cores. Even with such powerful networks, intelligent job placement
is of paramount importance to the efficient use of dragonfly connected systems [5,7].
Intelligent job placement plays critical role in exploring the full potential of such high
performance networks.
In this chapter, we study the implications of contention for shared network
links in the context of multiple HPC applications running on dragonfly systems when
different job placement and routing configurations are in use. Our analyses focus
on the overall network performance as well as the performance of concurrently ex-
ecuting applications in the presence of network contention. We use the same three
applications discussed in section 5.2 and analyze the interference among them. For
each application, we first examine its performance with two job placement policies
and three routing policies. And we make the following observations through extensive
simulations.
• Concurrently running applications on a dragonfly network interfere with each
other when they share network resources. Communication-intensive applica-
tions “bully” their less intensive peers and obtain performance improvement at
the expense of less intensive ones.
87
• Random placement of application processes in the dragonfly can improve the
performance of communication-intensive applications by enabling network re-
source sharing, though it introduces interference causing performance degrada-
tion to the less intensive applications.
• Contiguous placement can be beneficial to the consistent performance of less
communication-intensive applications by minimizing network resource sharing,
because it reduces the opportunities for traffic from other applications to be
loaded on links that serve as minimal routes for the less intensive application.
However, this comes with the downside of reduced system performance due to
load imbalance.
Based on the aforementioned key observations, one would expect that an ideal
job placement policy on dragonfly systems would take relative communication in-
tensity into account, and mix contiguous and non-contiguous placement based on
application needs. To explore this expectation, we investigate a hybrid job place-
ment policy, which assigns random allocations to communication-intensive applica-
tions and contiguous allocations to less intensive ones. Initial experimentation shows
that hybrid job placement aids in reducing the worst-case performance degrada-
tion for less communication-intensive applications while retaining the performance
of communication-intensive applications, though without eliminating the problem
entirely. Further more, we explore two possible placement policies, random router
and random partition, that have been discussed in existing literature. Based on the
experimental results, random router and random partition placement bring great per-
formance improvement for the less intensive application at the cost of performance
degradation for intensive ones. Unfortunately, none of them can completely prevent
the “bully” behavior from happening.
The rest of this chapter is organized as follows. Section 6.2 describes an im-
88
plementation of the dragonfly network, introduces the placement policies and routing
policies. Section 6.3 discusses the use of CODES as a research vehicle and three rep-
resentative applications from the DOE Design Forward Project. Section 6.4 presents
the observations and analysis of three applications running on a dragonfly network
with different placement and routing configurations. Section 6.5 validates the obser-
vations we obtain from previous section. Section 6.6 presents a alternative placement
policy for the dragonfly network. Section 6.7 explores two other possible job place-
ment policies that have been discussed in previous literature. Section 6.8 discusses
the related work. Finally, the conclusion is presented in Section 6.9.
6.2 Background
In this section, we review the dragonfly topology, including the placement and
routing policies examined in previous work.
6.2.1 Dragonfly Network. The dragonfly is a two-level hierarchical topology,
consisting of several groups connected by all-to-all links [94]. Each group consists of
a routers connected via all-to-all local channels. For each router, p compute nodes
are attached to it via terminal links, while h links are used as global channels for
intergroup connections. The resulting radix of each router is k = a + h + p − 1.
Different computing centers could choose different values for a, h, p when deploying
their dragonfly network. The adoption of proper a, h, p involves many factors such
as system scale, building cost and workload characteristics.
It is recommended that for load balancing purposes, a proper dragonfly con-
figuration should follow a = 2p = 2h [94]. According to this configuration, the total
number of groups, denoted as g in the dragonfly network would be g = a ∗ h+ 1, the
total number of compute nodes denoted as N in the network would be N = p ∗ a ∗ g.
In this work, we focus on the dragonfly topology that follow this configuration. An
89
example dragonfly network is illustrated in Figure 6.1. There are six routers in each
group (a = 6), three compute nodes per router (p = 3), and three global channels per
router (h = 3). This dragonfly network consists of 19 groups and 342 nodes in total.
Figure 6.1. Five group slice of a 19-group dragonfly network. Job J1 is allocatedusing random placement, while Job J2 is allocated using contiguous placement.
6.2.2 Routing on Dragonfly. The routing policy refers to the strategy adopted
to route packets from the source router to the destination router. Previously studied
routing policies for dragonfly networks include minimal routing, adaptive routing [35],
progressive adaptive routing [95] and variations thereof [96]. In this work we study
three alternative routing policies considered by the community for dragonfly networks.
Minimal: In this policy, a packet takes the minimal (shortest) path from the
source to the destination. The packet first routes locally from the source node to the
90
global channel leading to the destination group. It traverses the global channel to
the destination group and routes locally to the destination node. Minimal routing
can guarantee the minimum hops a packet takes from the source to the destination.
However, it usually results in congestion along the minimal paths.
Adaptive: In this policy, the path a packet takes will be adaptively chosen
between minimal and non-minimal paths, depending on the congestion situation along
those paths. For the non-minimal path, an intermediate router in a separate group will
be randomly chosen. The packet is forwarded to the intermediate router, connecting
the source and destination groups through two separate minimal paths. Adaptive
routing can avoid hot-spots in the presence of congestion and collapses to minimal
routing otherwise.
Progressive Adaptive: As opposed to adaptive routing, the decision to
adaptively route a packet is continually re-evaluated within the source group un-
til a non-minimal route is chosen; the re-evaluation does not occur in intermediate
groups [95]. Progressive adaptive routing is capable of handling scenarios where the
minimal route is congested but the source router has not been informed yet.
6.2.3 Job Placement on Dragonfly. For a parallel application requiring mul-
tiple compute nodes, the job placement policy refers to the way of assigning the
required number of nodes to the application by system software such as the batch
scheduler [97]. In this work, we study two alternative placement policies considered
by the community for dragonfly systems:
Random Placement: In this policy, an application gets the required num-
ber of nodes randomly from the available nodes in the system. As illustrated in
Figure 6.1, J1 is randomly allocated nodes attached to different routers in differ-
ent groups. Routers may be shared by different applications and more routers are
91
involved in serving each application when random placement is in use. Random place-
ment can distribute the tasks of an application uniformly across the network to avoid
the possible local congestion.
Contiguous Placement: In this policy, the compute nodes are assigned to
an application consecutively. The assignment first fills up a group, then crosses group
boundaries as necessary. As illustrated in Figure 6.1, J2 is allocated eight nodes by
contiguous placement. Contiguous placement confines the tasks of an application into
the same group and uses the minimum number of routers to serve each application,
which may result in local network congestion and increase the possibility of hot-spots.
6.3 Methodology
Configurable dragonfly networks that allow us to perform the exploration pre-
sented in this paper are hard to come by for the time being. Even with access to
systems with such networks, job placement and routing policies are part of system
configuration, which is impossible for users to make changes at will [5, 6, 10, 98].
Therefore, we resort to simulation in our work.
6.3.1 Simulation Tool. We utilize the CODES simulation toolkit (Co-Design of
Multilayer Exascale Storage Architectures) [99], which builds upon the ROSS parallel
discrete event simulator [100,101] to enable exploratory study of large scale systems of
interest to the HPC community. CODES supports dragonfly [99,102], torus [103,104],
and Slim Fly [105] networks with flit-level high-fidelity simulation. It can drive these
models through an MPI simulation layer utilizing traces generated by the DUMPI
MPI trace library available as part of the SST macro toolkit [47]. The behavior and
performance of the CODES dragonfly network model has been validated by Mubarak
et al. [102] against BookSim, a serial cycle-accurate interconnection network simula-
tor [106].
92
6.3.2 Parallel Applications. We use a trace-driven approach to workload gen-
eration, choosing in particular three parallel application traces gathered to represent
exascale workload behavior as part of the DOE Design Forward Project [81, 107].
Specifically, we study communication traces representing the Algebraic MultiGrid
Solver (AMG), Geometric Multigrid V-Cycle from Production Elliptic Solver (Multi-
Grid) and Crystal Router MiniApp (CrystalRouter). The details about the commu-
nication pattern of these applications have been discussed in section 5.2.
6.3.3 System Configuration. The dragonfly network topology was originally
envisioned by Kim et al. [94]. The parameters for building the dragonfly network
studied in our work are chosen based on the model proposed in [94]. In our dragonfly
network, each group consists of a = 8 routers connected via all-to-all local channels.
For each router, there are p = 4 compute nodes attached to it via terminal links.
Each router also has h = 4 global channels used for intergroup connections. The
radix of each router is hence k = a + h + p− 1 = 15. The total number of groups is
g = a∗h+1 = 33 and the total number of compute nodes is N = p∗a∗g = 1056. Being
different from the Cray XC systems which have multiple global channels connecting
a pair of groups [36], dragonfly network simulated in this work uses single global
channel between groups. For that particular reason, we use a higher bandwidth
for the global channel and relative low bandwidth for the local and terminal channel.
The links in our dragonfly network are asymmetric, 2 Gib/s for the local and terminal
router-ports and 4 GiB/s for the global ports. In this work, we simulate the network
performance and job interference across six different job placement and routing policy
combinations, which are summarized in Table 6.3. 2
We analyze both the overall network performance and the performance of each
2With respect to random placement, we experiment with 50 sets of distinctiveallocation generated by random placement. The corresponding experimental resultsare median chosen, which intended to eliminate the possibility of variation.
93
Table 6.1. Nomenclature for different placement and routing configurations
Routing Policies
Placement Policies Minimal Adaptive Progressive Adaptive
Contiguous cont-min cont-adp cont-padp
Random rand-min rand-adp rand-padp
application. Our analysis focuses on the following metrics:
• Network Traffic: The traffic refers to the amount of data in bytes going
through each router. We analyze the traffic on each terminal and on the local
and global channels of each router. The network reaches optimal performance
when the traffic is uniformly distributed and no particular network link is over-
loaded.
• Network Saturation Time: The saturation time refers to the time period
when the buffer of a certain port in the router is full. We analyze the saturation
time of ports corresponding to terminal links, local and global channels. The
saturation time indicates the congestion level of routers.
• Communication Time: The communication time of each MPI rank refers to
the time it spends in completing all its message exchanges with other ranks. Due
to the use of simulation, we are able to measure the absolute (simulated) time
a message takes to reach its destination. The performance of each application
is measured by the communication time distribution across all its ranks.
Note that we do not model computation for each MPI rank due to both the
complexities inherent in performance prediction on separate parallel architectures as
well as the emphasis on the side of the Design Forward traces on communication
94
behavior rather than compute representation; users are instructed to treat the traces
as if they came from one MPI rank-per-node configuration, despite being gathered
using a rank-per-core approach. We follow the recommended interpretation in our
simulations.
6.3.4 Workload Summary. Two sets of parallel workloads are used in this study.
Workload I consists of AMG, MultiGrid and CrystalRouter. As shown in Table 6.2,
AMG has the least amount of data transfer, making it the least communication-
intensive job in the workload. CrystalRouter has the most amount of data transfer,
which means it is the most communication-intensive job in Workload I. Workload II
consists of sAMG, MultiGrid and CrystalRouter. sAMG, a synthetic version of AMG,
is generated by increasing the data transferred in AMG’s MPI calls by a factor of 100,
making it the most communication-intensive job in the workload. We add sAMG for
reasons that will become clear in the following sections.
As a significant portion of our experiments rely on nondeterministic behavior
(random application allocation), we ran each configuration a total of 50 times with
differing random seeds. We then chose a representative execution for presentation
based on the median performance of each application. While there is variation in
repeated runs of the following experiments, the resulting trends and observations are
representative of the full suite of experimentation.
Table 6.2. Summary of Applications
App Name Num. Rank Avg. Data/Rank Total Data
AMG 216 0.6MB 130MB
MultiGrid 125 5MB 625MB
CrystalRouter 100 35MB 3500MB
sAMG 216 60MB 13000MB
95
6.4 Study of Parallel Workload I
The study of Workload I consists of two parts. First, we analyze the over-
all network performance when Workload I is running under different placement and
routing configurations. Second, we isolate each application from the workload and
analyze its performance on both a per-rank basis as well as by considering router
traffic resident to application ranks. The analysis allows us to identify the “bully” in
Workload I.
6.4.1 Network Performance Analysis. We first study the network performance
at the system level by analyzing the degree of traffic and saturation seen at each
router.
(a) GC Traffic (b) LC Traffic (c) TL Traffic
(d) GC Saturation Time (e) LC Saturation Time (f) TL Saturation Time
Figure 6.2. Aggregate traffic and saturation time for Workload I under the configu-rations listed in Table 6.3. “CA” and “CPA” have equivalent behavior.
Figure 6.2 shows the aggregate traffic for terminal links, local and global chan-
nels, as well as the corresponding saturation time for Workload I under the placement
and routing configurations summarized in Table 6.3. When contiguous placement is
96
coupled with minimal routing (CM), application traffic is confined within the con-
secutively allocated groups, causing congestion on some routers along minimal paths
to and from application nodes. Both local and global channels experience signifi-
cant congestion, as applications span multiple groups. Similarly, the saturation time
for both local and global channels are the highest compared with other configura-
tions. When contiguous placement is coupled with adaptive (CA) and progressive
adaptive (CPA) routing, traffic is able to take non-minimal paths via intermediate
routers, helping to alleviate congestion along the minimal paths. The resulting traffic
through the most utilized local and global channels are greatly reduced, as shown in
Figures 6.2(a) and 6.2(b). Similarly, the corresponding saturation time on local and
global channels is also reduced significantly, demonstrating the efficacy of adaptive
routing in this case. For contiguous placement, we see no perceptible difference in
behavior between adaptive and progressive adaptive routing.
In most cases, the random placement policy behaves similarly across routing
policies. Random placement uniformly distributes MPI ranks over the network, bal-
ancing the resulting traffic load. As shown in Figures 6.2(a) and 6.2(b), no router
experiences an exceptionally high volume of traffic on its local and global channels.
When random placement is coupled with minimal routing (RM), less traffic is gen-
erated on account of the packets avoiding intermediate forwarding. At the same
time, there is still significant congestion on local channels due to the lack of ability of
packets to traverse non-minimal routes, falling into the same trap as the contiguous-
minimal configuration. Coupled with (progressive) adaptive routing, saturation times
are effectively minimized on both global and local channels when random placement
is in use, as shown in Figure 6.2(d) and 6.2(e). Further, in comparison to contigu-
ous allocations, random allocations result in a more evenly distributed load on the
resulting channels, as expected.
97
Figures 6.2(c) and 6.2(f) are presented for the purpose of symmetry, showing
the traffic per terminal link as well as the saturation time experienced at each termi-
nal. The terminal traffic distribution corresponds directly to application traffic, as we
are using one MPI rank per node (terminal). However, saturation times are different,
resulting from the aforementioned network behavior. Particularly, contiguous alloca-
tions coupled with minimal routing results in a “long-tail” distribution of saturation
time.
6.4.2 Individual Application Analysis.
(a) MultiGrid (b) CrystalRouter (c) AMG
Figure 6.3. Communication time distribution across application ranks in Workload I.
Now that the system-level view has been analyzed, we turn to evaluate the
behavior of each application within Workload I. Figure 6.3 shows the communica-
tion time distribution across application ranks for different placement and routing
configurations.
The relative behavior of contiguous allocations is roughly similar in all three
applications. Contiguous placement with minimal routing results in poor relative
performance across the board compared to the adaptive routing alternatives. Given
the analyses in Section 6.4.1, this is to be expected – the contiguous-minimal config-
uration results in significant congestion.
For the MultiGrid and CrystalRouter applications (Figures 6.3(a) and 6.3(b),
98
(a) MultiGrid LC Traffic (b) CrystalRouter LC Traffic (c) AMG LC Traffic
(d) MultiGrid GC Traffic (e) CrystalRouter GC Traffic (f) AMG GC Traffic
Figure 6.4. Aggregate workload traffic for routers serving specific applications. “CA”and “CPA” have equivalent behavior. More routers are involved in serving eachapplication when random placement is in use, compared to contiguous placement.
respectively), using random allocation with any routing method results in perfor-
mance improvements over contiguous allocations, which is largely in agreement with
the literature (see Section 6.8). The high-radix nature of the network topology en-
sures that the benefits from the resulting load balancing outweigh the costs of extra
hops for point-to-point messages.
The AMG application (Figure 6.3(c)), however, shows markedly different be-
havior when using random allocation. Random allocation with minimal routing re-
sults in worse performance than contiguous-adaptive configurations, while using adap-
tive routing results in significant performance regressions. As this is a counterintuitive
result not discussed in other works, we investigate further.
We step back to a network-level system view to identify the culprit behind
AMG’s abnormal behavior with random placement. This time, however, we identify
99
the compute nodes that each MPI rank resides on and the routers that are serving
each application, and analyze the traffic on a per-application basis. The results of this
experimentation are presented in Figure 6.4. Note that different numbers of routers
are used in the contiguous and random allocation configurations, as each router serves
multiple terminals.
The system behavior with respect to the CrystalRouter application arguably
best matches expectations. Use of contiguous allocations results in a subset of chan-
nels with a significant traffic load while a significant portion are unused. Use of ran-
dom allocations results in a comparatively smoother traffic distribution, with some
variation on the margins due to the randomness.
MultiGrid shows roughly similar behavior for contiguous allocations, but dif-
ferent behavior along the local channels. There is a significant variation in the traffic
distribution on local channels, even with adaptive routing, which nevertheless has the
net effect of reducing the maximal traffic load.
AMG shows a similar level of variation to MultiGrid in this case. However, it
is the least communication-intensive application of the three by a significant factor.
As evidenced by the wide gap between the router traffic in contiguous and random
placement configurations, the routers serving the AMG application are being utilized
by MultiGrid and CrystalRouter, resulting in AMG traffic contending with traffic of
other applications. The net effect, as shown in Figure 6.3(c), is significant slowdowns
for AMG. We refer to this phenomenon as AMG being “bullied” by MultiGrid and
CrystalRouter.
6.4.3 Key Observations. In summary, based on the simulations presented in
Section 6.4.1 and 6.4.2, we make the following observations.
System-level performance is significantly improved when random placement
100
and adaptive routing are in use. Random placement can uniformly distribute MPI
ranks of application over the network, and adaptive routing can redirect the traffic
from congested routers to other less busy routers. The combination of the two min-
imizes hot-spots and promotes load-balanced distribution. The resulting increased
number of hops per message was not a significant detriment in comparison. This
matches what is seen in the literature.
The performance of communication-intensive jobs in the system improves through
use of random allocation policies. Both CrystalRouter and MultiGrid, the two most
communication-intensive jobs, saw improved distributions of communication perfor-
mance when moving to a random allocation. Again, this matches what is seen in the
literature.
The performance of less communication-intensive jobs in workload regresses
when random placement and adaptive routing are in use. AMG in Workload I is “bul-
lied” by its concurrently running communication-intensive peers, MultiGrid and Crys-
talRouter. AMG shares routers and groups with MultiGrid and CrystalRouter under
random placement. The traffic from MultiGrid and CrystalRouter is (re)directed to
the routers that serve AMG, slowing down AMG’s communication. 3
In contrast, performance consistency of each application is achieved only when
contiguous placement and minimal routing are in use. As a corollary to the previ-
ous observation, router and group sharing among applications are guaranteed to be
prohibited when using contiguous placement with minimal routing (sharing of spare
nodes within a group notwithstanding). This renders the “bully” behavior moot,
though with the downside of significant performance degradation, so such approaches
3We have tried three different congestion “sensing” schemes in the literaturefor adaptive routing [96]. Although there are some variations between the results,none of the congestion sensing scheme prevent the “bully” behavior.
101
must be carefully considered.
6.5 Study of Parallel Workload II
In this section, we use a different experimental configuration to verify and
explore the observations made in the previous section. Specifically, we conduct the
same sets of experiments through Workload II, which consists of sAMG, MultiGrid
and CrystalRouter. By replacing AMG with sAMG, we turn the “bully” into the
“bullied”.
6.5.1 Network Performance Analysis. Figure 6.5 shows aggregate traffic and
saturation times for Workload II, corresponding to Figure 6.2 in Workload I. Replac-
ing AMG with sAMG results in greater aggregate traffic as well as more saturation
than in Workload I, but regardless, similar patterns can be observed. Contiguous
placement with minimal routing results in load imbalance with respect to both traffic
and saturation. The addition of adaptive routing alleviates these effects to some de-
gree, particularly with respect to global channel usage, while trading off saturation in
global channels for saturation in the local channels. Using random placement again
shows roughly similar performance characteristics across routing configurations, with
adaptive routing helping to balance aggregate load while increasing the aggregate
traffic due to the related indirection.
6.5.2 Individual Application Analysis. We study the performance of each
application individually in the same manner as in Section 6.4.2. Figure 6.6 shows the
communication time distributions of the ranks of the three applications, running con-
currently in Workload II. The “bully” in Workload I becomes the “bullied” in Work-
load II. MultiGrid and CrystalRouter are in this instance the less communication-
intensive applications. With random placement and (progressive) adaptive routing,
both MultiGrid and CrystalRouter experience prolonged communication time, as
102
(a) GC Traffic (b) LC Traffic (c) TL Traffic
(d) GC Saturation Time (e) LC Saturation Time (f) TL Saturation Time
Figure 6.5. Aggregate traffic and saturation time for Workload II under the configu-rations listed in Table 6.3.
shown in Figure 6.6(a), 6.6(b). On the other hand, sAMG (Figure 6.6(c)) bene-
fits from those configurations in a similar manner to CrystalRouter in Workload I.
Contiguous placement coupled with minimal routing, while preventing the “bully” be-
havior, results in poor performance for all of the applications except CrystalRouter,
which we expect is due to a higher degree of network isolation.
(a) MultiGrid (b) CrystalRouter (c) sAMG
Figure 6.6. Communication time distribution across application ranks in Workload II.The “bully”, sAMG, benefits from random placement and adaptive routing, whilethe “bullied”, MultiGrid and CrystalRouter, suffer performance degradation.
103
Once again, we look at the network-level system view to scrutinize the traffic
through the routers serving each application. The routers serving sAMG have a
high volume of traffic on both local and global channels when contiguous placement
is in use, as shown in Figures 6.7(c) and 6.7(f). As in previous results, the use
of random placement alleviates the local congestion by uniformly distributing the
traffic of sAMG over the network, getting more routers involved in serving sAMG.
In this case, a majority of those less busy routers are also serving MultiGrid and
CrystalRouter.
MultiGrid in Workload II is similar to AMG in Workload I when considering
resident channel behavior, as shown in Figures 6.7(a) and 6.7(d). There is a large
gap in traffic volume between the contiguous and random placement approaches, due
to other applications (sAMG in particular) utilizing the same links. CrystalRouter
in Workload II additionally experiences more load on its channels using random al-
location configurations, as shown in Figure 6.7(b) and 6.7(b). However, the maximal
load under random allocation is closer to that observed in contiguous allocations as
compared to MultiGrid.
6.5.3 Key Observations. Revisiting the observations in Section 6.4.3, we find
that those observations are held under this separate configuration. System-level per-
formance is still much improved in terms of load-balancing with random placement.
sAMG, being far and away the most communication-intensive application in Work-
load II, benefits greatly from random placement, whereas in Workload I AMG was
effectively penalized for being less communication-intensive. CrystalRouter, being the
comparatively less communication-intensive application in Workload II, experiences
performance regressions in Workload II under random and adaptive policies.
Interestingly, in both Workload I and II, MultiGrid experiences a more subtle
performance variatioin than the significant swings in performance observed in the
104
(a) MultiGrid LC Traffic (b) CrystalRouter LC Traffic (c) AMG LC Traffic
(d) MultiGrid GC Traffic (e) CrystalRouter GC Traffic (f) AMG GC Traffic
Figure 6.7. Aggregate workload traffic for routers serving specific applications. Morerouters are involved in serving each application when random placement is in use,compared to contiguous placement.
other applications. These behaviors persisted across multiple runs with different
random seeds. Additionally, CrystalRouter in Workload II has less drastic changes
in maximal load, but still experiences performance regressions. We are continuing
to work towards understanding the root causes and implications of this behavior,
for which we expect application-specific communication patterns to be an important
factor.
6.6 Hybrid Job Placement
Based on our experiments with Workloads I and II, the “bully” behavior is
exhibited when the dragonfly network is configured with random placement and (pro-
gressive) adaptive routing, and there is a large gap between the communication inten-
sity of applications running on the network. As shown through our experimentation,
contiguous placement policies give up too much in terms of congestion and load bal-
105
ance, hence being an impractical solution. Further, running each job with a dedicated
routing policy is unrealistic, since routing policy is part of system configuration which
can not be changed on the fly upon job submission.
As a natural extension of our observations, one question that arises is whether
we can combine the merits of random and contiguous placement policies in which each
application receives the performance benefits from system load balancing while avoiding
the “bully” behavior. As an initial exploration of the question, we set up a mock hybrid
job placement policy, in which less communication-intensive jobs receive contiguous
allocations to avoid the “bully” effect, while the communication-intensive jobs are
allocated randomly in order to distribute the communication load. For Workload I,
this means AMG gets a contiguous allocation while MultiGrid and CrystalRouter get
random allocations. Note that we do not consider challenges inherent in designing an
allocation policy for production usage, such as backfilling, reserving large contiguous
sets of nodes, determining a metric for communication intensity, etc., preferring a
restricted-scope experiment looking at the design space of dragonfly allocation policies
in light of our experimental observations.
(a) MultiGrid (b) CrystalRouter (c) sAMG
Figure 6.8. Application communication time. Workload I is running with all place-ment and routing configurations. Methods prefixed with “H” represent the hybridallocation approach.
For the purpose of brevity, we only present the communication time distribu-
tion of each application under all placement and routing configurations, including the
hybrid placement method. These results are presented in Figure 6.8. As shown in
106
Figure 6.8(a) and 6.8(b), MultiGrid and CrystalRouter exhibit similar performance in
both hybrid and random placement, as nodes are being placed randomly in each case.
While the performance of AMG under hybrid placement, shown in Figure 6.8(c), still
appears to exhibit significant communication interference on account of the other
applications as opposed to the best contiguous placement policies, the effects are sig-
nificantly reduced compared to a random-adaptive policy. We believe this to be a
result of more AMG-specific traffic occupying a smaller set of routers/groups, both
reducing the probability of traffic entering them through adaptive routing and in-
creasing the relative proportion of link utilization by AMG. Of course, this comes
with the costs associated with contiguous allocation, in which AMG’s traffic is less
likely to load balance across multiple dragonfly groups.
These initial experiments demonstrate some degree of benefit derived from us-
ing a hybrid approach, helping to alleviate the “bully” effect while retaining the per-
formance of communication-intensive applications. However, the behavior is still not
ideal in this case – AMG’s communications still experience performance degradation
versus the contiguous configurations. Hence, more work in this area is needed to fully
understand the intricate relationships between job scheduling and system/application
communication behavior to achieve optimal network utilization and application per-
formance in high-radix networks.
6.7 Other Placement Policies
The fact that hybrid job placement can not eliminate the adversary effect of the
“bully” behavior motivates us to explore other job placement policies. Random Router
placement, which has been studied by Jain et al.[5], is one possible job placement
policy for future dragonfly connected systems. In the Random Router placement
policy, a router is identified as “idle” if all its attached nodes are available. Idle
routers are randomly picked over the network, and all the compute nodes attached
107
to each router are assigned to a single job exclusively. Thus, the allocation made
by Random Router placement policy can preserve the locality within each router.
Compared with the random placement we discussed in Section 6.2.3, Random Router
placement makes a trade-off between the extent of randomness and locality.
Random Partition job placement policy is another possible solution for future
dragonfly systems. The partition based job placement has been adopted on torus
connected HPC systems such as IBM Blue Gene/P and Blue Gene/Q machines to
accommodate their capability computing workload. It has been well studied by Zhou
and Yang et al.[10][28][9]. Similarly, a partition could be configured as a portion of a
group in a dragonfly network. For example, in our simulated dragonfly network, we
define a partition as half a group, which consists of four routers. Thus, each parti-
tion will consist of 16 compute nodes. The Random Partition job placement policy
will assign each job some randomly picked partitions. Random Partition placement
preserves more locality and renders less randomness compared with Random Router
placement.
Table 6.3. Three different random placement and routing configurations
Routing
Placement Minimal Adaptive Prog Adaptive
Random rand-min rand-adp rand-padp
RandomRouter randR-min randR-adp randR-padp
RandomPartition randP-min randP-adp randP-padp
6.7.1 Individual Application Analysis. We present the communication time
distribution of each application in Workload I under three different random place-
ment policies and routing configurations, including the random placement policy we
discussed in section 6.2.3. These results are presented in Figure 6.9.
108
rand-m
in
rand-a
dp
rand-p
adp
randP-min
randP-ad
p
randP-pa
dp
randR-min
randR-ad
p
randR-pa
dp0123456789
millisecond
MG_CommunicationTime
(a) MultiGrid
rand-m
in
rand-a
dp
rand-p
adp
randP-min
randP-ad
p
randP-pa
dp
randR-min
randR-ad
p
randR-pa
dp35
40
45
50
55
60
65
70
75
milliseco
nd
CR_CommunicationTime
(b) CrystalRouter
rand-m
in
rand-a
dp
rand-p
adp
randP-min
randP-ad
p
randP-pa
dp
randR-min
randR-ad
p
randR-pa
dp0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
millisecond
AMG_CommunicationTime
(c) sAMG
Figure 6.9. Application communication time. Workload I is running with threedifferent random placement policies coupled with three routing configurations.
As we discussed in section 6.4, MultiGrid and CrystalRouter prefer random
placement for the reason that their traffic can be evenly distributed across the net-
109
work. As shown in Figure 6.9(a) and 6.9(b), MultiGrid and CrystalRouter exhibits
the similar pattern across three different random placement policies. Both applica-
tions suffer performance loss when switching from random (rand) to random router
(randR) and random partition(randP). Random router placement preserves some ex-
tent of locality by assigning all the compute nodes attached to a router to MultiGrid,
resulting some MPI ranks reside on those adjacent nodes. And random partition
placement makes it worse by grouping four routers together and preserving more
locality. Local congestion is likely to happen among those adjacent ranks in both
applications for the same reason we discussed about contiguous placement in sec-
tion 6.4. When random partition is coupled with minimal routing, communication of
both MultiGrid and CrystalRouter suffer more delay due to the congestion on mini-
mal paths. Adaptive (progressive) can help alleviate that local congestion by routing
packets through both minimal and non-minimal paths adaptively.
Due to its relative small traffic amount, AMG favors allocation with locality to
avoid being “bullied” by the other applications. Random router placement provides
allocation with router-level locality. However, the four compute nodes attached to
each router can not accommodate the 3D nearest neighbor communication pattern
of AMG. Thus, we can only observe very few performance improvement when AMG
switching from random to random router placement in Figure 6.9(c). When random
partition placement is coupled with (progressive) adaptive routing, the performance
of AMG is greatly boosted, indicating by the slump in AMG’s communication time.
Random partition placement makes a better trade-off between locality and random-
ness for AMG by assigning all compute nodes attached to four routers in the same
group to AMG. Although random partition placement coupled with (progressive)
adaptive routing can greatly improve AMG’s performance, this doesn’t mean that
it can reduce the “bully” effect. In fact, the improvement for AMG is achieved by
introducing performance degradation to MultiGrid and CrystalRouter.
110
6.7.2 Key Observations. Based the experiments presented in section 6.7.1, we
can make the following observations.
Neither random router nor random partition placement can eliminate the “bully”
effect. Both placement policies can not avoid AMG sharing network with MultiGrid
and CrystalRouter, which is the root cause of “bully” effect. Traffic from each ap-
plication still need to share the local and global channel when traversing from the
source to the destination router.
MultiGrid and CrystalRouter prefer randomness over locality on dragonfly net-
work. Due to their high traffic volume, MultiGrid and CrystalRouter prefer random
placement to evenly distribute their traffic. Random router placement reduces the
randomness by preserving the router-level locality, introducing potential local conges-
tion and performance degradation to both applications. Random partition placement
makes it worse for MultiGrid and CrystalRouter as partition-level locality has been
preserved.
AMG benefits from allocation with high extent of locality. As the least commu-
nication intensive application in Workload I, AMG prefers exclusive network resource
without sharing with other applications. The locality preserved by random router
and random partition placement policy can reduce the sharing of routers between dif-
ferent applications. The performance of AMG can be greatly improved when random
partition placement is applied.
6.8 Related Work
The impact of job placement on system behavior and application performance
has been the subject of many studies. We focus on the HPC-centric studies here. Skin-
ner et al. identified significant performance variability due to network contention [108].
They found that performance variability is inevitable on either torus or fat-tree net-
111
works when network sharing among concurrently running applications is allowed.
Bhatele et al. studied the performance variability of a specific application, p3FD, run-
ning on different HPC production systems with torus network topologies [4]. They
obtained performance consistency in their application when network resources were
allocated compactly and exclusively and wide variability otherwise. Jokanovic et al.
studied the impact of job placement to the workload and claimed that the key to
reduce performance variability is to avoid network sharing [22].
Zhou et al. investigated the potential of relaxing network resource allocation
by utilizing application communication features[10]. They studied the performance of
a number of parallel benchmarks when assigning them with either mesh or torus con-
nected node allocation. Based on their observation, they proposed a communication-
aware scheduling policy that selectively allocates network resource to users jobs ac-
cording to job communication characteristics. Yang et al. proposed a window-based
locality-aware job scheduling design for HPC systems[28]. The objective of their
scheduling design is to preserve the locality with regard to node allocation, and in
the meanwhile maintain high system utilization. In another recent work from Yang
et al., they investigate the performance of applications with different communica-
tion characteristics when using different allocation and mapping strategies on torus
connected HPC systems [9].
Recently, several researchers have investigated job placement and routing al-
gorithms on dragonfly networks. Prisacari et al. proposed a communication-matrix-
based analytic modeling framework for mapping application workloads onto network
topologies [109]. They found that, in the context of dragonfly networks, optimizing
for throughput and not workload completion time is often misleading and the notion
of system balance cited as a dragonfly design parameter is not always directly appli-
cable to all workloads. Jain et al. conducted a comprehensive analysis of various job
112
placement and routing policies with regard to network link throughput on dragonfly
network [5]. Their work is based on an analytical model and synthetic workload.
Bhatele et al. used coarse-grain simulation to study the performance of synthetic
workloads under the different task mapping and routing policies on two-level direct
networks [6]. Mubarak et al. focused on the modeling of large-scale dragonfly networks
with parallel event driven simulation. The dragonfly network model for million-node
configurations presented in their work strongly scales when going from 1,024 to 65,536
MPI tasks on IBM Blue Gene/P and IBM Blue Gene/Q systems [102]. The dragonfly
model used in this paper is from their work.
Our work complements the literature in the following aspects. First, our
simulations are driven by real application traces intended to be representative of
production-scale application patterns. Second, we study a variety of different job
placement and routing policies that could be used on HPC systems with dragonfly
networks. This study could serve as an guideline for future system design and im-
plementation. Third, we holistically examine network behavior at both the overall
system level as well as the individual application level, though we do not consider
communication-pattern specific application mappings as Prisacari et al. did. Last but
not least, with the CODES simulation toolkit and related network models [102,110],
we are able to simulate and examine system and application behavior at a very fine
grain, collecting data at the dragonfly link level with packet-level fidelity. We believe
these differences allowed us to uncover the “bully” behavior, which to our knowledge
is unreported in the literature. However, in a sense, Prisacari et al.’s work suggests
these types of behaviors as possibilities deriving from the balance-first design rationale
for the dragonfly.
6.9 Summary
In this chapter, we have conducted extensive studies of system and application
113
behavior using various job placement and routing configurations on a dragonfly net-
work. We took a simulation-based approach, utilizing the CODES simulation toolkit
and related models for high-fidelity dragonfly simulation, driving the network with
three production-representative scientific application traces. We found that, under
the prevailing recommendation of random job placement and adaptive routing, net-
work traffic can be well distributed to achieve a balanced load and strong overall
performance, at the cost of impairing jobs with less intensive communication pat-
terns. We denoted this as the “bully” effect. On the other hand, contiguous process
placement prevents such effects while exacerbating local congestion, though this can
be mitigated through the addition of adaptive routing. Finally, we performed initial
experimentation exploring a mock “hybrid” contiguous/random job placement pol-
icy and two other random placement policies. Our preliminary study demonstrates
the need of specialized job placement strategy based on application communication
characteristics.
To the best of our knowledge, using real application traces from production
systems for the study of job interference on dragonfly networks has not been reported
in the literature so far. We believe the observations and new placement policy pre-
sented in this paper are valuable to the HPC community. We believe the observations
and new placement policy presented in this paper are valuable to a number of com-
munities including HPC computing facilities, system software developers, and system
administrators and HPC users. The computing facilities should take network resource
sharing into consideration when choosing proper configurations when building their
dragonfly networks. System software developers could design better scheduling algo-
rithms for jobs with distinct communication characteristics. System administrators
could make more accurate predication about system availability based on job running
status when system is configured with different placement and routing policies. Users
can provide detail information about their applications such that batch scheduler
115
CHAPTER 7
CONCLUSION
The batch scheduler plays an important role in managing and utilizing the
large scale HPC system. It serves as the interface between the users and the HPC
system, decides when and where to dispatch the submitted jobs for running on the
system. While the existing batch scheduling designs have been challenged by the
emerging issues in the exascale computing era, our research is motivated by exploring
the new scheduling algorithms and methodologies to solve those issues. The spe-
cific issues we identified to solve in our work include increasing energy cost, network
contention and job interference. In this dissertation, we have made the contribution
of addressing these issues in a orchestrated way by building a cooperative batching
scheduling framework that integrates novel batch scheduling algorithms and method-
ologies.
7.1 Summary of Contributions
We proposed new batch scheduling algorithms and methodologies, integrated
them in our new cooperative batch scheduling framework to address three major
challenges for HPC systems in exascale era. And we made following contributions:
• We propose an novel job power aware scheduling design, with the objective
to reduce the ever-increasing electricity bill for HPC systems. Our design is
based on the facts that HPC jobs have different individual power profiles and
that electricity prices vary throughout a day. By scheduling jobs with high
power profiles during low electricity pricing period and jobs with low power
profiles during high electricity pricing period, our scheduler is capable of cutting
the electricity bill of HPC systems by up to 23% without impacting system
utilization, which is critical to HPC systems.
116
• In order to balance performance with system performance, we design and im-
plement a window-based locality-aware job scheduling methodology for torus-
connected system. Being different from the traditional batch scheduler, our
design has three novel features. First, rather than one-by-one job scheduling,
our design takes a “window” of jobs (i.e.,multiple jobs) into consideration for
job prioritizing and resource allocation. Second, our design maintains a list of
slots to preserve node contiguity information for resource allocation. Finally, we
formulate a 0-1 Multiple Knapsack problem to describe our scheduling decision
making and present two algorithms to solve the problem. The comprehensive
trace-based simulations demonstrate our design can reduce average job wait
time by up to 28% and average job response time by 30%, with a slight im-
provement on overall system utilization.
• We study the communication behavior of three parallel applications: AMG,
Crystal Router, and MultiGrid on torus network with different placement poli-
cies and mapping strategies. We analyze the intra- and interjob interference
by simulating three applications running both independently and concurrently.
Based on our comprehensive experiments, we make some valuable observations
with regard to relation between application communication pattern and its per-
formance when different placement and mapping strategies are in use. We be-
lieve these observations can provide valuable guidance for HPC batch schedulers
and resource managers to make flexible job allocations. We claim that rather
than using predefined partitions or a noncontiguous placement policy, future
HPC systems should assign resources to each job based on job communication
patterns.
• We conduct extensive studies of system and application behavior using vari-
ous job placement and routing configurations on a dragonfly network. We find
117
that, under the prevailing recommendation of random job placement and adap-
tive routing, network traffic can be well distributed to achieve a balanced load
and strong overall performance, at the cost of impairing jobs with less intensive
communication patterns. We denoted this as the “bully” effect. On the other
hand, contiguous process placement prevents such effects while exacerbating
local congestion, though this can be mitigated through the addition of adaptive
routing. We explore a series of other possible job placement policies that can be
applied to dragonfly connected systems. Our study demonstrates the effective-
ness of specialized job placement strategy based on application communication
characteristics.
7.2 Future Research
A natural extension of our accomplished work is to enhance the cooperative
batch scheduling with deep understanding about application communication charac-
teristics, as well as new network topology.
7.2.1 Communication Characteristics. A thorough understanding about paral-
lel applications communication pattern is of great importance for the design of future
batch scheduling framework. However, communication pattern is a high level concept
regarding the communication characteristics of parallel applications. There are many
factors need to be considered to accurately analyze the communication pattern of
parallel application, such as communication topology (i.e., communication matrix),
intensity, frequency and operation dependence. We focus mainly on the study about
the communication topology and intensity of applications so far, however we believe
other factors, such as communication frequency, are also critical to the applications
communication performance when running on HPC systems. One possible direction
to continue the study about application communication pattern is to analyze the
communication frequency and its impact to the job placement policy.
118
7.2.2 Fat-tree Network Topology. Many HPC computing facilities will deploy
their next generation supercomputers with Fat-tree network topology. One of the
most prominent deployment of such system is the Summit supercomputer at Oak
Ridge National Laboratory [111]. The batch scheduler need to make topology-aware
job scheduling on the Fat-tree network. And the interference between jobs running
concurrently on Fat-tree network is also an open question. In order to design a
better batch scheduling for Fat-tree connected systems, lots of effort are required for
a comprehensive study on this topic.
119
BIBLIOGRAPHY
[1] M. Wright and al., The opportunities and challenges of exascale computing.http://science.energy.gov/, U.S. Department of Energy, 2010.
[2] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka,“Integrating dynamic pricing of electricity into energy aware scheduling for hpcsystems,” in Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis, SC ’13, (New York, NY, USA),pp. 60:1–60:11, ACM, 2013.
[3] S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. E. Papka, “Measuringpower consumption on ibm blue gene/q,” in Parallel and Distributed ProcessingSymposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International,pp. 853–859, May 2013.
[4] A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, “There goes the neigh-borhood: Performance degradation due to nearby jobs,” in Proceedings of theInternational Conference on High Performance Computing, Networking, Stor-age and Analysis, SC ’13, (New York, NY, USA), pp. 41:1–41:12, ACM, 2013.
[5] N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale, “Maximizing through-put on a dragonfly network,” in Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis, SC ’14, (Pis-cataway, NJ, USA), pp. 336–347, IEEE Press, 2014.
[6] A. Bhatele, W. D. Gropp, N. Jain, and L. V. Kale, “Avoiding hot-spots on two-level direct networks,” in High Performance Computing, Networking, Storageand Analysis (SC), 2011 International Conference for, pp. 1–11, Nov 2011.
[7] A. Bhatele, N. Jain, Y. Livnat, V. Pascucci, and P.-T. Bremer, “Analyzing net-work health and congestion in dragonfly-based supercomputers,” in Proceedingsof the IEEE International Parallel & Distributed Processing Symposium, IPDPS’16 (to appear), IEEE Computer Society, May 2016. LLNL-CONF-678293.
[8] X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, “Watch out forthe bully!: Job interference study on dragonfly network,” in Proceedings ofthe International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’16, (Piscataway, NJ, USA), pp. 64:1–64:11, IEEEPress, 2016.
[9] X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, “Study of intra-and interjob interference on torus networks,” in 2016 IEEE 22nd InternationalConference on Parallel and Distributed Systems (ICPADS), pp. 239–246, Dec2016.
[10] Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Im-proving batch scheduling on blue gene/q by relaxing 5d torus network allocationconstraints,” in Parallel and Distributed Processing Symposium (IPDPS), 2015IEEE International, pp. 439–448, May 2015.
[11] Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Im-proving batch scheduling on blue gene/q by relaxing 5d torus network allocationconstraints,” IEEE Transactions on Parallel and Distributed Systems (To Ap-pear), 2016.
120
[12] Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/o-aware batch scheduling for petascale computing systems,” in Cluster Computing(CLUSTER), 2015 IEEE International Conference on, pp. 254–263, Sept 2015.
[13] Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J. Wang, and Z. Lan, “I/o-awarebandwidth allocation for petascale computing systems,” Parallel Computing.
[14] D. Zhao, X. Yang, I. Sadooghi, G. Garzoglio, S. Timm, and I. Raicu, “High-performance storage support for scientific applications on the cloud,” in Pro-ceedings of the 6th Workshop on Scientific Cloud Computing, ScienceCloud ’15,(New York, NY, USA), pp. 33–36, ACM, 2015.
[15] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, andC. Maltzahn, “On the role of burst buffers in leadership-class storage systems,”in 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies(MSST), pp. 1–11, April 2012.
[16] C. Patel, R. Sharma, C. Bash, and S. Graupner, “Energy aware grid: Globalworkload placement based on energy efficiency,” in ASME 2003 InternationalMechanical Engineering Congress and Exposition, pp. 267–275, American Soci-ety of Mechanical Engineers, 2003.
[17] S. Wallace, X. Yang, V. Vishwanath, W. E. Allcock, S. Coghlan, M. E. Papka,and Z. Lan, “A data driven scheduling approach for power management on hpcsystems,” in Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’16, (Piscataway, NJ, USA),pp. 56:1–56:11, IEEE Press, 2016.
[18] I. Redbooks, IBM System Blue Gene Solution: Blue Gene/Q System Adminis-tration. Vervante, 2012.
[19] C. Document, “Managing system software for cray xe and cray xt systems.,”2012.
[20] P. Krueger, T.-H. Lai, and V. A. Dixit-Radiya, “Job scheduling is more impor-tant than processor allocation for hypercube computers,” IEEE Transactionson Parallel and Distributed Systems, vol. 5, pp. 488–497, May 1994.
[21] J. A. Pascual, J. Navaridas, and J. Miguel-Alonso, “Job scheduling strate-gies for parallel processing,” ch. Effects of Topology-Aware Allocation Policieson Scheduling Performance, pp. 138–156, Berlin, Heidelberg: Springer-Verlag,2009.
[22] A. Jokanovic, J. Sancho, G. Rodriguez, A. Lucero, C. Minkenberg, andJ. Labarta, “Quiet neighborhoods: Key to protect job performance predictabil-ity,” in Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEEInternational, pp. 449–459, May 2015.
[23] D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Sala-pura, D. Satterfield, B. Steinmacher-Burow, and J. Parker, “The ibm bluegene/q interconnection fabric,” IEEE Micro, vol. 32, pp. 32–43, Jan. 2012.
[24] Y. Ajima, T. Inoue, S. Hiramoto, Y. Takagi, and T. Shimizu, “The tofu inter-connect,” IEEE Micro, vol. 32, pp. 21–31, Jan. 2012.
[25] ORNL, “Titan system overview,” Accessed October 14, 2015. Available onlinehttps://www.olcf.ornl.gov/kbarticles/titan-system-overview.
121
[26] A. Gara, M. A. Blumrich, D. Chen, G.-T. Chiu, P. Coteus, M. E. Giampapa,R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, et al., “Overviewof the blue gene/l system architecture,” IBM Journal of Research and Devel-opment, vol. 49, no. 2.3, pp. 195–212, 2005.
[27] C. Albing, N. Troullier, S. Whalen, R. Olson, J. Glenski, H. Pritchard, andH. Mills, Scalable Node Allocation for Improved Performance in Regular andAnisotropic 3D Torus Supercomputers, pp. 61–70. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2011.
[28] X. Yang, Z. Zhou, W. Tang, X. Zheng, J. Wang, and Z. Lan, “Balancing jobperformance with system performance via locality-aware scheduling on torus-connected systems,” in Cluster Computing (CLUSTER), 2014 IEEE Interna-tional Conference on, pp. 140–148, Sept 2014.
[29] M. Hennecke, W. Frings, W. Homberg, A. Zitz, M. Knobloch, and H. Bottiger,“Measuring power consumption on ibm blue gene/p,” Comput. Sci., vol. 27,pp. 329–336, Nov. 2012.
[30] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, “Cutting theelectric bill for internet-scale systems,” in Proceedings of the ACM SIGCOMM2009 Conference on Data Communication, SIGCOMM ’09, (New York, NY,USA), pp. 123–134, ACM, 2009.
[31] DOE, “Design Forward - Exascale Initiative,” Accessed October 14, 2015. Avail-able online http://portal.nersc.gov/project/CAL/trace.htm.
[32] J. Cope, N. Liu, S. Lang, C. D. Carothers, and R. B. Ross, “Codes: Enabling co-design of multi-layer exascale storage architectures,” in Workshop on EmergingSupercomputing Technologies 2011 (WEST 2011), (Tuscon, AZ), 2011.
[33] Mira Supercomputer, “https://www.alcf.anl.gov/mira,” Accessed April 15,2017.
[34] TOP, “Top500 supercomputing web site,” Accessed October 14, 2015. Availableonline http://www.top500.org.
[35] J. Kim, W. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalabledragonfly topology,” in Computer Architecture, 2008. ISCA ’08. 35th Interna-tional Symposium on, pp. 77–88, June 2008.
[36] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. John-son, J. Kopnick, M. Higgins, and J. Reinhard, “Cray cascade: A scalable hpcsystem based on a dragonfly network,” in High Performance Computing, Net-working, Storage and Analysis (SC), 2012 International Conference for, pp. 1–9,Nov 2012.
[37] Moab, “http://www.adaptivecomputing.com/,” Accessed May 5, 2016.
[38] Portable Batch System, “http://www.pbsworks.com/,” Accessed May 5, 2016.
[39] Slurm Workload Manager by Schedmd, “http://slurm.schedmd.com/,” Ac-cessed May 5, 2016.
[40] Cobalt Resource Manager, “https://trac.mcs.anl.gov/projects/cobalt,” Ac-cessed May 5, 2016.
122
[41] “An event-driven simulator.” Available online http://bluesky.cs.iit.edu/cqsim.
[42] “Parallel workload archive.” Available online http://www.cs.huji.ac.il/labs/parallel/workload.
[43] P. C. Roth, J. S. Meredith, and J. S. Vetter, “Automated characterization ofparallel application communication patterns,” in Proceedings of the 24th Inter-national Symposium on High-Performance Parallel and Distributed Computing,HPDC ’15, (New York, NY, USA), pp. 73–84, ACM, 2015.
[44] S. S. Shende and A. D. Malony, “The tau parallel performance system,” Int. J.High Perform. Comput. Appl., vol. 20, pp. 287–311, May 2006.
[45] M. Michael, J. Curt, C. Mike, B. Jim, R. Philip, and M. Tushar, “mpiP:Lightweight, Scalable MPI profiling,” Accessed October 14, 2015. Availableonline at http://mpip.sourceforge.net.
[46] X. Wu, F. Mueller, and S. Pakin, “Automatic generation of executable com-munication specifications from parallel applications,” in Proceedings of the In-ternational Conference on Supercomputing, ICS ’11, (New York, NY, USA),pp. 12–21, ACM, 2011.
[47] SNL, “SST DUMPI trace library,” Accessed October 14, 2015. Available onlinehttp://sst.sandia.gov/usingdumpi.html.
[48] J. Vetter, S. Lee, D. Li, G. Marin, C. McCurdy, J. Meredith, P. Roth, andK. Spafford, “Quantifying architectural requirements of contemporary extreme-scale scientific applications,” in High Performance Computing Systems. Perfor-mance Modeling, Benchmarking and Simulation (S. A. Jarvis, S. A. Wright,and S. D. Hammond, eds.), vol. 8551 of Lecture Notes in Computer Science,pp. 3–24, Springer International Publishing, 2014.
[49] M. Mubarak, C. Carothers, R. Ross, and P. Carns, “Modeling a million-nodedragonfly network using massively parallel discrete-event simulation,” in HighPerformance Computing, Networking, Storage and Analysis (SCC), 2012 SCCompanion:, pp. 366–376, Nov 2012.
[50] P. D. Barnes, Jr., C. D. Carothers, D. R. Jefferson, and J. M. LaPre, “Warpspeed: Executing time warp on 1,966,080 cores,” in Proceedings of the 1st ACMSIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIMPADS ’13, (New York, NY, USA), pp. 327–336, ACM, 2013.
[51] W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu, “Reducing fragmentationon torus-connected supercomputers,” in Parallel Distributed Processing Sympo-sium (IPDPS), 2011 IEEE International, pp. 828–839, May 2011.
[52] C. Garcia-Martos, J. Rodriguez, and M. J. Sanchez, “Forecasting electricityprices by extracting dynamic common factors: application to the iberian mar-ket,” IET Generation, Transmission Distribution, vol. 6, pp. 11–20, January2012.
[53] J. Lundgren, J. Hellstrom, and N. Rudholm, “Multinational electricity marketintegration and electricity price dynamics,” in 2008 5th International Confer-ence on the European Electricity Market, pp. 1–6, May 2008.
123
[54] T. Ida, K. Ito, and M. Tanaka, “Using dynamic electricity pricing to addressenergy crises: Evidence from randomized field experiments,” 2013.
[55] L. A. Barroso and U. Hlzle, “The case for energy-proportional computing,”Computer, vol. 40, pp. 33–37, Dec 2007.
[56] P. Li, “Variational analysis of large power grids by exploring statistical sam-pling sharing and spatial locality,” in ICCAD-2005. IEEE/ACM InternationalConference on Computer-Aided Design, 2005., pp. 645–651, Nov 2005.
[57] J. Hikita, A. Hirano, and H. Nakashima, “Saving 200kw and $200 k/year bypower-aware job/machine scheduling,” in Parallel and Distributed Processing,2008. IPDPS 2008. IEEE International Symposium on, pp. 1–8, April 2008.
[58] Z. Cao, L. T. Watson, K. W. Cameron, and R. Ge, “A power aware studyfor vtdirect95 using dvfs,” in Proceedings of the 2009 Spring Simulation Multi-conference, SpringSim ’09, (San Diego, CA, USA), pp. 107:1–107:6, Society forComputer Simulation International, 2009.
[59] E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath, “Load balancing andunbalancing for power and performance in cluster-based systems,” 2001.
[60] Y. Liu and H. Zhu, “A survey of the research on power management techniquesfor high-performance systems,” Softw. Pract. Exper., vol. 40, pp. 943–964, Oct.2010.
[61] M. Etinski, J. Corbalan, J. Labarta, and M. Valero, “Parallel job scheduling forpower constrained hpc systems,” Parallel Comput., vol. 38, pp. 615–630, Dec.2012.
[62] E. K. Lee, I. Kulkarni, D. Pompili, and M. Parashar, “Proactive thermal man-agement in green datacenters,” J. Supercomput., vol. 60, pp. 165–195, May2012.
[63] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a warehouse-sized computer,” in Proceedings of the 34th Annual International Symposiumon Computer Architecture, ISCA ’07, (New York, NY, USA), pp. 13–23, ACM,2007.
[64] C. Lefurgy, X. Wang, and M. Ware, “Power capping: A prelude to powershifting,” Cluster Computing, vol. 11, pp. 183–195, June 2008.
[65] Z. Zhou, Z. Lan, W. Tang, and N. Desai, Reducing Energy Costs for IBMBlue Gene/P via Power-Aware Job Scheduling, pp. 96–115. Berlin, Heidelberg:Springer Berlin Heidelberg, 2014.
[66] W. Tang, N. Desai, D. Buettner, and Z. Lan, “Analyzing and adjusting userruntime estimates to improve job scheduling on the blue gene/p,” in Paral-lel Distributed Processing (IPDPS), 2010 IEEE International Symposium on,pp. 1–11, April 2010.
[67] D. G. Feitelson and A. M. Weil, “Utilization and predictability in scheduling theibm sp2 with backfilling,” in Parallel Processing Symposium, 1998. IPPS/SPDP1998. Proceedings of the First Merged International ... and Symposium on Par-allel and Distributed Processing 1998, pp. 542–546, Mar 1998.
124
[68] J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pebay, “Ovis: a tool for intelli-gent, real-time monitoring of computational clusters,” in Proceedings 20th IEEEInternational Parallel Distributed Processing Symposium, pp. 8 pp.–, April 2006.
[69] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction toAlgorithms. McGraw-Hill Higher Education, 2nd ed., 2001.
[70] W. Tang, N. Desai, D. Buettner, and Z. Lan, “Job scheduling with adjustedruntime estimates on production supercomputers,” Journal of Parallel and Dis-tributed Computing, vol. 73, no. 7, pp. 926 – 938, 2013. Best Papers: Interna-tional Parallel and Distributed Processing Symposium (IPDPS) 2010, 2011 and2012.
[71] Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buet-tner, “Co-analysis of ras log and job log on blue gene/p,” in Parallel DistributedProcessing Symposium (IPDPS), 2011 IEEE International, pp. 840–851, May2011.
[72] D. Tsafrir, K. Ouaknine, and D. G. Feitelson, “Reducing performance evalu-ation sensitivity and variability by input shaking,” in 2007 15th InternationalSymposium on Modeling, Analysis, and Simulation of Computer and Telecom-munication Systems, pp. 231–237, Oct 2007.
[73] V. J. Leung, E. M. Arkin, M. Bender, D. Bunde, J. Johnston, A. Lal, J. S. B.Mitchell, C. Phillips, and S. S. Seiden, “Processor allocation on cplant: achiev-ing general processor locality using one-dimensional allocation strategies,” inCluster Computing, 2002. Proceedings. 2002 IEEE International Conferenceon, pp. 296–304, 2002.
[74] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka,“Integrating dynamic pricing of electricity into energy aware scheduling for hpcsystems,” in Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis, SC ’13, (New York, NY, USA),pp. 60:1–60:11, ACM, 2013.
[75] V. Lo, K. J. Windisch, W. Liu, and B. Nitzberg, “Noncontiguous processor al-location algorithms for mesh-connected multicomputers,” IEEE Trans. ParallelDistrib. Syst., vol. 8, pp. 712–726, July 1997.
[76] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to theTheory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979.
[77] D. Johnson, Near-optimal Bin Packing Algorithms. Massachusetts Institute ofTechnology, project MAC, Massachusetts Institute of Technology, 1973.
[78] D. S. Johnson, “Fast algorithms for bin packing,” Journal of Computer andSystem Sciences, vol. 8, no. 3, pp. 272–314, 1974.
[79] S. Martello and P. Toth, “Heuristic algorithms for the multiple knapsack prob-lem,” Computing, vol. 27, no. 2, pp. 93–112, 1981.
[80] A. Bhatele and L. V. Kale, “Application-specific topology-aware mapping forthree dimensional topologies,” in Parallel and Distributed Processing, 2008.IPDPS 2008. IEEE International Symposium on, pp. 1–8, April 2008.
125
[81] Department of Energy, “Characterization of the DOE Mini-apps,” AccessedApril 2, 2016. Available online http://portal.nersc.gov/project/CAL/designforward.htm.
[82] V. E. Henson and U. M. Yang, “Boomeramg: A parallel algebraic multigridsolver and preconditioner,” Applied Numerical Mathematics, vol. 41, no. 1,pp. 155 – 177, 2002. Developments and Trends in Iterative Methods for LargeSystems of Equations - in memorium Rudiger Weiss.
[83] R. E. Alcouffe, R. S. Baker, J. A. Dahl, S. A. Turner, and R. Ward, “PARTISN:A time-dependent, parallel neutral particle transport code system,” Los AlamosNational Laboratory, LA-UR-05-3925 (May 2005), 2005.
[84] Z. Joe and B. Randal, “SNAP: SN (Discrete Ordinates) Application Proxy,”Accessed October 14, 2015. Available online https://github.com/losalamos/SNAP.
[85] P. Fischer, A. Obabko, E. Merzari, and O. Marin, “Nek5000: Computationalfluid dynamics code,” Accessed October 14, 2015. Available online http://nek5000.mcs.anl.gov.
[86] J. Bell, A. Almgren, V. Beckner, M. Day, M. Lijewski, A. Nonaka, andW. Zhang, “Boxlib users guide,” tech. rep., Technical Report, CCSE,Lawrence Berkeley National Laboratory. Available at: https://ccse. lbl.gov/BoxLib/BoxLibUsersGuide. pdf, 2012.
[87] R. Fiedler and W. Stephen, “Improving task placement for applications with2d, 3d, and 4d virtual cartesian topologies on 3d torus networks with servicenodes,” in Proceedings of Cray User Group, 2013.
[88] Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, “Im-proving batch scheduling on blue gene/q by relaxing 5d torus network allocationconstraints,” in Parallel and Distributed Processing Symposium (IPDPS), 2015IEEE International, pp. 439–448, May 2015.
[89] D. Skinner and W. Kramer, “Understanding the causes of performance variabil-ity in hpc workloads,” in Workload Characterization Symposium, 2005. Proceed-ings of the IEEE International, pp. 137–149, Oct 2005.
[90] CORAL, “Collaboration benchmark codes,” Accessed October 14, 2015. Avail-able online https://asc.llnl.gov/CORAL-benchmarks.
[91] T. Hoefler, W. Gropp, W. Kramer, and M. Snir, “Performance modeling forsystematic performance tuning,” in High Performance Computing, Networking,Storage and Analysis (SC), 2011 International Conference for, pp. 1–12, Nov2011.
[92] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, andT. Hoefler, “Efficient task placement and routing of nearest neighbor exchangesin dragonfly networks,” in Proceedings of the 23rd International Symposium onHigh-performance Parallel and Distributed Computing, HPDC ’14, (New York,NY, USA), pp. 129–140, ACM, 2014.
[93] D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Sala-pura, D. Satterfield, B. Steinmacher-Burow, and J. Parker, “The ibm blue
126
gene/q interconnection network and message unit,” in High Performance Com-puting, Networking, Storage and Analysis (SC), 2011 International Conferencefor, pp. 1–10, Nov 2011.
[94] J. Kim, W. Dally, S. Scott, and D. Abts, “Cost-efficient dragonfly topology forlarge-scale systems,” IEEE Micro, vol. 29, pp. 33–40, Jan. 2009.
[95] N. Jiang, J. Kim, and W. J. Dally, “Indirect adaptive routing on large scale in-terconnection networks,” in Proceedings of the 36th Annual International Sym-posium on Computer Architecture, ISCA ’09, (New York, NY, USA), pp. 220–231, ACM, 2009.
[96] J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, and S. Scott, “Overcomingfar-end congestion in large-scale networks,” in High Performance ComputerArchitecture (HPCA), 2015 IEEE 21st International Symposium on, pp. 415–427, IEEE, 2015.
[97] D. Tsafrir, Y. Etsion, and D. G. Feitelson, “Backfilling using system-generatedpredictions rather than user runtime estimates,” IEEE Transactions on Paralleland Distributed Systems, vol. 18, pp. 789–803, June 2007.
[98] A. Jokanovic, J. C. Sancho, G. Rodriguez, A. Lucero, C. Minkenberg, andJ. Labarta, “Quiet neighborhoods: Key to protect job performance predictabil-ity,” in Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEEInternational, pp. 449–459, May 2015.
[99] M. Misbah, D. C. Christopher, B. R. Robert, and C. Philip, “Enabling par-allel simulation of large-scale hpc network systems,” in TRANSACTIONS ONPARALLEL AND DISTRIBUTED COMPUTING,, pp. VOL. X, NO. X, 2015,IEEE, 2015.
[100] C. D. Carothers, D. Bauer, and S. Pearce, “ROSS: A high-performance, low-memory, modular Time Warp system,” Journal of Parallel and DistributedComputing, vol. 62, pp. 1648–1669, Nov. 2002.
[101] P. D. Barnes, Jr., C. D. Carothers, D. R. Jefferson, and J. M. LaPre, “WarpSpeed: Executing Time Warp on 1,966,080 Cores,” in Proceedings of the1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation,SIGSIM PADS ’13, (New York, NY, USA), pp. 327–336, ACM, 2013.
[102] M. Mubarak, C. D. Carothers, R. Ross, and P. Carns, “Modeling a million-nodedragonfly network using massively parallel discrete-event simulation,” in HighPerformance Computing, Networking, Storage and Analysis (SCC), 2012 SCCompanion:, pp. 366–376, Nov 2012.
[103] M. Mubarak, C. D. Carothers, R. B. Ross, and P. Carns, “A case study inusing massively parallel simulation for extreme-scale torus network codesign,”in Proceedings of the 2Nd ACM SIGSIM Conference on Principles of AdvancedDiscrete Simulation, SIGSIM PADS ’14, (New York, NY, USA), pp. 27–38,ACM, 2014.
[104] N. Liu and C. D. Carothers, “Modeling billion-node torus networks using mas-sively parallel discrete-event simulation,” in Proceedings of the 2011 IEEEWorkshop on Principles of Advanced and Distributed Simulation, PADS ’11,(Washington, DC, USA), pp. 1–8, IEEE Computer Society, 2011.
127
[105] N. Wolfe, C. Carothers, M. Mubarak, R. Ross, and P. Carns, “Modeling amillion-node slim fly network using parallel discrete-event simulation,” in Pro-ceedings of the 4Nd ACM SIGSIM Conference on Principles of Advanced Dis-crete Simulation, SIGSIM PADS ’16.
[106] N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis,and J. Kim, “A detailed and flexible cycle-accurate network-on-chip simulator,”in Performance Analysis of Systems and Software (ISPASS), 2013 IEEE Inter-national Symposium on, pp. 86–96, April 2013.
[107] Department of Energy, “Exascale Initiative,” Accessed April 2, 2016. Availableonline http://www.exascaleinitiative.org/design-forward.
[108] D. Skinner and W. Kramer, “Understanding the causes of performance variabil-ity in hpc workloads,” in Workload Characterization Symposium, 2005. Proceed-ings of the IEEE International, pp. 137–149, Oct 2005.
[109] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, andT. Hoefler, “Efficient task placement and routing of nearest neighbor exchangesin dragonfly networks,” in Proceedings of the 23rd International Symposium onHigh-performance Parallel and Distributed Computing, HPDC ’14, (New York,NY, USA), pp. 129–140, ACM, 2014.
[110] J. Cope, N. Liu, S. Lang, C. D. Carothers, and R. B. Ross, “Codes: Enabling co-design of multi-layer exascale storage architectures,” in Workshop on EmergingSupercomputing Technologies 2011 (WEST 2011), (Tuscon, AZ), 2011.
[111] Summit Supercomputer, “https://www.olcf.ornl.gov/summit/,” Accessed April15, 2017.
Recommended