Upload
mariah-patrick
View
217
Download
0
Embed Size (px)
Citation preview
Office of Science
U.S. Department of Energy
Evaluating Checkpoint/Restart on the IBM SP
Office of Science
U.S. Department of Energy
Outline
• Motivation for Checkpoint/Restart (CPR)• CPR considerations• CPR on the IBM SP• Evaluation of CPR on the IBM SP• Results• Putting CPR into production
Office of Science
U.S. Department of Energy
Motivation for Checkpoint/Restart
• Large HPC systems typically have large parallel or long running jobs
• To be able to save the running state for large parallel or long running jobs periodically so that in the case of an interruption we don’t lose too much work
• To decrease the impact of single-node failures on the overall usability of the machine
• To be able to perform maintenance on the system with minimal impact to running jobs
• Better utilization of resources
Office of Science
U.S. Department of Energy
Checkpoint/Restart considerations
• User initiated (not from within the program)
• System (administrator) initiated
• Use of HPC systems is usually via a batch system (such as LoadLeveler)
• Both serial and parallel jobs are run on the machine
• Parallel jobs use message passing and we should be able to checkpoint these as well
• Use of CPR mechanism internal to code as well as externally
Office of Science
U.S. Department of Energy
Checkpoint/Restart Users
• System administrators and operators Checkpoint used to clear a node for maintenance work.
• End users of HPC systems (scientists, students, researchers)
• Programmers writing code that uses CPR mechanism internally (or utility programs to use CPR functionality for the system)
Office of Science
U.S. Department of Energy
Checkpoint/Restart mechanism
For Parallel programs
• Stop and discard mechanism (K. Z. Meth and W. G. Tuel) On receiving a checkpoint request, the task stops sending messages and is checkpointed. In-transit message information is saved so we know what messages have been sent but not acknowledged. These messages are resent on restart.
Office of Science
U.S. Department of Energy
Checkpoint/Restart methods
• Utility program as part of system software
• CPR API via system calls (ll_init_ckpt, etc.)
• Batch system software can use the API to implement CPR mechanism.
Office of Science
U.S. Department of Energy
CPR on the IBM SP
• Done via LL command (llckpt)• Once a process is checkpointed:
1. Process can continue running.2. Process is killed.
• Within LL:1. Job can be deleted from the queuing system.2. Job can be resubmitted for consideration by the scheduler.3. Job can be resubmitted and “held”.
Office of Science
U.S. Department of Energy
Checkpoint/Restart on the IBM SP
Job command file keywords:
In order to be able to checkpoint a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_time_limit = [time to checkpoint]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]
In order to be able to restart a LL job:#@ checkpoint = [yes|no| interval]#@ ckpt_dir = [path to checkpoint files]#@ ckpt_file = [basename of checkpoint files]#@ restart_from_ckpt = [yes| no]#@ restart_on_same_nodes = [yes|no]
Office of Science
U.S. Department of Energy
We evaluated the use of C/R with LoadLeveler on the SP usingboth a 4-node development system (dev2) and the 416-nodeproduction system (seaborg). We evaluated:
(a) System requirements(b) Configuration changes(c) Viability/Ease of Use
CPR Evaluation on the IBM SP
Office of Science
U.S. Department of Energy
2 kinds of programs:• Serial code that allocates a certain amount of memory (integer array and initializes the array)• MPI code that starts up a certain number of processes and allocates a certain amount of memory and does simple message passing
User checkpoint:• Submit a job using llsubmit, let it run, use llckpt -u to checkpoint, and resume job using llhold –r• User can also use llckpt –k and resubmit job
CPR Evaluation on the IBM SP
Office of Science
U.S. Department of Energy
Results – Dev2
0
50
100
150
200
250
300
350
0 2 4 6 8
Processes per node
Ch
eckp
oin
t ti
me (
secs)
1 node 2 nodes 3 nodes 4 nodes
Each task uses approximately 200 MB memory
Office of Science
U.S. Department of Energy
0
50
100
150
200
250
300
350
1 node 2 nodes 3 nodes 4 nodes
Number of nodes
Tim
e to
Ch
eckp
oin
t (s
ecs)
1 task/node 2 tasks/node 3 tasks/node 4 tasks/node
5 tasks/node 6 tasks/node 7 tasks/node 8 tasks/node
Results – Dev2
Each task uses approximately 200 MB memory
Office of Science
U.S. Department of Energy
Results – Dev2
Serial job
0
200
400
600
800
1000
1200
1400
1 10 100 1000 10000 100000
Size of job (MB)
Ch
ec
kp
oin
t ti
me
(s
ec
s)
64-bit 32-bit
Office of Science
U.S. Department of Energy
Results – Dev2
Serial job
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 10 100 1000 10000 100000
Size of job (MB)
Dis
k s
pace u
sed
(M
B)
64-bit 32-bit
Office of Science
U.S. Department of Energy
Results – Dev2
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8 9
Tasks per node
To
tal
ch
eckp
oin
t fi
le s
izes (
GB
)
1 node 2 nodes 3 nodes 4 nodes
Each task uses approximately 200 MB memory
Office of Science
U.S. Department of Energy
Results – Dev2
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9
Tasks per node
MP
I d
ata
fil
e s
izes (
MB
)
1 Node 2 nodes 3 nodes 4 nodes
Each task uses approximately 200 MB memory
Office of Science
U.S. Department of Energy
Results – Seaborg
16 tasks per node; Each task uses approximately 260 MB memory
0
500
1000
1500
2000
2500
3000
0 24 48 72 96 120
Number of Nodes
Ch
ec
kp
oin
t T
ime
(s
ec
s)
Office of Science
U.S. Department of Energy
Results – Seaborg
0
100
200
300
400
500
600
0 2 4 6 8 10 12 14 16
Tasks per Node
Ch
eckp
oin
t T
ime
(sec
s)
8 Nodes 12 Nodes 24 Nodes
Each task uses approximately 260 MB memory
Office of Science
U.S. Department of Energy
• What about restart? Times to restart are on the order of time to checkpoint.
• Disk usage, user quotas (checkpoint files are owned by job owner)
• #@ restart = yes keyword is implied if checkpoint = yes.
• Priority issues: Checkpointed and held jobs retain their priority.
• Not all jobs can be checkpointed. List of exceptions is documented in the LL manual.
Using CPR
Office of Science
U.S. Department of Energy
Acknowledgements:
• NERSC SP Systems Staff (N. Cardo, D. Paul, T. Stone)• IBM Staff (S. Burrow)• NERSC USG Staff (D. Skinner)• NERSC ASG Staff (A. Wong)