Upload
zona
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Reliable and Scalable Checkpointing Systems for Distributed Computing Environments. Final exam of. Tanzima Zerin Islam School of Electrical & Computer Engineering Purdue University West Lafayette, IN Date: April 8, 2013. Distributed Computing Environments. - PowerPoint PPT Presentation
Citation preview
Tanzima Zerin Islam
School of Electrical & Computer EngineeringPurdue UniversityWest Lafayette, INDate: April 8, 2013
Reliable and Scalable Checkpointing Systems for Distributed Computing Environments
Final exam of
Reliable & Scalable Checkpointing Systems 2
Distributed Computing Environments
Tanzima Islam ([email protected])
@Purdue
@Notre Dame
@Indiana U.
Internet
High Performance Computing (HPC):
Projected MTBF 3-26 minutes in exascaleFailure: hardware, software
Grid:Cycle sharing systemHighly volatile environmentFailure: eviction of guest jobs
Reliable & Scalable Checkpointing Systems 3
Fault-tolerance with Checkpoint-Restart
Checkpoints are execution statesSystem-level
Memory stateCompressible
Application-levelSelected variablesHard to compress
Tanzima Islam ([email protected])
Struct ToyGrp{1. float Temperature[1024];2. int Pressure[20][30];};
Reliable & Scalable Checkpointing Systems 4
Challenges in Checkpointing Systems
Tanzima Islam ([email protected])
@Purdue
@Notre Dame
@Indiana U.
Internet
HPC:Scalability of checkpointing systems
Grid:Use of dedicated checkpoint servers
Reliable & Scalable Checkpointing Systems 5
Contributions of This Thesis
Tanzima Islam ([email protected])
FALCON
Reliable Checkpointing System in Grid
[Best Student Paper Nomination, SC’09]
Scalable Checkpointing System in HPC
[Best Student Paper Nomination, SC’12]
Unpublished
Compression on Multi-core
2nd Place, ACM Student Research Competition’10
MCRENGINE
MCRCLUSTER
2007 - 2009 2009-2010 2010-2012 2012-2013
Prelim
Reliable & Scalable Checkpointing Systems 6
Agenda
[MCRENGINE] Scalable checkpointing system for HPC
[MCRCLUSTER] Benefit-aware clustering
Future directions
Tanzima Islam ([email protected])
A Scalable Checkpointing System using Data-Aware Aggregation and Compression
Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski
Reliable & Scalable Checkpointing Systems 8Tanzima Islam ([email protected])
Compute Nodes
Gateway Nodes
Network Contention
Parallel File System
Hera Hera
Atl
as
Big Picture of HPC
Contention for Shared File System Resources
Contention for Other Clusters
Reliable & Scalable Checkpointing Systems 9
Checkpointing in HPC
MPI applicationsTake globally coordinated checkpoints asynchronously
Application-level checkpointHigh-level data format for portability
HDF5, Adios, netCDF etc.
Checkpoint writing
Application
I/O Library
Data-Format API
Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};
NetCDFHDF5
1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{
DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}
Parallel File System (PFS)
N1 (Funneled)
Parallel File System (PFS)
NN (Direct)
Parallel File System (PFS)
NM (Grouped)
Not scalable Best compromise but complex
Easiest butcontention on PFS
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 10
Impact of Load on PFS at Large Scale
IORDirect (NN): 78MB per process
Observations:(−) Large average write time less frequent checkpointing(−) Large average read time poor application performance
0200400600800
100012001400
# of Processes (N)
128
256
512
1024
2048
4096
8192
1540
80
20406080
100120140
Ave
rage
Rea
d T
ime
(s)
Ave
rage
Wri
te T
ime
(s)
# of Processes (N)
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 11
What is the Problem?
Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 12
Our Contributions
Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints by using semantic information
Data-aware compressionImproves compression ratio by 115% compared to concatenation and general-purpose compression
Design and develop mcrEngineGrouped (NM) checkpointing systemImproves checkpointing frequencyImproves application performance
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 13
Agnostic scheme – concatenate checkpoints
Agnostic-block scheme – interleave fixed-size blocks
Observations:(+) Easy(−) Low compression ratio
Naïve Solution: Data-Agnostic Compression
C1
[1-B]C2
[1-B]
C1
[B+1-2B]
C2
[B+1-2B]
C1
[1-B]C1
[B+1-2B]
C2
[1-B]
C2
[B+1-2B]
C1
C2
C1 C2 pGzip PFS
pGzip PFS
First Phase
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 14
Our Solution: [Step 1] Identify Similar Variables Across Processes
C1.T C1.P C2.T C2.P
Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};
P0
Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};
P1
Meta-data:1. Name2. Data-type3. Class:
-- Array, Atomic
Concatenating similar variables C2.TC1.T C2.PC1.P
[Step 2] Merging Scheme I: Aware Scheme
First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure
C1.T C1.P C2.PC2.T
Interleaving similar variables
[Step 2] Merging Scheme II: Aware-Block Scheme
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 15
[Step 3] Data-Aware Aggregation & Compression
Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables
Output buffer
Data-type aware compression FPC Lempel-Ziv
T P H D
pGzip
PFS
C2.TC1.T C2.PC1.P
First Phase
Second Phase
C2.HC1.H C2.DC1.D
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 16
How MCRENGINE Works
CNCCNC
CNCAggregator
ComputeComponent
CNCCNC
CNCCompute Component
Aggregator
Aggregator
Meta-data
Meta-data
Meta-data
Identifies “similar” variables
Request T, P
Request T, P
Request T, P
Applies data-aware aggregation and compression
PFS
CNC : Compute node componentANC: Aggregator node componentRank-order groups, Grouped (NM) transfer
Group
CNCCNC
CNCComputeComponent
Group
Group
Request H, D
Request H, D
Request H, D
D
T P
T P
T P
H D
H D
H D
pGzip
pGzip
pGzip
HPT
HT P D
HT P D
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 17
Evaluation
ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set
Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system
Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typespGzip for general-purpose compression
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 18
Evaluation Metrics
Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?
Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase
Compression ratio = Uncompressed size
Compressed size
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 19
First Second First Second First Second First SecondALE3D Cactus Cosmology Implosion
0
0.5
1
1.5
2
2.5
3
3.5
4
No Benefit with Data-Agnostic Double Compression
Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible
Data-type aware compression improves compressibilityFirst phase changes underlying data format
Com
pres
sion
Rat
io
Data-Agnostic
Data-Aware
Multiple Phases of Data-Aware Compressionare Beneficial
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 20
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 320.5
1.5
2.5
Different merging schemes better for different applicationsLarger group size beneficial for certain applications
ALE3D: Improvement of 8% from group size 2 to 32
ALE3D Cactus
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Impact of Group Size on Compression Ratio
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 21
Data-Aware Technique Always Wins over Data-Agnostic
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Agnostic-Block
Agnostic
98-115%
Data-aware technique always yields better compression ratio than Data-Agnostic technique
Tanzima Islam ([email protected])
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 320.5
1.5
2.5
ALE3D Cactus
Reliable & Scalable Checkpointing Systems 22
Summary of Effectiveness Study
Data-aware compression always winsReduces gigabytes of data for Cactus
Larger group sizes may improve compression ratio
Different merging schemes for different applications
Compression ratio follows course of simulation
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 23
Impact of Data-Aware Compression on Latency
IOR with Grouped(NM) transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB
Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%
128256
5121024
20484096
8192
15424
16384
20480
24576
286720
50
100
150
200
250
300
# of Processes (N)
Ave
rage
Tra
nsfe
r T
ime
(se
c)
Agnostic
Aware
Agnostic-Read
Agnostic-Write
Aware-Read
Aware-Write
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 24
Impact of Aggregation & Compression on Latency
Used IORDirect (NN): 87MB per processGrouped (NM): Group size 32, 1.21GB per aggregator
128256
5121024
20484096
8192 10
20406080
100120140
128256
5121024
20484096
8192
154080
200400600800
100012001400
# of Processes (N)
Ave
rage
Wri
te T
ime
(sec
)A
vera
ge R
ead
Tim
e (s
ec)
N->N Write
N->M Write
N->N Read
N->M Read
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 25
No Comp.
Indiv. Comp
No Comp.
Agnostic Aware No Comp.
Indiv. Comp
No Comp.
Agnostic Aware
Direct Grouped Direct GroupedALE3D Cactus
0
50
100
150
200
250
300
350
End-to-End Checkpointing Overhead
15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint
Converts network bound operation into CPU bound one
Transfer Overhead
CPU Overhead
Tota
l Che
ckpo
inti
ng O
verh
ead
(sec
) Reduction in Checkpointing Overhead
87%51%
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 26
No Comp.
Indiv. Comp
No Comp.
Agnostic Aware No Comp.
Indiv. Comp
No Comp.
Agnostic Aware
Direct Grouped Direct GroupedALE3D Cactus
0
100
200
300
400
500
600
End-to-End Restart Overhead
Reduced overall restart overheadReduced network load and transfer time
Tota
l Rec
over
y O
verh
ead
(sec
)
43% 71%
Reduction in I/O Overhead
62% 64%
Reduction in Recovery Overhead
Tanzima Islam ([email protected])
Transfer Overhead
CPU Overhead
Reliable & Scalable Checkpointing Systems 27
Summary of Scalable Checkpointing System
Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%
Investigated different merging techniquesDemonstrated effectiveness using real-world applications
Designed and developed MCRENGINEReduces recovery overhead by more than 62%Reduces checkpointing overhead by up to 87%Improves scalability of checkpoint-restart systems
Tanzima Islam ([email protected])
Benefit-Aware Clustering of Checkpoints from Parallel Applications
Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski
Reliable & Scalable Checkpointing Systems 29
Our Goal & Contributions
Goal:Can suitably grouping checkpoints increase compressibility?
Contributions: Design new metric for “similarity” of checkpoints
Use this metric for clustering checkpoints
Evaluate the benefit of the clustering on checkpoint storage
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 30
Different Clustering Schemes
Tanzima Islam ([email protected])
13
16
15
10
11
3
14
9
8
1
12
6
4
27
5
2 3
5 6 7 8
9 10 11 12
14 15
1 4
1613
9
15
10
11
145
8
12
6
2
7
3
4
1
16
13
9
1510
11
14
5
8
12
6
2
7
3
4
1
16
13
Random Rank-wiseData-aware
Our Solution
Reliable & Scalable Checkpointing Systems 31
Research Questions
How to cluster checkpoints?
Does clustering improve compression ratio?
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 32
Benefit-Aware Clustering
Similarity metric: Improvement in reductionGoal: Minimize the total compressed size
Benefit matrix of Cactus
Tanzima Islam ([email protected])
β
Reliable & Scalable Checkpointing Systems 33
Novel Dissimilarity Metric
Tanzima Islam ([email protected])
Two factors for the dissimilarity between two checkpoints
Δ(i, j) = Σ [(i, k) – β(j, k)]2
k = 1
N
×β(i, j)
1
Reliable & Scalable Checkpointing Systems 34
Filter Order Similarity Cluster
How Benefit-Aware Clustering Works
Tanzima Islam ([email protected])
double T[3000];double V[10];double P[5000];double D[4000];double R[100];
double D[4000];double P[5000];double T[3000];
double T[3000];double P[5000];double D[4000];
Chunking
WaveletSample
D
PT
D
PT
β(14 )
Cluster 1
P1
P3
P4
Cluster 2
P2
P5
P1 P2 P3 P4 P5
Reliable & Scalable Checkpointing Systems 35
Structure of MCRCLUSTER
Tanzima Islam ([email protected])
F O S C
Compute Node
F O S CF O S C
F O S CF O S C
Aggregator
AggregatorP1
P2
P3
P4
P5
A1
A2
PFS
Reliable & Scalable Checkpointing Systems 36
Evaluation
Tanzima Islam ([email protected])
ApplicationIOR (synthetic checkpoints)Cactus
Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system
Evaluation metric:Macro benchmark: Effectiveness of clusteringMicro benchmark: Effectiveness of sampling
Reliable & Scalable Checkpointing Systems 37
Effectiveness of MCRCLUSTER
IOR: 32 checkpointsOdd processes write 0Even processes write: <rank> | 123456729% more compression compared to rank-wise, 22% more compared to random grouping
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 38
Effectiveness of Sampling
X axis: Each variableY axis: Range of benefit valuesTake away:
Chunking method preserves benefit relationships the closest
Tanzima Islam ([email protected])
Chunking Wavelet Transform
Reliable & Scalable Checkpointing Systems 39
Contributions of MCRCLUSTER
Design similarity and distance metric
Demonstrate significant result on synthetic data22% and 29% improvement compared to random and rank-wise clustering, respectively
Future directions for a first year Ph.D. studentStudy impact on real applicationsDesign scalable clustering technique
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 40
Applicability of My Research
Condor systems
Compression for scientific data
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 41
Conclusions
This thesis addresses:Reliability of checkpointing-based recovery in large-scale computing
Proposed three novel systems:FALCON: Distributed checkpointing system for GridsMCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPCMCRCLUSTER: “Benefit-Aware Clustering”
Provides a good foundation for further research in this field
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 42Tanzima Islam ([email protected])
Questions?
Reliable & Scalable Checkpointing Systems 43
Future Directions: Reliability
Reliability: Similarity-based process grouping for better compression
Group processes based on similarity instead of rank [On going]Analytical solution to group size selectionVariable streaming
Integrating mcrEngine with SCR
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 44
Future Directions: Performance
Cache usage analysis and optimizationDeveloped user-level tool for analyzing cache utilization [Summer’12]
Short term goals:Apply to real-applicationsAutomate analysis
Long-term goals:Suggest potential code optimizationsAutomate application tuning
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 45
Contact Information
Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 46
Effectiveness of mcrCluster
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 48
[Backup Slide] Failures in HPC
“A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson
Breakdown of root causes of failures Breakdown of downtime into root causes
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 49
[Backup Slide] Failures in HPC
“Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.
Disparity between network bandwidth and memory size
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 50
[Backup Slides] Falcon
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 51
[Backup Slide] Breakdown of Overheads
Tanzima Islam ([email protected])
Performance scales with checkpoint sizesLower network transfer overhead
F D F D F D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nti
ng
Ove
rhea
d (
sec)
F D F D F D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
Reliable & Scalable Checkpointing Systems 52
[Backup Slide] Parallel Falcon
Tanzima Islam ([email protected])
67% improvement in CPU time
F PF D F PF D F PF D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
F PF D F PF D F PF D500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nt
Sto
rin
g O
verh
ead
(se
c)
Reliable & Scalable Checkpointing Systems 53
[Backup Slides] mcrEngine
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 54
[Backup Slide] How to Find Similarity
Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity;};
Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity;};
P0
P1
Var: “ToyGrp/Temperature”Type: F32LE, Array1[1024]
Var: “ToyGrp/Pressure”Type: S8LE, Array2D [20][30]
Var: “ToyGrp/Temperature”Type: F32LE, Array1D [50]
Var: “ToyGrp/Pressure”Type: S8LE, Array2D [2][6]
Inside a checkpoint: Variables annotated with metadata
Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C
Generated hash key for matching
Var: “ToyGrp/Unit”Type: F64LE, Atomic
Var: “ToyGrp/Humidity”Type: I32LE, Atomic
ToyGrp/Temperature_F32LE_Array1D
ToyGrp/Pressure_S8LE_Array2D
ToyGrp/Humidity_I32LE_Atomic
ToyGrp/Temperature_F32LE_Array1D
ToyGrp/Pressure_S8LE_Array2D
ToyGrp/Unit_F64LE_Atomic
Var: “ToyGrp/Humidity”Type I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic
No match
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 55
[Backup Slide] Compression Ratio Follows Course of Simulation
Data-aware technique always yields better compression
Cactus
Com
pres
sion
Rat
io
Aware-Block
Aware
Agnostic-Block
Agnostic
0.81
1.21.41.61.8
22.2
1.0
2.0
3.0
4.0
5.0
6.0
1.3
1.5
1.7
1.9
2.1
2.3
Simulation Time-steps
Cosmology Implosion
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 56
[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme
Application Total Size (GB)
Data Types(%)D F S F Int
Aware-Block (%)
Aware (%)
ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7
Cactus 2.41 33.94
0 66.06 10.7 – 11.9 98 - 115
Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1
Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 57
References
1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.
2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.
3. L. Reinhold, “QuickLZ”.
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 58
Reliable and Efficient System for Storing Checkpoints in Grid
Tanzima Islam ([email protected])
Execution Environment: Grid
Reliable & Scalable Checkpointing Systems 59
State-of-the-Art: Checkpointing in Grid with Dedicated Storage
Submitter
Tanzima Islam ([email protected])
@Purdue
@Notre Dame
@Indiana U.
Internet
Problems:(−) High transfer latency(−) Contention on servers(−) Stress on shared network resources
Dedicated Storage Server
Reliable & Scalable Checkpointing Systems 60
Research Question
Can we improve the performance of applications by storing checkpoints on the
grid resources?
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 61
Overview of Our Solution: Checkpointing in Grid with Distributed Storage
SubmitterTanzima Islam ([email protected])
@Purdue
@Notre Dame
@Indiana U.
Internet
Q1. Which storage nodes?Q2. How to balance load?Q3. How to efficiently storage & retrieve?Constraint: -- All components must be user-level
Reliable & Scalable Checkpointing Systems 62
Answer to Q1: Storage Host Selection
Build failure model for storage resourcesCompute correlated temporal reliability
Based on historical data
Rank machines Based on: reliability, load, and network overhead
Output:(m+k) storage hosts
Tanzima Islam ([email protected])
Compute Host
Storage Host 1Storage Host 2
down
down
downdown
Objective function: checkpoint storing overhead – benefit from restart
Addresses Q2
Reliable & Scalable Checkpointing Systems 63
Checkpoint-Recovery Scheme
Compression
Erasure Encoding
(m+k)
21 m+k
Original Checkpoint
Disk
Compressed
Fragments
Storage Host
Checkpoint Storing Phase
21 m+k
Decompression
Erasure Decoding (m)
Disk
Compressed
Original Checkpoint
Recovery Phase
Fragments
Storage Host
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 64
Evaluation Setup
2 different applications with 4 input setsMCF (SPEC CPU 2006)TIGR (BioBench)
System-level checkpointsMacro benchmark experiment
Average job makespan
Micro benchmark experimentsEfficiency of checkpoint and restartEfficiency in handling simultaneous clientsEfficiency in handling multiple failures
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 65
D F D F D F500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Rec
over
y O
verh
ead
(se
c)
Checkpoint Storing & Recovery Overhead
Tanzima Islam ([email protected])
Performance scales with checkpoint sizesLower network transfer overhead
D F D F D F500MB 946MB 1677MB
0
20
40
60
80
100
120
140
160
180
Ch
eck
poi
nti
ng
Ove
rhea
d (
sec)
Transfer Overhead
CPU Overhead
Reliable & Scalable Checkpointing Systems 66
Overall Performance Comparison
Performance improvement between 11% and 44%
Tanzima Islam ([email protected])
mcf tigr0
20
40
60
80
100
120
Benchmark Applications
Ave
rage
Mak
esp
an T
ime
(min
)
Remote Dedicated Server
Local Dedicated Server
Falcon with Distributed Server
Reliable & Scalable Checkpointing Systems 67
Summary of Reliable Checkpointing System
Developed a reliable checkpoint-recovery system FALCON
Select reliable storage hostsPrefer lightly loaded onesCompress and encodeStore and retrieve efficiently
Ran experiments with FALCON in DiaGridPerformance improvement between 11% and 44%
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 68Tanzima Islam ([email protected])
Compute Nodes
Gateway Nodes
Network Contention
Parallel File System
Hera Hera
Atl
as
Checkpointing in HPC
Contention for Shared File System Resources
Contention for Other Clusters for File System
Reliable & Scalable Checkpointing Systems 69
2-D vs N-D Compression
Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 70Tanzima Islam ([email protected])
Reliable & Scalable Checkpointing Systems 71
Challenge in Extreme-Scale: Increase in Failure-Rate
Tanzima Islam ([email protected])
1 Eflop/s
1 Gflop/s
1 Tflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
N=1
N=500
Reliable & Scalable Checkpointing Systems 72
Towards Online Clustering
Reduce dimension of βReduce the number of variables
Representative data-typeNumber of elements greater than a threshold [Example: 100 variables double-type covers 80% of data]
Reduce the amount of dataSampling: Random, chunking and wavelet
Tanzima Islam ([email protected])