Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Tanzima Zerin Islam

School of Electrical & Computer EngineeringPurdue UniversityWest Lafayette, INDate: April 8, 2013

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Final exam of

Reliable & Scalable Checkpointing Systems 2

Distributed Computing Environments

Tanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

High Performance Computing (HPC):

Projected MTBF 3-26 minutes in exascaleFailure: hardware, software

Grid:Cycle sharing systemHighly volatile environmentFailure: eviction of guest jobs


Fault-tolerance with Checkpoint-Restart

Checkpoints are execution statesSystem-level

Memory stateCompressible

Application-levelSelected variablesHard to compress


Struct ToyGrp{1. float Temperature[1024];2. int Pressure[20][30];};


Challenges in Checkpointing Systems


@Purdue

@Notre Dame

@Indiana U.

Internet

HPC:Scalability of checkpointing systems

Grid:Use of dedicated checkpoint servers


Contributions of This Thesis


FALCON

Reliable Checkpointing System in Grid

[Best Student Paper Nomination, SC’09]

Scalable Checkpointing System in HPC

[Best Student Paper Nomination, SC’12]

Unpublished

Compression on Multi-core

2nd Place, ACM Student Research Competition’10

MCRENGINE

MCRCLUSTER

2007 - 2009 2009-2010 2010-2012 2012-2013

Prelim


Agenda

[MCRENGINE] Scalable checkpointing system for HPC

[MCRCLUSTER] Benefit-aware clustering

Future directions


A Scalable Checkpointing System using Data-Aware Aggregation and Compression

Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski

Reliable & Scalable Checkpointing Systems 8Tanzima Islam ([email protected])

Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Atl

as

Big Picture of HPC

Contention for Shared File System Resources

Contention for Other Clusters


Checkpointing in HPC

MPI applicationsTake globally coordinated checkpoints asynchronously

Application-level checkpointHigh-level data format for portability

HDF5, Adios, netCDF etc.

Checkpoint writing

Application

I/O Library

Data-Format API

Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};

NetCDFHDF5

1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{

DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}

Parallel File System (PFS)

N1 (Funneled)


NN (Direct)


NM (Grouped)

Not scalable Best compromise but complex

Easiest butcontention on PFS



Impact of Load on PFS at Large Scale

IORDirect (NN): 78MB per process

Observations:(−) Large average write time less frequent checkpointing(−) Large average read time poor application performance

0200400600800

100012001400

# of Processes (N)

128

256

512

1024

2048

4096

8192

1540

80

20406080

100120140

Ave

rage

Rea

d T

ime

(s)

Ave

rage

Wri

te T

ime

(s)

# of Processes (N)



What is the Problem?

Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data



Our Contributions

Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints by using semantic information

Data-aware compressionImproves compression ratio by 115% compared to concatenation and general-purpose compression

Design and develop mcrEngineGrouped (NM) checkpointing systemImproves checkpointing frequencyImproves application performance



Agnostic scheme – concatenate checkpoints

Agnostic-block scheme – interleave fixed-size blocks

Observations:(+) Easy(−) Low compression ratio

Naïve Solution: Data-Agnostic Compression

C1

[1-B]C2

[1-B]

C1

[B+1-2B]

C2

[B+1-2B]

C1

[1-B]C1

[B+1-2B]

C2

[1-B]

C2

[B+1-2B]

C1

C2

C1 C2 pGzip PFS

pGzip PFS

First Phase



Our Solution: [Step 1] Identify Similar Variables Across Processes

C1.T C1.P C2.T C2.P

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

P0

Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};

P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

Concatenating similar variables C2.TC1.T C2.PC1.P

[Step 2] Merging Scheme I: Aware Scheme

First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure

C1.T C1.P C2.PC2.T

Interleaving similar variables

[Step 2] Merging Scheme II: Aware-Block Scheme



[Step 3] Data-Aware Aggregation & Compression

Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables

Output buffer

Data-type aware compression FPC Lempel-Ziv

T P H D

pGzip

PFS

C2.TC1.T C2.PC1.P

First Phase

Second Phase

C2.HC1.H C2.DC1.D



How MCRENGINE Works

CNCCNC

CNCAggregator

ComputeComponent

CNCCNC

CNCCompute Component

Aggregator

Aggregator

Meta-data

Meta-data

Meta-data

Identifies “similar” variables

Request T, P

Request T, P

Request T, P

Applies data-aware aggregation and compression

PFS

CNC : Compute node componentANC: Aggregator node componentRank-order groups, Grouped (NM) transfer

Group

CNCCNC

CNCComputeComponent

Group

Group

Request H, D

Request H, D

Request H, D

D

T P

T P

T P

H D

H D

H D

pGzip

pGzip

pGzip

HPT

HT P D

HT P D



Evaluation

ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typespGzip for general-purpose compression



Evaluation Metrics

Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?

Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase

Compression ratio = Uncompressed size

Compressed size



First Second First Second First Second First SecondALE3D Cactus Cosmology Implosion

0

0.5

1

1.5

2

2.5

3

3.5

4

No Benefit with Data-Agnostic Double Compression

Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible

Data-type aware compression improves compressibilityFirst phase changes underlying data format

Com

pres

sion

Rat

io

Data-Agnostic

Data-Aware

Multiple Phases of Data-Aware Compressionare Beneficial



1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 320.5

1.5

2.5

Different merging schemes better for different applicationsLarger group size beneficial for certain applications

ALE3D: Improvement of 8% from group size 2 to 32

ALE3D Cactus

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Impact of Group Size on Compression Ratio



Data-Aware Technique Always Wins over Data-Agnostic

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Agnostic-Block

Agnostic

98-115%

Data-aware technique always yields better compression ratio than Data-Agnostic technique


1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 320.5

1.5

2.5

ALE3D Cactus


Summary of Effectiveness Study

Data-aware compression always winsReduces gigabytes of data for Cactus

Larger group sizes may improve compression ratio

Different merging schemes for different applications

Compression ratio follows course of simulation



Impact of Data-Aware Compression on Latency

IOR with Grouped(NM) transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB

Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%

128256

5121024

20484096

8192

15424

16384

20480

24576

286720

50

100

150

200

250

300

# of Processes (N)

Ave

rage

Tra

nsfe

r T

ime

(se

c)

Agnostic

Aware

Agnostic-Read

Agnostic-Write

Aware-Read

Aware-Write



Impact of Aggregation & Compression on Latency

Used IORDirect (NN): 87MB per processGrouped (NM): Group size 32, 1.21GB per aggregator

128256

5121024

20484096

8192 10

20406080

100120140

128256

5121024

20484096

8192

154080

200400600800

100012001400

# of Processes (N)

Ave

rage

Wri

te T

ime

(sec

)A

vera

ge R

ead

Tim

e (s

ec)

N->N Write

N->M Write

N->N Read

N->M Read



No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

0

50

100

150

200

250

300

350

End-to-End Checkpointing Overhead

15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint

Converts network bound operation into CPU bound one

Transfer Overhead

CPU Overhead

Tota

l Che

ckpo

inti

ng O

verh

ead

(sec

) Reduction in Checkpointing Overhead

87%51%



No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

0

100

200

300

400

500

600

End-to-End Restart Overhead

Reduced overall restart overheadReduced network load and transfer time

Tota

l Rec

over

y O

verh

ead

(sec

)

43% 71%

Reduction in I/O Overhead

62% 64%

Reduction in Recovery Overhead


Transfer Overhead

CPU Overhead


Summary of Scalable Checkpointing System

Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%

Investigated different merging techniquesDemonstrated effectiveness using real-world applications

Designed and developed MCRENGINEReduces recovery overhead by more than 62%Reduces checkpointing overhead by up to 87%Improves scalability of checkpoint-restart systems


Benefit-Aware Clustering of Checkpoints from Parallel Applications

Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski


Our Goal & Contributions

Goal:Can suitably grouping checkpoints increase compressibility?

Contributions: Design new metric for “similarity” of checkpoints

Use this metric for clustering checkpoints

Evaluate the benefit of the clustering on checkpoint storage



Different Clustering Schemes


13

16

15

10

11

3

14

9

8

1

12

6

4

27

5

2 3

5 6 7 8

9 10 11 12

14 15

1 4

1613

9

15

10

11

145

8

12

6

2

7

3

4

1

16

13

9

1510

11

14

5

8

12

6

2

7

3

4

1

16

13

Random Rank-wiseData-aware

Our Solution


Research Questions

How to cluster checkpoints?

Does clustering improve compression ratio?



Benefit-Aware Clustering

Similarity metric: Improvement in reductionGoal: Minimize the total compressed size

Benefit matrix of Cactus


β


Novel Dissimilarity Metric


Two factors for the dissimilarity between two checkpoints

Δ(i, j) = Σ [(i, k) – β(j, k)]2

k = 1

N

×β(i, j)

1


Filter Order Similarity Cluster

How Benefit-Aware Clustering Works


double T[3000];double V[10];double P[5000];double D[4000];double R[100];

double D[4000];double P[5000];double T[3000];

double T[3000];double P[5000];double D[4000];

Chunking

WaveletSample

D

PT

D

PT

β(14 )

Cluster 1

P1

P3

P4

Cluster 2

P2

P5

P1 P2 P3 P4 P5


Structure of MCRCLUSTER


F O S C

Compute Node

F O S CF O S C

F O S CF O S C

Aggregator

AggregatorP1

P2

P3

P4

P5

A1

A2

PFS


Evaluation


ApplicationIOR (synthetic checkpoints)Cactus

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Evaluation metric:Macro benchmark: Effectiveness of clusteringMicro benchmark: Effectiveness of sampling


Effectiveness of MCRCLUSTER

IOR: 32 checkpointsOdd processes write 0Even processes write: <rank> | 123456729% more compression compared to rank-wise, 22% more compared to random grouping



Effectiveness of Sampling

X axis: Each variableY axis: Range of benefit valuesTake away:

Chunking method preserves benefit relationships the closest


Chunking Wavelet Transform


Contributions of MCRCLUSTER

Design similarity and distance metric

Demonstrate significant result on synthetic data22% and 29% improvement compared to random and rank-wise clustering, respectively

Future directions for a first year Ph.D. studentStudy impact on real applicationsDesign scalable clustering technique



Applicability of My Research

Condor systems

Compression for scientific data



Conclusions

This thesis addresses:Reliability of checkpointing-based recovery in large-scale computing

Proposed three novel systems:FALCON: Distributed checkpointing system for GridsMCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPCMCRCLUSTER: “Benefit-Aware Clustering”

Provides a good foundation for further research in this field



Questions?


Future Directions: Reliability

Reliability: Similarity-based process grouping for better compression

Group processes based on similarity instead of rank [On going]Analytical solution to group size selectionVariable streaming

Integrating mcrEngine with SCR



Future Directions: Performance

Cache usage analysis and optimizationDeveloped user-level tool for analyzing cache utilization [Summer’12]

Short term goals:Apply to real-applicationsAutomate analysis

Long-term goals:Suggest potential code optimizationsAutomate application tuning



Contact Information

Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam


mailto:[email protected]


Effectiveness of mcrCluster



Backup Slides



[Backup Slide] Failures in HPC

“A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson

Breakdown of root causes of failures Breakdown of downtime into root causes



[Backup Slide] Failures in HPC

“Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.

Disparity between network bandwidth and memory size



[Backup Slides] Falcon



[Backup Slide] Breakdown of Overheads


Performance scales with checkpoint sizesLower network transfer overhead

F D F D F D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nti

ng

Ove

rhea

d (

sec)

F D F D F D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)


[Backup Slide] Parallel Falcon


67% improvement in CPU time

F PF D F PF D F PF D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)

F PF D F PF D F PF D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nt

Sto

rin

g O

verh

ead

(se

c)


[Backup Slides] mcrEngine



[Backup Slide] How to Find Similarity

Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity;};

Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity;};

P0

P1

Var: “ToyGrp/Temperature”Type: F32LE, Array1[1024]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [20][30]

Var: “ToyGrp/Temperature”Type: F32LE, Array1D [50]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [2][6]

Inside a checkpoint: Variables annotated with metadata

Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C

Generated hash key for matching

Var: “ToyGrp/Unit”Type: F64LE, Atomic

Var: “ToyGrp/Humidity”Type: I32LE, Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Humidity_I32LE_Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Unit_F64LE_Atomic

Var: “ToyGrp/Humidity”Type I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic

No match



[Backup Slide] Compression Ratio Follows Course of Simulation

Data-aware technique always yields better compression

Cactus

Com

pres

sion

Rat

io

Aware-Block

Aware

Agnostic-Block

Agnostic

0.81

1.21.41.61.8

22.2

1.0

2.0

3.0

4.0

5.0

6.0

1.3

1.5

1.7

1.9

2.1

2.3

Simulation Time-steps

Cosmology Implosion



[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme

Application Total Size (GB)

Data Types(%)D F S F Int

Aware-Block (%)

Aware (%)

ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7

Cactus 2.41 33.94

0 66.06 10.7 – 11.9 98 - 115

Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1

Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8



References

1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.

2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.

3. L. Reinhold, “QuickLZ”.



Reliable and Efficient System for Storing Checkpoints in Grid


Execution Environment: Grid


State-of-the-Art: Checkpointing in Grid with Dedicated Storage

Submitter


@Purdue

@Notre Dame

@Indiana U.

Internet

Problems:(−) High transfer latency(−) Contention on servers(−) Stress on shared network resources

Dedicated Storage Server


Research Question

Can we improve the performance of applications by storing checkpoints on the

grid resources?



Overview of Our Solution: Checkpointing in Grid with Distributed Storage

SubmitterTanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

Q1. Which storage nodes?Q2. How to balance load?Q3. How to efficiently storage & retrieve?Constraint: -- All components must be user-level


Answer to Q1: Storage Host Selection

Build failure model for storage resourcesCompute correlated temporal reliability

Based on historical data

Rank machines Based on: reliability, load, and network overhead

Output:(m+k) storage hosts


Compute Host

Storage Host 1Storage Host 2

down

down

downdown

Objective function: checkpoint storing overhead – benefit from restart

Addresses Q2


Checkpoint-Recovery Scheme

Compression

Erasure Encoding

(m+k)

21 m+k

Original Checkpoint

Disk

Compressed

Fragments

Storage Host

Checkpoint Storing Phase

21 m+k

Decompression

Erasure Decoding (m)

Disk

Compressed

Original Checkpoint

Recovery Phase

Fragments

Storage Host



Evaluation Setup

2 different applications with 4 input setsMCF (SPEC CPU 2006)TIGR (BioBench)

System-level checkpointsMacro benchmark experiment

Average job makespan

Micro benchmark experimentsEfficiency of checkpoint and restartEfficiency in handling simultaneous clientsEfficiency in handling multiple failures



D F D F D F500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)

Checkpoint Storing & Recovery Overhead


Performance scales with checkpoint sizesLower network transfer overhead

D F D F D F500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nti

ng

Ove

rhea

d (

sec)

Transfer Overhead

CPU Overhead


Overall Performance Comparison

Performance improvement between 11% and 44%


mcf tigr0

20

40

60

80

100

120

Benchmark Applications

Ave

rage

Mak

esp

an T

ime

(min

)

Remote Dedicated Server

Local Dedicated Server

Falcon with Distributed Server


Summary of Reliable Checkpointing System

Developed a reliable checkpoint-recovery system FALCON

Select reliable storage hostsPrefer lightly loaded onesCompress and encodeStore and retrieve efficiently

Ran experiments with FALCON in DiaGridPerformance improvement between 11% and 44%



Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Atl

as

Checkpointing in HPC

Contention for Shared File System Resources

Contention for Other Clusters for File System


2-D vs N-D Compression




Challenge in Extreme-Scale: Increase in Failure-Rate


1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1

N=500


Towards Online Clustering

Reduce dimension of βReduce the number of variables

Representative data-typeNumber of elements greater than a threshold [Example: 100 variables double-type covers 80% of data]

Reduce the amount of dataSampling: Random, chunking and wavelet


Documents

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments