72
Tanzima Zerin Islam School of Electrical & Computer Engineering Purdue University West Lafayette, IN Date: April 8, 2013 Reliable and Scalable Checkpointing Systems for Distributed Computing Environments Final exam of

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

  • Upload
    zona

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments. Final exam of. Tanzima Zerin Islam School of Electrical & Computer Engineering Purdue University West Lafayette, IN Date: April 8, 2013. Distributed Computing Environments. - PowerPoint PPT Presentation

Citation preview

Page 1: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Tanzima Zerin Islam

School of Electrical & Computer EngineeringPurdue UniversityWest Lafayette, INDate: April 8, 2013

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Final exam of

Page 2: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 2

Distributed Computing Environments

Tanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

High Performance Computing (HPC):

Projected MTBF 3-26 minutes in exascaleFailure: hardware, software

Grid:Cycle sharing systemHighly volatile environmentFailure: eviction of guest jobs

Page 3: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 3

Fault-tolerance with Checkpoint-Restart

Checkpoints are execution statesSystem-level

Memory stateCompressible

Application-levelSelected variablesHard to compress

Tanzima Islam ([email protected])

Struct ToyGrp{1. float Temperature[1024];2. int Pressure[20][30];};

Page 4: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 4

Challenges in Checkpointing Systems

Tanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

HPC:Scalability of checkpointing systems

Grid:Use of dedicated checkpoint servers

Page 5: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 5

Contributions of This Thesis

Tanzima Islam ([email protected])

FALCON

Reliable Checkpointing System in Grid

[Best Student Paper Nomination, SC’09]

Scalable Checkpointing System in HPC

[Best Student Paper Nomination, SC’12]

Unpublished

Compression on Multi-core

2nd Place, ACM Student Research Competition’10

MCRENGINE

MCRCLUSTER

2007 - 2009 2009-2010 2010-2012 2012-2013

Prelim

Page 6: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 6

Agenda

[MCRENGINE] Scalable checkpointing system for HPC

[MCRCLUSTER] Benefit-aware clustering

Future directions

Tanzima Islam ([email protected])

Page 7: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

A Scalable Checkpointing System using Data-Aware Aggregation and Compression

Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski

Page 8: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 8Tanzima Islam ([email protected])

Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Atl

as

Big Picture of HPC

Contention for Shared File System Resources

Contention for Other Clusters

Page 9: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 9

Checkpointing in HPC

MPI applicationsTake globally coordinated checkpoints asynchronously

Application-level checkpointHigh-level data format for portability

HDF5, Adios, netCDF etc.

Checkpoint writing

Application

I/O Library

Data-Format API

Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};

NetCDFHDF5

1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{

DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}

Parallel File System (PFS)

N1 (Funneled)

Parallel File System (PFS)

NN (Direct)

Parallel File System (PFS)

NM (Grouped)

Not scalable Best compromise but complex

Easiest butcontention on PFS

Tanzima Islam ([email protected])

Page 10: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 10

Impact of Load on PFS at Large Scale

IORDirect (NN): 78MB per process

Observations:(−) Large average write time less frequent checkpointing(−) Large average read time poor application performance

0200400600800

100012001400

# of Processes (N)

128

256

512

1024

2048

4096

8192

1540

80

20406080

100120140

Ave

rage

Rea

d T

ime

(s)

Ave

rage

Wri

te T

ime

(s)

# of Processes (N)

Tanzima Islam ([email protected])

Page 11: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 11

What is the Problem?

Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data

Tanzima Islam ([email protected])

Page 12: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 12

Our Contributions

Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints by using semantic information

Data-aware compressionImproves compression ratio by 115% compared to concatenation and general-purpose compression

Design and develop mcrEngineGrouped (NM) checkpointing systemImproves checkpointing frequencyImproves application performance

Tanzima Islam ([email protected])

Page 13: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 13

Agnostic scheme – concatenate checkpoints

Agnostic-block scheme – interleave fixed-size blocks

Observations:(+) Easy(−) Low compression ratio

Naïve Solution: Data-Agnostic Compression

C1

[1-B]C2

[1-B]

C1

[B+1-2B]

C2

[B+1-2B]

C1

[1-B]C1

[B+1-2B]

C2

[1-B]

C2

[B+1-2B]

C1

C2

C1 C2 pGzip PFS

pGzip PFS

First Phase

Tanzima Islam ([email protected])

Page 14: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 14

Our Solution: [Step 1] Identify Similar Variables Across Processes

C1.T C1.P C2.T C2.P

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

P0

Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};

P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

Concatenating similar variables C2.TC1.T C2.PC1.P

[Step 2] Merging Scheme I: Aware Scheme

First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure

C1.T C1.P C2.PC2.T

Interleaving similar variables

[Step 2] Merging Scheme II: Aware-Block Scheme

Tanzima Islam ([email protected])

Page 15: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 15

[Step 3] Data-Aware Aggregation & Compression

Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables

Output buffer

Data-type aware compression FPC Lempel-Ziv

T P H D

pGzip

PFS

C2.TC1.T C2.PC1.P

First Phase

Second Phase

C2.HC1.H C2.DC1.D

Tanzima Islam ([email protected])

Page 16: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 16

How MCRENGINE Works

CNCCNC

CNCAggregator

ComputeComponent

CNCCNC

CNCCompute Component

Aggregator

Aggregator

Meta-data

Meta-data

Meta-data

Identifies “similar” variables

Request T, P

Request T, P

Request T, P

Applies data-aware aggregation and compression

PFS

CNC : Compute node componentANC: Aggregator node componentRank-order groups, Grouped (NM) transfer

Group

CNCCNC

CNCComputeComponent

Group

Group

Request H, D

Request H, D

Request H, D

D

T P

T P

T P

H D

H D

H D

pGzip

pGzip

pGzip

HPT

HT P D

HT P D

Tanzima Islam ([email protected])

Page 17: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 17

Evaluation

ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typespGzip for general-purpose compression

Tanzima Islam ([email protected])

Page 18: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 18

Evaluation Metrics

Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?

Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase

Compression ratio = Uncompressed size

Compressed size

Tanzima Islam ([email protected])

Page 19: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 19

First Second First Second First Second First SecondALE3D Cactus Cosmology Implosion

0

0.5

1

1.5

2

2.5

3

3.5

4

No Benefit with Data-Agnostic Double Compression

Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible

Data-type aware compression improves compressibilityFirst phase changes underlying data format

Com

pres

sion

Rat

io

Data-Agnostic

Data-Aware

Multiple Phases of Data-Aware Compressionare Beneficial

Tanzima Islam ([email protected])

Page 20: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 20

1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 320.5

1.5

2.5

Different merging schemes better for different applicationsLarger group size beneficial for certain applications

ALE3D: Improvement of 8% from group size 2 to 32

ALE3D Cactus

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Impact of Group Size on Compression Ratio

Tanzima Islam ([email protected])

Page 21: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 21

Data-Aware Technique Always Wins over Data-Agnostic

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Agnostic-Block

Agnostic

98-115%

Data-aware technique always yields better compression ratio than Data-Agnostic technique

Tanzima Islam ([email protected])

1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 320.5

1.5

2.5

ALE3D Cactus

Page 22: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 22

Summary of Effectiveness Study

Data-aware compression always winsReduces gigabytes of data for Cactus

Larger group sizes may improve compression ratio

Different merging schemes for different applications

Compression ratio follows course of simulation

Tanzima Islam ([email protected])

Page 23: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 23

Impact of Data-Aware Compression on Latency

IOR with Grouped(NM) transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB

Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%

128256

5121024

20484096

8192

15424

16384

20480

24576

286720

50

100

150

200

250

300

# of Processes (N)

Ave

rage

Tra

nsfe

r T

ime

(se

c)

Agnostic

Aware

Agnostic-Read

Agnostic-Write

Aware-Read

Aware-Write

Tanzima Islam ([email protected])

Page 24: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 24

Impact of Aggregation & Compression on Latency

Used IORDirect (NN): 87MB per processGrouped (NM): Group size 32, 1.21GB per aggregator

128256

5121024

20484096

8192 10

20406080

100120140

128256

5121024

20484096

8192

154080

200400600800

100012001400

# of Processes (N)

Ave

rage

Wri

te T

ime

(sec

)A

vera

ge R

ead

Tim

e (s

ec)

N->N Write

N->M Write

N->N Read

N->M Read

Tanzima Islam ([email protected])

Page 25: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 25

No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

0

50

100

150

200

250

300

350

End-to-End Checkpointing Overhead

15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint

Converts network bound operation into CPU bound one

Transfer Overhead

CPU Overhead

Tota

l Che

ckpo

inti

ng O

verh

ead

(sec

) Reduction in Checkpointing Overhead

87%51%

Tanzima Islam ([email protected])

Page 26: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 26

No Comp.

Indiv. Comp

No Comp.

Agnostic Aware No Comp.

Indiv. Comp

No Comp.

Agnostic Aware

Direct Grouped Direct GroupedALE3D Cactus

0

100

200

300

400

500

600

End-to-End Restart Overhead

Reduced overall restart overheadReduced network load and transfer time

Tota

l Rec

over

y O

verh

ead

(sec

)

43% 71%

Reduction in I/O Overhead

62% 64%

Reduction in Recovery Overhead

Tanzima Islam ([email protected])

Transfer Overhead

CPU Overhead

Page 27: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 27

Summary of Scalable Checkpointing System

Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%

Investigated different merging techniquesDemonstrated effectiveness using real-world applications

Designed and developed MCRENGINEReduces recovery overhead by more than 62%Reduces checkpointing overhead by up to 87%Improves scalability of checkpoint-restart systems

Tanzima Islam ([email protected])

Page 28: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Benefit-Aware Clustering of Checkpoints from Parallel Applications

Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski

Page 29: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 29

Our Goal & Contributions

Goal:Can suitably grouping checkpoints increase compressibility?

Contributions: Design new metric for “similarity” of checkpoints

Use this metric for clustering checkpoints

Evaluate the benefit of the clustering on checkpoint storage

Tanzima Islam ([email protected])

Page 30: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 30

Different Clustering Schemes

Tanzima Islam ([email protected])

13

16

15

10

11

3

14

9

8

1

12

6

4

27

5

2 3

5 6 7 8

9 10 11 12

14 15

1 4

1613

9

15

10

11

145

8

12

6

2

7

3

4

1

16

13

9

1510

11

14

5

8

12

6

2

7

3

4

1

16

13

Random Rank-wiseData-aware

Our Solution

Page 31: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 31

Research Questions

How to cluster checkpoints?

Does clustering improve compression ratio?

Tanzima Islam ([email protected])

Page 32: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 32

Benefit-Aware Clustering

Similarity metric: Improvement in reductionGoal: Minimize the total compressed size

Benefit matrix of Cactus

Tanzima Islam ([email protected])

β

Page 33: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 33

Novel Dissimilarity Metric

Tanzima Islam ([email protected])

Two factors for the dissimilarity between two checkpoints

Δ(i, j) = Σ [(i, k) – β(j, k)]2

k = 1

N

×β(i, j)

1

Page 34: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 34

Filter Order Similarity Cluster

How Benefit-Aware Clustering Works

Tanzima Islam ([email protected])

double T[3000];double V[10];double P[5000];double D[4000];double R[100];

double D[4000];double P[5000];double T[3000];

double T[3000];double P[5000];double D[4000];

Chunking

WaveletSample

D

PT

D

PT

β(14 )

Cluster 1

P1

P3

P4

Cluster 2

P2

P5

P1 P2 P3 P4 P5

Page 35: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 35

Structure of MCRCLUSTER

Tanzima Islam ([email protected])

F O S C

Compute Node

F O S CF O S C

F O S CF O S C

Aggregator

AggregatorP1

P2

P3

P4

P5

A1

A2

PFS

Page 36: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 36

Evaluation

Tanzima Islam ([email protected])

ApplicationIOR (synthetic checkpoints)Cactus

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster23,328 cores, 1.3 Petabyte Lustre file system

Evaluation metric:Macro benchmark: Effectiveness of clusteringMicro benchmark: Effectiveness of sampling

Page 37: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 37

Effectiveness of MCRCLUSTER

IOR: 32 checkpointsOdd processes write 0Even processes write: <rank> | 123456729% more compression compared to rank-wise, 22% more compared to random grouping

Tanzima Islam ([email protected])

Page 38: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 38

Effectiveness of Sampling

X axis: Each variableY axis: Range of benefit valuesTake away:

Chunking method preserves benefit relationships the closest

Tanzima Islam ([email protected])

Chunking Wavelet Transform

Page 39: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 39

Contributions of MCRCLUSTER

Design similarity and distance metric

Demonstrate significant result on synthetic data22% and 29% improvement compared to random and rank-wise clustering, respectively

Future directions for a first year Ph.D. studentStudy impact on real applicationsDesign scalable clustering technique

Tanzima Islam ([email protected])

Page 40: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 40

Applicability of My Research

Condor systems

Compression for scientific data

Tanzima Islam ([email protected])

Page 41: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 41

Conclusions

This thesis addresses:Reliability of checkpointing-based recovery in large-scale computing

Proposed three novel systems:FALCON: Distributed checkpointing system for GridsMCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPCMCRCLUSTER: “Benefit-Aware Clustering”

Provides a good foundation for further research in this field

Tanzima Islam ([email protected])

Page 42: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 42Tanzima Islam ([email protected])

Questions?

Page 43: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 43

Future Directions: Reliability

Reliability: Similarity-based process grouping for better compression

Group processes based on similarity instead of rank [On going]Analytical solution to group size selectionVariable streaming

Integrating mcrEngine with SCR

Tanzima Islam ([email protected])

Page 44: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 44

Future Directions: Performance

Cache usage analysis and optimizationDeveloped user-level tool for analyzing cache utilization [Summer’12]

Short term goals:Apply to real-applicationsAutomate analysis

Long-term goals:Suggest potential code optimizationsAutomate application tuning

Tanzima Islam ([email protected])

Page 45: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 45

Contact Information

Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam

Tanzima Islam ([email protected])

Page 46: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 46

Effectiveness of mcrCluster

Tanzima Islam ([email protected])

Page 47: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 47

Backup Slides

Tanzima Islam ([email protected])

Page 48: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 48

[Backup Slide] Failures in HPC

“A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson

Breakdown of root causes of failures Breakdown of downtime into root causes

Tanzima Islam ([email protected])

Page 49: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 49

[Backup Slide] Failures in HPC

“Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.

Disparity between network bandwidth and memory size

Tanzima Islam ([email protected])

Page 50: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 50

[Backup Slides] Falcon

Tanzima Islam ([email protected])

Page 51: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 51

[Backup Slide] Breakdown of Overheads

Tanzima Islam ([email protected])

Performance scales with checkpoint sizesLower network transfer overhead

F D F D F D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nti

ng

Ove

rhea

d (

sec)

F D F D F D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)

Page 52: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 52

[Backup Slide] Parallel Falcon

Tanzima Islam ([email protected])

67% improvement in CPU time

F PF D F PF D F PF D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)

F PF D F PF D F PF D500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nt

Sto

rin

g O

verh

ead

(se

c)

Page 53: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 53

[Backup Slides] mcrEngine

Tanzima Islam ([email protected])

Page 54: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 54

[Backup Slide] How to Find Similarity

Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity;};

Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity;};

P0

P1

Var: “ToyGrp/Temperature”Type: F32LE, Array1[1024]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [20][30]

Var: “ToyGrp/Temperature”Type: F32LE, Array1D [50]

Var: “ToyGrp/Pressure”Type: S8LE, Array2D [2][6]

Inside a checkpoint: Variables annotated with metadata

Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C

Generated hash key for matching

Var: “ToyGrp/Unit”Type: F64LE, Atomic

Var: “ToyGrp/Humidity”Type: I32LE, Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Humidity_I32LE_Atomic

ToyGrp/Temperature_F32LE_Array1D

ToyGrp/Pressure_S8LE_Array2D

ToyGrp/Unit_F64LE_Atomic

Var: “ToyGrp/Humidity”Type I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic

No match

Tanzima Islam ([email protected])

Page 55: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 55

[Backup Slide] Compression Ratio Follows Course of Simulation

Data-aware technique always yields better compression

Cactus

Com

pres

sion

Rat

io

Aware-Block

Aware

Agnostic-Block

Agnostic

0.81

1.21.41.61.8

22.2

1.0

2.0

3.0

4.0

5.0

6.0

1.3

1.5

1.7

1.9

2.1

2.3

Simulation Time-steps

Cosmology Implosion

Tanzima Islam ([email protected])

Page 56: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 56

[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme

Application Total Size (GB)

Data Types(%)D F S F Int

Aware-Block (%)

Aware (%)

ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7

Cactus 2.41 33.94

0 66.06 10.7 – 11.9 98 - 115

Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1

Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8

Tanzima Islam ([email protected])

Page 57: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 57

References

1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.

2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.

3. L. Reinhold, “QuickLZ”.

Tanzima Islam ([email protected])

Page 58: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 58

Reliable and Efficient System for Storing Checkpoints in Grid

Tanzima Islam ([email protected])

Execution Environment: Grid

Page 59: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 59

State-of-the-Art: Checkpointing in Grid with Dedicated Storage

Submitter

Tanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

Problems:(−) High transfer latency(−) Contention on servers(−) Stress on shared network resources

Dedicated Storage Server

Page 60: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 60

Research Question

Can we improve the performance of applications by storing checkpoints on the

grid resources?

Tanzima Islam ([email protected])

Page 61: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 61

Overview of Our Solution: Checkpointing in Grid with Distributed Storage

SubmitterTanzima Islam ([email protected])

@Purdue

@Notre Dame

@Indiana U.

Internet

Q1. Which storage nodes?Q2. How to balance load?Q3. How to efficiently storage & retrieve?Constraint: -- All components must be user-level

Page 62: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 62

Answer to Q1: Storage Host Selection

Build failure model for storage resourcesCompute correlated temporal reliability

Based on historical data

Rank machines Based on: reliability, load, and network overhead

Output:(m+k) storage hosts

Tanzima Islam ([email protected])

Compute Host

Storage Host 1Storage Host 2

down

down

downdown

Objective function: checkpoint storing overhead – benefit from restart

Addresses Q2

Page 63: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 63

Checkpoint-Recovery Scheme

Compression

Erasure Encoding

(m+k)

21 m+k

Original Checkpoint

Disk

Compressed

Fragments

Storage Host

Checkpoint Storing Phase

21 m+k

Decompression

Erasure Decoding (m)

Disk

Compressed

Original Checkpoint

Recovery Phase

Fragments

Storage Host

Tanzima Islam ([email protected])

Page 64: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 64

Evaluation Setup

2 different applications with 4 input setsMCF (SPEC CPU 2006)TIGR (BioBench)

System-level checkpointsMacro benchmark experiment

Average job makespan

Micro benchmark experimentsEfficiency of checkpoint and restartEfficiency in handling simultaneous clientsEfficiency in handling multiple failures

Tanzima Islam ([email protected])

Page 65: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 65

D F D F D F500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Rec

over

y O

verh

ead

(se

c)

Checkpoint Storing & Recovery Overhead

Tanzima Islam ([email protected])

Performance scales with checkpoint sizesLower network transfer overhead

D F D F D F500MB 946MB 1677MB

0

20

40

60

80

100

120

140

160

180

Ch

eck

poi

nti

ng

Ove

rhea

d (

sec)

Transfer Overhead

CPU Overhead

Page 66: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 66

Overall Performance Comparison

Performance improvement between 11% and 44%

Tanzima Islam ([email protected])

mcf tigr0

20

40

60

80

100

120

Benchmark Applications

Ave

rage

Mak

esp

an T

ime

(min

)

Remote Dedicated Server

Local Dedicated Server

Falcon with Distributed Server

Page 67: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 67

Summary of Reliable Checkpointing System

Developed a reliable checkpoint-recovery system FALCON

Select reliable storage hostsPrefer lightly loaded onesCompress and encodeStore and retrieve efficiently

Ran experiments with FALCON in DiaGridPerformance improvement between 11% and 44%

Tanzima Islam ([email protected])

Page 68: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 68Tanzima Islam ([email protected])

Compute Nodes

Gateway Nodes

Network Contention

Parallel File System

Hera Hera

Atl

as

Checkpointing in HPC

Contention for Shared File System Resources

Contention for Other Clusters for File System

Page 69: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 69

2-D vs N-D Compression

Tanzima Islam ([email protected])

Page 70: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 70Tanzima Islam ([email protected])

Page 71: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 71

Challenge in Extreme-Scale: Increase in Failure-Rate

Tanzima Islam ([email protected])

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1

N=500

Page 72: Reliable and Scalable Checkpointing Systems for  Distributed  Computing Environments

Reliable & Scalable Checkpointing Systems 72

Towards Online Clustering

Reduce dimension of βReduce the number of variables

Representative data-typeNumber of elements greater than a threshold [Example: 100 variables double-type covers 80% of data]

Reduce the amount of dataSampling: Random, chunking and wavelet

Tanzima Islam ([email protected])