23
Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Embed Size (px)

Citation preview

Page 1: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Resource Predictors in HEP Applications

John Huth, HarvardSebastian Grinstein, Harvard

Peter Hurst, HarvardJennifer M. Schopf, ANL/NeSC

Page 2: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

The Problem

• Large data sets gets recreated, and scientists want to know if they should– Fetch a copy of the data– Recreate it locally

• This problem can be considered in the context of a virtual data system that tracks how data is created so recreation is feasible

Page 3: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

To make this decision you need

• 1) Estimate of time to recreate data– Info about data provenance, machine types,

etc

• 2) Estimate of data transfer time

• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly

Page 4: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

To make this decision you need

• 1) Estimate of time to recreate data– Info about data provenance, machine types,

etc

• 2) Estimate of data transfer time

• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly– OUR AREA OF CONCENTRATION

Page 5: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Regeneration Time Estimates

• Previous work (Chep 2004, “Resource Predictors in HEP Applications”)

• Estimate runtime of ATLAS application– End-to-end estimation since no low-level application

model available– Used data about input parameters (number of events,

versioning, debug on/off, etc) and benchmark data (using nbench)

• Estimates are accurate to 10% for event generation and reconstruction, 25% for event simulation

Page 6: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Regeneration Time Estimate Accuracy

Page 7: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

File Transfer Time Estimates

• Much previous work (e.g. Vazhkudai and Schopf, IJHPCA Vol 17, No. 3, August 2003 )

• We use simple end-to-end history data from GridFTP logs to estimate behavior– Simple approach works well on our

networks/machines– Average bandwidth used with no file-size

filtering

Page 8: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Testbed

• Files transferred from BNL to Harvard and from CERN to Harvard– BNL (aftpexp01.bnl.gov): 4x 3GHz Xeon, Linux 2.4.21-

37.ELsmp, 2.0GB RAM, 1.0 GBit/s NIC– Harvard: 2x 3.4GHz P4, Linux 2.4.20-21.EL.cernsmp, 1.5GB

RAM, 1.0 GBit/s NIC

• Typical network routes:– Harvard –NoX – ManLan – ESNet – BNL

Typical Latency 7.8 ms

– Harvard – NoX – ManLan – Chicago (Abilene) – CERNTypical Latency 148 ms

• Bottlenecks are in machines at each end (e.g. disk access)

Page 9: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Network Routing

Page 10: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Benchmarking

• Transfer files from BNL to Harvard– 20 files each 25MB, 50MB, 100MB, 250MB,

500MB, 1GB

• Average file transfer times are linear with file size

• Initially quiet machines, network– Transfers of 100MB files have variance ~5%

Page 11: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Time vs File Size, BNL(Quiet network)

Page 12: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Variance, BNL(100 MB files, quiet network)

Page 13: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Benchmarking

• Some data taken during “Service Challenge 3”

• Average file transfer times are still linear with file size, but have larger variance

Page 14: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Time vs File Size, BNL(Busy network)

Page 15: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Transfer Variance, BNL(100 MB files, busy network)

Page 16: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

But our concentration was on the framework

• Given ways to estimate application run time and file transfer time, we want to plug them into an existing framework to make better resource management decisions

• Could be implemented as a post-processor to optimize DAG’s produced by Chimera

Page 17: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Workflow Optimization

• A script parses the DAG, looking for I/O, binaries

• I/O files indexed in Replica Location Service (RLS)

• Client queries database for execution parameters, bandwidths

• Script evaluates execution, transfer times and rewrites fastest DAG

Page 18: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Our Strawman Application

• ATLAS event reconstruction jobs take ~20Mins to calculate a 100 Meg file

• File transfer Boston to BNL ~15 Sec/ 100 Meg file

• We created simplified jobs that would have average execution times equal to the file transfer times in order to have a situation closer to the one originally hypothesized

• Likely to be more common as data access becomes more contentious, and machines/calculations speed up

Page 19: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Framework Tests

• Generate “Non-optimized” DAG’s – linear chains which use a random mixture of transfers and calculations to instantiate 10, 20, or 40 files.

• Operate on these DAG’s with our optimizer to produce “Optimized” DAG’s

• Submit both “Non-optimized” and “Optimized” DAG’s and compare processing times

• For our particular strawman we expect the “Optimized” DAG’s to be 25% faster than the “Non-optimized”

Page 20: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Framework Tests

Page 21: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Comparison of Results

Page 22: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Optimized Results

Page 23: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC

Summary

• Implementation works

• A 28% time savings is seen

• Works with crude bandwidth predictions– More sophisticated predictions for dynamic

situations would be helpful

• Most useful when regeneration and transfer times are similar.