32
Robin M. Betz San Diego Supercomputer Center Ross C. Walker San Diego Supercomputer Center Dept of Chemistry & Biochemistry UC San Diego An Investigation of the Effects of Error- Correcting Code on GPU-accelerated Molecular Dynamics Simulations

An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Robin M. BetzSan Diego Supercomputer Center

Ross C. WalkerSan Diego Supercomputer Center

Dept of Chemistry & BiochemistryUC San Diego

An Investigation of the Effects of Error-Correcting Code on GPU-accelerated

Molecular Dynamics Simulations

Page 2: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

What is Molecular Dynamics?

Simulation of dynamic properties of condensed phase physical / biological systems

Enzymes/Proteins

Drug Molecules

Biological Catalysts

Classical energy function

Parameterized force fields (bonds, angles, dihedrals, VDW, charges...)

Integration of Newton's equations of motion

Atoms considered as points, electrons are implicit

Page 3: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

What can we do with Molecular Dynamics?

Simulation of time dependent properties

Protein domain motions

Small protein folds

Spectroscopic properties

Simulation of ensemble properties

Binding free energies

Reaction pathways

Free energy surfacesBinding of Tamiflu to influenza neuraminidase

Page 4: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

AMBER GPU-Acceleration

Collaboration between NVIDIA and AMBER development team

SPFP model for calculation accuracy and speed

Focus on accuracy

Passes all AMBER development tests

Bitwise identical trajectories for runs with same random seed

Conserves energy comparably to double-precision CPU code

Page 5: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

AMBER GPU-Acceleration

JAC (DHFR) benchmark: 23,559 atoms PME

Page 6: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

Random bit-flip events (ECC errors) occur in memory

ECC using Hamming codes to check memory is unaltered

Can detect and correct single bit-flip errors

Can detect double bit-flip errors

Costs about 10% of memory and speed in GPU simulations

Page 7: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

Input coordinates, velocities From step n

Output coordinates, velocitiesfrom step n+1

GPU

Page 8: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

Normal Operation

Page 9: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

Normal Operation

Page 10: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1001

Normal Operation

Page 11: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1010

1001

Normal Operation

Page 12: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

Page 13: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

Page 14: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1001

Page 15: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1101

ECC Event

Page 16: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1110

1101

Page 17: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

Page 18: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

Page 19: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1001

Page 20: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1101

ECC Event

Page 21: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1001

Error Correcting

Code

Page 22: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Error Correcting Code (ECC)

GPU

1001

1010

1001

Page 23: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

How Frequent Are ECC Events?

“DRAM Errors in the Wild” Google study

25,000-75,000 errors/billion device hours/MBit

~8% of DIMMs affected by errors each year

Majority of errors are “hard errors” => physical defect

Soft errors perhaps 2% of all errors

Unknown error frequency for GPUs

Page 24: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Simulation

Satellite Tobacco Mosaic Virus

Page 25: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Simulation

Satellite Tobacco Mosaic Virus (STMV)

1,067,095 atoms in explicit solvent

Uses 48% (2.6 GB) of GPU memory

NPT simulation at 300K with 2fs timestep

Start from equilibrated trajectory

Page 26: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Supercomputer

Keeneland GPU supercomputer at Georga Tech

240 available nodes for our test

3 M2090 GPUs per node

Run approximately 10 hours on all nodes with ECC on, then 10 hours with ECC off

Page 27: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Results

ECC on:

3 runs failed due to node I/O error

717 runs completed successfully

ECC off:

1 run hung during middle of simulation

719 runs completed successfully

No ECC events detected

Page 28: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Results

Mean time for 0.3 ns of simulation:

ON: 9.951 hours

OFF: 9.096 hours

Standard deviation:

ON: 0.079 hour

OFF: 0.006 hour

ON OFF8.6

8.8

9

9.2

9.4

9.6

9.8

10

10.2

Simulation Completion Time

Tim

e (

Ho

urs

)

Page 29: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

ON OFF8.6

8.8

9

9.2

9.4

9.6

9.8

10

10.2

Simulation Completion Time

Tim

e (

Ho

urs

)

Results

Mean time for 0.3 ns of simulation:

ON: 9.951 hours

OFF: 9.096 hours

Standard deviation:

ON: 0.079 hour

OFF: 0.006 hour

8.8% speed penalty

Page 30: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Conclusion

ECC soft error events are very rare

Increased likelihood of failure with longer wall clock times

Turning ECC off may benefit AMBER users

Page 31: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Conclusion

ECC soft error events are very rare

Increased likelihood of failure with longer wall clock times

Turning ECC off may benefit AMBER users

More research is needed into ECC soft error events

Page 32: An Investigation of the Effects of Error- Correcting …...Correcting Code on GPU-accelerated Molecular Dynamics Simulations What is Molecular Dynamics? Simulation of dynamic properties

Acknowledgments

Dr. Ross Walker

Keeneland

NVIDIA

NSF