17
Instrumenting Folding@Work Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel

Instrumenting Folding@Work

  • Upload
    lou

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Instrumenting Folding@Work. Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel. Overview. Problem Description Experimental Structure Folding@Work Workflow Benchmarks Results Weak Scaling (ns / day) Server Capacity Available Workers Over Time - PowerPoint PPT Presentation

Citation preview

Page 1: Instrumenting Folding@Work

Instrumenting Folding@Work

Badi Abdul-Wahid, RJ NowlingCSE 60641 Operating Systems

Professor Striegel

Page 2: Instrumenting Folding@Work

Overview

• Problem Description– Experimental Structure– Folding@Work Workflow

• Benchmarks• Results– Weak Scaling (ns / day)– Server Capacity– Available Workers Over Time– Variability of Computation Time

• Conclusions

Page 3: Instrumenting Folding@Work

Experimental Structure

Page 4: Instrumenting Folding@Work

Folding@Work Workflow

Page 5: Instrumenting Folding@Work

Benchmarks

• Tasks: 1 ns generations (approx 2 hr on test machine)

• 10 consecutive generations / simulations• Weak Scaling– 10 simulations / 10 workers– 100 simulations / 100 workers– 1,000 simulations / 1,000 workers

• Condor, later added SGE jobs• 1 Trial of each; Took ~ 2 days to run

Page 6: Instrumenting Folding@Work

Weak Scaling of F@W

Page 7: Instrumenting Folding@Work

Server Capacity (Wait Time)

Page 8: Instrumenting Folding@Work

Available Workers over Time

Page 9: Instrumenting Folding@Work

Transfer Times

Page 10: Instrumenting Folding@Work

Variability of Computation Time

Page 11: Instrumenting Folding@Work

Example Execution Timeline

Page 12: Instrumenting Folding@Work

Performance Model

Nwu =⟨texe⟩+ ⟨tW ,wait⟩

⟨tnew⟩+ ⟨ttrans⟩+ ⟨tM ,wait⟩

Page 13: Instrumenting Folding@Work

Weak Scaling (updated)

Page 14: Instrumenting Folding@Work

Wait Times

Page 15: Instrumenting Folding@Work

Tasks Waiting

Page 16: Instrumenting Folding@Work

Identified Areas of Improvement• Availibility of Resources

– Benchmarks limited by number of sustained workers available through Condor

– New feature: WorkQueue Worker Pool can be used to start new workers• WorkQueue Limits Number of Workers

– Increasing number of file descriptors allowed up to 2,500 workers to connect– Bad behavior occuring in calls to select()– Working with WorkQueue developers to switch to poll()

• Long-Running Work Units Delay Completion of Trajectories– Some work units not returned / taking very long time– Prevents trajectories from finishing– Use fast abort feature to re-assign work units that take longer than a

specified time

Page 17: Instrumenting Folding@Work

Conclusion

• Accomplished– Identified key metrics (ns / day, wait time)– Developed scaling model– Tested model

• Conclusions– Real scientific applications scale well– Forcing short workunits adds load to Master– Performance model validated– “Self-correcting” behavior