View
227
Download
0
Tags:
Embed Size (px)
Citation preview
David Hoover
Scientific Computing Branch, Division of Computer System Services
CIT, NIH
Swarms and Bundles: Bioinformatics and Biostatistics
on Biowulf
Embarrassingly Parallel Problems
• GWAS, with huge numbers of SNPs
• Sequence analysis, assembly, and mapping
• Testing and validating statistical models
• Protein folding and threading
• Molecular docking and compound screening
• Tomographic reconstruction
Tsai et al., Mol. Biochem. Parasitology, online preprint 2008
Protein folding calculations with Rosetta++100,000 cpu hours
Characterization of Surface Protein 3 from Malaria Parasite P. Falciparum
How to run multiple independent processes in parallel
16 independent processes
input
command
output input output
command
Biowulf Cluster Batch System
batch
job1
job1.out
script
batch
job16
job16.out
script
Node 1 Node 2 Node 3 Node 4
job1 job2 job3 job4
job1.out job2.out job3.out job4.out
biowulf% swarm -f file
Swarm
Node 1
job1
job1.out
biowulf% swarm -f file -b 4
Bundled Swarm
Swarm Facts
• Written and maintained by Helix Systems Staff• swarm introduced in late 2000
• 82% of all batch jobs run on the cluster since 2002 are swarm jobs
• ~60% of all wall time spent on swarm jobs
• swarm has been shared with clusters around the world
Swarm World Records
• Largest swarm: 683,445 commands
• Largest bundle: 24,000 commands per CPU
Future Challenges
• How to deal with larger multicore nodes?
Node 1 Node 2 Node 3