10
David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

David Hoover

Scientific Computing Branch, Division of Computer System Services

CIT, NIH

Swarms and Bundles: Bioinformatics and Biostatistics

on Biowulf

Page 2: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Embarrassingly Parallel Problems

• GWAS, with huge numbers of SNPs

• Sequence analysis, assembly, and mapping

• Testing and validating statistical models

• Protein folding and threading

• Molecular docking and compound screening

• Tomographic reconstruction

Page 3: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Tsai et al., Mol. Biochem. Parasitology, online preprint 2008

Protein folding calculations with Rosetta++100,000 cpu hours

Characterization of Surface Protein 3 from Malaria Parasite P. Falciparum

Page 4: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

How to run multiple independent processes in parallel

16 independent processes

input

command

output input output

command

Page 5: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Biowulf Cluster Batch System

batch

job1

job1.out

script

batch

job16

job16.out

script

Page 6: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Node 1 Node 2 Node 3 Node 4

job1 job2 job3 job4

job1.out job2.out job3.out job4.out

biowulf% swarm -f file

Swarm

Page 7: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Node 1

job1

job1.out

biowulf% swarm -f file -b 4

Bundled Swarm

Page 8: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Swarm Facts

• Written and maintained by Helix Systems Staff• swarm introduced in late 2000

• 82% of all batch jobs run on the cluster since 2002 are swarm jobs

• ~60% of all wall time spent on swarm jobs

• swarm has been shared with clusters around the world

Page 9: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Swarm World Records

• Largest swarm: 683,445 commands

• Largest bundle: 24,000 commands per CPU

Page 10: David Hoover Scientific Computing Branch, Division of Computer System Services CIT, NIH Swarms and Bundles: Bioinformatics and Biostatistics on Biowulf

Future Challenges

• How to deal with larger multicore nodes?

Node 1 Node 2 Node 3