29
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013

JGI/NERSC New Hardware Training

  • Upload
    sarai

  • View
    59

  • Download
    2

Embed Size (px)

DESCRIPTION

JGI/NERSC New Hardware Training. Kirsten Fagnan, Seung -Jin Sul January 10, 2013. Overview. New h ardware structure (# of nodes, cores, cores per socket) Exclusive use of a node – what does that mean Running serial (single-core) jobs on the exclusive nodes Python TaskFarmerMQ - PowerPoint PPT Presentation

Citation preview

Page 1: JGI/NERSC New Hardware Training

JGI/NERSC New Hardware Training

Kirsten Fagnan, Seung-Jin SulJanuary 10, 2013

Page 2: JGI/NERSC New Hardware Training

Overview

• New hardware structure (# of nodes, cores, cores per socket)

• Exclusive use of a node – what does that mean• Running serial (single-core) jobs on the

exclusive nodes– Python– TaskFarmerMQ

• Hands-on testing/work

Page 3: JGI/NERSC New Hardware Training

Genepool Components

450 SGI commodity nodes8 Slots, 48 GB of memory

222 Appro commodity nodes (new hardware)16 physical cores, 120 GB of memory

8 240 GB nodes 9 500 GB nodes 3 1000 GB nodes 1 2TB node

20 120 GB 8 slot nodes (x4170) – high

priority nodes

Page 4: JGI/NERSC New Hardware Training

New Commodity Node Layout

• 120G of memory• 16 physical cores (2 sockets - NUMA)• 16 virtual cores (hyperthreading)• 1.8 TB of local disk

Page 5: JGI/NERSC New Hardware Training

New High Memory Node Layout

• 5 500G Nodes, 2 1000G Nodes (why not 512 and 1024??)

• 32 physical cores (4 sockets - NUMA)• 32 virtual cores (hyperthreading)• 3.6 TB of local disk

Page 6: JGI/NERSC New Hardware Training

NUMA – Non-uniform Memory Architecture

• There is a memory hierarchy on each die, so each thread will not have uniform access time to different blocks of memory

Image from - http://venthusiast.com/numa-non-uniform-memory-access/

Page 7: JGI/NERSC New Hardware Training

Hyperthreading

• 16 physical cores + 16 virtual cores means that you can run applications with up to 32 threads.

• We have done some experiments with hyperthreading on/off and didn’t see any negative effects, but very few codes showed appreciable speed-up

Page 8: JGI/NERSC New Hardware Training

How are the old and new systems connected?

Page 9: JGI/NERSC New Hardware Training

NERSC Machine Room

Page 10: JGI/NERSC New Hardware Training

How do I access the new nodes?User still specify the following parameters:• Wallclock limit ( -l h_rt HH:MM:SS)• # of cores/nodes ( -pe ... )• Amount of memory per core (-l ram.c=16G)

The new hardware has 120 GB of memory, if you specify more than 48GB of memory, your job will be routed to the new hardware.

#!/bin/bash#$ -l h_rt=12:00:00, ram.c=100GB#$ -pe pe_slots 16#$ -N whole_node_serial_test

Can run up to 16 MPI tasks or with 16 threads.

#!/bin/bash#$ -l h_rt=12:00:00, ram.c=100GB#$ -pe pe_slots 16#$ -N whole_node_mpi_test#$ -pe pe_1 4 ## requesting 4 whole nodes, can run up to 16*4 MPI tasks

Page 11: JGI/NERSC New Hardware Training

What about run time?

• There are 50 commodity nodes that can run long jobs (>12 hours), all the high memory nodes can run long jobs

• The remainder of the jobs can run jobs with a up to a 12 hour wallclock

Page 12: JGI/NERSC New Hardware Training

Exclusive use of the node

• I/O from this node will only be done by your job, don’t need to share the 1Gb ethernet with anyone else

• 16 processors, 16 virtual cores (can test the benefit of hyperthreading with your code)

• You can use up to 120G (more on the highmem nodes)

Page 13: JGI/NERSC New Hardware Training

Want to take advantage of all 16 cores, but how?

Task 1Task 2Task 3 …Task 15Task 16

Page 14: JGI/NERSC New Hardware Training

Running 16 serial tasks - PythonYou can use a python's mpi4py module to launch multiple serial jobs. Below is a sample python script, 'mwrapper.py':

#!/usr/bin/env pythonfrom mpi4py import MPIfrom subprocess import callimport sysexctbl = sys.argv[1]comm = MPI.COMM_WORLDrank = comm.Get_rank()myDir = "dir"+str(rank).zfill(2)cmd = "cd "+myDir+" ; "+exctbl+" < infile > outfile"sts = call(cmd,shell=True)comm.Barrier()

Page 15: JGI/NERSC New Hardware Training

Running 16 serial tasks - Python

#!/bin/bash –l#$ -l h_rt=12:00:00#$ –pe pe_slots 16#$ -l ram.c=7680MB#$ -cwd

module load pythonmodule load openmpi

aprun –n 16 mwrapper.py a.out

Below is a batch script to use it for a serial program, a.out:

Page 16: JGI/NERSC New Hardware Training

P

tfmq-worker_1

RabbitMQ

P

tfmq-worker_2

P

tfmq-worker_n

...

...

$ tfmq-client -i task.lst

client

task_1 task_2 task_t

fork() fork() fork()

Running 16 Serial tasks - TaskfarmerMQ/jgi/tools/bin/blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d ./data/hs.m51.D4.diplotigs+fullDepthIsotigs.fa -e 1e-10 -F F -W 41 -i./data/blast_query_1_160.fna -m 8 -o ./out-blastn/test1.m8.bout:/project/projectdirs/genomes/sulsj/test/2012.10.08-taskfarmer-mq/task_version/out-blastn:test1.m8.bout:0/jgi/tools/bin/blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d ./data/hs.m51.D4.diplotigs+fullDepthIsotigs.fa -e 1e-10 -F F -W 41 -i./data/blast_query_1_160.fna -m 8 -o ./out-blastn/test2.m8.bout:/project/projectdirs/genomes/sulsj/test/2012.10.08-taskfarmer-mq/task_version/out-blastn:test1.m8.bout,test2.m8.bout:0

user task list

statusstatus status

Workers can be added at any time and reused

Page 17: JGI/NERSC New Hardware Training

TaskfarmerMQ Client/Worker Usage

tfmq-client -i <user task file> [-q user_specified_queue_name] [-w reuse_workers]

• -i,--tf: user task list file• -q,--tq: user-specified queue name (*NOTE: If you set your

queue name with this option, you SHOULD set the same queue name when you start the worker using “-q/--tq”).

• -w,--reuse: worker termination option. If set as “0 (default)”, all workers will be terminated after completion; If set as “1”, all workers will stay running for other tasks.

tfmq-worker [-q,--tq user-specified_queue_name]

The “-q/--tq” option is for setting user-defined queue name. If you set a different queue name for running tfmq-client, you SHOULD set the same name when you run the worker.

ex) User-defined queue name

$ tfmq-client -i task1.lst -q mytaskqueuename1$ tfmq-worker -q mytaskqueuename1

Page 18: JGI/NERSC New Hardware Training

TaskfarmerMQ Task List ExampleTask list format: <user command>:<output directory>:<list of output files>:<done flag>

blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d ./data/db.fa -e 1e-10 -F F -W 41 -i ./data/input1.fna -m 8 -o ./out-blastn/test1.m8.bout:./out-blastn:test1.m8.bout:0

blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d ./data/db.fa -e 1e-10 -F F -W 41 -i ./data/input2.fna -m 8 -o ./out-blastn/test2.m8.bout:./out-blastn:test1.m8.bout,test2.m8.bout:0

blastall -b 100 -v 100 -K 100 -p blastn -S 3 -d ./data/db.fa -e 1e-10 -F F -W 41 -i ./data/input3.fna -m 8 -o ./out-blastn/test3.m8.bout:./out-blastn:test4.m8.bout:0

Page 19: JGI/NERSC New Hardware Training

TaskfarmerMQ Task Examples

1. A case where I have a list of tasks that each require 1 core and 7680MB of memory- Step 1 Fire up a client with the name of the queue that I want:

tfmq-client -i task7680.lst -q my7680MBqueue

2. A case where I have a list of tasks that each require 1 core and 15G of memory

3. A case where I have a list of tasks that each require 1 core and 30G of memory

Page 20: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -l h_rt=12:00:00#$ -pe pe_slots 16#$ -l ram.c=7680MB#$ -cwd

for i in {1..16}do tfmq-worker –q my7680MBqueue &donewait

Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on Genepool Create a batch script. In this case called submit_16workers.q

Submit the job

genepool01:$ qsub submit_16workers.q

Note: We only specify, memory, slots and

runtime to route our jobs!

Page 21: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -l h_rt=12:00:00#$ -pe pe_slots 1#$ -l ram.c=7680G#$ -cwd

## Running on the gpint ## tfmq-client -i task1.lst -q my7680MBqueue

for i in {1..15}do tfmq-worker –q my7680MBqueue &donewait

Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on GenepoolCreate a batch script. In this case called submit_16workers.q

Submit the job

genepool01:$ qsub submit_16workers.q

The name of the queue for the client and

worker needs to be the same

Page 22: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -l h_rt=12:00:00#$ -pe pe_slots 16#$ -l ram.c=7680MB#$ -cwd## Running on the gpint ## tfmq-client -i task1.lst -q my7680MBqueue

for i in {1..16}do tfmq-worker –q my7680MBqueue &donewait

Example 1 – my tasklist is full of jobs that need 7.5GB (7500MB) of memory and 1 core each, to run these on GenepoolCreate a batch script. In this case called submit_16workers.q

Submit the job

genepool01:$ qsub submit_16workers.q

There are 16 cores on a node, so I can have 16

workers

Page 23: JGI/NERSC New Hardware Training

TaskfarmerMQ Task Examples

1. A case where I have a list of tasks that each require 1 core and 7.5G of memory

2. A case where I have a list of tasks that each require 1 core and 15G of memory

- Step 1 - Fire up a client with the name of the queue that I want:

tfmq-client -i task15.lst -q my15GBqueue

3. A case where I have a list of tasks that each require 1 core and 30G of memory

Page 24: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -l h_rt=12:00:00#$ -pe pe_8 #$ -l ram.c=15G#$ -cwd

for i in {1..8}do tfmq-worker –q my15GBqueue &donewait

Example 2 – my tasklist is full of jobs that need 15GB of memory and 1 core each, to run these on Genepool, so I can only use 8 coresCreate a batch script. In this case called submit_8workers.q

Submit the job

genepool01:$ qsub -t 1-10 submit_8workers.q

In this case we have only 8 workers (120G/15G = 8)

Page 25: JGI/NERSC New Hardware Training

TaskfarmerMQ Task Examples

1. A case where I have a list of tasks that each require 1 core and 7.5G of memory

2. A case where I have a list of tasks that each require 1 core and 15G of memory

3. A case where I have a list of tasks that each require 1 core and 30G of memory- Step 1 - Fire up a client with the name of the queue that I want:

tfmq-client -i task30.lst -q my30GBqueue

Page 26: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -pe pe_slots 4#$ -l ram.c=30G#$ -cwd

for i in {1..4}do tfmq-worker –q my30GBqueue &donewait

Example 3 – my tasklist is full of jobs that need 15GB of memory and 1 core each, so I can only use 4 coresCreate a batch script. In this case called submit_4workers.q

Submit the job

genepool01:$ qsub -t 1-10 submit_4workers.q

Page 27: JGI/NERSC New Hardware Training

#!/bin/sh#$ -N taskfarmermq_test#$ -pe pe_slots 4#$ -l ram.c=30G#$ -cwd

for i in {1..4}do tfmq-worker –q my30GBqueue &donewait

Example 3 – my tasklist is full of jobs that need 15GB of memory and 1 core each, so I can only use 4 coresCreate a batch script. In this case called submit_4workers.q

Submit the job

genepool01:$ qsub -t 1-10 submit_4workers.q

You can also run with task arrays to increase the number of workers available to a particular

queue

Page 28: JGI/NERSC New Hardware Training

Summary

The JGI now has access to almost 2x the computing power that was available before break.

To access the new hardware, just request between 48G and 240G of memory and your jobs will be routed to those nodes.

In an effort to keep jobs scheduling efficiently for all users, we are scheduling the new nodes a whole node at a time. This will also make it easier for users to debug workflows and should enable jobs to complete more consistently.

There are tools available (Python, TaskFarmerMQ) that will enable users with serial jobs to take advantage of the new hardware.

Page 29: JGI/NERSC New Hardware Training

hands-on section