30
GENE 760: Genomic Methods for Genetic Analysis Course Organizer: Jim Noonan: [email protected] TAs: Richard Sarro: [email protected] Rob Amezquita: [email protected]

GENE 760: Genomic Methods for Genetic Analysis Course Organizer: Jim Noonan:[email protected]@yale.edu TAs: Richard Sarro:[email protected]@yale.edu

Embed Size (px)

Citation preview

GENE 760: Genomic Methods for Genetic Analysis

Course Organizer: Jim Noonan: [email protected]

TAs:Richard Sarro: [email protected] Amezquita: [email protected]

GENE 760: Objectives

• Intro to analyses of genomic datasets• High throughput sequencing applications

- ChIP-seq- RNA-seq- Whole exome and whole-genome sequencing- Metagenomics- Big functional genomics datasets

• You do not need to have prior programming experience

• You will learn how to:- Work with massive datasets in a Linux HPC environment- Write your own scripts in Perl and R to parse files, run

pipelines, do basic statistical analyses- Interpret genomics data to gain biological insights

• Your level of experience with:- Working in a UNIX/Linux environment- High performance computing- Perl scripting- R- Any other programming language- High-throughput sequencing apps or data

Information we need from you

• Your name• Your netID• Your Grad School year• Are you taking the course for Credit or are you an Auditor?

• Your level of experience with:- Working in a UNIX/Linux environment- High performance computing- Perl scripting- R- Any other programming language- High-throughput sequencing apps or data

In a single email to [email protected]:

Cluster Computing Basics

R D BjornsonN J Carriero

Accessing Louise—You• Run a program on your computer (“local”) to login to louise (“remote”) over a network

connection.• The local computer must be on the Yale network:

– A computer at Yale– Via VPN software– Via a login to a computer at Yale that allows external access, then login from there to louise.

• The login program run on the local computer must support the secure shell protocol.– Linux: ssh– Mac OS X: Use terminal or X11/xterm to create a command line session (“shell”), then ssh.– Windows: Putty + ssh or cygwin (and then pretend as if you are using Linux).

• ssh [email protected]• On first log in, if prompted for a passphrase for an ssh key, just press “enter”. In general,

unless you know what you are doing, leave ssh-related files alone (and do not change the permission on your home directory!).

• Running GUIs involves understanding and using X11. Baked-in with Linux, distributed but not installed by default with Mac OS X, and a 3rd party add on for Windows (e.g., cygwin).

Accessing Louise—Your Data

• Use scp or sftp (part of the ssh program suite) to copy files from local to remote and back. (FileZilla offers a GUI.)

• rsync can be useful for keeping a local and remote file hierarchy in sync.

• wget will allow you to retrieve a file via a URL from the command line. Useful for fetching reference files from repository sites (ENSEMBL, NCBI, UCSC).

man: Describes how to use a command.man man

help: Information about frequently used “shell” commands.info: New and improved (?) man—may provide more

details.locate: Find the location of a file (in common system

areas).which: Use to determine which version of a program will be

used by default.Note: User interface is hunt-and-peck not point-and-click!

Don’t Panic!

Cluster OrganizationLogin nodes

– Virtualized– Light use only

Compute nodes– Multicore, ~4GB DRAM per core. Parallel or concurrent execution is relatively

easy using the cores of one node. More work to use the cores on multiple nodes. But in either case do not assume this will happen automatically.

– Shared vs dedicated

File systems– Cluster wide (default), accessible over network– Local to node (direct connection)

Cluster Organization (Louise)

Switch

Cluster-wide Storage System

Manager

Compute-28-10

Compute-9-16Compute-XX-YY

Compute-22-2

Login node 0/1

Don’t loiter in the lobby!

ssh

300+ Users.

90 compute nodes for general use.

Processor cores: 4 to 64 per

compute node

Compute-22-2

qsub

Resource Management

Need to explicitly allocate resources for computing– Interactive. For development; using interactive programs such as

MATLAB®, python or R; and/or graphic rich tools (X11 forwarding)– Batch

Commands– qsub registers a request for resources:

qsub -I -X nodes=1:ppn=8 -q default(-I: interactive; -X: X11 forwarding, also use ssh –Y when you login from your local computer)

qsub FileWithOptionsAndCommandsBatch

– qstat provides information about requests:qstat -1 -n -u njc2

ToolsEditor (emacs vs vi/vim)

emacs makes it possible to work directly with files 10s to 100s of MB, explore binary files, capture shell transcripts and review them, interactively navigate the file hierarchy, review file differences, etc. .

Binary vs ASCII files.

file Basic command to determine the kind of file.od –c Displays content byte by byte, permitting a detailed examination—useful especially when dealing with DOS/Unix/Mac OS X end of line conflicts or looking for file corruption. Often used in a “shell pipe” with head. Btw, do not use a “wysiwyg” editor such as Word or Wordpad for technical work: especially data preparation or code development.

Tools

ls , cd , mkdir: List directory contents, change directories, make a new directory. File hierarchy == tree of directories. A “path” is a series of nested directories written this way /dir0/dir1/dir2/file.

head , less , tail: See a couple of lines in an ASCII file. head and tail can be used to extract a small sample, e.g. to see the format of data in the file or to create test input (but this kind of sample is generally not representative). Often used with “pipes”. Use less to browse files (by line number or percentage).

split: One way to cope with large files (but virtual splitting can be more efficient: split will, at least temporarily, double the amount of file space used).

awk: Swiss army knife. Can do head/tail/split and much more:awk 'NR%1009 == 13{print $0}' fullDataSet > sampleDataSet

python: An excellent interactive general purpose text processing and analysis environment (increasingly popular, but perl has a large lead).

Tools: bash scripting, redirection and pipes

When you log into a computer you are connected to a program. This program accepts the text you type and does “something” with it. If, for example, you type “ls”, the program first determines that “ls” is not something that it directly understands, so it next looks for another program on the computer called “ls”. If it finds it, it runs that program on your behalf and then reports the output. If it does not find it, it reports an error to that effect.

This class of program is generally referred to as “command shells”. It should be clear that the shell plays a critical role in the use of a cluster computer, and yet most users give the shell little or no thought. This generally comes back to haunt them in the form of subtle bugs that they are ill equipped to diagnose and correct, as well as missed opportunities to streamline workflow.

Tools: bash

Consider a sequence of commands given to the bash shell (the default shell) :

gunzip data.gz awk '/chr13/{print $0}' data > chr13Recordsgzip datamyProgram -i chr13Records -o chr13Filteredrm chr13Recordssort -k 2,2n < chr13Filtered > chr13Sortedrm chr13Filtered

Note: stdin, stdout, stderr

Tools: bash

An alternative using bash “pipes”:gunzip -c data.gz

| awk '/chr13/{print $0}’| myProgram -i - -o –| sort -k 2,2n > chr13Sorted

Three advantages: – Less file system IO (extremely important in a cluster

setting)– Less clean up (an issue when this sort of processing is

done 100s or 1000s of times)– Better use of multicore machines (gunzip, awk, and myProgram can run concurrently).

Tools: bash

Now suppose we have 100 data sets: dataSet00.gz ... dataSet99.gz.A few notes about file naming:• When working with a large number of files, it is easy to lose

track of files or accidentally overwrite some, so choose a clear and informative scheme and stick to it. If >> 1000, use additional levels of directories.

• 0- vs 1- based indexing is a subtle point that you need to get comfortable with (you don’t have to use it yourself, but you will run into it sooner or later).

• Padding with leading 0’s compensates for dumb file sorting.How can we easily process all of these sets?

Tools: bash

for f in $(ls dataSet*.gz)do gunzip -c $f | awk '/chr13/{print $0}’ | myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/}

done

Note: You can use an editor to create a file that contains a complex command or a command sequence and then have bash execute that file: source CommandFile

ParallelismThat may take a while, how can we use multiple processors to do it faster? SimpleQueue!

1. Produce a list of tasks to be executed (essentially the same loop as before modified to display the commands to be executed rather than actually execute them):

for f in $(ls dataSet*.gz)do echo ”cd $(pwd) && ( gunzip -c $f | awk '/chr13/{print $0}’ |

myProgram -i - -o - | sort -k 2,2n > chr13Sorted_${f/.gz/} ) >${f}.out 2>${f}.err”

done > Tasks

2. Create a batch script that directs the resource manager to allocate compute nodes and then uses the allocated nodes to work through the list of tasks:

sqPBS.py default 4.6 njc2 dataExtraction Tasks > sqBatchScript(/usr/local/cluster/software/installation/SimpleQueue/sqPBS.py)

3. Submit (can “|” to qsub too): qsub sqBatchScript4. Check output files and status information (SimpleQueue collects a great deal).

Worker

Worker

Worker

SimpleQueue

cd ... && blast ds 01

cd ... && blast ds 02

cd ... && blast ds 03

cd ... && blast ds 04

cd ... && blast ds 05

cd ... && blast ds 06

cd ... && blast ds 00

Aside: Random Number Generation

If you run a code that depends on random numbers, you must take care to ensure it does what you expect when you run it several times, perhaps concurrently on different nodes.

On the one hand, in general you will want each instance to see different random numbers. This may not happen by default.

On the other, you would like to be able to reproduce your results. Different but not too different!

Parallelism: Pre-packaged

Thread based: Fairly common ("easy"-ish). Thread-based parallelism can only make use of the cores on one node.

Message passing based (MPI, PVM, …): Less common in bioinformatics. A message passing program can make use of the aggregate resources of many nodes.

“make” based: Illumina and one or two others. Limited to the cores of one node.

Parallelism: Pre-packagedIf you are using a 3rd party program, it is important to know which kind of

parallelism is used and to invoke the program appropriately.If threaded:

I. Run on a dedicated node!II. Check docs for a number of threads parameter.

If MP, typically need to set up a special execution environment in order to run the program using the resources allocated. Unfortunately, this tends to vary with MP implementations and so has to be addressed on a case by case basis (ask RDB or NJC).

If “make”, invoke like this:

make -j N MakeTarget > make.out 2> make.err

where N is the number of cores to use.

Do It Yourself: Owner computesIt is possible to write you own parallel programs.

One strategy that RDB and NJC often use:• Imagine that you run multiple copies of a sequential version.• At some point, the copies will enter a period of execution in which the work can

be split up into independent tasks. Add a check to decide which copy “owns” (and should execute) a given task—all other copies will skip this task.

• Each copy records the tasks it did. When it exits the period of execution that was split up, it exchanges with all other copies the results of the tasks it did. At this point all the copies know all the results and will continue to execute as if they had each done all of the work themselves.

The devil is in the details—especially the mechanisms used to settle ownership and to exchange task results. Ask us for help; just keep in mind that this kind of parallelism is an option and need not be terribly complex.

Software as an Experimental System

Start with “small” input sets and/or run parameters and systematically alter these to study how CPU time, memory use and IO activity vary from run to run.

Non-invasive tools:top May need a separate log in to the allocated node (use intra-cluster ssh).

time command:

/usr/bin/time –v prog a0 a1 a2 > outFile 2> errFile

Output from time will be appended to “errFile”. Note: use the full path—this is an instance where it is important to understand how the shell works.

Software as an Experimental System

If you are in a position to modify code, you can get much more accurate and detailed information.

Ditto with profiling:Compile time option plus post processing for C, C++, Fortran, …Available as a runtime facility in various scripting systems (python, perl, ruby).Activating profiling often significantly increases run time, placing a premium on the importance of well designed small test cases.

Scaling Considerations

Consider the time (in arbitrary “operation” units) to process N records, if doing:A record by record transform => Time(N)An all to all comparison => Time(N2)An exploration of subsets => Time(2N)An exploration of orderings => Time(N!)

One naturally tends to focus on run time, but memory and IO (amount as well as rate) matter too.

Scaling Considerations

What N corresponds to about 1 CPU second?Time(N) => 1,000,000,000Time(N2) => 30,000Time(2N) => 30Time(N!) => 13

What model applies clearly matters!

Scaling Considerations

It matters when determining how big a problem is feasible. Suppose we double the input size:Time(2*1,000,000,000) => ~ 2 sTime((2*30,000)2) => ~4 sTime(2(2*30)) => 1,000,000,000 sTime((2*13)!) => 1017 s (roughly ten billion years)

Scaling Considerations

It matters when verifying code behavior. If you have a code that you believe follows a Time(N) model, but empirically behaves like Time(N2), then you may have a bug.

For example, code that maintains a list of values can easily degenerate to Time(N2) if one is careless with the operations that maintain the list.

Other Performance Considerations

Memory hierarchy:Do as much as you can with one record before moving on to the next.

Physical vs Virtual Memory:When chunking work, size to fit in physical memory.

Local vs remote IO:If you cannot eliminate temporary IO via bash pipes or named pipes, at least

write to a local file system (but clean up!).

Bulk IO vs character IO:Mostly done for you, but avoid IO operations that read or write one byte or

character at a time.

Data IO vs metadata operations:Metadata operations are much more expensive than normal data IO. Avoid

them. E.g., don’t use a series of specially named empty files to indicate progress, write progress updates to a log file instead.