87
Introduct ion to Linux and HPC Presented by: Al Ritacco, Shailender Nagpal

Introduction to Linux and HPC Presented by: Al Ritacco, Shailender Nagpal

Embed Size (px)

Citation preview

Introduction to Linux and

HPC

Presented by:Al Ritacco, Shailender Nagpal

AGENDA

2

• Introduction to Linux • How to request an HPC account• How to Login to HPC• Basic Linux commands• Available resources• How to submit a job to the cluster

AGENDA

3

Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

What is UNIX

• Unix is an Operating System (OS), just like Microsoft "Windows" is an OS Computers– Runs on many computer "servers“, has ability to provide

multi-user, multi-tasking environment– Orchestrates the various parts of the computer: the

processor, the on-board memory, the disk drives, keyboards, video monitors, etc. to perform useful tasks

• Unix operating system comprises three parts – the kernel (with commands to interact with it), standard utility programs/ services, system configuration files

What is Linux

• Linux is “souped-up” Unix, and provides additional user-friendly programs– command line interface (CLI) and graphical user interface

(GUI) are available to execute commands

• What exactly does this mean?– It means we can install and run scientific software as well

as business applications

5

Why Unix/Linux?

• UNIX is good for automation of computer tasks:– performing complex operations with very few key strokes– operating on large number of objects for e.g.,

• Parsing file contents (pattern matching)• Manipulating text files containing scientific data

• UNIX is fast• LINUX(≈ UNIX) is free and runs on all PCs and MACs,

plus specialty hardware for mobile devices• Many scientific software are freely available on Linux

AGENDA

7

Introduction to Linux How to request an HPC account (to work on Linux)

How to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

Getting an account

• To get started on using the Umass linux servers, you need to have an account. Fill out this form:

https://ghpcc06.umassrc.org/hpc/index.php • Your PI has to authorize• To connect to the HPC server from Windows, use

Putty client, or from a Mac, use SSH

http://

wiki.umassrc.org/wiki/index.php/Connecting_to_the_Cluster

8

Working on a Linux Computer

• Linux as a personal workstation• Linux/Unix as a central “server” (multi-user)

– Three pieces of information – user name, password and server name or IP address

• “Putty" on Windows OS can be used to connect to UMass Research Computing servers– remote login may not allow for displaying graphics - text

mode interaction only– graphics or "X" can be displayed using special tools (Xming)

AGENDA

10

Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

Logging into Linux

• Why do we need to login?– Tracking who can login and what access they have

• Logging in– Use SSH client software– Login to a particular server which has a designated name:

• Ex: hpcc01.umassmed.edu, ghpcc06.umassrc.org• User credentials: user name, password

– SSH Client for Windows: Putty– SSH Client for Mac/Linux: Terminal

11

Connect to UMass Servers

12

How do I interact with Linux?

• Using a command line interface (CLI) where we explicitly type commands and have Linux execute them (using a command shell)

• What is a command shell– A program that interprets the commands we wish to have

executed by Linux

• Enter “bash”– Bourne again Shell

13

Logging out of Linux

• Logging Out of Linux:– To end your session use the “exit” command from the

command prompt:[username@hpcc02 ~]$ exitConnection to hpcc02 closed

• You can also use the key sequence (<ctrl>+D) to close a sessions

14

AGENDA

15

Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

Before we begin learning…

• We will use the term Linux and UNIX interchangeably• Many variants of Linux exist – Redhat, Ubuntu,

CentOS, Debian, etc.• Commands between Linux distributions will be

exactly (or almost exactly) the same• Most of the commands we will be covering are

applicable to other *NIX based operating systems

16

Files and Linux

• Linux users are working with– Applications– Files

• There are several different file types defined for different types of usage in Linux– Basic files text or binary type files (sequence files, etc.)– Executable files (programs). Programs such as bowtie,

gate, ls, cp, cd, etc.

17

Things we need to do on a shell

• Just like with a Windows PC, users need to:– Create, edit, move, rename and delete files– Organize files into folders and navigate the filesystem– Organize users and control permissions of what they can

see and do– View and manage processes, services– Install and run programs and work with their output

• In Linux, you have to learn "commands" to get above things done, implementing them on the "shell" or "command line"

Filesystem: Relative and Absolute Path

• The Linux file system is hierarchical and resembles a tree structure

• A user in the “admin” directory can access the “steve” directory by specifying the relative path “steve” or the absolute path “/users/admin/steve”. Similarly “users” can be accessed by specifying “../users” or “/users”

Linux Layout

• Linux commands are typically installed under:– /bin Linux commands– /sbin Typical system commands– /usr/bin User level commands (editors, etc.)– /share/bin Specific cluster software– /share/apps Specific genome based cluster tools

20

Basic command structure

• Basic form of a Unix command is: command [-options] [arguments]

• Example: ls -l /tmp– “ls” is the command. It lists contents of a directory– “-l” is the option or flag or modifier of the default behavior

of command. Try “ls”.– “/tmp” is the argument. Contents of this directory are

shown

• Aborting a shell command– most Unix systems allow to abort the current command by

typing Control-C

Note on Linux and commands

• Linux commands are case sensitive so:– Exit is not the same as exit– Bowtie is not the same as BOWTIE– Gate is not the same as gatE

• In Linux we use a / as a directory separator– In Windows we use \ as the directory separator

• Linux file names can be descriptive and do not require a file extension

22

Basic Linux commands (List 1)

ls List the contents of directorycp Copy file(s)rm Remove file(s)mv Move file(s)cd Change location to another directorymkdir Make a new directorypwd Display the path of current directoryrmdir Remove a directorycat Display contents of file

Basic Linux commands (List 1 ..contd)

head Display beginning of filetail Display end of fileclear Clear up the shell windowvi Open a file for editing in the VI editorpasswd Change the passwordless Displays contents of file with scrollingmore Displays contents of file with scrollinghistory Displays history of commands executed

Basic Linux commands (List 1 ..contd)

date Displays the current date and timewho Displays who is currently logged inwhoami Displays your usernamelast Displays recent login activityexit Exit the shellwc Count words and lines in filegrep Search for string pattern in fileman Display “manual” page for chosen command

25

Determining Present Working Directory with “pwd”

• When user logs in, they are placed in their HOME directory, which is usually under the “/home” directory

• The linux shell account name and the home directory name are usually the same, so “/home/snagpal” would be the home directory location for user “snagpal”

• As users navigate the filesystem, they can check/confirm where they are currently by running the “pwd” command

[snagpal@u15982204 ~]$ pwd/home/snagpal

• In windows, you can view the same in the windows explorer address bar

Changing directories with “cd”

• Often, users need to go to another directory that is:– a sub-directory that can be accessed below in the tree hierarchy of the

present working directory– a super-directory that can be accessed through the parent of the

present working directory

• In both cases, absolute and relative paths can be used. Lets say user is currently in “/home/snagpal” and needs to access– A sub-directory of the home directory

cd linuxcoursecd /home/snagpal/linuxcourse

– A super-directory of the home directorycd ../../usr/localcd /usr/local

Listing files and directories with “ls”

• “ls” lists files and sub-directories in a chosen directory. Windows explorer offers a rich, graphical equivalent– To list files in the current directory

ls– To list files in another directory (absolute path)

ls /usr/local– To modify the default view of the output to a long list

ls –l /usr/local

Making Directories with “mkdir”

• To create new sub-directories in the home folder or elsewhere on the filesystem, use the “mkdir” command

• Absolute or relative paths can be specifiedmkdir linuxcourse

mkdir /home/snagpal/linuxcourse

Removing Directories with “rmdir”

• To remove directories in the home folder or elsewhere on the filesystem, use the “rmdir” command

• Absolute or relative paths can be specifiedrmdir linuxcoursermdir /home/snagpal/linuxcourse

Copying, Moving and Removing files

• Users needing to make duplicates of a file can easily do so using the “cp” command. It requires the source and destination location to be specified (absolute or relative path)

cp /share/training/linux/test.txt /home/snagpal

cp /share/training/linux/test.txt .

• The dot “.” represents current working directory. Copying leaves a copy of the file in its original source location. Move deletes it, and also allows to rename files

mv /share/training/linux/test.txt /home/snagpal/file.txt

mv /share/training/linux/test.txt file.txt

• To remove a file, use “rm”rm test.txt

rm /home/snagpal/file.txt

File Naming conventions in Linux

• To name files and directories, use:– characters A-Z, a-z– numbers 0-9– period .– dash -– underscore _

• Files and Directory with shell meta characters in the name should be avoided, such as: \ / < > ! $ % ^ & * | { } [ ] “ ‘ ` ; ~

The “vi” editor (…contd)

• To exit the “vi” editor and return to the linux prompt, you have to return to command mode, by pressing the “Esc” key. Then use the “:” key to enter the command line mode

wq saves the current changes and exits viw! saves the current changes but doesn’t exit viq! exits vi without saving any change

• There are many more commands to execute in the command mode and command line mode. A vi tutorial is suggested

Creating and editing files

• Linux has many text editors, most commonly “vi”, but “emacs”, “pico” and “nano” can also be installed

• Most common syntax is:

vi newfile.txt # Creates new file

vi existingfile.txt # Opens existing file

• The filename is checked to see if it exists. If it does, it is displayed. If not, a new file with the name is created

• By default, “vi” opens in command mode. Users can scroll in the file – up, down, page up, page down, move cursor, delete lines, undo, etc

• To enter the “write” or “insert” mode for adding text, users press the “i” or “a” key on keyboard. To exit, press “Esc” key

Searching for patterns in text with “grep”

• Grep searches line-by-line for a specified pattern, and outputs any line that matches the pattern. Basic syntax for the grep command is: grep [options] pattern [files]

cp /share/training/linux/seq.fasta .

grep ">" seq.fasta

grep TCGAAGA seq.fasta

• Many “options”, also searches using regular expressions (a mathematical expression that expresses the characteristics of one or more strings, e.g.:te?xt, *omics

Counting words in file with “wc”

• The “wc” command counts words and lines in a filecp /share/training/linux/abstract.txt .

cat abstract.txtwc abstract.txtwc –l abstract.txt

Text processing Linux Commands$ head -2 file_name List the first two lines$ tail -2 file_name List last two lines$ head -5 file_name|tail -1 List fifth line$ cat file_name|head -50|tail -1 List 50th

line$ cat file|sort -rn|tail -5 List the last 5

items (sorted in reverse numerical order)$ sort -rn file|uniq –c Sort a file, and

count the number of line occurrences

37

Miscellaneous commands

• Displaying current date and time with "date“date

• Clearing the terminal with "clear“clear

• Displaying history of commands with "history“history

Getting Help in Unix

• Use the man command, followed by the name of the command you need help with– Type ‘man ls’ to see the manual page for the "ls" command

man ls

User convenience features

• Shell tab completion with suggestions• Shell expansion of wild-cards for specifying multiple

argumentsls –l *.txt

• Combining options/flagsls –la *.txt

• Using flag names with "--“• Copying and pasting clipboard with left and right mouse clicks

Tying Linux commands together• All commands are executed left -> right (LR)

– Output is expressed in the same manner

• Linux Pipes ‘|’ and commands• Ex: determine how many sequences we have$ cat sequence.fastq | wc

There are 4 lines per sequence in a fastq, how can we determine the # of sequences (x/4):

$ wc -l sequence.fastq| awk '{print $1}‘ | xargs -i echo "scale=0; {}/4“ | bc -l

41

Linux/UNIX Redirection

• What is redirection?– Linux uses the notion of < and > for redirection of input

and output respectively. – A redirection using > allows the user to save the output to

a file for example. In the same way > redirects output, < redirects input from for example the keyboard to a file for input.

– Ex: echo “test” > file1 # “test” to file1– Ex: cat < file1 # output the “file1” file

42

Redirection (..contd)

• A word on redirection: be careful when using redirection to a file, as a single > (redirect output from stdout to a file) will overwrite (or create) a file, whereas a >> (two > signs in a row) will attempt to append to a file thus preserving the initial file input.

43

Redirection (…contd)• If we create two files

(file1/2) with Line1, and Line2 in them respectively

• We can then create a new file using the > Redirection operator

$ cat file1 file2 > file3

44

Redirection (…contd)

• Using bowtie with re-direction – Ex: analyze fastq files to look for all alignments per

read, with hits guaranteed best stratum (with ties broken by quality), and reporting 2 end-to-end hits

• In the bowtie example we are redirecting the output of the bowtie alignment reads to the file we have named ‘output_file’ in your scratch dir.

$ bowtie -a --best -v 2 upstream_mate downstream_mate.fastq > ~/scratch/output_file

45

Shortcut BASH keystrokes• Keyboard shortcut timesavers in BASH

– CTRL + A Move cursor to start of line– CTRL + C Stop a program – CTRL + D Logout (Same as ‘exit’ command)– CTRL + E Move Cursor to end of line– CTRL + Z Suspend program– TAB Command completion (type part of

command and hit tab to complete command)– TAB TAB Shows all commands available

46

Executing Commands

• PATH– Commands are part of your shell’s PATH

• For example: when we type a command such as ‘ls’ the command will be run as it is part of the search PATH

– An example PATH is$ echo $PATH

/bin:/sbin/:/home/ritaccoa– Commands which are not in your PATH will not be

found and therefore not executed

47

Calling external bioinformatics programs

• On our server, several Bioinformatics software are installed

$ module avail• General method to using a software is to load the

software’s module$ module load bowtie/1.0.2$ bowtie --help

AGENDA

49

Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

HPC infrastructure at UMass RC

• Massachusetts Green High Performance Computing Cluster – 10264 cores available, each node has 196 - 512 GB RAM.

12 GPU nodes available– 400TBs of high performance EMC Isilon X series storage– FDR based Infiniband (IB) network and a 10GE network for

the storage environment

• Software related to research installed:– Physics, Medical Physics, Genomics, Chemistry…

50

00/00/2010Information Services,51

00/00/2010Information Services,52

Basic terminology• What is a node?• What is a CPU?

• What is a core?

• What is an Operating System– What is a kernel?

• What is a process?– Single process OS and processes– Concurrent (Multi-tasking) OS and processes – Multiple cores (SMP) and Linux processes

53

Basic Terminology• What is a Node?

– A single computer/blade which contains X number of CPUs and Y number of cores per CPU

• What is a CPU?– The central processing unit (CPU) carries out all of

the instructions in which a computer system requires to execute/perform a given task

• What is a core?– A core is a processor within a CPU chip (there can

be many cores on a given CPU)

54

Basic Terminology

• What is a process?– A process is a program executing (ex: iTunes)

• What is a Kernel?– The kernel is the glue between the hardware and

the user. The Kernel schedules processes.– The kernel can be thought of as a crossing guard

directing traffic for optimal performance

55

Basic Terminology• Processes and tasks

– Single process OS and processes• Single processing OSs can run only one user process at a

given time, a single task• All tasks run until completion before another task is started• MSDOS is an example of this type of single user execution

OS.

• Linux Processes and Cores– A one to one relationship is optimal for performance

56

Basic Terminology• Processes and tasks, cont

– Concurrent (Multi-tasking) OS and processes • A concurrent OS provides users the ability to execute

many programs simultaneously• Linux provides users the ability to execute: an editor, a

music player, and other tasks simultaneously, thus allowing for multi-tasking

– Multiple cores (SMP) and Linux processes• A process which can take advantage more than one

core while running. These are typically called: multi-threaded.

57

Short Review

• If a node has four CPUs and two cores per CPU, how many total cores are there?

• In Linux can we execute an editor and a program to search a genome at the same time?

• How many processes should we execute on a node which has two CPUs with 8 cores each?

58

AGENDA

59

Introduction to Linux How to request an HPC accountHow to Login to HPCBasic Linux commandsAvailable resourcesHow to submit a job to the cluster

What is HPC?

• HPC = High Performance Computing– Infrastructure where hundreds or thousands of

computers are networked together with shared common storage

– Multiple users can login and use the infrastructure– More than 1 computer can be used to complete a

computing task– Special tools/skills required to leverage HPC

environment – Linux, LSF commands

60

Definitions

61

HPC Term DefinitionNode A single computer available to perform computing tasks

Rack A cabinet in which multiple nodes can be stacked vertically and/or horizontally, allowing for efficient housing, networking and power management

Cluster A collection of computer “nodes” that are on the same network for inter-node communication, shared storage and to execute jobs

CPU A CPU is the electronic circuitry (Microprocessor) within a computer that carries out the instructions of a computer program

Core Independent programming unit within a CPU that can execute program instructions. A modern CPU can have multiple cores

Head node In a cluster, one or a few nodes can be designated as a head node where users typically are able to login and create/monitor jobs

Definitions (…contd)

62

HPC Term DefinitionCompute node

Compute nodes in a cluster execute a job created by a head node. Users cannot login into a compute node

Process A process is an instance of a computer program that is being executed. It contains the program code and its current activity

Thread A thread is the smallest sequence of programmed instructions that can be managed independently by the scheduler of an OS

Job A job is a linux command that is designated to be executed on a compute node rather than the head node

Job array Identical jobs that have a different iterator variable

Parallel job Jobs that break a complex computing task into smaller tasks, such that each task is executed on different nodes simultaneously

Queue Designated “lanes” for submitting different types of jobs depending on priority, resources required or expected duration of execution

Definitions (…contd)

63

HPC Term DefinitionScheduler HPC software that allows for efficient utilization of cluster resources

based on submitted job types

Job Management

HPC software that keeps track of jobs submitted

Research computing

One of the departments within Umass Medical School responsible for supporting the HPC infrastructure on campus. Not related to “IT”

Cloud computing

A variant of HPC infrastructure which is not limited to a particular organization, where computing resources are requested on demand

Distributed computing

Buzzword similar to High Performance Computing

Parallel computing

Buzzword similar to High Performance Computing

Why do you need HPC?

• Needs assessment:– Use software that’s only available on linux

• Install it yourself on your own linux PC?• RC already has it installed?

– Automate data crunching tasks• Routine incoming data that needs to be crunched?• Workflow available within RC to handle it?

– Simulations• Molecular dynamics simulations taking too much time?

64

HPC is not for these!

• To run windows software with ponit-n-click interfaces

• Working with office documents – spreadsheets, slides, etc

• Video games, music or general video• Web browsing• Emails

65

Policies for HPC use

• If you have a “need” to use HPC, RC group can help, but there are expectations:– Understanding of the constraints of our HPC

implementation – CPUs, memory, local and shared storage, networking, etc

– Good knowledge of your own tasks/jobs that you are going to run – expected run times, utilization of memory, disk space and network bandwidth

– Fair share policies

66

Typical HPC environment

67

Connections:

Storage

The Cluster

SDPROLIANT 1850R

SD

Catalyst8500

Power Supply 0CISCO YSTEMSS Power Supply 1

SwitchProcessor

SERIES

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

HEWLETTPACKARD

Cluster head

Storage unit NAS/SAN

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Slave node

Internal cluster traffic(ethernet 1 Gb/s)

NAS storage(ethernet 1 Gb/s)

Public network(ethernet 100 Mb/s)

What is a computing “Job”?

• A computing “job” is an instruction to the HPC system to execute a command or script– Simple linux commands that can be executed

within miliseconds would probably not qualify to be submitted as a “job”

– Any command that is expected to take up a big portion of CPU or memory for more than a few seconds on a node would qualify to be submitted as a “job”. Why? (Hint: multi-user environment)

68

How to submit a “job”

• The basic syntax is:bsub <valid linux command>

• bsub: LSF command for submitting a job• Lets say user wants to count number of lines

in a FASTQ file. On a linux PC, the command iswc –l reads.fastq

• To submit a job to do the work, dobsub wc –l reads.fastq

69

Specifying more “job” options

• Jobs can be marked with options for better job tracking and resource management– Job should be submitted with parameters such as

queue name, estimated runtime, job name, memory required, output and error files, etc.

• These can be passed on in the bsub commandbsub –q short –W 1:00 –R

rusage[mem=2048] –J “Myjob” –o hpc.out –e hpc.err wc –l reads.fastq

70

Job submission “options”

71

Option flag or name

Description

-q Name of queue to use. On our systems, possible values are “short” (<=4 hrs execution time), “long” and “interactive”

-W Allocation of node time. Specify hours and minutes as HH:MM

-J Job name. Eg “Myjob”

-o Output file. Eg. “hpc.out”

-e Error file. Eg. “hpc.err”

-R Resources requested from assigned node. Eg: “-R rusage[mem=1024]”, “-R hosts[span=1]”

-n Number of cores to use on assigned node. Eg. “-n 8”

Why use the correct queue?

• Match requirements to resources• Jobs dispatch quicker• Better for entire cluster• Help GHPCC staff determine when new

resources are needed

72

Demo

#!/bin/bash

#BSUB -q short#BSUB -W 00:10#BSUB -n 1#BSUB -R "rusage[mem=1024]"#BSUB -J "myTask[1-80]”#BSUB -o logs/out.%J.%I

echo "Hello Job $LSB_JOBID Task $LSB_JOBINDEX"

73

Create a script “hello-job-array.sh”

To execute on shell, run: bsub < hello-job-array.sh

Learning to use HPC

• Linux is a pre-requisite to using any HPC system– Plenty of linux tutorials on the internet– Attend our “Intro to linux” sessions when offered

• Our website is a good resource for learning to use HPC, visit

www.umassrc.org • Lots of examples provided

74

Disk usage best practices

• Archive your data– Make backups of your data on mid-long term storage

• Use local storage if possible– Local storage always faster than network

• Don’t use farline for cluster processing

75

HPC Best practices

• When submitting a large number of jobs please consider:– Single CPU jobs versus multi CPU Jobs– Correct amount of memory for your job– Job Arrays– Job dependencies

76

• The earlier your jobs are submitted the earlier your job will gain needed LSF resources.

• Re-direct all LSF output to one directory for convenience

• Add the following to your LSF / Job directives: (redirects stdout/stderr)

#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.out#BSUB -o $HOME/LSF_jobs_output/LSF_job.%J.%I.out

HPC Best practices cont.

77

HPC Best practices cont.LSF Queues and policies

• Fair share attempts to equalize CPU (slot) resources for Labs and users at job submission.

• The priority of a job is calculated in relation to other submitted jobs. The priority for jobs will change as jobs complete and job slots become available

• All labs start with an equal weight• Each lab member shares in this weight when submitting

jobs• Weights are measured from job submissions per user and

per lab• Weights are based on CPU time used and a decay time

78

Working with bioinformatics data files: A demo

• Log on to the Umass server using Putty on windows or Terminal on Mac

• Request an interactive shell session on one of the compute nodes for this demo

$ bsub –q interactive –W 4:00 –Is bash

• Navigate to the training directory or copy the examples to your local directory

$ cd /share/training/linux-bioinformatics

$ cp /share/training/linux-bioinformatics/* ~

Working with bioinformatics data files: A demo (…contd)

• We have a file with genomic sequence, called “sequence.fa”, and a file with NGS reads, “reads.fq”. Confirm them

$ ls

• We can examine a file using this Linux command$ file sequence.fasequence.fa: ASCII text

• Lets look at the attributes of the files in this directory$ ls -l

Working with bioinformatics data files: A demo (…contd)

• The “cat” command can be used to display the contents of one or more files to the screen

$ cat sequence.fa

• Maybe better to scroll through the file, as pages?$ less sequence.fa

• Display just the first line of file (header)$ head -1 sequence.fa

• Display the last 3 lines of the file$tail -3 sequence.fa

Working with bioinformatics data files: A demo (…contd)

• Determine number of lines in FASTQ filewc –l reads.fq

• Count the number of reads in FASTQ file$ x=`wc -l reads.fq | cut -f 1 -d ' '`

$ echo “$((x/4)) reads”

• Search for pattern in the sequence file and countgrep –c ACGTCA sequence.fa

• Search for adapter and count reads containing itgrep ^ACGTCA reads.fq | wc -l

Innovagene Informatics. All rights

reserved

Working with bioinformatics data files: A demo (…contd)

• Case-insensitive search and countgrep –i  ^ACGTCA reads.fqgrep –i  ^ACGTCA reads.fq | wc –l

• Display all headers in sequence file$ grep ^> sequence.fa

• Count number of bases in single-sequence FASTA file$ more +2 sequence.fa | wc -m

Working with bioinformatics data files: A demo (…contd)

• Now lets align the reads to the sequence file (chr19)module load bowtie2/2-2.1.0module load samtools/1.2

• If you still have enough time remaining on this compute node (interactive sessions can be requested for up to 8 hours), run bowtie2

bowtie2-build index sequence.fa

bowtie2 -p 1 -x sequence.fa reads.fq -S read.fq.sam

• You can also submit this alignment as a compute job

Working with bioinformatics data files: A demo (…contd)

• Create a bowtie script with the following content#!/bin/bash module load bowtie2/2-2.1.0 module load samtools/1.2 bowtie2-build sequence.fa referencebowtie2 -p 8 -x reference reads.fq -S reads.fq.samsamtools view -b reads.fq.sam –o reads.fq.bam

Working with bioinformatics data files: A demo (…contd)

• Now submit this script as a compute jobbsub -W 4:00 -q short -R

"rusage[mem=4096]" -J "bowtie-job" -o ngs.out -e ngs.err ./bowtie-align.sh

• Another way of writing the script is to include all of the command line options into the script itself (next slide)

• Then submit the compute job asbsub < bowtie-align2.sh

Working with bioinformatics data files: A demo (…contd)

#!/bin/bash#BSUB -J "SeqAlignJob" #BSUB -R rusage[mem=4096] #BSUB -q short #BSUB -W 4:00 #BSUB -o ngs.out #BSUB -e ngs.errmodule load bowtie2/2-2.1.0; module load samtools/1.2 bowtie2-build sequence.fa referencebowtie2 -p 8 -x reference reads.fq -S reads.fq.samsamtools view -b reads.fq.sam -o reads.fq.bam