MCB3895-004 Lecture #3Sept 2/14
Intro to UNIX terminal
Introduction to UNIX
• Nearly all bioinformatics software runs on UNIX and its derivatives (e.g., LINUX and Mac OS)
• Very little bioinformatics software runs on Windows
• Bioinformatics is very strongly tied to the open-source software movement
• Lots of help available on-line• Most programs are free• Windows is not very open-source friendly
Windows users:
• Option 1: Do all of your work connected to the Biotechnology Cluster server. Download sshclient (ftp://ftp.uconn.edu/restricted/ssh/)
• Option 2: Install LINUX to run in parallel with Windows (e.g., Biolinux http://nebc.nerc.ac.uk/tools/bio-linux)
Terminal
• The terminal is the primary way to do computational biology
• Mac: Utilities/Applications/
Terminal
• Linux: Applications/Accessories/
Terminal
• Windows: sshclient
Assignment
• A handy resource to learn the basics of UNIX is the “Unix and Perl Primer for Biologists”, which can be found here: http://korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.1.1.pdf
• The commands they demonstrate mainly involve creating, removing and moving around files and directories
• Once you learn them, these commands will take you far beyond what you can do with a more familiar GUI like Mac Finder or Windows Explorer
Worthy of special comment
1. Directory trees
2. Using tab to autocomplete
3. Wildcard characters like * to perform the same operation to multiple files (this is insanely useful once you get the hang of it!)
4. Using nano as a very basic text editorNever, ever, ever use Word for this!
5. Use underscores “_” not spaces in your filenames
Directory trees
• All computer files are organized hierarchically
• Each folder has an address
/Users/Jonathan/
Laptop_backup/Destop/
e-Books
A quick reference to where you are in UNIX• “/” - root
• “~” - your user home directory
• “.” - “here”, the directory you are in now
• “../” - one level up in the directory tree
More UNIX tricks
• “>” (greater than) redirects the output of a command into a new file
e.g., ls * > list • a list of the files in this directory is now stored in the
file “list”
More UNIX tricks
• cat joins multiple files together
e.g., cat file1 file2 > file3 • file3 contains file1 and file2 joined together• file1 and file2 still exist as they were
More UNIX tricks
• grep extracts all lines containing a particular pattern from a file
e.g., grep “NP_” file1 • Prints every line that contains the pattern “NP_” to
the screen
More UNIX tricks
• wc counts the newlines, words and bytes in a file
e.g., wc file1 • Prints an output like this:
10602 18921752002 file1
newlines words bytes filename
More UNIX tricks
• “|” (pipe) directs the output of one command into another
e.g., grep “NP_” file1 | wc • Sounds the output of the grep command into wc,
because grep extracts lines from a file, can be used to count the number of lines matching the grep expression
e.g., grep “NP_” file1 | less• Displays grep result as a list you can scroll through
More UNIX tricks
• gzip/gunzip: single file compressione.g., gunzip file.txt.gz• Decompresses file.txte.g., gzip file.txt• Creates compressed file file.txt.gz, removes file.txt
More UNIX tricks
• tar: file archive managemente.g., tar -cf all.tar * • Creates tar archive all.tar containing all files in
that directory, individual files unchangede.g., tar -xf all.tar• Extracts all files from tar archive all.tar to the
current directory, all.tar not deleted
• tar is very commonly used before gzip - “tarballs”
Connecting to the Bioinformatics facility server• UNIX command ssh
• e.g., ssh -l jlklassen bbcsrv3.biotech.uconn.edu
• Will ask for a password• If the first time connecting, will want you to authenticate
an RSA key (security feature)
• Your terminal now controls the bioinformatics facility server, not your own machine
• You can have multiple terminals open at the same time
Transferring files to the Bioinformatics facility server• Method 1: Filezilla (
https://filezilla-project.org/)
• Nice GUI
• Works on all platforms
• Install the client, not the server
Transferring files to the Bioinformatics facility server
• Method 2: UNIX command scp• e.g., scp [email protected]:all.tar all.tar
• Copy all.tar from my computer to the biotech server• e.g., scp -r [email protected]:dir/ .
• Copy the directory “dir” from the biotech server to the current working directory
• “-r” flag indicates “recursive”, needed for directories
Text editors
• Using nano works, but can be cumbersome for complex tasks
• Word is always bad! Adds layers you don’t see.
• Mac and LINUX have TextEdit and Gedit as default text editors, both work well
• Windows: Notepad and Wordpad are insufficient. I suggest downloading Gedit for Windows (https://wiki.gnome.org/Apps/Gedit)
• Other options exist for all platforms
Assignment
• See instructions posted on the website at http://wp.mcb3895.mcb.uconn.edu
• Part 1: work through Korf manual sections U1-U27 (some commands require external files, ignore these but understand what they do)
• Part 2: log on to the Biotech server, download a genome from NCBI and answer the questions given
• The assignment is due at the start of class 1 week from today
Command line power!
• The simplest way to download these data is to use the terminal command wget
$ wget –r --no-directories --retr-symlinks -P Acaricomes_phytoseiuli/ ftp://ftp.ncbi.nlm.gov/genomes/refseq/bacteria/Acaricomes_phytoseiuli/latest_assembly_versions/GCF_000376245.1_ASM37624v1/
• Deconstructed:• -r – “recursive”, i.e., download everything in this directory• --no-directories – does not create the entire ftp directory
structure• --retr-symlinks – NCBI uses a fancy file structure using
something called “symbolic links”, where a file points to another file somewhere else. “--retr-symlinks” gets the actual files, not just the links
• -P Acaricomes_phytoseuili/ – where to put the output