23
1/20/21 1 Introduction to UNIX Text Processing Bo Yoo January 20, 2021 1 Announcements Homework 1 released: cs273a.Stanford.edu (schedule section) Read the instructions carefully! Due 11:59PM PST 02/01/2021 TA office hours: Tuesdays 12PM-2PM PST starting next week (1/26) For this week only: we will have one on Friday 4PM-6PM PST (1/22) Meet via Zoom (link in our course website) 2 2 Stanford UNIX resources Host: cardinal.stanford.edu To connect from Unix/Linux/Mac: Open a terminal (user – your SUID): ssh [email protected] ssh [email protected] To connect from Windows: Download terminal emulator: PuTTy (http://goo.gl/s0itD ) 3 3

1/20/21 Introduction to UNIX Text Processing

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1/20/21 Introduction to UNIX Text Processing

1/20/21

1

Introduction to UNIX Text Processing

Bo Yoo January 20, 2021

1

Announcements• Homework 1 released: cs273a.Stanford.edu

(schedule section)– Read the instructions carefully!– Due 11:59PM PST 02/01/2021

• TA office hours:– Tuesdays 12PM-2PM PST starting next week

(1/26)– For this week only: we will have one on Friday

4PM-6PM PST (1/22)–Meet via Zoom (link in our course website)

2

2

Stanford UNIX resources

• Host: cardinal.stanford.edu• To connect from Unix/Linux/Mac:

Open a terminal (user – your SUID):ssh [email protected] [email protected]

• To connect from Windows:– Download terminal emulator: PuTTy

(http://goo.gl/s0itD)

3

3

Page 2: 1/20/21 Introduction to UNIX Text Processing

1/20/21

2

Stanford Resources

• There are many books, web tutorials, and videos out there on materials we will talk about in the tutorial today (e.g., UNIX, Python, Awk etc.)

• Stanford gives you free access to many of these resources for you

4

4

Stanford Resources

• General Stanford search:– https://searchworks.stanford.edu/

• All kinds of e-resources– https://library.stanford.edu/science/collections/mathemat

ics-and-statistics-collection/ebooks-and-digital-projects

• Safari!– https://searchworks.stanford.edu/view/4797413

• You can even ask a specialized librarian for the best resources Stanford currently has access to:– https://library.stanford.edu/subjects/computer-science

5

5

Huge suite of tools

6

6

Page 3: 1/20/21 Introduction to UNIX Text Processing

1/20/21

3

Many useful text processing UNIX commands• awk bzcat cat column cut grephead join sed sort tail tee truniq wc zcat …

• UNIX commands work together via text streams.

• Example usage and others available at http://tldp.org/LDP/abs/html/textproc.htmlhttp://en.wikipedia.org/wiki/Cat_%28Unix%29#Other

7

7

Knowing UNIX commands eliminates having to reinvent the wheel

• In the past homework, to perform a simple file sort, submissions used:– 35 lines of Python– 19 lines of Perl– 73 lines of Java– 1 line of UNIX commands

8

8

Anatomy of a UNIX command

command [options] [FILE1] [FILE2]• options: -n 1 -g -c = -n1 -gc• output is directed to “standard output” (stdout)• if no input file is specified, input comes from

“standard input” (stdin)• To view the usage:

command --help

9

9

Page 4: 1/20/21 Introduction to UNIX Text Processing

1/20/21

4

The real power of UNIX commands comes from combinations through piping (“|”)

• Pipes are used to pass the output of one program (stdout) as the input (stdin) to another

• Pipe character is <Shift>-\

grep “CS273a” grades.txt | sort -k 2,2gr | uniq

10

Find all lines in the file that have “CS273a” in them somewhere

Sort those lines by second column, in numerical order, highest to lowest

Remove duplicates and print to standard output

10

Dual Piping

• Pass the output of two commands to another command:– sort <(cat file1) <(cat file2)– join -1 1 -2 4 <(sort –k1 file1) <(sort –k4 file2)

• This is particularly useful for “join” and “comm” commands

11

11

Output redirection (>, >>)

• Instead of writing everything to standard output, we can write (>)or append (>>) to a file

grep “CS273a” allClasses.txt > CS273aInfo.txt

cat addlInfo.txt >> CS273aInfo.txt

12

12

Page 5: 1/20/21 Introduction to UNIX Text Processing

1/20/21

5

UCSC KENT SOURCE UTILITIEShttp://genomewiki.ucsc.edu/index.php/Kent_source_utilities

13

13

/afs/ir/class/cs273a/bin/

• Many C programs in this directory that do manipulation of sequences or chromosome ranges

• Run programs with no arguments to see help message

overlapSelect [OPTION]… selectFile inFile outFile

Many useful options to alter how overlaps computed

14

Output is all inFileelements that overlap any selectFileelements

selectFile

inFile

outFile

14

Kent Source and Mysql

• Linux + Mac Binaries– http://hgdownload.soe.ucsc.edu/admin/exe/

• Using MySQL on browser– http://genome.ucsc.edu/goldenPath/help/mysql.h

tml

15

15

Page 6: 1/20/21 Introduction to UNIX Text Processing

1/20/21

6

Interacting with UCSC Genome Browser MySQL Tables

• Command line:mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne “<STMT>“

e.g.mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne \

“select count(*) from hg18.knownGene“;

+-------+| 66803 |+-------+

http://dev.mysql.com/doc/refman/8.0/en/tutorial.html16

16

BEDTOOLShttps://bedtools.readthedocs.io/en/latest/index.html

17

17

Bedtools

• Bedtools are useful to perform wide-range of genome analysis tasks. They take in multiple commonly used genomic files (e.g. BED, BAM, VCF)

• You can run bedtools on any Stanford machines

• Command line:bedtools <subcommand> [options/input files]

18

18

Page 7: 1/20/21 Introduction to UNIX Text Processing

1/20/21

7

BED Files

• Browser Extensible Data (BED) format is a tab delimited text file format that store genomic regions. It has 3 required fields: chromosome, start, and end (https://genome.ucsc.edu/FAQ/FAQformat.html#format1) • Start is the starting position and is 0th based index (i.e.,

the first base in a chromosome is 0• End is the ending position of the region and this base is

not included in the region19

19

BED Files

• The first 10 bases of chromosome 1:• chr1 0 10• This will span the bases 0-9 (0-9 index inclusively)

20

20

Bedtools intersect

• Return intersecting regions in two or more bed files– TIP: bedtools sometimes have hard time recognizing the

file has a BED file if there are extra fields, so trim to the first 3 columns

• Command linebedtools intersect [options] –a file –b file1,file2,…

21

21

Page 8: 1/20/21 Introduction to UNIX Text Processing

1/20/21

8

Bedtools merge

• Combines overlapping or “close enough” regions into one

• Command linebedtools merge [options] –i file

22

22

SPECIFIC UNIX COMMANDS

23

23

man, whatis, apropos

• UNIX program that invokes the manual written for a particular program

• man sort– Shows all info about the program sort– Hit <space> to scroll down, “q” to exit

• whatis sort– Shows short description of all programs that have

“sort” in their names• apropos sort– Shows all programs that have “sort” in their names or

short descriptions24

24

Page 9: 1/20/21 Introduction to UNIX Text Processing

1/20/21

9

cat• Concatenates files and prints them to standard

output• cat [OPTION] [FILE]…

• Variants for compressed input files:zcat (.gz files)bzcat (.bz2 files)– Also useful if your zipped file is very large and you want to

check the content of the file quickly – zcat file1.gz | less (or | head)

25

ABCD

123

ABCD123

25

head, tail

• head: first ten linestail: last ten lines

• -n option: number of lines– For tail, -n+Kmeans line K to the end.

• head –n5 : first five lines• tail –n3 : last 3 lines• tail –n+2 | head –n 2 : lines 2-4

26

26

cut

• Prints selected parts of lines from each file to standard output

• cut [OPTION]… [FILE]…• -d Choose delimiter between columns

(default TAB)• -f Fields to print-f1,7 : fields 1 and 7-f1-4,7,11-13: fields 1,2,3,4,7,11,12,13

27

27

Page 10: 1/20/21 Introduction to UNIX Text Processing

1/20/21

10

cut example

28

CS 273 aCS.273.aCS 273 a

file.txt

cut –f1,3 file.txtCS aCS.273.aCS

cut –d ‘.’ –f1,3 file.txtCS 273 aCS.aCS 273 a

In general, you should make sure your file columns are all delimited with the same character(s) before

applying cut!

28

wc

• Print line, word, and character (byte) counts for each file, and totals of each if more than one file specified

• wc [OPTION]… [FILE]…• -l Print only line counts

29

29

sort

• Sorts lines in a delimited file (default: tab)• -k m,n sorts by columns m to n (1-based)• -g sorts by general numerical value (can handle

scientific format)• -r sorts in descending order• sort -k1,1gr -k2,3– Sort on field 1 numerically (high to low because of r).– Break ties on field 2 alphabetically.– Break further ties on field 3 alphabetically.

30

30

Page 11: 1/20/21 Introduction to UNIX Text Processing

1/20/21

11

uniq

• Discard all but one of successive identical lines from input and print to standard output

• -d Only print duplicate lines• -i Ignore case in comparison• -u Only print unique lines

31

31

uniq example

32

CS 273aCS 273aTA: Bo YooCS 273a

file.txtuniq file.txt

CS 273aTA: Bo YooCS 273a

uniq –u file.txt TA: Bo YooCS 273a

uniq –d file.txt CS 273a

In general, you probably want to make sure your file is sorted before applying uniq!

uniq example

25

CS 273aCS 273aTA: Bo YooCS 273a

file.txtuniq file.txt

CS 273aTA: Bo YooCS 273a

uniq –u file.txt TA: Bo YooCS 273a

uniq –d file.txt CS 273a

In general, you probably want to make sure your file is sorted before applying uniq!

32

grep

• Search for lines that contain a word or match a regular expression

• grep [options] PATTERN [FILE…]

• -i ignore case• -v Output lines that do not match• -f <FILE>: patterns from a file (1 per line)• -E Extended regex grep (=egrep)• -o print only matching part of the line

33

33

Page 12: 1/20/21 Introduction to UNIX Text Processing

1/20/21

12

grep example

grep -E “^CS[[:space:]]+273$” file

34

Search through “file”

For lines that start with CS

Then have one or more spaces (or tabs)

And end with 273

CS 273aCS273CS 273cs 273CS 273

file

CS 273CS 273

34

sed: stream editor

• Most common use is a string replace.• sed –e “s/SEARCH/REPLACE/g”

35

cat file.txt | sed –e “s/is/EEE/g”

Thisis anExample.

file.txtThEEEEEE anExample.

35

join

• Join lines of two files on a common field• join [OPTION]… FILE1 FILE2• -1 Specify which column of FILE1 to join on• -2 Specify which column of FILE2 to join on• Important: FILE1 and FILE2 must already be

sorted on their join fields!

36

36

Page 13: 1/20/21 Introduction to UNIX Text Processing

1/20/21

13

join example

37

CS273a Comp Tour Hum Gen.CS229 Machine LearningDB210 Devel. Biol.

file2.txt

Bejerano CS273aVilleneuve DB210Batzoglou DB273a

file1.txt

join -1 2 -2 1 file1.txt file2.txt

CS273a Bejerano Comp Tour Hum Gen.DB210 Villeneuve Devel. Biol.

37

SHELL SCRIPTING

38

38

Common shells

• Two common shells: bash and tcsh• Run ps to see which you are using.

39

39

Page 14: 1/20/21 Introduction to UNIX Text Processing

1/20/21

14

Multiple UNIX commands can be combined into a single shell script.

#!/bin/bash

cat $1 $2 > tmp.txt

paste tmp.txt $3 > $4

40

script.sh

Command prompt% ./script.sh file1.txt file2.txt file3.txt out.txt% sh script.sh file1.txt file2.txt file3.txt out.txt

Set scripts to be executable:% chmod u+x script.sh

http://www.faqs.org/docs/bashman/bashref_toc.html

40

for loop

# BASH for loop to print 1,2,3 on separate linesfor i in `seq 1 3`do

echo ${i}done

41

Special quote character, usually left of “1” on keyboard that indicates we should execute the command within the quotes

41

Making the script executable

You need to make the script executable to use dual piping

$sh script.sh

$chmod u+x script.sh$./script.sh

42

42

Page 15: 1/20/21 Introduction to UNIX Text Processing

1/20/21

15

SCRIPTING LANGUAGES

43

43

awk

• A quick-and-easy shell scripting language• https://www.gnu.org/software/gawk/manual/

gawk.html• Treats each line of a file as a record, and splits

fields by whitespace• Fields referenced as $1, $2, $3, … ($0 is entire

line)

44

44

Anatomy of an awk script.

awk ‘BEGIN {…} {…} END {…}’

45

before first line after last lineonce per line

45

Page 16: 1/20/21 Introduction to UNIX Text Processing

1/20/21

16

awk example

• Output the lines where column 3 is less than column 5 in a comma-delimited file. Output a summary line at the end.

46

awk -F',‘'BEGIN{ct=0;}{ if ($3 < $5) { print $0; ct=ct+1; } }

END { print "TOTAL LINES: " ct; }'

46

Useful things from awk

• Make sure fields are delimited with tabs (to be used by cut, sort, join, etc.

awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt

• Good string processing using substr, index, length functions

awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt

47

String tomanipulate

Startposition

Length

substr(“helloworld”, 4, 3) = “low” index(“helloworld”, “low”) = 4

length(“helloworld”) = 10 index(“helloworld”, “notpresent”) = 0

47

Python

• A scripting language with many useful constructs

• http://wiki.python.org/moin/BeginnersGuide• http://docs.python.org/tutorial/index.html

• Call a python program from the command line:python myProg.py

48

48

Page 17: 1/20/21 Introduction to UNIX Text Processing

1/20/21

17

Number types

• Numbers: int, float>>> f = 4.7>>> i = int(f)>>> j = round(f)>>> i4>>> j5.0>>> i*j20.0>>> 2**i16

49

49

Strings>>> dir(“”)[…, 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith',

'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

>>> s = “hi how are you?”>>> len(s)15>>> s[5:10]‘w are’>>> s.find(“how”)3>>> s.find(“CS273”)-1>>> s.split(“ “)[‘hi’, ‘how’, ‘are’, ‘you?’]>>> s.startswith(“hi”)True>>> s.replace(“hi”, “hey buddy,”)‘hey buddy, how are you?’>>> “ extraBlanks ”.strip()‘extraBlanks’ 50

50

Lists• A container that holds zero or more objects in

sequential order>>> dir([])[…, 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',

'reverse', 'sort']>>> myList = [“hi”, “how”, “are”, “you?”]>>> myList[0]‘hi’>>> len(myList)4>>> for word in myList:

print word[0:2]

hihoaryo

>>> nums = [1,2,3,4]>>> squares = [n*n for n in nums]>>> squares[1, 4, 9, 16] 51

51

Page 18: 1/20/21 Introduction to UNIX Text Processing

1/20/21

18

Dictionaries• A container like a list, except key can be

anything (instead of a non-negative integer)>>> dir({})[…, clear', 'copy', 'fromkeys', 'get', 'has_key', 'items',

'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

>>> fruits = {“apple”: True, “banana”: True}>>> fruits[“apple”]True>>> fruits.get(“apple”, “Not a fruit!”)True>>> fruits.get(“carrot”, “Not a fruit!”)‘Not a fruit!’

>>> fruits.items()[('apple', True), ('banana', True)]

52

52

Reading from files

>>> openFile = open(“file.txt”, “r”)>>> allLines = openFile.readlines()>>> openFile.close()

>>> allLines[‘Hello, world!\n’, ‘This is a file-reading\n’, ‘\texample.\n’]

53

Hello, world!This is a file-reading

example.

file.txt

53

Writing to files>>> writer = open(“file2.txt”, “w”)

>>> writer.write(“Hello again.\n”)

>>> name = “Cory”

>>> writer.write(“My name is %s, what’s yours?\n” % name)

>>> writer.close()

54

Hello again.My name is Cory, what’s yours?

file2.txt

54

Page 19: 1/20/21 Introduction to UNIX Text Processing

1/20/21

19

Creating functionsdef compareParameters(param1, param2):

if param1 < param2:return -1

elif param1 > param2:return 1

else:return 0

def factorial(n):if n < 0:

return Noneelif n == 0:

return 1else:

retval = 1num = 1while num <= n:

retval = retval*numnum = num + 1

return retval55

55

Example program

#!/usr/bin/env pythonimport sys # Required to read arguments from command line

if len(sys.argv) != 3:print “Wrong number of arguments supplied to Example.py”

sys.exit(1)

inFile = open(sys.argv[1], “r”)allLines = inFile.readlines()inFile.close()

outFile = open(sys.argv[2], “w”)for line in allLines:

outFile.write(line)

outFile.close()56

Example.py

56

Example program

python Example.py file1 file2

sys.argv = [‘Example.py’, ‘file1’, ‘file2’]

57

#!/usr/bin/env pythonimport sys # Required to read arguments from command line

if len(sys.argv) != 3:print “Wrong number of arguments supplied to Example.py”sys.exit(1)

inFile = open(sys.argv[1], “r”)allLines = inFile.readlines()inFile.close()

outFile = open(sys.argv[2], “w”)for line in allLines:

outFile.write(line)

outFile.close()

57

Page 20: 1/20/21 Introduction to UNIX Text Processing

1/20/21

20

DEMOexample files: /afs/ir/class/cs273a/UNIX_primer/

58

58

HOMEWORK DEMO EXECUTABLESexample files: /afs/ir/class/cs273a/hw1/demo

59

59

test_q1_04.orf_finder

Usage: ./test_q1_04.orf_finder [Fasta file]

*only work for Fasta files <=200 bps long

60

60

Page 21: 1/20/21 Introduction to UNIX Text Processing

1/20/21

21

test_q1_04.orf_finder

Usage: ./test_q1_04.orf_finder Example.fa

Example.fa:>

CGTAATGAAATTGCGCGAGTGCTAACGA

61

+/0 CGT AAT GAA ATT GCG CGA GTG CTA ACG A+/1 C GTA ATG AAA TTG CGC GAG TGC TAA CGA+/2 CG TAA TGA AAT TGC GCG AGT GCT AAC GA-/0 TCG TTA GCA CTC GCG CAA TTT CAT TAC G-/1 T CGT TAG CAC TCG CGC AAT TTC ATT ACG-/2 TC GTT AGC ACT CGC GCA ATT TCA TTA CG

61

test_q1_04.orf_finder

Usage: ./test_q1_04.orf_finder Example.fa

Example.fa:>

CGTAATGAAATTGCGCGAGTGCTAACGA

62

62

test_q1_04.orf_finder (verbose)

Usage: ./test_q1_04.orf_finder Example.fa vFor checking only, your code doesn’t need to print this out

Example.fa:>

CGTAATGAAATTGCGCGAGTGCTAACGA

63

63

Page 22: 1/20/21 Introduction to UNIX Text Processing

1/20/21

22

test_q1_04.orf_finder

Usage: ./test_q1_04.orf_finder Example.fa

Example.fa:>

CGTAATGAAATTGCGCGAGTGCATAACGA

64

64

test_alternative_splicing

Usage: ./test_alternative_splicing [Exons] [Splicing instructions]

*only work on examples with <5 codons and total sequence <=50 basepairs

65

65

test_alternative_splicing

Usage: ./test_alternative_splicing [Exons] [Splicing instructions]

Exons: AATTTATG,CGATTTA,ACGGTAATGTAA

Splicing instructions: 101

66

Exons AATTTATG CGATTTA ACGGTAATGTAA

Splicing instructions 1 (keep) 0 (drop) 1 (keep

Sequence used: AATTTATGACGGTAATGTAACoding sequence: AATTT ATG ACG GTA ATG TAA

66

Page 23: 1/20/21 Introduction to UNIX Text Processing

1/20/21

23

Amino Acid Table

67

Coding sequence: AATTT ATG ACG GTA ATG TAATranslated AA: MTVM

67

test_alternative_splicing

./test_alternative_splicingAATTTATG,CGATTTA,ACGGTAATGTAA 101

68

68

test_alternative_splicingExamples:

69

Exons Splicing instructions

Code result

AATGTATG,CGATGA,ACGGTAATGTAA 111AATGATG,CGATGA,ACGGTAATGTAA 111AATGATG,CGATGA,ACGGTAATGTAA 011AATGATG,CGATGA,ACGGTAATGTAA 101

AT,GCAGTTGC,CGTAG 111AT,GCAGTTGC,CGTAG 011

AT,G,CA,GTT,AA 11111AT,G,AGTT,AA 1111

69