Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR Genomics) - Lezione 05 settembre 2014

Bioinformatics TrainingIntroduction to Perl Programming

Andrea Telatin

Today’s menu

• Exercises!

• More pattern matching

• Exercises!

• Some tricks to better work with SAM files

• I kept past slides here

Warming upSimple exercises

Simple Exercises

• Prints the list of the arguments passed by the user, and the number of arguments passed

• Takes as input a list of integers, and prints their sum and average

• Prints a random DNA sequence, as long as the user requests

Write a program that…

• Prints the list of the arguments passed by the user, and the number of arguments passed



Regex to validate input!

Remember how to round it?

• Prints a random DNA sequence with user supplied length

Reading filesreview and exercises

Reading a text file• Requires the program to know the filename, that in

the BaSH jargon means its path (either relative or absolute)

• In Perl we assign a nickname to the opened file, called file handle.

• We use the open(FH, filename) function

• The function returns “true” on success.

Reading a Fileif (open(FILE, ‘path/to/filename’) == 0) { die “ Cant read filename.\n”; }"

After opening it, we can read it line by line with a while loop:while ($currentline = <FILE>) { print “$currentline”; }!

Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)!

Reading a Fileif (open(FILE, ‘path/to/filename’) == 0) { die “ Cant read filename.\n”; }"

After opening it, we can read it line by line with a while loop:while ($currentline = <FILE>) { print “$currentline”; }!

Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)!

File Handle File Name

TYPICAL “READ FILE” LOOP

A Perl clone: cat.plif (open(I, “$filename”) == 0) { die “Cant read \“$filename\”.\n”; } while ($line = <I>) { chomp($line); $c++; print “$line\n”; }"

print STDERR “I read $c lines.\n”;

Do it yourself

• Reads a text file and prints:

• the number of lines found

• the average line length

• the number of words in the document

Write a program that…

Do it yourself

• We now want to count how many times a word is present and print the most abundant words

• Decide by yourself how to design the program: which arguments to take and how to format the output :)

Extend the previous program…

HASH

You could save 2 lines here

Array and Hashes and now… something completely different

The sort function• @names = (‘Andrea’, ‘Robin’, ‘Giova’); @sortednames = sort @names;"

• Try this code yourself. Then try also this:

• @numbers = (10, 99, 999, -22); @sortednum = sort @numbers;"

Sorting numbers

• @nums = (10, 99, 999, -22); @sortednum = sort {$a <=> $b} @nums;"

• Try this code now. And try flipping $a and $b.

• Finally explained why it is a bad idea to use $a :)

• Remember that hash keys are arrays too.

Sorting hashes

• %cod = (1 => Tim, 2 => Andy, 3 => PI); foreach $id (sort {$a <=> $b} keys %cod);"

• This code sorts by hash key. Sometimes we need to sort by hash value too:

• sort {$cod{$a} <=> $cod{$b}} keys %cod;

http://perlmaven.com/how-to-sort-a-hash-in-perl

http://perlmaven.com/how-to-sort-a-hash-in-perl

New version of the program. Now sorts by hits and print the TOPx parade.

There is a known correlation between the word length and its frequency.

!

Longest words tend to be more informative. !

Improve the script to require a minimum word length (new parameter).

Parsing a fileSAM as a common example

SAM format review• Header: starts by @

• Alignments in tabular format:



Two simple examples

• SAM to FASTQ (extracts reads from SAM file)

• Subsample SAM

SAM to FASTQ1. Print user manual 2. Get arguments from command line via @ARGV 3. Open the SAM file, we’ll use “SAM” as file handle while ($line = <SAM>) {" chomp($line);" @sam_fields = split(/\t/, $line);" print “\@$sam_fields[0]\n”; print “$sam_fields[9]\n+\n”; " print “$sam_fields[10]\n”;"}

Subsample SAMIn the simplest implementation we just want to print one line every pack of lines. I.e. 1 out of 10 (= 10%).

Subsample SAM1. Print user manual

($sam_file, $denom) = @ARGV;"!if (open(SAM, "$sam_file")==0) {" die "Unable to read SAM file: \"$sam_file\".\n";"}"!while ($line = <SAM>) {" $c++;" if ($c % $denom == 0) {" print $line;" }"}

Subsample SAM1. Print user manual

($sam_file, $denom) = @ARGV;"!if (open(SAM, "$sam_file")==0) {" die "Unable to read SAM file: \"$sam_file\".\n";"}"!while ($line = <SAM>) {" $c++;" if ($c % $denom == 0) {" print $line;" }"}

is there something to fix here?

Another exampleAdding a little bit of spice

Alignments per chromosome

• We basically want to count how many alignments were found per reference sequence

• We might be interested, also, in normalising the number of alignments by chromosome length



Let’s do it together


1. We need to store the chromosome length. That information is on the header How can we store it?

2. We need a counter for each chromosome. We don’t know in advance how many are there. Any idea for this?"!


while ($line = <SAM>) {" chomp($line);" if (substr($line, 0, 1) eq ‘@‘) {"" " #header parsing"" } else {"" " #alignments parsing" }"}"

Global structure


($field, $content) = split(/\t/, $line);"if ($field eq ‘@SQ’) {" ($seqname, $len) = split(/\t/, $content);" $seqname = substr($seqname, 2);" …"" $size{$seqname} = $len;"}"

Header parsing


1. We need to store the chromosome length. That information is on the header How can we store it?

2. We need a counter for each chromosome. We don’t know in advance how many are there. Any idea for this?"!

Header parsing

Parsing the FLAGComplex yet simple

An exampleBit 0 = Read is pairedBit 1 = The read is mapped in a pairBit 2 = The query sequence is unmappedBit 3 = The mate is unmappedBit 4 = Strand of query (0=forward 1=reverse) etc.

For example:

Bit 0 - true - add 2**0 = 1Bit 1 - true - add 2**1 = 2Bit 2 - false - add nothingBit 3 - true - add 2**3 = 8Bit 4 - true - add 2**4 = 16

Bit pattern = 11010 = 16+8+2+1 = Flag 27

http://picard.sourceforge.net/explain-flags.html

http://picard.sourceforge.net/explain-flags.html

Interested in strand?

Bit 4 = Strand of query (0=forward 1=reverse) etc.

Flag 99: which is the strand?

if ($flag & 16) { print "Reverse"; } else { print "Forward"; }

2**4

This fixes our past “sam to fastq”

Pattern matchingBrief introduction

What is it?• The key feature making Perl so powerful in

processing text (i.e. parsing)

• A mini language inside the language

• Something like wild cards (but much more powerful) in BaSH

• Used to find data and substituting it

The match operator

• Patterns are usually delimited by /

• Returns true if the text contains the pattern

• Example:if ($dna=~/GGATCC/) { print “BamHI sensitive sequence\n”; }

The substitution operator• Syntax is s/PATTERN/SUBSTITUTION/

• Returns true if the text contains the pattern

• The modifier “g” at the end substitutes all the occurrences

• Example:$dna=~s/GGATCC/-cut-/g;

Before starting…

• http://regex101.com as a tester

• Can create you own regex tester :)

• Manual:http://perldoc.perl.org/perlrecharclass.html

http://regex101.com

http://perldoc.perl.org/perlrecharclass.html

Metacharacters• ^ Requires the pattern to be found at the begin

• $ Requires the pattern to be found at the end

• . Matches any character (except for \n)

• \s Matches a whitespace (space, tab…)

• \S Matches a non-whitespace (everything else)

Metacharacters (2)• \d Matches any digit (\D any non-digit)

• \w Matches any alphanum (\W …)

!

• The escape “\” before operators - as usual - makes them literal: /C. elegans/ /C\. elegans/

matches Ca elegans too matche C. elegans, only

Quantifiers• Added at the end of an entity

• ? zero or one

• * zero or more

• + at least one (one or more)

• {n} n or more

• {n,m} between n and m

Modifiers• Added at the end of the whole pattern

• i case insensitive

• g global: dont stop at the first match

• s makes the dot matching \n

• example:/name/i

New

Character classes• List of character, and meta-, enclosed by [ ]

In this context the ^ means not

• [aeiou] any vowel

• [A-Z] any upper case

• [^a-z] non lowercase letters

• [ACGTacgt] any DNA letter

Alternatives

• (also) in this context the pipe char is the “or”

• /TTC|GTA/ matches either TTC or GTA

• Grouped by parentheses: /mi chiamo (Andrea|Paolo)/

Capturing portions

• The parentheses are used to capture a portion of pattern.

• /my name is \w+ \w+/ should match a name and a surname.

• /my name is (\w+) (\w+)/ will capture, by brackets, saving to $1 and $2

CIGAR parsingThinking in terms of a regular expression

while ($cigar =~/(\d+)(\W)/g)Here you are

Do it yourself

• Will tell you how many words have a double consonant pattern (two consecutive consonants)

• How many word start and end by a vowel

• How many words contains the pattern the user specify via the command line

• Optional: prints the matching words, not only the total

Extend the “word counter” program so that…

Andrea, Diego, Paolo and BMR Genomics staff

Thanks for your attention

Science

Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR Genomics) - Lezione 05 settembre 2014