Upload
andrea-telatin
View
210
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Lezione finale del mini corso di Programmazione Bioinformatica. Parsing di file SAM: CIGAR e FLAG Sorting numerico Ancora espressioni regolari
Citation preview
Bioinformatics TrainingIntroduction to Perl Programming
Andrea Telatin
Today’s menu
• Exercises!
• More pattern matching
• Exercises!
• Some tricks to better work with SAM files
• I kept past slides here
Warming upSimple exercises
Simple Exercises
• Prints the list of the arguments passed by the user, and the number of arguments passed
• Takes as input a list of integers, and prints their sum and average
• Prints a random DNA sequence, as long as the user requests
Write a program that…
• Prints the list of the arguments passed by the user, and the number of arguments passed
• Takes as input a list of integers, and prints their sum and average
• Takes as input a list of integers, and prints their sum and average
Regex to validate input!
Remember how to round it?
• Prints a random DNA sequence with user supplied length
Reading filesreview and exercises
Reading a text file• Requires the program to know the filename, that in
the BaSH jargon means its path (either relative or absolute)
• In Perl we assign a nickname to the opened file, called file handle.
• We use the open(FH, filename) function
• The function returns “true” on success.
Reading a Fileif (open(FILE, ‘path/to/filename’) == 0) { die “ Cant read filename.\n”; }"
After opening it, we can read it line by line with a while loop:while ($currentline = <FILE>) { print “$currentline”; }!
Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)!
Reading a Fileif (open(FILE, ‘path/to/filename’) == 0) { die “ Cant read filename.\n”; }"
After opening it, we can read it line by line with a while loop:while ($currentline = <FILE>) { print “$currentline”; }!
Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)!
File Handle File Name
TYPICAL “READ FILE” LOOP
A Perl clone: cat.plif (open(I, “$filename”) == 0) { die “Cant read \“$filename\”.\n”; } while ($line = <I>) { chomp($line); $c++; print “$line\n”; }"
print STDERR “I read $c lines.\n”;
Do it yourself
• Reads a text file and prints:
• the number of lines found
• the average line length
• the number of words in the document
Write a program that…
Do it yourself
• We now want to count how many times a word is present and print the most abundant words
• Decide by yourself how to design the program: which arguments to take and how to format the output :)
Extend the previous program…
HASH
You could save 2 lines here
Array and Hashes and now… something completely different
The sort function• @names = (‘Andrea’, ‘Robin’, ‘Giova’); @sortednames = sort @names;"
• Try this code yourself. Then try also this:
• @numbers = (10, 99, 999, -22); @sortednum = sort @numbers;"
Sorting numbers
• @nums = (10, 99, 999, -22); @sortednum = sort {$a <=> $b} @nums;"
• Try this code now. And try flipping $a and $b.
• Finally explained why it is a bad idea to use $a :)
• Remember that hash keys are arrays too.
Sorting hashes
• %cod = (1 => Tim, 2 => Andy, 3 => PI); foreach $id (sort {$a <=> $b} keys %cod);"
• This code sorts by hash key. Sometimes we need to sort by hash value too:
• sort {$cod{$a} <=> $cod{$b}} keys %cod;
http://perlmaven.com/how-to-sort-a-hash-in-perl
New version of the program. Now sorts by hits and print the TOPx parade.
There is a known correlation between the word length and its frequency.
!
Longest words tend to be more informative. !
Improve the script to require a minimum word length (new parameter).
Parsing a fileSAM as a common example
SAM format review• Header: starts by @
• Alignments in tabular format:
SAM format review• Header: starts by @
• Alignments in tabular format:
Two simple examples
• SAM to FASTQ (extracts reads from SAM file)
• Subsample SAM
SAM to FASTQ1. Print user manual 2. Get arguments from command line via @ARGV 3. Open the SAM file, we’ll use “SAM” as file handle while ($line = <SAM>) {" chomp($line);" @sam_fields = split(/\t/, $line);" print “\@$sam_fields[0]\n”; print “$sam_fields[9]\n+\n”; " print “$sam_fields[10]\n”;"}
Subsample SAMIn the simplest implementation we just want to print one line every pack of lines. I.e. 1 out of 10 (= 10%).
Subsample SAM1. Print user manual
($sam_file, $denom) = @ARGV;"!if (open(SAM, "$sam_file")==0) {" die "Unable to read SAM file: \"$sam_file\".\n";"}"!while ($line = <SAM>) {" $c++;" if ($c % $denom == 0) {" print $line;" }"}
Subsample SAM1. Print user manual
($sam_file, $denom) = @ARGV;"!if (open(SAM, "$sam_file")==0) {" die "Unable to read SAM file: \"$sam_file\".\n";"}"!while ($line = <SAM>) {" $c++;" if ($c % $denom == 0) {" print $line;" }"}
is there something to fix here?
Another exampleAdding a little bit of spice
Alignments per chromosome
• We basically want to count how many alignments were found per reference sequence
• We might be interested, also, in normalising the number of alignments by chromosome length
SAM format review• Header: starts by @
• Alignments in tabular format:
Let’s do it together
Let’s do it together
1. We need to store the chromosome length. That information is on the header How can we store it?
2. We need a counter for each chromosome. We don’t know in advance how many are there. Any idea for this?"!
Let’s do it together
while ($line = <SAM>) {" chomp($line);" if (substr($line, 0, 1) eq ‘@‘) {"" " #header parsing"" } else {"" " #alignments parsing" }"}"
Global structure
Let’s do it together
($field, $content) = split(/\t/, $line);"if ($field eq ‘@SQ’) {" ($seqname, $len) = split(/\t/, $content);" $seqname = substr($seqname, 2);" …"" $size{$seqname} = $len;"}"
Header parsing
Let’s do it together
1. We need to store the chromosome length. That information is on the header How can we store it?
2. We need a counter for each chromosome. We don’t know in advance how many are there. Any idea for this?"!
Header parsing
Parsing the FLAGComplex yet simple
An exampleBit 0 = Read is pairedBit 1 = The read is mapped in a pairBit 2 = The query sequence is unmappedBit 3 = The mate is unmappedBit 4 = Strand of query (0=forward 1=reverse) etc.
For example:
Bit 0 - true - add 2**0 = 1Bit 1 - true - add 2**1 = 2Bit 2 - false - add nothingBit 3 - true - add 2**3 = 8Bit 4 - true - add 2**4 = 16
Bit pattern = 11010 = 16+8+2+1 = Flag 27
http://picard.sourceforge.net/explain-flags.html
Interested in strand?
Bit 4 = Strand of query (0=forward 1=reverse) etc.
Flag 99: which is the strand?
if ($flag & 16) { print "Reverse"; } else { print "Forward"; }
2**4
This fixes our past “sam to fastq”
Pattern matchingBrief introduction
What is it?• The key feature making Perl so powerful in
processing text (i.e. parsing)
• A mini language inside the language
• Something like wild cards (but much more powerful) in BaSH
• Used to find data and substituting it
The match operator
• Patterns are usually delimited by /
• Returns true if the text contains the pattern
• Example:if ($dna=~/GGATCC/) { print “BamHI sensitive sequence\n”; }
The substitution operator• Syntax is s/PATTERN/SUBSTITUTION/
• Returns true if the text contains the pattern
• The modifier “g” at the end substitutes all the occurrences
• Example:$dna=~s/GGATCC/-cut-/g;
Before starting…
• http://regex101.com as a tester
• Can create you own regex tester :)
• Manual:http://perldoc.perl.org/perlrecharclass.html
Metacharacters• ^ Requires the pattern to be found at the begin
• $ Requires the pattern to be found at the end
• . Matches any character (except for \n)
• \s Matches a whitespace (space, tab…)
• \S Matches a non-whitespace (everything else)
Metacharacters (2)• \d Matches any digit (\D any non-digit)
• \w Matches any alphanum (\W …)
!
• The escape “\” before operators - as usual - makes them literal: /C. elegans/ /C\. elegans/
matches Ca elegans too matche C. elegans, only
Quantifiers• Added at the end of an entity
• ? zero or one
• * zero or more
• + at least one (one or more)
• {n} n or more
• {n,m} between n and m
Modifiers• Added at the end of the whole pattern
• i case insensitive
• g global: dont stop at the first match
• s makes the dot matching \n
• example:/name/i
New
Character classes• List of character, and meta-, enclosed by [ ]
In this context the ^ means not
• [aeiou] any vowel
• [A-Z] any upper case
• [^a-z] non lowercase letters
• [ACGTacgt] any DNA letter
Alternatives
• (also) in this context the pipe char is the “or”
• /TTC|GTA/ matches either TTC or GTA
• Grouped by parentheses: /mi chiamo (Andrea|Paolo)/
Capturing portions
• The parentheses are used to capture a portion of pattern.
• /my name is \w+ \w+/ should match a name and a surname.
• /my name is (\w+) (\w+)/ will capture, by brackets, saving to $1 and $2
CIGAR parsingThinking in terms of a regular expression
while ($cigar =~/(\d+)(\W)/g)Here you are
Do it yourself
• Will tell you how many words have a double consonant pattern (two consecutive consonants)
• How many word start and end by a vowel
• How many words contains the pattern the user specify via the command line
• Optional: prints the matching words, not only the total
Extend the “word counter” program so that…
Andrea, Diego, Paolo and BMR Genomics staff
Thanks for your attention