Upload
alena-dendy
View
220
Download
0
Embed Size (px)
Citation preview
Perl
Introduction
Why Perl?
• Widely used scripting language• Powerful text manipulation capabilities• Relatively easy to use• Has a wide range of libraries available• Fast• Good support for file and process operations
Less suiteable for:
• Building large and complex applications– Java, C\C++, C#
• Applications with a GUI– Java, C\C++, C#
• High performance/memory efficient applications– Java, C\C++, C#, Fortran
• Statistics– R
Learning to script
Knowledge + Skills
Exercise
Determine the percentage GC-content of the human chromosome 22
open file
read linesper line:
skip if header line
count Cs and Gs
count all nucleotides
report percentage Cs and Gs
Hello World
Hello World….
Simple line of Perl code:print "Hello World";
Run from a terminal:perl -e 'print "Hello World";'
Now try this and notice the difference:perl -e 'print "Hello World\n";'
\n
“backslash-n”newline character
'Enter'key
\t
“backslash-t”'Tab' key
Hello World (cont)
To create a text file with this line of Perl code:echo 'print "Hello World\n";' > HelloWorld.pl
perl HelloWorld.pl
In the terminal window, type kate HelloWorld.pl
and then hit the enter key. Now you can edit the Perl code.
Pythagoras.pl
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a2 + $b2;
$c = sqrt($c2);
print $c;
$a
a single value or scalar variable starts with a $ followed by its name
Pythagoras.pl
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a2 + $b2;
$c = sqrt($c2);
print $c;
5
Perl scripts
Add these lines at the top of each Perl script:
#!/usr/bin/perl
# author:
# description:
use strict;
use warnings;
perl Pythagoras.pl
Global symbol "$a2" requires explicit package name at Pythagoras.pl line 8.
Global symbol "$b2" requires explicit package name at Pythagoras.pl line 9.
Global symbol "$c2" requires explicit package name at Pythagoras.pl line 10.
Global symbol "$a2" requires explicit package name at Pythagoras.pl line 10.
Global symbol "$b2" requires explicit package name at Pythagoras.pl line 10.
Global symbol "$c" requires explicit package name at Pythagoras.pl line 11.
Global symbol "$c2" requires explicit package name at Pythagoras.pl line 11.
Global symbol "$c" requires explicit package name at Pythagoras.pl line 12.
Execution of Pythagoras.pl aborted due to compilation errors.
Pythagoras.pl
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a2 + $b2;
$c = sqrt($c2);
print $c;
Pythagoras.pl
my $a = 3;
my $b = 4;
my $a2 = $a * $a;
my $b2 = $b * $b;
my $c2 = $a2 + $b2;
my $c = sqrt($c2);
print $c;
my
The first time a variable appears in the script, it should be claimed using
‘my’. Only the first time...
Pythagoras.pl
my($a,$b,$c,$a2,$b2,$c2);
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a2 + $b2;
$c = sqrt($c2);
print $c;
Pythagoras.pl
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a3 + $b2;
$c = sqrt($c2);
print $c;
4
Pythagoras.pl
$a = 3;
$b = 4;
$a2 = $a * $a;
$b2 = $b * $b;
$c2 = $a3 + $b2;
$c = sqrt($c2);
print $c;
Pythagoras.pl
my $a = 3;
my $b = 4;
my $a2 = $a * $a;
my $b2 = $b * $b;
my $c2 = $a3 + $b2;
my $c = sqrt($c2);
print $c;
perl Pythagoras.pl
Global symbol "$a3" requires explicit package name at Pythagoras.pl line 10.
Execution of Pythagoras.pl aborted due to compilation errors.
Text or numberVariables can contain text (strings) or numbers
my $var1 = 1;my $var2 = "2";my $var3 = "three";
Try these four statements:print $var1 + $var2; print $var2 + $var3;print $var1.$var2;print $var2.$var3;
Text or numberVariables can contain text (strings) or numbers
my $var1 = 1;my $var2 = "2";my $var3 = "three";
Try these four statements:print $var1 + $var2; => 3print $var2 + $var3; => 2print $var1.$var2; => 12print $var2.$var3; => 2three
variables can be added, subtracted, multiplied, divided and modulo’d with:
+ - * / %
variables can be concatenated with:.
sequence.plprint "Please type a DNA sequence: ";
#this is a comment line#Read a line from the standard input (keyboard)my $DNAseq = <STDIN>;
#Remove the newline (Enter) from the typed textchomp($DNAseq);
#Get the length of the text(DNA sequence)my $length = length($DNAseq);print "It has $length nucleotides\n";
sequence.plprint "Please type a DNA sequence: ";
#this is a comment line#Read a line from the standard input (keyboard)my $DNAseq = <STDIN>;
#Remove the newline (Enter) from the typed textchomp($DNAseq);
#Get the length of the text(DNA sequence)my $length = length($DNAseq);print "It has $length nucleotides\n";
Program flow is top - down
<STDIN>
read characters that are typed on the keyboard. Stop after the Enter key is
pressed
<>
same, STDIN is the default and can be left out. This is a recurring and
confusing theme in Perl...
sequence.plprint "Please type a DNA sequence: ";
#this is a comment line#Read a line from the standard input (keyboard)my $DNAseq = <>;
#Remove the newline (Enter) from the typed textchomp($DNAseq);
#Get the length of the text(DNA sequence)my $length = length($DNAseq);print "It has $length nucleotides\n";
$output = function($input)
input and output can be left outparentheses are optional
$coffee = function($beans,$water)
sequence2.pl
print "Please type a DNA sequence: ";
my $DNAseq = <>;
chomp($DNAseq);
#Get the first three characters of $DNAseq
my $first3bases = substr($DNAseq,0,3);
print "The first 3 bases: $first3bases\n";
$frag = substr($text, $start, $num)
Extract a fragment of string $text starting at $start and with $num characters.
The first letter is at position 0!
perldoc
perldoc -f substr substr EXPR,OFFSET,LENGTH,REPLACEMENT substr EXPR,OFFSET,LENGTH substr EXPR,OFFSET Extracts a substring out of EXPR and
returns it. First character is at offset 0, .....
perldoc -f print print FILEHANDLE LIST print LIST print Prints a string or a list of strings.
If you leave out the FILEHANDLE, STDOUT is the destination: your terminal window.
In Perl items in a list are separated by commasprint "Hello World","\n";
Is the same as:print "Hello World\n";
sequence3.pl
print "Please type a DNA sequence: ";
my $DNAseq = <>;
chomp($DNAseq);
#Get the second codon of $DNAseq
my $codon2 = substr($DNAseq,3,3);
print "The second codon: $codon2\n";
if, else, unless
sequence4.pl
print "Please type a DNA sequence: ";
my $DNAseq = <>;
chomp($DNAseq);
#Get the first three characters of $DNAseq
my $codon = substr($DNAseq,0,3);
if($codon eq "ATG") {
print "Found a start codon\n";
}
Conditional execution
if ( condition ) { do something
}
if ( condition ) {do something
} else {do something else
}
Conditional executionif ( $number > 10 ) {print "larger than 10";
} elsif ( $number < 10 ) {print "smaller less than 10";
} else {print "number equals 10";
}
unless ( $door eq "locked" ) {openDoor();
}
Conditions are true or false
1 < 10 : true21 < 10 : false
Comparison operators
Numeric test String test Meaning== eq Equal to!= ne Not equal to> gt Greater than
>= ge Greater than or equal to< lt Less than
<= le Less than or equal to<=> cmp Compare
Examples
if ( 1 == 1 ) { # TRUE
if ( 1 == 2 ) { # FALSE
if ( 1 != 2 ) { # TRUE
if ( -1 > 10 ) { # FALSE
if ( "hi" eq "dag" ) { # FALSE
if ( "hi" gt "dag" ) { # TRUE
if ( "hi" == "dag" ) { # TRUE !!!
The last example may surprise you, as "hi" is not equal to "dag" and therefore should evaluate to FALSE. But for a numerical comparison they are both 0.
numbers as conditions
0 : falseall other numbers : true
Numbers as conditions
if ( 1 ) { print "1 is true";
}
if ( 0 ) { print "this code will not be reached";
}
if ( $open ) { print "open is not zero";
}
repetition
sequence5.pl
print "Please type a DNA sequence: ";
my $DNAseq = <>;
chomp($DNAseq);
#Get all codons of $DNAseq
my $position = 0
while($position < length($DNAseq)) {
my $codon = substr($DNAseq,$position,3);
print "The next codon: $codon\n";
$position = $position + 3;
}
the while loop
while ( condition ) {
do stuff
}
my $i = 0;
while ($i < 10) {
$i = $i + 1;
}
print $i;
$i = $i + 1
First the part to the right of the assignment operator ‘=‘ is calculated, then the result is moved to the left.
$i += 1
Same result as previous slide.
$i++
Same as result previous slide, increments $i with 1.
++$i
Same as previous, but compare:print $i++;print ++$i;
Exercise: Fibonacci numbers
Write a script that calculates and prints all Fibonacci numbers below one thousand.
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, etc.
Fn = Fn-1+ Fn-2
F0 = 0, F1 = 1
sequence5.pl
print "Please type a DNA sequence: ";
my $DNAseq = <>;
chomp($DNAseq);
#Copy the sequence to a new variable
my $asDNAseq = $DNAseq;
#'translate' a->t, c->g, g->c, t->a
$asDNAseq =~ tr/acgt/tgca/;
print "Complementary strand:\n$asDNAseq\n";
$asDNAseq =~ tr/acgt/tgca/;
=~ is a binding operator and means: perform the following action on this variable.
The operation tr/// translates each character from the first set of characters into the corresponding character in the second set:
acgt
||||
tgca
Counting
tr/// can also be used to count characters. If the second part is left empty, no translation takes place.
$numberOfNs = ($DNASeq =~ tr/N//);
'automatic' typing
using a pipe "|":echo ggatcc | perl sequence5.pl
or redirect using "<":perl sequence5.pl < sequence.txt
Exercise 1.
Create a program that reads a DNA sequence from the keyboard, and reports the sequence length and the G/C content of the sequence (as a fraction)
perltidy
program that properly formats your perl scriptIndentation, spaces, etc.
perltidy yourscript.pl
Result is in:yourscript.pl.tdy
@months
a list variable or array starts with an @ followed by its name
0
1
2
3
Arrays
my @fibonacci = (0,1,1,2);
print @fibonacci;
print $fibonacci[3];
$fibonacci[4] = 3;
$fibonacci[5] = 5;
$fibonacci[6] = 8;
@fibonacci
0
1
2
3
0
1
1
2
Arrays
my @hw = ("Hello ","World","\n");
print @hw;
my @months = ( "January",
"February",
"March");
Arrays
To access a single element of the list use the array name with $ instead of the @ and append the position of the element in: [ ]
print $months[1];February
$hw[1] = "Wur";
print @hw;
Arrays
To find the index of the last element in the list:print $#months;
2
To find the number of elements in an array:print $#months + 1;
or:print scalar(@months);
Arrays
Note: like many programming languages, the index of the first item in an array is not 1, but 0!
Note: $months ≠ $months[0] !!!
Growing and shrinking arrays
push: add an item to the end of the listpop: remove an item from the end of the listshift: remove an item from the start of the listunshift:add an item to the start of the listsplice: insert/remove one or more items
@out = splice(@array, start, length, @in);
@numbers
index 0 1 2 3 4
value 1 2 3 4 5
$last = pop(@numbers);
0 1 2 3 4
1 2 3 4 5
$last
$last = pop(@numbers);
0 1 2 3
1 2 3 4
5
$last
push(@numbers, 6);
0 1 2 3
1 2 3 4
6
push(@numbers, 6);
0 1 2 3 4
1 2 3 4 6
6
$first = shift(@numbers);
0 1 2 3 4
1 2 3 4 6
$first
$first = shift(@numbers);
0 1 2 3
2 3 4 6
1
$first
unshift(@numbers,7);
7
0 1 2 3
2 3 4 6
unshift(@numbers,1);
0 1 2 3 4
7 2 3 4 6
7
@out = splice(@numbers,2,1,8,9);
0 1 2 3 4
7 2 3 4 6
8 9
0
@out
@out = splice(@numbers,2,1,8,9);
0 1 2 3 4 5
7 2 8 9 4 6
8 9
03
@out
my ($x,$y,$z) = @coordinates;
my @words = split(" ", "Hello World");
$words[0] = "Hello"$words[1] = "World"
More loops
my @plantList = ("rice", "potato", "tomato");
print $plantList[0];
print $plantList[1];
Print $plantList[2];
Or:
foreach my $plant (@plantList) {
print $plant;
}
Loopsforeach variable ( list ) {
do something with the variable}
foreach my $i ( @lotto_numbers ) {print $i;
}
foreach my $i ( 1 .. 10, 20, 30 ) {print $i;
}
Loopsfor variable ( list ) {
do something with the variable}
for my $i ( 1, 2, 3, 4, 5 ) {print $i;
}
for my $i ( 1 .. 10, 20, 30 ) {print $i;
}
Loops
while ( condition ) {
do something
}
my $i = 0;
while ($i < 10) {
print "$i < 10\n";
$i++;
}
Loops
for ( init; condition; increment ) {
do something
}
for (my $i = 0; $i < 10; $i++) {
print "$i < 10\n";
}
Loops
my $i = 0;
while ($i < 10) {
print "$i < 10\n";
$i++;
}
for (my $i = 0; $i < 10; $i++) {
print "$i < 10\n";
}
Exercise
Write a script that reverses a DNA sequence use an array
Hint: Splitting on an empty string "" splits after every character.@sequence = split("",$sequence);
%phonebook
a hash table variable starts with a % followed by its name
Name Box
Crick 3
Franklin 1
Watson 0
Wilkins 2
0
1
2
3
Hash tables
Also called lookup tables, dictionaries or associative arrays
key/value combinations: keys are text, values can be anything
%month_days = ("January" => 31,"February" => 28,"March" => 31 );
Hash tables
To access a value in the hash table, use the hash table name with $ instead of the % and append the key between { }
$month_days{"February"} = 29;
print $month_days{"January"}; 31
Hash tables
The 'keys' function returns an list with the keys of the hash table. There is also a 'values' function.
@month_list = keys(%month_days);
# ("January", "February", "March")
Hash tablesmy %latin_name=(
"rice" => "Oryza sativa","potato" => "Solanum tuberosum"
)
foreach my $common_name (keys(%latin_name)){print "$common_name: " ;print "$latin_name{$common_name}\n";
}rice: Oryza sativapotato: Solanum tuberosum
Hash tables
The keys have to be unique, the values do not.
The order of elements in a hash table is not reliable, first in is not necessarily first out.
You can use 'sort' to get the keys in an alphabetically ordered list:@sorted = sort(keys(%latin_name));
Exercise
Create a hash table with codons as keys and the corresponding amino acids as the values
Hint: search for the standard genetic code in the "genetic code" database at: http://srs.bioinformatics.nl/Use the three lines for the first, second and third base and the line for the corresponding AA.
I/O: Input and Output
reading and writing files
Reading and writing files
open FASTA, "sequence.fa";
my $firstLine = <FASTA>;
my $secondLine = <FASTA>;
close FASTA;
Reading and writing files
Files need to be opened before use
Reading and writing files
Perl uses so-called “file handles” to attach to files for reading and writing
file
file handle
Opening files
Generalopen FileHandle, "mode", "filename"
Open for reading:open LOG, "<", "/var/log/messages";open LOG, "/var/log/messages";
Open for writing:open WRT, ">", "newfile.txt";
Open for appending:open APP, ">>", "existingfile.txt";
Defensive programming
my $fastaName = "sequence.fa";
open FASTA, $fastaName or
die "cannot open $fastaName\n";
Reading from a file
reading from an open file via the filehandle:
$firstLine = <FASTA>;
$secondLine = <FASTA>;
@otherLines = <FASTA>;
<FASTA>
Reads one line if the result goes into a scalar$line = <FASTA>;
Reads all (remaining) lines if the result goes into an array
@lines = <FASTA>;
file handles 'remember' the position in the file
Standard in and standard out
The keyboard and screen also have 'file' handles, remember STDIN and STDOUT
read from the keyboard:$DNAseq = <STDIN>;
write to the screen:print STDOUT "Hello World\n";
Reading a file
open FASTA, "sequence.fa" or die;
my $sequence = "";
while (my $line = <FASTA>) {
chomp($line);
$sequence .= $line;
}
close FASTA;
print $sequence,"\n";
(my $line = <FASTA>)also is a condition
true: line could be readfalse: EOF, end of file
Identical?
while (my $line = <FASTA>) {
print $line;
}
for my $line (<FASTA>) {
print $line;
}
Not completely
Read line by line:while (my $line = <FASTA>) {
print $line;
}
First read complete file into computer memory:for my $line (<FASTA>) {
print $line;
}
Writing to a file
open RANDOM, ">", "Random.txt";
for(1..50) {
my $random = rand(6);
print RANDOM "$random\n";
}
close RANDOM;
Writing to a file
open RANDOM, ">", "Random.txt";
for(1..50) {
my $rnd = rand(6);
$rnd = sprintf("%d\n",$rnd + 1);
print RANDOM $rnd;
}
close RANDOM;
Closing the file
close filehandle;
close FASTA;
A file is automatically closed if you (re)open a file using the same filehandle, or if the Perl script is finished.
Minimalistic Perl
open FASTA, "sequence.fa" or die;
my $sequence = "";
while (my $line = <FASTA>) {
chomp($line);
$sequence .= $line;
}
close FASTA;
print $sequence,"\n";
Minimalistic Perl
open FASTA, "sequence.fa" or die;
my $sequence = "";
while (<FASTA>) {
chomp;
$sequence .= $_;
}
close FASTA;
print $sequence,"\n";
$_
default scalar variable, if no other variable is given. But only in selected
cases...
Minimalistic Perl
open FASTA, "sequence.fa" or die;
my $sequence = "";
while (<FASTA>) {
chomp;
$sequence .= $_;
}
close FASTA;
print $sequence,"\n";
Minimalistic Perl
open FASTA, "sequence.fa" or die;
my $sequence = "";
while ($_ = <FASTA>) {
chomp($_);
$sequence .= $_;
}
close FASTA;
print $sequence,"\n";
Exercises
2. Adapt the G/C script so multiple sequences in FASTA format are read from a file
3. Modify the script to process a file containing any number of sequences in EMBL format
4. Now let the program generate the reverse complement of the sequence(s), and report sequence length and G/C content
Exercises
5. Use the rand function of Perl to shuffle the nucleotides of the input sequence, while maintaining sequence composition; again report sequence length and G/C content