43
Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Embed Size (px)

Citation preview

Page 1: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

PerlPractical Extration and Reporting

Language

An Introduction by

Shwen Ho

Page 2: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

What is Perl good for?

Designed for text manipulation Very fast to implement Allows many different ways to

solve the same problem Runs on many different platform

– Windows, Mac, Unix, Linux, Dos, etc

Page 3: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Running Perl

Perl scripts do not need to be compiled

They are interpreted at the point of execution

They do not necessarily have a particular file extension although the .pl file extension is used commonly.

Page 4: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Running Perl

Executing it via the command line command line> perl script.pl arg1 arg2 ...

Or add the line "#!/usr/bin/perl" to the start of the script if you are using unix/linux– Remember to set the correct file execution

permissions before running it.

chmod +x perlscript.pl./perlscript.pl

Page 5: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Beginning Perl Every statement end with a semi colon ";".

Comments are prefixed at the start of the line with a hash "#".

Variable are assigned a value using the character "=".

Variables are not statically typed, i.e., you do not have to declare what kind of data you want to hold in them.

Variables are declared the first time you initialise them and they can be anywhere in the program.

Page 6: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Scalar Variables

Contains single piece of data '$' character shows that a variable is

scalar. Scalar variables can store either a

number of a string. A string is a chunk of text surrounded

by quotes.$name = "paul"; $year = 1980;print "$name is born in $year"; output: paul is born in 1980

Page 7: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Arrays Variables (List)

Ordered list of data, separated by commas. '@' character shows that a variable is an

array

Array of numbers@year_of_birth = (1980, 1975, 1999);

Array of string@name = ("Paul", "Jake", "Tom");

Array of both string and numbers@paul_address = (14,"Cleveland St","NSW",2030);

Page 8: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Retrieving data from Arrays Printing Arrays

@name = ("Paul", "Jake", "Tom");print "@name";

Accessing individual elements in an array@name = ("Paul", "Jake", "Tom");print "$name[1]";

What has changed? @name to $name– To access individual elements use the syntax $array[index]

Why did $name[1] print the second element?– Perl, like Java and C, uses index 0 to represent

the first element.

Page 9: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Interesting things you can do with Array

@name = ("Paul", "Jake", "Tom");

print "@name"; Paul Jake Tom

print @name; PaulJakeTom

$count=@name; $count = 3

@nameR=reverse(@name); @nameR=("Tom","Jake","Paul")

@nameS=sort(@name); @nameS=("Jake","Paul","Tom")

Page 10: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Basic Arithmetic Operators

+ Addition - Subtraction * multiplication / division ++ adding one to the variable -- subtracting one from the variable$a += 2 incrementing variable by 2$b *= 3 tripling the value of the variable

Page 11: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Relational Operators

Comparison NumericString

Equals == eqNot equal != neLess than < lt

Greater than > gtLess than or equal <= le

Greater than or equal >= gtComparison <=> cmp

Page 12: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Control Operators - If

if ( expression 1) {

...

}

elsif (expression 2) {

...

}

else {

...

}

Page 13: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Iteration Structures

while (CONDITION) { BLOCK }

until (CONDITION) {BLOCK}

do {BLOCK} while (CONDITION)

for (INITIALIZATION ; CONDITION ; Re-INITIALIZATION)

{BLOCK}

for VAR (LIST) {BLOCK}

foreach VAR (LIST) {BLOCK}

Page 14: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Iteration Structures$i = 1;while($i <= 5){ print "$i\n"; $i++;}

for($x=1; $x <=5; $x++) { print "$x\n";}

@array = [1,2,3,4,5];foreach $number (@array){ print "$number\n";}

Page 15: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

String Operations

Strings can be concatenated with the dot operator$lastname = "Harrison";$firstname = "Paul";$name = $firstname . $lastname;$name = "$firstname$lastname";

String comparison can be done with the relational operator

$string1 = "hello";$string2 = "hello";if ($string1 eq $string2) { print "they are equal"; }else { print "they are different"; }

Page 16: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

String comparison using patterns The =~ operator return true if the pattern

within the / quotes are found. $string1 = "HELLO";$string2 = "Hi there";# test if the string contains the pattern ELif ($string1 =~ /EL/) { print "This string contains the pattern"; }else { print "No pattern found"; }

Page 17: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Functions in Perl

No strict variable type restriction during function call– java example

variable_type function (variable_type variable_name)public int function1 (int var1, char var2) { … }

Perl has provided lots of useful functions within the language to get you started.– chop - remove the first character of a string– chomp - often used to remove the carriage return character

from the end of a string– push - append one or more element into an array– pop - remove the last element of an array and return it– shift - remove the first element of an array and return it– s - replace a pattern with a string

Page 18: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Functions in Perl The "split" function breaks a given string

into individual segments given a delimiter.

split( /pattern/, string) returns a list@output = split (/\s/, $string); # breaks the sentence into words

@output = split (//, $string); # breaks the sentence into single characters

@output = split (/,/, $string); # breaks the sentence into chunks separated by a

comma.

join ( /delimiter/, array) returns a string

Page 19: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Functions in Perl

A simple perl functionsub sayHello {

print "Hello!!\n";

}

sayHello();

Page 20: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Executing functions in Perl

Function arguments are stored automatically in a temporary array called @_ .

sub sayHelloto { @name = @_; $count = @_; foreach $person (@name){ print "Hello $person\n"; } return $count;}@array = ("Paul", "Jake", "Tom");sayHelloto(@array);sayHelloto("Mary", "Jane", "Tylor", 1,2,3);

Page 21: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Input / Output

Perl allows you to read in any input that is automatically sent to your program via standard input by using the handle <STDIN>.

One way of handling inputs via <STDIN> is to use a loop to process every line of input

Page 22: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Input / Output

Count the number of lines from standard input and print the line number together with the 1st word of each line.

$count = 1;foreach $line (<STDIN>){ @array = split(/\s/, $line); print "$count $array[0]\n"; $count++;}

Other I/O topics include reading and writing to files, Standard Error (STDERR) and Standard Output (STDOUT).

Page 23: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression

Regular expression is a set of characters that specify a pattern.

Used for locating piece of text in a file. Regular expression syntax allows the

user to do a "wildcard" type search without necessarily specifying the character literally.

Available across OS platform and programming language.

Page 24: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

A simple regular expression contains the exact string to match

$string = "aaaabbbbccc";if($string =~ /bc/){ print "found pattern\n";}

output: found pattern

Simple Regular Expression

Page 25: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Simple Regular Expression

The variable $& is automatically set to the matched pattern

$string = "aaaabbbbccc";if($string =~ /bc/){ print "found pattern : $&\n"; }

output: found pattern bc

Page 26: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Simple Regular Expression

What happen when you want to match a generalised pattern like an "a" followed by some "b"s and a single "c"

$string = "aaaabbbbccc";if($string =~ /abbc/){ print "found pattern : $&\n"; }else {print "nothing found\n"; }output: nothing found

Page 27: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Quantifiers We can specify the number of times we

want to see a specific character in a regular expression by adding operators behind the character.

* (asterisk) matches zero or more copies of a specific character

+ (plus) matches one or more copies of a specific character

Page 28: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Quantifiers@array = ["ac", "abc", "abbc", "abbbc",

"abb", "bbc", "bcf", "abbb", "c"];

foreach $string (@array){ if($string =~ /ab*c/){ print "$string "; }}

output: ac abc abbc abbbc

Page 29: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Quantifiers

Regular Exp

Matched pattern

abc abc

ab*c ac abc abbc abbbc

ab+c abc abbc abbbc

@array = ["ac", "abc", "abbc", "abbbc", "abb", "bbc", "bcf", "abbb", "c"];

Page 30: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Anchors

You can use Anchor restrictions preceding and behind the pattern to specify where along the string to match to.

^ indicates a beginning of a line restriction

$ indicates an end of line restriction

Page 31: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Anchors

Regular Exp

Matched pattern

^bc bc

^b*c bbc bcf c

^b*c$ bbc c

b*c$ ac abc abbc abbbc bbc c

@array = ["ac", "abc", "abbc", "abbbc",

"abb", "bbc", "bcf", "abbb",

"c"];

Page 32: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Range

[…] is used to identify the exact characters you are searching for.

[0123456789] will match a single numeric character.

[0-9] will also match a single numeric character

[A-Za-z] will match a single alphabet of any case.

Page 33: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Range

Search for a word that– starts with the uppercase T– second letter is a lowercase alphabet– third letter is a lower case vowel– is 3 letters long followed by a space

Regular expression : "^T[a-z][aeiou] "

Note : [z-a] is backwards and does not work Note : [A-z] does match upper and lowercase but

also 6 additional characters between the upper and lower case letters in the ASCII chart: [ \ ] ^ _ `

Page 34: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Regular Expression - Others Match a single character (non specific) with "." (dot)

a.c = matches any string with "a" follow by one character and followed by "c"

Specifying number of repetition sets with \{ and \}[a-z]\{4,6\} = match four, five or six lower case

alphabet

Remembering Patterns with \(,\) and \1Regular Exp allows you to remember and recall

patterns

Page 35: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

RegExp problem and strategies You tend to match more lines than desired.

A.*B matches AAB as well as AAAAAAACCCAABBBBAABBB

Knowing what you want to match Knowing what you don’t want to match Writing a pattern out to describe that you

want to match Testing the pattern

More info : type "man re_syntax" in a unix shell

Page 36: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Example problem - Background Biologists are interested in analysing

proteins that are from a particular biochemical enzyme class "CDK1, CDK2 or CDK3". In additional, biologists would like to extract those protein sequences that contain the amino acid pattern (motif) that represents a particular virus binding site.

Serine , Glutamic Acid , (multiple occurrence of) Alanine , Glycine

Serine = S, Glutamic Acid = E , Alanine = A, Glycine = G

Page 37: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Example Problem - Dataset Dataset was downloaded from an online

phosphorylation protein database.

Contains 16472 protein entries in one file.

One entry per line and terminates with carriage return character.

Comma delimited entries – field1, field2, field3, field4, …..

Page 38: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Example Problem - Dataset fields

1. acc - unique database ID2. sequence - amino acid sequence for the

protein3. position - position along sequence that is

phophorylated4. code - amino acid that is phophorylated5. pmid - unique protein ID linked to an

international protein database6. kinase - enzyme class of this protein7. source - where this protein found8. entry_date - date entered into the database

Page 39: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Example Problem - Dataset fields

1. acc - unique database ID2. sequence - amino acid sequence for the protein3. position - position along sequence that is

phophorylated4. code - amino acid that is phophorylated5. pmid - unique protein ID linked to an

international protein database6. kinase - enzyme class of this protein7. source - where this protein found8. entry_date - date entered into the database

Page 40: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

The task

1. Extract those entries that have the string CDK1, CDK2 or CDK3 in the enzyme column.

2. Within our extracted entries, search and match those sequences that contain the virus binding pattern.

3. Print out the database ID of the positively matched entries.

Page 41: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Problem: Divide and conquer

1. enzyme class CDK1 , CDK2 or CDK3

2. extract those protein with the pattern

Serine , Glutamic Acid , (multiple occurrence of) Alanine , Glycine

Serine = S, Glutamic Acid = E , Alanine = A, Glycine = G

Page 42: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Interesting parts of Perl not covered in this lecture

Hashes– One unique variable that is linked to

another variable • "Lecture 1002" ---> "Thur 3pm"• "Lecture 1002" ---> 25• "Lecture 1002" ---> [name1, name2, … ]• "Lecture 1002" ---> [{name1},{name2}.. ] {name2} -> student

ID {name1} --> student ID

Page 43: Perl Practical Extration and Reporting Language An Introduction by Shwen Ho

Interesting parts of Perl not covered in this lecture CGI (Common Gateway Interface)

– Creation of dynamic web pages using perl– CGI, PHP, JavaScript, Java Applet, etc.

Object Oriented Perl

Perl books & references to explore at your own curiosity– http://perldoc.perl.org/– http://www.oreilly.com/pub/topic/perl– Book: O’Reilly - Perl Cookbook - This will save you

someday– Book: O'Reilly - Mastering Regular Expressions