Programming in Computational Biology

Role of Programming in Computational

Biology

Atreyi Banerjee

Programming languages

• Self-contained language– Platform-independent– Used to write O/S– C (imperative, procedural)– C++, Java (object-oriented)– Lisp, Haskell, Prolog (functional)

• Scripting language– Closely tied to O/S– Perl, Python, Ruby

• Domain-specific language– R (statistics)– MatLab (numerics)– SQL (databases)• An O/S typically manages…

– Devices (see above)

– Files & directories

– Users & permissions

– Processes & signals

Role of Programming

Reduces time Reduces money on R&D Reduces human effort Streamlines workflow Standard Algorithm unifies data result and

assures reproducability Reduces human error

Applications of Programming Data Mining Genome Annotation Microarray Analysis Website Development Tool Development Statistical Analyses Phylogeny Genome Wide Association Studies (GWAS) Next Generation Sequencing studies

Bioinformatics “pipelines” often involve chaining together multiple tools

Perl is the most-used bioinformatics language

Most popular bioinformatics programming languages

Bioinformatics career survey, 2008

Michael Barton

PERL Practical Extraction & Report Language Interpreted, not compiled

Fast edit-run-revise cycle

• Procedural & imperative Sequence of instructions (“control flow”) Variables, subroutines

• Syntax close to C (the de facto standard minimal language)

Weakly typed (unlike C) Redundant, not minimal (“there’s more than one way to do it”) “Syntactic sugar”

High-level data structures & algorithms– Hashes, arrays

Operating System support (files, processes, signals) String manipulation

Pros and Cons of Perl

• Reasons for Perl’s popularity in bioinformatics (Lincoln Stein)– Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,

summarizing and otherwise mangling text

– Perl is forgiving

– Perl is component-oriented

– Perl is easy to write and fast to develop in

– Perl is a good prototyping language

– Perl is a good language for Web CGI scripting

• Problems with Perl– Hard to read (“there’s more than one way to do it”, cryptic syntax…)

– Too forgiving (no strong typing, allows sloppy code…)

General principles of programming

Make incremental changes Test everything you do

the edit-run-revise cycle Write so that others can read it

(when possible, write with others) Think before you write Use a good text editor Good debugging style

Regular expressions

Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful

probably Perl's strongest feature, compared to other languages

Often called "regexps" for short

Programming in PERL

Data Types Scalars ($) Arrays (@) Hashes (%)

Conditional Operators AND (&&), OR (||), NOT (!)

Arithmetic Operators (+, -,*, /)

CONDITIONS If else Elsif ladder

LOOPS For While Foreach

Default Variables $_ default variable @_ default array

Finding all sequence lengthsOpen file

Read line

End of file?

Line starts with “>” ?

Remove “\n” newline character at end of line

Sequence name Sequence data

Add length of line to running totalRecord the name

Reset running total of current sequence length

First sequence?Print last sequence length

Stop

noyes

yes

yes

no

no

Start

Print last sequence length

DNA Microarrays

Normalizing microarray data

• Often microarray data are normalized as a precursor to further analysis (e.g. clustering)

• This can eliminate systematic bias; e.g. if every level for a particular gene is elevated, this might

signal a problem with the probe for that gene if every level for a particular experiment is elevated, there

might have been a problem with that experiment, or with the subsequent image analysis

• Normalization is crude (it can eliminate real signal as well as noise), but common

Rescaling an array

For each element of the array:add a, then multiply by b

@array = (1, 3, 5, 7, 9);print "Array before rescaling: @array\n";rescale_array (\@array, -1, 2);print "Array after rescaling: @array\n";

sub rescale_array { my ($arrayRef, $a, $b) = @_; foreach my $x (@$arrayRef) { $x = ($x + $a) * $b; }}

Array before rescaling: 1 3 5 7 9Array after rescaling: 0 4 8 12 16

Array is passed by reference

Microarray expression data A simple format with tab-separated fields First line contains experiment names Subsequent lines contain:

gene name expression levels for each experiment

* EmbryoStage1 EmbryoStage2 EmbryoStage3 ...Cyp12a5 104.556 102.441 55.643 ...MRG15 4590.15 6691.11 9472.22 ...Cop 33.12 56.3 66.21 ...bor 5512.36 3315.12 1044.13 ...Bx42 1045.1 632.7 200.11 ...... ... ... ...

Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn…

Reading a file of expression datasub read_expr { my ($filename) = @_; open EXPR, "<$filename"; my $firstLine = <EXPR>; chomp $firstLine; my @experiment = split /\t/, $firstLine; shift @experiment; my %expr; while (my $line = <EXPR>) { chomp $line; my ($gene, @data) = split /\t/, $line; if (@data+0 != @experiment+0) { warn "Line has wrong number of fields\n"; } $expr{$gene} = \@data; } close EXPR; return (\@experiment, \%expr);}

Note use ofscalar contextto comparearray sizes

Reference to array of experiment namesReference to hash of arrays(hash key is gene name, arrayelements are expression data)

Normalizing by gene A program to normalize expression data from a

set of microarray experiments

Normalizes by gene

($experiment, $expr) = read_expr ("expr.txt");while (($geneName, $lineRef) = each %$expr) { normalize_array ($lineRef);}

sub normalize_array { my ($data) = @_; my ($mean, $sd) = mean_sd (@$data); @$data= map (($_ - $mean) / $sd, @$data);}

NB $datais a referenceto an array

Could also use the following:rescale_array($data,-$mean,1/$sd);

Normalizing by column

Remaps gene arrays to column arrays

($experiment, $expr) = read_expr ("expr.txt");my @genes = sort keys %$expr;for ($i = 0; $i < @$experiment; ++$i) { my @col; foreach $j (0..@genes-1) { $col[$j] = $expr->{$genes[$j]}->[$i]; } normalize_array(\@col); foreach $j (0..@genes-1) { $expr->{$genes[$j]}->[$i] = $col[$j]; }}

Puts columndata in @col

Puts @colback into %expr

Normalizes (note use of reference)

Genome annotations

GFF annotation format• Nine-column tab-delimited format for simple annotations:

• Many of these now obsolete, but name/start/end/strand (and sometimes type) are useful

• Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)

SEQ1 EMBL atg 103 105 . + 0 group1SEQ1 EMBL exon 103 172 . + 0 group1SEQ1 EMBL splice5 172 173 . + . group1SEQ1 netgene splice5 172 173 0.94 + . group1SEQ1 genie sp5-20 163 182 2.3 + . group1SEQ1 genie sp5-10 168 177 2.1 + . group1SEQ2 grail ATG 17 19 2.1 - 0 group2

Sequencename

Program

Featuretype Start

residue(starts at 1)

Endresidue

(starts at 1)Score

Strand(+ or -)

Codingframe

("." if notapplicable)

Group

ArtemisA tool for genome annotation

Reading a GFF file

• This subroutine reads a GFF file• Each line is made into an array via the split command• The subroutine returns an array of such arrays

sub read_GFF { my ($filename) = @_; open GFF, "<$filename"; my @gff; while (my $line = <GFF>) { chomp $line; my @data = split /\t/, $line, 9; push @gff, \@data; } close GFF; return @gff;}

Splits the line into at most ninefields, separated by tabs ("\t")

Appends a reference to @datato the @gff array

Writing a GFF file

• We should be able to write as well as read all datatypes• Each array is made into a line via the join command• Arguments: filename & reference to array of arrays

sub write_GFF { my ($filename, $gffRef) = @_; open GFF, ">$filename" or die $!; foreach my $gff (@$gffRef) { print GFF join ("\t", @$gff), "\n"; } close GFF or die $!;}

open evaluates FALSE ifthe file failed to open, and$! contains the error message

close evaluates FALSE ifthere was an error with the file

GFF intersect detection

• Let (name1,start1,end1) and (name2,start2,end2) be the co-ordinates of two segments

• If they don't overlap, there are three possibilities:• name1 and name2 are different;•

name1 = name2 but start1 > end2;

• name1 = name2 but start2 > end1;

• Checking every possible pair takes time N2 to run, where N is the number of GFF lines (how can this be improved?)

Self-intersection of a GFF file

sub self_intersect_GFF { my @gff = @_; my @intersect; foreach $igff (@gff) { foreach $jgff (@gff) { if ($igff ne $jgff) { if ($$igff[0] eq $$jgff[0]) { if (!($$igff[3] > $$jgff[4] || $$jgff[3] > $$igff[4])) { push @intersect, $igff; last; } } } } } return @intersect;}

Note: this code is slow.Vast improvements inspeed can be gained ifwe sort the @gff arraybefore checking forintersection.

Fields 0, 3 and 4 of the GFF line are the sequence name, start and end co-ordinates of the feature

Converting GFF to sequence

• Puts together several previously-described subroutines• Namely: read_FASTA read_GFF revcomp print_seq

($gffFile, $seqFile) = @ARGV;@gff = read_GFF ($gffFile);%seq = read_FASTA ($seqFile);foreach $gffLine (@gff) { $seqName = $gffLine->[0]; $seqStart = $gffLine->[3]; $seqEnd = $gffLine->[4]; $seqStrand = $gffLine->[6]; $seqLen = $seqEnd + 1 - $seqStart; $subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen); if ($seqStrand eq "-") { $subseq = revcomp ($subseq); } print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq);}

Phylogenetics Analysis of relationships between organisms

through phylogenetic programs like PHYLIP can be automated or run on command line

Packages

Perl allows you to organise your subroutines in packages each with its own namespace

Perl looks for the packages in a list of directories specified by the array @INC

Many packages available at http://www.cpan.org/

use PackageName;PackageName::doSomething();

This line includes a file called"PackageName.pm" in your code

print "INC dirs: @INC\n";

INC dirs: Perl/lib Perl/site/lib .The "." means thedirectory that thescript is saved in

This invokes a subroutine called doSomething()in the package called "PackageName.pm"

Object-oriented programming

Data structures are often associated with code FASTA: read_FASTA print_seq revcomp ... GFF: read_GFF write_GFF ... Expression data: read_expr mean_sd ...

Object-oriented programming makes this association explicit.

A type of data structure, with an associated set of subroutines, is called a class

The subroutines themselves are called methods A particular instance of the class is an object

OOP concepts

• Abstraction– represent the essentials, hide the details

• Encapsulation– storing data and subroutines in a single unit– hiding private data (sometimes all data, via accessors)

• Inheritance– abstract base interfaces– multiple derived classes

• Polymorphism– different derived classes exhibit different behaviors in response

to the same requests

OOP: Analogy

OOP: Analogy

o Messages (the words in the speech balloons, and also perhaps the coffee itself)

o Overloading (Waiter's response to "A coffee", different response to "A black coffee")

o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)

o Encapsulation (Customer doesn't need to know about Kitchen)

o Inheritance (not exactly used here, except implicitly: all types of coffee can be drunk or spilled, all humans can speak basic English and hold cups of coffee, etc.)

o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.

OOP: Advantages

• Often more intuitive– Data has behavior

• Modularity– Interfaces are well-defined– Implementation details are hidden

• Maintainability– Easier to debug, extend

• Framework for code libraries– Graphics & GUIs– BioPerl, BioJava…

OOP: Jargon

• Member, method– A variable/subroutine associated with a particular class

• Overriding– When a derived class implements a method differently from its parent

class

• Constructor, destructor– Methods called when an object is created/destroyed

• Accessor– A method that provides [partial] access to hidden data

• Factory– An [abstract] object that creates other objects

• Singleton– A class which is only ever instantiated once (i.e. there’s only ever one

object of this class)– C.f. static member variables, which occur once per class

Objects in Perl

• An object in Perl is usually a reference to a hash• The method subroutines for an object are found in a

class-specific package– Command bless $x, MyPackage associates variable $x with package MyPackage

• Syntax of method calls– e.g. $x->save();– this is equivalent to PackageName::save($x);– Typical constructor: PackageName->new();– @EXPORT and @EXPORT_OK arrays used to export

method names to user’s namespace

• Many useful Perl objects available at CPAN

Common Gateway Interface

• CGI (Common Gateway Interface)– Page-based web programming paradigm

• Can construct static (HTML) as well as dynamic (CGI) web pages.

• CGI.pm (also by Lincoln Stein)– Perl CGI interface– runs on a webserver– allows you to write a program that runs behind a

webpage

• CGI (static, page-based) is gradually being supplemented by AJAX

GUI

Graphical User Interface (GUI) are standalone modules created to make the work of an end user simpler.

Can be achieved through PERL Tk

BioPerl• Bioperl is a collection of Perl modules that facilitate the

development of Perl scripts for bioinformatics applications.

• A set of Open Source Bioinformatics packages– largely object-oriented

• Implements sequence and alignments manipulation, accessing of sequence databases and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER.

• Bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are typically difficult or impossible with web based systems.

• Parses BLAST and other programs• Basis for Ensembl

– the human genome annotation project– www.ensembl.org

http://www.ensembl.org/

Basic BioPerl modules

Bio::Perl Bio::Seq Bio::SeqIO Bio::Align Bio::AlignIO Bio::Tools::Run::StandAloneBlast

BLAST

CLUSTALW

References

http://www.bioperl.org http://www.cpan.org http://www.pasteur.fr Learning Perl Beginning Perl for Bioinformatics Mastering Perl for Bioinformatics

http://www.bioperl.org/

http://www.cpan.org/

http://www.pasteur.fr/

Thank You

Documents

Programming in Computational Biology