13
The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

cs

Short introduction to perl & gff

Marcus Ronninger

The Linnaeus Centre for Bioinformatics

Page 2: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csMotivation

• Bioinformatics yields lots of information

• The information have to be mined • Build or modify text files• Small changes can take long time with

lots of data• Example: Change every letter to lower

case• With script programming this could be

done in less than a second

Page 3: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csperl

• Practical extraction and report language

• Scripts• Object oriented programming• Graphical web interface, CGI• Possibilities • BioPerl

Page 4: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Example of a very simple perl script, to_lower_case.pl

#!/usr/bin/perl -wuse strict;my $seqfile = $ARGV[0];my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile";open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); }}

Page 5: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

cs

Useful tools for parsing files

• Scalar $• Array @• Regular expression /.fasta/• Split, @chars = split //, $word• Substitute s/old-regex/new-string/• Upper and lower case: uc, lc• Escape characters: \n \t \s etc• sub

Page 6: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csGeneral feature format, gff

• AKA “gene finding format”• A format for handling output from

different feature finding programs• Processes can be decoupled but the

result can still be put together• Makes it easy to include external

algorithms

Page 7: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csGeneral feature format

The construction of the format is very simple. The values are tab-delimited.SEQ1 EMBL atg 103 105 . + 0SEQ1 EMBL exon 103 172 . + 01. 2. 3. 4. 5. 6. 7. 8.

1. Sequence name

2. Source of the feature

3. Feature type

4. Start

5. End

6. Score - most feature finding programs have some kind of score for the found motif

7. Strand - can either be + or -

8. Frame - 0, 1, 2, .

Page 8: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csSmall example

A small script that transforms known transcription

factor binding sites into a .gff fileTFBS Position Motif

AP-2 -101 ccccaccccc

NF-1 -116 tgggctgcggccca

Hgcs -117 ctgggctgcggc

#Gfap#Known TFBS (Besnard et al 1991)#count backwards form the TSS#start -14AP-2: ccccaccccc -101NF-1: tgggctgcggccca -116

Hgcs: ctgggctgcggc -117

Page 9: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Basically the same procedure as the perl example

above

$seqlength = 5000;

$gff = “”;

while (<LIT>){

if ($_ =~ /^#start/){

$rel_start = $';

}

elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){

make_gff($_, $rel_start, "Literature");

}

}

Page 10: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExamplewhile (<LIT>){

if ($_ =~ /^#start/){

$rel_start = $';

}

elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){

make_gff($_, $rel_start, "Literature");

}

}

sub make_gff{

my $start;

my $stop;

(my $seq, my $rs, my $type) = @_;

my @feature = split(/\s+/, $seq); # now the array has the feature information

if($type eq "Literature"){

$start = $seqlength + $rs + $feature[2];

$stop = $start + length($feature[1]) -1;

$sign = '.';

$gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n";

}

etc.

Page 11: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csExample

Output: a file named lit.gff with the following

contents

AP-2: Literature AP-2: 4886 4895 undef . .NF-1: Literature NF-1: 4871 4884 undef . .Hgcs: Literature Hgcs: 4870 4881 undef . .

This can now be loaded into programs thatsupport

the gff format, e.g. Apollo

Page 12: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csApollo

• Gff files is boring to view as they are• Use graphical software• Apollo, a sequence annotation editor• Great for viewing gff files together with

the sequence

Page 13: The Linnaeus Centre for Bioinformatics Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

Th

e L

inn

aeu

s C

en

tre f

or

Bio

info

rmati

csReferences

• Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly

• http://www.sanger.ac.uk/Software/formats/GFF/

• http://www.fruitfly.org/annot/apollo/.