14
François Fauteux Department of Plant Science McGill University Macdonald campus Seeder: Perl Modules for Cis-regulatory Motif Discovery Bioinformatics Open Source Conference June 28 2009, Stockholm

Fauteux Seeder Bosc2009

  • Upload
    bosc

  • View
    964

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Fauteux Seeder Bosc2009

François Fauteux

Department of Plant Science

McGill University

Macdonald campus

Seeder: Perl Modules for

Cis-regulatory Motif Discovery

Bioinformatics Open Source Conference

June 28 2009, Stockholm

Page 2: Fauteux Seeder Bosc2009

• Precise control of where,

when and at which level

transcription occurs

• Synthetic promoterengineering

M. Venter, Trends Plant Sci 12, 118 (2007).

Introduction

Page 3: Fauteux Seeder Bosc2009

Transcription Factor Binding Sites

Page 4: Fauteux Seeder Bosc2009

• Searching for imperfect

copies of an unknown pattern

• Sequence-drivenapproaches: not guaranteed toyield a global optimum

• Enumerative approaches:computationally expensive

• Convergence towards low-complexity motifs

D. GuhaThakurta, Nucleic Acids Res 34, 3585 (2006).

DNA Motif Discovery

W. W. Wasserman, A. Sandelin,

Nat Rev Genet 5, 276 (2004).

Page 5: Fauteux Seeder Bosc2009

• Set B={B1,...,Bm} of background sequences

• Set P={P1,...,Pn} of positive sequences

• Length k of the motif seed

• Length l of the full motif to discover

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder Algorithm: Input

Page 6: Fauteux Seeder Bosc2009

• Enumerate all words [A C G T]

• SMD: smallest HD between w and a |w|-length substring of s

• SMDs between word w and background sequences

probability distribution gw(y)

Seeder::Background

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Page 7: Fauteux Seeder Bosc2009

• Sum S(w) of SMDs between w andpositive sequences p-value

• Closest match to word w* (min. q-value) found in each

positive sequence seed PWM

• Matrix is extended to motif width and sites maximizing the

score to the extended weight matrix are selected

• PWM is built from those sites and the process is iterated

Seeder::Finder

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Page 8: Fauteux Seeder Bosc2009

Seeder::Index

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Page 9: Fauteux Seeder Bosc2009

• List of indices corresponding

to words of increasing HD

• Efficient lookup of minimally

distant subsequence

Seeder::Index

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Page 10: Fauteux Seeder Bosc2009

Seeder Algorithm: Usage

#!/usr/bin/perl

use Seeder::Index;use Seeder::Finder;use Seeder::Background;

my $index = Seeder::Index->new( seed_width => "6", out_file => "6.index",);$index->get_index;

my $background = Seeder::Background->new( seed_width => "6", strand => "revcom", hd_index_file => "6.index", seq_file => "seqs.fasta", out_file => "seqs.bkgd",);$background->get_background;

my $finder = Seeder::Finder->new( seed_width => "6", strand => "revcom", motif_width => "12", n_motif => "1", hd_index_file => "6.index", seq_file => "prom.fasta", bkgd_file => "seqs.bkgd", out_file => "prom.finder",);$finder->find_motifs;

Page 11: Fauteux Seeder Bosc2009

• Binding site sequences from the Transfac database

G. K. Sandve, O. Abul, V. Walseng, F. Drablos, BMC Bioinformatics 8, 193 (2007).

Benchmark Against Popular Tools

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Page 12: Fauteux Seeder Bosc2009

SSP Promoter Motifs

F. Fauteux, M. V. Stromvik, submitted.

Page 13: Fauteux Seeder Bosc2009

http://seeder.agrenv.mcgill.ca

Page 14: Fauteux Seeder Bosc2009

SupervisorDr Martina Strömvik

Advisory committeeDr Mathieu BlanchetteDr Pierre Dutilleul

Acknowledgements