33
1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

Embed Size (px)

Citation preview

Page 1: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

1

• Keith Satterley, Bioinformatics Division, WEHI

Bioinformatics Seminar 13/11/07

Page 2: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

2

Summary:

GABOS = Get A Bit Of Sequence.GAFEP = Get A Few Exon Primers. Functions and Facilities:

WEB interface. Command Line Interface.

Data Management:Genome data.Result data.

Tools Used:PerlHTML

PHPJavascript

Availability.Future Work.

Page 3: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

3

• GABOS version 1 is at http://unix28.alpha.wehi.edu.au/bioinformatics/gabos

• WEB Page version 1 limitations:– Exons, DNA, Transcripts available.

– Genomes are a hard coded list of latest version data only.

– Annotation File is a hard coded list covering all genomes.

– Chromosome selection was a list of the common chromosome filenames.

• Data Files Availability– All data has been downloaded from UCSC’s download site. It is described at:

• http://hgdownload.cse.ucsc.edu/downloads.html and can be ftp downloaded from:• ftp://hgdownload.cse.ucsc.edu/goldenPath/

– Genome data is stored on the WEHI Disk Server accessible from:• WEHI Unix computers

– /home/users/lab0605/Bioinformatics/databases/genomes/UCSC• WEHI Windows computers – map a network drive to:

– \\unix33\bioinformatics• WEHI Macintoshes – Connect to Server at:

– smb://unix33/Bioinformatics

Page 4: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

4

• Genomes at WEHI:– Jul 24 01:05 canFam -> canFam2– Jul 23 15:16 canFam1– Jul 23 15:16 canFam2– Jul 22 01:17 danRer -> danRer4– Jul 23 10:33 danRer3– Jul 23 15:20 danRer4– Nov 6 01:10 dm -> dm3– Nov 5 16:37 dm3– Jul 22 01:17 galGal -> galGal3– Jul 20 17:27 galGal2– Jul 23 10:11 galGal3– Jul 22 01:17 hg -> hg18– Jul 23 10:29 hg17– Jul 23 10:29 hg18– Aug 24 01:10 mm -> mm9– Jul 23 10:30 mm7– Aug 23 14:50 mm8– Aug 23 18:12 mm9– Jul 22 01:17 monDom -> monDom4– Jul 23 10:32 monDom4– Jul 22 01:17 panTro -> panTro2– Jul 23 10:32 panTro1– Jul 23 10:33 panTro2– Jul 22 01:17 rheMac -> rheMac2– Jul 25 02:32 rheMac2– Jul 22 01:17 rn -> rn4– Jul 23 10:33 rn3– Jul 23 10:33 rn4

• More can be downloaded as requested.

Page 5: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

5

• Chromosome data Files:– Aug 23 14:09 chr9_random.fa– Aug 23 14:09 chrM.fa– Aug 23 14:09 chrUn_random.fa– Aug 23 14:14 chrX.fa– Aug 23 14:14 chrX_random.fa– Aug 23 14:14 chrY.fa– Aug 23 14:16 chrY_random.fa

– Jul 23 16:11 chr9.fa– Jul 23 16:11 chrM.fa– Jul 23 16:13 chrNA_random.fa– Jul 23 16:14 chrUn_random.fa– Jul 23 16:14 md5sum.txt– Jul 23 16:14 README.txt– Jul 23 16:16 scaffoldNA_random.fa– Jul 23 16:16 scaffoldUn_random.fa

– Jun 22 04:05 chr2L.fa– Jun 22 04:05 chr2LHet.fa– Jun 22 04:05 chr2R.fa– Jun 22 04:05 chr2RHet.fa– Jun 22 04:05 chr3L.fa– Jun 22 04:05 chr3LHet.fa

• Annotation Data Files:

Page 6: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

6

• Data Management:– Amount of data:

• How many genomes local? – currently 10 = 96GB.– 19 Vertebrates available + 9 sequence only.– 15 Insects, 5 Nematodes + 4 others available.

• How many versions of each? mm7, mm8, mm9?– 2 or 3 of each?

• Chromosome data: 10-50 per genome.• Annotation data: 5-10 per genome version

– RefSeq, genscan, mgc, xenoRef, uniGene, refFlat,– EST’s. mRNA’s …

– Up to date data!• Tool currently being written to nightly check UCSC• Download, unpack and sort annotation files.

Page 7: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

7

• GABOS Sequence Retrieval Features– Specify Search Criteria as either:

• Gene Name List– as in Annotation Files

» NM_001037759,NM_145692, NM_027033, NM_013715 as in RefSeq.txt» Sgk3, 4930418G15Rik, Cops5, Sulf1 as in RefFlat.txt

• Chromosome Sequence Range specification.–Chr10:13,500,000 - 14,550,000– This will select all genes in this region that are defined in

the annotation file(s) specified.

– Exons (incl. EST exons), Transcripts of Genes or straight DNA sequence can be retrieved.

• Specify either strand or both strands.

Page 8: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

8

• Extra Sequence Parameters– Range of bases in data object (for e.g. bps in an Exon)

• 1-e = all, base 1 to the end base (the default)

• 1-10 = bases 1 to 10

• 10-e = base 10 to end base in object.

– Range of objects requested. (for e.g. a range of Exons)

• 1-e = all exons (the default)• 1-3 = exons 1 to 3.• 1 = first exon only• e = last exon only

– Possible Extensions• (e-3)-e = last three objects (or bases)

Page 9: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

9

• GABOS Extras:– Specify the line length of the FASTA output file.– Output Sequence Lines ONLY. – Output Fasta Description Lines ONLY.– Concatenate ALL Sequences.– Concatenate ONLY Sequence from a DNA object (Each gene’s

exons concatenated for example).– String of characters to be inserted BEFORE each DNA object. – String of characters to be inserted AFTER each DNA object.– Specify flanking bases.– Show co-ordinates relative to: Chromosome, Exon, Transcript– Uses either RefSeq or Browser gene names in refFlat.txt

• GAFEP (Get a Few Exon Primers)– Use output of GABOS to find primers around each exon.

Page 10: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

10

GABOS Command Line Version (CLI).• Same code. Program detects environment and adjusts

accordingly.• CLI use of GABOS caters for programmatic use of the

tool as part of other tasks.– For eg. Collecting 5000 bases before a transcript and 5000 into

the transcript to be used for promoter/regulation searching for thousands of genes.

CLI Eg.gabos -afile refFlat.txt -genome mm9 -seqrange 4,482,560-4,483,185 -chr 1 -pre 420 -post 420 –fastaonly >my_results.fa

Options can be in any order. Output can be redirected to a file as shown.A file of gene names could be used as input instead of a chromosome sequence range.

gabos –help lists all options.

Page 11: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

11

• CLI additional abilities:.– Gene lists read from a file or piped in.– Debugging options available.– Specification of alternate locations for:

(enables use of program at other sites without modification.) • Annotation files.• Genome data files.• Checks if data files are latest version and updates

if not (To be replaced with upgraded procedure).

Page 12: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

12

GABOS Command Line options:

• -addend:s, • -addstart:s, • -dna:s, • -basedir:s, • -genome:s • -afile:s, • -adir:s, • -gdir:s, • -check! • -name:s, • -namep:s, • -namef:s, • -chr:s, • -seqrange:s, • -strand:s, • -dataobject:s, • -objectrange:s,

• -baserange • -seqonly, • -fastaonly, • -linelength:i, • -relative:s, • -pre:i • -post:i • -v! • -debug1:i, • -debug2:i, • -debug3:i, • -debug4:i, • -debug5:i, • -debug6:i, • -debugall:i, • -h|help|?, • -version

All GAFEP programs can also be run at the command line.

In particular:Combine_overlapping_exons,Create_primers1,Create_primers2 ,Makep3i,P3out2tab.

Page 13: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

13

• Demo of GABOS version 2.http://unix28.alpha.wehi.edu.au/bioinformatics/gabos/testing_index.php

– Improvements:• Automatically reads genomes available:• Automatically shows chromosome data for

genome selected.• Automatically shows Annotation data files for

genome selected.• Includes ability to read EST data files.• Uses alternate gene name in refFlat.txt.• Faster processing of large data files using/making

presorted versions.

Page 14: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

14

• GAFEP = Get A Few Exon Primers.This is a suite of programs.

1. Combines overlapping exons into one “CExon”.

2. Displays Primer3 options and collects choices.

3. Creates input files for Primer3 in the required format.

4. Runs Primer3, displays output on the web page and reformats the output suitable for pasting into Excel.

5. The same code runs from the web interface or from a Command Line Interface.

Page 15: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

15

Combining Exons to reduce number of primers needed.

12

34

56

7

CExon CExon Exon

Page 16: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

16

120bp

Short Exons

12090 90

300

Pad out short exonsto 300 bp.

Primersin flanks

90 90

440

70 70Add a 70 bp. cushion 120

90 9070 70200 200

840

Add 200bpflanks

120

Page 17: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

17

900bp

Long Exons

Primersin flanks

Add 200bpflanks 200 200

1025

48570 70

70bp overlap485bp

485bpSplit

485

625

7070

Add a 70 bp. cushion

Page 18: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

18

• Demonstration of GAFEP

Page 19: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

19

GAFEP Output

Page 20: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

20

Page 21: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

21

An example application:

Ben Kile’s lab are using GABOS/GAFEP to create primers to search for variations in

sequence caused by the ENU mutations in mice.

Page 22: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

22

Random chemical mutagenesis in the mouse

H3C-CH2-N-C-NH2=

N

=

O

O

Alkylating agent

Point mutagen

Efficiently mutates mouse spermatogonial stem cells

Male mice treated with ENU produce offspring heterozygous for ENU-induced mutations at the rate of 1 mutation per 1.5 megabases

ENU

N-ethyl-N-nitrosourea (ENU)

Page 23: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

23

Phenotyping screen: measuring platelet number

Platelet counts

Pla

tele

t co

un

t x1

03 /u

L

Plt16 and Plt20 cause dominant thrombocytopenia

Mutant offspring

Blood test

Page 24: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

24

Mapping strategy for dominant mutations

m m m m

mX

X F1 Generation

F2 Generation

Affected

C57BL/6

Balb/c

m

Wild-type

2nd Outcross

1st Outcross

AffectedUnaffected

Page 25: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

25

Mapping strategy for dominant mutations

1. Genome-wide scan with 80-100 microsatellites

20 affected and 20 unaffected animals

Result: mutation assigned to a chromosome

2. Fine mapping

200-1,000 informative meioses, genotyped with SSLPs at increasing density

Result: candidate interval refined to 1-3 Mb

IssuesRecombination cold spotsPolymorphism deserts

SNP density map of mouse chromosome 1(C57BL/6 v 129Sv)

Page 26: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

26

Candidate intervals

Chromosome 2: 20-21 Mb

Chromosome 11: 70-71 Mb

Heaven Hell

Page 27: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

27

Candidate gene sequencing

Prioritize candidates for sequencing on the basis of:

Known function

Homology to other genes of known function

Tissues expression pattern

Domain structure

Exhaustive literature searches…..

Page 28: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

28

Robotic liquid handling

2. Genomic PCR

3. Direct ampliconsequencing

4. Capillary electropheresis

1. Automated PCR primer design

5. Sequence analysis

In-well template clean-up

Candidate gene sequencing

Page 29: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

29

• Tools used to develop GABOS/GAFEP

• Perl programming language for all programs.

• Web interface– HTML coding– PHP – inserted into HTML and processed by the

webserver before the HTML is processed by the webserver.

– Javascript – processed by the clients web browser (Mozilla Firefox or Safari for example)

Page 30: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

30

Unix Server = unix28

Webserver = apacheClient = Mac, Windows.

Browser = Firefox,IE …

Display of GABOS/GAFEP

hereGenome

DATA

unix33

UCSC

nfs

ftp

Unix28 diskGABOS/GAFEP

wan/lan

Javascript acts hereIn response to user

html produced here

php processed here

html processed here

WEHI Computing Layout

Page 31: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

31

• Web Interface Debugging tools– Firefox Error Console– Firebug Addin to Firefox

Page 32: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

32

• Future Work:– Short term:

• Finalize GABOS version 2– Transcript, DNA working

• Complete data download maintenance program• Automate sorting of annotation files and modify GABOS to

be aware of sorted/non-sorted data and act accordingly.• Include ability to retrieve RNA data• Will run on any unix server – not just unix28.• Web Interface available on WEHI’s public server.• Source code will be made freely available.

– Longer Term:• Retrieve data for utrs, others?• Provide web interface access to annotation files. • Remove need for BioPerl to be installed.

Page 33: 1 Keith Satterley, Bioinformatics Division, WEHI Bioinformatics Seminar 13/11/07

33

Aknowledgements:• Bioinformatics Division

– Terry Speed & Gordon Smyth for the opportunity to pursue this project in an excellent environment.

– All others in Bioinformatics for many and varied help.• WEHI ITS

– Nick Tan, Jakub Szarlat for Unix help.– Dung Tran, Scott Wood for network help.– Tri Le and John Nguyen for MS windows support.– Tony Kyne & others in ITS for many questions answered.

• Molecular Medicine– Doug Hilton, Ben Kile for explaining their needs.

• Users for their feedback.– Kylie Greig, Adrienne Hilton, Greg Hather, Carolyn de Graaf …